Home › Guides › How to Detect AI Writing

How to detect AI writing — methodology, signals, and a workflow that holds up.

Most "how to detect AI" articles stop at five surface tells and a button-press. This is the deeper guide built for educators, managers, and editors who need a defensible verdict rather than a vibe check. Inside: the classifier-family methodology that separates ML-based detectors from perplexity tools and hybrids, the six-signal manual checklist a trained reader can run in two minutes, a four-step tool workflow built around sentence-level highlights, the ESL false-positive caveat with calibration numbers, and three confidence tiers for converting a score into a decision. The score is never the verdict by itself; the methodology around the score is what makes the verdict hold up.

Run the detector free Skip to the methodology

6 manual signals 4-step tool workflow 3 confidence tiers

Classifier families

Pick the right detector family for the job.

Not all detectors are doing the same thing under the hood. The three families differ in what they measure, what they generalise to, and where they fail. Knowing the family tells you how much weight to give the score before you even read it.

ML-based classifiers (TextSight, GPTZero)

Transformer models trained directly on paired corpora of millions of human and AI samples. They learn the joint distribution across many signals at once and adapt as new models ship. This family holds up best on GPT-4, Claude, Gemini, and Llama output, which is what most submissions in 2026 actually look like. TextSight runs an ensemble of a transformer classifier and a hand-tuned signal layer with a calibration network on top. GPTZero leans transformer-first with a perplexity head as backup. For educators, hiring managers, and editors handling real volume, this is the right default family because the accuracy floor is higher on newer models.

Perplexity-based tools (older 2022-era detectors)

Run the text through a smaller language model, measure how surprising the next word is on average, and report the result. The original GPTZero and many open-source tools started here. The approach worked passably on GPT-3, dropped on GPT-3.5, and collapses on GPT-4 and Claude because frontier models now sample with higher temperature and produce higher-perplexity prose. Use perplexity-only tools as a cross-check second opinion, not as your primary detector in 2026.

Hybrid ensembles (Originality.ai, Copyleaks Enterprise)

Combine a transformer head with explicit signal-layer measurements and sometimes a stylometric model for authorship. When a new model launches and the transformer head has not yet been retrained, the explicit signal layer still catches obvious cases. The trade-off is configuration cost; hybrid ensembles usually expose more knobs, which is great for teams with calibration discipline and overwhelming for everyone else. For methodology-curious educators, the hybrid family is worth understanding even if you do not use one daily.

The manual checklist

Six signals you can check by reading carefully.

A trained reader can run this six-point check in two minutes. Any single signal can appear in genuine human writing. Two or more in the same short passage is when the case starts to firm up, and three or more alongside a tool flag is usually conclusive.

1. Uniform sentence-length variance

Count the words in five consecutive sentences. If every sentence lands between 16 and 22 words, the burstiness signal is low and the passage reads AI even when the vocabulary is clean. Human writers vary length deliberately. A 30-word subordinate clause followed by a five-word punchline is the cleanest signal a piece was written rather than generated. Watch for paragraphs where no sentence drops below twelve words.

2. Transition phrase clustering

Furthermore, Moreover, In addition, Additionally, In conclusion. ChatGPT stacks these at paragraph boundaries. Humans usually trust the paragraph break itself to carry the transition. If three or four consecutive paragraphs each open with one of these phrases, the writing is templated regardless of what the headline score says. The fix when editing is to delete the transition; the diagnostic when detecting is to count them.

3. Vocabulary clustering (delve, tapestry, navigate)

Frontier models have favourite words. The reliable tells in 2026 are delve, robust, leverage as a verb, navigate used metaphorically, underscore, showcase, myriad, tapestry, multifaceted, and foster. Two or three in a 500-word passage is statistically unusual for natural writing. Five or more is a near-certainty. Most undergraduate writers and most working journalists use zero or one of these in a typical piece. The cluster, not any single word, is the signal.

4. Tripled adjectives

"A robust, comprehensive, multifaceted approach." Three adjectives stacked in front of one noun is a strong AI tell. Human writers usually pick one adjective and let the noun do the work, or replace the stack with a specific example. Count tripled-adjective constructions per page. Two or three per 500 words is unusual outside of AI prose, and they tend to cluster in the same sentences as the vocabulary tells above.

5. Polite-assistant register

Frontier models inherit their tuning from chatbot conversations, which leaves a distinctive politeness layer in long-form prose: hedges like "it is worth noting," "it is important to consider," "while X, it is also true that Y," summary nudges like "in essence" and "ultimately," and balanced "on one hand, on the other hand" constructions. Humans hedge too, but they vary the phrasing. AI uses the same three or four hedges across an entire piece. Uniform hedging is one of the cleanest signals when present.

6. Summary closers

AI prose tends to end every section with a closer that restates the thesis or summarises the section. Watch for paragraphs that end with "in summary," "overall," "to recap," or a sentence that reiterates what the paragraph already said. Human writers more often end on a beat, a turn, a question, or simply the next thought. Five sections in a row that each close by tying back to the headline thesis is templated, not earned.

The four-step scan

The tool workflow: scan, review, cross-verify, tier.

The methodology is what makes the score load-bearing. Run all four steps; a tool flag without manual cross-verification is fragile, and a manual read without a tool flag is too easy to argue with.

Step 1: Scan with a calibrated ML classifier

Paste the text into TextSight at app.textsight.ai. Free for three scans a day without signup. You get an overall 0 to 100 score, a per-sentence highlight map, and a bundled Plagiarism Risk score that catches copy-paste from public sources in the same scan. The ML-based head is calibrated against Turnitin-correlated patterns, so the score lines up with what an institutional integrity workflow would surface independently.

Step 2: Review sentence-level highlights, not just the headline

Look at the highlight map. Are the red sentences clustered in one section or scattered across the piece? Clustered red sentences are stronger evidence than the same percentage spread thinly. Are the flagged sentences carrying obvious vocabulary tells, transition clusters, or tripled adjectives? If yes, you have specific anchors you can quote in a follow-up conversation. The headline percentage is the summary; the highlights are the case.

Step 3: Cross-verify against the six-signal checklist

Go back to the text and run the six-signal manual check from the previous section. Mark each signal that appears. If the tool flagged 78 percent AI and you find four of six manual signals, the two methods agree and the verdict is firm. If the tool flagged 78 percent and you find only one manual signal, the score is real but the case is weaker; consider a second independent classifier before acting. Methodology means using both ends of the workflow against each other.

Step 4: Place the result in a confidence tier

Above 85 percent on a calibrated ML classifier with two or more manual signals is high-confidence: a defensible flag that warrants a closer review or a conversation about process. Between 60 and 85 percent is medium-confidence: bring in a second tool, or look for clustered residual highlights, or talk to the writer before drawing a conclusion. Below 60 percent is low-confidence: the writing usually reads as a human writer, possibly one with a formal or structured style. Tier the result before you tier the response.

Plans & pricing

Detector and AI rewriter on every tier.

Free includes 3 detector scans a day and a 1,500-word AI rewriter quota. Paid tiers raise the quotas and add the Chrome extension, file upload, and REST API. Yearly billing saves 25%.

Free

$0/forever

Try the detector and AI rewriter. No card.

3 detector scans/day
1,500 AI rewriter words
All 3 AI rewriter modes
Sentence-level highlights

Start free

Starter

$7.49/month

Billed $89.88/year — Save $30

For freelancers and light writers.

20,000 AI rewriter words/mo
Unlimited detector scans
Chrome extension
Email support

Get Starter

ESL writers and the 40 percent calibration gap.

If you are running this workflow on any classroom, hiring pipeline, or editorial queue with non-native English writers, the ESL caveat is the single most important thing on this page. Get this wrong and the methodology produces unjust outcomes regardless of how clean the score looks.

What the research actually says

Multiple peer-reviewed studies published since 2023 have shown that AI detectors flag English-as-a-second-language writing as AI-written at roughly three to five times the rate of native English writing on the same task. The reason is structural rather than mistaken. Learned-second-language English tends to use more uniform sentence shapes, a narrower active vocabulary, and more formal register, all of which overlap with the statistical signature classifiers were trained to recognise. The detector is not failing in those cases; it is correctly measuring something that happens to mean a different thing in ESL prose than in native prose.

What calibrated tools do about it

TextSight tunes its threshold lower for ESL prose than the open-source baselines it benchmarks against, by training on diverse English varieties rather than only US academic prose. The practical effect is a lower false-positive rate, not a zero false-positive rate. No detector eliminates the structural overlap; the best ones narrow it. If you know the writer is ESL, weight the score more cautiously and lean on the manual six-signal check, where the vocabulary clustering and tripled-adjective tells are more language-neutral than the burstiness or hedge density signals.

What to do operationally

If your queue includes ESL writers, build the calibration into the workflow rather than the score. Drop a flagged score by 15 to 20 points before deciding what tier it falls into. Require two manual signals plus the tool flag before treating a medium-confidence score as actionable. For high-stakes decisions like grades, hiring, or contract termination, never act on the score alone with an ESL writer; bring the per-sentence evidence into a conversation about process and drafts before drawing a conclusion.

From score to decision

Three confidence tiers for acting on the result.

A score is information, not a verdict. The tier converts the score plus the manual evidence into a defensible decision band. Most institutional integrity workflows in 2026 use a tiering structure like this whether or not they call it one.

High-confidence flag (above 85 percent)

The ML classifier returns above 85 percent AI, the highlight map shows clustered red sentences in a single section or across most paragraphs, and the manual checklist surfaces three or more of the six signals. This is the band where the tool and the methodology agree, and the case is defensible. Treat it as the start of a conversation about process and drafts, not as a verdict on its own. For graded or contracted work, pair the per-sentence evidence with a request to walk through the writing process; a genuine writer can reconstruct their process in two minutes, and the absence of that reconstruction is usually more diagnostic than the score itself.

Medium-confidence (60 to 85 percent)

The score is real but the case is contested. Some flagged sentences cluster, others scatter. The manual check surfaces one or two signals rather than three. This is where you cross-check with a second independent classifier; if both detectors agree on the high end of the medium band, treat it closer to high-confidence. If they disagree, the result is closer to inconclusive and should not drive a unilateral action. Medium-confidence is also where ESL calibration matters most; a 70 percent score on an ESL writer often drops to 50 percent once the structural overlap is accounted for.

Low-confidence (under 60 percent)

The text usually reads as written by a human, possibly one with a structured or formal style. Scattered red highlights with no clustering, zero or one manual signal, no model-specific vocabulary tells. This is the band where action would be unjust and where the methodology saves you from over-reading a noisy number. Document the result if your workflow requires a record, but treat the writing as human unless new evidence changes the picture.

When the simple tiers break

Edge cases worth naming explicitly.

The clean three-tier model handles most cases. These three edge patterns are where the methodology has to flex, and where most operational mistakes happen in real-world workflows.

AI-edited human content

A writer drafts the piece themselves, then runs it through a chatbot for grammar polish or light tightening. The underlying reasoning, structure, and voice are theirs, but the surface prose now carries AI-edit fingerprints. Detectors usually score this in the medium tier with scattered rather than clustered red highlights. The methodology call is to treat the work as human-authored with AI assistance, document the assistance honestly, and not penalise the structural pattern. This is the most common false-positive pattern in 2026 hiring and freelance pipelines.

Heavily-rewritten AI content

The mirror case. A piece that was AI-generated but then carefully rewritten by a human can come back below 60 percent overall, which sounds clean. The diagnostic is whether residual flagged sentences cluster. If three or four sentences in a single paragraph still flag red while the rest of the piece is clean, the residual signal means "AI text with human polish," which is a different story than human writing throughout. Use the highlight map, not just the headline number, to surface this case.

Mixed-author drafts

Common in collaborative work and graduate writing. Different sections by different authors produce uneven highlight density across the piece. The headline score averages out to a noisy mid-tier number that obscures which sections are actually flagged. The methodology call is to read each section's highlight density on its own rather than the global percentage. A 55 percent overall score with one section at 90 percent and the rest at 20 percent is a localised case for that section, not a verdict on the whole draft.

FAQ

Detect AI writing frequently asked.

What is the difference between ML-based and perplexity-based AI detectors?

ML-based classifiers like TextSight and GPTZero are transformer models trained on millions of paired human and AI samples. They learn ensemble weightings across many signals. Perplexity-based tools, the older 2022 generation, score the text against a single language model and report how surprising the next word is. ML-based classifiers outperform perplexity alone by 15 to 20 points of accuracy on GPT-4 and Claude. Hybrid detectors combine both heads.

What manual signals point to AI writing?

Six signals are the working checklist: uniform sentence-length variance where every sentence lands between 16 and 22 words, transition phrase clustering with Furthermore, Moreover, and In addition stacked at paragraph starts, vocabulary clustering on delve, tapestry, navigate, and robust, tripled adjectives in front of one noun, polite-assistant register that hedges constantly, and summary closers that restate the thesis. Two or more in a short passage is a meaningful signal.

How should I read the overall confidence score?

Use three tiers. Above 85 percent on a calibrated ML classifier is a high-confidence flag and warrants a closer review or conversation. Between 60 and 85 is medium-confidence, where the text could be AI or a structured human writer, and a second-tool cross-check is the right next step. Below 60 percent is low-confidence and usually indicates a human writer, possibly one with a formal or templated style.

Are ESL writers more likely to be false-flagged?

Yes. Multiple peer-reviewed studies show that AI detectors flag English-as-a-second-language writing as AI-written at three to five times the rate of native English. Learned-second-language English uses more uniform sentence shapes and a narrower active vocabulary, which overlaps with the AI signature. Calibrated detectors like TextSight run lower false-positive rates on ESL prose than open-source baselines, but no tool eliminates the risk.

What about AI-edited human content or heavily rewritten AI text?

These are the hardest edge cases. A human draft lightly polished by AI usually scores in the medium tier with scattered red highlights rather than clusters. Heavily rewritten AI text can come back below 60 percent overall but still show clustered residual flags on specific paragraphs. Mixed-author drafts often look like a 40 to 70 percent score with uneven highlight density across sections. Use the highlight map, not just the headline number.

Can I detect AI writing without using any tool?

Partially. A trained reader can spot two or three of the six manual signals reliably, especially vocabulary clustering and tripled adjectives. But the human eye is unreliable at estimating sentence-length variance precisely and at separating natural formal prose from AI templating. The defensible workflow combines a calibrated tool with the manual checklist, not one or the other alone.

What workflow do educators and editors use in practice?

Scan first with a calibrated ML classifier like TextSight, look at the sentence-level highlights, cross-verify against the six-signal checklist, then place the result in the high, medium, or low tier. For high-tier flags, bring the per-sentence evidence into a conversation about process and drafts. For medium-tier results, run a second independent classifier. For low-tier scores, document and move on. The score never replaces judgement.

How long does this methodology take per piece?

About five minutes on a 500 to 800 word piece once you can recognise the six signals by sight. The tool scan takes 30 seconds; reading the highlights takes one to two minutes; manual signal cross-verification takes two to three minutes. The 10-minute estimate in this guide is for the first time through; it drops sharply with practice.