Most "how to detect AI" articles stop at five surface tells and a button-press. This is the deeper guide built for educators, managers, and editors who need a defensible verdict rather than a vibe check. Inside: the classifier-family methodology that separates ML-based detectors from perplexity tools and hybrids, the six-signal manual checklist a trained reader can run in two minutes, a four-step tool workflow built around sentence-level highlights, the ESL false-positive caveat with calibration numbers, and three confidence tiers for converting a score into a decision. The score is never the verdict by itself; the methodology around the score is what makes the verdict hold up.
Not all detectors are doing the same thing under the hood. The three families differ in what they measure, what they generalise to, and where they fail. Knowing the family tells you how much weight to give the score before you even read it.
Transformer models trained directly on paired corpora of millions of human and AI samples. They learn the joint distribution across many signals at once and adapt as new models ship. This family holds up best on GPT-4, Claude, Gemini, and Llama output, which is what most submissions in 2026 actually look like. TextSight runs an ensemble of a transformer classifier and a hand-tuned signal layer with a calibration network on top. GPTZero leans transformer-first with a perplexity head as backup. For educators, hiring managers, and editors handling real volume, this is the right default family because the accuracy floor is higher on newer models.
Run the text through a smaller language model, measure how surprising the next word is on average, and report the result. The original GPTZero and many open-source tools started here. The approach worked passably on GPT-3, dropped on GPT-3.5, and collapses on GPT-4 and Claude because frontier models now sample with higher temperature and produce higher-perplexity prose. Use perplexity-only tools as a cross-check second opinion, not as your primary detector in 2026.
Combine a transformer head with explicit signal-layer measurements and sometimes a stylometric model for authorship. When a new model launches and the transformer head has not yet been retrained, the explicit signal layer still catches obvious cases. The trade-off is configuration cost; hybrid ensembles usually expose more knobs, which is great for teams with calibration discipline and overwhelming for everyone else. For methodology-curious educators, the hybrid family is worth understanding even if you do not use one daily.
A trained reader can run this six-point check in two minutes. Any single signal can appear in genuine human writing. Two or more in the same short passage is when the case starts to firm up, and three or more alongside a tool flag is usually conclusive.
Count the words in five consecutive sentences. If every sentence lands between 16 and 22 words, the burstiness signal is low and the passage reads AI even when the vocabulary is clean. Human writers vary length deliberately. A 30-word subordinate clause followed by a five-word punchline is the cleanest signal a piece was written rather than generated. Watch for paragraphs where no sentence drops below twelve words.
Furthermore, Moreover, In addition, Additionally, In conclusion. ChatGPT stacks these at paragraph boundaries. Humans usually trust the paragraph break itself to carry the transition. If three or four consecutive paragraphs each open with one of these phrases, the writing is templated regardless of what the headline score says. The fix when editing is to delete the transition; the diagnostic when detecting is to count them.
Frontier models have favourite words. The reliable tells in 2026 are delve, robust, leverage as a verb, navigate used metaphorically, underscore, showcase, myriad, tapestry, multifaceted, and foster. Two or three in a 500-word passage is statistically unusual for natural writing. Five or more is a near-certainty. Most undergraduate writers and most working journalists use zero or one of these in a typical piece. The cluster, not any single word, is the signal.
"A robust, comprehensive, multifaceted approach." Three adjectives stacked in front of one noun is a strong AI tell. Human writers usually pick one adjective and let the noun do the work, or replace the stack with a specific example. Count tripled-adjective constructions per page. Two or three per 500 words is unusual outside of AI prose, and they tend to cluster in the same sentences as the vocabulary tells above.
Frontier models inherit their tuning from chatbot conversations, which leaves a distinctive politeness layer in long-form prose: hedges like "it is worth noting," "it is important to consider," "while X, it is also true that Y," summary nudges like "in essence" and "ultimately," and balanced "on one hand, on the other hand" constructions. Humans hedge too, but they vary the phrasing. AI uses the same three or four hedges across an entire piece. Uniform hedging is one of the cleanest signals when present.
AI prose tends to end every section with a closer that restates the thesis or summarises the section. Watch for paragraphs that end with "in summary," "overall," "to recap," or a sentence that reiterates what the paragraph already said. Human writers more often end on a beat, a turn, a question, or simply the next thought. Five sections in a row that each close by tying back to the headline thesis is templated, not earned.
The methodology is what makes the score load-bearing. Run all four steps; a tool flag without manual cross-verification is fragile, and a manual read without a tool flag is too easy to argue with.
Paste the text into TextSight at app.textsight.ai. Free for three scans a day without signup. You get an overall 0 to 100 score, a per-sentence highlight map, and a bundled Plagiarism Risk score that catches copy-paste from public sources in the same scan. The ML-based head is calibrated against Turnitin-correlated patterns, so the score lines up with what an institutional integrity workflow would surface independently.
Look at the highlight map. Are the red sentences clustered in one section or scattered across the piece? Clustered red sentences are stronger evidence than the same percentage spread thinly. Are the flagged sentences carrying obvious vocabulary tells, transition clusters, or tripled adjectives? If yes, you have specific anchors you can quote in a follow-up conversation. The headline percentage is the summary; the highlights are the case.
Go back to the text and run the six-signal manual check from the previous section. Mark each signal that appears. If the tool flagged 78 percent AI and you find four of six manual signals, the two methods agree and the verdict is firm. If the tool flagged 78 percent and you find only one manual signal, the score is real but the case is weaker; consider a second independent classifier before acting. Methodology means using both ends of the workflow against each other.
Above 85 percent on a calibrated ML classifier with two or more manual signals is high-confidence: a defensible flag that warrants a closer review or a conversation about process. Between 60 and 85 percent is medium-confidence: bring in a second tool, or look for clustered residual highlights, or talk to the writer before drawing a conclusion. Below 60 percent is low-confidence: the writing usually reads as a human writer, possibly one with a formal or structured style. Tier the result before you tier the response.
Free includes 3 detector scans a day and a 1,500-word AI rewriter quota. Paid tiers raise the quotas and add the Chrome extension, file upload, and REST API. Yearly billing saves 25%.
Billed $89.88/year — Save $30
Billed $179.88/year — Save $60
Billed $359.88/year — Save $120
Yearly billing saves 25%. View full pricing
If you are running this workflow on any classroom, hiring pipeline, or editorial queue with non-native English writers, the ESL caveat is the single most important thing on this page. Get this wrong and the methodology produces unjust outcomes regardless of how clean the score looks.
Multiple peer-reviewed studies published since 2023 have shown that AI detectors flag English-as-a-second-language writing as AI-written at roughly three to five times the rate of native English writing on the same task. The reason is structural rather than mistaken. Learned-second-language English tends to use more uniform sentence shapes, a narrower active vocabulary, and more formal register, all of which overlap with the statistical signature classifiers were trained to recognise. The detector is not failing in those cases; it is correctly measuring something that happens to mean a different thing in ESL prose than in native prose.
TextSight tunes its threshold roughly 40 percent lower for ESL prose than the open-source baselines it benchmarks against, by training on diverse English varieties rather than only US academic prose. The practical effect is a lower false-positive rate, not a zero false-positive rate. No detector eliminates the structural overlap; the best ones narrow it. If you know the writer is ESL, weight the score more cautiously and lean on the manual six-signal check, where the vocabulary clustering and tripled-adjective tells are more language-neutral than the burstiness or hedge density signals.
If your queue includes ESL writers, build the calibration into the workflow rather than the score. Drop a flagged score by 15 to 20 points before deciding what tier it falls into. Require two manual signals plus the tool flag before treating a medium-confidence score as actionable. For high-stakes decisions like grades, hiring, or contract termination, never act on the score alone with an ESL writer; bring the per-sentence evidence into a conversation about process and drafts before drawing a conclusion.
A score is information, not a verdict. The tier converts the score plus the manual evidence into a defensible decision band. Most institutional integrity workflows in 2026 use a tiering structure like this whether or not they call it one.
The ML classifier returns above 85 percent AI, the highlight map shows clustered red sentences in a single section or across most paragraphs, and the manual checklist surfaces three or more of the six signals. This is the band where the tool and the methodology agree, and the case is defensible. Treat it as the start of a conversation about process and drafts, not as a verdict on its own. For graded or contracted work, pair the per-sentence evidence with a request to walk through the writing process; a genuine writer can reconstruct their process in two minutes, and the absence of that reconstruction is usually more diagnostic than the score itself.
The score is real but the case is contested. Some flagged sentences cluster, others scatter. The manual check surfaces one or two signals rather than three. This is where you cross-check with a second independent classifier; if both detectors agree on the high end of the medium band, treat it closer to high-confidence. If they disagree, the result is closer to inconclusive and should not drive a unilateral action. Medium-confidence is also where ESL calibration matters most; a 70 percent score on an ESL writer often drops to 50 percent once the structural overlap is accounted for.
The text usually reads as written by a human, possibly one with a structured or formal style. Scattered red highlights with no clustering, zero or one manual signal, no model-specific vocabulary tells. This is the band where action would be unjust and where the methodology saves you from over-reading a noisy number. Document the result if your workflow requires a record, but treat the writing as human unless new evidence changes the picture.
The clean three-tier model handles most cases. These three edge patterns are where the methodology has to flex, and where most operational mistakes happen in real-world workflows.
A writer drafts the piece themselves, then runs it through a chatbot for grammar polish or light tightening. The underlying reasoning, structure, and voice are theirs, but the surface prose now carries AI-edit fingerprints. Detectors usually score this in the medium tier with scattered rather than clustered red highlights. The methodology call is to treat the work as human-authored with AI assistance, document the assistance honestly, and not penalise the structural pattern. This is the most common false-positive pattern in 2026 hiring and freelance pipelines.
The mirror case. A piece that was AI-generated but then carefully rewritten by a human can come back below 60 percent overall, which sounds clean. The diagnostic is whether residual flagged sentences cluster. If three or four sentences in a single paragraph still flag red while the rest of the piece is clean, the residual signal means "AI text with human polish," which is a different story than human writing throughout. Use the highlight map, not just the headline number, to surface this case.
Common in collaborative work and graduate writing. Different sections by different authors produce uneven highlight density across the piece. The headline score averages out to a noisy mid-tier number that obscures which sections are actually flagged. The methodology call is to read each section's highlight density on its own rather than the global percentage. A 55 percent overall score with one section at 90 percent and the rest at 20 percent is a localised case for that section, not a verdict on the whole draft.
The procedural sister guide. The reader's three-step workflow for cross-checking with two free detectors.
Read the procedural guideRun the methodology on a real scan. Sentence-level highlights, calibrated overall score, bundled Plagiarism Risk.
Open the detectorHow the 0-to-100 metric is computed and what each tier means for graded or published work.
Read the guideThe classroom workflow built on this methodology. Per-student review, ESL calibration, integrity conversations.
Open the educator guideCalibrated ML classifier, sentence-level highlights, free to try with no card. 3 detector scans a day, the full six-signal evidence on every result.