Every detector publishes an accuracy number. Almost none of them mean what readers think they mean. This guide walks through what TPR and FPR actually measure, why ESL writing gets over-flagged, what the peer-reviewed literature says, and which tools hold up on identical passages. No marketing language, no hidden bias, no claim that any one tool is always right.
Three things to keep in mind before you read any vendor's claim about detection accuracy.
One. A vendor accuracy number is almost always a true positive rate measured on a vendor-chosen test set. It says how often the tool catches AI on writing similar to that set. It does not say how often it wrongly flags a real student.
Two. The number that determines whether you can trust a verdict is the false positive rate, broken out by writer type. A 1% FPR on native English can sit next to a 22% FPR on ESL writing inside the same tool. Vendors rarely publish that split. Independent researchers do.
Three. Every accuracy number degrades the moment a paraphraser, a round of human editing, or an unfamiliar topic enters the picture. Treat any score as a probability, not a fact.
Every detector pitch hides one of these two numbers. Reading them together is the only honest way to evaluate a tool.
TPR is the share of AI-generated passages the tool correctly flags. A 92% TPR means the detector catches 92 out of every 100 AI samples. Vendors love this number because it is easy to push up. It says nothing about cost. A tool with 99% TPR and a 30% FPR is worse than one with 91% TPR and a 3% FPR for any real classroom.
FPR is the share of human-written passages wrongly flagged as AI. This is the number that determines whether you can trust a verdict. On a class of 30 essays, a 5% FPR means roughly 1.5 students wrongly accused. A 22% FPR, which Stanford measured on TOEFL essays in 2023, means closer to 7. The cost of a false positive falls on the writer.
Every detector has one knob: the threshold. Lower it and TPR rises but FPR rises too. Raise it and FPR drops but TPR drops with it. The vendor's headline number is whichever point on the curve makes the marketing read best. To compare honestly, fix the threshold or compare full curves.
ML literature calls TPR "recall" and pairs it with "precision" (of everything flagged, how much was AI). Marketing pages translate these into "accuracy" or "detection rate" without disclosing the test set or threshold. If a page says "99% accurate" without splitting TPR and FPR, the writer either does not know the difference or is hoping you do not.
The single most documented failure mode in the academic literature. Worth understanding before trusting any verdict that involves a non-native English writer.
In July 2023, Liang and colleagues at Stanford published a paper in Patterns (Cell Press) measuring detector accuracy on TOEFL essays by non-native English speakers. More than half were misclassified as AI-generated by mainstream detectors, with one configuration reaching 61% FPR. On essays from US-born eighth-graders, the same detectors held under 5%. The model was reading the structural footprint of formally-taught English as machine-generated.
Classical detectors score perplexity and burstiness. Second-language academic writing uses a constrained vocabulary, follows taught templates, and produces uniform sentence lengths. All three reduce perplexity and burstiness. The signal the detector reads as "machine" overlaps the signal of a careful non-native writer. Not a bug in any one tool: a property of the underlying method.
Most major vendors have re-tuned since the Stanford paper. GPTZero shipped a 2024 ESL update. Turnitin recalibrated threshold defaults. Originality.ai added a language-aware second pass. Gains are uneven: independent retests since 2024 still measure ESL FPR between 14% and 25%. TextSight's June 2026 benchmark measures 6% on our sample mix.
If you write in formally-taught English (Indian-curriculum, Filipino academic, Chinese university register), expect more false flags than the vendor's headline number predicts. Pre-scan before submission. Re-scan any flag on a second independent tool. If you teach or grade ESL essays, treat any single verdict as a starting point for a conversation, not evidence.
If your prose has any of these traits, expect a higher false positive rate from any detector. Knowing the pattern in advance lets you defend the work.
Writers who stick to a controlled vocabulary, whether by training, register, or genre, produce text the detector reads as predictable. STEM students, legal writers, and disciplined ESL writers sit in this band. The fix is not to inject random vocabulary; it is to know the pattern and name it in any review.
Burstiness measures how sentence length varies across a paragraph. Human writing tends to be spiky. Polished academic prose, especially in second-language or technical registers, smooths the variance out. Detectors read smoothness as machine generation. This is the single biggest reason carefully edited human writing gets flagged.
"In conclusion," "furthermore," "on the other hand" are taught in formal writing instruction and used in good faith. They are also overrepresented in early GPT and Claude output. A paragraph that opens with a learned transition and closes with a hedging clause looks like every freshman essay the model saw during training.
Engineering reports, financial briefs, and clinical write-ups lean on bullet lists, numbered steps, and parallel grammatical structure. All three are signals machine-generated text exhibits at high rates. STEM students get over-flagged for the same reason ESL writers do: genre conventions overlap the AI signal.
Most detectors need four to six sentences before they score reliably. Below 250 words, scores swing widely and false positives spike. Run short snippets through at least two tools and weight any verdict accordingly.
Self-published FPR (from each vendor's docs) next to the independently measured FPR on the same 100 ESL passages from our June 2026 benchmark.
| Detector | Self-published FPR | Measured FPR (native EN) | Measured ESL FPR | Source / citation |
|---|---|---|---|---|
| GPTZero | ~1% | 4% | 22% | GPTZero 2024 docs · Stanford 2023 |
| Turnitin | 4% | ~5% | 14 to 18% | Turnitin docs · Weber-Wulff 2023 |
| Originality.ai | <2% | 6% | 19% | Originality.ai docs · internal 2026 |
| Copyleaks | <1% | 8% | 17% | Copyleaks docs · Weber-Wulff 2023 |
| Winston AI | <1% | 7% | 16% | Vendor claim · internal 2026 |
| Sapling | not published | 9% | 18% | Internal 2026 benchmark |
| ZeroGPT | not published | 12% | 25% | Internal 2026 benchmark |
| Crossplag | not published | 11% | 21% | Weber-Wulff 2023 |
| Content at Scale | <2% | 8% | 19% | Vendor claim · internal 2026 |
| Writer.com | not published | 7% | 15% | Internal 2026 benchmark |
| Smodin | not published | 13% | 24% | Internal 2026 benchmark |
| QuillBot Detector | not published | 9% | 18% | Internal 2026 benchmark |
| Scribbr | <1% | 6% | 14% | Vendor claim · internal 2026 |
| Hive Moderation | not published | 8% | 16% | Internal 2026 benchmark |
| TextSight | 2% | 2% | 6% | June 2026 internal benchmark |
Self-published numbers from each vendor's docs as of June 2026. Measured numbers from TextSight's internal benchmark on 100 ESL + 100 native English passages (methodology). Independent citations: Weber-Wulff et al., International Journal for Educational Integrity, 2023; Liang et al., Patterns (Cell Press), 2023.
A focused four-tool head-to-head on the four passage types that matter most. Methodology, threshold, and sample notes below the table.
| Passage segment | GPTZero TPR/FPR | Turnitin TPR/FPR | Originality TPR/FPR | TextSight TPR/FPR |
|---|---|---|---|---|
| Native English AI (GPT-4, n=100) | 96% TPR | 93% TPR | 95% TPR | 97% TPR |
| Native English Human (n=100) | 4% FPR | 5% FPR | 6% FPR | 2% FPR |
| ESL Human (academic, n=100) | 22% FPR | 16% FPR | 19% FPR | 6% FPR |
| Humanized AI (post-edit, n=100) | 41% TPR | 58% TPR | 64% TPR | 78% TPR |
Row 1. Four detectors within four points on raw GPT-4. The easy case. Not decision-grade alone.
Row 2. Field range is 2 to 6% FPR. A 2-point swing is two fewer wrongly-flagged students per 100.
Row 3 is the decision row. The 16-point gap between GPTZero (22%) and TextSight (6%) is the difference between a class of 30 ESL essays with 7 wrong accusations and the same class with 2. If your population skews ESL, this row matters more than the other three combined.
Row 4. Where most detectors collapse. GPTZero drops to 41% TPR because perplexity scoring is exactly what paraphrasers add variance to.
A five-step protocol used by students, teachers, and editors to dispute a wrong AI flag. Practical, not adversarial.
The instinct after a flag is to rewrite. Resist it. A panic rewrite destroys version history, your strongest evidence. Capture Google Docs revision history, Word AutoRecover, Notion page history, or browser autosave. Edit timestamps showing 40 minutes of incremental revision are the closest thing to proof.
One verdict is a probability. Two agreeing is a stronger signal. Run the passage through a detector using a different signal family: if the first was GPTZero (perplexity), run TextSight (sentence rhythm). Disagreement is itself evidence the verdict is not reliable.
Any reviewer using a detector verdict against you owes three things: the published methodology, the threshold (50%, 60%, 80% confidence floor), and the per-sentence breakdown. Most academic integrity policies require this disclosure on request. Refusal is grounds for escalation.
Modern detectors show which sentences scored high and why. If the flagged sentences carry formulaic transitions, learned templates, or technical register, that pattern alone explains the flag and is worth naming. "This sentence flagged because it has low burstiness, a structural property of careful academic prose" lands differently than a generic denial.
Source-cited audit of GPTZero's claims against independent academic findings.
Read the review →Five writing styles that trigger false flags, plus a protocol if you were wrongly flagged.
Read the playbook →Mechanism guide to perplexity, burstiness, and rhythm-based scoring failures.
Read the breakdown →Turnitin's 4% FPR claim against measured field results on ESL and STEM writing.
Read the review →Head-to-head detection benchmark with pricing and ESL false positives compared.
Read the compare →The benchmark corpus, threshold definitions, and reproducibility notes.
Read the methodology →Free tier: 3 scans a day, 5,000 characters per scan, no card, no email, no signup. The fastest way to test the accuracy claims on this page against your own writing.