HomeResources › AI Detector Accuracy Explained

AI detector accuracy, explained honestly.

Every detector publishes an accuracy number. Almost none of them mean what readers think they mean. This guide walks through what TPR and FPR actually measure, why ESL writing gets over-flagged, what the peer-reviewed literature says, and which tools hold up on identical passages. No marketing language, no hidden bias, no claim that any one tool is always right.

Try TextSight free Jump to the benchmark
15 detectors tested 400 passages, same conditions Peer-reviewed sources cited Last verified
The short answer

What "accurate" really means here.

Three things to keep in mind before you read any vendor's claim about detection accuracy.

One. A vendor accuracy number is almost always a true positive rate measured on a vendor-chosen test set. It says how often the tool catches AI on writing similar to that set. It does not say how often it wrongly flags a real student.

Two. The number that determines whether you can trust a verdict is the false positive rate, broken out by writer type. A 1% FPR on native English can sit next to a 22% FPR on ESL writing inside the same tool. Vendors rarely publish that split. Independent researchers do.

Three. Every accuracy number degrades the moment a paraphraser, a round of human editing, or an unfamiliar topic enters the picture. Treat any score as a probability, not a fact.

Bottom line. The best detectors in 2026 sit around 90 to 97 percent TPR on raw AI and 2 to 6 percent FPR on careful native English. On ESL writing the field FPR range stretches from 6% (TextSight, internal) to 25% (worst-calibrated commercial tools). No single verdict should be treated as proof.
The two numbers that matter

TPR and FPR, in plain English.

Every detector pitch hides one of these two numbers. Reading them together is the only honest way to evaluate a tool.

True positive rate (TPR), or "did it catch the AI"

TPR is the share of AI-generated passages the tool correctly flags. A 92% TPR means the detector catches 92 out of every 100 AI samples. Vendors love this number because it is easy to push up. It says nothing about cost. A tool with 99% TPR and a 30% FPR is worse than one with 91% TPR and a 3% FPR for any real classroom.

False positive rate (FPR), or "how often it accuses the innocent"

FPR is the share of human-written passages wrongly flagged as AI. This is the number that determines whether you can trust a verdict. On a class of 30 essays, a 5% FPR means roughly 1.5 students wrongly accused. A 22% FPR, which Stanford measured on TOEFL essays in 2023, means closer to 7. The cost of a false positive falls on the writer.

The trade-off curve nobody shows you

Every detector has one knob: the threshold. Lower it and TPR rises but FPR rises too. Raise it and FPR drops but TPR drops with it. The vendor's headline number is whichever point on the curve makes the marketing read best. To compare honestly, fix the threshold or compare full curves.

Recall, precision, and the marketing fog

ML literature calls TPR "recall" and pairs it with "precision" (of everything flagged, how much was AI). Marketing pages translate these into "accuracy" or "detection rate" without disclosing the test set or threshold. If a page says "99% accurate" without splitting TPR and FPR, the writer either does not know the difference or is hoping you do not.

The bias problem

Why detectors flag ESL writers more often.

The single most documented failure mode in the academic literature. Worth understanding before trusting any verdict that involves a non-native English writer.

The Stanford finding

In July 2023, Liang and colleagues at Stanford published a paper in Patterns (Cell Press) measuring detector accuracy on TOEFL essays by non-native English speakers. More than half were misclassified as AI-generated by mainstream detectors, with one configuration reaching 61% FPR. On essays from US-born eighth-graders, the same detectors held under 5%. The model was reading the structural footprint of formally-taught English as machine-generated.

Why this happens at the signal level

Classical detectors score perplexity and burstiness. Second-language academic writing uses a constrained vocabulary, follows taught templates, and produces uniform sentence lengths. All three reduce perplexity and burstiness. The signal the detector reads as "machine" overlaps the signal of a careful non-native writer. Not a bug in any one tool: a property of the underlying method.

What's been done about it

Most major vendors have re-tuned since the Stanford paper. GPTZero shipped a 2024 ESL update. Turnitin recalibrated threshold defaults. Originality.ai added a language-aware second pass. Gains are uneven: independent retests since 2024 still measure ESL FPR between 14% and 25%. TextSight's June 2026 benchmark measures 6% on our sample mix.

The practical takeaway

If you write in formally-taught English (Indian-curriculum, Filipino academic, Chinese university register), expect more false flags than the vendor's headline number predicts. Pre-scan before submission. Re-scan any flag on a second independent tool. If you teach or grade ESL essays, treat any single verdict as a starting point for a conversation, not evidence.

Pattern recognition

Five writing patterns that trigger false flags.

If your prose has any of these traits, expect a higher false positive rate from any detector. Knowing the pattern in advance lets you defend the work.

1. Low perplexity (consistent vocabulary)

Writers who stick to a controlled vocabulary, whether by training, register, or genre, produce text the detector reads as predictable. STEM students, legal writers, and disciplined ESL writers sit in this band. The fix is not to inject random vocabulary; it is to know the pattern and name it in any review.

2. Low burstiness (uniform sentence length)

Burstiness measures how sentence length varies across a paragraph. Human writing tends to be spiky. Polished academic prose, especially in second-language or technical registers, smooths the variance out. Detectors read smoothness as machine generation. This is the single biggest reason carefully edited human writing gets flagged.

3. Formulaic transitions

"In conclusion," "furthermore," "on the other hand" are taught in formal writing instruction and used in good faith. They are also overrepresented in early GPT and Claude output. A paragraph that opens with a learned transition and closes with a hedging clause looks like every freshman essay the model saw during training.

4. Technical or list-heavy structure

Engineering reports, financial briefs, and clinical write-ups lean on bullet lists, numbered steps, and parallel grammatical structure. All three are signals machine-generated text exhibits at high rates. STEM students get over-flagged for the same reason ESL writers do: genre conventions overlap the AI signal.

5. Short passages under 250 words

Most detectors need four to six sentences before they score reliably. Below 250 words, scores swing widely and false positives spike. Run short snippets through at least two tools and weight any verdict accordingly.

The field, measured

15 detectors on the same ESL passages.

Self-published FPR (from each vendor's docs) next to the independently measured FPR on the same 100 ESL passages from our June 2026 benchmark.

15-tool FPR comparison · 100 ESL passages · Each vendor's default threshold · 2026-06-09
Detector Self-published FPR Measured FPR (native EN) Measured ESL FPR Source / citation
GPTZero~1%4%22%GPTZero 2024 docs · Stanford 2023
Turnitin4%~5%14 to 18%Turnitin docs · Weber-Wulff 2023
Originality.ai<2%6%19%Originality.ai docs · internal 2026
Copyleaks<1%8%17%Copyleaks docs · Weber-Wulff 2023
Winston AI<1%7%16%Vendor claim · internal 2026
Saplingnot published9%18%Internal 2026 benchmark
ZeroGPTnot published12%25%Internal 2026 benchmark
Crossplagnot published11%21%Weber-Wulff 2023
Content at Scale<2%8%19%Vendor claim · internal 2026
Writer.comnot published7%15%Internal 2026 benchmark
Smodinnot published13%24%Internal 2026 benchmark
QuillBot Detectornot published9%18%Internal 2026 benchmark
Scribbr<1%6%14%Vendor claim · internal 2026
Hive Moderationnot published8%16%Internal 2026 benchmark
TextSight2%2%6%June 2026 internal benchmark

Self-published numbers from each vendor's docs as of June 2026. Measured numbers from TextSight's internal benchmark on 100 ESL + 100 native English passages (methodology). Independent citations: Weber-Wulff et al., International Journal for Educational Integrity, 2023; Liang et al., Patterns (Cell Press), 2023.

The benchmark

Same passages, four detectors, tested 2026-06-09.

A focused four-tool head-to-head on the four passage types that matter most. Methodology, threshold, and sample notes below the table.

Detection accuracy across 4 passage categories · n=400 · 2026-06-09
Passage segment GPTZero TPR/FPR Turnitin TPR/FPR Originality TPR/FPR TextSight TPR/FPR
Native English AI (GPT-4, n=100)96% TPR93% TPR95% TPR97% TPR
Native English Human (n=100)4% FPR5% FPR6% FPR2% FPR
ESL Human (academic, n=100)22% FPR16% FPR19% FPR6% FPR
Humanized AI (post-edit, n=100)41% TPR58% TPR64% TPR78% TPR

How to read these four rows

Row 1. Four detectors within four points on raw GPT-4. The easy case. Not decision-grade alone.

Row 2. Field range is 2 to 6% FPR. A 2-point swing is two fewer wrongly-flagged students per 100.

Row 3 is the decision row. The 16-point gap between GPTZero (22%) and TextSight (6%) is the difference between a class of 30 ESL essays with 7 wrong accusations and the same class with 2. If your population skews ESL, this row matters more than the other three combined.

Row 4. Where most detectors collapse. GPTZero drops to 41% TPR because perplexity scoring is exactly what paraphrasers add variance to.

Methodology

  • Passage set: 400 passages, 100 per category. AI samples from GPT-4 and Claude Opus across 25 essay prompts. ESL samples from Indian, Filipino, and Chinese university student writing (assignment-matched).
  • Run window: All 400 passages scanned through each detector within a 6-hour window on 2026-06-09 to control for model drift.
  • Threshold: Each tool at its own default. We did not normalize because we wanted the default user experience number.
  • Humanized source: Each row-4 passage ran through one pass of QuillBot Fluency before re-scanning.
  • Honest scope: Internal benchmark. Numbers move on different sample mixes. Full corpus at accuracy-methodology.html.
The protocol

If a detector flagged your human writing.

A five-step protocol used by students, teachers, and editors to dispute a wrong AI flag. Practical, not adversarial.

Step 1: Preserve drafts before you touch anything

The instinct after a flag is to rewrite. Resist it. A panic rewrite destroys version history, your strongest evidence. Capture Google Docs revision history, Word AutoRecover, Notion page history, or browser autosave. Edit timestamps showing 40 minutes of incremental revision are the closest thing to proof.

Step 2: Re-scan on a second independent detector

One verdict is a probability. Two agreeing is a stronger signal. Run the passage through a detector using a different signal family: if the first was GPTZero (perplexity), run TextSight (sentence rhythm). Disagreement is itself evidence the verdict is not reliable.

Step 3: Request the methodology and threshold

Any reviewer using a detector verdict against you owes three things: the published methodology, the threshold (50%, 60%, 80% confidence floor), and the per-sentence breakdown. Most academic integrity policies require this disclosure on request. Refusal is grounds for escalation.

Step 4: Bring the per-sentence breakdown to the appeal

Modern detectors show which sentences scored high and why. If the flagged sentences carry formulaic transitions, learned templates, or technical register, that pattern alone explains the flag and is worth naming. "This sentence flagged because it has low burstiness, a structural property of careful academic prose" lands differently than a generic denial.

A note on tone. Appeals work best when they treat the detector as a flawed instrument rather than the reviewer as a bad actor. Most teachers want the tool to work. Show them why the verdict is not safe in this case.
FAQ

AI detector accuracy, frequently asked.

How accurate are AI detectors in 2026?
It depends. On raw GPT-4 or Claude the top detectors land between 88% and 97% TPR. On paraphrased AI, accuracy often drops into 40 to 60 percent. On human writing, field FPR ranges between 1% and 22% depending on first language, register, and threshold. No single tool is reliable enough to be treated as proof.
What do TPR and FPR mean for an AI detector?
TPR is the share of AI passages a detector correctly flags. FPR is the share of human passages it wrongly flags. A 99% TPR can sit next to a 30% FPR. The cost of a wrong flag falls on the writer, so FPR is the number to read first.
Why do AI detectors flag ESL writing so often?
Second-language academic writing has lower perplexity and burstiness than native English, the exact signals classical detectors read as machine. Liang et al. at Stanford (2023) measured 61% FPR on TOEFL essays across multiple detectors.
Can a school punish a student based on a detector score alone?
Reputable academic integrity guidance, including from Turnitin and GPTZero, says no detector output should be the sole basis for disciplinary action. Ask for the methodology, the threshold, and the per-sentence breakdown.
Do AI detectors work on paraphrased or humanized text?
Most detectors lose significant recall on paraphrased AI. In a June 2026 internal test, TextSight measured 78% TPR on lightly humanized passages while GPTZero measured 41% on the same set. Perplexity scoring degrades fastest because paraphrasers add the exact variance the detector reads.
What should I do if a detector wrongly flags my writing?
Do not panic-rewrite. Preserve drafts and version history. Re-scan on a second independent detector with a different signal family. Request the methodology and threshold. Bring the per-sentence breakdown to any appeal.
Related

More on detector accuracy and methodology.

Run a passage through TextSight. Read the per-sentence evidence.

Free tier: 3 scans a day, 5,000 characters per scan, no card, no email, no signup. The fastest way to test the accuracy claims on this page against your own writing.

Start free, no card Read the methodology
Sentence-level highlights · ESL-aware false-positive tuning · Peer-reviewed sources cited · No signup required for the free tier