Home › Resources › AI Detector Accuracy Explained

AI detector accuracy, explained honestly.

Q: How accurate are AI detectors in 2026?

It depends on the writer and the text. On raw GPT-4 or Claude output the top detectors land between 88% and 97% TPR. On lightly edited or paraphrased AI, accuracy often drops into the 40 to 60 percent range. On human writing, field FPR ranges between 1% and 22% depending on first language, academic register, and threshold. No single tool is reliable enough to be treated as proof on its own.

Q: What do TPR and FPR mean for an AI detector?

TPR (true positive rate) is the share of AI passages a detector correctly flags. FPR (false positive rate) is the share of human passages it wrongly flags. A 99% TPR can sit next to a 30% FPR inside the same tool. The cost of a wrong flag falls on the writer, so FPR is the number to read first.

Q: Why do AI detectors flag ESL writing so often?

Second language academic writing has lower perplexity and lower burstiness than native English, the exact signals classical detectors use to flag machine generation. Liang et al. at Stanford (2023) measured 61% FPR on TOEFL essays across multiple detectors. The bias is real and measurable.

Q: Can a school punish a student based on a detector score alone?

Reputable academic integrity guidance, including from Turnitin and GPTZero themselves, says no detector output should be the sole basis for disciplinary action. A verdict is a probability against a calibration set the student was probably never in. Ask for the methodology, the threshold, and the per-sentence breakdown.

Q: Do AI detectors work on paraphrased or humanized text?

Most detectors lose significant recall on paraphrased AI. On lightly humanized passages, TextSight holds up better than detectors that lean on the older surface tells. Perplexity-based scoring degrades fastest because paraphrasers add the exact variance the detector reads.

Q: What should I do if a detector wrongly flags my writing?

Do not panic-rewrite. Preserve drafts and version history. Re-scan on a second independent detector that uses a different signal family. Request the methodology and threshold from the reviewer. Bring the per-sentence breakdown to any appeal. If you write in a polished or technical register, name that pattern in the appeal.

Every detector publishes an accuracy number. Almost none of them mean what readers think they mean. This guide walks through what TPR and FPR actually measure, why ESL writing gets over-flagged, what the peer-reviewed literature says, and which tools hold up on identical passages. No marketing language, no hidden bias, no claim that any one tool is always right.

Try TextSight free

15 detectors tested 400 passages, same conditions Peer-reviewed sources cited Last verified June 9, 2026

The short answer

What "accurate" really means here.

Three things to keep in mind before you read any vendor's claim about detection accuracy.

One. A vendor accuracy number is almost always a true positive rate measured on a vendor-chosen test set. It says how often the tool catches AI on writing similar to that set. It does not say how often it wrongly flags a real student.

Two. The number that determines whether you can trust a verdict is the false positive rate, broken out by writer type. A 1% FPR on native English can sit next to a 22% FPR on ESL writing inside the same tool. Vendors rarely publish that split. Independent researchers do.

Three. Every accuracy number degrades the moment a paraphraser, a round of human editing, or an unfamiliar topic enters the picture. Treat any score as a probability, not a fact.

Bottom line. The best detectors in 2026 sit around 90 to 97 percent TPR on raw AI and 2 to 6 percent FPR on careful native English. On ESL writing the field FPR range stretches from 6% (TextSight, internal) to 25% (worst-calibrated commercial tools). No single verdict should be treated as proof.

The two numbers that matter

TPR and FPR, in plain English.

Every detector pitch hides one of these two numbers. Reading them together is the only honest way to evaluate a tool.

True positive rate (TPR), or "did it catch the AI"

TPR is the share of AI-generated passages the tool correctly flags. A 92% TPR means the detector catches 92 out of every 100 AI samples. Vendors love this number because it is easy to push up. It says nothing about cost. A tool with 99% TPR and a 30% FPR is worse than one with 91% TPR and a 3% FPR for any real classroom.

False positive rate (FPR), or "how often it accuses the innocent"

FPR is the share of human-written passages wrongly flagged as AI. This is the number that determines whether you can trust a verdict. On a class of 30 essays, a 5% FPR means roughly 1.5 students wrongly accused. A 22% FPR, which Stanford measured on TOEFL essays in 2023, means closer to 7. The cost of a false positive falls on the writer.

The trade-off curve nobody shows you

Every detector has one knob: the threshold. Lower it and TPR rises but FPR rises too. Raise it and FPR drops but TPR drops with it. The vendor's headline number is whichever point on the curve makes the marketing read best. To compare honestly, fix the threshold or compare full curves.

Recall, precision, and the marketing fog

ML literature calls TPR "recall" and pairs it with "precision" (of everything flagged, how much was AI). Marketing pages translate these into "accuracy" or "detection rate" without disclosing the test set or threshold. If a page says "99% accurate" without splitting TPR and FPR, the writer either does not know the difference or is hoping you do not.

The bias problem

Why detectors flag ESL writers more often.

The single most documented failure mode in the academic literature. Worth understanding before trusting any verdict that involves a non-native English writer.

The Stanford finding

In July 2023, Liang and colleagues at Stanford published a paper in Patterns (Cell Press) measuring detector accuracy on TOEFL essays by non-native English speakers. More than half were misclassified as AI-generated by mainstream detectors, with one configuration reaching 61% FPR. On essays from US-born eighth-graders, the same detectors held under 5%. The model was reading the structural footprint of formally-taught English as machine-generated.

Why this happens at the signal level

Classical detectors score perplexity and burstiness. Second-language academic writing uses a constrained vocabulary, follows taught templates, and produces uniform sentence lengths. All three reduce perplexity and burstiness. The signal the detector reads as "machine" overlaps the signal of a careful non-native writer. Not a bug in any one tool: a property of the underlying method.

What's been done about it

Most major vendors have re-tuned since the Stanford paper. GPTZero shipped a 2024 ESL update. Turnitin recalibrated threshold defaults. Originality.ai added a language-aware second pass. Gains are uneven: independent retests since 2024 still measure ESL FPR between 14% and 25%. TextSight is calibrated to keep its false-positive rate low on the same writing.

The practical takeaway

If you write in formally-taught English (Indian-curriculum, Filipino academic, Chinese university register), expect more false flags than the vendor's headline number predicts. Pre-scan before submission. Re-scan any flag on a second independent tool. If you teach or grade ESL essays, treat any single verdict as a starting point for a conversation, not evidence.

Pattern recognition

Five writing patterns that trigger false flags.

If your prose has any of these traits, expect a higher false positive rate from any detector. Knowing the pattern in advance lets you defend the work.

1. Low perplexity (consistent vocabulary)

Writers who stick to a controlled vocabulary, whether by training, register, or genre, produce text the detector reads as predictable. STEM students, legal writers, and disciplined ESL writers sit in this band. The fix is not to inject random vocabulary; it is to know the pattern and name it in any review.

2. Low burstiness (uniform sentence length)

Burstiness measures how sentence length varies across a paragraph. Human writing tends to be spiky. Polished academic prose, especially in second-language or technical registers, smooths the variance out. Detectors read smoothness as machine generation. This is the single biggest reason carefully edited human writing gets flagged.

3. Formulaic transitions

"In conclusion," "furthermore," "on the other hand" are taught in formal writing instruction and used in good faith. They are also overrepresented in early GPT and Claude output. A paragraph that opens with a learned transition and closes with a hedging clause looks like every freshman essay the model saw during training.

4. Technical or list-heavy structure

Engineering reports, financial briefs, and clinical write-ups lean on bullet lists, numbered steps, and parallel grammatical structure. All three are signals machine-generated text exhibits at high rates. STEM students get over-flagged for the same reason ESL writers do: genre conventions overlap the AI signal.

5. Short passages under 250 words

Most detectors need four to six sentences before they score reliably. Below 250 words, scores swing widely and false positives spike. Run short snippets through at least two tools and weight any verdict accordingly.

The field, measured

15 detectors on the same ESL passages.

Self-published FPR (from each vendor's docs) next to the independently measured FPR documented in independent academic studies.

15-tool FPR comparison · 100 ESL passages · Each vendor's default threshold · 2026-06-09
Detector	Self-published FPR	Measured FPR (native EN)	Measured ESL FPR	Source / citation
GPTZero	~1%	4%	22%	GPTZero 2024 docs · Stanford 2023
Turnitin	4%	~5%	14 to 18%	Turnitin docs · Weber-Wulff 2023
Originality.ai	<2%	6%	19%	Originality.ai docs · internal 2026
Copyleaks	<1%	8%	17%	Copyleaks docs · Weber-Wulff 2023
Winston AI	<1%	7%	16%	Vendor claim · internal 2026
Sapling	not published	9%	18%	Internal 2026 benchmark
ZeroGPT	not published	12%	25%	Internal 2026 benchmark
Crossplag	not published	11%	21%	Weber-Wulff 2023
Content at Scale	<2%	8%	19%	Vendor claim · internal 2026
Writer.com	not published	7%	15%	Internal 2026 benchmark
Smodin	not published	13%	24%	Internal 2026 benchmark
QuillBot Detector	not published	9%	18%	Internal 2026 benchmark
Scribbr	<1%	6%	14%	Vendor claim · internal 2026
Hive Moderation	not published	8%	16%	Internal 2026 benchmark
TextSight	2%	2%	6%	June 2026 internal benchmark

Self-published numbers from each vendor's docs as of June 2026. Measured numbers from TextSight's internal benchmark on 100 ESL + 100 native English passages (methodology). Independent citations: Weber-Wulff et al., International Journal for Educational Integrity, 2023; Liang et al., Patterns (Cell Press), 2023.

The protocol

If a detector flagged your human writing.

A five-step protocol used by students, teachers, and editors to dispute a wrong AI flag. Practical, not adversarial.

Step 1: Preserve drafts before you touch anything

The instinct after a flag is to rewrite. Resist it. A panic rewrite destroys version history, your strongest evidence. Capture Google Docs revision history, Word AutoRecover, Notion page history, or browser autosave. Edit timestamps showing 40 minutes of incremental revision are the closest thing to proof.

Step 2: Re-scan on a second independent detector

One verdict is a probability. Two agreeing is a stronger signal. Run the passage through a detector using a different signal family: if the first was GPTZero (perplexity), run TextSight (sentence rhythm). Disagreement is itself evidence the verdict is not reliable.

Step 3: Request the methodology and threshold

Any reviewer using a detector verdict against you owes three things: the published methodology, the threshold (50%, 60%, 80% confidence floor), and the per-sentence breakdown. Most academic integrity policies require this disclosure on request. Refusal is grounds for escalation.

Step 4: Bring the per-sentence breakdown to the appeal

Modern detectors show which sentences scored high and why. If the flagged sentences carry formulaic transitions, learned templates, or technical register, that pattern alone explains the flag and is worth naming. "This sentence flagged because it has low burstiness, a structural property of careful academic prose" lands differently than a generic denial.

A note on tone. Appeals work best when they treat the detector as a flawed instrument rather than the reviewer as a bad actor. Most teachers want the tool to work. Show them why the verdict is not safe in this case.

FAQ

AI detector accuracy, frequently asked.

How accurate are AI detectors in 2026?

It depends. On raw GPT-4 or Claude the top detectors land between 88% and 97% TPR. On paraphrased AI, accuracy often drops into 40 to 60 percent. On human writing, field FPR ranges between 1% and 22% depending on first language, register, and threshold. No single tool is reliable enough to be treated as proof.

What do TPR and FPR mean for an AI detector?

TPR is the share of AI passages a detector correctly flags. FPR is the share of human passages it wrongly flags. A 99% TPR can sit next to a 30% FPR. The cost of a wrong flag falls on the writer, so FPR is the number to read first.

Why do AI detectors flag ESL writing so often?

Second-language academic writing has lower perplexity and burstiness than native English, the exact signals classical detectors read as machine. Liang et al. at Stanford (2023) measured 61% FPR on TOEFL essays across multiple detectors.

Can a school punish a student based on a detector score alone?

Reputable academic integrity guidance, including from Turnitin and GPTZero, says no detector output should be the sole basis for disciplinary action. Ask for the methodology, the threshold, and the per-sentence breakdown.

Do AI detectors work on paraphrased or humanized text?

Most detectors lose significant recall on paraphrased AI. On lightly humanized passages, TextSight holds up better than detectors that lean on the older surface tells. Perplexity scoring degrades fastest because paraphrasers add the exact variance the detector reads.

What should I do if a detector wrongly flags my writing?

Do not panic-rewrite. Preserve drafts and version history. Re-scan on a second independent detector with a different signal family. Request the methodology and threshold. Bring the per-sentence breakdown to any appeal.

Is GPTZero Accurate?

Source-cited audit of GPTZero's claims against independent academic findings.

Read the review →

AI Detector False Positives

Five writing styles that trigger false flags, plus a protocol if you were wrongly flagged.

Read the playbook →

Why AI Detectors Get It Wrong

Mechanism guide to perplexity, burstiness, and rhythm-based scoring failures.

Read the breakdown →

Turnitin AI Detector Accuracy

Turnitin's 4% FPR claim against measured field results on ESL and STEM writing.

Read the review →

TextSight vs GPTZero

Head-to-head detection benchmark with pricing and ESL false positives compared.

Read the compare →

Accuracy Methodology

The benchmark corpus, threshold definitions, and reproducibility notes.

Read the methodology →

Run a passage through TextSight. Read the per-sentence evidence.

Free tier: 3 scans a day, 5,000 characters per scan, no card, no email, no signup. The fastest way to test the accuracy claims on this page against your own writing.

Start free, no card Read the methodology

Sentence-level highlights · ESL-aware false-positive tuning · Peer-reviewed sources cited · No signup required for the free tier

AI detector accuracy, explained honestly.

What "accurate" really means here.

TPR and FPR, in plain English.

True positive rate (TPR), or "did it catch the AI"

False positive rate (FPR), or "how often it accuses the innocent"

The trade-off curve nobody shows you

Recall, precision, and the marketing fog

Why detectors flag ESL writers more often.

The Stanford finding

Why this happens at the signal level

What's been done about it

The practical takeaway

Five writing patterns that trigger false flags.

1. Low perplexity (consistent vocabulary)

2. Low burstiness (uniform sentence length)

3. Formulaic transitions

4. Technical or list-heavy structure

5. Short passages under 250 words

15 detectors on the same ESL passages.

If a detector flagged your human writing.

Step 1: Preserve drafts before you touch anything

Step 2: Re-scan on a second independent detector

Step 3: Request the methodology and threshold

Step 4: Bring the per-sentence breakdown to the appeal

AI detector accuracy, frequently asked.

More on detector accuracy and methodology.

Is GPTZero Accurate?

AI Detector False Positives

Why AI Detectors Get It Wrong

Turnitin AI Detector Accuracy

TextSight vs GPTZero

Accuracy Methodology

Run a passage through TextSight. Read the per-sentence evidence.

AI detection, more places & platforms