HomeResources › AI Detector False Positives

AI Detector False Positives: causes and what to do.

A false positive is a detector calling human writing AI-generated. Independent academic studies have measured rates between 4 and 22 percent depending on the tool and the writer, high enough that any single verdict deserves a second look. ESL writers, polished academic stylists, and short-passage submitters carry the highest risk. Below: measured FPR by detector, the five writing patterns most likely to trigger a false flag, and a five-step protocol if your writing has been wrongly flagged.

Jump to the 5-step protocol See the 15-detector benchmark
15 detectors tested 100 ESL passages, n=100 Vendor-default thresholds Last verified
The definition

What counts as a false positive.

Before you appeal a verdict you need to know what the verdict actually is. A false positive is a measurable statistical event, not a judgement call.

A false positive is the technical term for a detector flagging a passage as AI-generated when it was, in fact, written by a human. The metric that quantifies this is the false positive rate, abbreviated FPR. It is the share of confirmed-human passages in a test set that the detector wrongly labels positive. If you scan 100 human-written essays and the tool flags 8 of them, the measured FPR on that sample is 8 percent.

Three things matter when you read an FPR number. The first is the sample: a vendor running its own internal eval gets to choose what counts as human writing, and a calibration set assembled from staff blog posts is not representative of a global mix. The second is the threshold: every detector has a tunable cutoff above which a score becomes a flag. Raising the threshold cuts FPR but also cuts true positive rate, the share of actual AI text caught. The third is who the writer is, because measured FPR varies sharply across populations, especially between native and non-native English writers.

"I disagree with this verdict" is not the same thing as a false positive. A verdict is a probability, not a finding. The honest reading is: this detector judged the passage above its threshold, against a calibration set the writer was probably never in. That framing is what makes the protocol later on this page actually work.

The field, measured

False positive rates published, false positive rates measured.

Self-reported vendor numbers live in one column. Independent academic measurements live in the other. The gap between the two is the entire story.

The 2023 paper by Weber-Wulff and colleagues in the International Journal for Educational Integrity tested 14 detectors on a fixed corpus and measured false positive rates anywhere from 0 to 50 percent depending on the tool, with most clustering between 5 and 15 percent. The 2023 Stanford study by Liang et al. focused specifically on ESL writing, sampling TOEFL essays and English-language student work from non-native writers; several major detectors that report sub-2 percent FPR on native English clocked above 60 percent on the ESL sample. Elkhatat and co-authors published a separate 2023 evaluation showing sharp accuracy drops once passages crossed model boundaries, with detectors trained on GPT-2 era output performing notably worse on GPT-4 and Claude generations.

Self-reported numbers tell a different story. GPTZero published a 1 percent FPR on its 2024 methodology page. Turnitin documents a 4 percent FPR on long-form human prose. Originality.ai claims under 2 percent. Copyleaks reports under 1 percent. The vendor numbers are not lies; they are correct against the calibration set the vendor chose. They are simply not the number you should expect on your own traffic, especially if your traffic includes ESL writers or short passages or polished technical prose.

Our own 400-passage June 2026 benchmark, with the methodology linked at the bottom of the next section, gives TextSight a 2 percent FPR on native English academic prose and 6 percent on ESL writing. We publish both numbers because the gap is the honest part. Any vendor showing you only the better of the two is hiding the population that matters most.

The four high-risk groups

Who carries the false-positive risk.

False positives are not random. Four writer profiles absorb most of the misfires. If you fit any of them, treat any single detector verdict as a starting point, not a conclusion.

1. ESL writers

The biggest single risk group. Second-language academic writing tends to be technically correct, vocabulary-consistent, and structurally tidy, exactly the surface features that overlap with machine-generated prose. The Liang 2023 study put this on the record: several detectors that report under 5 percent FPR on native English measured above 60 percent on TOEFL essays. Our own benchmark shows the same direction at a smaller magnitude (a 6 percent versus 22 percent gap across leading detectors on identical Indian, Filipino, and Chinese university samples).

2. High-achieving native writers with polished register

Counter-intuitively, the best human writers are also at risk. Award-winning student essays, peer-reviewed journal abstracts, and senior-level editorial prose tend to have lower perplexity and more uniform burstiness than rougher drafts. The fingerprints that detectors lean on for "human" partly reward inconsistency. Clean writing is statistically harder to distinguish from clean machine writing, and the field is honest about that being a real limit.

3. STEM and technical writers

Methods sections, lab reports, and code documentation use formulaic structures and constrained vocabularies by professional convention. The signal a detector reads as "this looks templated" is, in scientific writing, just the genre. Engineering and CS student writing flags at roughly twice the rate of humanities student writing in our internal samples, even when the underlying authorship is identical.

4. Short-passage submitters

Anyone submitting under 250 words is fighting noise. Detectors need enough sentences to average their signal across, and below that floor the score swings wildly with a single rephrase. A 120-word discussion-board reply can score 28 percent AI on Monday and 71 percent AI on Tuesday with no edits, just sampling variance. If your workflow is short replies or paragraph-length scans, expect higher variance and lower trust on any single result.

The mechanism

The five writing patterns that trigger false flags.

Five surface features recur in nearly every wrongly-flagged passage we have audited. They are statistical properties of the text, not stylistic crimes. Knowing them lets you read a verdict critically.

Low perplexity

Perplexity measures how predictable each next word is, given the words before it. Consistent vocabulary across an essay, especially in a narrow topic area, drives perplexity down. Classical detectors read low perplexity as a generation signal. The mechanism is real, but plenty of careful human writers produce low-perplexity prose simply by being careful.

Low burstiness

Burstiness is variance in sentence length across the document. Human writing tends to spike (a 4-word sentence next to a 38-word one). AI writing is smoother. Writers trained in formal academic register often suppress the spikes deliberately because uniformity is read as polish. That suppression is exactly what the burstiness signal punishes.

Formulaic transitions

Furthermore, Moreover, In conclusion, It is important to note that. These transitions appear in both AI output and well-taught student writing for the same reason: they were learned as paragraph-glue in formal education. The detector cannot tell whether the writer learned them from a curriculum or from a model's training corpus.

Technical or STEM register

Dense noun phrases, passive constructions, and chained "of"-clauses ("the calibration of the threshold of the detector") read as templated text to a generic classifier. In their native habitat (a methods section, a clinical write-up) they are simply correct genre. A detector tuned mostly on essays and blog posts is reading STEM prose against the wrong reference distribution.

Short passages under 250 words

The structural problem. Most detectors need at least 4-5 sentences of context to lock in a stable score, and the floor of usable accuracy is widely understood to be around 250 words. Below that you are sampling from a wide variance interval and treating one sample as if it were the mean.

Benchmark

FPR by detector on the same 100 ESL passages.

15 detectors. One fixed sample of 100 ESL academic passages from Indian, Filipino, and Chinese university student writing. Each tool run at its default threshold inside a single 6-hour window on 2026-06-09. Methodology bullets after the table.

Detection false positive rate by tool, same 100-passage ESL sample, vendor default thresholds, 2026-06-09
Detector Self-published FPR Independent FPR ESL FPR (measured) Citation
GPTZero~1%4%22%GPTZero 2024 / Stanford 2023
Turnitin AI4%~5%14-18%Turnitin docs / Weber-Wulff 2023
Originality.ai<2%6%19%Originality docs / internal 2026
Copyleaks<1%8%17%Copyleaks docs / Weber-Wulff
Winston AI<1%7%16%Vendor / internal 2026
Saplingn/a9%18%Internal 2026
ZeroGPTn/a12%25%Internal 2026
Crossplagn/a11%21%Weber-Wulff 2023
Content at Scale<2%8%19%Vendor / internal 2026
Writer.comn/a7%15%Internal 2026
Smodinn/a13%24%Internal 2026
QuillBot Detectorn/a9%18%Internal 2026
Scribbr<1%6%14%Vendor / internal 2026
Hive Moderationn/a8%16%Internal 2026
TextSight2%2%6%June 2026 benchmark, n=400

What the table is saying

Read across any row and the gap between the vendor's self-published number and the measured ESL number is the part to take seriously. Detectors that publish under 2 percent are not lying, but the population that number applies to (typically a tidy internal sample of native English long-form writing) is not the population a real classroom or editorial pipeline contains. The ESL column reflects the same passages running through every tool the same day. Differences are the model, not the sample.

TextSight's 6 percent ESL FPR is the lowest in the table, and the gap to the next-best mainstream tool (Scribbr at 14 percent) is the entire reason we publish this page. We are not claiming a no-false-positive product, because that product does not exist. We are claiming a calibrated, transparent rate that holds up on the population most affected.

Methodology

  • Sample: 100 ESL academic passages drawn from Indian (IIT, IIM, DU, JNU), Filipino, and Chinese university student writing. All confirmed human-authored with retained draft history.
  • Length: 300 to 800 words per passage, mean ~520 words.
  • Threshold: Each detector run at its vendor-default flag threshold, no tuning.
  • Window: All 15 detectors scanned within a single 6-hour window on 2026-06-09 to control for model drift.
  • Self-published column: Sourced from each vendor's public methodology page or pricing page as of 2026-06-09.
  • Independent column: Where available, Weber-Wulff et al. 2023 or Liang et al. 2023; otherwise our June 2026 internal benchmark.
  • Honest scope: This is TextSight's measurement. Vendors will likely score differently on different ESL mixes or different threshold settings. We re-run quarterly and update the page.
If you have been flagged

A five-step protocol if your writing has been wrongly flagged.

The five things to do before you do anything else. In order. They work because they prioritise evidence preservation over emotional reaction, and because they force the institution to engage with methodology rather than verdict.

1

Do not panic-rewrite

The flagged document is now evidence. Rewriting it destroys the paper trail. Leave the file untouched. Open a copy if you need to reference it, but do not save changes over the original.

2

Preserve drafts and version history

Google Docs has full revision history under File > Version history. Word with track changes keeps the timeline if it was on. Notion and most modern editors keep autosaves. Export the version history now, while it still exists.

3

Re-scan on two independent detectors

One verdict is a signal, three verdicts are a finding. Run the same passage through two more detectors and screenshot both results with the URL and timestamp visible. Agreement across tools is much harder to dismiss than any single score.

4

Request the methodology

Ask your institution which detector was used, at what threshold, and against what calibration set. Most reputable vendors publish this. If the institution cannot answer those three questions, the verdict is being treated as oracle output rather than a measurement.

5

Escalate with evidence attached

Go through the proper appeal channel. Include the per-sentence breakdown, your two independent re-scans, the methodology request, and your exported draft history. The combination shifts the burden of proof to the verdict, where it belongs.

Most academic integrity frameworks (including Turnitin's own documentation and GPTZero's published guidance) explicitly say a detector verdict should not be the sole basis for disciplinary action. That is the lever the protocol exists to pull. You are not asking the institution to ignore the verdict, you are asking it to follow its own published policy and corroborate the verdict against drafts, history, and a second signal. In our experience the protocol resolves the majority of cases at step 3 or step 4, before a formal hearing is needed.

FAQ

AI detector false positives, frequently asked.

How often do AI detectors get human writing wrong?
Across independent academic studies, false positive rates range from roughly 4 percent to 22 percent depending on the tool, the threshold, and the writer's background. The widest variance is on ESL writing, where measured rates sit between 14 percent and 25 percent. Detectors generally publish a single self-reported number, but those numbers reflect calibration on a controlled internal sample rather than the mixed real-world traffic most users actually run through them.
Can a school punish me based on a detector verdict alone?
Most reputable academic integrity frameworks say no, and several vendors say the same in their own documentation. Turnitin's published guidance states that detector output should not be the sole basis for an academic misconduct decision. GPTZero's documentation echoes the point. A verdict is a probability against a calibration set, not evidence of cheating. Disciplinary action without corroboration (drafts, version history, an interview) is widely treated as a procedural failure.
Why do AI detectors flag ESL writers more often?
Second-language academic writers tend to produce text with lower perplexity (a smaller, more consistent vocabulary) and lower burstiness (more uniform sentence length). Those are the same signals that classical detectors read as AI generation. The 2023 Stanford study by Liang et al. quantified this bias, measuring false positive rates above 60 percent on some ESL samples for detectors that reported under 5 percent on native-English samples. The pattern is statistical, not malicious, but the impact on real students is the same.
What writing patterns trigger false positives?
Five patterns recur in our 2026 testing. First, low perplexity, meaning a consistent vocabulary with few surprises. Second, low burstiness, meaning sentences of similar length without natural variance. Third, formulaic transitions like Furthermore, Moreover, In conclusion. Fourth, technical or STEM register that leans on dense noun phrases. Fifth, very short passages under 250 words, where the detector has too little signal to average over. Polished, well-edited human writing routinely trips one or two of these.
What should I do if I've been wrongly flagged?
Five steps. First, do not panic-rewrite the document; you need it as evidence. Second, preserve drafts, autosaves, and version history immediately. Google Docs, Word with track changes, and Notion all keep a timeline. Third, re-scan the same passage on two independent detectors and screenshot the results. Fourth, request the detector's published methodology and threshold from your institution. Fifth, escalate through the proper appeal channel with the per-sentence breakdown attached and your draft history as corroboration.
Is there a detector with no false positives?
No, and any vendor claiming that is misrepresenting the problem. False positives are a structural property of probabilistic classifiers, not a bug to be patched out. The honest goal is a low, calibrated, transparent false positive rate, plus per-sentence rationale so a human reviewer can read the verdict critically. TextSight publishes a 2 percent FPR on native English and 6 percent on ESL writing, measured against a 400-passage June 2026 benchmark. We re-run the test quarterly and update the page.
Why is TextSight's ESL false positive rate lower?
We tuned the classifier in 2025 against writing samples from Indian universities (IIT, IIM, DU, JNU), Filipino education programmes, and Chinese postgraduate writing. The training set deliberately includes high-quality second-language academic prose so the model learns that low perplexity plus low burstiness is not enough on its own to flag a passage. Sentence-architecture and paragraph-cadence signals carry more weight, and those are harder to confuse with non-native fluency.
Related

More on detection accuracy and defending your work.

Run a second opinion on the same passage. Free, no signup.

TextSight's free tier gives you three scans a day at 5,000 characters per scan, with sentence-level highlights so you can read the verdict critically. No card, no email, no commitment.

Start free, no card See the methodology
Sentence-level highlights · ESL-aware calibration · Quarterly re-tested benchmark · No signup for the free tier