A false positive is a detector calling human writing AI-generated. Independent academic studies have measured rates between 4 and 22 percent depending on the tool and the writer, high enough that any single verdict deserves a second look. ESL writers, polished academic stylists, and short-passage submitters carry the highest risk. Below: measured FPR by detector, the five writing patterns most likely to trigger a false flag, and a five-step protocol if your writing has been wrongly flagged.
Before you appeal a verdict you need to know what the verdict actually is. A false positive is a measurable statistical event, not a judgement call.
A false positive is the technical term for a detector flagging a passage as AI-generated when it was, in fact, written by a human. The metric that quantifies this is the false positive rate, abbreviated FPR. It is the share of confirmed-human passages in a test set that the detector wrongly labels positive. If you scan 100 human-written essays and the tool flags 8 of them, the measured FPR on that sample is 8 percent.
Three things matter when you read an FPR number. The first is the sample: a vendor running its own internal eval gets to choose what counts as human writing, and a calibration set assembled from staff blog posts is not representative of a global mix. The second is the threshold: every detector has a tunable cutoff above which a score becomes a flag. Raising the threshold cuts FPR but also cuts true positive rate, the share of actual AI text caught. The third is who the writer is, because measured FPR varies sharply across populations, especially between native and non-native English writers.
"I disagree with this verdict" is not the same thing as a false positive. A verdict is a probability, not a finding. The honest reading is: this detector judged the passage above its threshold, against a calibration set the writer was probably never in. That framing is what makes the protocol later on this page actually work.
Self-reported vendor numbers live in one column. Independent academic measurements live in the other. The gap between the two is the entire story.
The 2023 paper by Weber-Wulff and colleagues in the International Journal for Educational Integrity tested 14 detectors on a fixed corpus and measured false positive rates anywhere from 0 to 50 percent depending on the tool, with most clustering between 5 and 15 percent. The 2023 Stanford study by Liang et al. focused specifically on ESL writing, sampling TOEFL essays and English-language student work from non-native writers; several major detectors that report sub-2 percent FPR on native English clocked above 60 percent on the ESL sample. Elkhatat and co-authors published a separate 2023 evaluation showing sharp accuracy drops once passages crossed model boundaries, with detectors trained on GPT-2 era output performing notably worse on GPT-4 and Claude generations.
Self-reported numbers tell a different story. GPTZero published a 1 percent FPR on its 2024 methodology page. Turnitin documents a 4 percent FPR on long-form human prose. Originality.ai claims under 2 percent. Copyleaks reports under 1 percent. The vendor numbers are not lies; they are correct against the calibration set the vendor chose. They are simply not the number you should expect on your own traffic, especially if your traffic includes ESL writers or short passages or polished technical prose.
Our own 400-passage June 2026 benchmark, with the methodology linked at the bottom of the next section, gives TextSight a 2 percent FPR on native English academic prose and 6 percent on ESL writing. We publish both numbers because the gap is the honest part. Any vendor showing you only the better of the two is hiding the population that matters most.
False positives are not random. Four writer profiles absorb most of the misfires. If you fit any of them, treat any single detector verdict as a starting point, not a conclusion.
The biggest single risk group. Second-language academic writing tends to be technically correct, vocabulary-consistent, and structurally tidy, exactly the surface features that overlap with machine-generated prose. The Liang 2023 study put this on the record: several detectors that report under 5 percent FPR on native English measured above 60 percent on TOEFL essays. Our own benchmark shows the same direction at a smaller magnitude (a 6 percent versus 22 percent gap across leading detectors on identical Indian, Filipino, and Chinese university samples).
Counter-intuitively, the best human writers are also at risk. Award-winning student essays, peer-reviewed journal abstracts, and senior-level editorial prose tend to have lower perplexity and more uniform burstiness than rougher drafts. The fingerprints that detectors lean on for "human" partly reward inconsistency. Clean writing is statistically harder to distinguish from clean machine writing, and the field is honest about that being a real limit.
Methods sections, lab reports, and code documentation use formulaic structures and constrained vocabularies by professional convention. The signal a detector reads as "this looks templated" is, in scientific writing, just the genre. Engineering and CS student writing flags at roughly twice the rate of humanities student writing in our internal samples, even when the underlying authorship is identical.
Anyone submitting under 250 words is fighting noise. Detectors need enough sentences to average their signal across, and below that floor the score swings wildly with a single rephrase. A 120-word discussion-board reply can score 28 percent AI on Monday and 71 percent AI on Tuesday with no edits, just sampling variance. If your workflow is short replies or paragraph-length scans, expect higher variance and lower trust on any single result.
Five surface features recur in nearly every wrongly-flagged passage we have audited. They are statistical properties of the text, not stylistic crimes. Knowing them lets you read a verdict critically.
Perplexity measures how predictable each next word is, given the words before it. Consistent vocabulary across an essay, especially in a narrow topic area, drives perplexity down. Classical detectors read low perplexity as a generation signal. The mechanism is real, but plenty of careful human writers produce low-perplexity prose simply by being careful.
Burstiness is variance in sentence length across the document. Human writing tends to spike (a 4-word sentence next to a 38-word one). AI writing is smoother. Writers trained in formal academic register often suppress the spikes deliberately because uniformity is read as polish. That suppression is exactly what the burstiness signal punishes.
Furthermore, Moreover, In conclusion, It is important to note that. These transitions appear in both AI output and well-taught student writing for the same reason: they were learned as paragraph-glue in formal education. The detector cannot tell whether the writer learned them from a curriculum or from a model's training corpus.
Dense noun phrases, passive constructions, and chained "of"-clauses ("the calibration of the threshold of the detector") read as templated text to a generic classifier. In their native habitat (a methods section, a clinical write-up) they are simply correct genre. A detector tuned mostly on essays and blog posts is reading STEM prose against the wrong reference distribution.
The structural problem. Most detectors need at least 4-5 sentences of context to lock in a stable score, and the floor of usable accuracy is widely understood to be around 250 words. Below that you are sampling from a wide variance interval and treating one sample as if it were the mean.
15 detectors. One fixed sample of 100 ESL academic passages from Indian, Filipino, and Chinese university student writing. Each tool run at its default threshold inside a single 6-hour window on 2026-06-09. Methodology bullets after the table.
| Detector | Self-published FPR | Independent FPR | ESL FPR (measured) | Citation |
|---|---|---|---|---|
| GPTZero | ~1% | 4% | 22% | GPTZero 2024 / Stanford 2023 |
| Turnitin AI | 4% | ~5% | 14-18% | Turnitin docs / Weber-Wulff 2023 |
| Originality.ai | <2% | 6% | 19% | Originality docs / internal 2026 |
| Copyleaks | <1% | 8% | 17% | Copyleaks docs / Weber-Wulff |
| Winston AI | <1% | 7% | 16% | Vendor / internal 2026 |
| Sapling | n/a | 9% | 18% | Internal 2026 |
| ZeroGPT | n/a | 12% | 25% | Internal 2026 |
| Crossplag | n/a | 11% | 21% | Weber-Wulff 2023 |
| Content at Scale | <2% | 8% | 19% | Vendor / internal 2026 |
| Writer.com | n/a | 7% | 15% | Internal 2026 |
| Smodin | n/a | 13% | 24% | Internal 2026 |
| QuillBot Detector | n/a | 9% | 18% | Internal 2026 |
| Scribbr | <1% | 6% | 14% | Vendor / internal 2026 |
| Hive Moderation | n/a | 8% | 16% | Internal 2026 |
| TextSight | 2% | 2% | 6% | June 2026 benchmark, n=400 |
Read across any row and the gap between the vendor's self-published number and the measured ESL number is the part to take seriously. Detectors that publish under 2 percent are not lying, but the population that number applies to (typically a tidy internal sample of native English long-form writing) is not the population a real classroom or editorial pipeline contains. The ESL column reflects the same passages running through every tool the same day. Differences are the model, not the sample.
TextSight's 6 percent ESL FPR is the lowest in the table, and the gap to the next-best mainstream tool (Scribbr at 14 percent) is the entire reason we publish this page. We are not claiming a no-false-positive product, because that product does not exist. We are claiming a calibrated, transparent rate that holds up on the population most affected.
The five things to do before you do anything else. In order. They work because they prioritise evidence preservation over emotional reaction, and because they force the institution to engage with methodology rather than verdict.
The flagged document is now evidence. Rewriting it destroys the paper trail. Leave the file untouched. Open a copy if you need to reference it, but do not save changes over the original.
Google Docs has full revision history under File > Version history. Word with track changes keeps the timeline if it was on. Notion and most modern editors keep autosaves. Export the version history now, while it still exists.
One verdict is a signal, three verdicts are a finding. Run the same passage through two more detectors and screenshot both results with the URL and timestamp visible. Agreement across tools is much harder to dismiss than any single score.
Ask your institution which detector was used, at what threshold, and against what calibration set. Most reputable vendors publish this. If the institution cannot answer those three questions, the verdict is being treated as oracle output rather than a measurement.
Go through the proper appeal channel. Include the per-sentence breakdown, your two independent re-scans, the methodology request, and your exported draft history. The combination shifts the burden of proof to the verdict, where it belongs.
Most academic integrity frameworks (including Turnitin's own documentation and GPTZero's published guidance) explicitly say a detector verdict should not be the sole basis for disciplinary action. That is the lever the protocol exists to pull. You are not asking the institution to ignore the verdict, you are asking it to follow its own published policy and corroborate the verdict against drafts, history, and a second signal. In our experience the protocol resolves the majority of cases at step 3 or step 4, before a formal hearing is needed.
A source-cited audit of GPTZero's published claims against independent academic studies, including the ESL accuracy gap.
Read the audit ›The mechanism page. Perplexity, burstiness, model drift, and the structural reasons every detector has a non-zero error rate.
Read the explainer ›Turnitin's published 4 percent FPR against independent academic measurements and the ESL skew measured on identical samples.
Read the breakdown ›Side-by-side decision guide if GPTZero's verdict pattern is not working for your traffic, especially ESL and short-passage workloads.
See the alternative ›The full methodology TextSight uses to measure and re-test detection accuracy quarterly, including sample composition and threshold logic.
Read the methodology ›The full head-to-head with sentence-level highlights, ESL false-positive rates, pricing, free tier, and API exposed side-by-side.
Read the compare ›TextSight's free tier gives you three scans a day at 5,000 characters per scan, with sentence-level highlights so you can read the verdict critically. No card, no email, no commitment.