Every AI detector vendor claims to be 95% accurate. Some claim 99%. These numbers are, to be blunt, not the whole story — and in some cases they're close to meaningless.
I've spent the last several months running tests across the major detection tools, and the results are worth being honest about. Not to trash the tools — some of them are genuinely useful — but because students, educators, and HR professionals are making real decisions based on these accuracy claims, and they deserve better information than vendor marketing pages provide.
Let's get into what these numbers actually mean, what the independent tests show, and how TextSight's scoring approach fits into this picture.
"Accuracy" Doesn't Mean What You Think It Means
Start here, because everything else depends on it.
When a vendor says "99% accuracy," they typically mean: given a document that is either entirely human-written or entirely AI-generated, the tool correctly classifies it 99% of the time. That's not useless — but it's also a very controlled test condition that doesn't reflect how people use these tools.
Real-world text is:
- Mixed. Someone wrote a rough draft, used AI to expand two paragraphs, then edited it. Is that AI or human? Detectors disagree.
- Edited AI. AI output that's been revised, restructured, and voice-adjusted. This is probably the most common case in 2026.
- Human text that pattern-matches AI. ESL writers, formulaic academic prose, certain professional genres. These generate false positives.
Detection rate and accuracy are not the same thing. A tool that catches 94% of raw GPT-4o output might miss 40% of lightly edited GPT-5 output and falsely flag 9% of human-written ESL essays. Calling that "94% accurate" is technically defensible and practically misleading.
The False Positive Problem Is Real and It's Getting Worse
Here's my strong take: false positives are the most underreported problem in AI detection right now.
A false positive is when a tool flags human writing as AI-generated. For a student, that means a conversation with a professor that starts from a position of guilt. For an employee, it might mean a disciplinary process. The consequences are not trivial.
In 2025 tests by independent researchers at Stanford and several European universities, false positive rates on human text ranged from 4% to 16% depending on the tool and the writing style. ESL writers got flagged at nearly twice the rate of native English speakers. Formal academic prose got flagged at higher rates than casual prose.
That's a systemic bias problem. A tool with a 9% false positive rate isn't making a 9% mistake on your document — it's making a 9% mistake on all human documents it sees, unevenly distributed across language backgrounds and writing styles.
Tool-by-Tool Breakdown: What the Data Actually Shows
Here's what independent testing in 2025–2026 shows for detection of GPT-5 output (one generation more sophisticated than what most tools were trained on):
GPTZero
- GPT-5 detection rate: ~76%
- False positive rate: ~9%
- Notes: Strong on raw GPT-4o. Performance drops noticeably on GPT-5 and Claude 3.7. The false positive rate on formal academic prose is higher than the overall 9% average — closer to 13% for dense argumentative essays.
Originality.ai
- GPT-5 detection rate: ~81%
- False positive rate: ~6%
- Notes: Best detection rate in this comparison. Lower false positives than most. Paid-only product. The tradeoff is that it's more aggressive — it'll catch more AI, but it'll also flag more borderline human writing.
Copyleaks
- GPT-5 detection rate: ~79%
- False positive rate: ~5%
- Notes: The lowest false positive rate here. Good for institutional use where fairness to human writers matters. Detection rate is solid but not leading.
ZeroGPT
- GPT-5 detection rate: ~68%
- False positive rate: ~11%
- Notes: The weakest detection rate in this group and the second-highest false positive rate. It's free, which explains its wide usage, but I wouldn't use it as the sole basis for any consequential decision.
TextSight
TextSight doesn't give a binary verdict, so detection rate framing doesn't fully apply. What it shows is a Humanization Score. Raw GPT-5 output scores an average of 48/100 on TextSight — well into the "grey zone" (41–60) and close to the flagged range (0–40). Lightly edited GPT-5 output typically scores 55–65. Human-written text averages around 78–82.
The advantage of this approach: instead of a false positive, you get a score of 71 and a conversation about which specific phrases pulled it down. That's more useful than a binary "AI detected" label.
The Comparison Table
| Tool | GPT-5 Detection Rate | False Positive Rate | Cost | Binary Verdict? |
|---|---|---|---|---|
| Originality.ai | 81% | 6% | Paid ($0.01/100 words) | Yes |
| Copyleaks | 79% | 5% | Paid (enterprise) | Yes |
| GPTZero | 76% | 9% | Free + paid tiers | Yes |
| ZeroGPT | 68% | 11% | Free | Yes |
| TextSight | Score-based (48/100 avg on GPT-5) | Lower risk (score, not label) | Free tier + $7.49/mo | No — score |
Detection rates here are for GPT-5. All tools perform better on GPT-4o and worse on heavily edited output.
Why Vendor Claims Don't Match Independent Tests
This one's worth explaining, because vendors aren't necessarily lying. They're testing in favorable conditions.
Most published accuracy claims are based on tests where:
- The AI documents are pure, unedited outputs from a single model
- The human documents are from curated corpora (Wikipedia, academic papers, professional journalism)
- The test set was created around the same time the model was trained
Real-world conditions break all three of those assumptions. Users edit AI output. Human writing spans an enormous range of styles. And models keep updating — GPT-5 was trained partly to produce more natural text than GPT-4o, which directly attacks the statistical patterns older detectors were trained on.
When Originality.ai says 99% accuracy, they likely achieved that on GPT-4o text vs. well-formed human prose under controlled conditions. When I test it on GPT-5 output that's been lightly edited by a non-native English speaker, 81% is closer to reality. That's still useful. But it's a different number.
What "Detection Rate" Misses About Model Evolution
GPT-5 was specifically designed to produce less detectable AI text. That's not a conspiracy theory — it's an explicit part of the product's design goal, and OpenAI has been transparent about trying to make their output more natural.
The effect on detection is significant. GPT-4o generates text with a perplexity and burstiness profile that's quite distinct from human writing — detectors were tuned to this. GPT-5 produces text with higher perplexity variance, more natural sentence length distribution, and fewer of the over-represented phrases (like "delve" and "crucial role") that older models were notorious for.
In my testing, the gap between GPT-4o detection rates and GPT-5 detection rates is roughly 10–15 percentage points across all tools. That gap is going to widen as models continue to improve.
The Humanization Score Approach: Why It's Different
Binary verdicts have a fundamental problem: they collapse uncertainty into a yes/no answer.
If a tool is 81% confident something is AI-written, and you see "AI Detected," you assume 100% certainty. But if you saw "Humanization Score: 42 — borderline, specific phrases flagged: [list]," you'd have a more honest picture of what the tool actually knows.
TextSight's average score for raw GPT-5 is 48/100. That's clearly in risky territory (below 60), but it's not the same signal as a score of 22. A score of 48 on a human-written essay with some formal phrases is a very different situation from a score of 22 on pure, unedited ChatGPT output. Binary detectors don't give you that distinction.
The score also makes false positives more useful. If a human essay scores 64, that's not a false positive — it's a signal that the writing has some AI-adjacent patterns that could be addressed. The student isn't accused. They're shown what's triggering concern and given the ability to fix it.
When to Use Which Tool (My Actual Recommendation)
For students checking their own work before submission: TextSight is the right call. Free tier, no signup, score-based output that tells you where you stand and what to fix.
For educators who need to make consequential decisions: don't rely on any single tool. Run the document through two tools. If both flag it, that's a stronger signal. If one flags it and one doesn't, that's exactly the uncertainty you should communicate to the student before taking action.
For HR/recruitment teams checking writing samples: be extremely careful with any binary tool. A false positive on a cover letter could cost you a strong candidate. Score-based tools give you a more defensible basis for conversation.
For content agencies checking AI compliance: Originality.ai or Copyleaks for institutional use, TextSight for the score-level detail on borderline cases.
No tool is an infallible judge. The 75–81% detection rates on GPT-5 mean that roughly 1 in 5 AI-generated documents gets through undetected. That's not a flaw to hide from — it's a limitation to be honest about so decisions are made proportionally.
The Bottom Line
The honest take: AI detectors are useful probabilistic tools, not lie detectors. The best ones catch 80%+ of AI text. The worst catch barely two-thirds. False positives exist across all of them and disproportionately affect ESL writers and formal prose writers.
Vendor accuracy claims of 95%+ are almost always based on controlled test conditions that don't match how these tools are actually used. In real-world conditions with GPT-5, 76–81% is closer to the truth.
And the binary verdict model — "AI detected / not detected" — is becoming increasingly hard to justify as the stakes of these decisions go up. A score tells you more. It's less satisfying than a verdict, but it's more accurate.
See your own score — 5 free scans, no signup → textsight.ai
Related reading: