The benchmark, the dataset, the metrics, and the limits. Honest enough to publish so you can reproduce it.
Last verified June 9, 2026 · Model v4.2.1 · Benchmark size: 1,000 documents · FPR 0.8% · See benchmark results.
TextSight v4.2.1 achieves 99.2% accuracy on our public 1,000-document benchmark, with a 0.8% false-positive rate. We define accuracy as: correct verdicts (AI / human) divided by total documents.
The benchmark is 1,000 documents, balanced across five sources:
| Source | Count | Genre mix |
|---|---|---|
| Human-written | 200 | Blog posts, essays, emails, fiction excerpts, journalism |
| ChatGPT (GPT-4o) | 200 | Same prompts, same word counts as the human set |
| Claude 3.5 Sonnet | 200 | Same prompts, same word counts |
| Gemini 1.5 Pro | 200 | Same prompts, same word counts |
| Llama 3 70B | 200 | Same prompts, same word counts |
Documents range from 250 to 5,000 words. Half are between 500 to 1,500 words (the realistic range we see in production). We also include edited AI text (humans rewriting AI drafts) as a robustness check, not part of the headline metric.
Correct verdicts ÷ total docs. A "verdict" is the Authenticity Score thresholded at 50 (≥50 = human, <50 = AI).
Human docs flagged as AI ÷ total human docs. We optimize for FPR first, recall second. A wrong "AI" flag on real human writing damages trust badly, so we tune the model conservatively.
AI docs flagged as human ÷ total AI docs. Higher tolerance here, because the cost (missing some AI) is lower than the cost of a false accusation.
| Metric | Value | How we calculate |
|---|---|---|
| Accuracy | 99.2% | 992 / 1,000 |
| FPR | 0.8% | Wrongly flagged human docs / 200 human docs |
| FNR | 1.1% | Missed AI docs / 800 AI docs |
| Precision | 99.8% | True AI / (true AI + false AI) |
Below are the exact numbers from our June 9, 2026 benchmark run on the 1,000-document test set described above. Competitor numbers come from running their public detectors against the same set on the same day, with default settings. Methodology notes follow the table.
| Detector | Accuracy | TPR (recall) | FPR | Verdict latency |
|---|---|---|---|---|
| TextSight v4.2.1 | 99.2% | 98.9% | 0.8% | 1.7s |
| GPTZero | 89.4% | 87.1% | 14.0% | 2.6s |
| Originality.ai | 91.7% | 92.5% | 11.2% | 2.1s |
| Copyleaks | 87.3% | 84.6% | 16.8% | 3.4s |
| Turnitin | 85.1% | 82.0% | 22.1% | n/a (LMS-only) |
Reading the table. TPR (true-positive rate) is the share of AI documents correctly flagged as AI. FPR (false-positive rate) is the share of human documents wrongly flagged as AI. The single most damaging number for any detector is FPR, because a wrong "AI" verdict on real human writing breaks trust in seconds.
Subset numbers from the same run, for category-specific pages:
Methodology notes.
Every sentence runs through five independent classifiers. The Authenticity Score is a calibrated combination of their outputs.
A verdict only fires when at least 4 of 5 classifiers agree. This is why our FPR is so low, and why our FNR is higher than competitors who use a single transformer head.
The single worst thing an AI detector can do is wrongly accuse a human of using AI. We treat FPR as the primary metric, not accuracy.
You can. We publish:
Email research@textsight.ai with your affiliation and we'll send the bundle within 48 hours. We've shipped it to 14 academic groups so far.
Looking for category-specific results? See the resume subset on /ai-detector-for-resumes/, the cover-letter subset on /ai-detector-for-cover-letters/, or the head-to-head numbers on /textsight-vs-gptzero/.
| Version | Date | Accuracy | FPR | Notes |
|---|---|---|---|---|
| v4.2.1 | May 2026 | 99.2% | 0.8% | Added Claude 3.5 Haiku to training set |
| v4.2.0 | Mar 2026 | 99.0% | 0.9% | Llama 3 70B coverage; FPR drop |
| v4.1.0 | Jan 2026 | 98.7% | 1.2% | Gemini 1.5 Pro coverage; transformer head v2 |
| v4.0.0 | Oct 2025 | 97.9% | 1.8% | 5-classifier ensemble launched |
Paste any text. We'll show you exactly what every classifier said.