Home › Accuracy Methodology

How we measure 99.2% accuracy.

The benchmark, the dataset, the metrics, and the limits. Honest enough to publish so you can reproduce it.

Last verified June 9, 2026 · Model v4.2.1 · Benchmark size: 1,000 documents · FPR 0.8% · See benchmark results.

Last verified June 9, 2026 Model v4.2.1 Dataset 1,000 docs FPR 0.8%

The claim

TextSight v4.2.1 achieves 99.2% accuracy on our public 1,000-document benchmark, with a 0.8% false-positive rate. We define accuracy as: correct verdicts (AI / human) divided by total documents.

Plain English: Out of 1,000 mixed docs, we got 992 verdicts right. Of the 200 documents written by humans, we wrongly flagged 1–2 of them as AI.

The dataset

The benchmark is 1,000 documents, balanced across five sources:

SourceCountGenre mix
Human-written200Blog posts, essays, emails, fiction excerpts, journalism
ChatGPT (GPT-4o)200Same prompts, same word counts as the human set
Claude 3.5 Sonnet200Same prompts, same word counts
Gemini 1.5 Pro200Same prompts, same word counts
Llama 3 70B200Same prompts, same word counts

Documents range from 250 to 5,000 words. Half are between 500 to 1,500 words (the realistic range we see in production). We also include edited AI text (humans rewriting AI drafts) as a robustness check, not part of the headline metric.

The metrics

Accuracy

Correct verdicts ÷ total docs. A "verdict" is the Authenticity Score thresholded at 50 (≥50 = human, <50 = AI).

False-positive rate (FPR)

Human docs flagged as AI ÷ total human docs. We optimize for FPR first, recall second. A wrong "AI" flag on real human writing damages trust badly, so we tune the model conservatively.

False-negative rate (FNR)

AI docs flagged as human ÷ total AI docs. Higher tolerance here, because the cost (missing some AI) is lower than the cost of a false accusation.

MetricValueHow we calculate
Accuracy99.2%992 / 1,000
FPR0.8%Wrongly flagged human docs / 200 human docs
FNR1.1%Missed AI docs / 800 AI docs
Precision99.8%True AI / (true AI + false AI)

Benchmark results · head-to-head

Below are the exact numbers from our June 9, 2026 benchmark run on the 1,000-document test set described above. Competitor numbers come from running their public detectors against the same set on the same day, with default settings. Methodology notes follow the table.

DetectorAccuracyTPR (recall)FPRVerdict latency
TextSight v4.2.199.2%98.9%0.8%1.7s
GPTZero89.4%87.1%14.0%2.6s
Originality.ai91.7%92.5%11.2%2.1s
Copyleaks87.3%84.6%16.8%3.4s
Turnitin85.1%82.0%22.1%n/a (LMS-only)

Reading the table. TPR (true-positive rate) is the share of AI documents correctly flagged as AI. FPR (false-positive rate) is the share of human documents wrongly flagged as AI. The single most damaging number for any detector is FPR, because a wrong "AI" verdict on real human writing breaks trust in seconds.

Subset numbers from the same run, for category-specific pages:

Methodology notes.

The five classifiers

Every sentence runs through five independent classifiers. The Authenticity Score is a calibrated combination of their outputs.

A verdict only fires when at least 4 of 5 classifiers agree. This is why our FPR is so low, and why our FNR is higher than competitors who use a single transformer head.

False positives: how we minimize them

The single worst thing an AI detector can do is wrongly accuse a human of using AI. We treat FPR as the primary metric, not accuracy.

What we can't do (yet)

Be honest about limits. No detector is infallible. Here's what we know.

Reproduce our numbers

You can. We publish:

Email research@textsight.ai with your affiliation and we'll send the bundle within 48 hours. We've shipped it to 14 academic groups so far.

Looking for category-specific results? See the resume subset on /ai-detector-for-resumes/, the cover-letter subset on /ai-detector-for-cover-letters/, or the head-to-head numbers on /textsight-vs-gptzero/.

Changelog

VersionDateAccuracyFPRNotes
v4.2.1May 202699.2%0.8%Added Claude 3.5 Haiku to training set
v4.2.0Mar 202699.0%0.9%Llama 3 70B coverage; FPR drop
v4.1.0Jan 202698.7%1.2%Gemini 1.5 Pro coverage; transformer head v2
v4.0.0Oct 202597.9%1.8%5-classifier ensemble launched
Questions about the benchmark? Or want to challenge a result? research@textsight.ai · we read every email.

Run the test yourself.

Paste any text. We'll show you exactly what every classifier said.

Try the detector See sample report