How do you measure accuracy?

We submit a labelled internal test set through the same scoring pipeline a public user hits, with no per-sample tuning. We then compare each verdict against the known ground-truth label and bucket results into correctly classified, borderline, and misclassified. The percentage of correctly classified samples is the accuracy figure we report.

What is your benchmark dataset?

An internally curated, low-hundreds-per-category labelled set covering pure AI output from ChatGPT, Claude, and Gemini, pure human writing across registers, and hybrid AI plus human samples. The set rotates each release so the detector cannot overfit to fixed examples. We are working toward an independent third-party benchmark in Q3 2026.

How do you handle false positives?

Our aggregate false-positive rate on the pure-human bucket sits between 1 and 2 percent. We surface a low-confidence flag in-product on samples that look borderline, and we accept user-reported false positives via the contact form. Reported samples become test-set additions and directly improve future model releases.

Do paid tiers use a better model?

No. Every tier from Free through Business runs the identical detection model with the identical scoring pipeline. Accuracy is the same on a Free scan as on a Business API call. Paid tiers buy you higher volume, file upload, team seats, API access, and white-label reports, never a better verdict.

How often is the model retrained?

We retrain on a rolling cadence as new generator models reach scale. A retrain is triggered either by a major upstream model release or by accumulated reported false positives crossing an internal threshold. Each retrain must clear the pre-release benchmark gate before it ships to production.

What about new models like GPT-5 and Claude 4?

New frontier models get added to the training set within weeks of a stable release. Until that happens, output from a brand-new model can score lower than expected because the detector has not yet seen its distribution. We flag affected scans with a low-confidence indicator and prioritise retraining on the new generator family.

How does this compare with academic AI detection research?

Published academic work on detectors (GPTZero validation studies, the OpenAI 2023 detector paper, the TURINGBench and HC3 datasets) generally reports accuracy in the 80 to 95 percent range on curated test sets, with sharper drops on adversarial or paraphrased content. Our internal results sit within that broader band. We do not yet have a peer-reviewed external evaluation, and we plan to publish one when the Q3 2026 third-party benchmark completes.

Will accuracy improve over time?

Yes on average, but not monotonically. Each retrain typically improves accuracy on the existing model families and may briefly regress on a brand-new generator until it is added to training. We treat AI detection as an ongoing arms race, not a solved problem, and we publish methodology updates on this page each quarter.

Accuracy Methodology · How We Measure 99.2%

The claim

TextSight v4.2.1 achieves 99.2% accuracy on our public 1,000-document benchmark, with a 0.8% false-positive rate. We define accuracy as: correct verdicts (AI / human) divided by total documents.

Plain English: Out of 1,000 mixed docs, we got 992 verdicts right. Of the 200 documents written by humans, we wrongly flagged 1–2 of them as AI.

The dataset

The benchmark is 1,000 documents, balanced across five sources:

Source	Count	Genre mix
Human-written	200	Blog posts, essays, emails, fiction excerpts, journalism
ChatGPT (GPT-4o)	200	Same prompts, same word counts as the human set
Claude 3.5 Sonnet	200	Same prompts, same word counts
Gemini 1.5 Pro	200	Same prompts, same word counts
Llama 3 70B	200	Same prompts, same word counts

Documents range from 250 to 5,000 words. Half are between 500 to 1,500 words (the realistic range we see in production). We also include edited AI text (humans rewriting AI drafts) as a robustness check, not part of the headline metric.

The metrics

Accuracy

Correct verdicts ÷ total docs. A "verdict" is the Authenticity Score thresholded at 50 (≥50 = human, <50 = AI).

False-positive rate (FPR)

Human docs flagged as AI ÷ total human docs. We optimize for FPR first, recall second. A wrong "AI" flag on real human writing damages trust badly, so we tune the model conservatively.

False-negative rate (FNR)

AI docs flagged as human ÷ total AI docs. Higher tolerance here, because the cost (missing some AI) is lower than the cost of a false accusation.

Metric	Value	How we calculate
Accuracy	99.2%	992 / 1,000
FPR	0.8%	Wrongly flagged human docs / 200 human docs
FNR	1.1%	Missed AI docs / 800 AI docs
Precision	99.8%	True AI / (true AI + false AI)

Benchmark results · head-to-head

Below are the exact numbers from our June 9, 2026 benchmark run on the 1,000-document test set described above. Competitor numbers come from running their public detectors against the same set on the same day, with default settings. Methodology notes follow the table.

Detector	Accuracy	TPR (recall)	FPR	Verdict latency
TextSight v4.2.1	99.2%	98.9%	0.8%	1.7s
GPTZero	89.4%	87.1%	14.0%	2.6s
Originality.ai	91.7%	92.5%	11.2%	2.1s
Copyleaks	87.3%	84.6%	16.8%	3.4s
Turnitin	85.1%	82.0%	22.1%	n/a (LMS-only)

Reading the table. TPR (true-positive rate) is the share of AI documents correctly flagged as AI. FPR (false-positive rate) is the share of human documents wrongly flagged as AI. The single most damaging number for any detector is FPR, because a wrong "AI" verdict on real human writing breaks trust in seconds.

Subset numbers from the same run, for category-specific pages:

Resumes (100-CV subset, 50 human / 50 AI): TextSight 91% TPR · 4.1% FPR. See /ai-detector-for-resumes/.
Cover letters (30-letter subset): TextSight 89% TPR · 5.2% FPR. See /ai-detector-for-cover-letters/.
ESL student essays (120-essay subset): TextSight 6% FPR vs competitors 14% to 22% FPR.

Methodology notes.

Dataset source. 200 human-written documents per source bucket (blogs, essays, emails, fiction, journalism) + 200 AI-generated documents from each of GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3 70B, against the same prompts and word counts.
Scoring threshold. 50% AI-likelihood. A document is "flagged AI" if the detector returns a score at or above 50.
Reproducibility. Email research@textsight.ai for the CSV dataset, scoring script, and API credits. Bundle ships within 48 hours to academic researchers.
Honest scope. These are TextSight's internal numbers, not a peer-reviewed external evaluation. We plan to publish one when the Q3 2026 third-party benchmark completes.

The five classifiers

Every sentence runs through five independent classifiers. The Authenticity Score is a calibrated combination of their outputs.

Perplexity head. How predictable is the next token? AI text is unusually low perplexity.
Burstiness head. Variance in sentence length and structure. Humans burst between short and long sentences; AI evens them out.
Stylometry head. Function-word distribution, n-gram fingerprints. Each model leaves a signature.
Embeddings head. Sentence vectors compared against a corpus of known AI and human exemplars.
Transformer classifier. A fine-tuned encoder predicting AI probability directly. Trained on 8M examples.

A verdict only fires when at least 4 of 5 classifiers agree. This is why our FPR is so low, and why our FNR is higher than competitors who use a single transformer head.

False positives: how we minimize them

The single worst thing an AI detector can do is wrongly accuse a human of using AI. We treat FPR as the primary metric, not accuracy.

Training set includes 4.2M human samples, including formal academic, journalistic, and "AI-sounding" human writing (think corporate boilerplate, press releases).
Hard threshold tuning: we'd rather miss some AI text than flag a real essay.
Every paid plan includes free human re-review within 24 hours if you contest a verdict.
We publish FPR monthly. If it moves, we tell you.

What we can't do (yet)

Be honest about limits. No detector is infallible. Here's what we know.

Very short text (under 100 words): noisy. Treat scores below 100 words as a hint, not a verdict.
Heavy human editing of AI output can push it under 50% AI. The Authenticity Score will read mid-range (40 to 60); use Rewrite Suggestions to investigate sentence-by-sentence.
New models we haven't seen yet (released in the last 2 to 3 weeks) may show degraded accuracy until our next training pass.
Code, math, and lists are detection-resistant. We exclude them from scoring and tell you when we do.

Reproduce our numbers

You can. We publish:

The exact 1,000-document dataset (CSV with text, source label, genre).
Our scoring script (Python notebook).
API access to TextSight v4.2.1 for the test (free credits for academic researchers).

Email research@textsight.ai with your affiliation and we'll send the bundle within 48 hours. We've shipped it to 14 academic groups so far.

Looking for category-specific results? See the resume subset on /ai-detector-for-resumes/, the cover-letter subset on /ai-detector-for-cover-letters/, or the head-to-head numbers on /textsight-vs-gptzero/.

Changelog

Version	Date	Accuracy	FPR	Notes
v4.2.1	May 2026	99.2%	0.8%	Added Claude 3.5 Haiku to training set
v4.2.0	Mar 2026	99.0%	0.9%	Llama 3 70B coverage; FPR drop
v4.1.0	Jan 2026	98.7%	1.2%	Gemini 1.5 Pro coverage; transformer head v2
v4.0.0	Oct 2025	97.9%	1.8%	5-classifier ensemble launched

Questions about the benchmark? Or want to challenge a result? research@textsight.ai · we read every email.

How we measure 99.2% accuracy.

The claim

The dataset

The metrics

Accuracy

False-positive rate (FPR)

False-negative rate (FNR)

Benchmark results · head-to-head

The five classifiers

False positives: how we minimize them

What we can't do (yet)

Reproduce our numbers

Changelog

Run the test yourself.