HomeChatGPT Detector › Accurate

Accurate ChatGPT detector — sentence-level evidence + ESL calibration.

No detector is 100 percent accurate on every text it sees, and any vendor who says otherwise is selling a billboard. TextSight is trained on roughly 4 million GPT-4 outputs, scored at the sentence level so you can see which lines triggered the flag, and calibrated against roughly 600,000 ESL writing samples so non-native writers do not get punished for sounding uniform. The result is around 90 percent accuracy on native long-form GPT-4 text, an ESL false-positive rate about 40 percent below uncalibrated detectors, and a result panel you can actually defend in a conversation.

Run an accurate scan See the workflow
~90% accuracy on GPT-4 long-form ESL false positives 40% lower Sentence-level evidence on every scan
Why TextSight is accurate

Three things that move accuracy in this category.

Accuracy in AI detection comes from training data scale, where the signal is scored, and whether the classifier has been calibrated against the populations that get falsely flagged. TextSight is built around all three.

1. Trained on roughly 4 million GPT-4 outputs

The training corpus spans essays, blog posts, emails, product descriptions, scripts, marketing copy, and technical documentation. It includes raw GPT-4, GPT-4 with custom system prompts, GPT-4o multimodal text, and a growing GPT-5 sample. Volume matters because the GPT-4 family fingerprint shifts with each minor release and with each new prompting style the public adopts, and a smaller training set drifts faster.

2. Sentence-level signals, not just a document percentage

Every sentence gets its own probability score. The result panel colours each sentence green, yellow, or red and lists the specific GPT-4 signals that triggered each flag (sentence-length floor, nested-clause density, the polite-assistant register, the "thoughtful synthesis" closer). A document-level percentage alone is brittle. A document percentage backed by sentence evidence holds up when a teacher, editor, or recruiter has to explain why the call was made.

3. ESL calibration drops false positives by about 40 percent

Uncalibrated detectors flag ESL writing as AI roughly twice as often as native English writing, because the simpler vocabulary and more uniform sentence structure ESL writers tend toward also happen to be features the classifier associates with AI. TextSight runs a calibration layer trained on roughly 600,000 ESL writing samples that reduces the ESL false-positive rate from about 10 percent to 4 to 6 percent, without touching native-English accuracy. That single change is the difference between a usable tool and an academic-integrity disaster.

Honest accuracy framing

What TextSight actually scores, and what it does not.

Accuracy in this category depends on three variables: the source model, the length of the text, and whether the text has been edited or rewritten. Here is where TextSight lands on each.

Around 90 percent on native GPT-4 long-form text

On unedited GPT-4 output of 500 words or longer, the classifier sits at roughly 90 percent. That holds across GPT-4, GPT-4o, and the bulk of GPT-5 output. Most of the detectable AI text in 2026 lives in this band, which is why it gets the most engineering attention.

70 to 80 percent on heavily edited or rewritten GPT text

If a passage went through three rounds of human edits, or through a dedicated AI rewriter, no detector reliably hits 90 percent. TextSight drops to roughly 70 to 80 percent in that regime, and the result panel surfaces a "post-edited" warning so you know to weight the call accordingly.

4 to 6 percent ESL false positive, 1 to 2 percent native false positive

The ESL gap is real and category-wide. Every detector scores worse on English written by non-native speakers. TextSight's ESL calibration brings the gap to roughly 4 to 6 percent, against the 8 to 12 percent uncalibrated detectors typically show. Native English false positives sit at 1 to 2 percent, which is the industry norm for the band.

No detector is 100 percent accurate

That sentence will eat most competitor marketing, and it is still true. Treat any ChatGPT detector score as a conversation starter, not a verdict. The 0 to 100 Authenticity Score and the sentence colours together give you enough evidence to decide whether to escalate a paper, push back on a freelancer, or move on. They do not give you the right to declare a piece "AI" with no further review.

ESL calibration

Why the ESL gap is the accuracy problem that matters.

ESL false positives are the single most consequential failure mode in this category, because they send real students into academic-integrity hearings they should not be in. TextSight treats the ESL gap as a first-class engineering target, not a footnote.

What goes wrong without calibration

ESL writers tend toward shorter average sentences, more uniform sentence rhythm, lower vocabulary variance, and fewer idiomatic constructions. A naive classifier trained on native-English-versus-GPT-4 data sees those same features in GPT-4 output, and the easiest way to be "accurate" on the test set is to flag uniform writing as AI. The result is an ESL false-positive rate around 10 percent on most uncalibrated detectors, double the native-English rate.

How TextSight calibrates

A second classifier layer trained on roughly 600,000 ESL writing samples scores how "ESL-like" the input looks before the AI classifier delivers its verdict. When the ESL-likeness score is high, the AI threshold is shifted to compensate for the population baseline. The trade-off is a slightly higher false-negative rate on AI text written in an ESL-style register, which we judge an acceptable cost compared to falsely accusing a real student.

The published result

ESL false positives drop from roughly 10 percent to 4 to 6 percent, depending on first language and writing topic. Native English false positives stay at 1 to 2 percent. The accuracy on real ChatGPT output drops by about 1 percentage point, which we consider a fair trade given who eats the cost of getting the call wrong.

Plans & pricing

Same classifier on every tier.

A free scan and a Business scan on the same text produce the same Authenticity Score, the same sentence colours, and the same per-model confidence. Paid tiers raise the cap and unlock the AI rewriter, file upload, and API.

Free
$0/forever

 

Try the accurate detector. No card, no email.
  • 3 scans / day
  • 5,000 chars per scan
  • Sentence-level evidence
  • ESL calibration included
Start free
Starter
$7.49/month

Billed $89.88/year — Save $30

For writers checking individual articles regularly.
  • 20 scans / day
  • 20,000 AI rewriter words/mo
  • Chrome extension
  • Email support
Get Starter
Business
$29.99/month

Billed $359.88/year — Save $120

For schools and teams running accuracy at scale.
  • 100,000 AI rewriter words/mo
  • REST API access
  • 5 team seats
  • White-label PDFs
Get Business

Yearly billing saves 25%. View full pricing →

30-second workflow

Four steps from paste to defensible call.

Accuracy is only useful if the workflow around it is short. The standard TextSight scan takes about thirty seconds end to end on an 800-word essay.

1. Paste the text

Open app.textsight.ai. Paste up to 5,000 characters on the free tier, or up to 10,000 on Pro. The character counter ticks as you type or paste. No signup is required for your first scan.

2. See the score

Click Scan. The classifier runs in around six seconds for short text, thirty seconds for an 800-word essay. The result panel returns a 0 to 100 Authenticity Score, a per-sentence colour map, and a list of the top three GPT-4 fingerprint signals found in the text.

3. Review the evidence

Open the sentence-level panel. Every red sentence is annotated with the specific signals that triggered it: the sentence-length floor, the nested-clause syntax, the polite-assistant register, the synthesis closer, or the "intricate tapestry" vocabulary cluster. Yellow sentences are borderline and listed separately so a reviewer knows where the genuine uncertainty lies.

4. Decide

If three or more red sentences carry classic GPT-4 fingerprint signals on long-form text, the call is solid. If red sentences are isolated quotes or technical definitions, downgrade to "inconclusive". If the text is under 300 words or has been heavily edited, the result panel surfaces a low-confidence warning and you should treat the score as directional rather than precise.

Honest scope

Where this detector should not be the final word.

No AI detector is accurate enough to be the sole basis of an academic-integrity ruling or a freelancer firing. These are the cases where the published numbers still leave too much room for a wrong call.

Under 200 words

Every detector loses statistical power. Short paragraphs do not give the classifier enough sentence-level data points to score reliably. TextSight surfaces a "low confidence" warning when input length drops below the threshold the model was tuned for.

Heavily rewritten AI text

A passage that went through three rounds of human edits or a dedicated AI rewriter no longer carries the surface signals the classifier was trained on. Accuracy drops to 70 to 80 percent. The Authenticity Score is still useful as a directional reading, but a single low score should not be treated as proof.

Borderline mixed passages

Text that is 80 percent ChatGPT and 20 percent human edits sits in a no-man's-land that no detector handles cleanly. TextSight reports an Authenticity Score rather than forcing a binary classification, and the sentence colour map handles this case better than a document-level percentage. Accuracy on the borderline sentences themselves still drops into the 70s.

Use as conversation starter, not verdict

The most defensible workflow we have seen: scan with TextSight, read the sentence-level evidence, then have a conversation with the writer. A direct conversation produces more clarity than any percentage. If the conversation goes badly, the sentence evidence is the documentation you keep for the next step.

FAQ

Accuracy questions people actually ask.

How accurate is the TextSight ChatGPT detector?
Around 90 percent on native GPT-4 long-form text (500 plus words), 70 to 80 percent on heavily edited or rewritten GPT output, and 75 to 82 percent on short passages under 300 words. Native English false-positive rate sits at 1 to 2 percent and ESL false-positive rate at 4 to 6 percent. We publish those numbers rather than collapse them to a single headline.
Why is ESL calibration important for accuracy?
Uncalibrated detectors flag ESL writing as AI roughly twice as often as native English writing, because simpler vocabulary and more uniform sentence structure look like AI patterns to a naive classifier. TextSight runs an ESL calibration layer trained on roughly 600,000 ESL writing samples that reduces the ESL false-positive rate by about 40 percent versus uncalibrated baselines, while keeping native-English accuracy the same.
What does sentence-level evidence look like?
Every sentence gets its own probability score and a colour: green for human, yellow for borderline, red for AI. You see exactly which sentences triggered the document score, with the GPT-4 fingerprint signals listed beside them. That is more useful than a single percentage when you have to decide whether to escalate a paper or push back on a freelancer.
Is no detector 100 percent accurate, so why bother?
Correct, no detector is 100 percent accurate. TextSight treats the score as a conversation starter, not a verdict. Around 90 percent accuracy on GPT-4 long-form is meaningfully better than guessing, especially when combined with sentence-level evidence that lets a human reviewer judge the borderline cases. We publish where the tool fails so you can use it where it works.
How do you measure accuracy honestly?
Quarterly re-benchmark against a 12,000-document test set balanced across source model, length bucket, and writer profile (native English, ESL, mixed human-and-AI). We publish accuracy, precision, recall, and false-positive rate for each cell of the matrix rather than collapsing to a single number. Each frontier-model release triggers an out-of-cycle re-benchmark and a public changelog entry.
Does paid accuracy beat free accuracy?
No. The same classifier runs on every tier. A free scan and a Business scan on the same text produce the same Authenticity Score, the same sentence colours, and the same per-model confidence. Paid tiers raise the volume cap and unlock the AI rewriter, file upload, and API. They do not swap in a different model.
How should I act on a borderline score?
Treat any score between 30 and 70 as inconclusive on the document level, then read the sentence-level highlights for evidence. If three or more sentences flag red with classic GPT-4 fingerprint signals like the nested-clause syntax or the synthesis closer, that is real evidence. If the red sentences are isolated quotes or technical definitions, that is more likely a false positive. Use the colours, not the headline.
Related

More for the accuracy workflow.

Run an accurate scan. See the evidence.

Free to try, no card, your first scan in about six seconds. Around 90 percent accuracy on GPT-4 long-form, sentence-level evidence, and ESL calibration that drops false positives by about 40 percent.

Run an accurate scan See pricing
Trained on roughly 4M GPT-4 outputs · ESL calibrated · Sentence-level evidence