No detector is 100 percent accurate on every text it sees, and any vendor who says otherwise is selling a billboard. TextSight is trained on roughly 4 million GPT-4 outputs, scored at the sentence level so you can see which lines triggered the flag, and calibrated against roughly 600,000 ESL writing samples so non-native writers do not get punished for sounding uniform. The result is around 90 percent accuracy on native long-form GPT-4 text, an ESL false-positive rate about 40 percent below uncalibrated detectors, and a result panel you can actually defend in a conversation.
Accuracy in AI detection comes from training data scale, where the signal is scored, and whether the classifier has been calibrated against the populations that get falsely flagged. TextSight is built around all three.
The training corpus spans essays, blog posts, emails, product descriptions, scripts, marketing copy, and technical documentation. It includes raw GPT-4, GPT-4 with custom system prompts, GPT-4o multimodal text, and a growing GPT-5 sample. Volume matters because the GPT-4 family fingerprint shifts with each minor release and with each new prompting style the public adopts, and a smaller training set drifts faster.
Every sentence gets its own probability score. The result panel colours each sentence green, yellow, or red and lists the specific GPT-4 signals that triggered each flag (sentence-length floor, nested-clause density, the polite-assistant register, the "thoughtful synthesis" closer). A document-level percentage alone is brittle. A document percentage backed by sentence evidence holds up when a teacher, editor, or recruiter has to explain why the call was made.
Uncalibrated detectors flag ESL writing as AI roughly twice as often as native English writing, because the simpler vocabulary and more uniform sentence structure ESL writers tend toward also happen to be features the classifier associates with AI. TextSight runs a calibration layer trained on roughly 600,000 ESL writing samples that reduces the ESL false-positive rate from about 10 percent to 4 to 6 percent, without touching native-English accuracy. That single change is the difference between a usable tool and an academic-integrity disaster.
Accuracy in this category depends on three variables: the source model, the length of the text, and whether the text has been edited or rewritten. Here is where TextSight lands on each.
On unedited GPT-4 output of 500 words or longer, the classifier sits at roughly 90 percent. That holds across GPT-4, GPT-4o, and the bulk of GPT-5 output. Most of the detectable AI text in 2026 lives in this band, which is why it gets the most engineering attention.
If a passage went through three rounds of human edits, or through a dedicated AI rewriter, no detector reliably hits 90 percent. TextSight drops to roughly 70 to 80 percent in that regime, and the result panel surfaces a "post-edited" warning so you know to weight the call accordingly.
The ESL gap is real and category-wide. Every detector scores worse on English written by non-native speakers. TextSight's ESL calibration brings the gap to roughly 4 to 6 percent, against the 8 to 12 percent uncalibrated detectors typically show. Native English false positives sit at 1 to 2 percent, which is the industry norm for the band.
That sentence will eat most competitor marketing, and it is still true. Treat any ChatGPT detector score as a conversation starter, not a verdict. The 0 to 100 Authenticity Score and the sentence colours together give you enough evidence to decide whether to escalate a paper, push back on a freelancer, or move on. They do not give you the right to declare a piece "AI" with no further review.
ESL false positives are the single most consequential failure mode in this category, because they send real students into academic-integrity hearings they should not be in. TextSight treats the ESL gap as a first-class engineering target, not a footnote.
ESL writers tend toward shorter average sentences, more uniform sentence rhythm, lower vocabulary variance, and fewer idiomatic constructions. A naive classifier trained on native-English-versus-GPT-4 data sees those same features in GPT-4 output, and the easiest way to be "accurate" on the test set is to flag uniform writing as AI. The result is an ESL false-positive rate around 10 percent on most uncalibrated detectors, double the native-English rate.
A second classifier layer trained on roughly 600,000 ESL writing samples scores how "ESL-like" the input looks before the AI classifier delivers its verdict. When the ESL-likeness score is high, the AI threshold is shifted to compensate for the population baseline. The trade-off is a slightly higher false-negative rate on AI text written in an ESL-style register, which we judge an acceptable cost compared to falsely accusing a real student.
ESL false positives drop from roughly 10 percent to 4 to 6 percent, depending on first language and writing topic. Native English false positives stay at 1 to 2 percent. The accuracy on real ChatGPT output drops by about 1 percentage point, which we consider a fair trade given who eats the cost of getting the call wrong.
A free scan and a Business scan on the same text produce the same Authenticity Score, the same sentence colours, and the same per-model confidence. Paid tiers raise the cap and unlock the AI rewriter, file upload, and API.
Billed $89.88/year — Save $30
Billed $179.88/year — Save $60
Billed $359.88/year — Save $120
Yearly billing saves 25%. View full pricing →
Accuracy is only useful if the workflow around it is short. The standard TextSight scan takes about thirty seconds end to end on an 800-word essay.
Open app.textsight.ai. Paste up to 5,000 characters on the free tier, or up to 10,000 on Pro. The character counter ticks as you type or paste. No signup is required for your first scan.
Click Scan. The classifier runs in around six seconds for short text, thirty seconds for an 800-word essay. The result panel returns a 0 to 100 Authenticity Score, a per-sentence colour map, and a list of the top three GPT-4 fingerprint signals found in the text.
Open the sentence-level panel. Every red sentence is annotated with the specific signals that triggered it: the sentence-length floor, the nested-clause syntax, the polite-assistant register, the synthesis closer, or the "intricate tapestry" vocabulary cluster. Yellow sentences are borderline and listed separately so a reviewer knows where the genuine uncertainty lies.
If three or more red sentences carry classic GPT-4 fingerprint signals on long-form text, the call is solid. If red sentences are isolated quotes or technical definitions, downgrade to "inconclusive". If the text is under 300 words or has been heavily edited, the result panel surfaces a low-confidence warning and you should treat the score as directional rather than precise.
No AI detector is accurate enough to be the sole basis of an academic-integrity ruling or a freelancer firing. These are the cases where the published numbers still leave too much room for a wrong call.
Every detector loses statistical power. Short paragraphs do not give the classifier enough sentence-level data points to score reliably. TextSight surfaces a "low confidence" warning when input length drops below the threshold the model was tuned for.
A passage that went through three rounds of human edits or a dedicated AI rewriter no longer carries the surface signals the classifier was trained on. Accuracy drops to 70 to 80 percent. The Authenticity Score is still useful as a directional reading, but a single low score should not be treated as proof.
Text that is 80 percent ChatGPT and 20 percent human edits sits in a no-man's-land that no detector handles cleanly. TextSight reports an Authenticity Score rather than forcing a binary classification, and the sentence colour map handles this case better than a document-level percentage. Accuracy on the borderline sentences themselves still drops into the 70s.
The most defensible workflow we have seen: scan with TextSight, read the sentence-level evidence, then have a conversation with the writer. A direct conversation produces more clarity than any percentage. If the conversation goes badly, the sentence evidence is the documentation you keep for the next step.
General-purpose detection across the full GPT family with the same sentence-level evidence panel.
Open the detector →Three scans a day, no signup, same classifier and same ESL calibration as paid.
Use it free →The model-tuned classifier trained on roughly 4M GPT-4 outputs, with the full fingerprint breakdown.
Read GPT-4 page →Full tier breakdown for Free, Starter, Pro, and Business. Annual billing saves 25%.
See pricing →Free to try, no card, your first scan in about six seconds. Around 90 percent accuracy on GPT-4 long-form, sentence-level evidence, and ESL calibration that drops false positives by about 40 percent.