An AI detector does not know who wrote your text. It measures statistical patterns, how predictable each word is, how much your sentences vary, the fingerprint of your style, then a model converts those patterns into a probability that the passage was machine-generated. Below: what a detector actually does, the core signals it reads (perplexity, burstiness, stylometry, word distribution), the model types that turn signals into a score, why detectors disagree, and how TextSight scores text sentence by sentence so you can read a verdict critically.
A detector is a classifier. It sorts text into "more likely human" or "more likely machine" by measuring how the words sit next to each other. It is not a lie detector and it cannot see authorship.
When you paste text into an AI detector, the tool is answering one narrow statistical question: do the patterns in this passage look more like the human writing it was trained on, or more like the machine-generated writing it was trained on? It outputs a probability, usually shown as a percentage, and a verdict if that probability crosses a threshold. That is the whole job. The tool has no record of who typed the words, no access to your draft history, and no way to "know" the answer the way a witness would.
This matters because the output is often read as a finding when it is really a measurement. A 78 percent AI score means the text shares enough surface features with the detector's machine-writing training set to land above its cutoff. It does not mean a person did or did not write it. Understanding the difference between a probability and a proof is the single most useful thing to take from this page, and it is why TextSight shows per-sentence evidence rather than only a headline number.
Every detector works in two stages. First it extracts features from the text, the statistical signals covered in the next section. Then it feeds those features into a model that has learned, from millions of labelled examples, where the boundary between human and machine writing tends to fall. The quality of a detector is the quality of those two stages, and its honesty is whether it tells you how it measured.
Almost every detector reads some combination of four families of signal. None of them is proof on its own. Together they form the pattern a model is trained to recognise.
How predictable each next word is, given the words before it. AI writing tends to choose the statistically likely next word, so it has low perplexity. Detectors read low, smooth predictability as a generation signal. Careful human writers can produce low perplexity too, which is where false flags begin.
The variance in sentence length and complexity across a passage. Humans are bursty: a four-word sentence next to a thirty-eight-word one. Machine writing is more even. Low burstiness, sentences of similar length and rhythm, is one of the strongest classical AI signals.
The style fingerprint of the text: vocabulary richness, punctuation habits, and function-word ratios (how often "the", "of", "and" appear). Human authors have idiosyncratic fingerprints. Machine output tends toward a flatter, more average profile, which a model can learn to spot.
The frequency of specific words and short word sequences (n-grams). Over-use of transition phrases like "Furthermore", "Moreover", and "It is important to note" shifts the distribution toward patterns common in machine output and formulaic instruction.
The key thing to hold onto is that these are statistical properties of text, not stylistic crimes. Low perplexity is not cheating, it is just consistency. Low burstiness is not dishonesty, it is just uniform rhythm, which formal academic training actively rewards. A detector cannot tell whether your even, vocabulary-consistent paragraph came from a language model or from a meticulous student who edits hard. That ambiguity is exactly why these signals produce errors, a topic we return to below in why detectors disagree.
Once the signals are extracted, something has to convert them into a verdict. Detectors broadly fall into three families, and the family largely determines how a tool behaves on edge cases.
The earliest and simplest detectors measure perplexity and burstiness directly and apply a threshold. They are fast, transparent, and need no large training set, but they are also the most fragile. They were largely calibrated on older model output, so they degrade as new models write with more variance, and they over-flag the careful human writing that happens to score low on the same axes.
The modern mainstream. A transformer model (often a BERT-family encoder) is fine-tuned on millions of paired human and AI samples to predict an AI probability directly, rather than reading a single hand-picked statistic. These classifiers learn subtler combinations of the signals above and generally outperform pure perplexity scorers on long-form text, at the cost of needing a large, well-curated training corpus and periodic retraining as models evolve.
The most robust approach combines signals rather than betting on one. In practice the strongest workflow is also an ensemble at the human level: running a passage through two independent detectors, where agreement is the strongest evidence and a single-tool verdict is the weakest. As TextSight's methodology puts it, ensemble use of two detectors remains the most reliable workflow.
TextSight sits in the transformer-classifier family: a fine-tuned encoder that predicts AI probability directly from the text, which is part of why its false positive rate stays low. It pairs that model with sentence-level analysis so the verdict comes with evidence, and it recommends ensemble agreement across two tools for any decision that carries consequences. We describe the architecture this way because it is how the rest of the site already describes it, we do not invent capabilities the product does not have.
Most detectors return one number for the whole passage. That number averages over everything, which hides where the signal actually lives.
A document-level score is the easiest thing to produce and the hardest thing to act on. If a 1,200-word essay comes back at 64 percent AI, you have no idea whether that is one heavily machine-like paragraph dragging up four perfectly human ones, or an even spread across the whole text. The headline number cannot tell you, and neither can it tell a reviewer where to look. Averaging is exactly the operation that makes a verdict feel authoritative while removing the detail that would let you check it.
Sentence-level scoring solves this by running the analysis at a finer grain and highlighting which specific sentences carry the AI signal and which read as human. That turns an opaque percentage into something a person can actually read and challenge. For a writer, it points at the passages worth revising for clarity and voice. For an educator, it shows where to ask a follow-up question rather than where to accuse. TextSight's AI detector is built around this idea: the per-sentence breakdown is the product, the headline score is just the summary.
No detector is perfect, and any vendor claiming zero errors is misrepresenting the problem. Knowing the failure modes is what lets you read a verdict for what it is.
The biggest limit. Second-language academic writers and highly polished native writers both tend to produce low-perplexity, low-burstiness prose, the exact signals detectors read as machine output. The result is human writing flagged as AI, with second-language writers carrying the most risk. This is a structural property of the signals, not a malfunction. We cover it in depth on AI detector false positives.
Running machine text through a paraphraser raises its perplexity and adds variance, which can lower a detector score. Detectors respond by training on paraphraser output, so the advantage erodes with every model update. Chasing a lower number is a moving target, which is why TextSight frames the work around understanding and improving honest writing, not evading a tool.
Below roughly 250 words a detector has too little signal to average over, and the score swings with a single rephrase. A short reply can read very differently on two scans with no edits at all. Short passages are where two tools disagree most, and where any single verdict deserves the least trust.
Every new generation of language model shifts the statistical fingerprint a detector was trained to recognise. A detector tuned on last year's output performs worse on this year's, which is why measured accuracy drifts and why responsible tools re-test and retrain on a schedule rather than publishing one number forever.
Put together, these failure modes explain why two reputable detectors can return different scores on the identical passage. Different model, different training set, different threshold. The practical takeaway is the one TextSight states in its methodology: ensemble agreement across two independent tools is the strongest evidence, and a single-tool verdict is the weakest.
The same mechanics, applied with sentence-level transparency and an honest account of the limits.
TextSight uses a fine-tuned transformer classifier that predicts AI probability directly from the text, and reports 99.2 percent accuracy on its public 1,000-document benchmark. We publish the methodology behind that number, including sample composition and threshold logic, because a number without a method is just marketing. The model is tuned with second-language academic prose deliberately in the training mix, which is part of why our false positive rate stays low on the population most often wrongly flagged.
What we do differently is refuse to stop at a headline percentage. Every scan returns a per-sentence breakdown so you can see exactly which sentences carry the AI signal, read the verdict critically, and decide what to do with it. We also tell you plainly that no detector is infallible and that ensemble agreement across two tools beats any single score. The goal is understanding and better honest writing, not a tool to evade. If you want the full measurement detail, the accuracy methodology page documents how we test and re-test.
Why detectors flag human writing, measured false positive rates by tool, and a step-by-step protocol if you have been wrongly flagged.
Read the explainer ›How TextSight turns the same signals into a readability-style score you can use to improve your own writing, not to game a detector.
See the score ›A criteria-led guide to choosing a detector, what accuracy, transparency, and ESL calibration actually mean when you compare tools.
Read the guide ›A head-to-head with sentence-level highlights, ESL false-positive rates, pricing, free tier, and API exposed side-by-side.
Read the compare ›The full method behind the 99.2 percent benchmark, including sample composition, threshold logic, and how we re-test as models evolve.
Read the methodology ›Run a passage now. Sentence-level highlights, free tier with no card required, and an honest verdict you can read critically.
Open the detector ›TextSight's free tier gives you three scans a day at 5,000 characters per scan, with sentence-level highlights so you can see exactly which sentences carry the AI signal and why. No card, no email, no commitment.