GPT-5 is OpenAI's newest flagship, and it was tuned to read more human. The obvious GPT-4-era giveaways are mostly gone, the cadence varies more, and single-word lexical tells are weaker. The signals that hold up are statistical and structural: consistent rhetorical scaffolding, even-handed both-sides hedging, semantic evenness, and perplexity and burstiness distributions that still differ from human writing. TextSight is re-fit on fresh GPT-5 samples, flags the lines that react with colour-coded highlights, and runs the same scan against Gemini and other models at no extra step. Free to try. No card.
GPT-5 is the default model behind a large share of new ChatGPT output, and it was deliberately tuned to look less machine-written. Most detectors were calibrated on GPT-4-era vocabulary and now underrate GPT-5. TextSight was re-fit on fresh GPT-5 samples and weights the statistical and structural patterns that survive the cleanup, alongside Gemini and other model signals. No detector is perfect on a frontier model, so we frame the score as evidence to weigh, not an automated verdict.
GPT-5 is harder to read than its predecessors precisely because OpenAI removed the easy tells. The much-discussed filler words appear far less often, the rhythm varies more, and a first read can feel genuinely human. That means a detector cannot lean on a banned vocabulary list and expect to catch it. TextSight instead reads how the text is built: its rhetorical shape, the evenness of its emphasis, the way it hedges, and the perplexity and burstiness profile underneath.
You do not need to tell TextSight which model produced the text. The classifier reads the prose and flags GPT-5-shaped sentences, Gemini-shaped sentences, and other model-shaped sentences in the same pass. Mixed-source documents (one paragraph drafted in GPT-5, another reworded in Gemini) score correctly because each sentence is scored on its own pattern.
Colour-coded sentence highlights point to the specific lines that carry GPT-5 markers: the tidy setup-and-resolution scaffolding, the both-sides hedging, and the flat semantic density where every sentence carries a similar load. Reviewers see exactly which sentences drove the score rather than guessing from a single percentage.
Output coming through chatgpt.com, the ChatGPT desktop and mobile apps, or any downstream tool wrapped around the OpenAI API all carry the same fingerprints. The classifier treats GPT-5 as a model, not as a product surface, so detection works regardless of where the user pasted from.
The honest starting point: GPT-5 is a deliberately cleaner writer than GPT-4. It dropped most of the giveaway vocabulary, varies its sentence length more, and reads convincingly human on a first pass. So the reliable tells are no longer single words. They are statistical and structural, and they fall into five families that hold up even after light editing. None of them is a verdict on its own; the classifier weighs them together.
GPT-5 reaches for the same shape across answers: a clean orienting setup, a balanced body that walks through the considerations in order, and a tidy summarising resolution. A human writer commits earlier, buries the lede, or trails off; GPT-5 almost always closes the loop. When the same setup-body-resolution arc recurs across paragraphs that should have had different shapes, the structure itself becomes the signal, no buzzword required.
Ask GPT-5 a question with a clear answer and it still tends to present both sides evenly, qualifying its own claims and granting the counterpoint before moving on. The hedging is symmetrical and almost diplomatic, where a person with a view leans one way and shows it. This balanced neutrality is one of the more durable GPT-5 fingerprints because it comes from how the model was aligned, not from a phrase it can be told to drop.
Human writing is lumpy. Some sentences carry a lot, others coast. GPT-5 tends to spread information evenly, so each sentence delivers a similar load and the emphasis never spikes. That flatness reads smooth but is statistically distinctive: the per-sentence information curve stays unusually level across a passage. TextSight measures this evenness directly rather than inferring it from vocabulary.
GPT-5 narrowed the gap with human writing on both measures, which is exactly why older detectors miss it. But the distributions have not closed entirely. Word choices remain a little more predictable than a person's, and sentence-length variance, while wider than GPT-4's, still clusters in a measurable range. The classifier reads the shape of these distributions across the whole passage, not any one sentence in isolation.
The famous markers are weaker now. Words like delve appear far less often, and counting them no longer works as a detection strategy. They have not vanished entirely, so they remain a minor input, but TextSight deliberately puts little weight on any fixed vocabulary list. Leaning on word counts is what causes buzzword-based detectors to return falsely low scores on GPT-5; the heavier weight belongs on structure, hedging balance, and the statistical profile above.
Pro at $19.99 a month standard, $14.99 a month on yearly, is the right fit for solo editors, instructors, and reviewers running steady individual scans. Business at $39.99 a month standard, $29.99 a month on yearly, fits teams scanning fifty or more pieces a month with shared history and REST API access. Full details on the pricing page.
Billed $89.88/year — Save $30
Billed $179.88/year — Save $60
Billed $359.88/year — Save $120
Yearly billing saves 25%. View full pricing →
Detector disagreement on GPT-5 is common, and it usually runs in one direction: older tools score it too low. Many AI detectors were calibrated on GPT-3.5 and GPT-4 output, where a handful of vocabulary markers and a uniform cadence did most of the work. GPT-5 removed exactly those crutches, so a detector tuned to the previous generation reads it as suspiciously human.
Detectors trained mostly on GPT-4-era output learned the stock transitions, the predictable filler, and the flatter cadence of that generation. GPT-5 writes cleaner, varies its rhythm, and drops the giveaway vocabulary, so it does not light up those features. The detector reads it as low confidence and returns a human-ish score even when the prose is straightforwardly GPT-5.
A detector built on counting flagged words was always brittle, and GPT-5 makes the weakness obvious. With the obvious markers thinned out, a vocabulary-based classifier has little to fire on. TextSight puts the weight on structure, hedging balance, semantic evenness, and the perplexity and burstiness distribution instead, because those persist after the easy words are gone.
When TextSight reports a high AI score on a paragraph and an older detector reports a low one, the disagreement is usually a calibration gap, not a contradiction. The two tools are reading different distributions, and the older one is looking for tells GPT-5 no longer leaves. Sentence-level highlights make this concrete: a reviewer can point to the specific lines carrying the structural and statistical markers and decide whether to act. No detector is perfect on a frontier model, so the highlights matter more than the headline number.
OpenAI updates GPT-5 and the stylistic distribution drifts. TextSight refits the classifier against fresh GPT-5 samples on a rolling cadence rather than freezing a snapshot. The page you are reading reflects the current distribution, and because the approach reads statistical fingerprints rather than a fixed word list, it degrades gracefully as the model evolves instead of going stale overnight.
GPT-5 turns up wherever ChatGPT does, which is nearly everywhere, now on the latest model. Because it is the default flagship, a large share of new ChatGPT-produced text is GPT-5 output. The highest-volume contexts are student essays, marketing and SEO copy, and customer-support replies, with code documentation close behind. Each context calls for a slightly different read of the scan.
Students reach for GPT-5 because it produces a clean, balanced essay that no longer trips the obvious word-level filters. Instructors reviewing submissions see the giveaway differently now: the recurring setup-and-resolution shape across paragraphs, the both-sides hedging on questions that wanted a stance, and the unnaturally even information density. Sentence highlights make the pattern explicit, which is more useful in an integrity conversation than a single percentage.
Content teams lean on GPT-5 for blog drafts, landing pages, and email sequences because the prose reads polished and the old AI tells are gone. The polish is now the tell. The copy is smooth and competent but oddly shapeless, every paragraph hitting the same rhythm and granting every counterpoint. Reviewers running a pre-delivery scan catch the lift-and-publish case before it ships.
Support teams route a growing share of replies through GPT-5 behind the OpenAI API. The diplomatic, evenly hedged tone is exactly what the model produces by default. Detection here is less about misconduct and more about quality control: flagging canned, un-reviewed replies before they go to a customer, and spotting reviews or product descriptions that were generated wholesale.
Engineering teams use GPT-5 to draft README files, API references, and longer technical notes. The structure fits, but the prose around the code carries the same flat evenness and tidy resolution. A quick scan catches documentation that has not been read by a human before publication, a separate quality concern from academic integrity but one the same signals surface.
A single percentage is not an evidence trail, and on a frontier model the evidence matters more than the number. The TextSight result panel surfaces which sentences carried GPT-5 markers and why, with paragraph-level rollups for longer pieces, so reviewers can point to specific lines and pair the score with their own judgment.
Every sentence is colour-coded by its own AI-likeness score. Red sentences clustered through the tidy setup-and-resolution scaffolding and the even both-sides hedging are a stronger signal than scattered yellows. The visual makes the structural pattern legible without forcing a reviewer to study the percentage, which is exactly what you want when the word-level tells are no longer there to point at.
Longer pieces get paragraph-level rollups so reviewers can see which paragraph is dragging the headline score. On GPT-5 content this usually points at the body sections, where the flat information density and symmetrical hedging are most concentrated. Targeting the lowest-scoring paragraph first is the fastest way to confirm the read.
Perplexity measures how predictable word choices are to a language model. GPT-5 narrowed this gap deliberately, so the numbers run closer to human writing than GPT-4 did, which is the whole reason older detectors miss it. The diagnostic context still helps: an unusually smooth, low-variance perplexity curve across a long passage is informative even when no single sentence looks damning.
Burstiness measures sentence-length variance. GPT-5 varies its sentence length more than earlier models, so high burstiness alone no longer clears a passage. What reads as a signal is the combination: moderate burstiness paired with the structural scaffolding and even semantic density. The diagnostics are context for the verdict, not the verdict, and on a frontier model that distinction is the point.
More LLM-specific detection guides.
The most-trained-on model, with the obvious tells GPT-5 left behind.
For ChatGPT output →Anthropic's warm, reflective register and why it reads more human than GPT-5.
For Claude output →Why statistical fingerprints survive frontier-model updates and word lists do not.
Read the method →How TextSight stacks up against detectors that froze on the GPT-4 generation.
See the roundup →Free to try. No card. Pro at $14.99 a month on yearly for solo reviewers; Business at $29.99 a month on yearly for detection teams.