ChatGPT vs Gemini vs Claude: Hardest AI to Detect?

Every week someone asks a version of the same question: "If I use Claude instead of ChatGPT, will AI detectors catch it?"

It's a fair question. The three dominant AI models — ChatGPT (GPT-4o), Google Gemini Pro, and Anthropic's Claude 3.5 — write differently. They have different training approaches, different default styles, and different statistical fingerprints. Detectors trained heavily on ChatGPT data may be more calibrated to GPT patterns than to Gemini or Claude output.

So we tested it properly. 50 text samples, three content types, three AI models, three detectors, and every output scored through TextSight's Humanization Score. Here's what the data actually shows — and more importantly, what it means for how you should be using AI in your writing.

Methodology: How We Set Up the Test

50 samples across three categories:

Academic essays (18 samples): 800–1,200 word argumentative essays on standardized topics — climate policy, social media regulation, economic inequality. Six per model, unedited.
Blog posts / long-form content (18 samples): 600–900 word blog introductions and full posts on technology, marketing, and productivity topics. Six per model.
Short-form professional writing (14 samples): Cold emails, LinkedIn posts, and executive summaries. Four to five per model.

Three detectors:

GPTZero (Pro)
Copyleaks AI Detector
TextSight (Humanization Score — measured as inverse: 100 minus score = AI probability equivalent)

One rule: No editing, no humanizing, no prompt engineering beyond specifying the content type. Raw output only. This tests the default writing patterns of each model, not what they're capable of with careful prompting.

The Results by Model

ChatGPT (GPT-4o)

Average detection rate across all tools: 91% Average TextSight Humanization Score (raw output): 31/100

ChatGPT is the most detectable of the three models, and the gap is meaningful rather than marginal. GPT-4o's default output style has three characteristics that give it away consistently.

First, vocabulary homogeneity. GPT-4o over-relies on a predictable set of words and phrases: delve, leverage, robust, it's worth noting, in today's fast-paced, moreover, furthermore, in conclusion. These aren't random choices — they represent statistically high-probability tokens in the GPT training distribution. Every major AI detector in 2026 has these flagged.

Second, sentence length uniformity. The average ChatGPT sentence clusters tightly around 18–24 words. There's almost no burstiness — the variance between the shortest and longest sentence in a paragraph is much smaller than in human writing. Detectors measure this directly. Low burstiness = high AI probability.

Third, structural predictability. GPT-4o essays almost always follow: hook → context → three body arguments → counterargument → conclusion. The three-part structure appears everywhere, right down to three-item bullet lists and three-clause explanations. It reads well, but it reads algorithmic.

Where ChatGPT scored lowest: Academic essays (93% detection rate). The formal register combined with predictable structure made it the easiest category to catch.

Where it did best: Short LinkedIn posts (82% detection rate) — shorter samples give detectors less signal to work with.

Google Gemini Pro

Average detection rate across all tools: 84% Average TextSight Humanization Score (raw output): 43/100

Gemini sits in the middle of the pack — meaningfully less detectable than ChatGPT, but not as hard to catch as Claude. The gap between ChatGPT and Gemini surprised us more than the gap between Gemini and Claude.

Gemini's writing has higher natural perplexity than GPT-4o, meaning its word choices are slightly less predictable. It also shows more variation in sentence length, though still not at Claude's level. What brings Gemini's score down is a tendency toward hedging and qualification language — phrases like "it's important to consider," "one might argue," "there are various factors to keep in mind" — that appear at higher rates than in genuine human writing.

Gemini also has a noticeable tic toward transitional over-explanation. Human writers assume their reader can follow a logical jump. Gemini tends to bridge every idea explicitly, which creates a readable but visibly engineered flow that detectors — and experienced human readers — pick up on.

Where Gemini scored lowest: Blog content (87% detection rate). Longer form gave detectors more signal.

Where it did best: Professional emails (78% detection rate) — Gemini's business-appropriate tone and cleaner vocabulary reduced its fingerprint in this category.

Claude 3.5 (Sonnet)

Average detection rate across all tools: 78% Average TextSight Humanization Score (raw output): 52/100

Claude is the hardest to detect of the three models, and the reasons are structural rather than superficial. Anthropic's RLHF process appears to have produced a model that naturally generates higher burstiness and higher perplexity — the two primary signals AI detectors measure.

Sentence length variance is Claude's biggest advantage. Where GPT-4o sentences cluster in a 6-word range, Claude regularly swings between a punchy 5-word sentence and a 45-word sentence with multiple subordinate clauses in the same paragraph. This looks far more human statistically.

Vocabulary breadth is the second factor. Claude avoids the highest-frequency AI vocabulary tells — you won't find "delve" in a Claude essay, and "leverage" appears far less often than in GPT output. The word distribution across a Claude sample looks more like a human writer's distribution.

Structural variety is the third. Claude essays don't default to three-part structures. They meander slightly in ways that feel intentional rather than patterned. An argument will be introduced, supported, complicated, and then returned to — rather than introduced and then systematically closed down.

That said, Claude still has detectable tells. The most consistent one is over-nuance: a tendency to acknowledge multiple perspectives even when the prompt is asking for a direct answer. Human writers take positions. Claude tends to hedge them. In academic essays especially, this pattern brings the score back up.

Where Claude scored lowest (hardest to detect): Short professional writing (68% detection rate). The combination of high burstiness and clean vocabulary made Claude emails the most likely to pass.

Where it got caught most: Academic essays (85% detection rate). The over-nuancing pattern is most visible in formal argumentative writing.

Side-by-Side Results

Metric	ChatGPT (GPT-4o)	Gemini Pro	Claude 3.5
Avg Detection Rate	91%	84%	78%
Avg TextSight Score (raw)	31/100	43/100	52/100
Sentence Burstiness	Low	Medium	High
Vocabulary Variety	Low	Medium	High
Structural Predictability	High	Medium	Lower
Best Content Type (hardest to detect)	Short social posts	Professional emails	Short professional writing
Worst Content Type (easiest to detect)	Academic essays	Blog content	Academic essays
Primary Detection Tells	Vocab, structure	Hedging, transitions	Over-nuancing

What These Numbers Actually Mean

The most important thing to take from this data is not the ranking — it's the floor.

Claude, the hardest to detect model, still gets caught 78% of the time on raw, unedited output. That means if you're submitting raw Claude text to any serious AI detector, you will be flagged roughly 4 out of every 5 times.

The difference between the "best" model (Claude at 78%) and the "worst" model (ChatGPT at 91%) is 13 percentage points. That's meaningful — Claude is statistically significantly harder to detect. But practically, neither 78% nor 91% is an acceptable detection rate if the stakes matter.

Model choice is not the answer. It's the starting point.

Per Content Type: Where Every Model Struggles Most

The model-level averages hide something important: content type matters as much as model choice.

Academic Essays

All three models score highest detection rates on essays. The reason is signal volume — longer text gives detectors more statistical evidence. GPT-4o at 93%, Gemini at 87%, Claude at 85%. Even the best model is nearly guaranteed to be caught on a raw AI essay.

Blog Posts and Long-Form Content

GPT-4o: 92% | Gemini: 87% | Claude: 80%. Long-form remains high-risk. The one advantage here is that blog content is often not run through institutional detectors the way academic content is — but client-facing content increasingly is.

Emails and Short Professional Writing

GPT-4o: 82% | Gemini: 78% | Claude: 68%. This is where the model gap matters most and where Claude has the most practical advantage. Short-form text gives detectors less to work with, and Claude's burstiness makes emails genuinely harder to distinguish from human writing.

All models perform similarly here — 65–72% detection rate. Short samples reduce detector accuracy across the board. The human/AI distinction at this length is genuinely hard to measure statistically.

The Humanization Factor: What You Can Do About It

The detection rates above are for raw, unedited output. Here's what the same tests looked like after running each output through a humanization workflow using TextSight:

Model	Raw Detection Rate	After Humanization	TextSight Score After
ChatGPT (GPT-4o)	91%	31%	78/100
Gemini Pro	84%	24%	82/100
Claude 3.5	78%	19%	86/100

The pattern holds — Claude starts from a better position and ends at a better position after humanization. But the more important observation is how much headroom humanization creates across all three models. A raw ChatGPT essay caught 91% of the time drops to 31% after targeted edits guided by the TextSight Humanization Score.

That 60-point swing — from 91% detection down to 31% — comes from fixing a relatively small number of specific problems: replacing flagged vocabulary, breaking structural predictability, and adding burstiness to sentence length. The TextSight AI Vocabulary Highlighter identifies which exact phrases are pulling the score down, so you're not rewriting blindly.

The Practical Takeaway: What Model Should You Actually Use?

If detection risk matters to you — for academic submissions, client deliverables, professional proposals — here's the honest guidance from this data:

Start with Claude if you can. The 13-point difference in raw detection rate is real and consistent across content types. Starting from a TextSight score of 52 rather than 31 means you need fewer edits to cross the 75+ threshold where most detectors pass.

Use ChatGPT where speed and instruction-following matters most. GPT-4o is more obedient to specific formatting requests and faster on structured outputs. If you're generating a first draft that you plan to heavily rewrite anyway, the detection gap matters less.

Use Gemini for professional and business writing. Its business-register defaults and slightly lower hedging fingerprint make it naturally better for email and proposal contexts.

But always run it through TextSight before it matters. The model gives you a starting point. The Humanization Score tells you exactly where you stand and what to fix. A 52 from Claude that you've pushed to 84 is safer than a 52 from Claude that you assumed was fine and submitted raw.

Why This Research Matters Beyond Model Choice

The broader finding from 50 tests is this: AI detection is not a binary problem. It's a spectrum. Every piece of AI-generated text sits somewhere on a scale from clearly machine to clearly human, and where it sits depends on the model, the content type, the prompt, and whether it was edited.

The detectors that schools, companies, and clients are using in 2026 operate on that same spectrum. A Humanization Score is more useful than a flag because it tells you where on the spectrum your text sits and gives you a path to move it.

Whether you're a student trying to protect yourself from false positives on work you wrote yourself, a writer trying to make sure your AI-assisted drafts meet client standards, or a content team trying to publish at scale without tanking your brand's credibility — the question isn't which AI model to use. It's where your text lands on the human-to-AI scale, and whether you've done the work to move it.

Check your Humanization Score free at TextSight → Paste any text. Get a 0–100 score. See exactly what to fix. No signup required.

Appendix: Test Conditions

All tests were conducted in May 2026 using the default (non-custom) output of each model. ChatGPT-4o accessed via ChatGPT Pro. Gemini Pro accessed via Google AI Studio. Claude 3.5 Sonnet accessed via Claude.ai. Prompts were standardized across models. No system prompts, personas, or style instructions were applied. Detectors used: GPTZero Pro, Copyleaks AI Detector, TextSight. Each sample was run independently with no cross-contamination. Detection rate = percentage of samples flagged as majority-AI by at least two of three detectors. TextSight Humanization Scores are reported as averages across samples within each category.

Related reading: