HomeGuides › How to Detect AI in Academic Papers

Detect AI in academic papers — section-by-section calibration that holds up.

Written for journal editors, peer reviewers, and dissertation committees who need a defensible call rather than a single percentage. A research paper is not a blog post with footnotes. Literature reviews are citation-heavy and read formal. Methods are templated by genre convention. Discussions reward fluent abstract prose, which is exactly what frontier models produce by default. Scanning a whole manuscript as one block and reading one number averages those very different baselines into noise. The five-step section-by-section workflow below pairs TextSight's sentence-level signal with iThenticate cross-verification and an ESL-aware calibration, so the case you bring to the author is built on per-section evidence rather than a global score.

Scan a paper free Skip to the workflow
5-step workflow Section-aware calibration ESL guardrails
Why papers are different

A paper is five small papers stitched together.

Each section is written to its own genre convention. A single percentage flattens those very different baselines into a number that hides where the signal actually lives. Reading the paper section by section is what separates triage from evidence.

Abstract

The shortest section and the most outsourced. Authors under deadline pressure regularly draft an abstract last and reach to a chatbot when the deadline tightens. The Abstract is also where an AI-tells vocabulary cluster shows up densely, because the section is short enough that two or three frontier-favourite words read as a cluster. Treat a high Abstract score paired with clustered red highlights as one of the strongest signals on the page.

Literature Review

Citation-heavy paraphrase reads formal and templated. Parenthetical citations every two or three sentences, author-date constructions, and the convention of summarising other researchers' work all push the statistical signature toward AI-adjacent territory. Expect the Lit Review to flag higher than the rest of the paper even when a human wrote it. Discount the Lit Review score by roughly the gap you observe between it and the Methods section in clearly human-written reference papers; do not drop it to zero.

Methods

Templated by design. Uniform third-person, narrow technical vocabulary, repeated sentence structures, no figurative language: those are the conventions of good methods writing. They also overlap with the statistical signature of AI prose. A clean human-written methods section can score higher than the same author's discussion. Read methods scores as ambient noise unless something specific stands out, and put the diagnostic weight elsewhere.

Results

Partially protected. Reporting specific procedures, numbers, instruments, and outcomes constrains the prose in ways that frontier models do not naturally replicate. A high Results score is unusual and worth a second look; a low Results score is the expected baseline rather than evidence of authenticity.

Discussion

The most variable section and the most diagnostic. Discussions reward fluent generalised abstraction, which is exactly what chatbots produce by default. A discussion section that flags substantially higher than the methods section, especially with clustered red highlights and AI-favoured vocabulary, is the cleanest signal the paper offers. This is where the section-by-section workflow earns its weight.

The five-step workflow

Scan, review, check, cross-verify, discuss.

Roughly fifteen minutes per manuscript once you are practised. The workflow is designed so the cheap steps catch the easy cases, and the expensive step, the conversation with the author, is reserved for papers where the evidence has converged.

Step 1: Scan the paper in TextSight

Paste the manuscript into TextSight at app.textsight.ai. The calibrated ML classifier returns a sentence-level highlight map and per-section density in seconds, before any plagiarism queue has started. This is the pre-Turnitin and pre-iThenticate signal: a fast generative-AI read that tells you whether the prose was written rather than lifted. Free tier is enough for one paper a day; Pro at $14.99 a month on yearly billing removes the cap and is the right plan for any editor handling more than a manuscript a week.

Step 2: Review per-section scores

Split the manuscript into Abstract, Literature Review, Methods, Results, and Discussion, and look at the density of each section rather than a single global percentage. The pattern matters more than the headline number. A paper where the Discussion section is twice as densely flagged as the Methods section is telling you something the one-shot scan would have buried.

Step 3: Check sentence-level highlights

Read the red-highlighted sentences in each section. Do they cluster in one paragraph or scatter across the section? Clustered red sentences in the Discussion or Abstract are a stronger signal than the same percentage spread thinly across the paper. The highlight map is the diagnostic layer; the headline percentage is triage. Note the specific sentences you would quote in a follow-up conversation.

Step 4: Cross-verify with iThenticate

Send the manuscript through iThenticate or Turnitin in parallel for plagiarism similarity. The two outputs answer different questions: TextSight tells you whether the prose was generated; iThenticate tells you whether it was lifted from indexed sources. A paper that flags on both warrants a far closer conversation than one that flags on only one. Editors at Nature, Science, The Lancet, and JAMA already run AI screening; pair it with the similarity layer rather than treating either signal as a verdict on its own.

Step 5: Discuss with the author

Treat the result as a conversation-starter, not a verdict. Open an exchange about drafting process, AI-assistance disclosure, and the specific section-level pattern you observed. Lead with per-sentence highlights and section densities rather than a global percentage; you get more honest answers and hold up better on appeal. A genuine author can reconstruct their drafting process in five minutes; the absence of that reconstruction is usually more diagnostic than the score itself.

Plans & pricing

Detector and AI rewriter on every tier.

Free includes 3 detector scans a day and a 1,500-word AI rewriter quota. Paid tiers raise the quotas and add the Chrome extension, file upload, and REST API. Yearly billing saves 25%.

Free
$0/forever

 

Try the detector and AI rewriter. No card.
  • 3 detector scans/day
  • 1,500 AI rewriter words
  • All 3 AI rewriter modes
  • Sentence-level highlights
Start free
Starter
$7.49/month

Billed $89.88/year — Save $30

For freelancers and light writers.
  • 20,000 AI rewriter words/mo
  • Unlimited detector scans
  • Chrome extension
  • Email support
Get Starter
Business
$29.99/month

Billed $359.88/year — Save $120

For journals and editorial offices.
  • 150,000 AI rewriter words/mo
  • REST API access
  • 5 team seats
  • Webhook integrations
Get Business

Yearly billing saves 25%. View full pricing

Pre-Turnitin and pre-iThenticate

Fast generative signal before the similarity queue.

Turnitin and iThenticate answer the plagiarism question. TextSight answers the generative-AI question. Running the AI scan first, then the similarity check in parallel, is what gives editors the converging evidence a defensible reviewer note needs.

Different signals, different questions

Plagiarism engines surface text reuse from indexed sources. They were not built to recognise generated prose that never appeared in any indexed corpus, which is precisely what frontier models produce by default. A paper drafted in ChatGPT and never copy-pasted from a public source can pass Turnitin and iThenticate cleanly while being almost entirely AI-generated. The sentence-level generative signal that TextSight returns in seconds catches that case before the similarity queue completes.

Run them in parallel, read them together

The strongest reviewer position combines both outputs. A paper that flags on TextSight only is a generative-AI question and warrants a conversation about drafting process and AI-assistance disclosure. A paper that flags on iThenticate only is a similarity question and warrants the standard plagiarism conversation. A paper that flags on both is the strongest case and warrants the closest follow-up. Reviewers who treat the two outputs as a single converging picture, rather than as competing verdicts, build cases that hold up on appeal.

Journal context: who is already screening

Nature, Science, The Lancet, JAMA, and several other major journals have published AI-use disclosure policies and routinely run automated screening on submissions. Editors at smaller journals and conference programme committees increasingly do the same. The screening layer is not a verdict; it is triage. Reviewers and editors still receive a flagged paper and have to decide what to do with it. A section-by-section workflow with per-sentence evidence is what separates a defensible reviewer note from a single-percentage gotcha that loses on appeal.

The false-positive that matters most

International authors and the 40 percent calibration gap.

For any journal with international authors, which is almost every journal that publishes, the ESL caveat is the single most important calibration on this page. Get this wrong and the workflow produces unjust outcomes regardless of how clean the score looks.

What the research says about ESL false positives

Multiple peer-reviewed studies published since 2023 have shown that off-the-shelf AI detectors flag English-as-a-second-language writing as AI-written at roughly three to five times the rate of native-English writing on the same task. The reason is structural rather than accidental. Learned-second-language English uses more uniform sentence shapes, a narrower active vocabulary, and a more formal register, all of which overlap with the statistical signature classifiers were trained to recognise. The detector is not failing; it is correctly measuring something that happens to mean a different thing for ESL authors than for native ones.

TextSight runs about 40 percent lower false-positive rates on ESL prose

TextSight trains on diverse English varieties rather than only US academic prose, which narrows the structural overlap by roughly 40 percent against open-source baselines. The practical effect is a lower false-positive rate, not a zero false-positive rate. No detector eliminates the overlap; the best ones narrow it. If you know the author is writing in a second language, weight the score cautiously and lean on sentence-level evidence in the Discussion rather than the Lit Review, since the vocabulary-cluster and clustered-highlight signals are more language-neutral than the burstiness or hedge-density signals.

What to do operationally

If your journal publishes international authors, build the calibration into the workflow rather than into the score. Drop a flagged score by 15 to 20 points for ESL authors before deciding what tier it falls into. Require clustered sentence-level highlights in the Discussion plus a vocabulary cluster plus an iThenticate hit before treating an ESL paper as a high-confidence generative-AI case. For any high-stakes decision, including rejection or sanction, never act on the score alone; lead with the per-sentence evidence and an honest conversation about drafting process.

Score as trigger, not verdict

A conversation-starter, not a gotcha.

The detector did not catch the author. The detector flagged sections for closer reading, which the reviewer then evaluated against per-sentence evidence and section-level patterns. Framing the result as a conversation-starter rather than a verdict is what separates reviewer authority from reviewer overreach.

What to bring to the author

The section-by-section densities, the specific highlighted sentences and the markers they match, the iThenticate cross-verification result, and a request to walk through the drafting process. A genuine author can reconstruct their process in five minutes: where the idea started, which sections were drafted first, what AI assistance was used and how it was disclosed. The reconstruction is usually more diagnostic than the score itself. Reviewers who lead with the evidence rather than the percentage hold up better on appeal and get more honest answers.

What not to bring

A single global percentage with no per-section breakdown. A confident verdict on the strength of one number. A demand for explanation phrased as an accusation. Reviewers who treat the score as the case rather than as the trigger for evidence-building lose authority the moment the author pushes back, and the journal loses the ability to act if the case turns out to be real. The score opens the conversation; it does not close it.

When to escalate

Escalate to the editor when the evidence has converged: clustered red highlights in two or more sections, a vocabulary cluster across Abstract and Discussion, an iThenticate hit on the same passages, and a drafting-process explanation that does not match the per-section pattern. Each one of those signals on its own is information, not evidence. Two of them is a question worth asking the author. Three or more is a case worth escalating.

FAQ

Academic AI detection frequently asked.

Why detect AI section by section rather than as a single global scan?
A research paper is built from sections with very different genre conventions. Literature reviews are citation-heavy and read formal; methods sections are templated by design; discussion sections reward fluent generalised prose, which is exactly what frontier models produce by default. A single global percentage on a 6,000-word paper averages these very different baselines into a number that hides where the signal actually lives. Scanning section by section surfaces the diagnostic pattern that a single-shot scan buries.
Where in the paper does AI-generated prose usually show up?
Most often in the Abstract, Introduction, and Discussion. These three sections reward fluent generalised prose and are the easiest to outsource to a chatbot under deadline pressure. Methods and Results are partially protected by the discipline of reporting specific procedures, numbers, and instruments. Literature Reviews sit in the middle: citation-heavy paraphrase often reads AI-adjacent even when a human wrote it, which is why the Lit Review flag rate is structurally elevated and should be discounted slightly.
How does this workflow compare to running Turnitin or iThenticate?
It runs before them rather than instead of them. Turnitin and iThenticate are similarity engines built to surface text reuse from indexed sources, which is a different question from whether prose was generated. TextSight gives a sentence-level generative-AI signal in seconds, before any plagiarism queue completes. The two outputs answer different questions, and a paper that flags on both signals warrants a far closer reviewer conversation than one that flags on only one.
What about journals that already run AI screening?
Nature, Science, The Lancet, JAMA, and several other major journals have published AI-use disclosure policies and routinely run screening on submissions. The screening is not a verdict; it is a triage layer. Reviewers and editors still receive a flagged paper and have to decide how to act on it. A section-by-section workflow with per-sentence evidence is what separates a defensible reviewer note from a single-percentage gotcha that will lose on appeal.
How should I handle ESL authors and international submissions?
With more care, not less. Learned-second-language English shares structural features with AI prose: uniform sentence shapes, narrower active vocabulary, more formal register. Off-the-shelf detectors can flag ESL authors at three to five times the native-English rate on the same content. TextSight runs roughly 40 percent lower false-positive rates on ESL prose than open-source baselines by training on diverse English varieties, but no detector eliminates the structural overlap. If you know the author is writing in a second language, weight the score cautiously and lean on sentence-level evidence in the Discussion rather than the Lit Review.
Why is the methods section often a false-positive trap?
Methods sections are written to a tight genre template by design. Authors describe procedures in uniform third-person, use a narrow technical vocabulary, repeat sentence structures across paragraphs, and avoid figurative language. Those properties overlap with the statistical signature classifiers were trained to recognise. A clean human-written methods section can score higher than the same author's discussion section. Read methods scores as ambient and put the diagnostic weight on Abstract, Introduction, and Discussion.
Should I frame the result as a verdict or as a conversation?
As a conversation. A detector score, even at 90 percent with clustered highlights, is the trigger for evidence-gathering, not the evidence itself. Reviewers who present a verdict on the strength of a single percentage tend to lose authority when the author pushes back. Reviewers who present per-sentence highlights, section-by-section density patterns, and a request to walk through the drafting process get more honest answers and hold up better on appeal. The score opens the conversation; it does not close it.
How long does the full section-by-section workflow take per paper?
About fifteen minutes once you are practised. The scans themselves take under a minute per section; reading the highlights and noting section densities takes five to seven minutes; iThenticate cross-verification runs in parallel and adds no reviewer time. The full sequence is faster than a careful first read of the paper, and it produces a defensible record that survives a contested decision much better than a single global scan does.
Related

More for the reviewer workflow.

Detect by section, decide with evidence.

Calibrated ML classifier with per-section densities and sentence-level highlights. Free to try with no card. 3 detector scans a day, the full evidence layer on every result, ESL-aware calibration on by default.

Scan a paper free See pricing
Section-by-section, sentence-level, ESL-calibrated. The way reviewers actually do it.