The honest answer is: it depends. And most schools aren't pausing long enough to ask what it depends on.
AI detection in education has become a reflex rather than a policy. A student submits an essay. The teacher runs it through ZeroGPT or Turnitin's AI detector. A number comes back. A decision gets made. Very few institutions have thought carefully about what that number actually means, what its error rate is, or what rights students have when it's used against them.
Let's slow down and work through this properly.
What Detectors Are Actually Good At
There's a version of AI detector use that makes sense. It's narrower than most schools think.
AI detectors are reasonably reliable at catching obviously raw, unedited AI submissions — the ones where a student has typed a prompt into ChatGPT and pasted the result directly. These submissions have the full signature: low burstiness, passive voice hedging, consistent paragraph structure, high-frequency AI vocabulary. Detection tools score these in the 15–35 range with reasonable accuracy. At this end of the scale, the tools are doing something useful.
They're also useful as a rough screening tool — a way to flag essays that might warrant a follow-up conversation, not as evidence but as a starting point. "I noticed some unusual patterns in the structure of your essay — can you walk me through your drafting process?" That's a legitimate use.
That's about where the legitimate uses end.
What Detectors Are Terrible At
The limitations are serious and not well-understood by most educators.
They can't distinguish AI-assisted from AI-generated. A student who used ChatGPT to outline their essay, wrote it themselves, then ran it through Grammarly has used AI tools but has also done substantial human intellectual work. A student who used AI to generate a first draft and then rewrote it heavily has also done real intellectual work. Detectors can't tell the difference between these students and a student who copied raw output.
They can't catch edited AI output reliably. Any student who spends 30 minutes varying sentence lengths, replacing AI vocabulary, and restructuring paragraphs will clear most detectors. The detection floor is easily broken by someone who knows what they're doing. This creates a perverse incentive: the students who get caught are the ones who didn't bother to conceal their AI use, while more sophisticated users sail through.
They have serious false positive rates. This is the most important limitation and the least discussed.
ZeroGPT's false positive rate on genuine human writing is approximately 16% — one in six completely human-written essays gets flagged. On ESL student writing, that rate climbs to 61.3%. ESL students are being flagged at rates that make detection meaningless: if you're flipping a coin weighted toward "guilty" for a particular demographic, you're not running a detection system. You're running a bias system.
This isn't a fringe critique. It's been documented in peer-reviewed research and in a growing body of academic integrity case reviews where students successfully appealed AI misconduct findings. The tools that institutions are treating as authoritative are flagging human writing at rates that would be considered unacceptable in any other evidentiary context.
The False Positive Crisis Is an Institutional Problem
Let me be specific about what's at stake.
An institution that uses a detection score as primary evidence for an academic misconduct finding is exposing itself to serious legal and ethical risk. In the United States, academic misconduct determinations have constitutional implications in public universities (due process requirements). In the UK, GDPR and data protection frameworks impose obligations around algorithmic decision-making. In most jurisdictions, using an AI detector score as the primary evidence in a disciplinary proceeding is legally questionable at best.
Several lawsuits have already been filed by students who received academic misconduct findings based on AI detector scores for work they wrote themselves. Some have been settled. Others are ongoing. The outcomes are being watched carefully.
This isn't a technicality. A 16% false positive rate on human writing means that if a teacher runs 100 essays through a detector and disciplines everyone who scores below a threshold, roughly 16 students in that group submitted genuinely human work. In a cohort of ESL students, that number is over 60.
No academic integrity process that generates this error rate should be described as fair.
What Teachers Defending Detection Use Say
I want to give this position a fair hearing, because it isn't irrational.
The core argument: AI submission is a real problem. It's growing. Without any tool to identify it, there's no mechanism for enforcement. Detectors, even imperfect ones, represent a deterrent that shapes behavior. And the deterrent effect matters even if the detection rate isn't perfect — knowing a detector exists might prevent students from submitting raw AI output.
There's also a time argument. Teachers at under-resourced institutions are managing 150+ students. A tool that flags obviously problematic submissions for closer review — even if it's imperfect — saves time that would otherwise require reading every paper for AI patterns manually.
These are real points. The deterrence argument is actually fairly strong. The time argument is real.
The problem is that these arguments justify using detectors as a screening tool that raises questions. They don't justify using detectors as a verdict tool that answers them. Most institutional policies have quietly slid from the former to the latter.
What Works Better Than Detection
Here's where I'll give a direct recommendation.
Assignment design is the most durable solution, and it's the one most institutions aren't implementing because it requires changing how they teach, not just buying a tool.
AI detectors are only necessary when assignments are structured in ways that AI can complete easily. A five-paragraph essay on "the causes of World War I" is completable by any modern AI in about 90 seconds. Change the assignment to "analyze one primary source document from 1914 using the specific historical context framework we discussed in week 3 lectures" and the AI problem largely resolves itself — not because AI can't write the essay, but because the assignment requires demonstrated engagement with specific class content.
Conversation-based assessment is extremely effective. A brief 5-minute conversation — "walk me through your argument in paragraph three" or "what research did you do that didn't make it into the paper?" — immediately distinguishes students who engaged with the material from students who didn't. You can't fake engagement you don't have in a live conversation. This scales reasonably: even at 30 students, 5-minute check-in conversations are feasible.
Process documentation requires students to submit outlines, drafts, and revision histories alongside final papers. A student using AI as their sole tool will struggle to produce a plausible revision history. A student using AI as an editing tool will produce one that reflects real thinking alongside tool use. This also creates a record — if a misconduct finding is later disputed, the process documentation provides real evidence.
Where Detectors Fit in This Picture
Used correctly, detectors are one input among several — a screening tool that raises a flag, not a verdict tool that settles a question.
The best use case: a teacher reads an essay that feels unusual (structurally inconsistent, topics covered too uniformly, lacking any personal voice). They run it through a detector. The score comes back at 28. Combined with the teacher's own reading, this justifies initiating a conversation with the student. Not a misconduct proceeding. A conversation.
The conversation either resolves the concern or it doesn't. If the student can speak fluently about their argument, their sources, and their process — the conversation is over. If they can't, that's when further process begins.
This is what detectors are good for: surfacing cases worth looking at more closely. Not closing them.
A Note on TextSight Specifically
Most academic contexts are using tools that return a binary verdict: "AI" or "human." The score-based approach — like TextSight's Humanization Score from 0–100 — is, honestly, better suited to educational contexts, because it shows a distribution rather than a verdict.
A score of 42 is in the grey zone. It might be AI. It might be ESL writing. It might be heavily edited AI. It might be a very formal human writer. A score of 42 is a reason to look more closely, not a reason to make a finding.
A score of 12 is different. Paired with a teacher's own reading and a conversation where the student can't discuss their work, a score of 12 is meaningful contextual evidence.
That's the model: score informs probability, not verdict. Educators who use any detector this way — as a calibrated probability estimate rather than a pass/fail gate — are using it appropriately.
Most aren't using it this way yet. But the legal and ethical pressure to change is building fast.
How to Talk to Students About AI Detection
One thing most teachers aren't doing — and should be: having a direct, honest conversation with students about what detection tools can and can't do.
Students who know detection tools exist and know their limitations will make different decisions than students who fear a black box. If you explain that ZeroGPT has a 16% false positive rate on human writing, and that ESL writers are flagged at over 60%, you're not undermining detection. You're being honest about what it measures.
That honesty has a useful side effect. Students who understand that raw AI output is easily caught — but that heavily-edited, AI-assisted work is not — will make more thoughtful choices about how they use these tools. The students most at risk of getting caught are the ones who don't understand how detection works. Transparency serves everyone.
The conversation also shifts the framing from punishment to learning. "I'm using this tool to raise questions, not reach verdicts, and if it flags your work we'll talk about it" is a much more defensible policy than a quiet detection regime students only discover when they're facing misconduct proceedings.
The Bottom Line
Should teachers use AI detectors? Yes — as a screening tool that raises questions, in combination with assignment design that reduces AI completion, conversation-based assessment that requires demonstrated engagement, and process documentation that creates a real evidentiary record.
No — as primary evidence in a misconduct finding. No — as a verdict tool for ESL students. No — as a substitute for the harder pedagogical work of redesigning assignments and building relationships with students.
The honest answer is that detection is a partial solution to a pedagogical problem. The tool is useful. The way most institutions are using it isn't.
Related reading: