Run a French essay through GPTZero and you'll get a score. That score is probably wrong. Not slightly off — meaningfully unreliable, in ways that can have serious consequences for the student who wrote it.
This is the open problem in AI detection that barely anyone in mainstream coverage addresses. Detectors were trained overwhelmingly on English-language data. Most of the major players — GPTZero, Turnitin's AI detection layer, ZeroGPT — built their models using English-language text corpora, English-language AI outputs, and English-language human writing samples. That's where the training data was abundant, that's where the initial market was, and that's where the models work.
Everywhere else? It's complicated. And often bad.
What Happens When You Run a Non-English Essay Through a Detector
The short answer: you get a number that doesn't mean what you think it means.
Here's why. AI detectors are essentially classifiers trained to distinguish statistical patterns in text — token probability distributions, sentence structure variance, vocabulary frequency patterns, syntactic regularity. When those models are trained on English, they learn what English human writing looks like vs. what English AI writing looks like.
French, Spanish, Arabic, Mandarin Chinese — these languages have entirely different syntactic structures, vocabulary distributions, and stylistic norms. More formal writing in French sounds different from formal writing in English. The vocabulary overlap is limited. The sentence rhythms are different. A detector trained on English has no calibrated baseline for what "human-sounding French" looks like because it never learned one.
The result is wild variance. In tests of GPTZero and ZeroGPT against French and Spanish text, scores are effectively random — human-written essays come back flagged at 40-80% AI, and actual AI-generated text in French sometimes scores lower than human-written text. The model is guessing, because it has no real signal.
Turnitin's AI detection layer has acknowledged this problem to some degree in its documentation, noting that its AI detection is "optimized for English" and may produce unreliable results for other languages. That's a responsible disclosure. It doesn't help the student who gets flagged.
The Translated Essay Problem Is Even Harder
Here's the scenario that matters most for actual university students: an international student who isn't confident in written English drafts their essay in their first language — Spanish, Mandarin, Arabic, Hindi — and then uses Google Translate or DeepL or GPT-4o to translate it into English before submission.
Was AI involved? Yes. Did the student write the content? Also yes. Is this academic dishonesty? That depends entirely on the institution's policy, and most institutions haven't thought through where translation assistance ends and AI writing assistance begins.
The detection problem for this scenario is severe. The translated essay often scores very high on AI detectors — not because the ideas are AI-generated, but because the translation process imposes a statistical regularity on the text that mirrors GPT output. Translation models and GPT-4o share similar training objectives in some ways. The surface-level English text that comes out of a translation can look indistinguishable from AI-generated text to a detector that doesn't know a translation happened.
So you get a student who wrote every word of the content, in their language, having their work flagged as AI-generated in English. That outcome is deeply unfair. The detector is technically detecting a real signal — the text has AI-translation fingerprints — but it's misidentifying what that signal means.
The Training Data Problem Is Hard to Solve
Let's be honest about why this hasn't been fixed: building reliable AI detection for multiple languages isn't a minor update to existing models. It's building new models.
A detection system for French needs:
- Large corpora of human-written French text across many genres, registers, and demographic groups
- Large corpora of AI-generated French text from the major models (GPT-4o in French, Claude in French, Gemini in French, and so on)
- Validation data that's properly balanced across language registers
- Calibration work to establish what false positive rates look like in French vs. English
Then you do the same for Spanish. And Arabic. And Mandarin. And 50+ other languages if you want genuinely multilingual detection.
This is real engineering work with real research costs. Most AI detection startups built English-first because that's where the product-market fit was and where the training data was easiest to get. Expanding to other languages is a second (or third or fourth) product effort.
The larger players like Turnitin have more resources to do this work, but they also move slowly and have institutional priorities. GPTZero has done some multilingual work but hasn't published performance data across language families. The honest state of the field is: reliable multilingual AI detection doesn't exist yet at scale.
What TextSight Currently Covers
TextSight is an English-first product. The Humanization Score and AI Vocabulary Highlighter are calibrated for English text. Running non-English text through TextSight will give you a score, but that score isn't calibrated against a multilingual training corpus — it's applying an English-trained model to non-English text, and the same limitations that affect other detectors apply here too.
That's a direct, honest answer. TextSight's score is reliable for English. For other languages, you're operating outside the model's training distribution, and the score should be treated skeptically.
The AI Vocabulary Highlighter component does identify specific phrases that appear frequently in AI outputs, and some of those phrase-level patterns do transfer across languages to a limited degree — AI models in French and Spanish also tend toward formal register, hedging language, and additive transitional structures. But the calibration isn't there yet for a full Humanization Score to mean the same thing in French as it does in English.
Building multilingual support properly is on the roadmap. Right now, TextSight is honest about its scope.
The Specific Case of Indian English
One interesting case worth calling out: Indian English.
India is the world's largest English-speaking country by number of speakers, and Indian English has distinct syntactic and stylistic patterns that differ from American and British English. The vocabulary choices, sentence structures, and formality conventions that are natural and correct in Indian English can look unusual to an American-trained detector.
More importantly, English education in India has historically emphasized formal, structured writing. Academic writing in Indian universities often follows more prescriptive, formal conventions than contemporary American academic writing. This isn't wrong — it's a different standard. But it's a standard that AI detectors weren't trained on, so essays written in standard Indian academic English often score higher for AI than they should.
This is a variant of the broader non-native speaker bias problem, and it applies even to students writing in their first or near-first language. The detector's baseline isn't neutral. It's calibrated to a particular kind of English that reflects the cultural and demographic makeup of who wrote the training data.
What a Realistic Path to Multilingual Detection Looks Like
The field will get here. It'll just take time. Here's what actually needs to happen:
Language-specific model calibration. Not translation of English models, but purpose-built detection layers trained on each target language's human and AI writing samples. French detection requires a French model. Spanish requires Spanish. Shortcutting this with translation is what produces the unreliable scores we have now.
Regional register calibration. Spanish written in Mexico reads differently from Spanish written in Spain. Arabic written in formal Modern Standard Arabic reads differently from colloquial Levantine Arabic. Multilingual detection has to account for regional variation, not just language variation.
Honest confidence intervals. Detectors should communicate uncertainty. A system that says "confidence: high" for English and "confidence: low / language not fully supported" for French is more trustworthy than one that just gives you a score for everything without disclosing that some scores are less reliable.
Institutional guidance. Universities with significant international student populations need to adopt policies that account for detection unreliability across languages. "Our AI detector flagged this" can't be treated as reliable evidence for a student writing in their third language.
What To Do If You're Writing in English as a Second Language
If you're an international student or ESL writer submitting work in English, here's practical advice:
First, understand that your work may be disproportionately flagged — not because you used AI, but because your writing style may be more formal and regular than American native-speaker writing, which is exactly what detectors flag. This isn't fair. It's the current reality.
Second, if you used translation assistance (Google Translate, DeepL, AI translation), be aware that translated text scores high for AI on most detectors. Know your institution's policy on translation assistance before you submit.
Third, if you're using TextSight as a self-check tool, focus on the phrase-level highlights rather than the aggregate Humanization Score. The vocabulary patterns that TextSight flags — formal hedging phrases, additive transitional structures, certain vocabulary clusters — are useful signals even in ESL writing, because they can help you identify places where your writing sounds more formal or AI-like than you intend.
Fourth, keep drafts. If you're ever challenged on whether you wrote your work, a version history showing the evolution of your ideas is the strongest evidence you can have.
The multilingual AI detection problem is real, it's underacknowledged, and it's producing unfair outcomes for real students right now. That's worth saying plainly, even when the full solution isn't here yet.
Related reading:
- Are AI Detectors Biased Against Non-Native English Speakers?
- My Essay Was Flagged as AI but I Wrote It
- Can Turnitin Detect ChatGPT?
- GPTZero vs Turnitin vs TextSight
The Cascade Effect: When Policies Don't Account for Detection Failures
There's a downstream problem that gets less attention than the detection failures themselves.
Universities are adopting AI detection policies faster than they're updating the academic integrity frameworks those policies operate inside. The result: students can be referred to misconduct proceedings based on tool output that, for their language background, has an accuracy rate closer to a coin flip than to a reliable detection system.
The institutional assumption is that if the tool flagged it, there's a real signal worth investigating. That assumption doesn't hold when the tool's false positive rate for a particular student population is 61%+. And the non-English/translated-text scenario is an even more extreme version of that — the score for non-English essays isn't just unreliable, it's essentially arbitrary.
Some universities have already started adjusting. The University of Sydney, several UK institutions, and a handful of US liberal arts colleges have issued guidance specifically noting that AI detection results are not considered sufficient evidence for academic misconduct without corroborating factors. That's the right approach. It's not universal yet.
For international students at institutions that haven't made this adjustment: the best protection is procedural. Maintain version histories. Save drafts with timestamps. If you use any AI tool at all — even for translation or grammar checking — document when and how. In a misconduct hearing, documentation of your writing process is far stronger evidence than arguing about whether a detection score is accurate.
The Bigger Picture: Whose Writing Counts as "Normal"?
There's a values question buried inside all the technical detail.
AI detection systems define "human writing" implicitly by what they were trained on. If the training data is predominantly native English speakers in American academic contexts, then "human" gets operationalized as that specific demographic's writing patterns. Everyone who writes differently — by language background, educational tradition, regional English variety, or simply personal style — is measured against a standard that wasn't built with them in mind.
This isn't unique to AI detection. Standardized testing, automated essay scoring, grammar checkers calibrated to American English — the same dynamic shows up across writing assessment tools. AI detection just makes it more consequential because the output is a quasi-accusation rather than a score on a scale.
What would a genuinely inclusive approach look like? It would require training data that represents a wide range of writing backgrounds, not just native-speaker English. It would require language-specific calibration rather than one-size-fits-all models. And it would require honest communication to institutions about where the models are and aren't reliable — not just a score presented as if it means the same thing for every writer.
That's not the industry we have today. It might be the industry we need in three to five years, as the demographic consequences of current practices become harder to ignore.
Until then, the gap between "what these tools claim to detect" and "what they actually reliably detect" is widest for the students who are most vulnerable to the consequences of a false accusation.