AI-detection programs discriminate against non-native English speakers, claims study
Programs used to detect whether an essay or application has been generated with the use of artificial intelligence (AI) often flag content written by non-native speakers of English, a study conducted by researchers at Stanford University has found out. This can have a serious impact on the future of individuals, The Guardian reported.
The explosion in the use of artificial intelligence (AI)-powered platforms to generate content has also led to an equal rise in the tools to detect their usage. While students got away by submitting AI-generated content or code in the early days of ChatGPT, colleges have reacted quickly to flag such content.
If plagiarism checks were common in the last decade of academics, AI-generated content checks will dominate this decade and are also being used to check essays written for purposes of securing admissions or even job applications.
Where popular text detectors fail
Colleges are now equipped with tools that claim to detect if the content submitted by a student is AI-generated. While the tools are also in the works, they claim an accuracy of 99 percent, which Stanford researchers suggest is "misleading".
Interestingly, these tools also deploy AI to flag content, and as with other things, AI does, these too can have a bias.
James Zou, an assistant professor of biomedical data science at Stanford University, tested 91 essays against seven of the most popular tools used in colleges these days. All of these essays were written by non-native English language speakers and were written for the Test of English as a Foreign Language (TOEFL), a widely recognized English language proficiency test.
More than half the essays were flagged as AI-generated by these programs with one even marking 98 percent of them as written by a bot. The researchers then used essays written by eight-graders in the US who are native speakers of the English language to test the same tools. More than 90 percent of them were reported to be written by humans.
How content is evaluated by AI tools?
The researchers then looked into why the AI detection tools were so biased against non-native speakers of the language. They found that the discrimination stemmed from one of the factors called "text perplexity" of the content. This is a measure of whether the generative AI model is "surprised" or "confused" when predicting the next word in a sentence.

If the model can predict the word easily, it marks text perplexity as low while if it finds it hard to predict it, the perplexity is rated high. Large language models (LLM) such as ChatGPT work by churning out text with low perplexity and use this to identify content from being AI-generated or not.
Since non-native speakers of the language are more likely to use common words in their writing and have a more familiar pattern, the content they generate is more likely to be flagged as bot-generated.
The researchers warn that biased content generators could falsely flag applications and assignments as AI-generated and even marginalize non-native speakers on the internet since search engines like Google also use them for assessing content.
In academic settings, this could even force students to use AI-generated content to sound more human or risk career prospects, or affect the psychological well-being of students.
The research findings were published in the journal Patterns.
Study abstract:
GPT detectors frequently misclassify non-native English writing as AI-generated, raising concerns about fairness and robustness. Addressing the biases in these detectors is crucial to prevent the marginalization of non-native English speakers in evaluative and educational settings and to create a more equitable digital landscape.