Chen says that while content moderation policies from Facebook, Twitter, and others succeeded in filtering out some of the most obvious English-language disinformation, the system often misses such content when it’s in other languages. That work instead had to be done by volunteers like her team, who looked for disinformation and were trained to defuse it and minimize its spread. “Those mechanisms meant to catch certain words and stuff don’t necessarily catch that dis- and misinformation when it’s in a different language,” she says.
Google’s translation services and technologies such as Translatotron and real-time translation headphones use artificial intelligence to convert between languages. But Xiong finds these tools inadequate for Hmong, a deeply complex language where context is incredibly important. “I think we’ve become really complacent and dependent on advanced systems like Google,” she says. “They claim to be ‘language accessible,’ and then I read it and it says something totally different.”
(A Google spokesperson admitted that smaller languages “pose a more difficult translation task” but said that the company has “invested in research that particularly benefits low-resource language translations,” using machine learning and community feedback.)
All the way down
The challenges of language online go beyond the US—and down, quite literally, to the underlying code. Yudhanjaya Wijeratne is a researcher and data scientist at the Sri Lankan think tank LIRNEasia. In 2018, he started tracking bot networks whose activity on social media encouraged violence against Muslims: in February and March of that year, a string of riots by Sinhalese Buddhists targeted Muslims and mosques in the cities of Ampara and Kandy. His team documented “the hunting logic” of the bots, catalogued hundreds of thousands of Sinhalese social media posts, and took the findings to Twitter and Facebook. “They’d say all sorts of nice and well-meaning things–basically canned statements,” he says. (In a statement, Twitter says it uses human review and automated systems to “apply our rules impartially for all people in the service, regardless of background, ideology, or placement on the political spectrum.”)
When contacted by MIT Technology Review, a Facebook spokesperson said the company commissioned an independent human rights assessment of the platform’s role in the violence in Sri Lanka, which was published in May 2020, and made changes in the wake of the attacks, including hiring dozens of Sinhala and Tamil-speaking content moderators. “We deployed proactive hate speech detection technology in Sinhala to help us more quickly and effectively identify potentially violating content,” they said.
“What I can do with three lines of code in Python in English literally took me two years of looking at 28 million words of Sinhala”
Yudhanjaya Wijeratne, LIRNEasia
When the bot behavior continued, Wijeratne grew skeptical of the platitudes. He decided to look at the code libraries and software tools the companies were using, and found that the mechanisms to monitor hate speech in most non-English languages had not yet been built.
“Much of the research, in fact, for a lot of languages like ours has simply not been done yet,” Wijeratne says. “What I can do with three lines of code in Python in English literally took me two years of looking at 28 million words of Sinhala to build the core corpuses, to build the core tools, and then get things up to that level where I could potentially do that level of text analysis.”
After suicide bombers targeted churches in Colombo, the Sri Lankan capital, in April 2019, Wijeratne built a tool to analyze hate speech and misinformation in Sinhala and Tamil. The system, called Watchdog, is a free mobile application that aggregates news and attaches warnings to false stories. The warnings come from volunteers who are trained in fact-checking.
Wijeratne stresses that this work goes far beyond translation.
“Many of the algorithms that we take for granted that are often cited in research, in particular in natural-language processing, show excellent results for English,” he says. “And yet many identical algorithms, even used on languages that are only a few degrees of difference apart—whether they’re West German or from the Romance tree of languages—may return completely different results.”
Natural-language processing is the basis of automated content moderation systems. Wijeratne published a paper in 2019 that examined the discrepancies between their accuracy in different languages. He argues that the more computational resources that exist for a language, like data sets and web pages, the better the algorithms can work. Languages from poorer countries or communities are disadvantaged.
“If you’re building, say, the Empire State Building for English, you have the blueprints. You have the materials,” he says. “You have everything on hand and all you have to do is put this stuff together. For every other language, you don’t have the blueprints.
“You have no idea where the concrete is going to come from. You don’t have steel and you don’t have the workers, either. So you’re going to be sitting there tapping away one brick at a time and hoping that maybe your grandson or your granddaughter might complete the project.”
Deep-seated issues
The movement to provide those blueprints is known as language justice, and it is not new. The American Bar Association describes language justice as a “framework” that preserves people’s rights “to communicate, understand, and be understood in the language in which they prefer and feel most articulate and powerful.”