Computers have become incredibly precise at translating spoken words into text messages and sifting through huge treasures of information to find answers to complex questions. At least as long as you speak English or another of the world’s dominant languages.
But try talking on your phone in Yoruba, Igbo, or a number of widely spoken African languages and you will find issues that can hamper access to information, commerce, personal communications, customer service, and other advantages of the global technological economy.
“We’re getting to the point where if a machine doesn’t understand your language, it will be like it never existed,” said Vukosi Marivate, head of data science at the University of Pretoria in South Africa, in a call to action at a December Virtual Gathering of Global Artificial Intelligence Researchers.
America’s tech giants don’t have much experience in making their language technology work well outside the wealthiest markets, a problem that has also made it harder for them to spot dangerous fake news on their platforms.
Marivate is part of a coalition of African researchers who have tried to change this. Among their projects, one revealed that machine translation tools were failing to properly translate online COVID-19 surveys from English to several African languages.
“Most people want to be able to interact with the rest of the information superhighway in their local language,” Marivate said in an interview. He is a founding member of Masakhane, a pan-African research project aimed at improving the representation of dozens of languages in the branch of AI known as natural language processing. It is the largest of a number of basic language technology projects that have arisen from the Andes in Sri Lanka.
Tech giants offer their products in many languages, but they don’t always pay attention to the nuances needed for these apps to work in the real world. Part of the problem is that there simply isn’t enough data online in these languages - including scientific and medical terms – for AI systems to effectively learn to understand them better.
Google, for example, offended members of the Yoruba community several years ago when its language app misled Esu, a benevolent trickster god, as the devil. Facebook’s linguistic misunderstandings are linked to political conflicts around the world and its failure to contain damaging misinformation about COVID-19 vaccines. More mundane translation issues have been turned into online memes for joke.
Omolewa Adedipe became frustrated with trying to share her thoughts on Twitter in the Yoruba language, as her automatically translated tweets usually end up having different meanings.
The 25-year-old content designer once tweeted, “T’Ílù ò bà dùn, T’Ílù ò bà t’òrò. Èyin l’ęmò bí ę şe şé ”, which means:“ If the land (or the country, in this context) is not peaceful, or happy, you are responsible. ”Twitter has however managed to end up with the translation: “If you are not happy, if you are not happy.”
For complex Nigerian languages like Yoruba, these accents – often paired with tones – make all the difference in communication. ‘Ogun’, for example, is a Yoruba word meaning war, but it can also mean a state in Nigeria (Ògùn), iron god (Ògún), to stab (Ógún), twenty or property (Ogún).
“Some biases are deliberate given our history,” said Marivate, who has devoted part of his AI research to the southern African languages Xitsonga and Setswana spoken by members of his family, as well as to the common conversational practice of “code switching” between languages.
“The history of the African continent and of colonized countries in general is that when a language had to be translated, it was very narrowly translated,” he said. “You were not allowed to write a general text in any language because the colonizing country might fear that people would communicate and write books about insurgencies or revolutions. But they would allow religious texts.”
Google and Microsoft are among the companies that say they are trying to improve the technology for so-called “low-resource” languages for which AI systems do not have enough data. Computer scientists at Meta, the company formerly known as Facebook, announced in November a breakthrough on the path to a “universal translator” capable of translating multiple languages at once and working better with low-resource languages such as than Icelandic or Hausa.
This is an important step, but for now only big tech companies and big AI labs in developed countries can build these models, said David Ifeoluwa Adelani. He is a researcher at the University of Saarland in Germany and another member of Masakhane, whose mission is to strengthen and stimulate African-led research to address technology “that does not include our names, our cultures, our places, our history “.
Improving systems requires not only more data, but also careful human scrutiny from native speakers who are under-represented in the global tech workforce. It also requires a level of computing power that can be difficult to access for independent researchers.
Writer and linguist Kola Tubosun created a multimedia dictionary for the Yoruba language and also created a text-to-speech machine for the language. He is now working on similar speech recognition technologies for Nigeria’s other two main languages, Hausa and Igbo, to help people who want to write short sentences and passages.
“We are financing ourselves,” he said. “The point is to show that these things can be profitable.”
Tubosun led the team that created Google’s “Nigerian English” voice and accent used in tools like maps. But he said it remains difficult to raise the funds to develop technology that could allow a farmer to use a voice tool to track market trends or weather trends.
In Rwanda, software engineer Remy Muhire is helping create a new open source voice dataset for the Kinyawaranda language that involves many volunteers who register by reading newspaper articles and other texts in Kinyawaranda.
“They are native speakers. They understand the language,” said Muhire, a member of Mozilla, maker of the Internet browser Firefox. Part of the project involves collaboration with a government-backed smartphone app that answers questions about COVID-19. To improve AI systems in various African languages, Masakhane researchers are also tapping into information sources across the continent, including Voice of America’s Hausa service and the BBC’s Igbo broadcast.
More and more people are coming together to develop their own language approaches instead of waiting for elite institutions to solve problems, said Damián Blasi, who studies linguistic diversity at the Harvard Data Science Initiative.
Blasi is the co-author of a recent study that analyzed the uneven development of language technology in over 6,000 languages around the world. For example, he found that although Dutch and Swahili both have tens of millions of speakers, there are hundreds of scientific reports on natural language processing in the Western European language and only about 20 in the East African language.