
How to teach an AI 200 languages

Meta's translation programme NLLB can handle over 200 languages - significantly more than DeepL or Google Translate. The researchers now explain how this was possible.
If you want to translate back and forth between languages, you no longer have to rely on tedious word-for-word translations. AI translation programmes such as DeepL or Google Translate can translate entire sections of text from one language into another in no time at all - provided the language in question is one that is widely spoken in the global North, such as English, French or German. However, if you want to translate the Bantu language Luganda, which is widely spoken in southern Uganda, you will usually encounter problems. As there is only a small amount of digital content in this language, it is very difficult to train an AI. But in 2022, Meta released the open-source translation programme NLLB (no language left behind), which can handle 204 languages, including 150 resource-poor languages such as Luganda. On 5 June 2024, the Meta team explained in the scientific journal "Nature" how this stroke of genius was achieved.
In addition to the lack of text sources for resource-poor languages, there is another major difficulty in creating a comprehensive AI translation programme. If you train such algorithms on as many languages as possible, the overall quality usually suffers. A programme that is otherwise very good at translating between German and English, for example, can find the task much more difficult if it has to master 40 other languages. To prevent this loss of performance, the models usually have to be enlarged - which, however, leads to a significantly higher training effort and longer runtimes.
To avoid this "curse of multilingualism", the Meta team has divided the NLLB language model into many different smaller AI models, each of which is particularly good at one task. For example, one model serves Benue-Congo languages, which are common in sub-Saharan Africa, while another model focusses on languages with a similar script. Another model could also specialise in idioms. The use of these separate AI models makes it possible to prevent quality losses due to a large number of languages.
A comprehensive data set
One of the most important components of the NLLB model, however, is the data set: "Flores-200" is accessible to everyone and comprises 204 different languages. The language model was trained with three different types of data. Firstly, the researchers collected publicly accessible texts from the internet, as well as 6000 selected example sentences in 39 languages with extremely few resources. In addition, they used certain sentences with their corresponding translations that are available in web archives. Using this data, they were able to train an algorithm to assign high-dimensional coordinates to sentences with similar meanings in different languages (such as "I like science", "I like science" and "j'aime la science") that are close to each other. In this way, the experts were able to generate sentence pairs with the same meaning in different languages to train their large AI model.
The NLLB translation programme has now been in use for two years. "It provides translations of reasonable quality in several low-resource languages," writes computer scientist David I. Adelani from University College London, who is not part of the Meta team, in an article in "Nature". "However, the quality of these translations is still significantly worse than that of languages with many resources", such as German or French. To change this, future language models could use grammar and dictionaries to improve their understanding of language, as studies published in March 2024 suggest. However, it will probably be a long time before a translation programme masters all 7,000 existing languages.
Spectrum of Science
We are partners of Spektrum der Wissenschaft and want to make well-founded information more accessible to you. Follow Spektrum der Wissenschaft if you like the articles.
Original article on Spektrum.de

Experts from science and research report on the latest findings in their fields – competent, authentic and comprehensible.