Content translation/Machine Translation/NLLB-200

From mediawiki.org

NLLB-200[edit]

Machine translation support for Content Translation has now been further extended. In addition to Apertium, Yandex, Google, Youdao, and others, we are now adding NLLB-200 to the list of machine translation (MT) systems available for users of Content Translation. This will result in additional machine translation support for a few languages.

NLLB-200 is provided by an AI research team at Meta. This machine translation API was created by this team to make their research available for the translation of Wikipedia articles in a small set of languages for evaluation purposes. The AI research team at Meta released the translation models used by NLLB-200 with an open source license.

Various teams inside Wikimedia Foundation have collaborated with the team supporting NLLB-200 to work out an agreement that will allow the use of NLLB-200 consistent with Wikipedia's policy of attribution of rights, privacy of our users, and brand representation. NLLB-200 is based on open source models. For ease of use we are using the more convenient machine translation API to access the service hosted by the team. Please find more details of the agreement below and we are happy to hear any questions you may have about this service.

Key features[edit]

  • No nonpublic personal information of users is sent to NLLB-200. The MT system will be accessed via an API key. Article content (freely licensed) is sent to the NLLB-200 server from Wikimedia Foundation servers. No direct communication is happening between the user and external services and no nonpublic personal information of users (IP, username) is sent to the NLLB-200 service. The client contacting NLLB-200 is open source and you can check it here. No part of NLLB-200 service or code will be part of Wikimedia infrastructure or Content Translation codebase. Please also see a diagram of this technical setup at the end of the section.
  • Information is returned from NLLB-200 under a free license. When NLLB-200 is used, a translated version of Wikipedia content under the same free license is obtained. Users can modify it and publish it as part of Wikipedia without conflicts with existing policies. The resulting content translated by NLLB-200 and the user modifications will be available under the same license that is used for the rest of the articles in Wikipedia.
  • Benefits the wider open source translation community. Translations obtained from NLLB-200 and user modifications will be publicly available. The post-edited translations are of special interest for the translation research community (including the NLLB-200 team) who can use this resource to create new translation services to support languages for which open source machine translation is not available yet. This will help developers create and improve machine translation systems.
  • Users can disable it. Automatic translation is an optional tool in Content Translation. Users have an option to disable it if they don't find it useful for some reason. Although many Content Translation users have requested for translation services, each individual user eventually decides whether they would like to use them or not.
Communication diagram of MT Client
Communication diagram of MT Client

Summary of our agreement with Meta[edit]

Meta’s obligations[edit]

  • Provide an API key to the Wikimedia Foundation that allows volunteers on Wikimedia sites to translate articles

Wikimedia Foundation’s obligations[edit]

  • To provide the volunteer-edited versions of the text translated by the translation tool to improve the functionality of the NLLB-200 tool
  • No personal data of translators will be shared
  • Just the original content to translate, its language, and the translation target language will be sent in the request to Meta

Important notes[edit]

  • All content will remain licensed under CC BY-SA 3.0
  • Meta is not requiring any "branding" on Wikimedia Sites, but NLLB-200 may be listed as a translation tool option in the translation interface drop-down menu
  • There is no exchange of personal information of volunteers
  • Agreement is governed by U.S. law
  • The translations published by translators, with or without the help of machine translation services, will be provided in the form of parallel corpora by the Content Translation APIs. These APIs will be developed incrementally and results will be freely available for everyone, not just Meta

Questions about this service[edit]

We have addressed some immediate questions about NLLB-200 in this section. This is also available in the Content Translation FAQ page.

What languages are being handled by NLLB-200? Are there plans to add more?[edit]

NLLB-200 can be used to translate English, Spanish and French content into 23 different languages. The model supports the translation into Assamese, Asturian, Aymara, Bashkir, Cantonese, Chinese, Hausa, Icelandic, Igbo, Iloko, Kongo, Lingala, Luganda, Northern Sotho, Occitan, Oromo, Sorani, Swati, Tigrinya (supported only from French and Spanish), Tsonga, Tswana, Wolof (supported only from English and French), and Zulu. Also, some specific combinations are supported when translating from Catalan and Portuguese (to Occitan), Chinese (to Cantonese) and Russian (to Bashkir). Based on community input and data analysis, support for more languages can be added in the future.

How is using NLLB-200 different than using Apertium or others?[edit]

As a user of Content Translation you will not feel any difference on the translation interface as NLLB-200 will display the translated content in the same way Apertium or other services currently do for the supported language pairs. Different services provide a different translation quality level depending on the language and the specific contents. You can try and change among the available services the one providing the best initial translation for a given paragraph.

How is the machine translation being done if I choose NLLB-200?[edit]

NLLB-200 is available through an API key that allows access to their translation system. Content Translation also uses that unique API key to access NLLB-200. When a user starts translating an article, the HTML content of each section of the source article is sent to NLLB-200 and a translated version is obtained and displayed on the respective translation column of Content Translation. Links and references are adapted as usual and users can modify the content as required.

This process continues for all the sections of the article being translated. For better performance, the translations for consecutive sections are pre-fetched. The user can save the unpublished translation (to work on it again at a later time), revise, or publish the article in the usual manner. The article is published on Wikipedia like any other normal article with appropriate attribution and licenses.

Here’s a diagram of the process.

Is NLLB-200 based on open source software?[edit]

The AI research team at Meta released the translation models used by NLLB-200 with an open source license, as part of the No Language Left Behind project. The translation models used by NLLB-200 are made available though an API.

Content Translation evolved from a long-standing need to bridge the gap in the amount of content between Wikipedias in different languages. Like all other software used on Wikimedia sites, Content Translation is also open source. In this particular case as well, we are using an open source client to interact with the external service and import freely licensed content in order to help users expand our free knowledge. The service running NLLB-200 is running on external services to facilitate the integration, but this external dependency may not be required in the future once the models are publicly available.

To use NLLB-200 we are not adding any proprietary software in the Content Translation code, or on the Wikimedia websites and servers. The service is free of charge as part of Meta’s offering to the Wikimedia Foundation. Only the freely available Wikipedia article content (in segments) is sent to the NLLB-200 service and the obtained translated content is freely usable on Wikipedia pages. The translated content can be modified by users and this data is also available publicly under a free license through the Content Translation API. This is a valuable resource made available for the community to develop open source translation services for those languages where they don't exist yet. After studying the implications carefully, we found fact that the content was stored previously in a closed source service does not limit the freedom of our knowledge or our software in the present or the future. We have taken special care to make sure that the content provided is freely licensed to make sure it complies with Wikipedia policies. This includes a long process for legal and technical evaluation and compliance. The summary of our agreement is also available above.

From user feedback, we have seen that machine translation support is really helpful for users and we want to support all languages in the best way. Guided by the principles of Wikimedia Foundation's resolution to support free and open source software, we will prioritize the integration of open source services whenever they are available for a language. Apertium has been a critical part of Content Translation since its inception, but currently, it only provides machine translations for about 30 of the numerous possible language combinations that Wikipedia can support.

Should I be worried about my personal information when using NLLB-200?[edit]

Irrespective of the service being used, you can be sure that only Wikipedia content from existing articles is sent and only freely licensed content will be added back to the translation. Communication with those services happens at the server side, so they are isolated from the user device and they have no access to nonpublic personal information of users. Please refer to this diagram for more details.

What if NLLB-200 is the only machine translation tool available and I don't want to use it?[edit]

Machine Translation is an optional feature in Content Translation that you can easily disable at will. If more machine translation systems are added for your languages, you can choose to enable MT again and select the MT service of your choice.

Will the content translated by NLLB-200 be free for use in Wikipedia?[edit]

Yes. The content received from NLLB-200 is otherwise freely available on the web translation platform. For ease of use Content Translation receives it via an API key to make it seamlessly available on the translation interface. This content can be modified by the users (if necessary) and used in Wikipedia articles under free licenses.

Can this content be used for improving machine translation systems in general?[edit]

Yes. Translations made in Content Translation are saved in our database. This information will be made publicly available for anyone to use as translation examples to improve their translation services (from University research groups, open source projects to commercial companies, anyone!). The content can be accessed via the Content Translation API. Please note, only information related to translated text is publicly available. This includes – source and translated text, source and target language information and an identifier for the segment of text.