Content translation/Machine Translation/MT Clients

Machine translation services are accessed using client modules in Content translation. We have Apertium and Yandex clients written already in the code. It is possible to add any number of such MT service clients and map to language pairs. This documentation explains the Machine client architecture.

Technical requirements


A new MT client can be a locally hosted machine translation system or a remote machine translation system accessed through API. API based services are recommended since that allows to isolate it as a service. If the client is free licensed and better packaged for Linux distros, we can consider hosting it in Wikimedia cluster. For example, Apertium is hosted inside wmflabs. On the other hand, Yandex is not hosted by Wikimedia. Both apertium and yandex are accessed using the web APIs.

Translation API
A machine translation API takes source language, target language, source content and outputs translated content.
 * If API is not public, it can accept an authentication token, mostly a key.
 * The output format can be JSON for convenience.
 * API shoud accept POST.
 * API should not demand any user identifiable information such as user name. CXServer does not provide it to MT Client.
 * API should be capable of accepting a reasonable number of requests per minute.
 * API should accept a reasonable amount of content per request.
 * It is recommended to have a dashboard to analyse the usage of API including requests per day/week/month and Number of characters translated per day/week/month

API must be publicly documented including the error codes.

Guidelines of performance
Content translation is still a beta feature, available only for opt-in logged in users. So the current usage pattern may not be the right assessment for future. Moreover, when we expand the machine translation to more languages, there will be more users and requests. Depending on our current usage, some baselines are given below. Note that this is never going to be the final assessment. APIs must be designed to accept more than this.


 * Atleast 10,000 requests per day
 * Atleast 10 million characters per day
 * Atleast 5000 characters per request

Input format
The content to translate from CX is HTML formatted. Translating HTML while preserving markup is challenging, but some MT Engines are capable of that (example: Yandex). Apertium does not handle HTML markup. Depending on the capability, CX can send plain text version or HTML of the content.

Quality of translation
We evaluate the quality of MT by requesting feedback from Wikipedia contributors from the language in context. CX uses MT as an intial translation template and encourage translators to improve it. Because of that unless the quality is quite bad as per the feedback we get, we can use it.

Developing a new MT Client Module
The best way to learn this is to refer an existing client module like Yandex or Apertium. The client modules are present in [ https://phabricator.wikimedia.org/diffusion/GCXS/browse/master/lib/mt/ cxserver's] lib/mt folder. Let us call our client as BabelFish MT Client. Create a file named BabelFish.js in lib/mt folder.

If your BabelFish service is not capable of translating HTML by retaining all markup in appropriate position in translation, instead of, you will have to write  method in the above code. Refer Apertium.js for such an example. Yandex.js is an example for MT client that is capable of handling both html and text content.

You need to add an entry in lib/mt/index.js for your new client.

To map a language pair to use this client, create a config file in config folder. You may refer exiting configuration files for examples. Then enable this MT engine in the cxserver config.yaml. Here also follow the existing entries for examples.

Restart the cxserver and test your client. You may want to read some unit tests existing for Apertium to write your own tests.

Machine translation clients
The following are machine translation clients that support Content Translation in different languages:
 *  ( languages supported )
 * OpusMT ( languages supported )
 *  ( languages supported )
 *  ( languages supported )
 *  ( languages supported )
 *  ( languages supported )
 * Elia ( formerly known as Matxin ) ( languages supported )
 * NLLB-200 ( formerly known as Flores ) ( languages supported )