Content translation/Machine Translation/MT Clients/fr

Machine translation services are accessed using client modules in Content translation. We have Apertium and Yandex clients written already in the code. It is possible to add any number of such MT service clients and map to language pairs. This documentation explains the Machine client architecture.



Contraintes techniques
A new MT client can be a locally hosted machine translation system or a remote machine translation system accessed through API. API based services are recommended since that allows to isolate it as a service. If the client is free licensed and better packaged for Linux distros, we can consider hosting it in Wikimedia cluster. For example, Apertium is hosted inside wmflabs. On the other hand, Yandex is not hosted by Wikimedia. Both apertium and yandex are accessed using the web APIs.



API de traduction
A machine translation API takes source language, target language, source content and outputs translated content. L'API doit être publiquement documentée ainsi que les codes d'erreur.
 * If API is not public, it can accept an authentication token, mostly a key.
 * Le format de sortie peut être JSON par commodité.
 * L'API doit accepter POST.
 * API should not demand any user identifiable information such as user name. CXServer does not provide it to MT Client.
 * L'API doit être capable d'accepter un nombre raisonnable de requêtes par minute.
 * API should accept a reasonable amount of content per request.
 * It is recommended to have a dashboard to analyse the usage of API including requests per day/week/month and Number of characters translated per day/week/month



Règles de performance
Content translation is still a beta feature, available only for opt-in logged in users. So the current usage pattern may not be the right assessment for future. Moreover, when we expand the machine translation to more languages, there will be more users and requests. Depending on our current usage, some baselines are given below. Note that this is never going to be the final assessment. APIs must be designed to accept more than this.


 * Au moins 10,000 requêtes par jour
 * Au moins 10 millions de caractères par jour
 * Au moins 5,000 caractères par requête



Format d'entrée
The content to translate from CX is HTML formatted. Translating HTML while preserving markup is challenging, but some MT Engines are capable of that (example: Yandex). Apertium does not handle HTML markup. Depending on the capability, CX can send plain text version or HTML of the content.



Qualité de la traduction
We evaluate the quality of MT by requesting feedback from Wikipedia contributors from the language in context. CX uses MT as an intial translation template and encourage translators to improve it. Because of that unless the quality is quite bad as per the feedback we get, we can use it.



Développer un nouveau module de client MT
La meilleure façon d'apprendre cela est de se référer à un module client existant comme Yandex ou Apertium. Les modules client sont présents dans le dossier lib/mt de cxserver. Appelons notre client comme Client TA BabelFish. Créez un fichier nommé BabelFish.js dans le dossier lib/mt. If your BabelFish service is not capable of translating HTML by retaining all markup in appropriate position in translation, instead of, you will have to write   method in the above code. Refer Apertium.js for such an example. Yandex.js is an example for MT client that is capable of handling both html and text content.

You need to add an entry in lib/mt/index.js for your new client.

To map a language pair to use this client, create a config file in config folder. You may refer exiting configuration files for examples. Then enable this MT engine in the cxserver config.yaml. Here also follow the existing entries for examples.

Restart the cxserver and test your client. You may want to read some unit tests existing for Apertium to write your own tests.



Clients de traduction automatique
The following are machine translation clients that support Content Translation in different languages:
 *  (langues prises en charge)
 * OpusMT (languages supportés)
 *  (langues prises en charge)
 *  (langues prises en charge)
 *  (langues prises en charge)
 *  (langues prises en charge)
 * Elia (initialement connu comme Matxin) (langues prises en charge)
 * NLLB-200 (initialement connu comme Flores) (langues prises en charge)