Content translation/Machine Translation/Apertium/Service

Background
With, https://gerrit.wikimedia.org/r/#/c/157787 (this is just POC) in Content Translation server configuration, we are introducing our "first" service - Apertium. This is very critical components for Content Translation (Machine Translation).

Why Apertium as service?
1. WMF Language Engineering team works with Apertium developers closely in reporting bugs and getting Apertium improved. This require WMF LE using Apertium's latest code and fixes and keep it up to date and make same as the apertium.org's Apertium instance. This collaboration and communication loop work only if LE team has full control on software running on Apertium service (currently runs on Labs at: http://apertium.wmflabs.org)

2. Apertium on Ubuntu/Debian are outdated.

3. Updating them on Ubuntu/Debian is an extra effort and there is no guarantee that it will be up-to-date on distro as packages (despite of working closely with Debian Science team and we have proposed up-to-date package to team with help of Apertium upstream).

4. New version of Apertium is unlikely to enter in Debian/Ubuntu soon.

5. We are using Apertium-Apy (http://wiki.apertium.org/wiki/Apertium-apy) to serve MT requests.

6. #4 will make sure that user is getting decent result while requesting MT from Apertium, which is the only Free and Open Source Machine Translation service available at moment. Work on Google/Bing MT services will take some time and bring similar situation (we will handle it via LCA team first!) to us without any control over it.

Current scenario
We are pointing mt.apertium configuration to apertium.wmflabs.org which is running on Labs. Details of this instance is below:


 * Size: m1.large, 4 cores, 8GB RAM
 * It has Apertium and language pairs from the apt repo that apertium developers maintain.
 * The server runs APY - a Python Tornado based webserver, that preloads the Machine translation pipelines. The MT is processor intensive and require multi core machines.

Proposal
We require to replicate similar setup for Beta and later in Production to utilize power of Apertium. This is only possible with latest upstream code or packages from Apertium. We are ready to invest time and effort from LE team for this being critical component of project.

The following work need to be done by team in order to setup Apertium as service for Content Translation:


 * 1) Review packages from Apertium upstream.
 * 2) (re)Package Apertium.
 * 3) Package Apertium language pairs.
 * 4) Package dependencies (Python packages, list to be finalized)
 * 5) Package Apertium APY.
 * 6) Write startup scripts as needed.
 * 7) Puppetize Apertium packages and its dependencies.
 * 8) Setup needed Debian repositories.
 * 9) Setup Apertium service running on top of needed packages.
 * 10) Hardware provision.

Core packages

 * apertium
 * libapertium
 * lttoolbox
 * apertium-dbus

APY and dependencies

 * apertium-apy

Language pairs

 * apertium-af-nl
 * apertium-br-fr
 * apertium-ca-it
 * apertium-cy-en
 * apertium-dan-nor
 * apertium-en-ca
 * apertium-en-es
 * apertium-en-gl
 * apertium-eo-ca
 * apertium-eo-en
 * apertium-eo-es
 * apertium-eo-fr
 * apertium-es-an
 * apertium-es-ast
 * apertium-es-ca
 * apertium-es-gl
 * apertium-es-it
 * apertium-es-pt
 * apertium-es-ro
 * apertium-eu-en
 * apertium-eu-es
 * apertium-fr-ca
 * apertium-fr-es
 * apertium-hbs-eng
 * apertium-hbs-mkd
 * apertium-hbs-slv
 * apertium-id-ms
 * apertium-is-sv
 * apertium-isl-eng
 * apertium-kaz-tat
 * apertium-mk-bg
 * apertium-mk-en
 * apertium-mt-ar
 * apertium-nno-nob
 * apertium-oc-ca
 * apertium-oc-es
 * apertium-pt-ca
 * apertium-pt-gl
 * apertium-sv-da

Single language tool

 * apertium-arg
 * apertium-ava
 * apertium-ara
 * apertium-bak
 * apertium-ben
 * apertium-bre
 * apertium-bul
 * apertium-ces
 * apertium-chv
 * apertium-dan
 * apertium-ell
 * apertium-eus
 * apertium-fao
 * apertium-fin
 * apertium-glv
 * apertium-hbs
 * apertium-hin
 * apertium-hye
 * apertium-isl
 * apertium-ita
 * apertium-kaa
 * apertium-kaz
 * apertium-kir
 * apertium-kum
 * apertium-lvs
 * apertium-mkd
 * apertium-mlt
 * apertium-nld
 * apertium-nno
 * apertium-nob
 * apertium-nog
 * apertium-rus
 * apertium-san
 * apertium-spa
 * apertium-sqi
 * apertium-swe
 * apertium-tat
 * apertium-tuk
 * apertium-tur
 * apertium-ukr
 * apertium-urd
 * apertium-uzb

Tools

 * apertium-lex-tools

Links

 * 1) Content Translation technical architecture: https://www.mediawiki.org/wiki/Content_translation/Technical_Architecture
 * 2) Apertium: https://www.mediawiki.org/wiki/Content_translation/Apertium