Content translation/Machine Translation/Apertium/Service

We are introducing our first self-hosted service for CX: Apertium. Proof of concept patch. Apertium provides machine translation (MT), which is critical component for CX.

Why Apertium as service?
1. WMF Language Engineering team (LE) works with Apertium developers closely in reporting bugs and getting Apertium improved. We need to use recent version of Apertium and its fixes and also need to keep it up to date. This collaboration and communication loop work only if LE team has full control on software running on Apertium service. LE team is currently using its own instance on Labs at http://apertium.wmflabs.org

2. Apertium on Ubuntu/Debian are outdated and updating packages on Ubuntu/Debian is an extra effort and there is no guarantee that it will be up-to-date on distro as packages (despite of working closely with Debian Science team and we have proposed up-to-date package to team with help of Apertium upstream).

3. We are going to use Apertium-Apy to serve MT requests. APY provides a web service on top of Apertium and also makes the Apertium scalable by loading the processing pipelines once for all requests. This is the recommended scalable production setup by Apertium

4. Using Apertium-APY and latest Apertium will make sure that user is getting decent result while requesting MT from Apertium, which is the the best Free and Open Source machine translation software available. Work on Google/Bing MT services will take some time and bring a similar situation (we will handle it via LCA team first!) to us without any control over it.

Current scenario
We are pointing mt.apertium configuration to apertium.wmflabs.org which is running on Labs. Details of this instance is below:


 * Size: m1.large, 4 cores, 8GB RAM
 * It has Apertium and language pairs from the apt repo that apertium developers maintain.
 * The server runs APY - a Python Tornado based webserver, that preloads the Machine translation pipelines. The MT is CPU intensive and requires multi-core machines.

Proposal
We require to replicate similar setup for Beta and later in Production to utilize power of Apertium. This is only possible with latest upstream code or packages from Apertium. We are ready to invest time and effort from LE team for this being critical component of project.

The following work need to be done by team in order to setup Apertium as service for Content Translation:


 * 1) Review packages from Apertium upstream.
 * 2) (re)Package Apertium.
 * 3) Package Apertium language pairs.
 * 4) Package dependencies (Python packages, list to be finalized)
 * 5) Package Apertium APY.
 * 6) Write startup scripts as needed.
 * 7) Puppetize Apertium packages and its dependencies.
 * 8) Setup needed Debian repositories.
 * 9) Setup Apertium service running on top of needed packages.
 * 10) Hardware provisioning.

Core packages

 * apertium
 * apertium-dbus
 * cg3
 * libapertium
 * lttoolbox

APY

 * apertium-apy

Tools

 * apertium-lex-tools

Language pairs

 * apertium-af-nl
 * apertium-br-fr
 * apertium-ca-it
 * apertium-cy-en
 * apertium-dan-nor
 * apertium-en-ca
 * apertium-en-es
 * apertium-en-gl
 * apertium-eo-ca
 * apertium-eo-en
 * apertium-eo-es
 * apertium-eo-fr
 * apertium-es-an
 * apertium-es-ast
 * apertium-es-ca
 * apertium-es-gl
 * apertium-es-it
 * apertium-es-pt
 * apertium-es-ro
 * apertium-eu-en
 * apertium-eu-es
 * apertium-fr-ca
 * apertium-fr-es
 * apertium-hbs-eng
 * apertium-hbs-mkd
 * apertium-hbs-slv
 * apertium-id-ms
 * apertium-is-sv
 * apertium-isl-eng
 * apertium-kaz-tat
 * apertium-mk-bg
 * apertium-mk-en
 * apertium-mt-ar
 * apertium-nno-nob
 * apertium-oc-ca
 * apertium-oc-es
 * apertium-pt-ca
 * apertium-pt-gl
 * apertium-sv-da

Single language tools (Optional)
See: http://wiki.apertium.org/wiki/Languages for more details on monolingual packages.
 * apertium-arg
 * apertium-ava
 * apertium-ara
 * apertium-bak
 * apertium-ben
 * apertium-bre
 * apertium-bul
 * apertium-ces
 * apertium-chv
 * apertium-dan
 * apertium-ell
 * apertium-eus
 * apertium-fao
 * apertium-fin
 * apertium-glv
 * apertium-hbs
 * apertium-hin
 * apertium-hye
 * apertium-isl
 * apertium-ita
 * apertium-kaa
 * apertium-kaz
 * apertium-kir
 * apertium-kum
 * apertium-lvs
 * apertium-mkd
 * apertium-mlt
 * apertium-nld
 * apertium-nno
 * apertium-nob
 * apertium-nog
 * apertium-rus
 * apertium-san
 * apertium-spa
 * apertium-sqi
 * apertium-swe
 * apertium-tat
 * apertium-tuk
 * apertium-tur
 * apertium-ukr
 * apertium-urd
 * apertium-uzb

Links

 * 1) CX Technical architecture
 * 2) Apertium in CX