Content translation/Product Definition/MT API Proof of Concept

Requirements

There should be a “Machine Translator” abstract interface. It should be as simple as possible to implement, so that we can support a wide range of MT engines. Features such as caching, queueing and storage should be implemented elsewhere. There should be one main method:

translate($targetLang, array($sourceLang => $sourceText)) → $targetText or NULL

The method performs synchronous machine pre-translation
It may take up to several seconds to run

The MT engine is not responsible for providing responsive real-time feedback to user requests.
Instead the ContentTranslation system will intelligently queue and cache MT requests.
So we can support MT systems with limited throughput / responsiveness

$sourceLang and $targetLang are BCP47 codes (e.g. 'en', 'zh-Hant')
$sourceText and $targetText are plaintext strings of a single segment (or paragraph).

TODO: support the following:

Source text containing markup

This is important for articles with hyperlinks!

Source text containing tagged translated terms:

E.g. to specify that the surname “Gray” should be translated as “Gray” and not as the colour gray.
See http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc9
This will allow using wikidata to translate terms
Not all MT engines will support this

Target text containing word-to-word alignment:

See http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc11]
This would be useful for subphrase selection, auto-completion etc.

Supplying multiple source segments is permitted (as per the Translate Extension) but not required. [TODO: Do we need this? If so the response should say which source text(s) actually got translated]
The method may return NULL. The return value may be different if the method is called with the same arguments a second time.

Other methods may allow discovery of supported language pairs, performance and usage limit characteristics, etc.

Considerations about Node.js

The proof of concept is written in Node.js javascript. Here are some considerations about using it as the final implementation.

Organisational

+1 In use in WMF already (Parsoid)
+1 Significant Node.js expertise in WMF
+1 Javascript is widely known in WMF and outside
-1 BUT Node.js’s asynchronous request handling model is unfamiliar for someone coming from PHP.

See http://s3.amazonaws.com/four.livejournal/20091117/jsconf.pdf for an intro

NL: Possible code re-use in frontend? But no code-share from Translate.

Performance (caveat: I Am Not A Performance Engineer!)

+1 Lightweight connections are cheap in Node.js (http://nodejs.org/about/)

We'll need to service thousands of tiny requests
e.g. every load/save for a segment or suggestion

-1 Possibly less expertise about parallelisation?
0 Caching architecture may ultimately be more important than server performance characteristics (c.f. mediawiki as a whole)

Relevant Architecture Considerations

Just the architecture questions which are primarily relevant to the choice of server technology.

Separation

We're calling MT engines which are separate systems
The same should probably be true for our Translation Memory solution(s)
And dictionary resources
How much separation do we want from mediawiki?

NL: MW provides very little help for the backend. In addition, it has high startup costs for each request, it is written in different language, the core API code is planned to change in the future.

NL: Scaling & monitoring performance is easier on separate service as opposed to running on the mw app servers. Same for poking holes in firewall if needed.
NL: Access control (especially if we are proxying paid services)

Performance

Prioritisation and pre-calculation of suggestion info
Storage and delivery of suggestion info
Delivery of suggestion info

Caching in the browser
Bunching of segment requests

Document Segmentation

The main tool is ...

Data storage hierarchy

Project: A passage of evolving parallel text in multiple languages

Like an evolving Wikipedia article

(trivial case: one source language -- like a wiki article now!)
(basic case: one source, one target -- content translation demo)
(simple case: one source, many targets -- most translation systems)
(general case: multiple languages are both source and target)

Task: A particular translation task in the Project

(In the simple case, corresponds to a particular source text)

One source language
One target language
Document tree, containing delimited segment locations

Note: it suffices to store the source document and segmentation algorithm
Note: paragraph separation exists in the tree
Note: segmentation algorithms are language-dependent

List of segments with unique IDs

Segment: Storage for a minimal standalone piece of text

(Usually a sentence, title or label) (Note: can contain markup, including placeholders)

Segment ID
key-value mapping of lang to text
Cached MT, TM and dictionary search results