Content translation/Product Definition/MT API Proof of Concept
- 1 Requirements
- 2 Considerations about Node.js
- 3 Relevant Architecture Considerations
- 4 Document Segmentation
- 5 Data storage hierarchy
There should be a “Machine Translator” abstract interface. It should be as simple as possible to implement, so that we can support a wide range of MT engines. Features such as caching, queueing and storage should be implemented elsewhere. There should be one main method:
translate($targetLang, array($sourceLang => $sourceText)) → $targetText or NULL
- The method performs synchronous machine pre-translation
- It may take up to several seconds to run
- The MT engine is not responsible for providing responsive real-time feedback to user requests.
- Instead the ContentTranslation system will intelligently queue and cache MT requests.
- So we can support MT systems with limited throughput / responsiveness
- $sourceLang and $targetLang are BCP47 codes (e.g. 'en', 'zh-Hant')
- $sourceText and $targetText are plaintext strings of a single segment (or paragraph).
TODO: support the following:
- Source text containing markup
- This is important for articles with hyperlinks!
- Source text containing tagged translated terms:
- E.g. to specify that the surname “Gray” should be translated as “Gray” and not as the colour gray.
- See http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc9
- This will allow using wikidata to translate terms
- Not all MT engines will support this
- Target text containing word-to-word alignment:
- See http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc11]
- This would be useful for subphrase selection, auto-completion etc.
- Supplying multiple source segments is permitted (as per the Translate Extension) but not required. [TODO: Do we need this? If so the response should say which source text(s) actually got translated]
- The method may return NULL. The return value may be different if the method is called with the same arguments a second time.
Other methods may allow discovery of supported language pairs, performance and usage limit characteristics, etc.
Considerations about Node.js
- +1 In use in WMF already (Parsoid)
- +1 Significant Node.js expertise in WMF
- -1 BUT Node.js’s asynchronous request handling model is unfamiliar for someone coming from PHP.
- See http://s3.amazonaws.com/four.livejournal/20091117/jsconf.pdf for an intro
- NL: Possible code re-use in frontend? But no code-share from Translate.
- Performance (caveat: I Am Not A Performance Engineer!)
- +1 Lightweight connections are cheap in Node.js (http://nodejs.org/about/)
- We'll need to service thousands of tiny requests
- e.g. every load/save for a segment or suggestion
- -1 Possibly less expertise about parallelisation?
- 0 Caching architecture may ultimately be more important than server performance characteristics (c.f. mediawiki as a whole)
Relevant Architecture Considerations
Just the architecture questions which are primarily relevant to the choice of server technology.
- We're calling MT engines which are separate systems
- The same should probably be true for our Translation Memory solution(s)
- And dictionary resources
- How much separation do we want from mediawiki?
- NL: MW provides very little help for the backend. In addition, it has high startup costs for each request, it is written in different language, the core API code is planned to change in the future.
- NL: Scaling & monitoring performance is easier on separate service as opposed to running on the mw app servers. Same for poking holes in firewall if needed.
- NL: Access control (especially if we are proxying paid services)
- Prioritisation and pre-calculation of suggestion info
- Storage and delivery of suggestion info
- Delivery of suggestion info
- Caching in the browser
- Bunching of segment requests
The main tool is ...
Data storage hierarchy
Project: A passage of evolving parallel text in multiple languages
Like an evolving Wikipedia article
- (trivial case: one source language -- like a wiki article now!)
- (basic case: one source, one target -- content translation demo)
- (simple case: one source, many targets -- most translation systems)
- (general case: multiple languages are both source and target)
Task: A particular translation task in the Project
(In the simple case, corresponds to a particular source text)
- One source language
- One target language
- Document tree, containing delimited segment locations
- Note: it suffices to store the source document and segmentation algorithm
- Note: paragraph separation exists in the tree
- Note: segmentation algorithms are language-dependent
- List of segments with unique IDs
Segment: Storage for a minimal standalone piece of text
(Usually a sentence, title or label) (Note: can contain markup, including placeholders)
- Segment ID
- key-value mapping of lang to text
- Cached MT, TM and dictionary search results