Content translation/Product Definition/MT API Proof of Concept

From mediawiki.org

Requirements[edit]

There should be a “Machine Translator” abstract interface. It should be as simple as possible to implement, so that we can support a wide range of MT engines. Features such as caching, queueing and storage should be implemented elsewhere. There should be one main method:

translate($targetLang, array($sourceLang => $sourceText)) → $targetText or NULL
  • The method performs synchronous machine pre-translation
  • It may take up to several seconds to run
  • The MT engine is not responsible for providing responsive real-time feedback to user requests.
  • Instead the ContentTranslation system will intelligently queue and cache MT requests.
  • So we can support MT systems with limited throughput / responsiveness
  • $sourceLang and $targetLang are BCP47 codes (e.g. 'en', 'zh-Hant')
  • $sourceText and $targetText are plaintext strings of a single segment (or paragraph).

TODO: support the following:

  • Source text containing markup
  • This is important for articles with hyperlinks!
  • Source text containing tagged translated terms:
  • Target text containing word-to-word alignment:
  • Supplying multiple source segments is permitted (as per the Translate Extension) but not required. [TODO: Do we need this? If so the response should say which source text(s) actually got translated]
  • The method may return NULL. The return value may be different if the method is called with the same arguments a second time.

Other methods may allow discovery of supported language pairs, performance and usage limit characteristics, etc.

Considerations about Node.js[edit]

The proof of concept is written in Node.js javascript. Here are some considerations about using it as the final implementation.

  • Organisational
  • +1 In use in WMF already (Parsoid)
  • +1 Significant Node.js expertise in WMF
  • +1 Javascript is widely known in WMF and outside
  • -1 BUT Node.js’s asynchronous request handling model is unfamiliar for someone coming from PHP.
  • NL: Possible code re-use in frontend? But no code-share from Translate.
  • Performance (caveat: I Am Not A Performance Engineer!)
  • We'll need to service thousands of tiny requests
  • e.g. every load/save for a segment or suggestion
  • -1 Possibly less expertise about parallelisation?
  • 0 Caching architecture may ultimately be more important than server performance characteristics (c.f. mediawiki as a whole)

Relevant Architecture Considerations[edit]

Just the architecture questions which are primarily relevant to the choice of server technology.

  • Separation
  • We're calling MT engines which are separate systems
  • The same should probably be true for our Translation Memory solution(s)
  • And dictionary resources
  • How much separation do we want from mediawiki?
  • NL: MW provides very little help for the backend. In addition, it has high startup costs for each request, it is written in different language, the core API code is planned to change in the future.
  • NL: Scaling & monitoring performance is easier on separate service as opposed to running on the mw app servers. Same for poking holes in firewall if needed.
  • NL: Access control (especially if we are proxying paid services)
  • Performance
  • Prioritisation and pre-calculation of suggestion info
  • Storage and delivery of suggestion info
  • Delivery of suggestion info
  • Caching in the browser
  • Bunching of segment requests

Document Segmentation[edit]

The main tool is ...

Data storage hierarchy[edit]

Project: A passage of evolving parallel text in multiple languages[edit]

Like an evolving Wikipedia article

  • (trivial case: one source language -- like a wiki article now!)
  • (basic case: one source, one target -- content translation demo)
  • (simple case: one source, many targets -- most translation systems)
  • (general case: multiple languages are both source and target)

Task: A particular translation task in the Project[edit]

(In the simple case, corresponds to a particular source text)

  • One source language
  • One target language
  • Document tree, containing delimited segment locations
  • Note: it suffices to store the source document and segmentation algorithm
  • Note: paragraph separation exists in the tree
  • Note: segmentation algorithms are language-dependent
  • List of segments with unique IDs

Segment: Storage for a minimal standalone piece of text[edit]

(Usually a sentence, title or label) (Note: can contain markup, including placeholders)

  • Segment ID
  • key-value mapping of lang to text
  • Cached MT, TM and dictionary search results