Content translation/Technical Architecture

Abbreviations and glossary

 * 1) CX - Content Translation
 * 2) MT - Machine Translation
 * 3) TM - Translation memory
 * 4) Segment - Smallest unit of text which is fairly self-contained grammatically. This usually means a sentence, a title, a phrase in a bulleted list, etc.
 * 5) Segmentation algorithm - rules to split a paragraph into segments. Weakly language-dependent (sensible default rules work quite well for many languages).
 * 6) Parallel bilingual text - two versions of the same content, each written in a different language.
 * 7) Sentence alignment - matching corresponding sentences in parallel bilingual text. In general this is a many-many mapping, but it is approximately one-one if the texts are quite strict translations.
 * 8) Word alignment - matching corresponding words in parallel bilingual text. This is strongly many-many.
 * 9) Lemmatization - also called stemming. Mapping multiple grammatical variants of the same word to a root form; e.g. (swim, swims, swimming, swam, swum) -> swim. Derivational variants are not usually mapped to the same form (so happiness !-> happy).
 * 10) Morphological analysis - mapping words into morphemes, e.g. swims -> swim/3rdperson_present
 * 11) Service providers - External systems which provide MT/TM/Glossary services. Example: google
 * 12) Translation tools - Translation support tools - Translation aids - Translation support - Context aware translation tools like MT, Dictionary, Link localization
 * 13) Link localization - Converting a wiki article link from one language to another language with the help of wikidata. Example: http://en.wikipedia.org/wiki/Sea becomes http://es.wikipedia.org/wiki/Mar
 * 14) Redis http://en.wikipedia.org/wiki/Redis

Introduction
This document tries to capture the architectural understanding for the Content Translation project. This document evolves as the team goes to the depths of each component.

Architecture considerations

 * 1) The translation aids that we want to provide are mainly from third party service providers, or otherwise external to the CX software itself.
 * 2) Depending on the external system capability, varying amount of delays can be expected. It emphasises the need for server-side optimizations of service provider APIs such as
 * caching of results from them
 * proper scheduling to better utilize capacities and to operate in the limits of API usage


 * 1) We are providing a free-to-edit interface, so we should not block any edits (translation in the context of CX) if the user decides not to use translation aids. Incrementally we can prepare all translation aids support and provide them in the context.

Server communication

 * 1) We can provide translation aids for the content in increments as and when we receive them from service providers. We do not need to wait for all service providers to return data before proceeding.
 * 2) For large articles, we will have to do some kind of paging mechanism to provide this data in batches.
 * 3) This means that client and server should communicate in regular intervals or as and when data is available at server. We considered traditional http pull methods (Ajax) and push communication (websockets).

Client side

 * 1) jQuery
 * 2) Socket.io client
 * 3) contenteditable
 * 4) LESS, Grid, HTML templating?

Server side

 * 1) Node.js with
 * Socket.io
 * node.js built in cluster or http://learnboost.github.com/cluster/
 * express


 * 1) Redis http://en.wikipedia.org/wiki/Redis Also see MW usage of Redis https://www.mediawiki.org/wiki/Redis WMF uses Redis in production.
 * 2) A proxy server like Apache or Nginx

Node instances
Node.js built-in cluster module-based instance management system. Currently the code is borrowed from Parsoid. We have a cluster that forks express server instances depending on number of processors available in the system. It also fork new processes for replacing instances killed/suicided.

This approach uses Node.js built-in in cluster. A better alternative is node module cluster from socket.io developers http://learnboost.github.io/cluster/

Load balancing
When you scale your app in a cluster environment, the load balancer will take over, and the requests will be sent to different node instances causing Socket.io to break because that client-server socket is not authenticated (handshaked).

For such situations, load balancers have a feature called ‘sticky sessions’, also known as ‘session affinity’. The idea is that if this property is set, all the requests following the first load-balanced request will go to the same server instance. app.use(cookieParser); app.use(express.session({ store: sessionStore, key: 'cxsessionid', secret:'your secret here' }));
 * 1) Express sets a session cookie with name cxsessionid.
 * 2) When socket.io connects, it uses that same cookie and hits the load balancer.
 * 3) The load balancer always routes it to the same server that the cookie was set in.

Reference technology stacks:
 * 1) Trello http://blog.fogcreek.com/the-trello-tech-stack
 * 2) WMF Parsoid project

Security
To be designed

WMF Infrastructure
To be designed. https://wikitech.wikimedia.org/wiki/Parsoid gives some idea

The content itself might be sensitive (private wikis) and thus should not always be shared through translation memory. We also need to restrict access to service providers, especially if they are paid ones.

Article preprocessing
We take a concrete example to figure out the details of the data model. We use the Hydrogen article from English wikipedia. https://en.wikipedia.org/wiki/Hydrogen Article - Original From https://en.wikipedia.org/wiki/Hydrogen

Hydrogen is a chemical element with chemical symbol H and atomic number 1. With an atomic weight of 1.00794 u, hydrogen is the lightest element on the periodic table. Its monatomic form (H) is the most abundant chemical substance in the universe, constituting roughly 75% of all baryonic mass.[7][note 1] Non-remnant stars are mainly composed of hydrogen in its plasma state. The most common isotope of hydrogen, termed protium (name rarely used, symbol 1H), has a single proton and zero neutrons.

The universal emergence of atomic hydrogen first occurred during the recombination epoch. At standard temperature and pressure, hydrogen is a colorless, odorless, tasteless, non-toxic, nonmetallic, highly combustible diatomic gas with the molecular formula H2. Since hydrogen readily forms covalent compounds with most non-metallic elements, most of the hydrogen on Earth exists in molecular forms such as in the form of water or organic compounds. Hydrogen plays a particularly important role in acid–base reactions. In ionic compounds, hydrogen can take the form of a negative charge (i.e., anion) known as a hydride, or as a positively charged (i.e., cation) species denoted by the symbol H+. The hydrogencation is written as though composed of a bare proton, but in reality, hydrogen cations in ionic compounds are always more complex species than that would suggest.

As the simplest atom known, the hydrogen atom has had considerable theoretical application. For example, the hydrogen atom is the only neutral atom with an analytic solution to the Schrödinger equation. Hydrogen gas was first artificially produced in the early 16th century, via the mixing of metals with acids. In 1766–81, Henry Cavendish was the first to recognize that hydrogen gas was a discrete substance,[8] and that it produces water when burned, a property which later gave it its name: in Greek, hydrogen means "water-former".

Industrial production is mainly from the steam reforming of natural gas, and less often from more energy-intensive hydrogen production methods like the electrolysis of water.[9]Most hydrogen is employed near its production site, with the two largest uses being fossil fuel processing (e.g., hydrocracking) and ammonia production, mostly for the fertilizer market.

Hydrogen is a concern in metallurgy as it can embrittle many metals,[10] complicating the design of pipelines and storage tanks.[11]

First Pass - Prepare the article
We need to remove references, images, templates or anything that we don't allow translation or we cannot translate from the article.

''TODO: Not sure which component does this cleaning up and prepare the data ready for CX. Add your ideas.''

Hydrogen is a chemical element with chemical symbol H and atomic number 1. With an atomic weight of 1.00794 u, hydrogen is the lightest element on the periodic table. Its monatomic form (H) is the most abundant chemical substance in the universe, constituting roughly 75% of all baryonic mass. Non-remnant stars are mainly composed of hydrogen in its plasma state. The most common isotope of hydrogen, termed protium (name rarely used, symbol 1H), has a single proton and zero neutrons.

The universal emergence of atomic hydrogen first occurred during the recombination epoch. At standard temperature and pressure, hydrogen is a colorless, odorless, tasteless, non-toxic, nonmetallic, highly combustible diatomic gas with the molecular formula H2. Since hydrogen readily forms covalent compounds with most non-metallic elements, most of the hydrogen on Earth exists in molecular forms such as in the form of water or organic compounds. Hydrogen plays a particularly important role in acid–base reactions. In ionic compounds, hydrogen can take the form of a negative charge (i.e., anion) known as a hydride, or as a positively charged (i.e., cation) species denoted by the symbol H+. The hydrogencation is written as though composed of a bare proton, but in reality, hydrogen cations in ionic compounds are always more complex species than that would suggest.

As the simplest atom known, the hydrogen atom has had considerable theoretical application. For example, the hydrogen atom is the only neutral atom with an analytic solution to the Schrödinger equation. Hydrogen gas was first artificially produced in the early 16th century, via the mixing of metals with acids. In 1766–81, Henry Cavendish was the first to recognize that hydrogen gas was a discrete substance, and that it produces water when burned, a property which later gave it its name: in Greek, hydrogen means "water-former".

Industrial production is mainly from the steam reforming of natural gas, and less often from more energy-intensive hydrogen production methods like the electrolysis of water.[9]Most hydrogen is employed near its production site, with the two largest uses being fossil fuel processing (e.g., hydrocracking) and ammonia production, mostly for the fertilizer market.

Hydrogen is a concern in metallurgy as it can embrittle many metals, complicating the design of pipelines and storage tanks.

Second pass
According to our current understanding, this should be done by server. Segments are marked in the document using spans, maybe with a unique class. Implementation details about a language independent sentence segmentation is unknown at this point.

Segmentation
See Content_translation/Segmentation

Workflow
See Content_translation/Workflow

First version
Suggested data structures - semantic data model - master copy at server- version 0

Article {	"sourceLang": "en", "targetLang": "cy", "sourceLocation": "Hydrogen", "segmentedContent": " Hydrogen is a chemical element with chemical symbol H and atomic number 1 ...", "segmentCount": 100, "segments": { segment1: { "source": "Hydrogen is a chemical element with chemical symbol H and atomic number 1", "target": null, "mt": null, "tm": null, "tmTstamp": null, "dictSource": null, "dictIds": null }, segment2: { "source": "Hydrogen is a chemical element with chemical symbol H and atomic number 1", "target": null, "mt": null, "tm": null, "tmTstamp": null, "dictSource": null, "dictIds": null }		}	}	"dictionary": null, "glossary": null, "links": null // bunch of other stuff that is less relevant in a data structure overview }

Version Updates
When the server completed all the tasks {	"sourceLang": "en", "targetLang": "cy", "sourceLocation": "Hydrogen", "segmentedContent": " Hydrogen is a chemical element with chemical symbol H and atomic number 1 ...", "segments": { segment1: { "source": "Hydrogen is a chemical element with chemical symbol H and atomic number 1", "target": null, "mt": [ // Empty list means no results (do differ from null) {                       "confidence": 0.5, "engine": "Bangor", "target": "Hydrogen yw elfen cemegol gyda symbol cemegol H a rhif atomig 1" },	// Translation from another engine ],       "tm": [ // Translation memory. Eg: TWN {                       "match": 0.7, "sourceLang": "en", // might search related languages "targetLang": "cy", // might search related languages "referenceDoc": "Helium", // Previous translation source article "referenceDocSegmentId": "segment1", // Previous translation location "confidence": 0.8, "tags": ["Science", "Chemistry", "Physics"], "source": "Helium is a chemical element with chemical symbol He", "target": "Elfen cemegol yw Heliwm gyda'r symbol cemegol He" },               {                        "match": 0.25, "sourceLang": "en", "targetLang": "cy", "referenceDoc": "Hydrogen_(band)", "referenceDocSegmentId": "segment7", "confidence": 1.0, "tags": ["Music", "Scotland"], "source": "Hydrogen is a Scottish rock band", "target": "Band roc o'r Alban yw Hydrogen" }       ],        "suggestionsTstamp": 1391419264386, // time at which suggestion search occurred. Important because we can do a subsequent “updateSuggestions” search which only considers segments added after this time stamp. "dictSource": "Hydrogen is a chemical element with chemical symbol H and atomic number 1", "linkSource": "Hydrogen is a chemical element with chemical symbol H and atomic number 1",       "dictIds": ["d5", "d6", "d7", "d9"],        "linkIds": [“l1"], }, segment2: // ../ }	“dictionary”: { // ...        "d5": { "reference": "Glossary/Computer science", "source": "atomic", "target": "elfennol", "note": "(computer science)" },       "d6": { "reference": "Wiktionary", "source": "atomic number", "target": "rhif atomig", "note": "" },       "d7": { "reference": "Glossary/Chemistry", "source": "atomic", "target": "atomig", "note": "(of an atom)" },       // ...        "d9": { "reference": "Glossary/Environment", "source": "chemicals", "target": "cemegolion", "note": "" } },	“glossary”: null, “links”: { l1: “http://cy.wikipedia.org/wiki/Elamend” }	// bunch of other stuff that is less relevant in a data structure overview }

Datamodel diff
Passing the whole data model again and again has bandwidth wastage. We can be smart about it. One approach can be sending the changed data alone by finding an object diff at server. Send the object diff to server and use an object patching algorithm at client side.

To be Designed

Task manager
We will require a task manager at server per client session. This task manager will be responsible for: Example: https://github.com/learnboost/kue
 * 1) Creating asynchronous process for contacting external service providers, wait for the result and update the data model
 * 2) Intelligent scheduling to take care of api limitations
 * 3) Manage pushing of data model updates to the client

Displaying paragraph source text from data model

 * 1) Start with the segmentedContent for the document structure and markup
 * 2) For each segment:
 * Replace the content with markup from dictSource if available.
 * Visually annotate words with dictionary entries

Constructing a paragraph translation text from data model

 * 1) Start with the segmentedContent for the document structure and markup
 * 2) For each segment:
 * If a MT is available, replace the segment content with the best MT. Otherwise, keep the source language text
 * Translate link target and text with data from Wikidata

Machine learning
We need to capture translators improvements for the translation suggestions (for all translation aids). If we are allowing free text translation, the extent to which we can do this is limited as explained in the segment alignment section of this document. But for the draft/final version of translation, we need to capture the following to our CX system to continuously improve CX system and potentially provide this to service providers
 * 1) Dictionary corrections/improvements - Whether the user used CX suggestions or used a new meaning. Also out of n suggestions, which suggestion was used.
 * 2) Link localization-TBD: Can we give this back to wikidata?
 * 3) MT - Edited translations
 * 4) Translation memory - if they are used, it affects the confidence score
 * 5) Segmentation corrections
 * TBD: Whether we are doing it automatically/when user saves/publishes
 * TBD: How exactly we analyse the translation create semantic information for feeding the above mentioned systems

One potential for this is parallel corpora for many language pairs, which is quite useful for any kind for morphological analysis.

Segment alignment
We allow free text editing for translation. We can give word meaning, translation templates. But since we are allowing users to summarize, combine multiple sentences, choice of not translating something, we will not be able learn the translation.

We need to align the source and target translations - example: translatedSentence24 is the translation of originalSentence58

This context is completely different than a structured translation we do with Translate extension. The following type of edits are possible TODO: replace the word alignment diagram opposite with a suitable segment alignment diagram
 * 1) Translator combines multiple segments to single segment - like summarising
 * 2) Translator splits a single segment to multiple segments to construct simple sentences
 * 3) Translator leaves multiple segments untranslated
 * 4) Translator re-orders segments in a paragraph or even in multiple paragraphs
 * 5) Translator paraphrases content from multiple segments to another set of segments

The CX UI provides a visual synchronization for segments wherever possible. From the UI perspective, word and sentence alignment will help user orientation but they are not a critical component that will break the experience if it is not 100% perfect.

Alignment of words and sentences are useful to provide additional context to the user. When getting info on a word from the translation it is good to have it highlighted in the source so that it becomes easier to see the context in which the word is used. This is something not illustrated in the prototype but it is something that google translate does, for example.

Pau thinks it is totally fine if the alignment is not working in all cases. For example, we may want to support it when the translation is done by modifying the initial template, but it is harder if the user starts from scratch so we don't do it in the later case.

The following best try approach is proposed
 * 1) We are considering the source article annotated with identifiers for segments. When we use machine translation for template, we can try copying this annotations to template. The feasibility of copying annotations to template will depend on many things. (a) The ability of MT providers to take html and give the translations back without losing these annotations (b) Mapping the link ids by using href as key from our server side (c) If MT backend not available, we are giving source article as translation template- no issue there
 * 2) In the translation template, if user edits, it is possible that the segment markup get destroyed. It can happen when two sentences combined or even just because of deletion of first word. But wherever the annotations remains, it should be easy for us to do visual sync up.
 * 3) If we don't get enough matching of annotations(ids), we ignore.

In iterations, it is possible to improve this approach in multiple ways
 * 1) Consider incorporating sentence alignment using minimal linguistic data for simple language pairs if possible. Example: English-Welsh (?)
 * 2) Consider preserving contenteditable nodes in a translation template so that we have segment ids to match (Follow the VE team developments on this area)

An approach to be evaluated: Instead of making the whole translation column content editable, when a translation template is inserted, mark each segment as contenteditable. This will prevent from destruction of segments

Server Side caching
Redis http://blog.stevenlu.com/2013/03/07/using-redis-for-caching-in-nodejs/ What to cache:
 * 1) Data model as a whole?
 * 2) Link localization to be shared across sessions
 * 3) Dictionaries (is it sensible to load them to Redis?)
 * 4) TBD

Client Side caching

 * 1) Not much to cache at client
 * 2) Code etc. cached by resource loader

Machine Translation

 * MT searches are stored per-segment
 * Many MT engines (e.g. moses) can output translation confidence

MT Api design

 * Notes by David https://docs.google.com/a/wikimedia.org/document/d/115lECbLvkM3KTlrCCrkgivZ3yV3pzh2IeC767Sva79s/edit
 * http://blog.programmableweb.com/2013/01/15/63-translation-apis-bing-google-translate-and-google-ajax-language/ 63 Translation APIs: Bing, Google Translate and Google AJAX Language
 * http://www.microsoft.com/web/post/using-the-free-bing-translation-apis
 * Yandex API http://api.yandex.com/translate/ http://api.yandex.com/translate/doc/dg/reference/translate.xml

Dictionary

 * Google dictionary https://code.google.com/p/google-ajax-apis/issues/detail?id=16
 * There is one dictionary data object for the whole article.
 * To list entries on a per-segment level would get extremely repetitious.
 * Morphology: The dictionary data contains entries in citation form. However, matched text may contain conjugated forms (note that “chemical” matched “chemicals” in the example above).
 * What we need is a pluggable architecture that can be used for “practical” “minimal” stemmers
 * Stemmer which produce false positives is acceptable because it is acceptable to give incorrect meaning as suggestion. Also acceptable to not give meaning for words always

Link localization
This should be done with the help of Wikidata. Also to be cached.

Proof of concept

 * POC: Client server communication model
 * POC: Socket.io based data model sync