Content translation/Technical Architecture

Abbreviations and glossary

 * 1) CX - Content Translation
 * 2) MT - Machine Translation
 * 3) TM - Translation memory
 * 4) Segment - Smallest unit of text which is fairly self-contained grammatically. This usually means a sentence, a title, a phrase in a bulleted list, etc.
 * 5) Segmentation algorithm - rules to split a paragraph into segments. Weakly language-dependent (sensible default rules work quite well for many languages).
 * 6) Parallel bilingual text - two versions of the same content, each written in a different language.
 * 7) Sentence alignment - matching corresponding sentences in parallel bilingual text. In general this is a many-many mapping, but it is approximately one-one if the texts are quite strict translations.
 * 8) Word alignment - matching corresponding words in parallel bilingual text. This is strongly many-many.
 * 9) Lemmatization - also called stemming. Mapping multiple grammatical variants of the same word to a root form; e.g. (swim, swims, swimming, swam, swum) -> swim. Derivational variants are not usually mapped to the same form (so happiness !-> happy).
 * 10) Morphological analysis - mapping words into morphemes, e.g. swims -> swim/3rdperson_present
 * 11) Service providers - External systems which provide MT/TM/Glossary services. Example: google
 * 12) Translation tools - Translation support tools - Translation aids - Translation support - Context aware translation tools like MT, Dictionary, Link localization
 * 13) Link localization - Converting a wiki article link from one language to another language with the help of wikidata. Example: http://en.wikipedia.org/wiki/Sea becomes http://es.wikipedia.org/wiki/Mar
 * 14) Redis http://en.wikipedia.org/wiki/Redis

Introduction
This document tries to capture the architectural understanding for the Content Translation project. This document evolves as the team goes to the depths of each component.

Architecture considerations

 * 1) The translation aids that we want to provide are mainly from third party service providers, or otherwise external to the CX software itself.
 * 2) Depending on the external system capability, varying amount of delays can be expected. It emphasises the need for server-side optimizations of service provider APIs such as
 * caching of results from them
 * proper scheduling to better utilize capacities and to operate in the limits of API usage


 * 1) We are providing a free-to-edit interface, so we should not block any edits (translation in the context of CX) if the user decides not to use translation aids. Incrementally we can prepare all translation aids support and provide them in the context.

Server communication

 * 1) We can provide translation aids for the content in increments as and when we receive them from service providers. We do not need to wait for all service providers to return data before proceeding.
 * 2) For large articles, we will have to do some kind of paging mechanism to provide this data in batches.
 * 3) This means that client and server should communicate in regular intervals or as and when data is available at server. We considered traditional http pull methods (Ajax) and push communication (websockets).

Client side

 * 1) jQuery
 * 2) Socket.io client
 * 3) contenteditable
 * 4) LESS, Grid, HTML templating?

Server side

 * 1) Node.js with
 * Socket.io
 * node.js built in cluster or http://learnboost.github.com/cluster/
 * express


 * 1) Redis http://en.wikipedia.org/wiki/Redis Also see MW usage of Redis https://www.mediawiki.org/wiki/Redis WMF uses Redis in production.
 * 2) A proxy server like Apache or Nginx

Architecture diagram
https://docs.google.com/a/wikimedia.org/drawings/d/1Cbvq2mvmLWmfsXpfUBXGdPXYcLNE3viTRsMVN3Re_EU

Node instances
Node.js built-in cluster module-based instance management system. Currently the code is borrowed from Parsoid. We have a cluster that forks express server instances depending on number of processors available in the system. It also fork new processes for replacing instances killed/suicided.

This approach uses Node.js built-in in cluster. A better alternative is node module cluster from socket.io developers http://learnboost.github.io/cluster/

Load balancing
When you scale your app in a cluster environment, the load balancer will take over, and the requests will be sent to different node instances causing Socket.io to break because that client-server socket is not authenticated (handshaked).

For such situations, load balancers have a feature called ‘sticky sessions’, also known as ‘session affinity’. The idea is that if this property is set, all the requests following the first load-balanced request will go to the same server instance. app.use(cookieParser); app.use(express.session({ store: sessionStore, key: 'cxsessionid', secret:'your secret here' }));
 * 1) Express sets a session cookie with name cxsessionid.
 * 2) When socket.io connects, it uses that same cookie and hits the load balancer.
 * 3) The load balancer always routes it to the same server that the cookie was set in.

Reference technology stacks:
 * 1) Trello http://blog.fogcreek.com/the-trello-tech-stack
 * 2) WMF Parsoid project

Security
To be designed

WMF Infrastructure
To be designed. https://wikitech.wikimedia.org/wiki/Parsoid gives some idea

The content itself might be sensitive (private wikis) and thus should not always be shared through translation memory. We also need to restrict access to service providers, especially if they are paid ones.

Article preprocessing
We take a concrete example to figure out the details of the data model. We use the Hydrogen article from English wikipedia. https://en.wikipedia.org/wiki/Hydrogen Article - Original From https://en.wikipedia.org/wiki/Hydrogen

Hydrogen is a chemical element with chemical symbol H and atomic number 1. With an atomic weight of 1.00794 u, hydrogen is the lightest element on the periodic table. Its monatomic form (H) is the most abundant chemical substance in the universe, constituting roughly 75% of all baryonic mass.[7][note 1] Non-remnant stars are mainly composed of hydrogen in its plasma state. The most common isotope of hydrogen, termed protium (name rarely used, symbol 1H), has a single proton and zero neutrons.

The universal emergence of atomic hydrogen first occurred during the recombination epoch. At standard temperature and pressure, hydrogen is a colorless, odorless, tasteless, non-toxic, nonmetallic, highly combustible diatomic gas with the molecular formula H2. Since hydrogen readily forms covalent compounds with most non-metallic elements, most of the hydrogen on Earth exists in molecular forms such as in the form of water or organic compounds. Hydrogen plays a particularly important role in acid–base reactions. In ionic compounds, hydrogen can take the form of a negative charge (i.e., anion) known as a hydride, or as a positively charged (i.e., cation) species denoted by the symbol H+. The hydrogencation is written as though composed of a bare proton, but in reality, hydrogen cations in ionic compounds are always more complex species than that would suggest.

As the simplest atom known, the hydrogen atom has had considerable theoretical application. For example, the hydrogen atom is the only neutral atom with an analytic solution to the Schrödinger equation. Hydrogen gas was first artificially produced in the early 16th century, via the mixing of metals with acids. In 1766–81, Henry Cavendish was the first to recognize that hydrogen gas was a discrete substance,[8] and that it produces water when burned, a property which later gave it its name: in Greek, hydrogen means "water-former".

Industrial production is mainly from the steam reforming of natural gas, and less often from more energy-intensive hydrogen production methods like the electrolysis of water.[9]Most hydrogen is employed near its production site, with the two largest uses being fossil fuel processing (e.g., hydrocracking) and ammonia production, mostly for the fertilizer market.

Hydrogen is a concern in metallurgy as it can embrittle many metals,[10] complicating the design of pipelines and storage tanks.[11]