Content translation/Developers/Markup

In ContentTranslation translators are translating html content. The HTML contains all possible markups that a typical Wikipedia article has. This means, the machine translation is on HTML content. But, not all MT engines support HTML content.

Some MT engines, such as Moses, output subsentence alignment information directly, showing which source words correspond to which target words. $ echo 'das ist ein kleines haus' | moses -f phrase-model/moses.ini -t this is |0-1| a |2-2| small |3-3| house |4-4| The Apertium MT engine does not translate formatted text faithfully. Markup such as HTML tags is treated as a form of blank space. This can lead to semantic changes (if words are reordered), or syntactic errors (if mappings are not one-to-one). $ echo 'legal persons ' | apertium en-es -f html Personas legales  $ echo 'I am David' | apertium en-es -f html Soy  David Other MT engines exhibit similar problems. This makes it challenging to provide machine translations of formatted text. This document explains how this challenge is tackled in ContentTranslation.

As we saw in the examples above, a machine translation engine can cause the following errors in the translated HTML. The errors are listed in descending order of severity. All of the above issues can cause bad experience to translators.
 * 1) Corrupt markup - If the machine translation engine is unaware of HTML structure, they can potentially move the HTML tags randomly, causing corrupted markup in the MT result
 * 2) Wrongly placed annotations - The two examples given above illustrate this. It is more severe if content includes links and link targets were swapped or randomly given in the MT output.
 * 3) Missing annotations - Sometimes the MT engine may eat up some tags in the translation process.
 * 4) Split annotations -During translation a single word can be translated to more than one word. If the source word has a mark up, say  tag. Will the MT engine apply the  tag wrapping both words or apply to each word?

Apart from potential issues with markup transfer, there is another aspect about sending HTML content to MT engines. Compared to plain text version of a paragraph, HTML version is bigger in terms of size(bytes). Most of these extra addition is tags and attributes which should be unaffected by the translation. This is unnecessary bandwidth usage. If the MT engine is a metered engine(non-free, API access is measured and limited), we are not being economic.

Overview
This make sure that MT engines are translating only plain text and mark up is applied as a post-MT processing.
 * The input HTML content is translated into a LinearDoc, with inline markup (such as bold and links) stored as attributes on a linear array of text chunks. This linearized format is convenient for important text manipulation operations, such as reordering and slicing, which are challenging to perform on an HTML string or a DOM tree.
 * Plain text sentences (with all inline markup stripped away) are sent to the MT engine for translation.
 * The MT engine returns a plain text translation, together with subsentence alignment information (saying which parts of the source text correspond to which parts of the translated text).
 * The alignment information is used to reapply markup to the translated text.

To transfer the markup, initially we tried an algorithm based on case change observation. To locate the translation of a word which is potentially reordered in translated text, a sentence is translated as it is and as that particular word uppercased. By comparing the output of both, the diff will tell where the word translation is appearing in translation. This approach and its limitations are listed below. Later we will see an more advanced algorithm.

Deriving subsentence alignment from case change observations
ContentTranslation tries to derive it by selectively changing certain portions of the text into upper case, and seeing how the translation changes. Suppose we are trying to translate the following formatted text: The issue was reported by a registered blind user. First, ContentTranslation will obtain the translation for the plain text sentence: en: The issue was reported by a registered blind user. es: El asunto estuvo informado por un usuario ciego registrado. Then, for each range of markup that occurs, it will obtain the translation of the sentence with that range upper cased: en: The ISSUE was reported by a registered blind user. es: El ASUNTO estuvo informado por un usuario ciego registrado. en: The issue was reported by a REGISTERED BLIND user. es: El asunto estuvo informado por un usuario CIEGO REGISTRADO. It compares these translations, and uses the differences to calculate a partial range mapping, showing in this case: This range mapping is used to apply formatting to the plain text translation:   El asunto estuvo informado por un usuario ciego registrado . Note that the only change made to the modified text was to turn portions to upper case. This is important, because it means the translation context is not being changed. Without the same context, phrase translations can change; e.g. Apertium translates 'registered blind' as 'Registrado ciego', which is different from the translation in the full context.
 * Characters 4-8 of the original text (i.e. 'issue') map to characters 3-8 of the translation (i.e. 'asunto').
 * Characters 28-44 of the original text (i.e. 'registered blind') map to characters 42-58 of the translation text (i.e. 'ciego registrado').

Limitations
However, note that if a mapping is not found for a particular phrase, ContentTranslation will fall back gracefully by simply failing to reapply the appropriate markup. There is no risk of semantic change or syntactic error.
 * Only language pairs that have case can use this technique. So It won't work for Devanagari, Arabic, Chinese, ...
 * Uppercase-only phrases, like sentence-initial 'A' or 'HIV', cannot be mapped. (Note lowercase/titlecase words can usually be upper-cased, but not the reverse, e.g. Apertium will not recognise 'britain' or 'hiv').
 * The MT engine must give the same translation modulo upper casing. This seems to be the case for Apertium, but not Google Translate API (which does not offer alignment info).
 * The MT engine must reproduce case in the target language.
 * The MT engine runs several times per sentence. This is not a showstopper because Apertium is pretty fast anyway.

Annotation mapping using translation subsequence approximation

 * 1) For the text to translate, find the text of inline annotations like  bold, italics, links etc. We call it subsequences
 * 2) Pass the full text and subsequences to the plain text machine translation engine. Use some delimiter so that we can do the array mapping between source items(full text and subsequences) and translated items.
 * 3) The translated full text will have the subsequences somewhere in the text.  To locate the subsequence translation in full text translation, use an  approximate search algorithm
 * 4) The approximate search algorithm will return the start position of  match and length of match. To that range we map the annotation from the source html.
 * 5) The approximate match involves calculating the edit distance between  words in translated full text and translated subsequence. It is not strings  being searched, but ngrams with n=number of words in subsequence. Each  word in ngram will be matched independently.

An analysis for this algorithm with the help of lot of examples is available at http://etherpad.wikimedia.org/p/cx-markup-alignment