Content translation/Developers/Markup

The Apertium MT engine does not translate formatted text faithfully. Markup such as HTML tags is treated as a form of blank space. This can lead to semantic changes (if words are reordered), or syntactic errors (if mappings are not one-to-one). $ echo 'legal persons ' | apertium en-es -f html Personas legales  $ echo 'I am David' | apertium en-es -f html Soy  David Other MT engines exhibit similar problems. This makes it challenging to provide machine translations of formatted text. This document explains how this challenge is tackled in ContentTranslation.

Overview

 * HTML is translated into a LinearDoc, with inline markup (such as bold and links) stored as attributes on a linear array of text chunks. This linearized format is convenient for important text manipulation operations, such as reordering and slicing, which are challenging to perform on an HTML string or a DOM tree.
 * Plain text sentences (with all inline markup stripped away) are sent to the MT engine for translation.
 * The MT engine returns a plain text translation, together with subsentence alignment information (saying which parts of the source text correspond to which parts of the translated text).
 * The alignment information is used to reapply markup to the translated text.

Types of failures in annotation transfer
Failure types (most serious first):
 * 1) Corrupt markup
 * 2) Wrongly placed annotations
 * 3) Missing annotations
 * 4) Split annotations

Deriving subsentence alignment from case change observations
Some MT engines, such as Moses, output subsentence alignment information directly, showing which source words correspond to which target words. echo 'das ist ein kleines haus' | moses -f phrase-model/moses.ini -t this is |0-1| a |2-2| small |3-3| house |4-4| However, many MT engines, including Apertium, do not offer this information directly. ContentTranslation tries to derive it by selectively changing certain portions of the text into upper case, and seeing how the translation changes. Suppose we are trying to translate the following formatted text: The issue was reported by a registered blind user. First, ContentTranslation will obtain the translation for the plain text sentence: en: The issue was reported by a registered blind user. es: El asunto estuvo informado por un usuario ciego registrado. Then, for each range of markup that occurs, it will obtain the translation of the sentence with that range upper cased: en: The ISSUE was reported by a registered blind user. es: El ASUNTO estuvo informado por un usuario ciego registrado. en: The issue was reported by a REGISTERED BLIND user. es: El asunto estuvo informado por un usuario CIEGO REGISTRADO. It compares these translations, and uses the differences to calculate a partial range mapping, showing in this case: This range mapping is used to apply formatting to the plain text translation:   El asunto estuvo informado por un usuario ciego registrado . Note that the only change made to the modified text was to turn portions to upper case. This is important, because it means the translation context is not being changed. Without the same context, phrase translations can change; e.g. Apertium translates 'registered blind' as 'Registrado ciego', which is different from the translation in the full context.
 * Characters 4-8 of the original text (i.e. 'issue') map to characters 3-8 of the translation (i.e. 'asunto').
 * Characters 28-44 of the original text (i.e. 'registered blind') map to characters 42-58 of the translation text (i.e. 'ciego registrado').

Limitations
However, note that if a mapping is not found for a particular phrase, ContentTranslation will fall back gracefully by simply failing to reapply the appropriate markup. There is no risk of semantic change or syntactic error.
 * Only language pairs that have case can use this technique. So It won't work for Devanagari, Arabic, Chinese, ...
 * Uppercase-only phrases, like sentence-initial 'A' or 'HIV', cannot be mapped. (Note lowercase/titlecase words can usually be upper-cased, but not the reverse, e.g. Apertium will not recognise 'britain' or 'hiv').
 * The MT engine must give the same translation modulo upper casing. This seems to be the case for Apertium, but not Google Translate API (which does not offer alignment info).
 * The MT engine must reproduce case in the target language.
 * The MT engine runs several times per sentence. This is not a showstopper because Apertium is pretty fast anyway.

Annotation mapping using translation subsequence approximation

 * 1) For the text to translate, find the text of inline annotations like  bold, italics, links etc. We call it subsequences
 * 2) Pass the full text and subsequences to the plain text machine translation engine. Use some delimiter so that we can do the array mapping between source items(full text and subsequences) and translated items.
 * 3) The translated full text will have the subsequences somewhere in the text.  To locate the subsequence translation in full text translation, use an  approximate search algorithm
 * 4) The approximate search algorithm will return the start position of  match and length of match. To that range we map the annotation from the source html.
 * 5) The approximate match involves calculating the edit distance between  words in translated full text and translated subsequence. It is not strings  being searched, but ngrams with n=number of words in subsequence. Each  word in ngram will be matched independently.

An analysis for this algorithm with the help of lot of examples is available at http://etherpad.wikimedia.org/p/cx-markup-alignment