Content translation/Developers/Markup

The Apertium MT engine does not translate formatted text faithfully. Markup such as HTML tags is treated as a form of blank space. This can lead to semantic changes (if words are reordered), or syntactic errors (if mappings are not one-to-one). $ echo 'legal persons ' | apertium en-es -f html Personas legales  $ echo 'I am David' | apertium en-es -f html Soy  David Other MT engines exhibit similar problems. This makes it challenging to provide machine translations of formatted text. This document explains how this challenge is tackled in ContentTranslation.

Overview

 * HTML is translated into a LinearDoc, with inline markup (such as bold and links) stored as attributes on a linear array of text chunks. This linearized format is convenient for important text manipulation operations, such as reordering and slicing, which are challenging to perform on an HTML string or a DOM tree.
 * Plain text sentences (with all inline markup stripped away) are sent to the MT engine for translation.
 * The MT engine returns a plain text translation, together with subsentence alignment information (saying which parts of the source text correspond to which parts of the translated text).
 * The alignment information is used to reapply markup to the translated text.

Types of failures in annotation transfer
Failure types (most serious first):
 * 1) Corrupt markup
 * 2) Wrongly placed annotations
 * 3) Missing annotations
 * 4) Split annotations

Deriving subsentence alignment from case change observations
Some MT engines, such as Moses, output subsentence alignment information directly, showing which source words correspond to which target words. echo 'das ist ein kleines haus' | moses -f phrase-model/moses.ini -t this is |0-1| a |2-2| small |3-3| house |4-4| However, many MT engines, including Apertium, do not offer this information directly. ContentTranslation tries to derive it by selectively changing certain portions of the text into upper case, and seeing how the translation changes. Suppose we are trying to translate the following formatted text: The issue was reported by a registered blind user. First, ContentTranslation will obtain the translation for the plain text sentence: en: The issue was reported by a registered blind user. es: El asunto estuvo informado por un usuario ciego registrado. Then, for each range of markup that occurs, it will obtain the translation of the sentence with that range upper cased: en: The ISSUE was reported by a registered blind user. es: El ASUNTO estuvo informado por un usuario ciego registrado. en: The issue was reported by a REGISTERED BLIND user. es: El asunto estuvo informado por un usuario CIEGO REGISTRADO. It compares these translations, and uses the differences to calculate a partial range mapping, showing in this case: This range mapping is used to apply formatting to the plain text translation:   El asunto estuvo informado por un usuario ciego registrado . Note that the only change made to the modified text was to turn portions to upper case. This is important, because it means the translation context is not being changed. Without the same context, phrase translations can change; e.g. Apertium translates 'registered blind' as 'Registrado ciego', which is different from the translation in the full context.
 * Characters 4-8 of the original text (i.e. 'issue') map to characters 3-8 of the translation (i.e. 'asunto').
 * Characters 28-44 of the original text (i.e. 'registered blind') map to characters 42-58 of the translation text (i.e. 'ciego registrado').

Limitations
However, note that if a mapping is not found for a particular phrase, ContentTranslation will fall back gracefully by simply failing to reapply the appropriate markup. There is no risk of semantic change or syntactic error.
 * Only language pairs that have case can use this technique. So It won't work for Devanagari, Arabic, Chinese, ...
 * Uppercase-only phrases, like sentence-initial 'A' or 'HIV', cannot be mapped. (Note lowercase/titlecase words can usually be upper-cased, but not the reverse, e.g. Apertium will not recognise 'britain' or 'hiv').
 * The MT engine must give the same translation modulo upper casing. This seems to be the case for Apertium, but not Google Translate API (which does not offer alignment info).
 * The MT engine must reproduce case in the target language.
 * The MT engine runs several times per sentence. This is not a showstopper because Apertium is pretty fast anyway.

Annotation mapping using translation subsequence approximation

 * 1) Do not pass html at all for apertium for languages differing in word order in a significant way. ie es-ca: pass html. en-es: donot pass html, pass plain text version of content to be translated.
 * 2) Along with original text version, pass text of inline mark up. Seee examples above. The subsentences listed for each full sentences are the extra content we pass to apertium. They are text content of iniline markup, following the tag hierarchy
 * 3) To apply mark up for a source word sequence, get its translation, find out  it in full sentence translation.
 * 4) If full text match is found, we are done.
 * 5) Since subsentence pattern matching cannot be done if the translated subsequence is changed(Example:Inflected), we need to do a pattern matching with approximation. Edit distance or levenshtein distance can be used for approximation with a threshold limit that is tuned for the language. We may also want to couple this with a language dependent check like 'first letter is same'. See examples at  http://etherpad.wikimedia.org/p/cx-markup-alignment

An analysis for this algorithm with the help of lot of examples is available at http://etherpad.wikimedia.org/p/cx-markup-alignment