Content translation/Developers/Markup

The Apertium MT engine does not translate formatted text faithfully. Markup such as HTML tags is treated as a form of blank space. This can lead to semantic changes where words are reordered and syntactic errors where mappings are not one-to-one. $ echo 'legal persons ' | apertium en-es -f html Personas legales  $ echo 'I am David' | apertium en-es -f html Soy  David Other MT engines exhibit similar problems. This makes it challenging to provide machine translations of formatted text. This document explains how this challenge is tackled in ContentTranslation.

Overview

 * HTML is translated into a LinearDoc, with inline markup (such as bold and links) stored as attributes on a linear array of text chunks. This linearized format is convenient for important text manipulation operations, such as reordering and slicing, which are challenging to perform on an HTML string or a DOM tree.
 * Plain text sentences (with all inline markup stripped away) are sent to the MT engine for translation.
 * The MT engine returns a plain text translation, together with subsentence alignment information (saying which parts of the source text correspond to which parts of the translated text).
 * The alignment information is used to reapply markup to the translated text.

Deriving subsentence alignment from case change observations
Some MT engines, such as Moses, output subsentence alignment information directly, showing which source words correspond to which target words. echo 'das ist ein kleines haus' | moses -f phrase-model/moses.ini -t this is |0-1| a |2-2| small |3-3| house |4-4| However, many MT engines, including Apertium, do not offer this information directly. ContentTranslation tries to derive it by selectively changing certain portions of the text into upper case, and seeing how the translation changes. Suppose we are trying to translate the following formatted text: In the new year the winning team will be announced . First, ContentTranslation will obtain the translation for the plain text sentence: en: In the new year the winning team will be announced. es: En el año nuevo el equipo ganador será anunciado. Then, for each range of markup that occurs, it will obtain the translation of the sentence with that range upper cased: en: In the NEW YEAR the winning team will be announced. es: En el AÑO NUEVO el equipo ganador será anunciado.''' ''' en: In the new year the WINNING TEAM will be announced. es: En el año nuevo el EQUIPO GANADOR será anunciado.''' ''' en: In the new year the winning team will be ANNOUNCED. es: En el año nuevo el equipo ganador será ANUNCIADO.