Talk:Content translation/Developers/Markup

From mediawiki.org

Deriving subsentence alignment from case change observations[edit]

ContentTranslation tries to derive it by selectively changing certain portions of the text into upper case, and seeing how the translation changes. Suppose we are trying to translate the following formatted text:

The <b>issue</b> was reported by a <a href="x">registered blind</a> user.

First, ContentTranslation will obtain the translation for the plain text sentence:

en: The issue was reported by a registered blind user.
es: El asunto estuvo informado por un usuario ciego registrado.

Then, for each range of markup that occurs, it will obtain the translation of the sentence with that range upper cased:

en: The ISSUE was reported by a registered blind user.
es: El ASUNTO estuvo informado por un usuario ciego registrado.
en: The issue was reported by a REGISTERED BLIND user.
es: El asunto estuvo informado por un usuario CIEGO REGISTRADO.

It compares these translations, and uses the differences to calculate a partial range mapping, showing in this case:

  • Characters 4-8 of the original text (i.e. 'issue') map to characters 3-8 of the translation (i.e. 'asunto').
  • Characters 28-44 of the original text (i.e. 'registered blind') map to characters 42-58 of the translation text (i.e. 'ciego registrado').

This range mapping is used to apply formatting to the plain text translation:''

El <b>asunto</b> estuvo informado por un usuario <a href="x">ciego registrado</a>.

Note that the only change made to the modified text was to turn portions to upper case. This is important, because it means the translation context is not being changed. Without the same context, phrase translations can change; e.g. Apertium translates 'registered blind' as 'Registrado ciego', which is different from the translation in the full context.

Limitations[edit]

  • Only language pairs that have case can use this technique. So It won't work for Devanagari, Arabic, Chinese, ...
  • Uppercase-only phrases, like sentence-initial 'A' or 'HIV', cannot be mapped. (Note lowercase/titlecase words can usually be upper-cased, but not the reverse, e.g. Apertium will not recognise 'britain' or 'hiv').
  • The MT engine must give the same translation modulo upper casing. This seems to be the case for Apertium, but not Google Translate API (which does not offer alignment info).
  • The MT engine must reproduce case in the target language.
  • The MT engine runs several times per sentence. This is not a showstopper because Apertium is pretty fast anyway.

However, note that if a mapping is not found for a particular phrase, ContentTranslation will fall back gracefully by simply failing to reapply the appropriate markup. There is no risk of semantic change or syntactic error.