Content translation/Developers/Markup/fr

Dans la les traducteurs traduisent le contenu Html. Le Html contient toutes les balises possibles qu'un article typique Wikipedia peut contenir. Cela signifie que la traduction automatique se fait sur le contenu Html. Mais tous les moteurs de traduction automatique ne supportent pas le contenu Html.

Certains moteurs de traduction automatique tels que Moses, génèrent directement l'information d'alignement des sous-phrases montrant les mots source et les mots cible qui leur correspondent.

$ echo 'das ist ein kleines haus' | moses -f phrase-model/moses.ini -t this is |0-1| a |2-2| small |3-3| house |4-4|

Le moteur de traduction automatique Apertium ne traduit pas fidèlement le texte formaté. Le balisage tel que celui du Html est traité sous la forme d'un espace vide. Ce qui peut produire des modifications dans la sémantique (si les mots sont réorganisés), ou des erreurs de sémentique (si la correspondance un à un n'est pas respectée).

$ echo 'legal persons ' | apertium en-es -f html Personas legales 

$ echo 'I am David' | apertium en-es -f html Soy  David

Les autres moteurs de traduction automatique montrent des problèmes similaires. C'est ce qui rend difficile de fournir la traduction automatique d'un texte formaté. Ce document explique la manière dont ce conflit est géré dans la Traduction de contenu.

Comme montré dans les exemples ci-dessus, un moteur de traduction automatique peut provoquer les erreurs suivantes dans le Html traduit. Les erreurs sont listées par ordre de sévérité décroissante.
 * 1) Corrupt markup - (balisage corrompu). Si le moteur de traduction automatique ne sait pas gérer le Html, il peut éventuellement déplacer les balises Html, et provoquer ainsi un marquage incohérent dans le résultat traduit automatiquement
 * 2) Wrongly placed annotations - (annotations mal placées). Les deux exemples ci-dessus illustrent la situation. C'est plus grave si le contenu comprend des liens et que leurs cibles sont interverties ou distribuées aléatoirement dans la sortie traduite automatiquement.
 * 3) Missing annotations - (annotations absentes). Quelques fois le moteur de traduction automatique peut manger certaines balises dans le processus de traduction.
 * 4) Split annotations - (annotations fragmentées). Pendant la traduction, un mot peut être traduit en un ou plusieurs mots. Si le mot source est balisé par exemple avec, le moteur de traduction automatique doit-il baliser l'ensemble des mots avec  ou baliser chaque mot séparément ?

Tous ces problèmes peuvent conduire à ce que les traducteurs en tirent une mauvaise expérience.

A part les problèmes potentiels liés au transfert du balisage, il existe un autre aspect de l'envoi d'un contenu Html aux moteurs de traduction automatique. En comparaison avec la version texte simple d'un paragraphe, la version Html est plus grande au niveau de la taille en octets. Ces ajouts sont en partie dûs aux balises et aux attributs qui doivent rester transparents lors de la traduction. Ceci est une utilisation inutile de la largeur de bande. Si le moteur de traduction automatique est bridé (par exemple non gratuit, avec un accès contrôlé et limité à l'API), nous n'avons pas gagné grand chose.

Aperçu

 * The input HTML content is translated into a LinearDoc, with inline markup (such as bold and links) stored as attributes on a linear array of text chunks. This linearized format is convenient for important text manipulation operations, such as reordering and slicing, which are challenging to perform on an HTML string or a DOM tree.
 * Plain text sentences (with all inline markup stripped away) are sent to the MT engine for translation.
 * The MT engine returns a plain text translation, together with subsentence alignment information (saying which parts of the source text correspond to which parts of the translated text).
 * The alignment information is used to reapply markup to the translated text.

This make sure that MT engines are translating only plain text and mark up is applied as a post-MT processing.

To transfer the markup, initially we tried an algorithm based on case change observation. To locate the translation of a word which is potentially reordered in translated text, a sentence is translated as it is and as that particular word uppercased. By comparing the output of both, the diff will tell where the word translation is appearing in translation. This approach and its limitations are listed below. Later we will see an more advanced algorithm.



Correspondance des annotations utilisant l'approximation de subséquence de la traduction
This is the algorithm currently used in ContentTranslation. This algorithm tries to overcome the limitations of the previous upper casing algorithm. Essentially the algorithm does a fuzzy match to find the target locations in translated text to apply annotations. Here also content given to MT engines is plain text only.

The steps are given below.


 * 1) For the text to translate, find the text of inline annotations like bold, italics, links etc. We call it subsequences.
 * 2) Pass the full text and subsequences to the plain text machine translation engine. Use some delimiter so that we can do the array mapping between source items(full text and subsequences) and translated items.
 * 3) The translated full text will have the subsequences somewhere in the text. To locate the subsequence translation in full text translation, use an approximate search algorithm
 * 4) The approximate search algorithm will return the start position of match and length of match. To that range we map the annotation from the source html.
 * 5) The approximate match involves calculating the edit distance between words in translated full text and translated subsequence. It is not strings being searched, but ngrams with n=number of words in subsequence. Each word in ngram will be matched independently.

To understand this, let us try the algorithm in some example sentences.


 * 1) Translating the Spanish sentence   to Catalan: The plain text version is  . And the subsequence with annotation is    . We give both the full text and subsequence to MT. The full text translation is  . and the word     is translated as  . We do a search for   in the full text translation. The search will be successfull and the  tag will be applied, resulting  .The seach performed in this example is plain text exact search. But the following example illustrate why it cannot be an exact search.
 * 2) Translating an English sentence   to Spanish. The full text translation of this is    One of the subsequence   will get translated as  . The case of   differs and search should be smart enough to identify  as a match for The word order in source text and translation is already handled by the algorithm. The following example will illustrate that is not just case change that happens.
 * 3) Translating   to Spanish. The plain text version get translated as  .  and the word with annotation modern get translated as   . We need a match for   and  . We get  . This is a case of word inflection. A single letter at the end of the word changes.
 * 4) Now let us see an example where the subsequence is more than one word and the case of nested subsequences. Translating English sentence   to Spanish. Here, the subsequnce   is in bold, and inside that, the red is in italics. In this case we need to translate the full text, sub sequence   and  . So we have,   El perro rojo grande as full translation, Rojo grande and Rojo as translations of sub sequences.   need to be first located and bold tag should be applied. Then search for   and apply Italic. Then we get.
 * 5) How does it work with heavily inflected languages like Malayalam? Suppose we translate  I am from Kerala  to Malayalam. The plain text translation is ഞാന്‍ കേരളത്തില്‍ നിന്നാണു്. And the sub sequence Kerala get translated to കേരളം. So we need to match കേരളം and കേരളത്തില്‍. They differ by an edit distance of 7 and changes are at the end of the word. This shows that we will require language specific tailoring to satisfy a reasonable output.

The algorithm to do an approximate string match can be a simple levenshtein distance, but what would be the acceptable edit distance? That must be configurable per language modules. And the following example illustrate that just doing an edit distance based matching wont work.

Translating  to English. Plain text translation is    translates as. With an edit distance approach,  will match more with   than. To address this kind of cases, we mix a second criteria that the words should start with same letter. So this also illustrate that the algorithm should have language specific modules.

Still there are cases that cannot be solved by the algorithm we mentioned above. Consider the following example

Translating. Plain text translation to Spanish is  and the phrase cannot translates as. Here we need to match  and   which of course wont match with the approach we explained so far.

To address this case, we do not consider sub sequence as a string, but an n-gram where n= number of words in the sequence. The fuzzy matching should be per word in the n-gram and should not be for the entire string. i.e.  to be fuzzy matched with   and , and   to be fuzzy matched wth   and  - left to right, till a match is found. This will take care of word order changes as welll as inflections

Revisiting the 4 type of errors that happen in annotation transfer, with the algorithm explained so far, we see that in worst case, we will miss annotations. There is no case of corrupted markup.

As and when ContentTranslation add more language support, language specific customization of above approach will be required.