Help:Extension:Translate/Translation memory architecture

From mediawiki.org
Jump to navigation Jump to search

This documentation explains the current architecture (as of 2020-10) of a translation memory implemented in the Translate extension. Target audience is developers and other people interested or tasked to improve the translation memory architecture.

What is a translation memory[edit]

The simple purpose is to speed up the translation process by suggesting similar previously translated segments.

Database 1:

id | translations
--------------------------------------------------
 1 | en: Are you a bunny rabbit?
   | fi: Oletko sinä pupujänis?
--------------------------------------------------
 2 | en: Who are you?
   | fi: Kuka sinä olet?

If you are now translating a string “Are you a human”, the translation memory may return one or more suggestions. Usually a score is given how closely they match to the original string, and this score is also used to select the best candidates.

Translation memory output 1:

Query | value: Are you a human?
      | source language: en
      | target language: fi
----
Sug 1 | value: Oletko sinä pupujänis?
      | match: 80%
----
Sug 2 | value: Kuka sinä olet?
      | match: 60%

Why is a translation memory hard to implement[edit]

The most essential feature of a translation memory is the function to calculate the score (later: scoring function). The function should aim to find a balance of:

  1. finding the best suggestions as judged by the translators
  2. being performant in the senses that it returns suggestions in a reasonable time and does not consume too much computing resources

Why is finding the best suggestions difficult[edit]

Both of these are difficult things to do. What is a best suggestion is a subjective measure, and any kind of large scale human evaluation is laborious to execute. As an illustration of the challenges here:

Database 2:

id | translations
----
 1 | en: Where is the highway?
   | fi: Missä se maantie on?
----
 2 | en: Germany has a number of old highway strips
   | fi: Saksassa on useita vanhoja lentokoneiden varalaskupaikkoja
----
 3 | en: The old highway strip was 1111 meters long.
   | fi: Vanha lentokoneiden varalaskupaikka oli 1111 metriä pitkä.

Which one would be the best suggestion for “Where is the old highway strip?”. A naive system would return document #1 since it has four words in common. However, the translator likely can write the translation of “where is ___” very quickly, but may not be sure what is the best translation for “highway strip”. In this case documents #2 and #3 would have provided an answer to this question. Of these #3 would potentially be better match against the likely translation of “Missä se vanha lentokoneiden varalaskupaikka on?” given it has long section of words that can be used directly “vanha lentokoneiden varalaskupaikka” (apart from fixing the case of one letter) while the the document #2 would need correction of multiple words to be grammatically correct. Fixing them takes more time than removing and adding complete new words.

Why is it difficult to make a performant scoring function[edit]

Performance is determined by the complexity of the scoring function. A simple equality comparison using a hash (which can be precomputed) is very fast, and it doesn’t really matter how many times we run it. But something more useful, such as a Levenshtein edit distance, takes (usually, depending on the implementation) polynomial time depending on the length of strings that are being compared. So when we are trying to calculate a score for a long paragraph, it can take a good portion of a second for one scoring, so it is not possible to do thousands of such slow comparisons.

There are, of course, some optimizations, such as filtering out candidates that are too short or too long compared to the query string, meaning they would never meet the minimum score.

How Translate implements translation memory[edit]

The Translate extension has a component named TTMServer. It’s etymology is possibly Translate (extension) Translation Memory Server. TTMServer is two things:

  1. Multilingual translation search
  2. Translation memory

These two features currently share the same data and backend, but they perform different types of queries against it. Translation search is relevant in the sense that we are working under the assumption that translation search and and translation memory can continue sharing the same dataset and that translation search functionality is not degraded by the way of reduced functionality or reduced performance.

TTMServer provides an abstraction (a poor one --author) for the rest of the Translate extension to update, search or perform translation memory query. In theory it can support multiple backends, but in practice there is only one.

There are two backends: ElasticSearch and Database. Database backend does not support translation search and it is provided only as a convenience for development environments. There used to be another backend for Solr, but it was removed as we did not have resources to maintain a backend we do not use ourselves.

TTMServer storage schema[edit]

For this section, do familiarize yourself with what an index means in ElasticSearch, for example from https://www.elastic.co/what-is/elasticsearch.

The translations (and definitions) are stored in an index. The index makes no distinction between definition or translation, so you can basically search from any language to any language, though it is not possible to have per-language analyzers with this schema.

The schema can handle multiple wikis sharing the same index, which is often desirable. The exception to this is private wikis, which should have their own private index.

Glossary

  • Message: A piece of text with a source language and all it’s translations. In the case of the schema, a message can have multiple versions.
  • Document: One version of the message in one language.

The schema has the following fields:

_id Also known as global id. It is a token in the format of wikiId-localId-revisionId/languageCode.
content The text value of the definition or translation.
group Which message group this string belongs to. May not be globally unique.
language Language code for the content.
localid Page title of the “message”.
uri Link to the message.
wiki Wiki database identifier

Here is an example entry:

_id translatewiki_net-bw_-Wikimedia:Xtools-projects-8849980/de
content Projekte
group xtools
language de
localid Wikimedia:Xtools-projects
uri https://translatewiki.net/wiki/Wikimedia:Xtools-projects/de
wiki translatewiki_net-bw_

How is the index updated[edit]

The index supports having multiple different versions of the same message. Consider the following situation. A message is translated to many languages. Then the definition is changed. Translations are marked as outdated and updated.

When the definition is updated. We add a new document to the index. Since the revision number is different, it won’t override the existing one. When the translations are updated, any previous versions of the translations for that message are deleted, and a new document is added for the latest version.

Fuzzy translations are never inserted to the index.

How is the index queried[edit]

The current algorithm is complicated, but I’ll try to explain it in parts. The high level description is:

Inputs: text, sourceLanguage, targetLanguage, threshold

Algorithm:

  1. Matching Query: Query the index where language is sourceLanguage, content “is similar to” text, ordered by “scoring function” where score is higher than threshold. Return all matching documents.
  2. Construct searchTerms to retrieve matching translations by replacing language code in the _id with the target language
  3. Retrieving Query: Query the index where _id is any of searchTerms

The definition for “is similar to” uses ElasticSearch’s “fuzzy_like_this” query. Do note that this is deprecated and has been removed in a newer version of ElasticSearch.

The definition for “scoring function” uses levenshtein_distance_score to calculate the edit distance and use that as the score.

An astute reader can already notice that major issue here: we do a slow matching and scoring query to a long list of documents, which can be pointless if those documents do not have any translations.

The current algorithm tries to work around this by doing the above algorithm twice. First it takes top 100 results from the Matching Query, hoping they are sufficient to return enough[1] results from the Retrieving Query. If this is not the case, it queries for 500 more (doing redundant work) and merges the results.

  1. Having 6 or more unique suggestions