Help:Extension:Translate/Translation memories

TTMServer is a translation memory that comes with the Translate extension. It needs no external dependencies. It is enabled by default and it replaces the support for tmserver from translatetoolkit, which was hard to set up. TTMServer is a simple translation memory and doesn't use any advanced algorithms. It takes advantage of MediaWiki's excellent language support and database abstraction features.

There are four different ways to use TTMServer:

Configuration
All translation aids including translation memories are configured with the  configuration setting. Example configuration of TTMServers:

Possible keys and values are:

Currently only MySQL is supported for the databases.

TTMServer API
Would you like to implement your own TTMServer service, here are the specifications.

Query parameters:

Your service must accept the following parameters:

Your service must provide a JSON object which must have key  with an array of objects. Those objects must contain the following data:

Example:

URL: http://translatewiki.net/w/api.php?action=ttmserver&sourcelanguage=en&targetlanguage=fi&text=january&format=jsonfm Response:

TTMServer architecture
The backend contains three tables,,   and. Those correspond to sources, targets and fulltext. You can find the table definitions in. The sources contain all the message definitions. Even though usually they are always in the same language, say, English, the language of the text is also stored for the rare cases this is not true.

Each entry has a unique id and two extra fields, length and context. Length is used as the first pass filter, so that when querying we don't need to compare the text we're searching with every entry in the database. The context stores the title of the page where the text comes from, for example "MediaWiki:Jan/en". From this information we can link the suggestions back to "MediaWiki:Jan/de", which makes it possible for translators to quickly fix things, or just to determine where that kind of translation was used.

The second pass of filtering comes from the fulltext search. The definitions are mingled with an ad hoc algorithm. First the text is segmented into segments (words) with MediaWiki's. If there are enough segments, we strip basically everything that is not word letters and normalize the case. Then we take the first ten unique words, which are at least 5 bytes long (5 letters in English, but even shorter words for languages with multibyte code points). Those words are then stored in the fulltext index for further filtering for longer strings.

When we have filtered the list of candidates, we fetch the matching targets from the targets table. Then we apply the levenshtein edit distance algorithm to do the final filtering and ranking. Let's define:


 * E : edit distance
 * S : the text we are searching suggestions for
 * Tc : the suggestion text
 * To : the original text which the Tc is translation of

The quality of suggestion Tc is calculated as E/min(length(Tc),length(To)). Depending on the length of the strings, we use: either PHP's native levenshtein function; or, if either of the strings is longer than 255 bytes, the PHP implementation of levenshtein algorithm. It has not been tested whether the native implementation of levenshtein handles multibyte characters correctly. This might be another weak point when source language is not English (the others being the fulltext search and segmentation).

There is a script which fills the translation memory with translations from the active message groups. Even big sites should be able to bootstrap the memory in half an hour when using multiple threads with the  parameter. The time depends heavily on how complete the message group completion stats are (incomplete ones will be calculated during the bootstrap). New translations are automatically added by a hook. New sources (definitions) are added when first translation is added.

Old translations which are no longer used and do not belong to any message groups are not purged automatically, unless you rerun the bootstrap script. When the translation of a message is updated, the previous translation is removed from the memory. When the definition is updated nothing happens immediately. When translations are updated against the new definition, a new entry will be added. The old definition and its old translations remain in the database until purged by rerunning the bootstrap script. Also fuzzy translations will not be added to the translation memory, but neither are the translations removed from the memory when they are fuzzied.

Solr backend
Much of the above also applies to the TTMServer using the Solr search platform as backend, except the details on database layout and queries. The results are still ranked with the levenshtein algorithm to provide unified score values.

In Solr there are no tables. Instead we have documents with fields. Here is an example document:

So all the different translations are just fields of the same document. This way we can retrieve all the translations at the same time when we query for likely matches. The  field is again used for limiting the results that are not of similar length, because the edit distance would be too large for them. The content and language of the source text is stored in the fields  and. The field  is used for full text searching and has more processing, like stemming and tokenizing. The global unique identifier for the document consists of the source wiki identifier, the name of the message and the hash of the message contents. This way, if the source text changes, we will get a new document only with translations that match the current source. The existing translations against the existing text in the database will stay there until purged. There are a few more fields to identify the source wiki and the message where the translations come from.

Installation
Here are the general quick steps for installing and configuring Solr for TTMServer. You should adapt them to your situation.

Then we need to install Solarium and configure it and make the Solr to be the default translation memory backend.

And finally we can populate the translation memory with content.