Help:Extension:Translate/Translation memories

TTMServer is a translation memory that comes with the Translate extension. It needs no external dependencies. It (will be) is enabled by default and is the recommended option, obsoleting the old tmserver from translatetoolkit which was hard to set up. Like tmserver, TTMServer is lightweight and doesn't use any advanced algorithms. It takes advantage of MediaWiki's language support and database abstraction features.

All suggestion services are configured with the  configuration setting. The default configuration of TTMServer is as follows:

They key  accepts a   string, which orders the TTMServer to use a shared database for storing the translations. Bad suggestions are filtered out depending on the  option, which is given as a float percentage, 1 meaning exact match. The translation editor will more over only show three best suggestions (which may come from more than three sources). Finally, you can make the translation memory public, which enabled the possibility for anyone to query the memory using mediawiki web API.

Currently only MySQL is supported.

Architecture
The backend contains three tables, translate_tms, translate_tmt and translate_tmf. Those correspond to sources, targets and fulltext. You can find the table definitions in sql/translate_tm.sql. The sources contain all the message defitions. Even though usually they are always in the same language, say, English, the language of the text is also stored for the rare cases this is not true.

Each entry has unique id and two extra fields, length and context. Length is used as the first pass filter, so that when querying we don't need to compare the text we search with every entry in the database. The context stores the title of the page where text comes from, for example "MediaWiki:Jan/en". From this information we can link the suggestions back to "MediaWiki:Jan/de", which makes it possible for translators to quickly fix things, or just to know where that kind of translation was used.

The second pass of filtering comes from the fulltext search. The definitions are mungled with an ad-hoc algorithm. First the text is segmented into segments (words) with MediaWiki's Language::segmentByWord. If there are enough segments, we strip basically everything that is not word letters and normalize the case. Then we take ten first unique words, which are at least 5 bytes long (5 letters in English, but even shorter words for languages with multibyte code points). Those words are then stored in the fulltext index for further filtering for longer strings.

When we have filtered the list of candidates, we fetch the matching targets from the targets table. Then we apply levenshtein edit distance algorithm to do the final filtering and ranking. Lets define:
 * E :edit distance
 * S :the text we are searching suggestions for
 * Tc :the suggestion text
 * To :the original text which the Tc is translation of

The quality of suggestion Tc is calculated as E/min(length(Tc),length(To)). Depending on the length of the strings, we either use PHP's native levenshtein function, or if either of the strings are longer than 255 bytes, we use PHP implementation of levenshtein algorithm. It has not been tested whether the native implementation of levenshtein handles multibyte characters correctly. This might be another weak point when source language is not English (the another being the fulltext search and segmentation).

There is script which fills the translation memory with translations from the active message groups. Even big sites should be able to bootstrap the memory in half an hour when using multiple threads with the --thread parameter. The time depends heavily how complete the message group completion stats are (and incomplete ones will be calculated during the bootstrap. New translation are automatically added by a hook. New sources (definitions) are added when first translation is added. Old translations which are no longer used and not belong to any message groups are not purged automatically, unless you rerun the bootstrap script. When translation of a message is updated, the previous translation is removed from the memory. When the definition is updated nothing happens immediately. When translations are updated against the new definition, new entry will be added. The old definition and its old translations remain in the database until purged by rerunning the bootstrap script. Also fuzzy translations will not be added to the translation memory, but neither are the translations removed from the memory when they are fuzzied.

It is not yet possible to add remote TTMServer (using the provided web api module instead of querying the database directly) as translation aids, but this feature is planned to be implemented later.

tmserver
This section documents how to setup local translation memory with the aid of tmserver from translatetoolkit. You will need to have a sqlite support enabled in PHP.

First we need to install (or just extract) translatetoolkit. It can be downloaded from, choose the latest release (not the big green button). Then extract the package to somewhere, preferable outside of the web document root.

Now we need to create an empty database. In command line go to the extracted folder and type the following commands. The second command throws an error but it should create a tmdb.sqlite file. Script the first line if you have installed the package already by other means and the prefix to the second command also.

You can use sqlite3 command to check that the database is initialised correctly:

You can now move the database file to suitable place. You need to setup the permissions in a way that the database can be accessed by both the webserver and yourself. This means that also the parent directory must be writable to the both.

Then configure the database file with the examples given in Translate.php. Now you can bootstrap the translation memory with existing translations by running tm-export.php under scripts folder in Translate extension.

Lastly, don't forget to start up the translation memory server itself. Here is an example script. You can use cron or init files to start it up automatically with the system itself. We lack proper init script, so if you can make one it would be appreciated. Of course replace the port and path to reflect your configuration. Change 0.0.0.0 to 127.0.0.1 if you don't want to make the server publicly accessible (the database must be on the same host as the server itself currently). Again you can skip the path to the command if you have installed the toolkit.

The database is updated with new translation immediately when they are made - if everything is working and the webserver has write permissions to the database.