Help:Extension:Translate/Translation memories

From MediaWiki.org
Jump to: navigation, search

Other languages: English 100% • ‎Japanese 47% • ‎Dutch 71% • ‎Polish 49%

TTMServer is a translation memory that comes with the Translate extension. It needs no external dependencies. It is enabled by default and it replaces the support for tmserver from translatetoolkit, which was hard to set up. TTMServer is a simple translation memory and doesn't use any advanced algorithms. It takes advantage of MediaWiki's excellent language support and database abstraction features.

There are three different ways to use TTMServer:

Local database Remote API Shared database
Enabled by default Yes No No
Can have multiple sources No Yes Yes
Updated with local translations Yes No Yes
Accesses database directly Yes No Yes
Access to source Editor Link Editor if local or link
Can be shared as API service Yes Yes Yes
Type identifier ttmserver remote-ttmserver shared-ttmserver

All translation aids including translation memories are configured with the $wgTranslateTranslationServices configuration setting. Example configuration of TTMServers:

Default configuration
$wgTranslateTranslationServices['TTMServer'] = array(
        'database' => false, // Passed to wfGetDB
        'cutoff' => 0.75,
        'type' => 'ttmserver',
        'public' => false,
);
Remote API configuration
$wgTranslateTranslationServices['example'] = array(
        'url' => 'http://example.com/w/api.php',
        'displayname' => 'example.com',
        'cutoff' => 0.75,
        'timeout-sync' => 4,
        'timeout-async' => 4,
        'type' => 'remote-ttmserver',
);
Shared database configuration
$wgTranslateTranslationServices['TTMServer'] = array(
        'database' => 'sharedttmserver', // Passed to wfGetDB
        'cutoff' => 0.75,
        'type' => 'shared-ttmserver',
);

Possible keys and values are:

Key Applies to Description
cutoff All Minimum threshold for matching suggestion. Only a few best suggestions are shown even if there would be more above the threshold.
database Local and shared For shared databases, or if you just want to store the translation memory on a different location, you can specify the database name here. You also have to configure MediaWiki's load balancer to know how to connect to that database.
displayname Remote The text shown in the tooltip when hovering the suggestion source link (the bullets).
public All Whether this TTMServer can be queried through the api.php of this wiki.
symbol All The suggestion source link text. Defaults to ‣ for remote and to • otherwise.
timeout-async Remote How long to wait for an answer when the suggestion is loaded separately via AJAX call.
timeout-sync Remote How long to wait for an answer when the suggestion is loaded inside PHP request.
type All Type of the TTMServer, see table above.
url Remote URL to api.php of the remote TTMServer.
Note: You must use the key TTMServer as the array index to $wgTranslateTranslationServices if you want the translation memory to be updated with new translations. Remote TTMServers cannot be used for that, because they cannot be updated.

Currently only MySQL is supported for the databases.

TTMServer API

Would you like to implement your own TTMServer service, here are the specifications.

Query parameters:

Your service must accept the following parameters:

Key Value
format json
action ttmserver
service Optional service identifier if there are multiple shared translation memories. If not provided, the default service is assumed.
sourcelanguage Language code as used in MediaWiki, see IETF language tags and ISO693?
targetlanguage Language code as used in MediaWiki, see IETF language tags and ISO693?
test Source text in source language

Your service must provide JSON object which must have key ttmserver with array of objects. Those objects must contain the following data:

Key Value
source Original source text.
target Translation suggestion.
context Local identifier for the source, optional.
location URL to the page where the suggestion can be seen in use.
quality Decimal number in range [0..1] describing the suggestion quality. 1 means perfect match.

Example:

{
        "ttmserver": [
                {
                        "source": "January",
                        "target": "tammikuu",
                        "context": "Wikimedia:Messages\\x5b'January'\\x5d\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/Wikimedia:Messages%5Cx5b%27January%27%5Cx5d\/fi",
                        "quality": 0.85714285714286
                },
                {
                        "source": "January",
                        "target": "tammikuu",
                        "context": "Mantis:S month january\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/Mantis:S_month_january\/fi",
                        "quality": 0.85714285714286
                },
                {
                        "source": "January",
                        "target": "Tammikuu",
                        "context": "FUDforum:Month 1\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/FUDforum:Month_1\/fi",
                        "quality": 0.85714285714286
                },
                {
                        "source": "January",
                        "target": "tammikuun",
                        "context": "MediaWiki:January-gen\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/MediaWiki:January-gen\/fi",
                        "quality": 0.85714285714286
                },
                {
                        "source": "January",
                        "target": "tammikuu",
                        "context": "MediaWiki:January\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/MediaWiki:January\/fi",
                        "quality": 0.85714285714286
                }
        ]
}

TTMServer architecture

The backend contains three tables, translate_tms, translate_tmt and translate_tmf. Those correspond to sources, targets and fulltext. You can find the table definitions in sql/translate_tm.sql. The sources contain all the message definitions. Even though usually they are always in the same language, say, English, the language of the text is also stored for the rare cases this is not true.

Each entry has a unique id and two extra fields, length and context. Length is used as the first pass filter, so that when querying we don't need to compare the text we're searching with every entry in the database. The context stores the title of the page where the text comes from, for example "MediaWiki:Jan/en". From this information we can link the suggestions back to "MediaWiki:Jan/de", which makes it possible for translators to quickly fix things, or just to determine where that kind of translation was used.

The second pass of filtering comes from the fulltext search. The definitions are mingled with an ad-hoc algorithm. First the text is segmented into segments (words) with MediaWiki's Language::segmentByWord. If there are enough segments, we strip basically everything that is not word letters and normalize the case. Then we take ten first unique words, which are at least 5 bytes long (5 letters in English, but even shorter words for languages with multibyte code points). Those words are then stored in the fulltext index for further filtering for longer strings.

When we have filtered the list of candidates, we fetch the matching targets from the targets table. Then we apply the levenshtein edit distance algorithm to do the final filtering and ranking. Let's define:

edit distance
the text we are searching suggestions for
Tc 
the suggestion text
To 
the original text which the Tc is translation of

The quality of suggestion Tc is calculated as E/min(length(Tc),length(To)). Depending on the length of the strings, we use: either PHP's native levenshtein function; or, if either of the strings is longer than 255 bytes, the PHP implementation of levenshtein algorithm.[1] It has not been tested whether the native implementation of levenshtein handles multibyte characters correctly. This might be another weak point when source language is not English (the other being the fulltext search and segmentation).

There is a script which fills the translation memory with translations from the active message groups. Even big sites should be able to bootstrap the memory in half an hour when using multiple threads with the --thread parameter. The time depends heavily on how complete the message group completion stats are (incomplete ones will be calculated during the bootstrap). New translation are automatically added by a hook. New sources (definitions) are added when first translation is added.

Old translations which are no longer used and do not belong to any message groups are not purged automatically, unless you rerun the bootstrap script. When the translation of a message is updated, the previous translation is removed from the memory. When the definition is updated nothing happens immediately. When translations are updated against the new definition, a new entry will be added. The old definition and its old translations remain in the database until purged by rerunning the bootstrap script. Also fuzzy translations will not be added to the translation memory, but neither are the translations removed from the memory when they are fuzzied.

Personal tools
Namespaces

Variants
Actions
Navigation
Support
Download
Development
Communication
Print/export
Toolbox