Help:Extension:Translate/Vertaalgeheugens

From MediaWiki.org
Jump to: navigation, search
Other languages: Arabic 39% • ‎Danish 99% • ‎German 19% • ‎Zazaki 25% • ‎English 100% • ‎Persian 35% • ‎Japanese 60% • ‎Luxembourgish 31% • ‎Lithuanian 16% • ‎Dutch 47% • ‎Polish 43% • ‎Brazilian Portuguese 46% • ‎Russian 60% • ‎Ukrainian 100% • ‎Chinese (China) 7%

TTMServer is a translation memory server that comes with the Translate extension. It needs no external dependencies. It is enabled by default and it replaces the support for tmserver from translatetoolkit, which was hard to set up. TTMServer is a simple translation memory and it doesn't use any advanced algorithms. It does, however, take advantage of MediaWiki's excellent language support and database abstraction features.

There are three different ways to use TTMServer:

Local database API op afstand Solr backend
Standaard ingeschakeld Ja Nee No
Kan meerdere bronnen hebben Nee Ja Yes
Bijgewerkt met lokale vertalingen Ja Nee Yes
Heeft directe toegang tot de database Ja Nee No
Toegang tot bron Tekstverwerker Verwijzing Editor if local or link
Kan gedeeld worden als een API-dienst Ja Ja Yes

Contents

Configuration

Alle vertaalhulpmiddelen inclusief vertaalgeheugens worden ingesteld met de instelling $wgTranslateTranslationServices. Voorbeeldinstellingen voor TTMServers:

Default configuration
$wgTranslateTranslationServices['TTMServer'] = array(
        'database' => false, // Passed to wfGetDB
        'cutoff' => 0.75,
        'type' => 'ttmserver',
        'public' => false,
);
Remote API configuration
$wgTranslateTranslationServices['example'] = array(
        'url' => 'http://example.com/w/api.php',
        'displayname' => 'example.com',
        'cutoff' => 0.75,
        'timeout' => 3,
        'type' => 'ttmserver',
        'class' => 'RemoteTTMServer',
);
Solr backend configration
$wgTranslateTranslationServices['TTMServer'] = array(
        'type' => 'ttmserver',
        'class' => 'SolrTTMServer',
        'cutoff' => 0.75,
        /* See http://wiki.solarium-project.org/index.php/V2:Basic_usage
        'config' => This will be passed to Solarium_Client
         */
);
See installation notes at the bottom of this page.

Possible keys and values are:

Sleutel Van toepassing op Beschrijving
config Solr Solr instance config for Solarium, see below.
cutoff Alle Minimumwaarde voor het van toepassing zijn van een suggestie. Alleen de beste suggesties worden weergegeven, zelfs als er meer boven de drempelwaarde zijn.
database Lokaal Als u het vertaalgeheugen op een andere plaats wilt opslaan, kunt u hier de databasenaam opgeven. Als u een loadbalancer gebruikt, moet u die ook aangeven met welke database verbinding moet worden gemaakt.
displayname Op afstand De tekst die wordt weergegeven in de tooltip als de muisaanwijzer bovenop de verwijzing staat (de "bullets").
public Alle Of deze TTMServer geraadpleegd kan worden via api.php van deze wiki.
symbol Alle The suggestion source link text. Defaults to ‣ for remote and to • otherwise.
timeout Op afstand How long to wait for an answer from remote service.
type Alle Type TTMServer wat betreft resultaatopmaak.
url Op afstand URL naar api.php voor de TTMServer op afstand.
You must use the key TTMServer as the array index to $wgTranslateTranslationServices if you want the translation memory to be updated with new translations. Remote TTMServers cannot be used for that, because they cannot be updated.

Currently only MySQL is supported for the databases.

TTMServer API

If you would you like to implement your own TTMServer service, here are the specifications.

Query parameters:

Your service must accept the following parameters:

Sleutel Waarde
format json
action ttmserver
service Optional service identifier if there are multiple shared translation memories. If not provided, the default service is assumed.
sourcelanguage Language code as used in MediaWiki, see IETF language tags and ISO693?
targetlanguage Language code as used in MediaWiki, see IETF language tags and ISO693?
test Source text in source language

Your service must provide a JSON object that must have the key ttmserver with an array of objects. Those objects must contain the following data:

Sleutel Waarde
source Oorspronkelijke brontekst.
target Vertaalsuggestie.
context Locale ID voor de bron (optioneel).
location URL naar de pagina waar de suggestie vandaan komt.
quality Decimaal getal in de reeks [0..1] die de kwaliteit van de suggestie beschrijft. 1 betekent een perfecte overeenkomst.

Voorbeeld:

{
        "ttmserver": [
                {
                        "source": "January",
                        "target": "tammikuu",
                        "context": "Wikimedia:Messages\\x5b'January'\\x5d\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/Wikimedia:Messages%5Cx5b%27January%27%5Cx5d\/fi",
                        "quality": 0.85714285714286
                },
                {
                        "source": "January",
                        "target": "tammikuu",
                        "context": "Mantis:S month january\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/Mantis:S_month_january\/fi",
                        "quality": 0.85714285714286
                },
                {
                        "source": "January",
                        "target": "Tammikuu",
                        "context": "FUDforum:Month 1\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/FUDforum:Month_1\/fi",
                        "quality": 0.85714285714286
                },
                {
                        "source": "January",
                        "target": "tammikuun",
                        "context": "MediaWiki:January-gen\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/MediaWiki:January-gen\/fi",
                        "quality": 0.85714285714286
                },
                {
                        "source": "January",
                        "target": "tammikuu",
                        "context": "MediaWiki:January\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/MediaWiki:January\/fi",
                        "quality": 0.85714285714286
                }
        ]
}

Architectuur van TTMServer

The backend contains three tables: translate_tms, translate_tmt and translate_tmf. Those correspond to sources, targets and fulltext. You can find the table definitions in sql/translate_tm.sql. The sources contain all the message definitions. Even though usually they are always in the same language, say, English, the language of the text is also stored for the rare cases this is not true.

Each entry has a unique id and two extra fields, length and context. Length is used as the first pass filter, so that when querying we don't need to compare the text we're searching with every entry in the database. The context stores the title of the page where the text comes from, for example "MediaWiki:Jan/en". From this information we can link the suggestions back to "MediaWiki:Jan/de", which makes it possible for translators to quickly fix things, or just to determine where that kind of translation was used.

The second pass of filtering comes from the fulltext search. The definitions are mingled with an ad hoc algorithm. First the text is segmented into segments (words) with MediaWiki's Language::segmentByWord. If there are enough segments, we strip basically everything that is not word letters and normalize the case. Then we take the first ten unique words, which are at least 5 bytes long (5 letters in English, but even shorter words for languages with multibyte code points). Those words are then stored in the fulltext index for further filtering for longer strings.

When we have filtered the list of candidates, we fetch the matching targets from the targets table. Then we apply the levenshtein edit distance algorithm to do the final filtering and ranking. Let's define:

edit distance
the text we are searching suggestions for
Tc 
the suggestion text
To 
the original text which the Tc is translation of

The quality of suggestion Tc is calculated as E/min(length(Tc),length(To)). Depending on the length of the strings, we use: either PHP's native levenshtein function; or, if either of the strings is longer than 255 bytes, the PHP implementation of levenshtein algorithm.[1] It has not been tested whether the native implementation of levenshtein handles multibyte characters correctly. This might be another weak point when source language is not English (the others being the fulltext search and segmentation).

There is a script which fills the translation memory with translations from the active message groups. Even big sites should be able to bootstrap the memory in half an hour when using multiple threads with the --thread parameter. The time depends heavily on how complete the message group completion stats are (incomplete ones will be calculated during the bootstrap). New translations are automatically added by a hook. New sources (definitions) are added when first translation is added.

Old translations which are no longer used and do not belong to any message groups are not purged automatically, unless you rerun the bootstrap script. When the translation of a message is updated, the previous translation is removed from the memory. When the definition is updated nothing happens immediately. When translations are updated against the new definition, a new entry will be added. The old definition and its old translations remain in the database until purged by rerunning the bootstrap script. Also fuzzy translations will not be added to the translation memory, but neither are the translations removed from the memory when they are fuzzied.

Solr backend

Much of the above also applies to the TTMServer using the Solr search platform as backend, except the details on database layout and queries. The results are by default ranked with the levenshtein algorithm on the Solr side, but other available string matching algorithms can also be used, like ngram matching for example.

In Solr there are no tables. Instead we have documents with fields. Here is an example document:

  <doc>
    <str name="wiki">sandwiki-bw_</str>
    <str name="uri">http://localhost/wiki/MediaWiki:Action-read/bn</str>
    <str name="messageid">MediaWiki:Action-read</str>
    <str name="globalid">sandwiki-bw_-MediaWiki:Action-read-813862/bn</str>
    <str name="language">bn</str>
    <str name="content">এই পাতাটি পড়ুন</str>
    <arr name="group">
      <str>core</str>
      <str>core-1.20</str>
      <str>core-1.19</str>
      <str>mediawiki</str>
    </arr>
    <long name="_version_">1421795636117766144</long>
  </doc>

Each translation has its own document and message documentation has one too. To actually get suggestions we first perform the search sorted by string similarity algorithm for all documents in the source language. Then we do another query to fetch translations if any for those messages.

We are using lots of hooks to keep the translation memory database updated in almost real time. If user translates similar messages one after another, the previous translation can (in the best case) be displayed as suggestion for the next message.

Initial import

  1. Execute ttmserver-export.php command line script for each wiki using the shared translation memory.

New translation (if not fuzzy)

  1. Create document

Updated translation (if not fuzzy)

  1. Delete wiki:X language:Y message:Z
  2. Create document

Updated message definition

  1. Create new document

All existing documents for the message stay around because globalid is different.

Translation is fuzzied

  1. Delete wiki:X language:Y message:Z

Messages changes group membership

  1. Delete wiki:Z message:Z
  2. Create document (for all languages)

Message goes out of use

  1. Delete wiki:Z message:Z
  2. Create document (for all languages)

Any further changes to definitions or translations are not updated to TM.

Translation memory query

  1. Collect similar messages with strdist("message definition",content)
  2. Collect translation with globalid:[A,B,C]

Search query

  1. Find all matches with text:"search query"

Can be narrowed further by facets on language or group field.

Identifier fields Field globalid uniquely identifies the translation or message definition by combining the following fields:

  • wiki identifier (MediaWiki database id)
  • message identifier (Title of the base page)
  • message version identifier (Revision id of the message definition page)
  • message language

The used format is wiki-message-version/language.

In addition we have separate fields for wiki id, message id and language to make the delete queries listed above possible.

Installation

Here are the general quick steps for installing and configuring Solr for TTMServer. You should adapt them to your situation.

# Solr needs java
sudo apt-get install openjdk-6-jre-headless
# Download and extract solr from  http://lucene.apache.org/solr/downloads.html
wget http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/solr/3.6.0/apache-solr-3.6.0.tgz
tar xzf apache-solr-*.tgz
cd apache-solr-*/example
# Copy the config from the extension directory
cp .../Translate/ttmserver/schema.xml solr/conf
# Start the server
java -jar start.jar

To use Solrbackend you also need Solarium library. Easiest way is to install the Solarium MediaWiki extensions. See the example configuration for Solr backend at the configuration section of this page. You can pass extra configuration to Solarium via the config key like done for example in the Wikimedia configuration.

And finally we can populate the translation memory with content.

php Translate/scripts/ttmserver-export.php --threads 2
This page is a translated version of a page Help:Extension:Translate/Translation memories and the translation is 47% complete.