帮助:扩展:翻译/翻译记忆

From MediaWiki.org
Jump to navigation Jump to search
This page is a translated version of the page Help:Extension:Translate/Translation memories and the translation is 64% complete.

Other languages:
Deutsch • ‎English • ‎Lëtzebuergesch • ‎Nederlands • ‎Zazaki • ‎català • ‎dansk • ‎español • ‎français • ‎italiano • ‎lietuvių • ‎polski • ‎português • ‎português do Brasil • ‎suomi • ‎български • ‎русский • ‎українська • ‎العربية • ‎فارسی • ‎தமிழ் • ‎中文 • ‎日本語

The Translate extension translation memory supports multiple backends. The available backends are database, Solr and ElasticSearch. This page helps you install the best one for you and explains their specifications in deeper detail.

Unlike other translation aids, for instance external machine translation services, the translation memory is constantly updated by new translations in your wiki. Advanced search across translations is also available at Special:SearchTranslations if you choose Solr or ElasticSearch.

Comparison

The database backend is used by default: it has no dependencies and doesn't need configuration. The database backend can't be shared among multiple wikis and it does not scale to large amounts of translated content. Hence we also support Solr and ElasticSearch backends. It is also possible to use another wiki's translation memory via their web API is open. Unlike the others, remote backends are not updated with translations from the current wiki.

数据库 远程 API Solr 或 ElasticSearch
默认为启用
可含多个来源
随本地翻译更新
直接访问数据库
访问来源 编者 链接 本地时编者,否则链接
可共享为 API 服务
表现 缩放不好 未知 合理的

条件

ElasticSearch 后端

ElasticSearch is relatively easy to set up. If it is not available in your distribution packages, you can get it from their website. You will also need to get the Elastica extension. Finally, please see puppet/modules/elasticsearch/files/elasticsearch.yml for specific configuration needed by Translate.

The bootstrap script will create necessary schemas. If you are using ElasticSearch backend with multiple wikis, they will share the translation memory by default, unless you set the index parameter in the configuration.

When upgrading to the next major version of elasticsearch (e.g. upgrading from 2.x to 5.x), it is highly recommended to read the release notes and the documentation regarding the upgrade process. Elastic offers a migration plugin that may help you to prepare your system before installing the next version. Follow these steps carefully otherwise elasticsearch may refuse to upgrade and you may end up in a delicate situation where you cannot rollback to the previous version.

Once the next major version is installed it's probable that the next time you run ttmserver-export.php it fails. This is because elasticsearch will perform an internal migration progress that may create an incompatible mapping with the one ttmserver-export.php wants to update. You have to use the --reindex flag to force a full rebuild of the index.

As a rule of thumb it is recommended to run ttmserver-export.php with the --reindex flag after major upgrades. You will ensure that your indices are always created with the current elasticsearch version which is a requirement for major upgrades.

Solr backend

Here are the general quick steps for installing and configuring Solr for TTMServer. You should adapt them to your situation.

# Solr needs java
sudo apt-get install openjdk-6-jre-headless
# Download and extract solr from: http://lucene.apache.org/solr/downloads.html
wget http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/solr/3.6.0/apache-solr-3.6.0.tgz
tar xzf apache-solr-*.tgz
cd apache-solr-*/example
# Copy the config from the extension directory
cp .../Translate/ttmserver/schema.xml solr/conf
# Start the server
java -jar start.jar

To use Solr backend you also need Solarium library. The easiest way is to install the Solarium extension. See the example configuration for Solr backend in the configuration section of this page. You can pass extra configuration to Solarium via the config key as done for example in the Wikimedia configuration.

安装

After putting the requirements in place, installation requires you to tweak the configuration and then execute the bootstrap.

配置

包含翻译记忆的所有翻译辅助功能都通过$wgTranslateTranslationServices设置變數来配置。

The primary translation memory backend must use the key TTMServer. The primary backend receives translation updates and is used by Special:SearchTranslations.

Example configuration of TTMServers:

默认配置
$wgTranslateTranslationServices['TTMServer'] = array(
        'database' => false, // Passed to wfGetDB
        'cutoff' => 0.75,
        'type' => 'ttmserver',
        'public' => false,
);
远程 API 配置
$wgTranslateTranslationServices['example'] = array(
        'url' => 'http://example.com/w/api.php',
        'displayname' => 'example.com',
        'cutoff' => 0.75,
        'timeout' => 3,
        'type' => 'ttmserver',
        'class' => 'RemoteTTMServer',
);
ElasticSearch backend configuration
$wgTranslateTranslationServices['TTMServer'] = array(
        'type' => 'ttmserver',
        'class' => 'ElasticSearchTTMServer',
        'cutoff' => 0.75,
        /*
         * See http://elastica.io/getting-started/installation.html
         * See https://github.com/ruflin/Elastica/blob/master/lib/Elastica/Client.php
        'config' => This will be passed to \Elastica\Client
         */
);
ElasticSearch multible backends configuration (supported by MLEB 2017.04)
// Defines the default service used for read operations
// Allows to quickly switch to another backend
$wgTranslateTranslationDefaultService = 'cluster1';
$wgTranslateTranslationServices['cluster1'] = array(
        'type' => 'ttmserver',
        'class' => 'ElasticSearchTTMServer',
        'cutoff' => 0.75,
        /*
         * Defines the list of services to replicate writes to.
         * Only "writable" services are allowed here.
         */
        'mirrors' => [ 'cluster2' ],
        'config' => [ 'servers' => [ 'host' => 'elastic1001.cluster1.mynet' ] ]
);
$wgTranslateTranslationServices['cluster2'] = array(
        'type' => 'ttmserver',
        'class' => 'ElasticSearchTTMServer',
        'cutoff' => 0.75,
        /*
         * if "cluster2" is defined as the default service it will start to replicate writes to "cluster1".
         */
        'mirrors' => [ 'cluster1' ],
        'config' => [ 'servers' => [ 'host' => 'elastic2001.cluster2.mynet' ] ]
);
Solr 后端配置
$wgTranslateTranslationServices['TTMServer'] = array(
        'type' => 'ttmserver',
        'class' => 'SolrTTMServer',
        'cutoff' => 0.75,
        /* See http://wiki.solarium-project.org/index.php/V2:Basic_usage
        'config' => This will be passed to Solarium_Client
         */
);

可能的键值为:

用于 说明
config Solr和ElasticSearch 傳遞給Solarium或Elastica的配置。
cutoff 所有 匹配建议的最小阀值。尽管在阀值上有更多合法值,但仅显示一些最佳建议。
database 本地 如果您想保存翻译记忆到不同位置,则可以在此处指定数据库名。同时还必须配置 MediaWiki 的负载均衡器以确定连接到该数据库的方法。
displayname 远程 悬停在建议来源链接(子弹头图标)时,工具提示中显示的文本。
index ElasticSearch The index to use in ElasticSearch. Default: ttmserver.
public 所有 该 TTMServer 是否可通过本 wiki 的 api.php 查询。
replicas ElasticSearch If you are running a cluster, you can increase the number of replicas. Default: 0.
shards ElasticSearch How many shards to use. Default: 5.
timeout 远程 等待远程服务应答的秒數。
type 所有 以最终格式表示的 TTMServer 类型。
url 远程 远程 TTMServer 中 api.php 的链接。
use_wikimedia_extra ElasticSearch Boolean, when the extra plugin is deployed you can disable dynamic scripting on elastic v1.x. This plugin is now mandatory for elastic 2.x clusters.
mirrors Writable services Array of strings, defines the list of services to replicate writes to, it allows to keep multiple TTM services up to date. Useful for fast switch-overs or to reduce downtime during planned maintenance operations (Added in MLEB 2017.04)
要让翻译记忆与新译文一起更新,您必须把 TTMServer 键作为 $wgTranslateTranslationServices 的数组索引。远程 TTMServer 无法实现此功能,因为无法更新它们。 As of MLEB 2017.04 the key TTMServer can be configured with the configuration variable $wgTranslateTranslationDefaultService.

目前只支持MySQL数据库後端。

Bootstrap

When you have chosen Solr or ElasticSearch and set up the requirements and configuration, run ttmserver-export.php to bootstrap the translation memory. Bootstrapping is also required when changing translation memory backend. If you are using a shared translation memory backend for multiple wikis, you'll need to bootstrap each of them separately.

Sites with lots of translations should consider using multiple threads with the --thread parameter to speed up the process. The time depends heavily on how complete the message group completion stats are (incomplete ones will be calculated during the bootstrap). New translations are automatically added by a hook. New sources (message definitions) are added when the first translation is created.

Bootstrap does the following things, which don't happen otherwise:

  • adding and updating the translation memory schema;
  • populating the translation memory with existing translations;
  • cleaning up unused translation entries by emptying and re-populating the translation memory.

When the translation of a message is updated, the previous translation is removed from the translation memory. However, when translations are updated against a new definition, a new entry is added but the old definition and its old translations remain in the database until purged. When a message changes definition or is removed from all message groups, nothing happens immediately. Saving a translation as fuzzy does not add a new translation nor delete an old one in the translation memory.

TTMServer API

如果您想实现自己的 TTMServer 数据库,请看详细说明。

查询参数:

您的服务必须接受下列参数:

format json
action ttmserver
service 存在多个共享翻译记忆时可选的服务标识符。如果未提供,则使用默认服务。
sourcelanguage 如同 MediaWiki 中使用的语言代码,请参阅 IETF 语言标记和 ISO693?
targetlanguage 如同 MediaWiki 中使用的语言代码,请参阅 IETF 语言标记和 ISO693?
test 源语言表示的原内容

您的服务必须提供对象数组中含有键 ttmserver 的 JSON 对象。这些对象必须包含下列数据:

source 原始的源文本。
target 翻译建议。
context 源的本地标识符,可选。
location 到查看建议的网页链接。
quality 表示建议且在 [0..1] 区间的十进制数。1 表示最佳匹配。

例如:

{
        "ttmserver": [
                {
                        "source": "January",
                        "target": "tammikuu",
                        "context": "Wikimedia:Messages\\x5b'January'\\x5d\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/Wikimedia:Messages%5Cx5b%27January%27%5Cx5d\/fi",
                        "quality": 0.85714285714286
                },
                {
                        "source": "January",
                        "target": "tammikuu",
                        "context": "Mantis:S month january\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/Mantis:S_month_january\/fi",
                        "quality": 0.85714285714286
                },
                {
                        "source": "January",
                        "target": "Tammikuu",
                        "context": "FUDforum:Month 1\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/FUDforum:Month_1\/fi",
                        "quality": 0.85714285714286
                },
                {
                        "source": "January",
                        "target": "tammikuun",
                        "context": "MediaWiki:January-gen\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/MediaWiki:January-gen\/fi",
                        "quality": 0.85714285714286
                },
                {
                        "source": "January",
                        "target": "tammikuu",
                        "context": "MediaWiki:January\/en",
                        "location": "https:\/\/translatewiki.net\/wiki\/MediaWiki:January\/fi",
                        "quality": 0.85714285714286
                }
        ]
}

数据库后端

后端包含了三个表:translate_tmstranslate_tmttranslate_tmf,分别对应于源、目标和完整的文本。您可以在 sql/translate_tm.sql 中看到表格的定义。源包含了所有信息组定义。Even though usually they are always in the same language, say, English, the language of the text is also stored for the rare cases this is not true.

每个条目都有唯一的 ID 和两个附加字段:长度和上下文。查询时使用长度作为首个过滤器,这样就无需把要搜索的文本和数据库中每个条目进行比较。上下文中保存了文本来源的页面标题,例如“MediaWiki:Jan/en”。根据该信息,我们可以把建议链接到“MediaWiki:Jan/de”,这样有助于译者快速修复问题或确定使用哪个译文。

第二个过滤器来自全文索引。它的定义与 ad hoc 算法混合。首先通过 MediaWiki 的 Language::segmentByWord 把文本分割为片段(词)。如果有足够的片段,我们主要去除所有非单词字母的那些内容来常态化。接着获取开头的十个唯一单词,必须至少五个字节长(英文中的五个字母,对于多字节字符则更少字数)。然后把这些词保存在全文索引中供将来过滤更长的字符串。

过滤出候选列表后,则从目标表中获取匹配的目标。然后使用编辑距离算法进行最后的过滤和排序。定义如下:

编辑距离
用于搜索建议的文本
Tc 
建议文本
To 
译文 Tc 的原始文本

通过 E/min(length(Tc),length(To)) 计算 Tc 建议的质量。我们使用 PHP 内置的 levenshtein 函数,但当某个字符串长于 255 字节时,则使用 PHP 实现的 levenshtein 算法。[1]尚未测试内置的 levenshtein 是否能正确处理多字节字符。当源语言不是英文时,这可能是另一个问题(全文索引和分割时)。

Solr 后端

Solr Solr search platform backend works similar to the database backend, except that it uses a dedicated search engine for increased speed. The results are by default ranked with the levenshtein algorithm on the Solr side, but other available string matching algorithms can also be used, like ngram matching for example.

在 Solr 中没有表格,这里使用含字段的文档代替。下面是一个示例文档:

  <doc>
    <str name="wiki">sandwiki-bw_</str>
    <str name="uri">http://localhost/wiki/MediaWiki:Action-read/bn</str>
    <str name="messageid">MediaWiki:Action-read</str>
    <str name="globalid">sandwiki-bw_-MediaWiki:Action-read-813862/bn</str>
    <str name="language">bn</str>
    <str name="content">এই পাতাটি পড়ুন</str>
    <arr name="group">
      <str>core</str>
      <str>core-1.20</str>
      <str>core-1.19</str>
      <str>mediawiki</str>
    </arr>
    <long name="_version_">1421795636117766144</long>
  </doc>

每个译文有自己的文档,而每个信息文件也有一个。要实际获取建议,首先对源语言的所有文档根据字符串相似度算法分类进行搜索,然后再次查询获取这些信息的译文。

我们使用大量的钩子以保持翻译记忆数据库的实时更新。如果用户一个接一个翻译了类似的信息,则之前的译文会在翻译后面的信息时显示为建议(最好的情况下)。

New translation (if not fuzzy)

  1. Create document

Updated translation (if not fuzzy)

  1. Delete wiki:X language:Y message:Z
  2. Create document

Updated message definition

  1. Create new document

All existing documents for the message stay around because globalid is different.

Translation is fuzzied

  1. Delete wiki:X language:Y message:Z

Messages changes group membership

  1. Delete wiki:Z message:Z
  2. Create document (for all languages)

Message goes out of use

  1. Delete wiki:Z message:Z
  2. Create document (for all languages)

Any further changes to definitions or translations are not updated to TM.

Translation memory query

  1. Collect similar messages with strdist("message definition",content)
  2. Collect translation with globalid:[A,B,C]

Search query

  1. Find all matches with text:"search query"

Can be narrowed further by facets on language or group field.

Identifier fields Field globalid uniquely identifies the translation or message definition by combining the following fields:

  • wiki identifier (MediaWiki database id)
  • message identifier (Title of the base page)
  • message version identifier (Revision id of the message definition page)
  • message language

The used format is wiki-message-version/language.

In addition we have separate fields for wiki id, message id and language to make the delete queries listed above possible.