Content translation/Documentation/FAQ

What is the Content Translation tool?
It's a tool that helps editors create a new article based on a corresponding article about the same topic in a different language.

What is CX?
"CX" is an abbreviation for "ContentTranslation". See glossary below.

How does the Content Translation tool differ from the Translate extension?
The Translate extension was initially built with focus on translating software user interface messages for MediaWiki and other programs. It can also translate MediaWiki pages, but experience shows that it's not so practical for translating articles of the kind that you can find in Wikipedia, Wikivoyage or similar sites: it requires adding markup to the source article to prepare it for translation, and it can mess things up if the source article changes drastically, as it often happens in Wikipedia. This works fairly well for documentation in mediawiki.org, meta and many other sites, but it doesn't scale for Wikipedia.

Is it available for all users of a wiki?
It is available for logged-in users of a wiki where it's enabled, and it must be enabled as a beta feature in the preferences.

Can it be used only to translate articles?
The focus for initial development is articles in Wikipedia and possibly Wikivoyage. It may be enhanced to articles in the style of other sites later.

Will there be special features to insert links and references from the original article?
Links will be automatically inserted when a corresponding link can be found using interlanguage links.

The tool will try to adapt references as much as possible between the source and target languages. This may be challenging given that different languages use different citation formats.

Will Content Translation use information from Wikidata?
Yes.

The earliest release will use interlanguage links from Wikidata to auto-fill the links in the translated article. There are plans to use labels, aliases soon laterwards.

It is likely that when templates in different Wikipedias will use data from Wikidata more, it will be simply picked up by ContentTranslation.

What are the translation aids that will be made available?
The current plan is:
 * Dictionaries: translation and definitions of words.
 * Link adaptation: Links will be adapted automatically when they will be available as interlanguage links to the target languages. It will be possible to make basic manipulation on them - remove them and pick them from other sources.
 * Machine translation and translation memory: These are similar to what is used in the Translate extension.

Will you provide suggestions from translation memory?
Yes, in the future.

The data for translation memory will have to be filled from some initial translations, so it may take a while from the time that translation memory is enabled for ContentTranslation until it becomes useful.

How are you integrating machine translations?
For language in which machine translation is available, machine translation will be auto-filled upon clicking a paragraph in the translation area.

Initially we're using the Apertium engine, which is free software and can be installed and maintained on our own servers. At a later point we may use Moses and other engines.

How can I improve support for my language?
Contribute to an existing Apertium pair, or create a new one!

Get in contact with the Apertium community with IRC,, or many other ways.

Are you building on other efforts as well?
There was a lot of research on the topic, see Machine translation. For instance: «The quantitative results show that the contributions can improve the accuracy of a combination of RBMT-SPE pipeline at around 10 %, after the post-edition of 50,000 words in the Computer Science domain. We believe that these conclusions can be extended to MT engines involving other less-resourced languages lacking big parallel corpora or frequently updated lexical knowledge» (10.1007/978-3-642-35085-6_4).

Can the machine-translated content be edited manually?
Yes.

We treat machine translation only as a tool that may help a human translator be faster. Publishing machine-translated articles is not the intention of ContentTranslation.

Will there be a feature to prevent bulk publishing of unedited machine translated text?
Yes!

We take article quality seriously. Machine translation is only a tool that helps the translator be more efficient, but the developers understand well that all translations must be edited by a human. The translation interface will show a warning if the translator will try to publish an article that only has machine translation. The developers will work with the editing communities to adjust this.

What dictionaries will be available?
The dictionaries will be initially taken from free dictionaries from the freedict project. Later other dictionaries may be added, such as Wiktionary, OmegaWiki, terminology collections, and possibly other open sites.

Can I copy images over from the source article?
Yes, images will be copied just like - simply by clicking.

How will templates be handled? How are you handling infoboxes?
Initially, templates will be simply blacklisted by default. They will not even be shown in the source column of translation interface. Many templates are project-specific, so it won't be possible to handle their translation at all. Some simple templates that have no parameters and do have a corresponding template in the target language will have

Can I set up the Content Translation extension on my local wiki?
Yes.

Just install the extension and follow the configuration guide. The default configuration has a bias for Wikipedia, so be sure to set it up correctly for your wiki.

What is cxserver?
ContentTranslation by definition works with multiple wikis and it needs to synchronize information between them, so it uses an additional component called "ContentTranslation server" or "cxserver" for short, to facilitate that. It also optimizes much of the connection to translation tools, such as dictionaries, machine translation, link adaptation, etc.

Is there interest in this feature?
Definitely! In the past there were so many attempts at making similar tools that it's impossible to count them. Some are listed at Machine translation (please add there any you know of).

Glossary

 * annotation: A markup applied to some part of text. Basically, it is html tags like anchor, bold, italic, underline etc.
 * card : a box which appears in the tools column on the special page and provides translation tools for specific context, e.g. a box that allows editing links
 * columns : vertical areas in which Special:ContentTranslation is divided: there are currently three columns (source, translation, tools)
 * Content Tanslation (CX) : This tool consisting of ContentTranslation extension and cxserver backend. It could be more intuitive to call it "CT", but this is already used for CategoryTree.
 * cxserver : Backend for CX written in Node.js, handling text segmentation and providing consistent API for services like machine translation, dictionaries and translation memories.
 * glossary:A list of terms with definitions or translations.
 * GWT (Given-When-Then): GWT is a semi-structured way to write down test cases. They can either be tested manually or automated as browser tests with Selenium.
 * lemmatization : also called stemming. Mapping multiple grammatical variants of the same word to a root form; e.g. (swim, swims, swimming, swam, swum) -> swim. Derivational variants are not usually mapped to the same form (so happiness !-> happy).
 * link localization : Converting a wiki article link from one language to another language with the help of wikidata. Example: http://en.wikipedia.org/wiki/Sea becomes http://es.wikipedia.org/wiki/Mar
 * machine translation (MT) : Initial translation made by computer algorithms to help translating faster.
 * morphological analysis : mapping words into morphemes, e.g. swims -> swim/3rdperson_present
 * parallel bilingual text : two versions of the same content, each written in a different language.
 * segmented : reduced in segments
 * segment : Smallest unit of text which is fairly self-contained grammatically. This usually means a sentence, a title, a phrase in a bulleted list, etc.
 * segmentation algorithm : rules to split a paragraph into segments. Weakly language-dependent (sensible default rules work quite well for many languages).
 * sentence alignment : matching corresponding sentences in parallel bilingual text. In general this is a many-many mapping, but it is approximately one-one if the texts are quite strict translations.
 * service : Things like MT, TM, Glossary
 * service providers : External systems which provide a service. Example: Google
 * source column : the column showing the segmented article in source language.
 * template destruction : inlining a template contents when suitable template does not exist in the target wiki
 * tools column : the column where cards appear
 * translation column : the column where the translation is done.
 * translation memory (TM) : A service which suggests translations based on previous translations.
 * translation tools (translation support tools, translation aids) : Context-aware translation tools like MT, Dictionary, link localization
 * word alignment : matching corresponding words in parallel bilingual text. This is strongly many-many.
 * translation dashboard: Listing of all translations of a user. A new translation can also be started from here.