Content translation/Published translations

Information about published translations are generally helpful for machine translation developers and others for different purposes. Content translation aims to provide this data as much as possible in different ways. The amount and details of the data will be improved over time. This page captures the current state.

List of published source and target titles
Content translation has an API to get list of all published translations across languages. Currently the API output returns the following details (illustrated with example)
 * List of all published translations across all languages. Example: https://ca.wikipedia.org/w/api.php?action=query&list=cxpublishedtranslations&limit=50&offset=5
 * List of all published translation between two languages. Example: https://es.wikipedia.org/w/api.php?action=query&list=cxpublishedtranslations&limit=50&offset=5&from=en&to=es

The stats data shows the percentage of translation completion. human indicate manual translation percentage. mt indicate machine translation percentage. Any edits on top of machine translation is considered as manual edits. The percentages are calculated at section level. any indicate the total translation (any=human+mt). Content translation does not demand full translation of source article. Users can freely translate as many or as few sections as they want. mtSectionsCount shows the number of translated sections. These stats are also used for abuse prevention (read more about the percentage calculation in that page).

Parallel corpora
The content translation tool is now widely used for translations. Along with the new articles created using translations, the source and translated articles are good sources for parallel text. Content translation collects this and make it available for machine translation developers. MT developers can use this to train their machine translation system. Content translation also captures the alignment of sections in source and translation, and in some cases even on sentence granularity.

API
To access the parallel text of a single translation, there is an API. First, one should know the translation id. This can be obtained from the  API explained above. To get the section level aligned parallel text, use  API.

Example:

You can see that the output is JSON formatted and contains section level contents. A section is paragraph or headers or figures. Technically a block level element in HTML. Every section contains up to three versions By default, the section contents are HTML. But if you prefer to get plain text version of each section, use striphtml argument in the API.
 * 1)  : The source content.
 * : The machine translated content. If the language pair involved has a machine translation service and translator used it, this section in output will have unmodified machine translation of source section. It will be empty if MT is not used or not available.
 * : The final translation by user. This will be an improved version by manual edits on top of machine translation. Or even translation from scratch if there is no MT.

Example:

If you wish to get only source and user versions, use  argument. By default its value is

Example: |user

Note: The output of this API will be empty for old translations (to be specific, translations before Jan 22, 2016). This is because the API and the required infrastructure was introduced only since that date. We did not capture the parallel text for old translations. But if you have a good aligner, you may still use the real article pairs from Wikipedia using the output of  API.

Parallel corpora dumps
For large scale machine learning, accessing parallel text per translation id is not convenient. We plan to provide monthly language pair wise dumps with TMX formatted data in near future.