Internationalisation wishlist 2017

This is a document of the "bag of issues" type, created by Wikimedia i18n aficionados Nikerabbit and Nemo_bis. Its purpose is to lay out mostly independent projects that would make Wikimedia's i18n infrastructure even more awesome, and in some cases prevent it from falling off of the train of i18n progress.

Previous editions

 * Internationalisation wishlist 2013
 * Internationalisation wishlist 2014

Visual page translation
Tags: #epic #visualeditor #parsoid #translate #translationadmins

The wiki page translation feature of the Translate extension does not currently work with Visual Editor due to the special tags it uses. More specifically, this is about editing the source pages that are used as the source for translations, not the translation process itself. The work can be divided into three steps:
 * 1) Migrate the special tag handling to a more standard way to handle tags in the parser. This need some changes to the PHP parser for it to be able to produce wanted output.
 * 2) Add support to Parsoid and Visual Editor so that editing page contents preserves the structures that page translation adds to keep track of the content.
 * 3) Add to Visual Editor some visual aid for marking the parts of the page that can (or cannot) be translated.

This is a difficult project due to complexities of wikitext parsing and intersecting multiple different products: Translate, MediaWiki core parser, Parsoid, Visual Editor.

Better insertables
Tags: #translators #translate #rtl #php #javascript #gsoc-outreachy

Various i18n libraries use different ways to mark variables. Some examples $1 // MediaWiki $var %1$s, %2 // Many C/Gettext projects ${var}

With insertables (those buttons that can be activated to insert these), we have made it easier to add these and avoid spelling mistakes in them. However, some of these formats, those with latin letters, are difficult and confusing to use in right-to-left language translations. One possible approach is to unify all these formats, so that translators only see one of them, even though the underlying code will see whatever syntax they use. We can also make it so that in right-to-left languages we use syntax that does not cause issues.

Another aspect of unification is that for translation memory, we should replace all variables with a similar placeholder, so that translation memory matching is more accurate.

If we want to take this even further: Insertables should perhaps be easier to control, so that project contacts have more visibility on them without having to write PHP code to support them. We can do this by 1) supporting most common formats out of the box 2) allowing to specify regular expression directly in YAML configuration.

Librarization of MediaWiki i18n
We should have a reference library which embeds all our learnings and best practices on i18n handling and l10n formats, to promote and use it widely in PHP and JavaScript projects. The library should also try to unify the custom/diverse formats like those for dates from moment.js or others (compare T31235).

Currently, we have a sort of conflict between our own PHP and JavaScript libraries and even many Wikimedia projects in PHP end up using custom solutions. We don't have recommendations for important languages like Python, which are "stuck" with Gettext (or custom formats like pywikibot?).

Extract our PHP message parsing code to a library
There are many PHP projects that would benefit from high quality i18n library. MediaWiki has many excellent features such as extensive handling of parameters, parameter types etc. It has some drawbacks though such as not being able to support nested constructions. See also https://github.com/Nikerabbit/monkey-i18n

At translatewiki.net we have multiple PHP projects. The licence (GPL-2.0+) might be a problem if they want to reuse code from MediaWiki.

File format support
Some more file format work?

Language addition process
Adding a new language on translatewiki.net (Translatewiki.net languages) requires many decisions and checks (e.g. ISO status, names in Wikipedia/CLDR/request, jquery.uls) and changes in various repositories. It's also not clear to translators what the status of their request is, sometimes data is forgotten. Only core staff can help (in practice just a single person) since full access to configuration and repositories is needed.

Suggesting to build a good documentation for the process and clear criteria that can be executed by anyone, leaving only +2 and oversight to admin. Thanks to more active code review tracking, patches there are slightly less likely to get stuck.

Move export thresholds to message groups
It would be helpful to alert users when translations are not being exported due to not meeting the export threshold. This information should be accessible to the Translate extension. Currently this is specified in the repository management. If this information is moved to the message group configuration, we would avoid duplication, and simplify repository management for exports.

Typed message parameters
Tags: #php #javascript #gsoc-outreach?

The MediaWiki message library is very versatile, but some limitations have become apparent over the time. The main one is the inability to embed structures that themselves contain linguistic content in sentences. This is best illustrated with the case of links. All the current alternatives are no nice:

msg1: Please see our $1 for more information msg2: terms of service call: $this->msg( "msg1" )->rawParam( Html::element( 'a', [ 'href' => '...', ], $this->msg( "msg2" )->text ) )->escaped;
 * 1) Alternative 1: lego

msg1: Please see our terms of service for more information call: $this->msg( "msg1", '...' )->text; // Lacks proper escaping!!!
 * 1) Alternative 2: markup 1

msg1: Please see our $1terms of service$2 for more information call: $this->msg( "msg1", )->rawParam( '', '' )->escaped;
 * 1) Alternative 3: markup 2

Instead, if we could do embedding, things would be quite simple for translators and developers: msg1: Please see our for more information call: $this->msg( "msg1" )->rawParam( Html::element( 'a', [ 'href' => '...' ], '$1' ) )->escaped; // The $1 inside the link gets replaced with "terms of service" from the translation with same escaping as the rest of the message.
 * 1) Suggested solution

It is also possible to device a custom syntax to make it shorter, but that is probably not necessary as translators encounter a lot this kind of syntax already with PLURAL, GRAMMAR, GENDER and some others.

See https://github.com/Nikerabbit/monkey-i18n for proof of concept for this idea. It also supports typed parameters, so that GENDER, PLURAL etc can validate that they are really getting a user or number, and even format it automatically without the need to use numParams.

Translation of non-prose MediaWiki strings
Tags: #php #javascript #full-stack #translate #structured-data #translatewiki.net

Magic words, special page aliases and namespaces should be translatable with a web interface to:
 * allow translators to change or update translations easily and quickly, without having to know about order of precedence or allowed characters and so on, but also reports on mistakes;
 * keep translations in a data format which is resilient to mistakes (no fatals due to data errors) and can be easily exported to the repositories (without worrying about removing translations which should be kept for backwards compatibility), like some JSON format on ContentHandler pages on translatewiki.net;
 * ideally, export such updates as part of the usual scripts to follow the usual continuous translation model and reduce breakage.

Handle multiple translation of multiple branches
Software translated in translatewiki.net uses the master branch as input and export. This means that once a stable branch is created, it stops receiving translation updates. It should be possible to translate, import and export multiple branches simultaneously. When translating, the messages which are same across branches should only be translated once.

Branch support has two benefits:
 * 1) software that is branched but not yet released can receive translation updates
 * 2) software that is already released, can release minor updates with latest translations

In the past, for MediaWiki core, minor releases were kept up to date with a great deal of manual effort, see Repository management.

Translatable PageForms
Tags: #translate #pageforms #translationadmins #php #gsoc-outreachy

PageForms is an extension previously known as SemanticForms. It allows to create forms for inputting data in a structured manner. It would be great, if it was possible, when creating a form with a form, to ask the form to be made translatable. This can be done manually after the form page has been created, but it is a laborious process to extract all the strings manually, create dozens of pages and manual configuration in LocalSettings.php (only available to wiki administrators). This all could be simplified with a better integration to something like one checkbox to check during form creation. After that one could go to Special:Translate to translate the form.

Gadget localisation
Integrate jquery.i18n and Translate to provide localisation facilities for gadgets. We can start doing this even before Duke Nukem Forever Gadgets 2.0 happens. Would help with: Multilingual Commons.

MediaWiki multilingual documentation
Most of the help pages and manuals are on Meta or rather on local Wikipedias: this leaves most small Wikipedias and other projects in "small languages" with no docs or outdated docs. Everything should be central, translatable, easily and equally available from all wikis. Have a policy to make translatable user-faced documentation at mediawiki.org

Multilingual Commons
Help Commons become truly multilingual. Content, categories, templates, gadgets etc. should be translatable.

«Jon Liechty [...] indicated that half of Wikimedia [Commons] uses the English language template, but the rest of the languages fall off logarithmically. He is concerned about the "exponential hole" separating the languages on each side of the curve.»

MT Translation of discussions
Tags: #community #javascript #gadget?

Wikimedia now has infrastructure for providing machine translations via an service. These services are now in use by the Content Translation tool and the Translate extension for page translation. We could also use these services in wiki discussions, to request translation of a comment of whole thread, to help non-speakers to understand what is being discussed, without having to copy-paste the text manually to a translation service. This could be integrated into Flow, for example. One component that is required for this is to detect the source language. Often we can assume it is the default language of the wiki or page, but in multilingual wikis such as Meta and MediaWiki.org, it is necessary to use a library or service that identifies the language.

Language selection for anonymous users for Wikimedia sites
Tags: #universal-language-selector #wmf-production

Multilingual Wikimedia sites such Commons, Meta, Wikidata and MediaWiki.org require to register a user account to change interface language. It should be possible to change the language without registering a user account and logging in.

Activity reporting and engagement
Project administrators/coordinators (project contacts on translatewiki.net) should be able not only to have a clear sense of what work is going on, but also of what translators/languages may need an additional effort (or vice versa are going especially well), in order to be able to contact translators where needed. Detailed reporting may be needed, if not an interface to semi-automatically send notifications in certain cases (such as translators who've reduced activity a lot in a language which needs more translations).

Thanking translators is still best done manually, but project contacts need to know whom to thank (knowing about new languages exported may also be helpful to tweak their configuration to actually use them, at times). The ability to easily communicate with "your own" translators can help project administrators build a sense of community and make them feel they're still in charge of the project even though they've merged with a larger wiki/community.

Translators should be able to stay on top of new translation work easily, e.g. by subscribing to feeds and notifications in the projects of their interest when there are new messages in the source language or requests for translations update (which no longer triggers edits and hence escapes enotifwatchlist). They are also interested in knowing how they rank against others, but our tools to this purpose may be: currently we have a monthly rank on the main page, a contribution count with a babel template and total "ranks" with language statistics

Relevant and reliable translation statistics
Tags: #php #javascript #full-stack

Sometimes projects want to know more about the workload for translators and so on. Translate offers a lot of reporting, but one simple feature we're currently lacking is the ability to count translations by number of words rather than by number of messages. Needed components:
 * 1) a way to computer number of words (should work in any language)
 * 2) storing the number of words somewhere in the database for quick access
 * 3) updating the statistics pages to use words instead of messages

Additionally, our statistics pages are currently missing information about proofread progress, that could also be added.

A reliable way for system administrators or wiki administrators to force hard updates of statistics and all caches may also be welcome, to easily overcome and problem with cache or job queue or other (compare T145295).

See also https://phabricator.wikimedia.org/tag/mediawiki-extensions-translate/ column "statistics"

Display "where are my translations deployed" information at translatewiki.net
Tags: #git #php #javascript #full-stack #gsoc-outreachy #translatewiki.net

It is often unclear to people when their translations will appear in the software. With some more integration of repository scripts, it should be possible to add metadata to translation revisions in which commits or branches they are included. Different kind of summaries can then be built on this data, such as "these translations of yours are still waiting to be exported".

The implementation would consists of two mostly independent parts:
 * 1) a tool that reads git repositories to check for each translation in which branch they are
 * 2) an interface that displays relevant information to the translators

As an extension, we could try to hook up into Wikimedia LocalisationUpdate process and release processed of different projects to also record the information when they are deployed. This is likely much more complicated.