Internationalisation wishlist 2017

This is a document of the "bag of issues" type, created by Wikimedia i18n aficionados Nikerabbit and Nemo_bis. Its purpose is to lay out mostly independent projects that would make Wikimedia's i18n infrastructure even more awesome, and in some cases prevent it from falling off of the train of i18n progress.

The scope is limited to Wikimedia and translatewiki.net and to technical improvements (almost all involve some kind of coding or system admin work).

Previous editions

 * Internationalisation wishlist 2013
 * Internationalisation wishlist 2014

Visual page translation
Tags: #epic #visualeditor #parsoid #translate #translationadmins

The wiki page translation feature of the Translate extension does not currently work with Visual Editor due to the special tags it uses. More specifically, this is about editing the source pages that are used as the source for translations, not the translation process itself. The work can be divided into three steps:
 * 1) Migrate the special tag handling to a more standard way to handle tags in the parser. This need some changes to the PHP parser for it to be able to produce wanted output.
 * 2) Add support to Parsoid and Visual Editor so that editing page contents preserves the structures that page translation adds to keep track of the content.
 * 3) Add to Visual Editor some visual aid for marking the parts of the page that can (or cannot) be translated.

This is a difficult project due to complexities of wikitext parsing and intersecting multiple different products: Translate, MediaWiki core parser, Parsoid, Visual Editor.

Better insertables
Tags: #translators #translate #rtl #php #javascript #gsoc-outreachy

Various i18n libraries use different ways to mark variables. Some examples $1 // MediaWiki $var %1$s, %2 // Many C/Gettext projects ${var}

With insertables (those buttons that can be activated to insert these), we have made it easier to add these and avoid spelling mistakes in them. However, some of these formats, those with latin letters, are difficult and confusing to use in right-to-left language translations. One possible approach is to unify all these formats, so that translators only see one of them, even though the underlying code will see whatever syntax they use. We can also make it so that in right-to-left languages we use syntax that does not cause issues.

Another aspect of unification is that for translation memory, we should replace all variables with a similar placeholder, so that translation memory matching is more accurate.

If we want to take this even further: Insertables should perhaps be easier to control, so that project contacts have more visibility on them without having to write PHP code to support them. We can do this by 1) supporting most common formats out of the box 2) allowing to specify regular expression directly in YAML configuration.

Extensive and robust localisation file format coverage
Translate extension supports multiple file formats. The formats have been developed "as needed" basis, and many formats are not yet supported or the support is incomplete. In this project the aim would be to make existing file formats (for example Android xml) more robust to meet the following properties: Example known bugs are 31331, 36584, 38479, 40712, 31300, 57964, 49412.
 * the code does not crash on unexpected input,
 * there is a validator for the file format,
 * the code can handle the full file format specification,
 * the code is secure (does not execute any code in the files nor have known exploits).

In addition new file formats can be implemented: in particular Apache Cocoon and AndroidXml string arrays have interest and patches to work on, but we'd also like TMX, for example. Adding new formats is a good chance to learn how to write parsers and generators with simple data but complicated file formats. For some formats, it might be possible to take advantage of existing PHP libraries for parsing and file generation. (More example formats other platforms support: OpenOffice.org SDF/GSI, Desktop, Joomla INI, Magento CSV, Maker Interchange Format (MIF), .plist, Qt Linguist (TS), Subtitle formats, Windows .rc, Windows resource (.resx), HTML/XHTML, Mac OS X strings, WordFast TXT, ical.)

This project paves the way for future improvements, like automatic file format detection, support for more software projects and extension of the ability to add files for translation by normal users via a web interface.

Transparent and fast language addition process for translatewiki.net
Tags: #documentation #translatewiki.net

Adding a new language on translatewiki.net (Translatewiki.net languages) requires many decisions and checks (e.g. ISO status, names in Wikipedia/CLDR/request, jquery.uls) and changes in various repositories. It's also not clear to translators what the status of their request is, sometimes data is forgotten. Only core staff can help (in practice just a single person) since full access to configuration and repositories is needed.

Suggesting to build a good documentation for the process and clear criteria that can be executed by anyone, leaving only +2 and oversight to admin. Thanks to more active code review tracking, patches there are slightly less likely to get stuck.

Transparent and fast project addition process for translatewiki.net
Tags: #documentation #translatewiki.net

Adding a new project on translatewiki.net (requires many decisions and checks (e.g. file format, access rights, license, string quality) and changes in various places. People are asked to join a IRC channel to discuss this. There is no tracking (unless they file a report in Wikimedia phabricator) about the process.

Suggesting to build a good documentation for the process and clear criteria that can be executed by anyone, leaving only +2 and oversight to admin. Thanks to more active code review tracking, patches there are slightly less likely to get stuck.

Better export thresholds
Tags: #translatewiki.net #yaml #front-end

It would be helpful to alert users when translations are not being exported from translatewiki.net due to not meeting the export threshold. This information should be accessible to the Translate extension. Currently this is specified in the repository management. If this information is moved to the message group configuration, we would avoid duplication, and simplify repository management for exports.

It would be also good to reconsider whether the current export levels make sense. One should check how many translations we currently have, that are not being exported due to not meeting the export level. We should also consider of lowering our thresholds for hopefully increased translator motivation, faster deployment of translations and less wasted work.

Glossaries
There must be a lot of glossaries and terminologies out there. Some of them must be useful to integrate in translatewiki.net.

Provide technical support for building glossaries with Translate extension and  in translatewiki.net. These should directly integrate into the translation editor.

Better support for formal and informal variants
Tags: #php #back-end #mediawiki-core #translatewiki.net

Currently the formal and informal variants are a bit hit and miss. The fact that not every message needs translation (compare with variants of English) makes them problematic. For languages which want to take seriously, we could make the formality an inline feature in addition the existing PLURAL, GENDER and GRAMMAR features.

The formality could be an additional option in the user preferences (like gender) or driven by the language codes directly. Language could choose their own number of formalities, not limited to two.

Example (current solution): (es)       ¿Estás ? (es-formal) ¿Está ? Example (proposed solution): (es) ?

The underlying assumption here is that only MediaWiki uses these formal/informal variants. If other projects use them, they could keep them as separate languages still until a better solution for them is described.

See for example 54957.

Typed message parameters
Tags: #php #javascript #gsoc-outreach?

The MediaWiki message library is very versatile, but some limitations have become apparent over the time. The main one is the inability to embed structures that themselves contain linguistic content in sentences. This is best illustrated with the case of links. All the current alternatives are no nice:

msg1: Please see our $1 for more information msg2: terms of service call: $this->msg( "msg1" )->rawParam( Html::element( 'a', [ 'href' => '...', ], $this->msg( "msg2" )->text ) )->escaped;
 * 1) Alternative 1: lego

msg1: Please see our terms of service for more information call: $this->msg( "msg1", '...' )->text; // Lacks proper escaping!!!
 * 1) Alternative 2: markup 1

msg1: Please see our $1terms of service$2 for more information call: $this->msg( "msg1", )->rawParam( '', '' )->escaped;
 * 1) Alternative 3: markup 2

Instead, if we could do embedding, things would be quite simple for translators and developers: msg1: Please see our for more information call: $this->msg( "msg1" )->rawParam( Html::element( 'a', [ 'href' => '...' ], '$1' ) )->escaped; // The $1 inside the link gets replaced with "terms of service" from the translation with same escaping as the rest of the message.
 * 1) Suggested solution

It is also possible to device a custom syntax to make it shorter, but that is probably not necessary as translators encounter a lot this kind of syntax already with PLURAL, GRAMMAR, GENDER and some others.

See https://github.com/Nikerabbit/monkey-i18n for proof of concept for this idea. It also supports typed parameters, so that GENDER, PLURAL etc can validate that they are really getting a user or number, and even format it automatically without the need to use numParams.

Automated exports from translatewiki.net
Tags: #php #puppet #back-end #security #git #translatewiki.net

Translatewiki.net exports are currently semi-automated to the level that one needs to run one script and watch the output. Ideally, it should be fully automated, run by a cron job.

The issues we currently consider blocking this are:
 * 1) migration away from personal ssh keys to a key used by translatewiki.net
 * 2) secure handling of this service ssh key
 * 3) migration of all projects to our new repository management tool (repong)
 * 4) reliability of exports
 * 5) automating the process of addition for new languages
 * 6) defining the automation via puppet

Step 1) can be completed by poking existing products to add access with the new key, preferably using an account for translatewiki.net. See for example https://github.com/translatewiki.

Step 2) would need advice from people with experience on this kind of thing to make sure it is secure. Obviously the automated exports would need access to this key, which is currently password protected.

Step 3) requires adding support to non-git version control systems to repong (written in PHP)

Step 4) would entail adding more checks on our end to verify we are not creating broken files. This rarely happens in syntactic level, as we use pretty standard libraries and battle tested code, but on higher level this can happen (e.g. not outputting authorship info). We should also better handle failures (logging, making sure admins can easily see and act on them). One issue is that the project might commit changes between our last import and following export. At minimum we should abort the export if this happens, by checking that we are exporting to the same revision as we have imported.

Step 5) needs more thought. Many projects need to add a code map or register new language in a separate file. Perhaps we can device a safe way to run scripts that these projects create, or just not export new language automatically, falling back to humans to add those manually.

Step 6) is just making sure this all happens automatically by having cron or similar execute exports periodically.

Translation of non-prose MediaWiki strings
Tags: #php #javascript #full-stack #translate #structured-data #translatewiki.net

Magic words, special page aliases and namespaces should be translatable with a web interface to:
 * allow translators to change or update translations easily and quickly, without having to know about order of precedence or allowed characters and so on, but also reports on mistakes;
 * keep translations in a data format which is resilient to mistakes (no fatals due to data errors) and can be easily exported to the repositories (without worrying about removing translations which should be kept for backwards compatibility), like some JSON format on ContentHandler pages on translatewiki.net;
 * ideally, export such updates as part of the usual scripts to follow the usual continuous translation model and reduce breakage.

Handle multiple translation of multiple branches
Tags: #translate #translatewiki.net #php #yaml

Software translated in translatewiki.net usually uses the master branch as input and export. This means that once a stable branch is created, it stops receiving translation updates. It should be possible to translate, import and export multiple branches simultaneously. When translating, the messages which are same across branches should only be translated once.

Branch support has two benefits:
 * 1) software that is branched but not yet released can receive translation updates
 * 2) software that is already released, can release minor updates with latest translations

In the past, for MediaWiki core, minor releases were kept up to date with a great deal of manual effort, see Repository management.

Complete TranslateSVG
Tags: #php #javascript #translate #full-stack #wmf-production

Subtitle translation
Tags: #epic #translate #wmf-deployment Commons supports subtitles on videos. Those are translated by editing a wiki page containing a special syntax. For multiple reasons this is not ideal. We should be looking into possibilities of integrating this with our existing translation tools or with some other free software tools that already exists (perhaps by integrating those into our tools). The goals should be: discoverability of things to translate, easy translation, modification tracking and no extra steps to have translations become available.

See also TimedText integration with Translate.

Would help with: Multilingual Commons.

Translatable PageForms
Tags: #translate #pageforms #translationadmins #php #gsoc-outreachy

PageForms is an extension previously known as SemanticForms. It allows to create forms for inputting data in a structured manner. It would be great, if it was possible, when creating a form with a form, to ask the form to be made translatable. This can be done manually after the form page has been created, but it is a laborious process to extract all the strings manually, create dozens of pages and manual configuration in LocalSettings.php (only available to wiki administrators). This all could be simplified with a better integration to something like one checkbox to check during form creation. After that one could go to Special:Translate to translate the form.

Gadget localisation
Integrate jquery.i18n and Translate to provide localisation facilities for gadgets. We can start doing this even before Duke Nukem Forever Gadgets 2.0 happens. Would help with: Multilingual Commons.

MediaWiki multilingual documentation
Most of the help pages and manuals are on Meta or rather on local Wikipedias: this leaves most small Wikipedias and other projects in "small languages" with no docs or outdated docs. Everything should be central, translatable, easily and equally available from all wikis. Have a policy to make translatable user-faced documentation at mediawiki.org

Multilingual Commons
Help Commons become truly multilingual. Content, categories, templates, gadgets etc. should be translatable.

«Jon Liechty [...] indicated that half of Wikimedia [Commons] uses the English language template, but the rest of the languages fall off logarithmically. He is concerned about the "exponential hole" separating the languages on each side of the curve.»

Big projects at translatewiki.net
Lately the number of active translators at twn has not grown. We should try to get big projects like KDE to lure in more translators. Would help with: Glossaries in TWN, Promote i18n best practices.

Alternatively, convince people like the FSF to adopt MediaWiki+Translate for the translation of their software with as little quirks as possible.

Machine translation of discussions
Tags: #community #javascript #gadget?

Wikimedia now has infrastructure for providing machine translations via a service (partially based on unfree software). These services are now in use by the Content Translation tool and the Translate extension for page translation. We could also use these services in wiki discussions, to request translation of a comment of whole thread, to help non-speakers to understand what is being discussed, without having to copy-paste the text manually to a translation service. This could be integrated into Flow, for example. One component that is required for this is to detect the source language. Often we can assume it is the default language of the wiki or page, but in multilingual wikis such as Meta and MediaWiki.org, it is necessary to use a library or service that identifies the language.

Language selection for anonymous users for Wikimedia sites
Tags: #universal-language-selector #wmf-production

Multilingual Wikimedia sites such Commons, Meta, Wikidata and MediaWiki.org require to register a user account to change interface language. It should be possible to change the language without registering a user account and logging in.

Librarization of MediaWiki i18n
Tags: #php #mediawiki-core

We should have a reference library which embeds all our learnings and best practices on i18n handling and l10n formats, to promote and use it widely in PHP and JavaScript projects. The library should also try to unify the custom/diverse formats like those for dates from moment.js or others (compare T31235).

Currently, we have a sort of conflict between our own PHP and JavaScript libraries and even many Wikimedia projects in PHP end up using custom solutions. We don't have recommendations for important languages like Python, which are "stuck" with Gettext (or custom formats like pywikibot?).

Extract our PHP message parsing code to a library
There are many PHP projects that would benefit from high quality i18n library. MediaWiki has many excellent features such as extensive handling of parameters, parameter types etc. It has some drawbacks though such as not being able to support nested constructions. See also https://github.com/Nikerabbit/monkey-i18n

At translatewiki.net we have multiple PHP projects. The licence (GPL-2.0+) might be a problem if they want to reuse code from MediaWiki.

Outreach on our i18n best practices
NL: Redundant with above?

Activity reporting and engagement
Project administrators/coordinators (project contacts on translatewiki.net) should be able not only to have a clear sense of what work is going on, but also of what translators/languages may need an additional effort (or vice versa are going especially well), in order to be able to contact translators where needed. Detailed reporting may be needed, if not an interface to semi-automatically send notifications in certain cases (such as translators who've reduced activity a lot in a language which needs more translations).

Thanking translators is still best done manually, but project contacts need to know whom to thank (knowing about new languages exported may also be helpful to tweak their configuration to actually use them, at times). The ability to easily communicate with "your own" translators can help project administrators build a sense of community and make them feel they're still in charge of the project even though they've merged with a larger wiki/community.

Translators should be able to stay on top of new translation work easily, e.g. by subscribing to feeds and notifications in the projects of their interest when there are new messages in the source language or requests for translations update (which no longer triggers edits and hence escapes enotifwatchlist). They are also interested in knowing how they rank against others, but our tools to this purpose may be: currently we have a monthly rank on the main page, a contribution count with a babel template and total "ranks" with language statistics

Translator hub

 * ''Originally proposed as: Translate Roll

The number of wikis using translation extension has increased significantly. At translatewiki.net, in some rare cases people run out of things to translate. It would be benefical to have some kind of central place to see translation status across the Translate universe. It would facilitate cross-project collaboration and raise awareness of different wikis having different kinds of content to translate.

Various ideas have been floated for implementation, from one special page just listing overall translation coverage in each wiki for a given language, to a "blog roll" type of links across wikis as well as single sign-on systems to ease moving between wikis.

Relevant and reliable translation statistics
Tags: #php #javascript #full-stack

Sometimes projects want to know more about the workload for translators and so on. Translate offers a lot of reporting, but one simple feature we're currently lacking is the ability to count translations by number of words rather than by number of messages. Needed components:
 * 1) a way to computer number of words (should work in any language)
 * 2) storing the number of words somewhere in the database for quick access
 * 3) updating the statistics pages to use words instead of messages

Additionally, our statistics pages are currently missing information about proofread progress, that could also be added.

A reliable way for system administrators or wiki administrators to force hard updates of statistics and all caches may also be welcome, to easily overcome and problem with cache or job queue or other (compare T145295).

See also https://phabricator.wikimedia.org/tag/mediawiki-extensions-translate/ column "statistics"

Display "where are my translations deployed" information at translatewiki.net
Tags: #git #php #javascript #full-stack #gsoc-outreachy #translatewiki.net

New translators in particular want to easily have feedback on their translations and what happened to them. On Translate wikis in general, watchlist (for accept log and modifications) and contributions (for a mere list) are not enough, especially for unlogged/separate actions like setting workflow state, pushing to CentralNotice, copying to another wiki, exporting to a CVS. Credit: Gloria_S.

It is especially unclear to people when their translations will appear in the software. With some more integration of repository scripts, it should be possible to add metadata to translation revisions in which commits or branches they are included. Different kind of summaries can then be built on this data, such as "these translations of yours are still waiting to be exported".

The implementation would consists of two mostly independent parts:
 * 1) a tool that reads git repositories to check for each translation in which branch they are,
 * 2) an interface that displays relevant information to the translators.

The interface could also be RSS, Twitter, IRC, whatever makes sense (perhaps also for imports): the main benefit would be transparency for what and how much we do. Since 2016, rakkaus sends some messages to #mediawiki-i18n for autosyncs, but these are cryptic. Going further, we could send out notices to translators "Your translations are now visible to users".

As an extension, we could try to hook up into the Wikimedia LocalisationUpdate process and the release processes of different projects to also record the information when they are deployed. This is likely much more complicated.

TUX for statistics
Special:LanguageStats and Special:MessageGroup stats look outdated compared to TUX editor and need a face lift too. In addition we could make them Web 2.0 compliant and make them faster with AJAX by not loading all information immediately.

Near-real time translation collaboration
When multiple people are translating same group (say a translatable page), it would be easier to see the updates they do live, something akin to etherpad. It doesn't need to support multiple people editing the same message at the same time. Even seeing what messages are open (and their content) would help. Credit: neverendingo.