Talk:Requests for comment/Localisation format

Jump to navigation Jump to search

About this board

Philip J. Rayment (talkcontribs)

I just downloaded an updated (bugfixed) extension that included a .i18n.php "shim" which said that it "maintains compatibility back to MediaWiki 1.17". However, versions prior to 1.20 only required php version 5.2.3, whereas it seems that the shim requires 5.3, which means that this shim does not necessarily provide compatibility back to MediaWiki 1.17. Surely this should be relatively easy to fix?

Nemo bis (talkcontribs)

Yes, the issue is known. A patch is pending (and AFAIK ready) to fix it: gerrit:125706.

Nikerabbit (talkcontribs)

Running old core with latest extension is always risky. Many of the extensions already depend on PHP 5.3 regardless of the shims.

Reply to "Broken shim"

Implementation of RfC completed

3
Siebrand (talkcontribs)

The implementation of the RfC is now complete. The following patch sets have been generated with their respective states:

Total number of involved patch sets: 611.

Below is a copy of a mailing list message I sent out a few days ago to wikitech-l:

Cheers!

Siebrand

Subject: Implementation JSON based localisation format for MediaWiki nearly completed

Long! tl;dr is in the first two paragraphs.

With the merging of https://gerrit.wikimedia.org/r/#/c/122787/ , probably
the largest patch set for MediaWiki ever (+548314, -714438), MediaWiki core
is now using JSON for localisation of interface messages, per a recently
adopted RfC[1]. Thanks Krinkle/Timo for reviewing!

Please be aware that if you have open patch sets touching *.i18n.php
messages files or MessagesXx.php files, you will have to update your patch
sets to match the new file layout and format.

In December 2013, the first MediaWiki extensions have already been migrated
to use the JSON format. Today, Antoine/hashar enabled a JSON linter on the
jslint job that runs on many Gerrit repositories' patch sets.

Since last week I've started to migrate first all MediaWiki extensions that
are used by WIkimedia to use JSON i18n. At this time, 1.23wmf20 has about
50% of its extensions using the updated format. Migration of two extensions
is taking a little longer[2], but Matt Flaschen is helping with that, and I
expect that to be resolved soon.

Migration of all extensions has been going very smoothly - it's about 80%
done. With the help of reedy/Sam Reed, Raimond Spekking/Raymond, Niklas
Laxström/Nikerabbit and Adam Wight, so far 427 patch sets related to this
project have already been reviewed and merged[3], 40 await review and I
expect some 90 more to be submitted for the project to be completed.

Thanks also go to Roan Kattouw/Catrope for implementation of parts of the
RfC together with Niklas, to Niklas for rewriting LocalisationUpdate to
support the JSON format and more, and all who helped draft the RfC,
including James Forrester, Santhosh Thottingal, David Chan, Ed Sanders,
Robert Thomas Moen, and those who deserve credit but I have forgotten to
mention.

Once all migrations are complete, I'll be doing a full export from
translatewiki.net, which will cause a lot of JSON files to be touched, but
will mostly update encoding (full UTF-8) and add a newline at enf of file
where missing.

What's next? With this project almost completed, next order of business is
creating an RfC on where to go with the data that now remains in the
MessagesXx.php files (like date formatting, fallback, directionality,
namespace names, special page names, etc.) and localisation for special
page names, magic words and namespace names that are still being
implemented using $wgExtensionMessagesDirs. Maybe this is something we
could discuss and prototype during the hackathon. Please let me know if
this is something you'd like to work on.

Again, thanks for the help, and apologies for the inconvenience these
changes may have caused you!

[1] https://www.mediawiki.org/wiki/Requests_for_comment/Localisation_format
[2]
https://gerrit.wikimedia.org/r/#/q/status:open+topic:json-i18n-special,n,z
[3] https://gerrit.wikimedia.org/r/#/q/status:merged+topic:json-i18n,n,z
Siebrand (talkcontribs)

Some small open issues:

  • Update core L10n/i18n documentation.
  • Update extension L10n/i18n documentation.
  • Clean up core's maintenance/languages/
    • messages.inc can go
    • messageTypes.inc can go
Nemo bis (talkcontribs)

Thanks Siebrand, Niklas and Raimond for all the migration work!

Reply to "Implementation of RfC completed"

Next steps from RFC review meeting

7
Sharihareswara (WMF) (talkcontribs)

At the RFC review meeting in December 2013, we agreed:

  • ACTION: RoanKattouw to remove groups
  • ACTION: RoanKattouw to look at the number of stat() calls and consider optimisations

I'll ping Roan to check on this.

Sharihareswara (WMF) (talkcontribs)

from #mediawiki just now:

<sumanah> RoanKattouw: https://www.mediawiki.org/wiki/Thread:Talk:Requests_for_comment/Localisation_format/Next_steps_from_RFC_review_meeting --  "ACTION: RoanKattouw to remove groups" and "ACTION: RoanKattouw to look at the number of stat() calls and consider optimisations" -- https://www.mediawiki.org/w/index.php?title=Requests_for_comment/Localisation_format&action=history doesn't have you editing the doc since that date :)
<RoanKattouw> sumanah: Whoops sorry
<RoanKattouw> sumanah: I did remove groups in the implementation
<sumanah> RoanKattouw: cool!
<RoanKattouw> The implementation ws in fact merged into core a while ago
<sumanah> oho!
<RoanKattouw> I did not research stat calls :(
<sumanah> RoanKattouw: is https://www.mediawiki.org/wiki/Requests_for_comment/Localisation_format#Implementation accurate (in showing what is done and not done)?
<RoanKattouw> sumanah: Yeah that looks accurate
<sumanah> RoanKattouw: okay. What would you call the status of the RFC? because "doc status: draft/implementation: underway" does not sound right.
<RoanKattouw> sumanah: The tragedy of this RfC is basically that both of the primary devs on it went back to school and chose it as a target for responsibility-shedding
<sumanah> :(
<RoanKattouw> (Niklas and myself)
<RoanKattouw> I'd say it's partially implemented
<RoanKattouw> The infrastructure it calls for is almost fully implemented
<RoanKattouw> It's just not used nearly everywhere yet
<RoanKattouw> And there's some tangential work not mentioned there that has yet to be done, although I think that bit is on Niklas's plate
<RoanKattouw> "that bit" being support in LocalisationUpdate for JSON i18n so that extensions that have been converted to JSON get the same daily message updates in production as the PHP ones
<sumanah> RoanKattouw: ok. mind if I just copy and paste your braindump here into the talkpage?
<RoanKattouw> That's the main thing that's blocking mass conversion AIUI
<RoanKattouw> Go for it
<sumanah> RoanKattouw: moving from "draft" to "end run around acceptance" I guess
<RoanKattouw> Sounds reasonable
<RoanKattouw> I think the acceptance is mostly there, socially, there's just a missing feature that needs to be implemented before we can really go bonkers on it
Nemo bis (talkcontribs)

The LU patches in question were submitted last week, I linked them from the page per above.

Nikerabbit (talkcontribs)

Thanks. The rewrite is not complete yet, there will be more patches this week.

Siebrand (talkcontribs)

All patches for the LU rewrite are now in Gerrit. I've submitted a conversion of the installer files for core. Am working on converting $messages of MessagesXx.php core files.

Jdforrester (WMF) (talkcontribs)

Excellent, thank you!

Did we reach a decision regarding non-$messages entries?

Reply to "Next steps from RFC review meeting"
Nemo bis (talkcontribs)
Nikerabbit (talkcontribs)

FYI: LU2 was mentioned, not the current LU.

Jdforrester (WMF) (talkcontribs)

I believe that until LU is updated, extensions that have been converted to JSON won't get the daily updates of translations (but they will when anyone scaps, which happens at least weekly).

Nemo bis (talkcontribs)

Thanks James for the answer. That's unfortunate, I expect lots of confused translators (for each translation you need to know if it's using json or not). I hope it's not too hard to update LU (be it 1.1 or 2).

Jdforrester (WMF) (talkcontribs)

I would argue that translators expecting out-of-process localisation updates are the ones who are confusing others. "This software doesn't changed at all except on Thursdays. Oh, except for messages. And things that look like interface components but are actually strings of advanced formatting." We may wish to re-visit whether we use this in future, or move to a more frequent deployment process that has only very rare exceptions.

Nemo bis (talkcontribs)

What are "out-of-process localisation updates"? The current process (since a few years ago) is the daily sync via LocalisationUpdate. Out of process updates would be syncs to master/cherry-picks, manual changes to l10n or manual runs of LocalisationUpdate. Is this what you mean?

Jdforrester (WMF) (talkcontribs)

No, you have it the wrong way around. LocalisationUpdate runs daily because we used to do regular updates only every few weeks or even months, and so localisation updates were significantly delayed in being released.

However, I don't think that it's even been a satisfactory outcome to have random localisation cache issues created by a bot in an unattended, out-of-process production deployment. Nowadays given that we do production deployments almost every day, it's an unnecessary complication and weakness in the integrity of the service we offer our readers.

Nikerabbit (talkcontribs)

Most deployments don't update messages though. Given the rapid changes in the software, it is not okay that translators have to wait up to two weeks until the untranslated strings in the interface are replaced with translations.

Jdforrester (WMF) (talkcontribs)

Where on Earth is "two weeks" from? Reedy's deployments every Tuesday and Thursday do full scap runs even if no other deployments happen. Unless you mean at Christmas/etc.?

Nikerabbit (talkcontribs)

New strings added on Wednesday. Translation is added on Thursday. Reedy makes a new branch. Raymond exports translations later that day. Next Thursday new branch is created which contains the translation. Next Thursday after that Wikipedia finally starts using that branch.

With LU the translation will be there late Thursday on the same day it was made.

Perhaps you are imagining a solution in the middle, where LU runs every day, but only on deployment server and only scap would push it out to production?

Jdforrester (WMF) (talkcontribs)

Yes.

Nemo bis (talkcontribs)

That however would "only" be a change in the l10n-update script at WMF; the extension would need to be fixed first anyway.

Nemo bis (talkcontribs)
Nikerabbit (talkcontribs)

We do not need James' approval for LUv2. The service will be useful to 3rd parties (of MediaWiki, maybe also other products) and it is then up to WMF to consider if they want to use it.

Reply to "LocalisationUpdate"
MaxSem (talkcontribs)

An extension with no UI and thus no messages except for descriptionmsg. 350 files with one message each. Almost the same stuff for extensions with very little messages.

Jdforrester (WMF) (talkcontribs)

… so? inodes are cheap.

Parent5446 (talkcontribs)

Well if we are following the jQuery.i18n specification, then it should be supported to put all messages in one file (just like how it's done now, except in JSON). See the bottom of this section for more info.

NEverett (WMF) (talkcontribs)

I believe the proposal is for the contents of the files to follow jQuery.i18n but to require one message per language. The 350 files with one message argument isn't enough to convince me that supporting the all messages in one file format is required. I actually think a strong argument might be "lets just copy jQuery.i18n 100% so we don't have to tell people it is just like it _but_." Even with that I still favor one file per language from a performance and simplicity perspective.

TheDJ (talkcontribs)

True, but the one language per file could be a convention, instead of a technical rule of our logic to read the datafilee of course. Or you could add another convention and say that people should use qall.json if they are putting all languages in a single file, if you want to still enable this, but not allow people to put multiple languages in a single file.

Siebrand (talkcontribs)

I think we should choose to go for putting each locale in a separate file. This reduces code complexity.

Seb35 (talkcontribs)

I like the JSON format, but I am a bit skeptical about the per-language files, although I also understand heavy files could be not practical.

As a translator, I sometimes open the i18n.php file with `vi' to search all translations of a message or navigate between the original English and other languages; if they are in separated files, it’s still possible but less practical (e.g. use `grep -R' or `cat *.json | vi -').

Another minor point is the maintenance of all languages could be possibly more difficult. I am thinking about multiples lines in Gerrit in the localisation updates (example), or mass operations as removing a message.

These are minor points but are also linked to the usability for the developers and other people looking the code.

Jdforrester (WMF) (talkcontribs)

On your points:

Using .i18n.php directly for translation
The main workflow for translators is translatewiki.net, where the English original (and qqq message) are clearly shown alongside the input box – I don't know if many people are manually paging through the i18n.php file for the English original. However, I agree that wanting to know how a message was translated into a related language (e.g. French and Spanish, or German and Swiss German) for consistency is probably a useful artefact we might want to suggest to TWN as a translation tool?
Multiple files touched in one git commit
I see this as a feature, not a bug – as someone who manually reviews every single VisualEditor git commit from the localisation update, it's now much much easier for me to see at a glance what languages got new messages or updated messages, and particularly when a language is newly added.
Removing a message across several files
The official guidance on this is that only the -en message should be removed by the developers, and the system will remove the non-en messages, so I don't think this is a problem here.
Seb35 (talkcontribs)

Thanks for your replies. Your convinced me for the two last items. For the first, I continue to prefer navigating in a single file when reading the translations on my computer, but I understand developers could prefer splitting them. For an interface for comparing the languages, there exists one, but I’m rather a command-line-addict when possible.

Reply to "A quite popular case..."
😂 (talkcontribs)

Couple of big reasons:

  1. It moves configuration out of code (and i18n is config, imho)
  2. JSON is more portable than PHP and more tools can input/output to it natively
  3. Some extensions have FREAKING HUGE i18n files and this will help cut them down to manageable sizes
  4. Since extension i18n is now split like core, it's harder for translation updates to conflict with code changes affecting only the en messages

I won't be able to make the RfC meeting tomorrow to discuss this, but I heartily endorse the proposal. I'm sure any rough edges in the proposal can be figured out by the interested parties :)

Reply to "I like this"
Nikerabbit (talkcontribs)

I could not get sleep yesterday night, so I was thinking about this RFC. Here are my thoughts (after sleeping) I would like to highlight.

It's about time. Migration from non-executable format has come up with various people, so it's not a new idea. This is a step that will allow further improvements to our i18n: LocalisationUpdate v2 could be an example.

Keep it simple, secure. PHP is an executable file format. Loading the messages for translation at translatewiki.net and any other kind of manipulation is icky. Some people choose the easy way and just execute the code, which leads to potential security issues. The other option, parsing the syntax by hand is not fun either. All that complexity drops when we separate messages and use format for which we have parsers.

It could be faster, you know. The biggest i18n files in extensions could be megabytes. While handling JSON can be slower than using PHP (no data), overall this can improve performance by avoiding loading unnecessary data, as each language is split into separate file. For example change in one translation in on extensions would not trigger recaching of all languages anymore, as we can track timestamps per language now. This benefit applies to MediaWiki core as well as to translatewiki.net.

Siebrand (talkcontribs)

Thanks a lot, Niklas. I used some of this to update the rationale of the RfC.

Happy-melon (talkcontribs)

Tangentially-relevant data: the last time I was asked to benchmark PHP-arrays verses JSON-objects in terms of loading performance (player data fixtures for testing a game server), JSON came out significantly faster, because it's a more limited format - a JSON object is certainly going to just be a data object; an included PHP file could be a class, a load of functions, anything.

Siebrand (talkcontribs)

Thank you for that background information, Happy-melon.

TheDJ (talkcontribs)

I guess we could get some benchmarking out of the VE experimenting quite quickly if we wanted to right ? That might be a good ida.

Reply to "Motivations"
Krenair (talkcontribs)

Will we be able to override messages from other extensions/core with this?

Either way, this is great. +1

Siebrand (talkcontribs)

There should not be a change in functionality, so this question is not relevant in the scope of this RfC I think. As far as I know, is is possible now, and should remain possible, although it's discouraged. There is https://gerrit.wikimedia.org/r/#/c/98078/ to allow a proper solution.

Reply to "Overriding other messages?"

Effects on caching implementation

5
Parent5446 (talkcontribs)

There's one thing I'd like to hopefully clarify on this: how will this be affecting message caching?

Right now we have the following caching structure (note that this is the exact order in which they are checked):

  1. [MessageCache] Entire languages of database message overrides are stored in serialized PHP files
  2. [MessageCache] Entire languages of database message overrides are stored in Memcached
  3. [MessageCache] Entire languages of database overrides are loaded manually from the database
  4. [MessageCache] Individual message overrides are stored in Memcached
  5. [MessageCache] Individual message overrides are loaded manually from the database
  6. [LocalisationCache] Entire languages of messages (with extension overrides) are stored in CDB files (or the database or cache, depending on config)
  7. [LocalisationCache] Entire languages of messages (without extension overrides) are stored in PHP files (the usual MessagesEn.php)

I'm not going to ask wtf is going on in MessageCache because it's out of scope for this RFC, but the real question comes in LocalisationCache. One of the main advantages mentioned by this RFC is that the modular design allows the cache to be expired in portions. This makes a lot of sense since right now LocalisationCache uses the entire extension file as a FileDependency rather than just a single language file. However, how will message groups and sub-groups be accounted for? Since the CDB cache is for an entire language, it still has to be regenerated for any message change. Are we going to split the CDB files by group?

In fact, how are groups working in the first place? How will the software know which message keys are in each group? I looked at the VisualEditor example thinking maybe the message keys would be prefixed with the group name, but that is not the case. I'm beginning to think the point of separating messages into groups is solely to make development easier, which makes sense, but that should be explicitly stated.

Jdforrester (WMF) (talkcontribs)
There's one thing I'd like to hopefully clarify on this: how will this be affecting message caching?

Immediately, it won't; the intent of this RfC is to change the file format, and nothing more, to lay the groundwork for future changes like better cacheing, client-side soft loading of messages, etc..

Since the CDB cache is for an entire language, it still has to be regenerated for any message change. Are we going to split the CDB files by group?

Yes, this is a possibility, but not in this RfC; this just makes it possible.

In fact, how are groups working in the first place? How will the software know which message keys are in each group? I looked at the VisualEditor example thinking maybe the message keys would be prefixed with the group name, but that is not the case. I'm beginning to think the point of separating messages into groups is solely to make development easier, which makes sense, but that should be explicitly stated.

Yes, we were thinking automatic prefixing, but not in this RfC; this just makes it possible.

Siebrand (talkcontribs)

Going to correct James here a bit :).

Since the CDB cache is for an entire language, it still has to be regenerated for any message change. Are we going to split the CDB files by group?

Maybe. The current cache format is very fast and efficient. We've not discussed this. In any case, it's out of scope of this RfC.

Jdforrester (WMF) (talkcontribs)

Ha. I removed a stress of "possible" too much to make it sound like we wanted to definitely change CDB files; fixed.

Siebrand (talkcontribs)

I'm beginning to think the point of separating messages into groups is solely to make development easier, which makes sense

As is written in the rationale, that is not solely the point. If it's the only point you see, that's great, because it's well within the scope of the RfC that developers don't hate it. So thanks for your feedback :).

Reply to "Effects on caching implementation"
Tim Starling (talkcontribs)

What is the reason for the omission of keys other than $messages from the new format? We already need to deliver non-message keys to JavaScript code, for example, we have mw.language.getData() and callers such as convertNumber(). Presumably these needs for non-message data will continue to increase. The fact that you're following an example from jQuery.i18n does not seem like a good excuse, given the much larger scope of this proposal.

How will existing non-message keys in extension message files be handled if $wgExtensionMessageFiles is deprecated?

Siebrand (talkcontribs)

On $wgExtensionMessageFiles: We chose for a limited scope ($messages only) to keep things manageable and increase chances of being able to implement this within a limited amount of time. Increasing scope, would create longer discussion, and increase chances of small issues in the periphery holding up the main change. I acknowledge that we need to find a solution for special page aliases and namespace names later. This might be done by extending the JSON format to support (associative) arrays in a similar fashion as currently done for the @metadata key.

On one $messages for core and language classes: Where possible we would like to change our currently "manually maintained" i18n related configurations for languages to data driven implementations based on collections by 3rd parties. The plural rules are an example of that. Future opportunities are in date/timezone information, number formatting, etc. Our plan is to address these things one feature at a time, to not create a big plan that has decreased changes of being implemented, because of complexity or a lack of resources.

Tim Starling (talkcontribs)

I'm not really buying that. To a large extent, how you do messages determines how you do non-message data, so you should write both on the RFC so that we can discuss them both with clarity. This RFC has clear implications for non-message keys, and I'm not sure if I support those implications. It's not necessary to implement the whole RFC in one go, so it only extends the scope of the design work, not the implementation work.

Nikerabbit (talkcontribs)

I'm not convinced that we need to increase the scope of this RFC. I see no issues for keeping non-message data as they currently are. For now it is actually a benefit since we are separating the messages, which are (mostly) machine maintained, from the non-messages, which are maintained by hand. This alone simplifies message handling in core and at translatewiki.net

In my opinion LocalisationCache is currently nicely abstracting away where and how we store the data, although it probably wasn't the issue you had in mind when writing LC.

As James proposes below, there is pretty straightforward mapping from PHP to JSON for non-messages, if we want to do it. For now there is no compelling reason to change how we do it. We would have to consider the implications about future decisions about the extent we are going to use 3rd party language data. And if we go into motivations like using common format for backend and frontend, then we are again in scope of another RFC in progress about frontend i18n.


How will existing non-message keys in extension message files be handled if $wgExtensionMessageFiles is deprecated?

On this I think we should clarify the RFC (assuming there is agreement) that we would only be deprecating that variable for messages for now. Magic words and aliases, which are already in separate i18n files for majority of extensions because translatewiki.net forces that, would keep working until we decide to do something about them.

Jdforrester (WMF) (talkcontribs)

In general, we should probably just use @-prefixed strings, objects or arrays as necessary – specifically[1]:

$namespaceNames
"@namespaceNames": { "NS_MEDIA": "Média", "NS_SPECIAL": "Spécial", … }
$namespaceAliases
"@namespaceAliases": { "Discuter": "NS_TALK", "Discussion_Utilisateur": "NS_USER_TALK", … }
$specialPageAliases
"@specialPageAliases": { "Activeusers": [ "Utilisateurs_actifs", "UtilisateursActifs" ], "Allmessages": [ "Messages_système", "Messages_systeme", "Messagessystème", "Messagessysteme" ], … }
$magicWords
"@magicWords": { "redirect": [ "0", "#REDIRECTION", "#REDIRECT" ], "notoc": [ "0", "__AUCUNSOMMAIRE__", "__AUCUNETDM__", "" ], … }
$bookstoreList
"@bookstoreList": { "Amazon.fr": "http://www.amazon.fr/exec/obidos/ISBN=$1", "alapage.fr": "http://www.alapage.com/mx/?tp=F&type=101&l_isbn=$1&donnee_appel=ALASQ&devise=&", … }
$linkTrail
"@linkTrail": "/^([a-zàâçéèêîôûäëïöüùÇÉÂÊÎÔÛÄËÏÖÜÀÈÙ]+)(.*)$/sDu"[2]
$dateFormats
"@dateFormats": { "mdy time": "H:i", "mdy date": "F j, Y", … }
$defaultDateFormat
"@defaultDateFormat": "zh";
$datePreferences
"@datePreferences": [ "default", "ISO 8601" ]
$separatorTransformTable
"@separatorTransformTable": { ",": "\xc2\xa0", ".": "," }
$separatorTransformTable
"@separatorTransformTable": { ",": "\xc2\xa0", ".": "," }
$linkTrail
"@linkTrail": "/^([a-zʻʼ“»]+)(.*)$/sDu";
$linkPrefixCharset
"@linkPrefixCharset": "a-zA-Z\\x80-\\xffʻʼ«„";
$linkPrefixExtension
"@linkPrefixExtension": "true";
$fallback
"@fallback": "zh-hans";
$fallback8bitEncoding
"@fallback8bitEncoding": "windows-1252";

I've probably made this too simple to work, on reflexion – have I missed something?

  1. Examples from MessagesFr.php, MessagesUz.php, MessagesZh-hans.php,
  2. Or do we want to convert this from PHP format somehow?
Reply to "Omission of non-message keys"