User:Liangent/wb-lang


 * Target: Language fallback and conversion feature for data stored in Wikibase / Wikidata.
 * Mentor: User:Denny?

Introduction
Currently Wikidata stores multilingual contents. Labels (names, descriptions etc) are expected to be written in every language, so every user can read them in their own language. But there're some problems currently:


 * If some content doesn't exist in some specific language, users with this exact language set in their preferences see something meaningless (its ID instead). This renders some language with fewer users (thus fewer labels filled) even unusable.
 * There're some similar languages which may often share the same value. Having strings populated for every language one by one wastes resources and may allow them out of sync later.
 * Even for languages which are not "that similar", MediaWiki already has some facility to transliterate (aka. convert) contents from its another sister language (aka. variant) which can be used to provide better results for users.

This proposal aims at resolving these issues by displaying contents from another language to users based on user preferences (some users may know more than one languages), language similarity (language fallback chain), or the possibility to do transliteration, and allow proper editing on these contents.

Although Wikidata is in its fast development stage, lots of data have been added to it. The later we resolve these issues, the more duplications may be created which will require more clean up work in the future, like what we had to face before / when the language converter (that transliteration system) was introduced for the Chinese Wikipedia. Besides, having this included in Wikidata design is better than patching is in adhoc ways later.

Finally, we don't have much workforce in LanguageConverter-related stuff. It's nice to accept me to do this now. :)

Requirements

 * Every user may define it's language preference order
 * Every language has its system fallback order
 * Some languages can be derivated (converted) from other (prime & sister) languages (variants) automatically
 * Display what user loves best to the extent of what's available in current data
 * with some annotations saying what language a string is actually in, when it's falling back to another language

Technical notes

 * Caching issues need some care
 * Site links? We have variants in titles...
 * Wikibase is under fast development. Talk with others to minimize merge conflicts
 * btw. "Commons media" needs to be multilingual sometimes... for items about concepts / properties of diagrams
 * If we got the only value known, in zh-cn, should I put it in zh or zh-cn? If the former: how to tag that the plain string is actually in zh-cn?
 * Specialized case: When a zh-cn user is populating data for a new field, should I put it in zh or zh-cn?

Timeline

 * May 27, 1900 UTC: Announced.
 * May 28 - June 6
 * (I'll be busy at the first one or two weeks after June 17 so I may have to start early to compensate that)
 * Investigate places where visible (to users & other external developers such as bot authors) work is needed which may include API (new interfaces or parameters may be needed), repo front-end (obviously), client front-end (for example, the add-link dialog) and exported data (for example, data dumps, if we're planning to provide per-language dump at some time), and design the interface when needed.


 * June 7 - June 16
 * Investigate current data exchange structures (API, embedded JavaScript data or anything else. Internal storage structure shouldn't be affected much as we're just doing fallback before data are sent out) and see whether they still meet our need. Design new data structure when necessary.


 * June 17: Beginning.
 * June 17 - June 30 (and as soon as any design is done)
 * Send designs of interface and data structures to mentor & others for review. During this period I may be somehow busy, so it won't block me much if there's some delay in others' actions.


 * July 1 - July 20
 * Code up anything internal (data structures, API etc.) based on design done in previous periods.


 * July 21 - July 29
 * Front-end development based on design, part I.


 * July 29, 1900 UTC - August 2, 1900 UTC
 * Mid-term; Writing some summary about current design and coding work as mid-term evaluation documents.


 * August 3 - August 11
 * Front-end development, part II.


 * August 12 - August 25
 * Test it to see whether it works well as an integrated product; tweak code when necessary.


 * Auguest 26 - September 2
 * Test it on larger data set? (optional, continue coding & testing work if it's not done yet)


 * September 3 - September 16
 * Try to have it deployed on Wikidata and test it in real world? (optional, continue coding & testing work if it's not done)


 * September 16 / September 23, 0900 UTC - September 27, 1900 UTC: Final documentations and reports.

Target

 * Expected target: Have it deployed on Wikidata.
 * Minimal target: Have a working codebase with required features completed. In this case I'll be still interested in getting it live finally as it'll be a required feature for Wikidata users.

Links

 * Wikidata/Data model
 * Wikidata/Notes/Data model primer
 * Wikidata/Notes/Language fallback
 * Writing systems
 * Language in MediaWiki