Extension:UniversalLanguageSelector/Fonts for Chinese wikis/proposal

UniversalLanguageSelector Fonts for Chinese wikis

 * Bugzilla report:
 * Announcement: Proposal announcement at the wikitech-l mailing list.

Name and contact information

 * Name: Aaron(Xiangquan) Xiao
 * Email: xiaoxiangquan@gmail.com
 * IRC or IM networks/handle(s): aaron_xiao
 * Location: Beijing, China
 * Typical working hours: (UTC+8:00) waking hours are 9:00 AM to 23:00 PM, typical working hours are 10:00 AM to 18:00 PM.

Synopsis
Chinese uses more than 80000 characters, and 70217 are included in Unicode 5.0. However, only 3500 of them are used in our daily life. Most of the rarely used characters are not often installed on readers' systems. Even us Chinese use GBK font heavily, which contains about only 20000 characters. So we are sure to meet tofu problems, and webfonts service is triggered.

However, including all characters in the font file makes it huge. We may want to tailor the font file for every page based on characters used on that page. Once finished, this feature can be applied to other languages facing the same problem, such as Japanese.

As of writing, there isn't any "good" enough free font which includes all Chinese characters in Unicode. And the "wiki" concept itself encourages collaborative content creation, so it would be nice to invite user to create a glyph for it when the system sees a character without existing data.


 * Mentors: User:DChan_(WMF) User:Liangent

To a Chinese font, it's certain to miss some characters. Current tofu detection algorithm cannot notice that. And loading a font with all Chinese characters is impossible, or at least inefficient.
 * How Important It Is

Deliverables
Tailor the font file according to the characters used in every page. During the SoC event, maybe I can only finish the Chinese Tailor as an experiment. If it works well, we can extend it to other languages or even become a universal feature.
 * Chinese Font Tailor

When tofu occurs, encourage the user to contribute the missing glyph.
 * Glyph Collector

Design docs, development docs, and user docs, anything we think it useful.
 * Documents


 * Long Term Support [after the event]

Firstly, I'll surely push the work forward to be released finally, just like what I did before for other open source projects.

I love i18n projects. So I'd like to go along with the topic. e.g. More than pleasure to be a mentor of GSoC in the future :)

Knowledge Preparation
Frankly speaking, I've never touched the technical part of mediawiki before. So it takes some time to get my hands dirty. I have started reading the docs about ULS and HOTOs for developers. And then All these will be done before Coding Start Date.
 * Get to the details of the implementation
 * Keep on discussing with mentors and the community
 * Finally give more accurate approaches and schedule

Building the Workflow
I'm about to graduate on around 1st, July. I have to spend some time on my graduation affairs. So during the first half of the event, less work will be done. I can build the developer environment and do more documentation, including designing and implementation facts. It is also a great time for us to plan more. In a word, build the workflow so that I can focus on the feature afterwards.
 * Build mediawiki developer environment
 * Find proper fonts to use, such as: (Note that we only deal with Chinese firstly)
 * Free Chinese Fonts suggested by Ubuntu CN
 * WenQuanYi for Chinese
 * Hanazono for Japanese
 * efont for Japanese
 * unfonts for Korea
 * Find proper creator for users to contribute glyph
 * Documentation on designing and implementation
 * Do some experiment coding, of course

Chinese Font Tailor


In the ULS webfonts's repository( ULS:/data/fontrepo/fonts ), Autonym contains all the characters needed for the ULS UI, as the picture shows. It only needs tens of Chinese characters, so only keep them, and kick the other tens of thousands out. That's how Chinese Font Tailor will work! But it's for general purpose, not only for UI.

I think there are two approaches. Whenever a page is updated, scan it to see what characters are used and then generate the font for it.
 * Aggressive update

Advantage: quick response for visiting

Disadvantage: it seems to affect the external logic

Whenever a page is visited and ULS is called, scan it to see what characters are used and then generate the font for it. Cache the font with a timestamp, then we can use it directly in future when finding it up-to-date after comparing the font's and the page's timestamp.
 * Lazy update

Advantage: better cohesion

Disadvantage: may cause notable delay for the first visiting

I myself suggest the latter one, as pages tend to Write Once and Read Many.


 * Implementation


 * A script that run on server will tailor the font file for a specified page called ABC. The font file will be named ABC_FONT-NAME_TIMESTAMP.ttf.
 * Modify the webfonts js (ULS:/resources/js/) to pack some parameters needed, such as the page name
 * A php script serve the right font for the webfonts' call

Glyph Collector
I think it a simple feature, without much technical barriers. Only one or two weeks are assigned. So I can finish the main features before Pencil Down Date.

Follow-up (Maybe after the event)
More testing, beta release, fixing reported bugs. Make it stable and more efficient, then push it forward to be merged into trunk finally.

Schedule

 * now ~ 18 May, Knowledge Preparation
 * Keep on discussing with mentors and the community to make the proposal better
 * Read docs of ULS and HOTOs for developers (1 week)
 * Get to the details of the implementation ( reading code ), find the point I'm going to work around (3 weeks)
 * Give more accurate approaches and schedule (1 week, before 18 May)


 * 19 May ~ 22 June, Building the Workflow
 * Build mediawiki developer environment (2 or 3 days)
 * Find proper fonts and glyph creator to use (1 week)
 * Design and plan more, document it (1 week)
 * Some experiment coding (1 week)


 * 23 June ~ 10 August, Implementation
 * Finish Chinese font tailor (3 weeks)
 * Glyph collector (1 or 1.5 weeks)
 * Update documents, test and fix bugs (2 weeks)


 * 11 August ~ 18 August, Final Term
 * Finish the final report, present the result to the community and Google

As I promised, push the work forward to release finally
 * Follow up

About Me
See my User Page.

See my Linkedin Page

M.S. in Computer Science, Peking University, Beijing, China
 * Education completed or in progress:

I searched in the organizations list with keyword "i18n", as it is one of my working fields. I'm always trying to introduce great open source projects to Chinese, as well as other non-English users.
 * How did you hear about this program?

Before June(included), I must spend some time on my thesis and graduation affairs. And then I can work as full time in July, August and September. ( I know the event schedule. I'll finish the main features on time. And then continue working to push it forward to be released finally, even after the GSoC event, of course. )
 * Will you have any other time commitments, such as school work, another job, planned vacation, etc., during the duration of the program?

As male, Only SoC.
 * We advise all candidates eligible to Google Summer of Code and FOSS Outreach Program for Women to apply for both programs. Are you planning to apply to both programs and, if so, with what organization(s)?

FOSS Projects
Wireshark is a great open source packet capturing and analyzing engine. I implemented several main P2P Kademlia protocols' dessectors, such as BT-uTP, Vuze-DHT and BT-DHT. They have been released since 1.6 and 1.7.
 * Wireshark

Blender is one of the best open source 3D modeling tool. In GSoC 2011, I implemented the i18n module, which has been released since 2.60.
 * Blender

Ogre is one of the best open source 3D rendering engines. To save time and memory, we'd like to load BIG terrains on demand, which means, a higher LOD is only loaded when necessary. The existed work is unstable, so I refactor it and add some new features, including an "Endless World" demo. Finally it is accepted by the repository, and released since Ogre 1.9. Click here to see the wiki
 * Ogre

BitBucket
 * Personal Repository

Google Code


 * FOSS ASIA

I like to not only contribute code, but also share experience and publicize my organizations in events. During this year's FOSS ASIA event, I gave two talks. One is about "How to ask for help in Open Source projects", and another is "Optimizations of the Terrain System" in OGRE.

Relevant Projects
The most relevant one is the Blender i18n module. In that project I learned to build the i18n process with gnu-gettext. Then we found the translators' community, manage fonts and PO files, and try creators such as fontforge.

Also I ever wrote a Java IDE as a course-project with 3 other classmates, called FreeJava. We use the Java i18n Mechanism to support English, Chinese and Japanese. We should write all language files ( ./mess_*.properties ) before compiling. The program will load the specified one and display based on the setProperty("locale","en") call in the source. It's ugly and we threw it away just after the semester.

Interested Projects
I'm interested in i18n, game-development and mobile-app projects, but this year I only applied for mediawiki. I'd like to do other i18n related projects if the feature in this proposal has low priority.

Any other info
Most Chinese users cannot speak English or any other foreign languages, so they are always kept away from the great products which don't support i18n. e.g. Blender, the best free 3D modeling tool, has few Chinese user before the i18n module finished. That's why I'm so addicted to such technologies.

Now I'm eager to enhance the Chinese-supporting for mediawiki, and to make it much easier for non-English users. Though I know little about the tech-part of mediawiki, but I think I can learn fast and make it in time.