Extension:UniversalLanguageSelector/Fonts for Chinese wikis/proposal

UniversalLanguageSelector Fonts for Chinese wikis

Bugzilla report: bug 31791 bug 63122
Announcement: Proposal announcement at the wikitech-l mailing list.

Name and contact information

Name: Aaron(Xiangquan) Xiao
Email: xiaoxiangquan@gmail.com
IRC or IM networks/handle(s): aaron_xiao
Location: Beijing, China
Typical working hours: (UTC+8:00) waking hours are 9:00 AM to 23:00 PM, typical working hours are 10:00 AM to 18:00 PM.

Synopsis

Chinese uses more than 80000 characters, and 70217 are included in Unicode 5.0. However, only 3500 of them are used in our daily life. Most of the rarely used characters are not often installed on readers' systems. Even us Chinese use GBK font heavily, which contains about only 20000 characters. So we are sure to meet tofu problems, and webfonts service is triggered.

However, including all characters in the font file makes it huge. We may want to tailor the font file for every page based on characters used on that page. Once finished, this feature can be applied to other languages facing the same problem, such as Japanese.

As of writing, there isn't any "good" enough free font which includes all Chinese characters in Unicode. And the "wiki" concept itself encourages collaborative content creation, so it would be nice to invite user to create a glyph for it when the system sees a character without existing data.

Mentors: User:DChan_(WMF) User:Liangent

How Important It Is

To a Chinese font, it's certain to miss some characters. Current tofu detection algorithm (bug 63122) cannot notice that. And loading a font with all Chinese characters is impossible, or at least inefficient.

Deliverables

Chinese Font Tailor

Tailor the font file according to the characters used in every page. During the SoC event, maybe I can only finish the Chinese Tailor as an experiment. If it works well, we can extend it to other languages or even become a universal feature.

Glyph Collector

When tofu occurs, encourage the user to contribute the missing glyph.

Documents

Design docs, development docs, and user docs, anything we think it useful.

Long Term Support [after the event]

Firstly, I'll surely push the work forward to be released finally, just like what I did before for other open source projects.

I love i18n projects. So I'd like to go along with the topic. e.g. More than pleasure to be a mentor of GSoC in the future :)

Participation

Knowledge Preparation

Frankly speaking, I've never touched the technical part of mediawiki before. So it takes some time to get my hands dirty. I have started reading the docs about ULS and HOTOs for developers. And then

Get to the details of the implementation
Keep on discussing with mentors and the community
Finally give more accurate approaches and schedule

All these will be done before Coding Start Date.

Building the Workflow

I'm about to graduate on around 1st, July. I have to spend some time on my graduation affairs. So during the first half of the event, less work will be done. I can build the developer environment and do more documentation, including designing and implementation facts. It is also a great time for us to plan more.

Build mediawiki developer environment
Find proper fonts to use, such as: (Note that we only deal with Chinese firstly)
- Free Chinese Fonts suggested by Ubuntu CN
- WenQuanYi for Chinese
- Hanazono for Japanese
- efont for Japanese
- unfonts for Korea
Find proper creator for users to contribute glyph
Documentation on designing and implementation
Do some experiment coding, of course

In a word, build the workflow so that I can focus on the feature afterwards.

Chinese Font Tailor

In the ULS webfonts's repository( ULS:/data/fontrepo/fonts ), Autonym contains all the characters needed for the ULS UI, as the picture shows. It only needs tens of Chinese characters, so only keep them, and kick the other tens of thousands out. That's how Chinese Font Tailor will work! But it's for general purpose, not only for UI.

I think there are two approaches.

Aggressive update

Whenever a page is updated, scan it to see what characters are used and then generate the font for it.

Advantage: quick response for visiting

Disadvantage: it seems to affect the external logic

Lazy update

Whenever a page is visited and ULS is called, scan it to see what characters are used and then generate the font for it. Cache the font with a timestamp, then we can use it directly in future when finding it up-to-date after comparing the font's and the page's timestamp.

Advantage: better cohesion

Disadvantage: may cause notable delay for the first visiting

I myself suggest the latter one, as pages tend to Write Once and Read Many.

Implementation

A script that run on server will tailor the font file for a specified page called ABC. The font file will be named ABC_FONT-NAME_TIMESTAMP.ttf.
Modify the webfonts js (ULS:/resources/js/) to pack some parameters needed, such as the page name
A php script serve the right font for the webfonts' call

Glyph Collector

I think it a simple feature, without much technical barriers. Only one or two weeks are assigned. So I can finish the main features before Pencil Down Date.

Follow-up (Maybe after the event)

More testing, beta release, fixing reported bugs. Make it stable and more efficient, then push it forward to be merged into trunk finally.

Schedule

Before May 18, Knowledge Preparation

Keep on discussing with mentors and the community to make the proposal better
Read docs of ULS and HOTOs for developers (1 week)
Get to the details of the implementation ( reading code ), find the point I'm going to work around (3 weeks)
Give more accurate approaches and schedule (1 week, before 18 May)

May 19 ~ June 22, Building the Workflow

Build mediawiki developer environment (2 or 3 days)
Find proper fonts and glyph creator to use (1 week)
Design and plan more, document it (1 week)
Some experiment coding (1 week)

June 23 ~ Aug 10, Implementation

Finish Chinese font tailor (3 weeks)
Glyph collector (1 or 1.5 weeks)
Update documents, test and fix bugs (2 weeks)

Aug 11 ~ Aug 18, Final Term

Finish the final report, present the result to the community and Google

Follow up

Push the work forward to be merged into trunk

About Me

See my User Page.

See my Linkedin Page

Education completed or in progress

M.S. in Computer Science, Peking University, Beijing, China

How did you hear about this program?

I searched in the organizations list with keyword "i18n", as it is one of my working fields. I'm always trying to introduce great open source projects to Chinese, as well as other non-English users.

Will you have any other time commitments, such as school work, another job, planned vacation, etc., during the duration of the program?

Before June(included), I must spend some time on my thesis and graduation affairs. And then I can work as full time in July, August and September. ( I know the event schedule. I'll finish the main features on time. And then continue working to push it forward to be released finally, even after the GSoC event, of course. )

We advise all candidates eligible to Google Summer of Code and FOSS Outreach Program for Women to apply for both programs. Are you planning to apply to both programs and, if so, with what organization(s)?

As male, Only SoC.

Past experience

FOSS Projects

Wireshark

Wireshark is a great open source packet capturing and analyzing engine. I implemented several main P2P Kademlia protocols' dessectors, such as BT-uTP, Vuze-DHT and BT-DHT. They have been released since 1.6 and 1.7.

Blender

Blender is one of the best open source 3D modeling tool. In GSoC 2011, I implemented the i18n module, which has been released since 2.60.

Ogre

Ogre is one of the best open source 3D rendering engines. To save time and memory, we'd like to load BIG terrains on demand, which means, a higher LOD is only loaded when necessary. The existed work is unstable, so I refactor it and add some new features, including an "Endless World" demo. Finally it is accepted by the repository, and released since Ogre 1.9. Click here to see the wiki

Personal Repository

BitBucket

Google Code

FOSS ASIA

I like to not only contribute code, but also share experience and publicize my organizations in events. During this year's FOSS ASIA event, I gave two talks. One is about "How to ask for help in Open Source projects", and another is "Optimizations of the Terrain System" in OGRE.

Relevant Projects

The most relevant one is the Blender i18n module. In that project I learned to build the i18n process with gnu-gettext. Then we found the translators' community, manage fonts and PO files, and try creators such as fontforge.

Also I ever wrote a Java IDE as a course-project with 3 other classmates, called FreeJava. We use the Java i18n Mechanism to support English, Chinese and Japanese. We should write all language files ( ./mess_*.properties ) before compiling. The program will load the specified one and display based on the setProperty("locale","en") call in the source. It's ugly and we threw it away just after the semester.

Interested Projects

I'm interested in i18n, game-development and mobile-app projects, but this year I only applied for mediawiki. I'd like to do other i18n related projects if the feature in this proposal has low priority.

Any other info

Most Chinese users cannot speak English or any other foreign languages, so they are always kept away from the great products which don't support i18n. e.g. Blender, the best free 3D modeling tool, has few Chinese user before the i18n module finished. That's why I'm so addicted to such technologies.

Now I'm eager to enhance the Chinese-supporting for mediawiki, and to make it much easier for non-English users. Though I know little about the tech-part of mediawiki, but I think I can learn fast and make it in time.