Extension:UniversalLanguageSelector/Fonts for Chinese wikis

Introduction
Including all Chinese characters makes a webfont file too large. We may want to tailor the font file for every page based on characters used on that page. Once finished, this feature can be applied to other languages facing the same problem, such as Japanese.

As of writing, there isn't any "good" enough free font which includes all Chinese characters in Unicode. And the "wiki" concept itself encourages collaborative content creation, so it would be nice to invite user to create a glyph for it when the system sees a character without existing data.

Proposal
Go to Proposal

Mentors
DChan,  Liangent

Repository
Font Tailor

Tofu Detection

Demo Site
Go to Demo Site

With a debug tool you can see only 20KB is downloaded. ( You can also click on the red tofu to see the new tofu-detection feature, which is more accurate by comparing pixels. )

Another example has more characters. About 40KB is downloaded.

Please create new pages to have a try to avoid conflicts. You can write things like

SOMETHING YOU WANT

Development Report
Go to Development Report

Milestones

 * May 19: Start coding.
 * Warm up with code and development tool set
 * Clarify what to do next
 * June 22: Mid-term evaluation: Finish the prototype of Font Tailor
 * July 20: Finish the Font Tailor ( ttf tailor finished and well tested. svg/woff/eot tailor finished but with no guarantee )
 * Aug 11: Pencil down: Tofu detection with font family settings
 * Aug 22: Final evaluation: Documents ( The page you're reading )

Next Step
Known_Issues as described. As graduated this year, I'll no longer participate in GSoC as a student. But I'd like to be a mentor here to help others on language related projects.
 * Product Implementation
 * Future Mentoring

Dynamic WebFonts
For standard WebFonts service, a static font file is downloaded. The @font-face rule is like: @font-face { font-family: WenQuanYi; ...   src: url('fontspath/wenquanyi.ttf') format('ttf'), ...; } Now we should return different font which is well tailored to contain all / only the characters in that page. So we change the url to: @font-face { font-family: WenQuanYi; ...   src: url('FontRequest.php?font=WenQuanYi...') format('ttf'), ...; } When the page is visited, a font request will be fired towards FontTailor.php. The php will get enough information from the parameters. If a tailored font file exists and is up-to-date, return it by attachment: header( "Content-Type: application/octet-stream" ); header( "Content-Disposition: attachment; filename=\"$wanted_filename\"" ); readfile( $tailored_fontfile ); If no tailored font file is available or it is out dated, the php should generate one.

Tailored Font Management
Under the font's path, there are three subtrees:


 * tailored/
 * 02c68248c6b40670c2889218987af948.ttf
 * 9efbe2b03fd390fa3e4bec7d65b36f46.ttf
 * tailored_for_title/
 * Main_Page_17.ttf -> 02c68248c6b40670c2889218987af948.ttf
 * Main_Page_16.ttf -> 02c68248c6b40670c2889218987af948.ttf
 * tailored_for_url/
 * %2Fwiki%2FMain_Page.ttf -> 02c68248c6b40670c2889218987af948.ttf
 * %2Fwiki%2FTest%3Fdebug%3Dtrue.ttf -> 9efbe2b03fd390fa3e4bec7d65b36f46.ttf
 * %2Fwiki%2FMain_Page.ttf -> 02c68248c6b40670c2889218987af948.ttf
 * %2Fwiki%2FTest%3Fdebug%3Dtrue.ttf -> 9efbe2b03fd390fa3e4bec7d65b36f46.ttf

Tree tailored contains all the real tailored font. Every different set of characters maps to a tailored font. e.g. 'abcde' and 'abcdef' map to different files. The file name is the md5 value of the char-sequence.

Tree tailored_for_title and tailored_for_url contain soft links to some tailored font file.

Font Tailor Workflow
By hooking ArticleViewHeader event, FontTailor will check if the tailored fonts have been ready for an article when it is requested. This is done by checking if the  pair has been contained under the tailored_for_title tree. If not, fire the tailor.
 * Trigger the Tailor

Get the article's content, and generate character set by sorting and uniquing. Search its MD5 value under the tailored tree to see if there is existed tailored font. If not, call php-font-lib to generate one there. As you know, this mechanism is somewhat like Git. Different articles or revisions may share the same tailored font.
 * Do Tailoring

Create a soft link under the tailored_for_title tree, so the future requests will hit.

Create or update a soft link under the tailored_for_url tree, it will be used below. Note that, the same url may present different article revision from time to time, so we should always update the soft link no matter a real tailoring happened or not.

When the article is ready on client, it will fire a request to the font, which has been modified by us from requesting static font to requesting FontRequest.php. The script will read $_SERVER['HTTP_REFERER'] to get requester's url, and find it under the tailored_for_url tree.
 * Request Tailored Font


 * Download and Render



If everything goes well, you'll see a properly rendered page like the attachment. The tailored font only contains the characters in the page, saving the downloading size from 4.5MB to 20KB.

Known Issues
It's strange that the output font file cannot work in WebFonts. But if you read it by another font editor ( FontCreater or FontForge ), and save to another file, it will work. You can find that the two files have some difference. I don't know why, yet. If someone have knowledge on TTF fonts, please take a look:
 * php-font-lib bug

- Output TTF of php-font-lib

- Fixed TTF by FontForge

Current solution is to run another fix function: Open('input.ttf',1) SelectAll Copy Generate('output.ttf') Close It's ugly to call exec in PHP, and it's also ugly to have fontforge required. So I want to fix the problem in php-font-lib if possible.
 * 1) !/usr/bin/env fontforge

As described above, a FontTailor request will tailor a font, write it to the disk, and create two soft links. The whole process takes up to 3 or more seconds. In a production environment it's likely that many concurrent requests will come in such a long duration, and multiple tailoring may be started. So we need some kind of lock when tailoring.
 * Concurrent Requests

We don't consider such complicated scenaries currently. So if the new content has extra characters than the original HTML, it may not be rendered as expected.
 * Additional Content Loaded with AJAX

Currently every subsetted font contains every character in the page. For example, ABC DEF We tailor Font1 and Font2, but they both contain characters ABCDEF, while not contain those just needed.
 * Redundant Subsetting

Tofu Detection with FontFamily
If a Chinese character is rendered as a tofu, the reason is obviously that the glyph is not available in the fonts, both from WebFonts service or from the system. According to, the most reliable way to detect a tofu is to compare it's image with the known tofu's image, such as unicode 0x0D00.

However, you cannot do that with a fixed fontFamily like sans-serif, because a WebFonts service may render it properly with the remote fonts. So the current detectTofu method may get some false-positive error. We should detect tofu with it's real fontFamily setting. And tofus are different, too. As you see below:
 * &#x0d00; [sans-serif tofu]
 * &#x0d00; [Linux Libertine tofu]
 * &#x0d00; [宋体 tofu]
 * &#x0d00; [Georgia tofu]

Detect Tofu by Comparing Image
Use HTML5's canvas element to draw each character, and compare with the tofu's image.

It's introduced in another patch from me, see and patch 122277.

Popup to Show Tofu Information
Traverse the DOM tree to find all text nodes, mark them as red, and bind click event to make a popup to show each tofu's information. In the future we can guide them to the font's contribute page or our own glyph-contribution page.