Extension:UniversalLanguageSelector/Fonts for Chinese wikis/final post

From mediawiki.org

Enhance WebFonts with Auto-Subsetting: Help Hanzi Survive from Our Era[edit]

Chap I. We should admit that Hanzi is not friendly to Internet[edit]

Just two points to prove.

First of all, there is never a complete Hanzi font.

English contains 26 letters. Generally you need only one byte for any character used in an article or web page. While for Hanzi, we even don’t know exactly how many characters exists. We may use about 80,000 of them (including those in Japanese and Korean). Unicode defines about 70,000. GBK, an encoding widely used in mainland China, defines a collection of 20,000. GB2312 defines a collection of 6,000. And BIG5, an encoding used in Hong Kong and Taiwan, defines a collection of 13,000. Generally any of them can work well, because we use only about 3500 in daily life.

But incomplete fonts do cause problems. If the computer does not contain a glyph, it will be rendered as something like 口, called “Tofu" in wikipedia community. We developers should more or less have encountered in software UIs, web pages or terminals. In 2014, China’s President Xi visited Germany, and the locals made a banner with lots of Tofus. It’s really difficult for non-Chinese to distinguish “square shaped character” and “square shape”.

Secondly, font files are too large.

Open your systems font directory and sort by size. The largest ones must be Hanzi fonts ( if you have installed any ). Other fonts are generally tens to hundreds of KB, but the smallest Hanzi font may take several MB. On my Mac, A Kaiti Hanzi font takes 70MB. If you really had a "complete Hanzi font", it is surely to take hundreds of MB. You might think it doesn’t matter, as storage is so cheap nowadays. But for some scenarios, size really matters. In 2011, I was working on Blender’s I18N system. In order to support Chinese, I needed to embed a 3MB font into the release package. The community considered it unacceptable, as the package itself is only 27MB. They could have paid a lot of effort just to save 1MB space, while I’m going to append 3MB all at once.

WebFonts solution is also facing the same problem. It will get fonts from the server when a web page is loaded. Google has already offered such online font services, but still don’t support Chinese. It is impossible to let users download several MB of fonts when browsing several KB of web page. The downloading time will significantly damage users’ experience, not to mention the bandwidth costs.

Should we give up Hanzi, or, as I thought before - abolish uncommon characters, unified using only 3000 common ones? Although it’s some kind of solution, but is not desirable after all. Hanzi has been living for 3000 years since oracle. It carries the entire history of Chinese civilization. Internet just showed up several decades ago. We are not sure which one will live longer. For us programmers, one of things we can do is to help Hanzi as well as other languages live better in our Internet era.

Chap II. Subsetting Implementation[edit]

WebFonts is a new solution to the problem of missing fonts in web. Almost all mainstream browsers support following CSS syntax:

@font-face {
      font-family: Arial;
      src: url ('http://path.com/arial.ttf') format ('ttf');
}

Browser will load the font from the url if it doesn’t exist locally. Google has already offer such online fonts service[2]. Wikimedia Foundation has also established its own font repository. Of course, neither of them provides Hanzi. To solve this problem, I developed the Auto-Subsetting feature[3] for MediaWiki during this year’s Google Summer of Code event.

The solution is divided into three parts:

  • 1. Hook on the server side to subsetting fonts before the content is sent to client. The subsetted font only contains the text in that page. Therefore it is very small. I use php-font-lib[4] library to do subsetting. The library has some bugs currently, I have to call fontforge once to fix.
  • 2. The CSS code for WebFonts has been changed to:
@font-face {
     font-family: Arial;
     src: url ('http://path.com/FontRequest.php?font=arial') format ('ttf');
}

Note the URL part has become a PHP script while not static TTF file. I call it “Dynamic Font”.

  • 3 On server side, the “Dynamic Font” request should return the subsetted font correctly. According to some experiments, both redirecting to a static font file:
<?php
     header ("Location: $path_to_tailored_font");
?>

Or modifying the MIME Type:

<?php
     header ('Content-Type: application/octet-stream');
     header ('Content-Disposition: attachment; filename=arial.ttf');
     readfile ($path_to_tailored_font);
?>

will work. Usually I follow the second method to avoid additional HTTP requests.

Chap III. Demo Site[edit]

I set up an instance on Wikimedia Foundation’s labs[5]. The second line of text on home page uses "WenQuanYi Micro Hei” font. The complete font has 4.5MB, but with a debug tool you’ll see that only 20KB is downloaded. Another larger test page [6] only needs to download about 40KB, too.

WebFonts technology can finally be applied Chinese now! You can safely design your page without having to worry about fonts. As verification, you can also create new wiki pages (PLEASE don’t modify the demo pages) with code:

<p style = "font-family: WenQuanYi Micro Hei"> SOMETHING YOU WANT </p>

Then you can see the page loads a new subsetted font that only contains the characters you used. Of course, subsetting is time-consuming, so you might have to wait for several seconds during the first visit.

Chap IV. Universal Implementation[edit]

The feature for Wikipedia has been completed, but I think this feature is generic and can solve some big problems. So I ported it to Github [7]. In this edition, fonts information is collected on the client side using JavaScript, which issues a AJAX request to the server asking for subsetting. Server returns URLs for each font. After the client receives them, JavaScript will generate CSS code and append to the ‎<head>. WebFonts then takes effect.

This implementation is very simple and easy to deploy. The project comes with sample code and you can also take a quick look at [8].

Last Word[edit]

I hope as many as websites applies this feature, to make the Internet much more friendly to Hanzi and vice versa. I am willing to provide assistance. And I also hope there is online font service for Hanzi with subsetting enabled. It must be a killing application for Chinese cyber world.

  • Sina Weibo: @甜菜萧
  • Facebook/Twitter: xiaoxiangquan
  • LinkedIn: linkedin.com/in/xiaoxiangquan

[1] https://app.yinxiang.com/shard/s28/sh/a5f355c8-ea38-4fc0-ae01-b4ae8840b193/70556a90514f96f117cedddd81707d0a

[2] https://www.google.com/fonts

[3] https://www.mediawiki.org/wiki/Extension:UniversalLanguageSelector/Fonts_for_Chinese_wikis

[4] https://github.com/PhenX/php-font-lib

[5] http://fonttailor.wmflabs.org

[6] http://fonttailor.wmflabs.org/index.php/Test

[7] https://github.com/xiaoxq/webfonts-subsetting

[8] http://fonttailor.wmflabs.org/webfonts-subsetting