User:TJones (WMF)/Notes/Chinese Analyzer Analysis

February/March 2017 — See TJones_(WMF)/Notes for other projects. See also T158202.

Test/Analysis Plan
All of the Chinese segmentation candidates I've found to date (See T158202) expect Simplified Chinese characters as input. Chinese Wikipedia supports both Traditional and Simplified characters, and converts them at display time according to user preferences. There is a Elasticsearch plugin (STConvert) that converts Traditional to Simplified (T2S) or vice versa (S2T). I suggest trying to set up an analysis chain using that, and segment and index everything as Simplified.

There are a number of segmenters to consider: SmartCN, IK, and MMSEG are all available with up-to-date Elasticsearch plugin wrappers. The developer of the wrappers for IK and MMSEG recommends IK for segmenting, but I plan to test both.

So, my evaluation plan is to first compare the content and output of STConvert with MediaWiki ZhConversion.php to make sure they do not differ wildly, and maybe offer some cross-pollination to bring them more in line with each other if that seem profitable. (I've pinged the legal team about licenses, data, etc.)

I'll try to set up the SIGHAN analysis framework to evaluate the performance of the segmenters on that test set. If there is no clear cut best segmenter, I'll take some text from Chinese Wikipedia, apply STConvert, segment the text with each of the contenders, and collect the instances where they differ for manual review by a Chinese speaker. This should allow us to focus on the differences found in a larger and more relevant corpus.

I'll also review the frameworks and see how amenable each is to patching to solve specific segmentation problems. Being easily patched might be more valuable than 0.02% better accuracy, for example.

It also makes sense to compare these segmenters to the baseline performance in prod (using icu_normalizer and icu_tokenizer), and the standard tokenizer.

We'll also test the highlighting for cross-character type (Traditional/Simplified/mixed) queries.

In parallel, I'll try to talk @dcausse into reviewing the code as needed to see if anything sticks out as particularly unmaintainable.

An Example
For reference in the discussion outline below, here's an example.

Right now (Feb, 27 2017), searching for Traditional 歐洲冠軍聯賽決賽 ("UEFA Champions League Final") returns 82 results. Searching for Simplified 欧洲冠军联赛决赛 gives 115 results. Searching for 欧洲冠军联赛决赛 OR 歐洲冠軍聯賽決賽 gives 178 results—so they have some overlapping results.

Searching for the mixed T/S query (the last two characters, meaning "finals", are Traditional, the rest is Simplified) 欧洲冠军联赛決賽 gives 9 results. Adding it to the big OR (欧洲冠军联赛决赛 OR 歐洲冠軍聯賽決賽 OR 欧洲冠军联赛決賽) gives 184 results, so 6 of the 9 mixed results are not included in the original 178. This is just one example that I know of. There are obviously other mixes of Traditional and Simplified characters that are possible for this query.

Initial Draft Proposal
Once we have all the necessary tools available, we have to figure out how best to deploy them.

The current draft proposal, after discussing what's possible with the Elasticsearch inner workings with David & Erik, is: The working assumption here is that whether a typical searcher searches for Simplified 年欧洲冠军联赛决赛, Traditional 年歐洲冠軍聯賽決賽, or mixed 年欧洲冠军联赛決賽, they are looking for the words in the query, regardless of whether the underlying characters are Traditional or Simplified. When they use quotes, they want those words as a phrase.
 * Convert everything to Simplified characters for indexing and use a segmenter to break the text into words, in both the text and plain fields. Do the same for normal queries at search time.
 * Index the text as is in the source plain field, and use a unigram segmenter.

For advanced searchers or editors who want to find specific characters (e.g., to distinguish the simple, Traditional, and mixed examples above), insource: would provide that ability.

We will of course verify that this is a decent plan with the community after we figure out what's actually possible with the tools we have available.

STConvert
STConvert is an Elasticsearch plugin that converts between Traditional and Simplified Chinese characters. It includes an analyzer, tokenizer, token-filter, and char-filter. It has released that are compatible with many versions of ES, including ES 5.1.2 and ES 5.2.1 (and ES 2.3.5, which I'm currently running in vagrant).

STConvert vs ZhConversion: Mappings
ZhConversion.php is a MediaWiki PHP module that also converts between Traditional and Simplified Chinese characters. It is used to convert Chinese wiki projects at display time.

In order to get a rough estimate of the coverage and likely comparative accuracy of the two, I compared the data they use to do their respective conversions. The ZhConversion data is in a PHP array and the STConvert data file is a colon-separated text file. WARNING—both of those links go to large, slow loading files that may crash your browser.

In each case, I normalized the list of mappings by removing the syntactic bits of the file and converting to a tab-separated format. For each mapping, I sorted the elements being mapped (so A:B and B:A would both become A:B), and sorted and de-duped the lists.

I did a diff of the files and took a look. There were no obvious wild inconsistencies. I also did a more careful automated review. The results are below.

For these first two sets of stats: STConvert (Elasticsearch) ZhConversion (Mediawiki) The duplicates and conflicts are not surprising. This kind of info is gathered from many sources, so duplicates and even conflict are likely. "Conflicts" and duplicates are not necessarily incorrect, either, since some mappings are many-to-one. If A and B both map to C in one direction, but C preferentially maps back to A in the other direction, you get both "conflict" and a sorted duplicate: {A:C, B:C, C:A} → {A:C, B:C, A:C}, so A:C is a dupe and C has a conflict.
 * raw mappings are the number of mappings present in the original PHP or text file
 * unique sorted mappings are what's left after re-ordering and de-duping
 * internal conflicts are the number of character strings that map to or from more than one thing, so having A:B and B:C would be an internal conflict for B.
 * raw mappings: 11708
 * unique sorted mappings: 11591
 * internal conflicts: 192
 * raw mappings: 20381
 * unique sorted mappings: 15890
 * internal conflicts: 2748

I then compared the mappings more systematically against each other. They have a lot of overlap in the sorted mappings, though both have some the other does not. I also looked for mismatches of several types, where the existing mappings don't agree. STConvert vs ZhConversion: Mappings Both singleton and complex mismatches might not be errors. These kinds of mappings are typically driven by examples, and will never be complete (there's always new vocabulary being created, and weird exceptions are the rule with human language), so one project may have had a request to add A:C, while another had a request to add B:C, but neither is wrong (and this is even more likely if A and B are Traditional and C is Simplified).
 * singleton mismatches occur when each has exactly one mapping for a character and they disagree; these are clear disagreements (though not necessarily errors, see below)
 * complex mismatches occur when there internal conflicts, and they aren't the same between mappings. For example, {A:B, A:C, A:D, A:E, A:F} vs {A:C, A:D, A:E, A:F}, in this case there's lots of overlap, but they don't match exactly. I didn't dig into the details of these kinds of mismatches.
 * merged unique sorted mappings: 17891
 * 9590 unique sorted mappings occur in both
 * STConvert has 2001 unique sorted mappings ZhConversion does not
 * ZhConversion has 4291 unique sorted mappings STConvert does not
 * Mismatches
 * singleton mismatches: 191
 * complex mismatches: 1692

Another important thing to note is that the distribution of Chinese characters is far from even. English Wikipedia says that educated Chinese know about 4,000 characters. A BBC guide to Chinese says that while there are over 50,000 characters, comprehensive dictionaries list only 20,000, educated speakers know 8,000, but only 2,000-3,000 are needed to read a newspaper.

In general, this gives me confidence that both systems are on par with each other in terms of the most common characters.

There is still a potential problem, given the obscure completeness often found in Wikipedia, that for two underlying forms of an uncommon word, one Simplified and one Traditional, the display forms (driven by ZhConversion) could be the same, while the indexed forms (driven by STConvert) could be different. This would be particularly confusing for most users because the two forms would look the same on the screen, but searching for either would only find one of the forms. The reverse situation, where STConvert merges them in the index but ZhConversion fails to render them the same on the screen might actually look "smart", because while they look different, search can still find both!

Based on the numbers above, I think the likelihood of this ever happening in almost 100% (I'm pretty sure I could engineer an example by trawling through the tables), but the likelihood of it happening often in actual real-world use is very small. I will try to get a quantitative answer by getting some Chinese Wikipedia text and running both converters in both directions and seeing how often they disagree.

If there is a significant mismatch, we could either share data across the two platforms (I'm not 100% sure how that works, from a licensing standpoint) or we could fork STConvert and convert the ZhConversion data into it's format. There's a small chance of some remaining incompatibilities based on implementation differences, but it would be easy to re-test, and I'd expect such differences to be significantly smaller than what we have now with independent parallel development.

STConvert vs ZhConversion: Real-World Performance
I extracted 10,000 random articles from Chinese Wikipedia, and deduped lines (to eliminate the equivalent of "See Also", "References", and other oft-repeated elements). The larger sample turned out to be somewhat unwieldy to work with, so I took a smaller one-tenth sample of the deduped data, which is approximately 1,000 articles' worth.

The smaller sample had 577,127 characters in it (6,257 unique characters); of those 401,449 were CJK characters (5,722 unique CJK characters). I ran this sample through each of STConvert and ZhConversion to convert the text to Simplified (T2S) characters.

STConvert changed 72,555(12.6%) of the characters and ZhConversion changed 74,518(12.9%) of the characters."Note that the sample is a mix of Simplified and Traditional characters, and that some characters are the same for both. I also ran the same process using STConvert to convert Simplified to Traditional characters (S2T), which resulted in 11.8% of characters being changed. Between the S2T and T2S outputs from STConvert, 24.3% changed (i.e., ~12.6% + ~11.8% with rounding error). So, roughly, it looks like ~3/4 of characters on Chinese Wikipedia are the same after either Simplified and Traditional conversion, and ~1/8 are either Simplified or Traditional that can be converted to the other."I did a character-by-character diff of their outputs against each other, and 2,627 characters where changed from the ZhConversion output, and 2,841 where changed from the STConvert output (0.46%-0.49%). The differences cover around 80 distinct characters, but by far the largest source of differences in in quotation marks, with some problems for high surrogate characters.

Of the 2627 diffs from ZhConversion: Of the 2841 diffs from STConvert: Discounting the 2084 quotes from the 2627 ZhConversion diffs leaves 543 characters.
 * 1920 are “ and ”
 * 164 are ‘ and ’
 * The rest are CJK characters.
 * 1916 are 「 and 」
 * 164 are 『 and 』
 * 4 are ｢ and ｣
 * 159 are \
 * 145 of which precede "
 * 12 of which precede u (and are part of \u Unicode encodings, like \uD867, which is an invalid "high surrogate" character)
 * The rest are CJK characters.

Discounting the 2084 quotes, 159 slashes, and reducing the 60 Unicode encodings to 12 individual characters, from the 2841 STConvert diffs also leaves 550 characters. (I'm not sure what's caused the 543 vs 550 character difference.)

So, other than quotes, ZhConversion and STConvert disagree on only 0.094% - 0.095% (less than a tenth of a percent) of the more than half a million characters in this sample.

387 (~71%) of the character differences are accounted for by the 11 most common characters (those with >10 occurrences). It's easy to line up most of these characters, since the counts are the same. I've lined them up and looked at English Wiktionary to see which is the likely correct form.

So the bulk of the difference is quotation marks (which we should be able to fix if needed), leaving about 0.1% disagreement on the rest, with attributable token-level errors roughly evenly divided between the two converters.

That seems close enough to work with!

Coming Up
Installing and running SmartCN, IK, and MMSEG and running them against SIGHAN data and Chinese Wikipedia data, code reviews, patchability reviews, highlight testing, quote testing, and recommendations.