User:TJones (WMF)/Notes/Chinese Analyzer Analysis

February/March 2017 — See TJones_(WMF)/Notes for other projects. See also T158202.

Test/Analysis Plan
All of the Chinese segmentation candidates I've found to date (See T158202) expect Simplified Chinese characters as input. Chinese Wikipedia supports both Traditional and Simplified characters, and converts them at display time according to user preferences. There is a Elasticsearch plugin (STConvert) that converts Traditional to Simplified (T2S) or vice versa (S2T). I suggest trying to set up an analysis chain using that, and segment and index everything as Simplified.

There are a number of segmenters to consider: SmartCN, IK, and MMSEG are all available with up-to-date Elasticsearch plugin wrappers. The developer of the wrappers for IK and MMSEG recommends IK for segmenting, but I plan to test both.

So, my evaluation plan is to first compare the content and output of STConvert with MediaWiki ZhConversion.php to make sure they do not differ wildly, and maybe offer some cross-pollination to bring them more in line with each other if that seem profitable. (I've pinged the legal team about licenses, data, etc.)

I'll try to set up the SIGHAN analysis framework to evaluate the performance of the segmenters on that test set. If there is no clear cut best segmenter, I'll take some text from Chinese Wikipedia, apply STConvert, segment the text with each of the contenders, and collect the instances where they differ for manual review by a Chinese speaker. This should allow us to focus on the differences found in a larger and more relevant corpus.

I'll also review the frameworks and see how amenable each is to patching to solve specific segmentation problems. Being easily patched might be more valuable than 0.02% better accuracy, for example.

It also makes sense to compare these segmenters to the baseline performance in prod (using icu_normalizer and icu_tokenizer), and the standard tokenizer.

We'll also test the highlighting for cross-character type (Traditional/Simplified/mixed) queries.

In parallel, I'll try to talk @dcausse into reviewing the code as needed to see if anything sticks out as particularly unmaintainable.

An Example
For reference in the discussion outline below, here's an example.

Right now (Feb, 27 2017), searching for Traditional 歐洲冠軍聯賽決賽 ("UEFA Champions League Final") returns 82 results. Searching for Simplified 欧洲冠军联赛决赛 gives 115 results. Searching for 欧洲冠军联赛决赛 OR 歐洲冠軍聯賽決賽 gives 178 results—so they have some overlapping results.

Searching for the mixed T/S query (the last two characters, meaning "finals", are Traditional, the rest is Simplified) 欧洲冠军联赛決賽 gives 9 results. Adding it to the big OR (欧洲冠军联赛决赛 OR 歐洲冠軍聯賽決賽 OR 欧洲冠军联赛決賽) gives 184 results, so 6 of the 9 mixed results are not included in the original 178. This is just one example that I know of. There are obviously other mixes of Traditional and Simplified characters that are possible for this query.

Initial Draft Proposal
Once we have all the necessary tools available, we have to figure out how best to deploy them.

The current draft proposal, after discussing what's possible with the Elasticsearch inner workings with David & Erik, is: The working assumption here is that whether a typical searcher searches for Simplified 年欧洲冠军联赛决赛, Traditional 年歐洲冠軍聯賽決賽, or mixed 年欧洲冠军联赛決賽, they are looking for the words in the query, regardless of whether the underlying characters are Traditional or Simplified. When they use quotes, they want those words as a phrase.
 * Convert everything to Simplified characters for indexing and use a segmenter to break the text into words, in both the text and plain fields. Do the same for normal queries at search time.
 * Index the text as is in the source plain field, and use a unigram segmenter.

For advanced searchers or editors who want to find specific characters (e.g., to distinguish the simple, Traditional, and mixed examples above), insource: would provide that ability.

We will of course verify that this is a decent plan with the community after we figure out what's actually possible with the tools we have available.

STConvert
STConvert is an Elasticsearch plugin that converts between Traditional and Simplified Chinese characters. It includes an analyzer, tokenizer, token-filter, and char-filter. It has released that are compatible with many versions of ES, including ES 5.1.2 and ES 5.2.1 (and ES 2.3.5, which I'm currently running in vagrant).

STConvert vs ZhConversion: Mappings
ZhConversion.php is a MediaWiki PHP module that also converts between Traditional and Simplified Chinese characters. It is used to convert Chinese wiki projects at display time.

In order to get a rough estimate of the coverage and likely comparative accuracy of the two, I compared the data they use to do their respective conversions. The ZhConversion data is in a PHP array and the STConvert data file is a colon-separated text file. WARNING—both of those links go to large, slow loading files that may crash your browser.

In each case, I normalized the list of mappings by removing the syntactic bits of the file and converting to a tab-separated format. For each mapping, I sorted the elements being mapped (so A:B and B:A would both become A:B), and sorted and de-duped the lists.

I did a diff of the files and took a look. There were no obvious wild inconsistencies. I also did a more careful automated review. The results are below.

For these first two sets of stats: STConvert (Elasticsearch) ZhConversion (Mediawiki) The duplicates and conflicts are not surprising. This kind of info is gathered from many sources, so duplicates and even conflict are likely. "Conflicts" and duplicates are not necessarily incorrect, either, since some mappings are many-to-one. If A and B both map to C in one direction, but C preferentially maps back to A in the other direction, you get both "conflict" and a sorted duplicate: {A:C, B:C, C:A} → {A:C, B:C, A:C}, so A:C is a dupe and C has a conflict.
 * raw mappings are the number of mappings present in the original PHP or text file
 * unique sorted mappings are what's left after re-ordering and de-duping
 * internal conflicts are the number of character strings that map to or from more than one thing, so having A:B and B:C would be an internal conflict for B.
 * raw mappings: 11708
 * unique sorted mappings: 11591
 * internal conflicts: 192
 * raw mappings: 20381
 * unique sorted mappings: 15890
 * internal conflicts: 2748

I then compared the mappings more systematically against each other. They have a lot of overlap in the sorted mappings, though both have some the other does not. I also looked for mismatches of several types, where the existing mappings don't agree. STConvert vs ZhConversion: Mappings Both singleton and complex mismatches might not be errors. These kinds of mappings are typically driven by examples, and will never be complete (there's always new vocabulary being created, and weird exceptions are the rule with human language), so one project may have had a request to add A:C, while another had a request to add B:C, but neither is wrong (and this is even more likely if A and B are Traditional and C is Simplified).
 * singleton mismatches occur when each has exactly one mapping for a character and they disagree; these are clear disagreements (though not necessarily errors, see below)
 * complex mismatches occur when there internal conflicts, and they aren't the same between mappings. For example, {A:B, A:C, A:D, A:E, A:F} vs {A:C, A:D, A:E, A:F}, in this case there's lots of overlap, but they don't match exactly. I didn't dig into the details of these kinds of mismatches.
 * merged unique sorted mappings: 17891
 * 9590 unique sorted mappings occur in both
 * STConvert has 2001 unique sorted mappings ZhConversion does not
 * ZhConversion has 4291 unique sorted mappings STConvert does not
 * Mismatches
 * singleton mismatches: 191
 * complex mismatches: 1692

Another important thing to note is that the distribution of Chinese characters is far from even. English Wikipedia says that educated Chinese know about 4,000 characters. A BBC guide to Chinese says that while there are over 50,000 characters, comprehensive dictionaries list only 20,000, educated speakers know 8,000, but only 2,000-3,000 are needed to read a newspaper.

In general, this gives me confidence that both systems are on par with each other in terms of the most common characters.

There is still a potential problem, given the obscure completeness often found in Wikipedia, that for two underlying forms of an uncommon word, one Simplified and one Traditional, the display forms (driven by ZhConversion) could be the same, while the indexed forms (driven by STConvert) could be different. This would be particularly confusing for most users because the two forms would look the same on the screen, but searching for either would only find one of the forms. The reverse situation, where STConvert merges them in the index but ZhConversion fails to render them the same on the screen might actually look "smart", because while they look different, search can still find both!

Based on the numbers above, I think the likelihood of this ever happening in almost 100% (I'm pretty sure I could engineer an example by trawling through the tables), but the likelihood of it happening often in actual real-world use is very small. I will try to get a quantitative answer by getting some Chinese Wikipedia text and running both converters in both directions and seeing how often they disagree.

If there is a significant mismatch, we could either share data across the two platforms (I'm not 100% sure how that works, from a licensing standpoint) or we could fork STConvert and convert the ZhConversion data into it's format. There's a small chance of some remaining incompatibilities based on implementation differences, but it would be easy to re-test, and I'd expect such differences to be significantly smaller than what we have now with independent parallel development.

STConvert vs ZhConversion: Real-World Performance
I extracted 10,000 random articles from Chinese Wikipedia, and deduped lines (to eliminate the equivalent of "See Also", "References", and other oft-repeated elements). The larger sample turned out to be somewhat unwieldy to work with, so I took a smaller one-tenth sample of the deduped data, which is approximately 1,000 articles' worth.

The smaller sample had 577,127 characters in it (6,257 unique characters); of those 401,449 were CJK characters (5,722 unique CJK characters). I ran this sample through each of STConvert and ZhConversion to convert the text to Simplified (T2S) characters.

STConvert changed 72,555(12.6%) of the characters and ZhConversion changed 74,518(12.9%) of the characters."Note that the sample is a mix of Simplified and Traditional characters, and that some characters are the same for both. I also ran the same process using STConvert to convert Simplified to Traditional characters (S2T), which resulted in 11.8% of characters being changed. Between the S2T and T2S outputs from STConvert, 24.3% changed (i.e., ~12.6% + ~11.8% with rounding error). So, roughly, it looks like ~3/4 of characters on Chinese Wikipedia are the same after either Simplified and Traditional conversion, and ~1/8 are either Simplified or Traditional that can be converted to the other."I did a character-by-character diff of their outputs against each other, and 2,627 characters where changed from the ZhConversion output, and 2,841 where changed from the STConvert output (0.46%-0.49%). The differences cover around 80 distinct characters, but by far the largest source of differences in in quotation marks, with some problems for high surrogate characters.

Of the 2627 diffs from ZhConversion: Of the 2841 diffs from STConvert: Discounting the 2084 quotes from the 2627 ZhConversion diffs leaves 543 characters.
 * 1920 are “ and ”
 * 164 are ‘ and ’
 * The rest are CJK characters.
 * 1916 are 「 and 」
 * 164 are 『 and 』
 * 4 are ｢ and ｣
 * 159 are \
 * 145 of which precede "
 * 12 of which precede u (and are part of \u Unicode encodings, like \uD867, which is an invalid "high surrogate" character)
 * The rest are CJK characters.

Discounting the 2084 quotes, 159 slashes, and reducing the 60 Unicode encodings to 12 individual characters, from the 2841 STConvert diffs also leaves 550 characters. (I'm not sure what's caused the 543 vs 550 character difference.)

So, other than quotes, ZhConversion and STConvert disagree on only 0.094% - 0.095% (less than a tenth of a percent) of the more than half a million characters in this sample.

387 (~71%) of the character differences are accounted for by the 11 most common characters (those with >10 occurrences). It's easy to line up most of these characters, since the counts are the same. I've lined them up and looked at English Wiktionary to see which is the likely correct form.

So the bulk of the difference is quotation marks (which we should be able to fix if needed), leaving about 0.1% disagreement on the rest, with attributable token-level errors roughly evenly divided between the two converters.

That seems close enough to work with!

SIGHAN Segmentation Analysis Framework
I was able to download and run the SIGHAN Second International Chinese Word Segmentation Bakeoff framework and tests after a little bit of format munging (they were in a Windows UTF-8 format, I'm running OSX/Linux).

There are four corpora, two Traditional and two Simplified, each containing at least a million characters and a million words, with the largest at more than 8M characters and 5M words. The scoring script provided generates a lot of stats, but I'll be looking only at recall, precision, and F1 scores, primarily F1.

Segmenters
I looked at the three segmenters/tokenizers I found. All three are available as Elasticsearch plugins. I'm also going to compare them to the current prod config (icu_normalizer and icu_tokenizer) and the "standard" Elasticsearch tokenizer.
 * SmartCN—supported by Elastic, previously considered not good enough. It is designed to work on Simplified characters, though that isn't well documented. It does not appear to be patchable/updatable. An old Lucene Issue explains why and nothing there seems to have changed. Since this is recommended and supported by Elasticsearch, I don't think it needs a code review.


 * IK—recommended by Elasticsearch, and up-to-date. I verified with the developer (Medcl) of the ES plugin/wrapper that it works on Simplified characters. It is explicitly patchable, with a custom dictionary. Would need code review.

The plugin developer recommended IK over MMSEG, but I planned to test both. However, I couldn't get MMSEG to install. After the results of the performance analysis, I don't think it's worth spending too much effort to get it to work.
 * MMSEG—also recommended by Elasticsearch, and up-to-date. The same developer (Medcl) wrote the ES plugin/wrapper for this as for IK. He verified that it works on Simplified characters. It is explicitly patchable, with a custom dictionary. Would need code review.

Max Word
Both IK and MMSEG have an interesting feature, "Max Word", which provides multiple overlapping segmentations for a given string. Given 中华人民共和国国歌 ("National Anthem of the People's Republic of China"), the ik_smart tokenizer splits it in two chunks: 中华人民共和国 ("People's Republic of China") and 国歌 ("National Anthem"). The ik_max_word tokenizer provides many additional overlapping segmentations. Rough translations (based on English Wiktionary) are provided, but don't rely on them being particularly accurate. 中华人民共和国国歌  input, for reference 中华人民共和国     People's Republic of China 中华人民          Chinese People 中华             China 华人           Chinese 人民共和国    People's Republic 人民         people 人           person 民         people 共和国    republic 共和      republicanism 和      and 国国   country 国歌 national anthem The SIGHAN segmentation testing framework doesn't support this kind of multiple segmentation, so I can't easily evaluate it that way. It's an interesting idea that would increase recall, but might decrease precision. Given the generally better recall of SmartCN (see below), I don't think we need to worry about this, but it's an interesting idea to keep in mind.

Segmenter Performance
I ran eight configs against the four SIGHAN segmentation test corpora. The eight tokenizers are: For comparison, the F1 scores for all eight configs on all four corpora are below (max scores per corpus are bolded):
 * The prod Chinese Wikipedia config (ICU tokenizer and ICU Normalizer)
 * The ICU tokenizer with STConvert
 * The Elasticsearch "standard" tokenizer
 * The Elasticsearch "standard" tokenizer, with STConvert (T2S)
 * The SmartCN tokenizer
 * The SmartCN tokenizer, with STConvert (T2S)
 * The IK tokenizer
 * The IK tokenizer, with STConvert (T2S)

The short version is that SmartCN+STConvert did the best on all the corpora, and IK did the worst. If IK is supposed to be better than MMSEG, then it doesn't matter much that we didn't test MMSEG.

It also turns out that "standard" and prod configs (with or without STConvert applied) were exactly the same—not just the recall, precision, and F1 scores, but also the raw tokenization output, so I'm dropping "standard" configs from the discussion.

The recall, precision, and F1 score details for the remaining six configs are below:

STConvert makes no difference for the Simplified corpora (MSR and PKU), which is a good sign, as it indicates that STConvert isn't doing anything to Simplified characters that it shouldn't.

STConvert makes a big difference on the Traditional corpora for the Simplified-only segmenters, improving IK's recall by a bit more than 15%, precision by a bit more than 30%, and F1 by about 24%. For SmartCN, recall improves 20-23%, precision about 30%, and F1 26-28%, explaining at least part of the local lore that SmartCN doesn't work so well; without T2S conversion, SmartCN is going to do particularly badly on the ~1/8 of Chinese Wikipedia text that is in strictly Traditional characters.

Oddly, STConvert actually hurts performance on the Traditional corpora for the ICU tokenizer, indicating that either ICU normalization is better than STConvert, or STConvert errors ripple into the ICU tokenizer. To test this, I ran SmartCN with the ICU normalizer, and the results were better by up to 0.2% for recall, precision, and F1—indicating that the Traditional tokenizer magic in prod is happening in the ICU Tokenizer, not the ICU Normalizer.

The IK tokenizer, even with STConvert, is clearly the worst of the bunch, so we can drop it from consideration.

SmartCN does significantly better on recall on all corpora, and significantly better on precision for the Simplified corpora (indicating that, not surprisingly, imperfections in T2S conversion have knock-on effects on tokenization). SmartCN+STConvert does a bit worse (~2%) than prod at precision for the AS corpus, but makes up for it with bigger gains in recall.

Unfortunately, neither SmartCN nor prod do nearly as well as the original participants in the SIGHAN Segmentation Bakeoff, where the top performer on each corpus had 94%-98% for recall, precision, and F1. SmartCN would rank near the bottom for each corpus.

Given the surprisingly decent performance of the prod/ICU tokenizer on the SIGHAN test set, I think a proper analysis of the effect on the tokens generated is in order. I expect to see a lot of conflation of Traditional and Simplified variants, but I'm also concerned about the treatment of non-CJK characters. For example, we may need to include both STConvert and ICU_Normalization so that wide characters and other non-ASCII Unicode characters continue to be handled properly.

Coming Up
Analysis analysis, highlight testing, quote testing, and summary, recommendations, & plans.