User:TJones (WMF)/Notes/Chinese Analyzer Analysis

February/March 2017 — See TJones_(WMF)/Notes for other projects. See also T158203.

Test/Analysis Plan
All of the Chinese segmentation candidates I've found to date (See T158202) expect Simplified Chinese characters as input. Chinese Wikipedia supports both Traditional and Simplified characters, and converts them at display time according to user preferences. There is a Elasticsearch plugin (STConvert) that converts Traditional to Simplified (T2S) or vice versa (S2T). I suggest trying to set up an analysis chain using that, and segment and index everything as Simplified.

There are a number of segmenters to consider: SmartCN, IK, and MMSEG are all available with up-to-date Elasticsearch plugin wrappers. The developer of the wrappers for IK and MMSEG recommends IK for segmenting, but I plan to test both.

So, my evaluation plan is to first compare the content and output of STConvert with MediaWiki ZhConversion.php to make sure they do not differ wildly, and maybe offer some cross-pollination to bring them more in line with each other if that seem profitable. (I've pinged the legal team about licenses, data, etc.)

I'll try to set up the SIGHAN analysis framework to evaluate the performance of the segmenters on that test set. If there is no clear cut best segmenter, I'll take some text from Chinese Wikipedia, apply STConvert, segment the text with each of the contenders, and collect the instances where they differ for manual review by a Chinese speaker. This should allow us to focus on the differences found in a larger and more relevant corpus.

I'll also review the frameworks and see how amenable each is to patching to solve specific segmentation problems. Being easily patched might be more valuable than 0.02% better accuracy, for example.

It also makes sense to compare these segmenters to the baseline performance in prod (using icu_normalizer and icu_tokenizer), and the standard tokenizer.

We'll also test the highlighting for cross-character type (Traditional/Simplified/mixed) queries.

In parallel, I'll try to talk @dcausse into reviewing the code as needed to see if anything sticks out as particularly unmaintainable.

An Example
For reference in the discussion outline below, here's an example.

Right now (Feb, 27 2017), searching for Traditional 歐洲冠軍聯賽決賽 ("UEFA Champions League Final") returns 82 results. Searching for Simplified 欧洲冠军联赛决赛 gives 115 results. Searching for 欧洲冠军联赛决赛 OR 歐洲冠軍聯賽決賽 gives 178 results—so they have some overlapping results.

Searching for the mixed T/S query (the last two characters, meaning "finals", are Traditional, the rest is Simplified) 欧洲冠军联赛決賽 gives 9 results. Adding it to the big OR (欧洲冠军联赛决赛 OR 歐洲冠軍聯賽決賽 OR 欧洲冠军联赛決賽) gives 184 results, so 6 of the 9 mixed results are not included in the original 178. This is just one example that I know of. There are obviously other mixes of Traditional and Simplified characters that are possible for this query.

Initial Draft Proposal
Once we have all the necessary tools available, we have to figure out how best to deploy them.

The current draft proposal, after discussing what's possible with the Elasticsearch inner workings with David & Erik, is: The working assumption here is that whether a typical searcher searches for Simplified 年欧洲冠军联赛决赛, Traditional 年歐洲冠軍聯賽決賽, or mixed 年欧洲冠军联赛決賽, they are looking for the words in the query, regardless of whether the underlying characters are Traditional or Simplified. When they use quotes, they want those words as a phrase.
 * Convert everything to Simplified characters for indexing and use a segmenter to break the text into words, in both the text and plain fields. Do the same for normal queries at search time.
 * Index the text as is in the source plain field, and use a unigram segmenter.

For advanced searchers or editors who want to find specific characters (e.g., to distinguish the simple, Traditional, and mixed examples above), insource: would provide that ability.

We will of course verify that this is a decent plan with the community after we figure out what's actually possible with the tools we have available.

STConvert
STConvert is an Elasticsearch plugin that converts between Traditional and Simplified Chinese characters. It includes an analyzer, tokenizer, token-filter, and char-filter. It has released that are compatible with many versions of ES, including ES 5.1.2 and ES 5.2.1 (and ES 2.3.5, which I'm currently running in vagrant).

STConvert vs ZhConversion: Mappings
ZhConversion.php is a MediaWiki PHP module that also converts between Traditional and Simplified Chinese characters. It is used to convert Chinese wiki projects at display time.

In order to get a rough estimate of the coverage and likely comparative accuracy of the two, I compared the data they use to do their respective conversions. The ZhConversion data is in a PHP array and the STConvert data file is a colon-separated text file. WARNING—both of those links go to large, slow loading files that may crash your browser.

In each case, I normalized the list of mappings by removing the syntactic bits of the file and converting to a tab-separated format. For each mapping, I sorted the elements being mapped (so A:B and B:A would both become A:B), and sorted and de-duped the lists.

I did a diff of the files and took a look. There were no obvious wild inconsistencies. I also did a more careful automated review. The results are below.

For these first two sets of stats: STConvert (Elasticsearch) ZhConversion (Mediawiki) The duplicates and conflicts are not surprising. This kind of info is gathered from many sources, so duplicates and even conflict are likely. "Conflicts" and duplicates are not necessarily incorrect, either, since some mappings are many-to-one. If A and B both map to C in one direction, but C preferentially maps back to A in the other direction, you get both "conflict" and a sorted duplicate: {A:C, B:C, C:A} → {A:C, B:C, A:C}, so A:C is a dupe and C has a conflict.
 * raw mappings are the number of mappings present in the original PHP or text file
 * unique sorted mappings are what's left after re-ordering and de-duping
 * internal conflicts are the number of character strings that map to or from more than one thing, so having A:B and B:C would be an internal conflict for B.
 * raw mappings: 11708
 * unique sorted mappings: 11591
 * internal conflicts: 192
 * raw mappings: 20381
 * unique sorted mappings: 15890
 * internal conflicts: 2748

I then compared the mappings more systematically against each other. They have a lot of overlap in the sorted mappings, though both have some the other does not. I also looked for mismatches of several types, where the existing mappings don't agree. STConvert vs ZhConversion: Mappings Both singleton and complex mismatches might not be errors. These kinds of mappings are typically driven by examples, and will never be complete (there's always new vocabulary being created, and weird exceptions are the rule with human language), so one project may have had a request to add A:C, while another had a request to add B:C, but neither is wrong (and this is even more likely if A and B are Traditional and C is Simplified).
 * singleton mismatches occur when each has exactly one mapping for a character and they disagree; these are clear disagreements (though not necessarily errors, see below)
 * complex mismatches occur when there internal conflicts, and they aren't the same between mappings. For example, {A:B, A:C, A:D, A:E, A:F} vs {A:C, A:D, A:E, A:F}, in this case there's lots of overlap, but they don't match exactly. I didn't dig into the details of these kinds of mismatches.
 * merged unique sorted mappings: 17891
 * 9590 unique sorted mappings occur in both
 * STConvert has 2001 unique sorted mappings ZhConversion does not
 * ZhConversion has 4291 unique sorted mappings STConvert does not
 * Mismatches
 * singleton mismatches: 191
 * complex mismatches: 1692

Another important thing to note is that the distribution of Chinese characters is far from even. English Wikipedia says that educated Chinese know about 4,000 characters. A BBC guide to Chinese says that while there are over 50,000 characters, comprehensive dictionaries list only 20,000, educated speakers know 8,000, but only 2,000-3,000 are needed to read a newspaper.

In general, this gives me confidence that both systems are on par with each other in terms of the most common characters.

There is still a potential problem, given the obscure completeness often found in Wikipedia, that for two underlying forms of an uncommon word, one Simplified and one Traditional, the display forms (driven by ZhConversion) could be the same, while the indexed forms (driven by STConvert) could be different. This would be particularly confusing for most users because the two forms would look the same on the screen, but searching for either would only find one of the forms. The reverse situation, where STConvert merges them in the index but ZhConversion fails to render them the same on the screen might actually look "smart", because while they look different, search can still find both!

Based on the numbers above, I think the likelihood of this ever happening in almost 100% (I'm pretty sure I could engineer an example by trawling through the tables), but the likelihood of it happening often in actual real-world use is very small. I will try to get a quantitative answer by getting some Chinese Wikipedia text and running both converters in both directions and seeing how often they disagree.

If there is a significant mismatch, we could either share data across the two platforms (I'm not 100% sure how that works, from a licensing standpoint) or we could fork STConvert and convert the ZhConversion data into it's format. There's a small chance of some remaining incompatibilities based on implementation differences, but it would be easy to re-test, and I'd expect such differences to be significantly smaller than what we have now with independent parallel development.

STConvert vs ZhConversion: Real-World Performance
I extracted 10,000 random articles from Chinese Wikipedia, and deduped lines (to eliminate the equivalent of "See Also", "References", and other oft-repeated elements). The larger sample turned out to be somewhat unwieldy to work with, so I took a smaller one-tenth sample of the deduped data, which is approximately 1,000 articles' worth.

The smaller sample had 577,127 characters in it (6,257 unique characters); of those 401,449 were CJK characters (5,722 unique CJK characters). I ran this sample through each of STConvert and ZhConversion to convert the text to Simplified (T2S) characters.

STConvert changed 72,555(12.6%) of the characters and ZhConversion changed 74,518(12.9%) of the characters."Note that the sample is a mix of Simplified and Traditional characters, and that some characters are the same for both. I also ran the same process using STConvert to convert Simplified to Traditional characters (S2T), which resulted in 11.8% of characters being changed. Between the S2T and T2S outputs from STConvert, 24.3% changed (i.e., ~12.6% + ~11.8% with rounding error). So, roughly, it looks like ~3/4 of characters on Chinese Wikipedia are the same after either Simplified and Traditional conversion, and ~1/8 are either Simplified or Traditional that can be converted to the other."I did a character-by-character diff of their outputs against each other, and 2,627 characters where changed from the ZhConversion output, and 2,841 where changed from the STConvert output (0.46%-0.49%). The differences cover around 80 distinct characters, but by far the largest source of differences in in quotation marks, with some problems for high surrogate characters.

Of the 2627 diffs from ZhConversion: Of the 2841 diffs from STConvert: Discounting the 2084 quotes from the 2627 ZhConversion diffs leaves 543 characters.
 * 1920 are “ and ”
 * 164 are ‘ and ’
 * The rest are CJK characters.
 * 1916 are 「 and 」
 * 164 are 『 and 』
 * 4 are ｢ and ｣
 * 159 are \
 * 145 of which precede "
 * 12 of which precede u (and are part of \u Unicode encodings, like \uD867, which is an invalid "high surrogate" character)
 * The rest are CJK characters.

Discounting the 2084 quotes, 159 slashes, and reducing the 60 Unicode encodings to 12 individual characters, from the 2841 STConvert diffs also leaves 550 characters. (I'm not sure what's caused the 543 vs 550 character difference.)

So, other than quotes, ZhConversion and STConvert disagree on only 0.094% - 0.095% (less than a tenth of a percent) of the more than half a million characters in this sample.

387 (~71%) of the character differences are accounted for by the 11 most common characters (those with >10 occurrences). It's easy to line up most of these characters, since the counts are the same. I've lined them up and looked at English Wiktionary to see which is the likely correct form.

So the bulk of the difference is quotation marks (which we should be able to fix if needed), leaving about 0.1% disagreement on the rest, with attributable token-level errors roughly evenly divided between the two converters.

That seems close enough to work with!

SIGHAN Segmentation Analysis Framework
I was able to download and run the SIGHAN Second International Chinese Word Segmentation Bakeoff framework and tests after a little bit of format munging (they were in a Windows UTF-8 format, I'm running OSX/Linux).

There are four corpora, two Traditional and two Simplified, each containing at least a million characters and a million words, with the largest at more than 8M characters and 5M words. The scoring script provided generates a lot of stats, but I'll be looking only at recall, precision, and F1 scores, primarily F1.

Segmenters
I looked at the three segmenters/tokenizers I found. All three are available as Elasticsearch plugins. I'm also going to compare them to the current prod config (icu_normalizer and icu_tokenizer) and the "standard" Elasticsearch tokenizer.
 * SmartCN—supported by Elastic, previously considered not good enough. It is designed to work on Simplified characters, though that isn't well documented. It does not appear to be patchable/updatable. An old Lucene Issue explains why and nothing there seems to have changed. Since this is recommended and supported by Elasticsearch, I don't think it needs a code review.


 * IK—recommended by Elasticsearch, and up-to-date. I verified with the developer (Medcl) of the ES plugin/wrapper that it works on Simplified characters. It is explicitly patchable, with a custom dictionary. Would need code review.

The plugin developer recommended IK over MMSEG, but I planned to test both. However, I couldn't get MMSEG to install. After the results of the performance analysis, I don't think it's worth spending too much effort to get it to work.
 * MMSEG—also recommended by Elasticsearch, and up-to-date. The same developer (Medcl) wrote the ES plugin/wrapper for this as for IK. He verified that it works on Simplified characters. It is explicitly patchable, with a custom dictionary. Would need code review.

Max Word
Both IK and MMSEG have an interesting feature, "Max Word", which provides multiple overlapping segmentations for a given string. Given 中华人民共和国国歌 ("National Anthem of the People's Republic of China"), the ik_smart tokenizer splits it in two chunks: 中华人民共和国 ("People's Republic of China") and 国歌 ("National Anthem"). The ik_max_word tokenizer provides many additional overlapping segmentations. Rough translations (based on English Wiktionary) are provided, but don't rely on them being particularly accurate. 中华人民共和国国歌  input, for reference 中华人民共和国     People's Republic of China 中华人民          Chinese People 中华             China 华人           Chinese 人民共和国    People's Republic 人民         people 人           person 民         people 共和国    republic 共和      republicanism 和      and 国国   country 国歌 national anthem The SIGHAN segmentation testing framework doesn't support this kind of multiple segmentation, so I can't easily evaluate it that way. It's an interesting idea that would increase recall, but might decrease precision. Given the generally better recall of SmartCN (see below), I don't think we need to worry about this, but it's an interesting idea to keep in mind.

Segmenter Performance
I ran seven configs against the four SIGHAN segmentation test corpora. The seven configs are: For comparison, the F1 scores for all seven configs on all four corpora are below (max scores per corpus are bolded): The short version is that SmartCN+STConvert did the best on all the corpora, and IK did the worst. If IK is supposed to be better than MMSEG, then it doesn't matter much that we didn't test MMSEG.
 * The prod Chinese Wikipedia config (ICU tokenizer and ICU Normalizer)
 * The ICU tokenizer with STConvert
 * The Elasticsearch "standard" analyzer (which tokenizes each CJK character separately)
 * The SmartCN tokenizer
 * The SmartCN tokenizer, with STConvert (T2S)
 * The IK tokenizer
 * The IK tokenizer, with STConvert (T2S)

(Note: I previously had eight configs. There is a "standard" analyzer, and a "standard" tokenizer. I confused the two and caused myself some problems, but it's all sorted now. We're looking at the standard analyzer here.)

The Standard analyzer just splits every CJK character into its own token. It's very interesting to see how badly that does—and it represents a sort of baseline for how bad performance can be. Anything worse than that is going out of its way to make mistakes!

The recall, precision, and F1 score details for the remaining six configs are below:

STConvert makes no difference for the Simplified corpora (MSR and PKU), which is a good sign, as it indicates that STConvert isn't doing anything to Simplified characters that it shouldn't.

STConvert makes a big difference on the Traditional corpora for the Simplified-only segmenters, improving IK's recall by a bit more than 15%, precision by a bit more than 30%, and F1 by about 24%. For SmartCN, recall improves 20-23%, precision about 30%, and F1 26-28%, explaining at least part of the local lore that SmartCN doesn't work so well; without T2S conversion, SmartCN is going to do particularly badly on the ~1/8 of Chinese Wikipedia text that is in strictly Traditional characters.

Oddly, STConvert actually hurts performance on the Traditional corpora for the ICU tokenizer, indicating that either ICU normalization is better than STConvert, or STConvert errors ripple into the ICU tokenizer. To test this, I ran SmartCN with the ICU normalizer, and the results were better by up to 0.2% for recall, precision, and F1—indicating that the Traditional tokenizer magic in prod is happening in the ICU Tokenizer, not the ICU Normalizer.

The IK tokenizer, even with STConvert, is clearly the worst of the bunch, so we can drop it from consideration.

SmartCN does significantly better on recall on all corpora, and significantly better on precision for the Simplified corpora (indicating that, not surprisingly, imperfections in T2S conversion have knock-on effects on tokenization). SmartCN+STConvert does a bit worse (~2%) than prod at precision for the AS corpus, but makes up for it with bigger gains in recall.

Unfortunately, neither SmartCN nor prod do nearly as well as the original participants in the SIGHAN Segmentation Bakeoff, where the top performer on each corpus had 94%-98% for recall, precision, and F1. SmartCN would rank near the bottom for each corpus.

Given the surprisingly decent performance of the prod/ICU tokenizer on the SIGHAN test set, I think a proper analysis of the effect on the tokens generated is in order. I expect to see a lot of conflation of Traditional and Simplified variants, but I'm also concerned about the treatment of non-CJK characters. For example, we may need to include both STConvert and ICU_Normalization so that wide characters and other non-ASCII Unicode characters continue to be handled properly.

Analysis Chain Analysis
Since the current production config includes the icu_normalizer, I was worried that SmartCN + STConvert would lose some folding that is currently happening, so I ran the normal analysis chain analysis that I do for smaller analysis chain tweaks. As expected, some folding was lost, but I also found some really weird behavior.

A Buggy Interlude
STConvert has an oddly formatted rule in its conversion chart. The rest are generally AB:CD, where A, B, C, and D are CJK characters. There's one rule, "恭弘=叶 恭弘:叶" that seems to get parsed funny, and the result is that "恭弘" gets converted to "叶 叶 恭弘:叶:叶". That's survivably bad, but what is much worse is that the token start/end annotations associated with it are incorrect, and it messes up the character counts for everything that comes after it. I caught it because "年7月2" was tokenized to "2011"—I don't know much Chinese (I know the characters for number up to three!), but clearly that looks wrong.

I found another problem, the rule mapping "儸" to "㑩" includes a zero-width no-break space, which is hard to see unless your editor happens to want to show it to you.

I've filed an issue on the project on GitHub, and the developer says he'll fix it. There are several ways to proceed. I like the character filter hack the best, and it's what I'm using for the rest of my testing.
 * "恭弘" only occurs 37 times in Chinese Wikipedia, and "儸" only 44—out of 900K+ articles, so just ignore the problem. Pros: easy; future compatible; will naturally resolve itself if a future version of STConvert fixes the problem. Cons: breaks indexing for everything after "恭弘" on a given line of text.
 * Don't deploy STConvert until it's fixed. Pros: relatively easy, future compatible. Cons: will not resolve itself—we have to remember to track it and deploy it later; substantially worse tokenization until we deploy it.
 * Patch or fork STConvert. (I did this on my local machine to make sure that the one weird rule was the source of the problem.) Pros: relatively easy, best possible tokenization. Cons: not future compatible and will not resolve itself—updates to STConvert without a fix will remove the fix (if patched) or future updates will require re-forking.
 * Hack a character filter to fix the problem. A character filter to map "恭弘" to "恭 弘" solves one problem problem, and explicitly mapping "儸" to "㑩" solves the other. SmartCN tokenizes "恭弘" as "恭" and "弘", so do it manually before STConvert gets a chance to do anything. Pros: easy; best possible tokenization; 99% future compatible (there's a slight chance that "恭弘" or a longer string containing it should be processed differently). Cons: will not resolve itself (even it if it isn't necessary in the future, the code would still be there as legacy cruft—but at least we can comment it thoroughly).

I Miss ICU, Like the Deserts Miss the Rain
Wide character variants (e.g., Ｆ or ８０) are still mapped correctly, but other Unicode variants, like ª, Ⓐ, ℎ, ﬁ, ß, and Ⅻ, are not mapped and case-folding of non-ASCII variants (Ů/ů, Ə/ə, ɛ/ℇ) doesn't happen. Another problem is that at least some non-ASCII characters cause tokenization splits: fußball → fu, ß, ball; enɔlɔʒi → en, ɔ, l, ɔ, ʒ, i.

So, I added back the icu_normalizer as another filter before STConvert and the patches for STConvert.

I ran into some additional problems with my analysis program. Some of the Simplified characters STConvert outputs are UTF-32 characters. Elasticsearch seems to do the right thing internally and at display time, but the token output is split into two separate UTF-16 characters. This screwed up the alignment in my analysis tool. Another issue is that icu_normalizer converts some single characters into multiple characters. So ⑴ becomes "(1)". Internally, Elasticsearch seems to do the right thing, but the indexes given by the output are into the modified token stream, so my analysis was misaligned. I put in a temporary char filter to map ⑴ to "1" to solve this problem. I found a few more cases—like 	℠, ™, ℡, ℅, ½, etc.—but once I eventually found the cause of all the relevant problems and I was able to ignore them in my analysis.

Analysis Results
In order to check some of the new T2S conversions, I implemented a very simple T2S converter using the data from MediaWiki's ZhConversion.php's $zh2Hans array. I sorted the mappings by length and applied longer ones first. I took agreement between STConvert and ZhConversion to be approximately correct, since they were developed independently.

I took the 99 remaining unexpected CJK collisions and converted them using another online T2S tool. That left 19 unresolved. For those, I looked up the characters in English Wiktionary, and all were identified as related.

I honestly didn't expect it to be that clean!

Most non-CJK collisions were all reasonable, or caused by uncaught off-by-one errors in my analysis tool.

One notable difference between the ICU Tokenizer and the SmartCN tokenizer is the way uncommon Unicode characters like ½ are treated. ICU converts it to a single token, "1/2", while SmartCN converts it to three: "1", "/", "2". On the other hand, ICU converts ℀ to "a" and "c", while SmartCN converts it to "a", "/", "c". There's not a ton of consistency and searching for reasonable alternatives often works. (e.g., Rs vs ₨, ℅ vs c/o, ℁ vs a/s—though 1/2 and ½ don't find each other.)

Some stats:
 * Only 3.5% of types in the prod config were merged after analysis (which includes Lating characters that are case folded, for example).
 * The new config has 17.4% fewer types (indicative of better stemming, since there are more multi-character types); 20% of them were merged after analysis (indicative of T2S merging lots of words).
 * About 15.8% of types (~20K) and 27.9% of tokens (~.8M) had new collisions (i.e., words were merged together after analysis)—and that's a ton! However, given the reputation of the current Chinese analysis for making lots of tokenization mistakes, it's good news.

Highlighting
Highlighting works across Traditional/Simplified text as you would expect. Yay!

Quotation Marks and Unicode Characters
All the quote characters(" " — ' ' — “ ” — ‘ ’ — 「 」 — 『 』 — ｢ ｣) behave as expected in both the text and in search.

fußball is indexed as fussball, and searching for either finds either. But it looks like ß is getting special treatment. Other more common accented ASCII is getting split up. lålélîlúløl gets divided into individual letters. A search for åéîúø finds lålélîlúløl. enɔlɔʒi is still split up as en, ɔ, l, ɔ, ʒ, i, though searching for it finds it. Searching for ʒiɔl en also finds it, since it gets split into the same tokens. Using quotes limits the results to the desired ones.

This is less than ideal, but probably something we can live with.

Interesting Examples from Native Speaker Review
Thanks to Chelsy, SFSQ2012, 路过围观的Sakamotosan, and Alan for examples and review!

An interesting problem with segmenting/tokenizing Chinese is that the context of a given word can affect its segmentation. Of course, higher frequency terms have more opportunities for rarer errors.

Another interesting wrinkle is that I set up the RelForge server with the new default (BM25, etc.) instead of the older config used by spaceless languages in production, so the analysis and comments below are comparing production to the new analysis chain, plus BM25 and changes to the way phrases are handled. (In particular, strings without spaces that get split into multiple tokens are treated as if they have quotes around them. In English this would be after-dinner mint being treated as "after dinner" AND mint—though we don't do this in English anymore. For Chinese, it's most queries!)

维基百科/維基百科 ("Wikipedia")
A nice example of a relatively high-frequency term is 维基百科/維基百科 ("Wikipedia"). In the labs index, 维基百科 has 128,189 results, while 維基百科 has 121,281 results. The discrepancy comes not from the segmentation of the queries, but the segmentation of article text. For better or worse, in isolation (e.g., as queries) both—after conversion to Simplified characters—are segmented as 维|基|百科.

Searching for 维基百科 NOT 維基百科 (7,034 results) or 維基百科 NOT 维基百科 (168 results) gives examples where the segmentation within the article goes "wrong". The skew in favor of Simplified NOT  having more results than the other way around makes sense (to me), since recognizing Traditional characters requires an extra steps of conversion to Simplified, which can also go wrong and screw up everything.

In the case of 维基百科 NOT 維基百科, one of the results is ChemSpider. The phrase containing 维基百科 is: 它通过WiChempedia收录维基百科中的化合物记录. When I run it through the proposed Chinese analysis chain, the segmentation for the relevant bit is 维|基|百|科. The Traditional query 維基百科 is converted and segmented as 维|基|百科, which doesn't match. The segmented Simplified query doesn't match either, but it's possible to get an exact match on "维基百科".

Multiple Languages/Character Sets
Queries with multiple languages/character sets (Chinese/English or Chinese/Japanese) perform differently with the different analyzers. Below are some examples, with some notes. Of note, the default analyzer makes type distinctions (2016 is "NUM", superbowl is "ALPHANUM", CJK characters are "IDEOGRAPHIC"), whereas the new analysis chain does not (everything is "word"). I don't think it matters internally to Elasticsearch.

Did You Mean Suggestions
While it's somewhat outside the scope of this analysis, one reviewer pointed out that the Chinese Did You Mean (DYM) results are not great.

2016 oscar winners gets the suggestion from prod of 2012 oscar winter, and from the new analyzer on RelForge: 2016 star winter.

Another search, "2017英国国会恐怖袭击" ("2017 British Parliament terrorist attacks") gets the suggestion from the new analyzer on RelForge: "2012 英国 国会 恐怖 袭击". Interestingly, the recommendation is tokenized! (We might want to consider removing spaces from between CJK characters in DYM recommendations on Chinese wikis, so the recommendations are more idiomatic.)

Differences in suggestions between prod and RelForge could be affected by the number of database shards in use (which affects frequency information), random differences in article distribution among shards, and differences in the overall frequency distribution (like maximum frequency) as a result of differences in segmentation/tokenizing, and collapsing Traditional and Simplified versions of the same word.

Summary

 * Chinese Wikis support Simplified and Traditional input and display. Text is stored as it is input, and converted at display time (by a PHP module called ZhConversion.php).
 * All Chinese-specific Elasticsearch tokenizers/segmenters/analyzers I can find only work on Simplified text. The SIGHAN 2005 Word Segmentation Bakeoff had participants that segmented Traditional and Simplified text, but mixed texts were not tested.
 * STConvert is an Elasticsearch plugin that does T2S conversion. It agrees with ZHConversion.php about 99.9% of the time on a sample of 1,000 Chinese Wiki articles.
 * It's good that they agree! We wouldn't want conversions for search/indexing and display to frequently disagree; that would be very confusing for users.
 * Using a Traditional to Simplified (T2S) converter greatly improves the performance of several potential Elasticsearch Chinese tokenizers/segmenters on Traditional text. (Tested on SIGHAN data.)
 * I uncovered two small bugs in the STConvert rules. I've filed a bug report and implemented a char_filter patch as a workaround.
 * SmartCN + STConvert is the best tokenizer combination (on the SIGHAN data). It performs a bit better than everything else on Traditional text and much better on Simplified text.
 * Our historically poor opinion of SmartCN may have been at least partly caused by the fact that it only really works on Simplified characters; and so it would perform poorly on mixed Traditional/Simplified text.
 * There are significantly fewer types of words (~16%) with SmartCN + STConvert compared to the current prod config, indicating more multi-character words are found. About 28% of tokens are have no words they are indexed with (i.e., mostly Traditional and Simplified forms being indexed together).
 * Search with SmartCN + STConvert works as you would hope: Traditional and Simplified versions of the same text find each other, highlighting works regardless of underlying text type, and none of the myriad quotes (" " — ' ' — “ ” — ‘ ’ — 「 」 — 『 』 — ｢ ｣) in the text affect results. (Regular quotes in the query are "special syntax" and do the usual phrase searching.)
 * SmartCN + STConvert doesn't tokenize some non-CJK Unicode characters as well as one would like. Adding the icu_normalizer as a pre-filter fixes many problems, but not all. The remaining issues I still see are with some uncommon Unicode characters: IPA and slash characters: ½ ℀ ℅. Searching for most works as you would expect (except for numerical fractions).

Recommendation

 * Deploy SmartCN + STConvert to production for Chinese wikis after an opportunity for community review (and after the ES5 upgrade is complete).

Next Steps: Plan & Status

 * ✓ Wait for RelForge to be downgraded to ES 5.1.2; when it is declared stable ("stable for RelForge") re-index Chinese Wikipedia there: Mostly done: A re-indexed copy of Chinese Wikipedia from mid-March using STConvert and SmartCN is available on RelForge. Because of my inexperience setting these things up, I set it up using the defaults—i.e., BM25 (the newer preferred term-weighting implementation) and a few other search tweaks not normally used on spaceless languages. See below for plans to deal with this.
 * ✓ Let people use it and give feedback: Done—still possibly more feedback coming. While there are some problems, impressions are generally positive.
 * ✓ Test that everything works in ES5 as expected: Done—it does.
 * Set up another copy of Chinese Wikipedia on the RelForge servers, using the "spaceless language" config
 * use RelForge to see how big of a difference BM25 vs spaceless config makes with the new analysis chain
 * get native speaker review of a sample of the changes
 * decide how to proceed: deploy as spaceless, deploy as BM25, or do an A/B test of some sort.
 * ✓ If/When vagrant update to ES5 is available, test there, in addition/instead: Done. Config is slightly different for the new version of STConvert, but everything works the same.
 * ✓ Update the plugin-conditional analyzer configuration to require two plugin dependencies (i.e., SmartCN and STConvert)—currently it seems is can only be conditioned on one. Done.
 * After ES5 is deployed and everything else checks out, deploy to SmartCN and STConvert to production, enable the new analysis config (and possibly the search config to switch to BM25), and re-index the Chinese projects.
 * Update Vagrant/Puppet config to make SmartCN and STConvert available.