User:TJones (WMF)/Notes/Chinese Analyzer Analysis

From mediawiki.org

February–April 2017 — See TJones_(WMF)/Notes for other projects. See also T158203. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Test/Analysis Plan[edit]

All of the Chinese segmentation candidates I've found to date (See T158202) expect Simplified Chinese characters as input. Chinese Wikipedia supports both Traditional and Simplified characters, and converts them at display time according to user preferences. There is a Elasticsearch plugin (STConvert) that converts Traditional to Simplified (T2S) or vice versa (S2T). I suggest trying to set up an analysis chain using that, and segment and index everything as Simplified.

There are a number of segmenters to consider: SmartCN, IK, and MMSEG are all available with up-to-date Elasticsearch plugin wrappers. The developer of the wrappers for IK and MMSEG recommends IK for segmenting, but I plan to test both.

So, my evaluation plan is to first compare the content and output of STConvert with MediaWiki ZhConversion.php to make sure they do not differ wildly, and maybe offer some cross-pollination to bring them more in line with each other if that seem profitable. (I've pinged the legal team about licenses, data, etc.)

I'll try to set up the SIGHAN analysis framework to evaluate the performance of the segmenters on that test set. If there is no clear cut best segmenter, I'll take some text from Chinese Wikipedia, apply STConvert, segment the text with each of the contenders, and collect the instances where they differ for manual review by a Chinese speaker. This should allow us to focus on the differences found in a larger and more relevant corpus.

I'll also review the frameworks and see how amenable each is to patching to solve specific segmentation problems. Being easily patched might be more valuable than 0.02% better accuracy, for example.

It also makes sense to compare these segmenters to the baseline performance in prod (using icu_normalizer and icu_tokenizer), and the standard tokenizer.

We'll also test the highlighting for cross-character type (Traditional/Simplified/mixed) queries.

An Example[edit]

For reference in the discussion outline below, here's an example.

Right now (Feb, 27 2017), searching for Traditional 歐洲冠軍聯賽決賽 ("UEFA Champions League Final") returns 82 results. Searching for Simplified 欧洲冠军联赛决赛 gives 115 results. Searching for 欧洲冠军联赛决赛 OR 歐洲冠軍聯賽決賽 gives 178 results—so they have some overlapping results.

Searching for the mixed T/S query (the last two characters, meaning "finals", are Traditional, the rest is Simplified) 欧洲冠军联赛決賽 gives 9 results. Adding it to the big OR (欧洲冠军联赛决赛 OR 歐洲冠軍聯賽決賽 OR 欧洲冠军联赛決賽) gives 184 results, so 6 of the 9 mixed results are not included in the original 178. This is just one example that I know of. There are obviously other mixes of Traditional and Simplified characters that are possible for this query.

Initial Draft Proposal[edit]

Once we have all the necessary tools available, we have to figure out how best to deploy them.

The current draft proposal, after discussing what's possible with the Elasticsearch inner workings with David & Erik, is:

  • Convert everything to Simplified characters for indexing and use a segmenter to break the text into words, in both the text and plain fields. Do the same for normal queries at search time.
  • Index the text as is in the source plain field, and use a unigram segmenter.

The working assumption here is that whether a typical searcher searches for Simplified 年欧洲冠军联赛决赛, Traditional 年歐洲冠軍聯賽決賽, or mixed 年欧洲冠军联赛決賽, they are looking for the words in the query, regardless of whether the underlying characters are Traditional or Simplified. When they use quotes, they want those words as a phrase.

For advanced searchers or editors who want to find specific characters (e.g., to distinguish the simple, Traditional, and mixed examples above), insource: would provide that ability.

We will of course verify that this is a decent plan with the community after we figure out what's actually possible with the tools we have available.

Update: I suggested using the T2S converter on the plain field, but I’ve decided not to for now. I created a labs instance with that config, but a quick native-speaker review didn’t show either as obviously better. Since more review went into the typical plain field config, and changing the plain field would make it and the text field essentially the same, I’m going with the better tested option. This does mean that Traditional and Simplified queries can get somewhat different results, thanks to one having a perfect match on the plain field. However, that may potentially offset some errors in the T2S conversion in some cases. The current config with T2S in the text field should be a big improvement, and if the lack of T2S and SmartCN segmenting in the plain field causes problems, we can make iterative improvements in the future.

STConvert[edit]

STConvert is an Elasticsearch plugin that converts between Traditional and Simplified Chinese characters. It includes an analyzer, tokenizer, token-filter, and char-filter. It has released that are compatible with many versions of ES, including ES 5.1.2 and ES 5.2.1 (and ES 2.3.5, which I'm currently running in vagrant).

STConvert vs ZhConversion: Mappings[edit]

ZhConversion.php is a MediaWiki PHP module that also converts between Traditional and Simplified Chinese characters. It is used to convert Chinese wiki projects at display time.

In order to get a rough estimate of the coverage and likely comparative accuracy of the two, I compared the data they use to do their respective conversions. The ZhConversion data is in a PHP array and the STConvert data file is a colon-separated text file. WARNING—both of those links go to large, slow loading files that may crash your browser.

In each case, I normalized the list of mappings by removing the syntactic bits of the file and converting to a tab-separated format. For each mapping, I sorted the elements being mapped (so A:B and B:A would both become A:B), and sorted and de-duped the lists.

I did a diff of the files and took a look. There were no obvious wild inconsistencies. I also did a more careful automated review. The results are below.

For these first two sets of stats:

  • raw mappings are the number of mappings present in the original PHP or text file
  • unique sorted mappings are what's left after re-ordering and de-duping
  • internal conflicts are the number of character strings that map to or from more than one thing, so having A:B and B:C would be an internal conflict for B.

STConvert (Elasticsearch)

  • raw mappings: 11708
  • unique sorted mappings: 11591
  • internal conflicts: 192

ZhConversion (Mediawiki)

  • raw mappings: 20381
  • unique sorted mappings: 15890
  • internal conflicts: 2748

The duplicates and conflicts are not surprising. This kind of info is gathered from many sources, so duplicates and even conflict are likely. "Conflicts" and duplicates are not necessarily incorrect, either, since some mappings are many-to-one. If A and B both map to C in one direction, but C preferentially maps back to A in the other direction, you get both "conflict" and a sorted duplicate: {A:C, B:C, C:A} → {A:C, B:C, A:C}, so A:C is a dupe and C has a conflict.

I then compared the mappings more systematically against each other. They have a lot of overlap in the sorted mappings, though both have some the other does not. I also looked for mismatches of several types, where the existing mappings don't agree.

  • singleton mismatches occur when each has exactly one mapping for a character and they disagree; these are clear disagreements (though not necessarily errors, see below)
  • complex mismatches occur when there internal conflicts, and they aren't the same between mappings. For example, {A:B, A:C, A:D, A:E, A:F} vs {A:C, A:D, A:E, A:F}, in this case there's lots of overlap, but they don't match exactly. I didn't dig into the details of these kinds of mismatches.

STConvert vs ZhConversion: Mappings

  • merged unique sorted mappings: 17891
    • 9590 unique sorted mappings occur in both
    • STConvert has 2001 unique sorted mappings ZhConversion does not
    • ZhConversion has 4291 unique sorted mappings STConvert does not
  • Mismatches
    • singleton mismatches: 191
    • complex mismatches: 1692

Both singleton and complex mismatches might not be errors. These kinds of mappings are typically driven by examples, and will never be complete (there's always new vocabulary being created, and weird exceptions are the rule with human language), so one project may have had a request to add A:C, while another had a request to add B:C, but neither is wrong (and this is even more likely if A and B are Traditional and C is Simplified).

Another important thing to note is that the distribution of Chinese characters is far from even. English Wikipedia says that educated Chinese know about 4,000 characters. A BBC guide to Chinese says that while there are over 50,000 characters, comprehensive dictionaries list only 20,000, educated speakers know 8,000, but only 2,000-3,000 are needed to read a newspaper.

In general, this gives me confidence that both systems are on par with each other in terms of the most common characters.

There is still a potential problem, given the obscure completeness often found in Wikipedia, that for two underlying forms of an uncommon word, one Simplified and one Traditional, the display forms (driven by ZhConversion) could be the same, while the indexed forms (driven by STConvert) could be different. This would be particularly confusing for most users because the two forms would look the same on the screen, but searching for either would only find one of the forms. The reverse situation, where STConvert merges them in the index but ZhConversion fails to render them the same on the screen might actually look "smart", because while they look different, search can still find both!

Based on the numbers above, I think the likelihood of this ever happening in almost 100% (I'm pretty sure I could engineer an example by trawling through the tables), but the likelihood of it happening often in actual real-world use is very small. I will try to get a quantitative answer by getting some Chinese Wikipedia text and running both converters in both directions and seeing how often they disagree.

If there is a significant mismatch, we could either share data across the two platforms (I'm not 100% sure how that works, from a licensing standpoint) or we could fork STConvert and convert the ZhConversion data into it's format. There's a small chance of some remaining incompatibilities based on implementation differences, but it would be easy to re-test, and I'd expect such differences to be significantly smaller than what we have now with independent parallel development.

STConvert vs ZhConversion: Real-World Performance[edit]

I extracted 10,000 random articles from Chinese Wikipedia, and deduped lines (to eliminate the equivalent of "See Also", "References", and other oft-repeated elements). The larger sample turned out to be somewhat unwieldy to work with, so I took a smaller one-tenth sample of the deduped data, which is approximately 1,000 articles' worth.

The smaller sample had 577,127 characters in it (6,257 unique characters); of those 401,449 were CJK characters (5,722 unique CJK characters). I ran this sample through each of STConvert and ZhConversion to convert the text to Simplified (T2S) characters.

STConvert changed 72,555(12.6%) of the characters and ZhConversion changed 74,518(12.9%) of the characters.

Note that the sample is a mix of Simplified and Traditional characters, and that some characters are the same for both. I also ran the same process using STConvert to convert Simplified to Traditional characters (S2T), which resulted in 11.8% of characters being changed. Between the S2T and T2S outputs from STConvert, 24.3% changed (i.e., ~12.6% + ~11.8% with rounding error). So, roughly, it looks like ~3/4 of characters on Chinese Wikipedia are the same after either Simplified and Traditional conversion, and ~1/8 are either Simplified or Traditional that can be converted to the other.

I did a character-by-character diff of their outputs against each other, and 2,627 characters where changed from the ZhConversion output, and 2,841 where changed from the STConvert output (0.46%-0.49%). The differences cover around 80 distinct characters, but by far the largest source of differences in in quotation marks, with some problems for high surrogate characters.

Of the 2627 diffs from ZhConversion:

  • 1920 are “ and ”
  • 164 are ‘ and ’
  • The rest are CJK characters.

Of the 2841 diffs from STConvert:

  • 1916 are 「 and 」
  • 164 are 『 and 』
  • 4 are 「 and 」
  • 159 are \
    • 145 of which precede "
    • 12 of which precede u (and are part of \u Unicode encodings, like \uD867, which is an invalid "high surrogate" character)
  • The rest are CJK characters.

Discounting the 2084 quotes from the 2627 ZhConversion diffs leaves 543 characters.

Discounting the 2084 quotes, 159 slashes, and reducing the 60 Unicode encodings to 12 individual characters, from the 2841 STConvert diffs also leaves 550 characters. (I'm not sure what's caused the 543 vs 550 character difference.)

So, other than quotes, ZhConversion and STConvert disagree on only 0.094% - 0.095% (less than a tenth of a percent) of the more than half a million characters in this sample.

387 (~71%) of the character differences are accounted for by the 11 most common characters (those with >10 occurrences). It's easy to line up most of these characters, since the counts are the same. I've lined them up and looked at English Wiktionary to see which is the likely correct form.

Freq ZhConversion STConvert Wiktionary says...
151 STConvert
66 ZhConversion
47 both are listed as Simplified forms of 餘
24 ZhConversion
20 ZhConversion
18 ? (牠 is archaic, 它 is used now; it's like converting thou to you—right or wrong? depends...)
15 ZhConversion
13 ZhConversion
12 鐘->钟, 鍾->锺 —possible frequency mismatch; or at least one of them has an error
11 ZhConversion
10 ZhConversion; STConvert version is flagged as an "alternative form"

So the bulk of the difference is quotation marks (which we should be able to fix if needed), leaving about 0.1% disagreement on the rest, with attributable token-level errors roughly evenly divided between the two converters.

That seems close enough to work with!

SIGHAN Segmentation Analysis Framework[edit]

I was able to download and run the SIGHAN Second International Chinese Word Segmentation Bakeoff framework and tests after a little bit of format munging (they were in a Windows UTF-8 format, I'm running OSX/Linux).

There are four corpora, two Traditional and two Simplified, each containing at least a million characters and a million words, with the largest at more than 8M characters and 5M words. The scoring script provided generates a lot of stats, but I'll be looking only at recall, precision, and F1 scores, primarily F1.

Segmenters[edit]

I looked at the three segmenters/tokenizers I found. All three are available as Elasticsearch plugins. I'm also going to compare them to the current prod config (icu_normalizer and icu_tokenizer) and the "standard" Elasticsearch tokenizer.

  • SmartCN—supported by Elastic, previously considered not good enough. It is designed to work on Simplified characters, though that isn't well documented. It does not appear to be patchable/updatable. An old Lucene Issue explains why and nothing there seems to have changed. Since this is recommended and supported by Elasticsearch, I don't think it needs a code review.
  • IKrecommended by Elasticsearch, and up-to-date. I verified with the developer (Medcl) of the ES plugin/wrapper that it works on Simplified characters. It is explicitly patchable, with a custom dictionary. Would need code review.
  • MMSEG—also recommended by Elasticsearch, and up-to-date. The same developer (Medcl) wrote the ES plugin/wrapper for this as for IK. He verified that it works on Simplified characters. It is explicitly patchable, with a custom dictionary. Would need code review.

The plugin developer recommended IK over MMSEG, but I planned to test both. However, I couldn't get MMSEG to install. After the results of the performance analysis, I don't think it's worth spending too much effort to get it to work.

Max Word[edit]

Both IK and MMSEG have an interesting feature, "Max Word", which provides multiple overlapping segmentations for a given string. Given 中华人民共和国国歌 ("National Anthem of the People's Republic of China"), the ik_smart tokenizer splits it in two chunks: 中华人民共和国 ("People's Republic of China") and 国歌 ("National Anthem"). The ik_max_word tokenizer provides many additional overlapping segmentations. Rough translations (based on English Wiktionary) are provided, but don't rely on them being particularly accurate.

中华人民共和国国歌   input, for reference
中华人民共和国      People's Republic of China
中华人民           Chinese People
中华              China
  华人            Chinese
    人民共和国     People's Republic
    人民          people
    人            person
      民          people
       共和国     republic
       共和       republicanism
         和       and
          国国    country
            国歌  national anthem

The SIGHAN segmentation testing framework doesn't support this kind of multiple segmentation, so I can't easily evaluate it that way. It's an interesting idea that would increase recall, but might decrease precision. Given the generally better recall of SmartCN (see below), I don't think we need to worry about this, but it's an interesting idea to keep in mind.

Segmenter Performance[edit]

I ran seven configs against the four SIGHAN segmentation test corpora. The seven configs are:

  • The prod Chinese Wikipedia config (ICU tokenizer and ICU Normalizer)
  • The ICU tokenizer with STConvert
  • The Elasticsearch "standard" analyzer (which tokenizes each CJK character separately)
  • The SmartCN tokenizer
  • The SmartCN tokenizer, with STConvert (T2S)
  • The IK tokenizer
  • The IK tokenizer, with STConvert (T2S)

For comparison, the F1 scores for all seven configs on all four corpora are below (max scores per corpus are bolded):

F1 AS (trad) CITYU (trad) MSR (simp) PKU (simp)
Prod (icu normalizer and tokenizer) 78.5% 75.2% 78.0% 78.4%
icu_tokenizer + STConvert 76.2% 72.8% 78.0% 78.2%
Standard 38.2% 35.1% 32.0% 32.8%
IK 49.4% 44.8% 72.3% 71.0%
IK + STConvert 73.7% 69.2% 72.3% 71.0%
SmartCN 54.3% 51.4% 86.4% 90.4%
SmartCN + STConvert 80.5% 79.9% 86.4% 90.4%

The short version is that SmartCN+STConvert did the best on all the corpora, and IK did the worst. If IK is supposed to be better than MMSEG, then it doesn't matter much that we didn't test MMSEG.

(Note: I previously had eight configs. There is a "standard" analyzer, and a "standard" tokenizer. I confused the two and caused myself some problems, but it's all sorted now. We're looking at the standard analyzer here.)

The Standard analyzer just splits every CJK character into its own token. It's very interesting to see how badly that does—and it represents a sort of baseline for how bad performance can be. Anything worse than that is going out of its way to make mistakes!

The recall, precision, and F1 score details for the remaining six configs are below:

Prod (icu normalizer and tokenizer)
Recall Prec F1
AS (trad) 78.8% 78.1% 78.5%
CITYU (trad) 77.4% 73.2% 75.2%
MSR (simp) 82.2% 74.2% 78.0%
PKU (simp) 80.6% 75.9% 78.2%
icu_tokenizer + STConvert
Recall Prec F1
AS (trad) 78.3% 74.1% 76.2%
CITYU (trad) 76.5% 69.4% 72.8%
MSR (simp) 82.2% 74.2% 78.0%
PKU (simp) 80.6% 75.9% 78.2%
Standard
Recall Prec F1
AS (trad) 49.1% 31.3% 38.2%
CITYU (trad) 45.8% 28.5% 35.1%
MSR (simp) 43.0% 25.5% 32.0%
PKU (simp) 42.8% 26.6% 32.8%
IK
Recall Prec F1
AS (trad) 56.4% 44.0% 49.4%
CITYU (trad) 51.7% 39.5% 44.8%
MSR (simp) 69.0% 76.0% 72.3%
PKU (simp) 66.3% 76.4% 71.0%
IK+STConvert
Recall Prec F1
AS (trad) 71.8% 75.8% 73.7%
CITYU (trad) 67.5% 71.0% 69.2%
MSR (simp) 69.0% 76.0% 72.3%
PKU (simp) 66.3% 76.4% 71.0%
SmartCN
Recall Prec F1
AS (trad) 64.9% 46.7% 54.3%
CITYU (trad) 62.4% 43.8% 51.4%
MSR (simp) 90.3% 82.8% 86.4%
PKU (simp) 92.9% 88.0% 90.4%
SmartCN + STConvert
Recall Prec F1
AS (trad) 85.5% 76.0% 80.5%
CITYU (trad) 85.5% 75.0% 79.9%
MSR (simp) 90.3% 82.8% 86.4%
PKU (simp) 92.9% 88.0% 90.4%

STConvert makes no difference for the Simplified corpora (MSR and PKU), which is a good sign, as it indicates that STConvert isn't doing anything to Simplified characters that it shouldn't.

STConvert makes a big difference on the Traditional corpora for the Simplified-only segmenters, improving IK's recall by a bit more than 15%, precision by a bit more than 30%, and F1 by about 24%. For SmartCN, recall improves 20-23%, precision about 30%, and F1 26-28%, explaining at least part of the local lore that SmartCN doesn't work so well; without T2S conversion, SmartCN is going to do particularly badly on the ~1/8 of Chinese Wikipedia text that is in strictly Traditional characters.

Oddly, STConvert actually hurts performance on the Traditional corpora for the ICU tokenizer, indicating that either ICU normalization is better than STConvert, or STConvert errors ripple into the ICU tokenizer. To test this, I ran SmartCN with the ICU normalizer, and the results were better by up to 0.2% for recall, precision, and F1—indicating that the Traditional tokenizer magic in prod is happening in the ICU Tokenizer, not the ICU Normalizer.

The IK tokenizer, even with STConvert, is clearly the worst of the bunch, so we can drop it from consideration.

SmartCN does significantly better on recall on all corpora, and significantly better on precision for the Simplified corpora (indicating that, not surprisingly, imperfections in T2S conversion have knock-on effects on tokenization). SmartCN+STConvert does a bit worse (~2%) than prod at precision for the AS corpus, but makes up for it with bigger gains in recall.

Unfortunately, neither SmartCN nor prod do nearly as well as the original participants in the SIGHAN Segmentation Bakeoff, where the top performer on each corpus had 94%-98% for recall, precision, and F1. SmartCN would rank near the bottom for each corpus.

Given the surprisingly decent performance of the prod/ICU tokenizer on the SIGHAN test set, I think a proper analysis of the effect on the tokens generated is in order. I expect to see a lot of conflation of Traditional and Simplified variants, but I'm also concerned about the treatment of non-CJK characters. For example, we may need to include both STConvert and ICU_Normalization so that wide characters and other non-ASCII Unicode characters continue to be handled properly.

Analysis Chain Analysis[edit]

Since the current production config includes the icu_normalizer, I was worried that SmartCN + STConvert would lose some folding that is currently happening, so I ran the normal analysis chain analysis that I do for smaller analysis chain tweaks. As expected, some folding was lost, but I also found some really weird behavior.

A Buggy Interlude[edit]

STConvert has an oddly formatted rule in its conversion chart. The rest are generally AB:CD, where A, B, C, and D are CJK characters. There's one rule, "恭弘=叶 恭弘:叶" that seems to get parsed funny, and the result is that "恭弘" gets converted to "叶 叶 恭弘:叶:叶". That's survivably bad, but what is much worse is that the token start/end annotations associated with it are incorrect, and it messes up the character counts for everything that comes after it. I caught it because "年7月2" was tokenized to "2011"—I don't know much Chinese (I know the characters for number up to three!), but clearly that looks wrong.

I found another problem, the rule mapping "儸" to "㑩" includes a zero-width no-break space, which is hard to see unless your editor happens to want to show it to you.

I've filed an issue on the project on GitHub, and the developer says he'll fix it. There are several ways to proceed.

  • "恭弘" only occurs 37 times in Chinese Wikipedia, and "儸" only 44—out of 900K+ articles, so just ignore the problem. Pros: easy; future compatible; will naturally resolve itself if a future version of STConvert fixes the problem. Cons: breaks indexing for everything after "恭弘" on a given line of text.
  • Don't deploy STConvert until it's fixed. Pros: relatively easy, future compatible. Cons: will not resolve itself—we have to remember to track it and deploy it later; substantially worse tokenization until we deploy it.
  • Patch or fork STConvert. (I did this on my local machine to make sure that the one weird rule was the source of the problem.) Pros: relatively easy, best possible tokenization. Cons: not future compatible and will not resolve itself—updates to STConvert without a fix will remove the fix (if patched) or future updates will require re-forking.
  • Hack a character filter to fix the problem. A character filter to map "恭弘" to "恭 弘" solves one problem problem, and explicitly mapping "儸" to "㑩" solves the other. SmartCN tokenizes "恭弘" as "恭" and "弘", so do it manually before STConvert gets a chance to do anything. Pros: easy; best possible tokenization; 99% future compatible (there's a slight chance that "恭弘" or a longer string containing it should be processed differently). Cons: will not resolve itself (even it if it isn't necessary in the future, the code would still be there as legacy cruft—but at least we can comment it thoroughly).

I like the character filter hack the best, and it's what I'm using for the rest of my testing.

I Miss ICU, Like the Deserts Miss the Rain[edit]

Wide character variants (e.g., F or 80) are still mapped correctly, but other Unicode variants, like ª, Ⓐ, ℎ, fi, ß, and Ⅻ, are not mapped and case-folding of non-ASCII variants (Ů/ů, Ə/ə, ɛ/ℇ) doesn't happen. Another problem is that at least some non-ASCII characters cause tokenization splits: fußball → fu, ß, ball; enɔlɔʒi → en, ɔ, l, ɔ, ʒ, i.

So, I added back the icu_normalizer as another filter before STConvert and the patches for STConvert.

I ran into some additional problems with my analysis program. Some of the Simplified characters STConvert outputs are UTF-32 characters. Elasticsearch seems to do the right thing internally and at display time, but the token output is split into two separate UTF-16 characters. This screwed up the alignment in my analysis tool. Another issue is that icu_normalizer converts some single characters into multiple characters. So ⑴ becomes "(1)". Internally, Elasticsearch seems to do the right thing, but the indexes given by the output are into the modified token stream, so my analysis was misaligned. I put in a temporary char filter to map ⑴ to "1" to solve this problem. I found a few more cases—like ℠, ™, ℡, ℅, ½, etc.—but once I eventually found the cause of all the relevant problems and I was able to ignore them in my analysis.

Analysis Results[edit]

In order to check some of the new T2S conversions, I implemented a very simple T2S converter using the data from MediaWiki's ZhConversion.php's $zh2Hans array. I sorted the mappings by length and applied longer ones first. I took agreement between STConvert and ZhConversion to be approximately correct, since they were developed independently.

I took the 99 remaining unexpected CJK collisions and converted them using another online T2S tool. That left 19 unresolved. For those, I looked up the characters in English Wiktionary, and all were identified as related.

I honestly didn't expect it to be that clean!

Most non-CJK collisions were all reasonable, or caused by uncaught off-by-one errors in my analysis tool.

One notable difference between the ICU Tokenizer and the SmartCN tokenizer is the way uncommon Unicode characters like ½ are treated. ICU converts it to a single token, "1/2", while SmartCN converts it to three: "1", "/", "2". On the other hand, ICU converts ℀ to "a" and "c", while SmartCN converts it to "a", "/", "c". There's not a ton of consistency and searching for reasonable alternatives often works. (e.g., Rs vs ₨, ℅ vs c/o, ℁ vs a/s—though 1/2 and ½ don't find each other.)

Some stats:

  • Only 3.5% of types in the prod config were merged after analysis (which includes Latin characters that are case folded, for example).
  • The new config has 17.4% fewer types (indicative of better stemming, since there are more multi-character types); 20% of them were merged after analysis (indicative of T2S merging lots of words).
  • About 15.8% of types (~20K) and 27.9% of tokens (~.8M) had new collisions (i.e., words were merged together after analysis)—and that's a ton! However, given the reputation of the current Chinese analysis for making lots of tokenization mistakes, it's good news.

Highlighting[edit]

Highlighting works across Traditional/Simplified text as you would expect. Yay!

Quotation Marks and Unicode Characters[edit]

All the quote characters(" " — ' ' — “ ” — ‘ ’ — 「 」 — 『 』 — 「 」) behave as expected in both the text and in search.

fußball is indexed as fussball, and searching for either finds either. But it looks like ß is getting special treatment. Other more common accented ASCII is getting split up. lålélîlúløl gets divided into individual letters. A search for åéîúø finds lålélîlúløl. enɔlɔʒi is still split up as en, ɔ, l, ɔ, ʒ, i, though searching for it finds it. Searching for ʒiɔl en also finds it, since it gets split into the same tokens. Using quotes limits the results to the desired ones.

This is less than ideal, but probably something we can live with.

Interesting Examples from Native Speaker Review[edit]

Thanks to Chelsy, SFSQ2012, 路过围观的Sakamotosan, and Alan for examples and review!

An interesting problem with segmenting/tokenizing Chinese is that the context of a given word can affect its segmentation. Of course, higher frequency terms have more opportunities for rarer errors.

Another interesting wrinkle is that I set up the RelForge server with the new default (BM25, etc.) instead of the older config used by spaceless languages in production, so the analysis and comments below are comparing production to the new analysis chain, plus BM25 and changes to the way phrases are handled. (In particular, strings without spaces that get split into multiple tokens are treated as if they have quotes around them. In English this would be after-dinner mint being treated as "after dinner" AND mint—though we don't do this in English anymore. For Chinese, it's most queries!)

维基百科/維基百科 ("Wikipedia")[edit]

A nice example of a relatively high-frequency term is 维基百科/維基百科 ("Wikipedia"). In the labs index, 维基百科 has 128,189 results, while 維基百科 has 121,281 results. The discrepancy comes not from the segmentation of the queries, but the segmentation of article text. For better or worse, in isolation (e.g., as queries) both—after conversion to Simplified characters—are segmented as 维 | 基 | 百科.

Searching for 维基百科 NOT 維基百科 (7,034 results) or 維基百科 NOT 维基百科 (168 results) gives examples where the segmentation within the article goes "wrong". The skew in favor of <Simplified> NOT <Traditional> having more results than the other way around makes sense (to me), since recognizing Traditional characters requires an extra steps of conversion to Simplified (both the target terms and the surrounding text), which can also go wrong and screw up everything.

In the case of 维基百科 NOT 維基百科, one of the results is ChemSpider. The clause containing 维基百科 is: 它通过WiChempedia收录维基百科中的化合物记录。 When I run it through the proposed Chinese analysis chain, the segmentation for the relevant bit is 维 | 基 | 百 | 科. The Traditional query 基百科 is converted and segmented as 维 | 基 | 百科, which doesn't match. The segmented Simplified query doesn't match either, but it's possible to get an exact match on "维基百科".

Multiple Languages/Character Sets[edit]

Queries with multiple languages/character sets (Chinese/English or Chinese/Japanese) perform differently with the different analyzers. Below are some examples, with some notes.

Queries Production ("default") Results ... New/SmartCN+STConv Results w/o BM25 Results w/ BM25 Notes (Prod vs New+BM25)
2017isscc会议 2017isscc | 会议 0 2017 | isscc | 会议 0 1 Non-zero results based on better tokenizing of non-CJK characters and no phrase requirement.
austria领事馆 austria | 领事 | 馆 0 austria | 领事馆 0 29 Non-zero results, because we aren't limited to the phrase “austria 领事 馆”.
austria 领事馆 austria | 领事 | 馆 9 austria | 领事馆 26 29 More results, mostly from T2S conversion, and some from lack of phrase requirement.
league of legends 电竞 league | of | legends | 电 | 竞 3 league | of | legends | 电 | 竞 16 21 Better & more results, because of T2S conversion and not being limited to the phrase “电 竞”.
新东方gre 新 | 东方 | gre 0 新 | 东方 | gre 0 10 Non-zero results, because we aren't limited to the phrase “新 东方 gre"
任天堂スーパーマリオ 任天堂 | スーパー | マリオ 0 任 | 天堂 | ス | ー | パ | ー | マ | リ | オ 1 430 Better segmentation of Japanese from production ICU tokenizer, but no results because of the phrase requirement.
2017superbowl 2017superbowl 0 2017 | superbowl 0 4 Non-zero results based on better tokenizing of non-CJK characters and no phrase requirement.
2017 superbowl 2017 | superbowl 4 2017 | superbowl 4 4 Same
2016 oscar winners 2016 | oscar | winners 68 2016 | oscar | winners 69 70 Worse results based on relevance, presumably because of BM25

Of note, the default analyzer makes type distinctions (2016 is "NUM", superbowl is "ALPHANUM", CJK characters are "IDEOGRAPHIC"), whereas the new analysis chain does not (everything is "word"). I don't think it matters internally to Elasticsearch.

New Analysis Chain without BM25[edit]

I've set up another instance of the Chinese Wikipedia index on RelForge with everything other than the analysis chain the same as prod. The tokenization of queries and article text is the same as the previous version, but searching doesn't use BM25, and does still have the phrase requirements of production. My initial hypothesis from this small set of examples is that the lack of the phrase requirement is making a lot of the improvements.

Did You Mean Suggestions[edit]

While it's somewhat outside the scope of this analysis, one reviewer pointed out that the Chinese Did You Mean (DYM) results are not great.

2016 oscar winners gets the suggestion from prod of 2012 oscar winter, and from the new analyzer on RelForge: 2016 star winter.

Another search, "2017英国国会恐怖袭击" ("2017 British Parliament terrorist attacks") gets the suggestion from the new analyzer on RelForge: "2012 英国 国会 恐怖 袭击". Interestingly, the recommendation is tokenized! (We might want to consider removing spaces from between CJK characters in DYM recommendations on Chinese wikis, so the recommendations are more idiomatic.)

Differences in suggestions between prod and RelForge could be affected by the number of database shards in use (which affects frequency information), random differences in article distribution among shards, and differences in the overall frequency distribution (like maximum frequency) as a result of differences in segmentation/tokenizing, and collapsing Traditional and Simplified versions of the same word.

Of Filters and Char_Filters—Not All ICU Normalization is the Same[edit]

David pointed out in a review of a patch that I had directly used the icu_normalizer, which isn't guaranteed to be installed. We have a bit of logic to replace the lowercase normalizer with the icu_normalizer if it's available.

All well and good, except—dun dun dun!—I used the icu_normalizer char_filter, not the icu_normalizer filter, so it's not quite the same.

Elasticsearch analyzers include a char_filter, tokenizer, and filter (a.k.a., token filter).

A char_filter changes characters or sequences of characters. So, you could replace ①, ②, and ③ with 1, 2, and 3. These changes come before tokenization, and so can affect tokenization. You wouldn't want to change ™ to TM before tokenization, because then Widget™ would become WidgetTM and get indexed as widgettm. For Chinese, STConvert is a char_filter, and we use it to replace Traditional characters with Simplified characters, because that's what the SmartCN tokenizer works on. I also have another little character filter that goes before STConvert to fix two bugs I found in STConvert.

The tokenizer is supposed to break the string of characters into tokens (more or less "words"). It can, however, also transform the characters. It can do ascii folding and other transformations as well. SmartCN is the tokenizer we use to break Chinese text into words. It also transforms a lot of non-word characters into commas—parens, slashes, quotes, ①, grawlix components (e.g., *&%$#!), etc. It also splits words on any character that's not a CJK character, or a Basic Latin character—so it splits on accented Latin characters, like ß, é, å, etc, and Straße gets broken into three pieces.

A filter (or token filter) is similar to a char_filter in that it transforms characters, but it can only operate on tokens already identified by the tokenizer, and can't affect them. So it could transform ①, ②, and ③ into 1, 2, and 3 or ™ into TM. But if Widget™ had already been broken into Widget and ™ by the tokenizer, the token filter can't change the token boundaries.

So, if I move the icu_normalizer char_filter to be a icu_normalizer (token) filter, a few things happen.

  • Non-Latin characters that the icu_normalizer would convert to Latin characters are lost.
    • Good: Widget™ gets tokenized as widget and a comma instead of as widgettm.
    • Bad: SmartCN converts ① to a comma instead of a 1, which is unrecoverable.
  • Accented Latin characters are split by SmartCN.
    • Bad: straße is indexed as stra, ss, and e instead of strasse. is indexed as s and i instead of si. Pacific becomes paci, fi, and c.
    • One mitigating circumstance is that we do an okay job of ranking these tokens higher when they are together.
  • Zero-width spaces split words. The icu_normalizer just drops these, but SmartCN splits on them.
    • Bad: An example is 台儿庄 vs 台[ZWSP]儿庄. Without the zero-width space, it's tokenized as one word. With it, it is split into three: 台 + 儿 + 庄 (儿庄 by itself gets split in two). Looks like this is a technique sometimes used in applications that can't wrap Chinese text without breaks, or as a hint for better word breaks, but it's not common. Since once suggested usage is between every character so the text wraps nicely in programs that don't process Chinese well, zero-width spaces can't reliably be interpreted as useful word breaks. It's easy to cut-n-paste them from another source without realizing it.
  • Left-to-right, right-to-left, and non-breaking spaces get tokenized separately.
    • Good: These don't get indexed. For example, 33[NBSP] gets indexed as 33, and so a query for 33 is an exact match, not a stemmed match.

The impact seems to be very small. Less than 1.5% of types and 0.1% of tokens seem to be affected.

I've re-purposed one of the Chinese indexes I set up in labs, and I've run RelForge on 1000 user queries to see what real-world effects this has.

One unfortunate aspect of this test is that the index snapshots used are from different times, meaning that some changes can be the result of actual differences in the index, rather than differences in the quality of the search. To deal with that, I re-built one of the indexes with a snapshot that matched the other. [Note: it made a big difference, even though the snapshots were only a week apart. Part of that is the small magnitude of the real search-related change here (< 5%), but it's bigger than I expected and thus important to keep in mind.]

Results[edit]

Categories of change below are sorted in terms of potential seriousness (i.e., assuming all changes are bad, how bad are they?). For example, swapping results #3 and #4 is a minor change (though the effect of position is important). Adding in a "new" result (i.e., not already in the top 20) in position #2 is a moderate change—all of the other results are still there, just moved down by one. Dropping one of the top 3 results is a big change, since a genuinely good result could have been lost. None of the changes are evaluated for quality (i.e., all the changes could be good! But we're treating them as if they are bad because it's less work and gives us a maximum potential bad impact. We could evaluate the quality of the changes if the results is sufficiently bad, but in this case we didn't need to.)

Changes that I consider to be "sufficiently bad" are in italics.

Zero Results Rate unchanged

Poorly Performing Percentage unchanged

Num TotalHits Changed: μ: 17.72; σ: 122.79; median: 0.00

...

#1 Result Differs: 0.6%

Num Top Result Changed: μ: 0.01; σ: 0.08; median: 0.00

6 samples—1 negation (Chinese), 1 Thai, 1 English (note that 0.6% of 1000 is 6—so this "sample" is all of them)

  • 1x small jump reordering (e.g, #1/#2 swap)
  • 1x new top result
  • 1x big jump reordering, dropping top result (e.g., #1 -> #12)
  • 1x top result missing
  • 1x 13 of top 20 new, slight reorder of others
  • 1x all top 20 results missing

Net result: ~ 0.4% of top results differ significantly (could be better or worse)

...

Top 3 Sorted Results Differ: 2.4%

Top 3 Unsorted Results Differ: 1.9%

Num Top 3 Results Changed: μ: 0.02; σ: 0.17; median: 0.00

19 samples—1 negation (Chinese), 1 Thai, 1 Arabic (note that 1.9% of 1000 is 19—so this "sample" is all of them; some of the results from above are included—though some are not; 5 of 19 are previously seen in the sample above)

  • 6x 1 new result in top 3
  • 4x small jump reordering (e.g, #4 or #5 -> #3 swap)
  • 1x big jump reordering, dropping top result (e.g., #1 -> #12) (same example as above)
  • 6x 1 of top 3 results missing, some slight reorder of others
  • 1x 13 of top 20 new, slight reorder of others (same example as above)
  • 1x all top 20 results missing (same example as above)

Net Result: ~ 0.9% of top 3 results differ significantly (could be better or worse)

...

Top 5 Sorted Results Differ: 4.6%

Top 5 Unsorted Results Differ: 2.9%

Num Top 5 Results Changed: μ: 0.04; σ: 0.26; median: 0.00

20 samples—1 negation (Chinese), 1 English, 1 Arabic (note that 2.9% of 1000 is 29, so this sample of 20 is more than two thirds of them, and some of the results from above are included—though some are not; 11 of 20 are previously seen in the samples above)

  • 3x small jump reordering (e.g, #6 -> #5 swap)
  • 6x 1 new result in top 5, slight reordering of others
  • 1x 2 new results in top 5, 3 other new results
  • 6x 1 of top 5 results missing, slight reorder of others
  • 2x 1 new result in top 5, 1 old result lost from top 5
  • 1x 13 of top 20 new, slight reorder of others (same example as above)
  • 1x all top 20 results missing (same example as above)

Net Result: ~ 1.5% of top 5 results differ significantly (could be better or worse)

Anecdota[edit]

  • A sample of 20 queries were tokenized with both config and the tokens were the same. So differences in results sometimes come from the tokenization of the articles.
  • The one query where all 20 top results are different is a very unfocused query, and it starts with a ! (NOT). It has 930K results; the entire Chinese Wikipedia only has 938K results. I'm not even sure there is a right way is to rank results when the only criteria is that they don't have certain search terms. Can one document be better at not having those terms than another? In the current zhwiki, teh query returns all (or very nearly all) articles, because it is treated as the negation of a phrase.
  • The swaps ("small jump reordering" above) seem to be based on very small changes in scoring, which can be attributed to differences in term frequency stats. For example, if ① is/isn't indexed as 1, then there could be more/fewer 1s in some documents and more/fewer documents with 1 in them, thus changing one aspect of the score. The swap with #1/#2 above, for example, originally had scores of 739.45544 vs 738.3111 with the ICU char_filter, and 737.90393 vs 738.3442 with the ICU token filter. These are minute changes in score, which could also happen over time as articles are created or deleted.
  • In one case where 13 of 20 results changed, the original search was 不一样的美男子Ⅱ (不一样的美男子 is a TV show with two seasons, so presumably 不一样的美男子Ⅱ is referring to the second season). The article for 不一样的美男子 mentions 不一样的美男子2 (with a 2 rather than a pre-composed Roman numeral Ⅱ). With the char_filter, is tokenized as ii and thus is a required match. With the token filter, is discarded by SmartCN (i.e., converted to a comma) and so better results (including the article on 不一样的美男子) with a phrase match (which is no longer required, but still appreciated) are returned. This one is better, but kinda for the wrong reasons. Searching for 不一样的美男子2 does the right thing with both the char_filter and the token filter.

Char_Filter vs Token Filter Recommendations[edit]

While it would be possible to add another hack that would allow us to conditionally include the icu_normalizer char_filter when it is available, it would be even hackier than what we currently have. Right now we replace the lowercase token filter with the more general and more powerful icu_normalizer when available. For the char_filter hack, we'd need to put in a dummy char_filter that does nothing just so we could replace it when the ICU char_filter is available (or come up with some other mechanism to signal that it should be inserted in the right place), and we'd lose the benefit of the lowercase token filter (which is needed for Latin characters anyway).

I think it's arguable that the token filter is better, but it is clear that the impact is fairly small, and even if it is all bad, it's still worth it to avoid the maintenance burden of another more egregious hack to enable the ICU char_filter.

Follow-up: Punctuation and the Scourge of Commas[edit]

August 2017. See also T172653.

As noted above, lots of punctuation and other non-text characters get converted to commas and indexed. This is not ideal, so I've added a stopword filter to the Chinese config to drop commas. On a 10K-article corpus, dropping the punctuation reduced the number of tokens from 3,733,954 to 3,121,948, or about 16.4%! In that sample, 261 distinct characters were normalized to commas, including punctuation, currency symbols, quote-like symbols, precomposed Roman numerals, arrows, math and logic symbols, circled, parenthesized, and full-stop numbers and letters (Ⓐ, ⑴, ⒌), box-drawing symbols, geometric shapes, bidi and other hidden characters, paren-like symbols, and lots of other misc characters.

Summary, Recommendation, & Plans[edit]

Summary[edit]

  • Chinese Wikis support Simplified and Traditional input and display. Text is stored as it is input, and converted at display time (by a PHP module called ZhConversion.php).
  • All Chinese-specific Elasticsearch tokenizers/segmenters/analyzers I can find only work on Simplified text. The SIGHAN 2005 Word Segmentation Bakeoff had participants that segmented Traditional and Simplified text, but mixed texts were not tested.
  • STConvert is an Elasticsearch plugin that does T2S conversion. It agrees with ZHConversion.php about 99.9% of the time on a sample of 1,000 Chinese Wiki articles.
    • It's good that they agree! We wouldn't want conversions for search/indexing and display to frequently disagree; that would be very confusing for users.
    • Using a Traditional to Simplified (T2S) converter greatly improves the performance of several potential Elasticsearch Chinese tokenizers/segmenters on Traditional text. (Tested on SIGHAN data.)
    • I uncovered two small bugs in the STConvert rules. I've filed a bug report and implemented a char_filter patch as a workaround.
  • SmartCN + STConvert is the best tokenizer combination (on the SIGHAN data). It performs a bit better than everything else on Traditional text and much better on Simplified text.
    • Our historically poor opinion of SmartCN may have been at least partly caused by the fact that it only really works on Simplified characters; and so it would perform poorly on mixed Traditional/Simplified text.
    • There are significantly fewer types of words (~16%) with SmartCN + STConvert compared to the current prod config, indicating more multi-character words are found. About 28% of tokens are have no words they are indexed with (i.e., mostly Traditional and Simplified forms being indexed together).
    • Search with SmartCN + STConvert works as you would hope: Traditional and Simplified versions of the same text find each other, highlighting works regardless of underlying text type, and none of the myriad quotes (" " — ' ' — “ ” — ‘ ’ — 「 」 — 『 』 — 「 」) in the text affect results. (Regular quotes in the query are "special syntax" and do the usual phrase searching.)
    • SmartCN + STConvert doesn't tokenize some non-CJK Unicode characters as well as one would like. Adding the icu_normalizer as a pre-filter fixes many problems, but not all. The remaining issues I still see are with some uncommon Unicode characters: IPA and slash characters: ½ ℀ ℅. Searching for most works as you would expect (except for numerical fractions).

Recommendation[edit]

  • Deploy SmartCN + STConvert to production for Chinese wikis after an opportunity for community review (and after the ES5 upgrade is complete).

Next Steps: Plan & Status[edit]

  • ✓ Wait for RelForge to be downgraded to ES 5.1.2; when it is declared stable ("stable for RelForge") re-index Chinese Wikipedia there: Mostly done: A re-indexed copy of Chinese Wikipedia from mid-March using STConvert and SmartCN is available on RelForge. Because of my inexperience setting these things up, I set it up using the defaults—i.e., BM25 (the newer preferred term-weighting implementation) and a few other search tweaks not normally used on spaceless languages. See below for plans to deal with this.
    • ✓ Let people use it and give feedback: Done—still possibly more feedback coming. While there are some problems, impressions are generally positive.
    • ✓ Test that everything works in ES5 as expected: Done—it does.
  • ✓ Set up another copy of Chinese Wikipedia on the RelForge servers, using the "spaceless language" config. Done
    • ✓ use RelForge to see how big of a difference BM25 vs spaceless config makes with the new analysis chain. Done—the majority of the change in results comes from the BM25, et al., changes.
    • ✗ get native speaker review of a sample of the changes. Skipped because of the result of the BM25 results.
    • ✓ decide how to proceed: deploy as spaceless, deploy as BM25, or do an A/B test of some sort. Done—deploy with BM25!
  • ✓ If/When vagrant update to ES5 is available, test there, in addition/instead: Done. Config is slightly different for the new version of STConvert, but everything works the same.
  • ✓ Update the plugin-conditional analyzer configuration to require two plugin dependencies (i.e., SmartCN and STConvert)—currently it seems is can only be conditioned on one. Done.
  • ✓ After ES5 is deployed and everything else checks out:
    • ✓ deploy SmartCN and STConvert to production (T160948) Done, though it won't have any effect until we re-index.
    • ✓ enable the new analysis config (T158203) Done, though it won't have any effect until we re-index.
    • ✓ enable the search config to switch to BM25 (T163829) Done, and deployed!
    • ✓ re-index the Chinese projects (T163832). Done, and deployed!
  • Update Vagrant/Puppet config to make SmartCN and STConvert available. (waiting on updates to how we deploy plugins to prod and vagrant)
  • August 2017:
    • Deploy punctuation change config (T172653)
    • re-index Chinese projects (T173464)