User:TJones (WMF)/Notes/Stempel Analyzer Patch Filters
July 2018 — See TJones_(WMF)/Notes for other projects. See also T186046. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.
Back in February in 2017 I did an analysis of the Stempel Polish analyzer before we deployed it to Polish-language wikis. One of the surprising things I found was that some of the stems generated by the stemmer, which is statistical in nature, are fairly absurd. (As before, if you aren't familiar with Polish, it's handy to know that most of the infinitives of verbs end in -ć; and -ać and -wać are also very common. Take a look at the list of Polish verbs on English Wiktionary.)
In particular, a number of words, especially non-Polish words, end up being stemmed as ć, which I interpret as an "empty verb", with the typical infinitive verb ending applied. One of the words that gets stemmed as ć is English button, so we've been referring to this as the "button problem".
At the time, we set up a RelForge test instance of Polish Wikipedia with the new stemmer and generally it seemed fine, though there were a few specific instances of ridiculous results. This is particularly likely when you search for a very rare word that shares a stem with an unrelated very common word, in which case the common word swamps the rare word in the results. Generally, though, exact matches from the plain field overcome weird matches from the text field, at least for the top few results.
We looked at the problem again at a search team offsite after the machine learning ("learn to rank" or "LTR") scorer was enabled, and it seemed worse. Though, reading over my earlier notes again, maybe it isn't actually much worse—results for some absurdly stemmed words were originally fine, some iffy, and some terrible; that seems to still be the case.
Nonetheless, talking about it we realized that we could take some of the most ridiculous stems, like ć, and treat them as stop words. Then, only plain field (i.e., unstemmed) matches would be possible. So, that's what we're testing here.
As usual, I took a sample of 10,000 random articles from Polish Wikipedia, and 10,000 random entries from Polish Wiktionary. I deduplicated lines to minimize/normalize the frequency of "wiki words", such as the equivalents of "References", "See Also", "Noun", etc.
I also pulled 10,000 random queries from each of Polish Wikipedia and Wiktionary, so that I could get a sense of the impact of the Stempel errors on queries as well as article/entry text. Also, if I need to, I can use those queries as part of a general regression test for any updates, via RelForge.
Taking a Second Look at Stempel
My plan is to reanalyze the monolithic Stempel analyzer, unpack the Stempel analyzer, make sure everything looks good, and then introduce stop words or other filters to get rid of the worst stems. As I've noted before, our default tool config converts the "lowercase" filter to the "icu_normalizer" filter, which lowercases and also normalizes some variants of various Unicode characters. So, I was planning to do it as a multi-step process—unpack and keep "lowercase", unpack and use "icu_normalizer", add stop words and other filters.
Since Stempel/Polish was my first time trying to analyze a new stemmer, my analysis tools were not built for the job. I made some improvements that made it easier to see what was going on, but the tools were still pretty new. Since then, every time I found some weird situation, I updated my tools to make it easier to detect that particular problem. Re-running the tools on Stempel has revealed some surprising new errors! Yay!
- "The best-laid schemes o' mice an' men / Gang aft agley" —Robert Burns
When I unpacked the stemmer, using the generic Elasticsearch "lowercase" and Stempel "polish_stem" filters, I discovered that I had a whole bunch of new tokens that I didn't have before!
Going back and comparing the default analyzer to the baseline Stempel analyzer (which is in production), I discovered and then verified on the command line that all of the following words—from samples taken from Polish Wikipedia articles, Polish Wiktionary entries, and queries to both of those wikis—were dropped by Stempel!
- Aby • Albo • Ale • Ani • Aż • Bardzo • Będą • Będzie • Bez • Bo • Bowiem • BY • Być • Był • Była • Byli • Było • Były • bym • Chce • Choć • CO • Coraz • Coś • Często • Czy • Czyli • Dla • DO • DR • Gdy • Gdyby • gdyż • Gdzie • GO • Godz • Hab • I • ICH • II • III • Im • INNE • inż • IV • IX • Iż • JA • Ją • Jak • Jakie • Jako • Je • Jednak • Jednym • Jedynie • Jego • JEJ • Jeśli • JEST • Jeszcze • Jeżeli • Już • KIEDY • Kilku • KTO • Która • Które • Którego • której • Który • których • którym • Którzy • LAT • Lecz • Lub • MA • Mają • Mamy • MGR • MI • Miał • Mimo • Mnie • Mogą • Może • Można • Mu • Musi • NA • Nad • Nam • NAS • Nawet • Nic • nich • NIE • niej • Nim • Niż • NO • Nowe • NP • NR • O • OD • OK • ON • ONE • Oraz • PAN • PL • PO • POD • Ponad • Ponieważ • Poza • Prof • Przed • Przede • Przez • PRZY • Raz • razie • ROKU • Również • Są • Się • sobie • Sposób • Swoje • TA • TAK • Takich • Takie • Także • Tam • TE • Tę • Tego • Tej • Tel • Temu • TEN • Teraz • też • TO • Trzeba • TU • Tych • Tylko • TYM • tys • Tzw • U • UL • VI • VII • VIII • Vol • W • WE • Wie • Więc • Właśnie • Wśród • WSZYSTKO • WWW • XI • XII • XIII • XIV • XV • Z • ZA • Zaś • Ze • Że • Żeby • ZŁ
In the Wikipedia corpus, for example, these dropped words account for about 22% of all tokens. w, (roughly "in" or "on") is 4% of all tokens by itself!
At first I didn't think those could be stop words because of the Roman numerals, but I popped the list into Google Translate, and looked up a few on English Wiktionary, and they are mostly plausible stop words.
Digging a little deeper, I found the Elasticsearch Stempel/Polish code on GitHub, and there it is invoking a Lucene analyzer with a stop word list. I tracked down the stop word list in Lucene, and discovered it comes from Carrot2. It is also available on GitHub from Carrot2 with a BSD-style license. They have other stop word lists, too. Some seem to be empty, but it's a nice resource to remember for future use.
Looking at the code and the behavior of the monolithic Stempel analyzer, stop words are filtered before the stemmer, as is typical with Elasticsearch analyzers. So, we can replicate that behavior using the same list from Carrot2.
An interesting note: our data is pretty comprehensive, in terms of finding all the stop words. The only one that didn't make my list was o.o. because we tokenize it as o.o and it doesn't match. It's part of an abbreviation, sp. z o.o., which means "limited liability company". I'll add both forms to my stop word list.
Enable Polish Stop Words
I added the list of Carrot2 stop words, plus 'o.o', to my unpacked analyzer and the only difference from the monolithic Stempel analyzer was that 'o.o' is being filtered now. Not too surprisingly, 'o.o(.)' only occurs in the Wikipedia article and query data, not in Wiktionary.
Okay, so we're back to the unpacked baseline!
We routinely upgrade the lowercase character filter to ICU Normalization, though I'd disabled it previously so I was comparing apples to apples when unpacking the Stempel analyzer.
I didn't expect to lose any tokens, but we did lose a few, which got normalized to stop words (Polish has a number of one-letter stop words):
- "Ｚ" (full width Z) gets normalized to z
- "𝕎" (double-struck W) gets normalized to w
- º (masculine ordinal indicator—though it also gets used as a degree sign) gets normalized to o
We see the usual good ICU normalizations:
- º and ª to o and a.
- µ (micro sign) to μ (Greek mu), which are indistinguishable in many fonts. I found a few Greek words with the micro sign instead of mu in them, too!
- Greek ς to σ
- IPA modifier letters, like ʰ, ʷ, ˠ, ʲ, ˢ converted to plain letters
- strip invisible characters like bidirectional text markers, soft hyphens, zero-width non-joiners
- lots of German ß to ss
- normalize typographical variants, like ﬁ to fi
- normalization of Thai characters
And the one recurring ICU normalization failure: İ (dotted I) is not normalized to I, which is occurs in names like İstanbul.
The overall impact of the normalization is low, and generally positive:
In the Wikipedia corpus, 70 pre-analysis types (0.028% of pre-analysis types) / 119 tokens (0.007% of tokens) were added to 68 groups (0.045% of post-analyis types), affecting a total of 522 pre-analysis types (0.206% of pre-analysis types) in those groups.
Almost all the collisions are German ß to ss.
In the Wiktionary corpus, 10 pre-analysis types (0.014% of pre-analysis types) / 12 tokens (0.006% of tokens) were added to 10 groups (0.016% of post-analyis types), affecting a total of 38 pre-analysis types (0.053% of pre-analysis types) in those groups.
These collisions were generally good, though there were a few that interacted with the weird stemming results, so ließ got normalized to liess which then stemmed to ować—it's not really the normalization that's at fault here.
There were also some splits, but all were caused by dotted I, and I added a character filter to handle dotted I.
In my previous analysis, I noticed that almost all one-letter stems and most two-letter stems were conflating completely unrelated words. The worst offender is clearly the stem ć, which conflates—among many, many others—0,31, 1231, Adrien, Baby, Chloe, Defoe, Espinho, Fitz, Girl, Hague, Issue, Judas, Klaas, Laws, Mammal, News, Otóż, Pains, Qaab, Right, Sears, Trask, Uniw, Value, Weich, XLIII, Yzaga, Zizinho, and Ładoś.
In my original quick survey, I noted that almost all the one- and two-character stems in my report were bad. I overgeneralized a bit, because plenty of one- and two-letter stems are fine, particularly numbers and characters in other writing systems. A more careful analysis of the one-letter stems reduces the set to a-z, ć, or ń. For the two-letter stems, they all start with a-z, ą, or ł and end with the same set as the one-letter stems. Three-letter stems ending in ć are pretty bad, too. The stemmer also doesn't like numbers that end in 1 or 31, so anything that starts with a number and ends in ć is also suspect.
The pattern-based filters can only replace the matched part of the token, which in our case is the whole token; I replace it with the empty string, but then I need a length-based filter to drop all the zero-length tokens.
I enabled the filters one-by-one to check that they did the right thing. The length filter by itself does nothing—as expected, since there aren't supposed to be any zero-length tokens before I enable the pattern-based filters.
I only tested the filters enabled one-by-one on the Wikipedia corpus, since I was mostly looking for correct function, though I did note the relative impact. Recall that the Wikipedia corpus has 253,092 pre-analysis types and 1,590,840 tokens.
[a-zął]?[a-zćń]— one- and two-letter tokens: removed 2328 pre-analysis types (0.92%), 61753 tokens (3.88%)
..ć— three-letter tokens ending in ć: removed 1423 pre-analysis types (0.56%), 14625 tokens (0.92%)
\d.*ć— tokens starting with numbers and ending in ć: removed 579 pre-analysis types (0.23%), 7808 tokens (0.49%)
- A small handful of these are sub-optimal, like 1998kolonia and 3delight, but the plain field will still match them exactly if anyone ever searches for them.
- ować, iwać, obić, snąć, ywać, ium, my, um — specific stop words: removed 72 pre-analysis types (0.03%), 666 tokens (0.04%)
There's definitely some overlap between categories, e.g., 12ć would be filtered by either
Next I enabled all of the Stempel-specific filters, and tested on both the Wikipedia and Wiktionary corpora. Recall that the Wikipedia corpus has 72,345 pre-analysis types and 209,414 tokens.
- Wikipedia: all the filter together removed 4097 pre-analysis types (1.62%), 77127 tokens (4.85%)
- Wiktionary: all the filter together removed 1221 pre-analysis types (1.69%), 20321 tokens (9.70%)
I'm always a bit worried when I there are too many regexes involved, but the not-very-precise timings on my laptop to run the Wikipedia data above showed no real difference between running any one of the filters vs all of the filters, so speed is probably not an issue.
There is, however, quite a big skew between the corpora, with a very similar percentage of pre-analysis types affected, but nearly double the percentage of tokens affected. I looked into the specifics in the Wiktionary data and the top five dropped tokens are: f (4090), m(3871), lp(1381), lm(1373), and n(1222), which total 11,937 tokens (58.7% of all dropped tokens). Those are the dictionary abbreviations for feminine, masculine, singular, plural, and neuter; ignoring those very dictionary-specific tokens, only 4.0% of tokens are dropped, which is in line with the numbers for Wikipedia.
And, just for the sake of interest, the top dropped tokens in the Wikipedia corpus were a (7591, "and"), ur (3798, abbrev. for "born"), r (3601, abbrev for "year"), zm (2325, abbrev for "died"), of (1896, English "of"), de (1893, "of" in several Romance languages), km (1847, mostly "kilometer"). That's 22,951 tokens, which is only 29.8% of the dropped tokens in Wikipedia, but still a significant portion that are clearly lower-value words.
Conclusions and Next Steps
I'd be a bit worried about dropping all of these words from the index if we didn't have the plain field. There's always a trade off between precision (getting only the relevant results) and recall (getting all the relevant results). Stemming with Stempel greatly improves recall, but the stemming errors decrease precision. Dropping the worst stems improves precision, but greatly decreases recall. The plain field index saves a lot of the recall (and may even improve precision on some really short words). We're bouncing back and forth, but slowly converging on the best configuration.
Next steps include: