User:TJones (WMF)/Notes/Stempel Analyzer Analysis

February 2017 — See TJones_(WMF)/Notes for other projects. See also T154517.

Overview
One of our quarterly goals is to find and deploy analyzers for languages where we currently don't have them. Polish made a good choice for the first one to tackle, since the Stempel analyzer is recommended and maintained by Elasticsearch, and we have some Polish-language expertise within Discovery (not me—thanks Jan!—but I know a lot more about Polish morphology than I did before).

I've discovered some surprising bugs in the Stempel stemmer, detailed below. I've also contacted some of the developers of the stemmer, and there is some chance of fixes being made at some point—though nothing definite yet.

I've also been told by one of the implementers at Apache that the stemmer is statistical in nature, so some weird behavior, especially on non-Polish input (e.g., URLs or English words) is to be expected.

Also, if you aren't familiar with Polish, it's handy to know that most of the infinitives of verbs end in -ć, and -ać and -wać are also very common. Take a look at the list of Polish verbs on English Wiktionary.

A live version of the Polish Wikipedia Index with the Stempel analyzer enabled is available at pl-wp-stempel-relforge.wmflabs.org. (Note that this is only the index, so you can search and read snippets, but there are no content pages. All links are red links.)

Tool Update
Since this was my first foray into adding a new analyzer from scratch, it turns out that my analyzer analysis tool wasn't up to the job.

My previous analyzer analysis tool was built on the assumption that only small tweaks were being made to the analysis, and that most stems for most words would remain unchanged. This was appropriate for previous analyzer tweaks (unpacking the Elasticsearch analyzer, enabling ASCII folding or ICU folding), but didn't work for changing one stemmer for another (or introducing one where there was none before).

So, I updated the analyzer analysis tool to do an "in-place" analysis of the analyzer's performance, look for and account for regular morphology, and highlight potential problems.

As a future 10% project, I plan to clean up and release the analyzer analysis tool as part of RelForge.

For better or worse, our next likely language to investigate is Chinese, which doesn't need stemming, but rather word-splitting, so the analyzer analysis tool will be less helpful—though other tools to assess word splitting exist, fortunately.

Data & Initial Manual Analysis
I pulled samples of 1,000 and 10,000 articles from the Polish Wikipedia as examples of typical formal Polish text, with a decent sampling on non-Polish text mixed in (e.g., media titles). I de-duped the samples at the individual line level to prevent over-representation of Wikipedia-specific text (the equivalent of "See Also", "References", "External Links", etc., in the English Wikipedia), and any other highly repetitive text from individual articles.

I tokenized and stemmed the text, counted the number of original tokens of each type, and grouped types based on the common stem (i.e., grouping the words that would count as being "the same" at search time).

On the one hand, 1,000 articles is enough, and 10,000 is more than enough, to get a decent sample of all the most common words in a language. On the other hand, Polish Wikipedia has 1.2 million articles, so rarer words (both Polish and foreign) and other strings (numbers, ISBN numbers, URLs, etc) are not all represented in the sample.

So, in a larger sample, we would expect more collisions—i.e., more words that are grouped together inappropriately—but this smaller sample should give a good sense of the general magnitude of the problem.

From the 1K-article sample, I pulled 100 random groups (with more than one word in the group), and the 35 largest groups (those most likely to have something weird in them), which Jan kindly reviewed. Of the 100 random groups, all were judged to be accurate, and of the 35 large groups, all but four were judged to be accurate. Of those four, two were ones I had already picked out as clearly having errors.

One example of a group with the expected kind and number of errors:"brać: Bierze (7), Biorą (2), Biorąc (6), Biorę (2), Braci (29), Brali (2), Brana (1), Brass (2), Brazeau (1), Braćmi (2), Brał (79), Brała (8), Brało (4), Brały (3), Brood (2), bierz (1), bierze (55), biorą (28), biorąc (38), biorący (7), biorę (1), braci (79), bracie (5), braciom (3), brali (34), braliby (1), brana (1), brane (6), brania (1), branie (5), brano (2), brać (10), braćmi (22), brał (163), brała (34), brało (21), brały (24)"Numbers after each word indicate the number of times it appeared in the 1K-article sample.

brać is the stemmed form, and the infinitive of the verb "to take," and most words here are forms of brać. Words like brass and brassy sort of make sense; most words in the corpus that end in -ass are English, so I'd speculate that some sort of phonetic and or orthographic similarity rules are converting -ss to -c to -ć—or something similar. braćmi is the plural instrumental case of brat ("brother"), but -ami is a common and apparently very regular suffix, so mis-stemming it as brać is understandable. These errors are not good, some are kinda dumb, but they are understandable.

In contrast, consider the list below, all of which are stemmed to ć. I believe this caused by some series of glitches in the statistical stemmer, which end up reducing these words to a kind of "null verb"—the infinitive ending ć and nothing else:"ć: 0,31 (1), 0031 (1), 1031 (6), 1131 (9), 1231 (6), 1331 (5), 1431 (9), 1531 (11), 1631 (15), 1731 (13), 1831 (115), 1931 (261), 2,31 (2), 2031 (1), 2131 (5), 2331 (2), 2431 (1), 2631 (1), 3031 (2), 3431 (1), 3731 (1), 4,31 (1), 4031 (1), 4531 (2), 4631 (1), 6,31 (2), 8331 (5), Adrien (3), Alainem (1), Anderb (3), Audyt (1), Awiwie (8), Ayres (1), Baby (46), Badża (1), Bains (7), Barisza (2), Batorzu (1), Batton (1), Batz (1), Bazy (10), Beau (6), Benue (2), Beroe (1), Betinho (1), Bogue (1), Boruca (1), Botz (1), Bronksie (1), Brydż (2), Bugue (1), Buon (1), Button (3), Błaga (1), CIDOB (1), CLAAS (2), CSNY (1), Caan (1), Ch'oe (1), Chingle (2), Chloe (5), Claas (12), Coins (3), Conseil (2), Conso (5), Corll (1), Cotton (11), Cramp (1), Czto (1), D.III (3), Daan (1), Daws (1), DeLuca (4), Defoe (1), Demy (1), Deol (1), Detroit (38), Drag (4), Drau (4), Dryń (1), Dutton (4), Duty (13), Dziób (6), EIRO (1), Edyp (9), Espinho (1), FESPACO (1), FODŻ (1), Fitz (6), Frag (1), Frau (6), GRAU (3), Gaan (1), Gaon (1), Gatz (1), Gazdag (1), Girl (81), Glenda (1), Godinho (1), Gotz (1), Grasso (1), Grau (1), Grodnie (17), Grun (1), Gula (3), Gwent (1), Götz (1), Haab (1), Hague (2), Hatton (3), Heim (4), Hetz (1), Hogue (1), Horyń (3), Hosszu (1), Hutton (4), Iberii (2), Inoue (4), Ironi (1), Issue (2), Izmaile (2), JX31 (3), Jakusho (4), Jedyn (1), Jigoku (1), Jonae (1), Jozue (1), Judas (3), Kamp (6), KarolK (2), Kasei (2), Katz (5), Keiki (1), Kiuas (1), Klaas (1), Kmdr (3), Konon (1), Kotz (1), Krag (1), Kringle (1), Któż (1), Kutz (4), Lague (1), Laon (1), Laws (7), Letz (1), LiCl (1), Liban (9), Ligue (25), Litton (1), Logue (1), Loja (2), Londyn (91), Lotz (1), Luau (1), Lunae (1), Lutz (8), MATZ (1), Maeue (1), Mains (1), Mammal (1), Marh (1), Marinho (3), Masii (1), Matz (2), Maurcie (1), Mesrob (1), Metz (16), Michle (1), Midas (1), Miras (1), Miraś (1), Montjoie (1), Monty (4), NEWS (1), NSDAB (2), Nadab (1), Nadala (3), Netz (1), News (29), Nieszawa (5), Nikei (1), Nimue (10), Noam (1), Nuxalk (1), Nyam (1), Ochli (1), Olień (1), Oloś (1), Osioł (1), Otóż (2), POTZ (1), PZSzerm (1), Pains (1), Patton (8), Paws (1), Pearl (35), Pedoe (1), Pique (1), Pobrzeża (1), Pono (1), Powsinie (1), Prag (1), Praha (9), Pringle (1), Progi (2), Prosper (7), Prońko (11), Psioł (1), Pt31 (1), Putz (1), Pâque (5), Qaab (2), RODACY (1), ROWS (1), Raab (2), Raini (1), Ratz (1), Reich (9), Reim (1), Retz (1), Revue (20), Right (20), Roam (2), Roary (1), Rogiedle (1), Rogue (2), Roque (1), Ruins (1), Rutki (1), Ryan (41), SAAB (1), SHANIDZE (1), SOAR (1), Saab (23), Saien (1), Sanona (1), Sears (2), Seiki (1), Semo (5), Shingle (1), Siecsław (1), Sitz (3), Skeel (1), Skopje (16), Skou (1), Soar (1), Sperl (1), Steel (20), Strip (11), Suau (1), Sudża (1), Suez (4), Sutton (4), Słota (1), Tadż (2), Taira (3), Teiki (1), Tiras (1), Toluca (1), Traiana (1), Trask (1), Turinga (5), Tęczy (2), UKWZ (1), Uniw (2), Vadinho (2), Value (3), Vogue (2), Voor (1), Vows (1), WKiŁ (2), WSRH (1), Weich (1), Wenko (1), Westy (1), Wheel (1), Wiza (1), Wschowie (5), Wuyue (3), Wybrzeża (32), Wyraz (4), XLIII (1), XVIII (361), XXIII (33), Yzaga (1), Zakręcie (2), Zizinho (1), already (1), arag (2), atque (2), baby (3), bazy (81), beim (1), benue (1), brydż (2), błaga (3), celu (534), czcią (9), czto (1), dietą (2), drag (1), duque (1), dziób (9), dżingle (1), elegie (1), frag (1), gaan (1), geol (1), girl (2), heim (1), idque (1), jakei (1), kmdr (34), mahoe (1), mirrą (1), mmol (2), modą (2), nabrzeża (1), nalewki (1), news (5), oddala (5), okala (1), pains (1), poecie (7), posł (1), prag (1), progi (7), right (5), rodacy (5), seiki (6), sliw (1), stins (1), strip (1), szosą (4), tadż (1), togę (1), trag (1), tęczy (5), usque (3), venae (8), venue (1), vogue (1), voor (4), vrau (1), wabiący (1), wasei (1), widmem (1), worku (4), wsiach (11), wsiami (11), wybrzeża (63), wydala (1), wyraz (32), zakręcie (1), ć (1), Ładoś (1)"

Further Semi-Automatic Analysis
As part of my analysis, I looked for common beginning and ending substrings in words grouped together under a common stem. (I'm using "prefix" and "suffix" to refer to morphological affixes, and "beginning" and "ending" to refer to common substrings of words. Otherwise I may need to write sentences like "There's no common prefix for the grouping because several of the words have different prefixes," and none of us will make it out of here still sane.) I case-folded the strings (so uppercase doesn't matter). If case-folding failed to find common beginning or ending substrings, I applied generic ASCII-folding (which mostly only helped match a few cases of o/ó and s/ś alternations).

I used the identification of common beginning and ending substrings to automatically identify candidates for "easy" morphological prefixes and suffixes the stemmer knows about. Polish has a lot of variation in suffixes and many are only one letter and can alternate with nothing, so I ignored those. But there are only five obvious prefixes that get stripped: nie-, naj-, anty-, nad-, and bez- (roughly "-less", "-est", "anti-", "super-", and "un-"). Wiktionary has plenty more Polish suffixes, but these five are the easily identifiable ones that are being stripped in at least some cases.

More complex morphological processes, in particular those that change the spelling of the word root, won't be captured correctly, or even at all, but identifying the "obvious" ones makes it easier to find the weird ones, and to ignore "understandable" mistakes."A brief aside to explain what I mean by an 'understandable' mistake: if you give the German word Frühling ('spring', also a surname) to the English analyzer, it reasonably tries to treat it as an English word, and strips the -ing suffix and indexes it as Frühl. And if you search English Wikipedia for frühl, you'll get results for Frühling—generally titles in German and German surnames. This is wrong, but it is also understandable from the point of view of English."So, in addition to case-folding and ASCII_folding, I also stripped morphological prefixes from the grouped words in an attempt to find common substrings. At this point, groups with no common substrings were likely candidates for errors, as were those with very short stemms that ended in ć.

A few common themes recur in the stemming errors. As noted before by others, strings ending in 1, especially when preceded by another number, tend to have the last two digits replaced by ć. This is very obvious in the buggy stemming results. A less obvious one I discovered while doing this write up is that if a word ends in -aab, that string and the preceding letter get changed to ć. That's why Haab, Saab, Raab, and Qaab end up in the ć grouping—after converting /.aab$/ to ć, there's nothing left. Similarly, Naqaab gets stemmed to nać. These patterns don't explain the bulk of the weird results as in the ć group above, or most of the errors I found.

Error Examples
The full list of potential error examples are on a sub-page.

Example tables show the stem (provided by the stemmer), the common substrings (as "beginning .. ending", with either possibly being empty), and the words stemmed to the shown stem. The original words are "types", and the number in parens after the type is the count of how many times the word appeared in the corpus (i.e., the number of "tokens").

Numbers
The sequence /\d1$/ tends to be stemmed to ć. Stems that start with a number and end in ć are uniformly bad. Full list here.

No Common Beginning or Ending Substring
I was really only looking for beginning and ending substrings, so if every word had an e in the middle I wouldn't have found that. Many of the types in each group are reasonable, but there's at least one that sticks out as wrong. I did find two cases where there was no common beginning or ending substring, but the results were still good, or at least reasonable.

It turns out that Polish is afflicted with silliness similar to good/better/best in English: dobrze, lepiej, and najlepiej all stem to dobrze.

Also, the a- prefix with the same "not" meaning as in English got stripped in one case: achromatyczność and chromatyczności ("achromatism" and "chromaticity") got stemmed together.

Full list here.

Other Two-Letter Stems Ending in ć
Groups of words all beginning with the same letter stemmed to that letter + ć. Despite starting with the same letter, it's pretty clear from the names and English words in the lists that these contain lots of errors. Full list here.

Other One-Letter Stems
As in the previous group, names and English words make it clear these groups all have errors. Full list here.

Other Two-Letter Stems
A handful of these seemed plausible, or at least "understandable", so I excluded them from my list. As before, names, acronyms, and English words make it clear there are a lot of errors here. I did not carefully review this whole list. Full list here.

Three-Letter Stems Ending in ć
Other three-letter stems seemed generally good, but the three-letter stems ending in ć were generally bad. I pulled out a few that seemed reasonable, but did not carefully review all of these. Full list here.

Four-Letter Stems Ending in ć
Again, other four-letter stems and other, longer stems ending in ć seemed generally reasonable, but these have plenty of errors, though I did not carefully review all of them. Full list here.

Error Impact
The entire corpus contained 1,650,410 (~1.6M) tokens. The groups I identified as containing errors account for 69,783 (~70K) tokens, or around 4.2%.

In reality, the number is certainly higher than 4.2%. I certainly missed other errors, and there are certainly other, infrequent errors that simply didn't occur in my 10K-article sample. However, I think the 4.2% is a good order of magnitude estimate. It's not really 1%, and it's probably not really 10%.

Also, CirrusSearch ameliorates some of the errors—but not others.

If we search for monty, which stems to ć, all of the top 20 results have monty as their highlighted keyword. In addition to the "stemmed" form, CirrusSearch more heavily weights exact matches, of which there are plenty in this case.

On the other hand, if we search for Ładoś, which also stems to ć, only the first four results have Ładoś as their highlighted keyword. Other keywords in the top 10 include XVIII, Brydż, Skopje, XXIII, bazy, Detroit, and Pearl. The rest of the top 20 are similarly irrelevant.

And in a disastrous case, if we search for śpiworów, which stems to the much more common r, most of the top-20 results have R as their highlighted keyword, and none have śpiworów. It doesn't show up in the top 200 results—though it does pop up between 200 and 220. Searching for it with quotes does get the exact form, of course.

These examples are not necessarily generally representative of how CirrusSearch behaves with Stempel—they just demonstrate a range of possible types of results.

Don't Forget the Recall Errors
This analysis focuses on precision errors—i.e., errors where words or other strings are inappropriately grouped together. I haven't really looked at recall errors, where different forms of the same word are not grouped together. Generally we've focused on precision errors in these reviews, since precision errors generally are more obvious to users and to developers.

I could undertake a more systematic review by stemming all the forms of a small group of common words, but I'm not sure how representative that is of real life, either. I don't know how often the plural instrumental of any given noun is used in Polish, or how often it is searched for on, say, Polish Wikipedia, or whether missing it is as terrible as missing the nominative singular form. (We poor English speakers just don't grok this case thing so well.)

For now, I'll stick with looking at the precision errors.

Query Analysis
Another option for next steps would be to take a smallish collection of randomly selected queries—say, 200—from Polish Wikipedia and run them in labs and see how often Stempel gives clearly bad results. If the answer is >10%, the error bars on that estimate won't matter, it's too big of a problem. On the other hand, if it's really low, we could keep looking at Stempel.

So, I pulled 10,000 random queries from Polish Wikipedia for the week of Jan 7-13, 2017. Only one query per IP was allowed, and no IP that issued 30 or more queries in a day were counted. Only "apparent human" web users were used. A random sub-sample of 200 queries were reviewed by hand.

The only repeated query in my samples was San Escobar, which came up three times. It is a country accidentally made up by the Polish Minister of Foreign Affairs in January 2017. It has an English Wikipedia page, but no Polish one yet.

Interestingly, this corpus has a significantly higher zero results rate (ZRR) than the ZRR dashboard indicates for Polish Wikipedia—about double! (42% vs 21-25%) My hypothesis is that filtering high-volume searchers (>30 queries in a day) and only allowing one query per IP removes more successful than unsuccessful queries. Not counting API queries could be the culprit, but I would think that would be more likely to decrease the ZRR. One other thing I noticed, with Stempel, there were more often queries with keyword matches in titles. Not sure if that's good or bad.
 * Using the Stempel stemmer, 57 queries (28.5%) got "generally more" results (more than 1-2% difference).
 * 22 queries (11% of all queries, and 26% of ZRR queries) went from zero results to some results.
 * 2 queries got more than 10,000 additional results. 1 other query got more than 100,000 additional results: mapy, an inflected form of mapa ("map") went from ~5600 results to ~346K results, though the top results were similar.
 * 34 (17%) queries had "substantially different results". I generally ignored shuffling of the top 3 results (which could be in part caused by shard size, as well as term frequency counts, which can differ significantly when different forms of a word are stemmed and counted together).
 * Of the 34, I classified 19 as apparently the results of inflected forms or folded forms matching. For the other 15, there was nothing obvious from the snippets—though likely causes include differences in term frequency scoring because of folded matches.
 * I found 5 instances of apparently undesirable stemming—though these are not great, they aren't incomprehensible: działki and Dziaduski, benou and Betsy, beno and ben, persi and Persowie & Pers, and pokewars and pokolenie.
 * Another instance was pretty bad. VIII stems to nothing, so the query Henry VIII only matches on Henry. Exact matches in the plain field get some decent matches in the top 20, but Henry Cow is #1.

In summary, out of 200 queries, many had more results, there were fewer queries with zero results, and no instances of bizarre matches, though VIII not matching anything is kind of annoying. I think this comes down more in favor of Stempel than against.

Recommendations
For better or worse, Stempel seems to be regarded as the best Polish stemmer/analyzer around—which means that we probably aren't going to find anything better. On the other hand, it does seem to be good enough to be regarded as the best Polish stemmer/analyzer around, so maybe it is generally good enough—and specifically better than nothing.

It also helps to keep in mind that my analysis is aimed at finding and highlighting the flaws in Stempel. Jan's analysis of 100 random stemming groups was that they were all quite good—so overall it probably is a net benefit. My problem is just that the errors are so random-seeming and not at all understandable to searchers.

Options include: Unfortunately, this is not a cut-and-dried case of Stempel being better than the status quo.
 * Letting the Polish Wikipedia community review the labs instance to see what they think. If it's a disaster, it's a disaster. If it's better than nothing, we could possibly also load other Polish wikis for review if needed.
 * Looking for another Polish stemmer/analyzer—maybe there is some undiscovered gem out there.
 * Punting on Polish—if this is worse than nothing, we can stick with nothing.
 * Waiting for some improvement to Stempel—either from the original developers (nothing yet happening there), or, if we can get the uncompiled rules and the licensing allows for it, we could try to patch some of the most egregious holes.