User:TJones (WMF)/Notes/Re-Ordering Stemming and Ascii-Folding on English Wikipedia
August 2016 — See TJones_(WMF)/Notes for other projects. (T142037) For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.
The current configuration of the Elasticsearch text analysis for English Wikipedia includes stemming and ascii-folding, in that order. The opposite order seems to make sense for English, and seems like it would improve search results. The point of this experiment is to make sure that no unexpected noise would be introduced by such a reconfiguration.
Stemming is the reduction of different forms of a word to a common form, for example, converting jumps, jumped, and jumping to jump. Stemming is not an exact science, and can be more or less aggressive. Somewhat unintuitive results are common, such as the names Cokely, Coker, Cokic, and Coking all being converted to coke. These mistakes are often caused by incorrectly treating parts of a word as regular morphology when it isn’t. Overall the impact is very positive.
Ascii-folding is the removal of diacritics (for example, converting å, á, â, ä, à, ã, and ā all to a), simplification of digraphs (such as converting æ to ae, or ß to ss), and more generally converting Latin characters to the “basic” Latin characters.
Stemming and ascii-folding interact. In particular, the stemmer takes no action on words with diacritics and other non-basic characters. As a result, plurals like clichés are unstemmed, and do not match the singular cliché. However, ascii-folding before stemming could merge unrelated terms by removing the diacritics. For example, the name Stüning, without the ü is reduced to the unrelated stun, along with stunning, stunned, and stuns.
Overall, my intuition was that the negative effects would be significantly less than the positive effects, and the total effect would be relatively small. I can’t think of any common words in English (excluding names) where diacritics are required. They are perhaps more common than not in a handful of French borrowings (fiancé(e), résumé, née), but variants without accents are relatively common even in these cases. Some accents seem to be beyond most English speakers’ grasp (me included!): tête à tête (“face to face”) appears in English Wikipedia as tete a tete, tète a tète, tète à tète, tête a tête, tête à tete, tête à tête, tete-a-tete, téte-a-téte, tête-a-tête, and probably others.
Rather than rely on my intuition, David and I decided it would be best to run an experiment. I extracted two corpora of articles from English Wikipedia—1,000 random articles, and 50,000 random articles (each without duplicates)—and tested the effects of re-ordering the stemming and ascii-folding steps of the Elasticsearch analysis chain.
Corpora Generation and Processing
I extracted two corpora of randomly selected articles from English Wikipedia—with 1K and 50K articles, respectively. The main point of the 1K corpus was to test my code, but it also has some relevance to extrapolating from the 50K corpus to the 5M articles in English Wikipedia. After extracting the corpora, I filtered all HTML tags, and other XML-like tags from the text.
The current configuration of Elasticsearch on English Wikipedia uses a setting called asciifolding_preserve, which not only indexes the ascii-folded version of a term, but also preserves and indexes the original unfolded version. David suggested not indexing the unfolded version in our new configuration, so I used the non-preserving version as the baseline for these tests.
For each corpus, I called the Elasticsearch analyzer in each configuration, and noted the results. In particular, we were looking for words that would now “collide” that hadn’t collided before. Some “collisions”, like clichés joining cliché, cliche, and cliches—are good. Others, like Stüning and stun above, are not.
A note on tokens and types: I’m looking at counts for both tokens and types, and it’s important to distinguish the two. Tokens refer to individual words, counted each time they appear. Types count all instances of a word (and related forms that are stemmed to be the same) as one thing. So, in the sentence, The brown dog jumped and the grey dogs are jumping., there are ten tokens (more or less what you would probably think of as normal words), but only seven types (the, brown, dog, jump, and, grey, are). For the most part, we care more about token counts, but I’d never hear the end of it from the corpus linguists if I didn’t count types, too.
Here are the counts for the two corpora, and the two analysis configurations:
|1K sample||50K sample|
|stem + fold||fold + stem||stem + fold||fold + stem|
|new collision types||42||0.128%||1,674||0.423%|
|new collision tokens||97||0.030%||6,969||0.045%|
- total tokens: basically the total number of words in the corpora (the same regardless of stemming/folding order)
- pre-analysis types: the number of unique different forms of words before stemming and folding (the same regardless of stemming/folding order)
- post-analysis types: the number of unique different forms of words after stemming and folding
- new collision types: the number of unique post-analysis forms of words that are bucketed together by the change (the % changed is in comparison to the post-analysis types)
- new collision tokens: the number of individual words that are bucketed together by the change (the % changed is in comparison to the total tokens)
As expected, the total impact is very small, with a tiny fraction (0.030%) of tokens in the 1K corpus being affected by only 42 new collisions. The 50K sample has, predictably, about 50x as many tokens, but only about 11x or 12x as many types before and after analysis.
The number of collisions is only 1.5x as big in the 50x sample, compared to the 1K sample. Given the typical Zipfian/power law distribution of word frequency, I’d expect this to extrapolate more or less logarithmically, so the number of new collisions for the full 5M articles would likely be less than 0.1%.
Since there are only 42 new type collisions in the 1K sample, I looked at all of them. They fell broadly into three categories:
- plurals (5 types/6 tokens), where an apparent singular and plural came together, such as Crónica/Crónicas.
- folded (21 types/65 tokens), where accented and unaccented forms came together, such as Eric/Éric, and Elias/Elías. This includes like Modeer/Modéer/mode, where Modeer and mode where already in a bucket together in a sub-optimal way, but it’s a good thing that Modéer is now in the same bucket as Modeer.
- folded_plurals (3 types, 3 tokens), got a match both folded and pluralized.
- others (9 types/23 tokens), where it wasn’t likely that the new bucketing was helpful, such as Gore/Göring, bile/bílý, sale/šálivé, and perhaps legends/légendes.
Since that turned out to be an easy and useful analysis, I automated a low fidelity version of it, where “plurals” match if either stripping an s or adding an s (if there isn’t one) to the end of the term makes a match—note that this is far from perfect because it won’t catch -es plurals, like match/matches, and it will match things that aren’t actually plurals, like André/Andrés, but it is simple to do automatically.
So, for the 50K sample, we can automatically count:
Note that "new collisions" are post-analysis types (final buckets), and all other are pre-analysis types (original forms). This is confusing—sorry. The number of types changes after analysis, but the number of tokens doesn't.
So, out of 15,397,805 (15 million) tokens, only 1,025 (0.0067%) were bucketed with other words, none of which look like an obvious plural/singular or ascii-folded version of the token.
Since the "other" category had only 516 items, I skimmed them. I didn't normalize case in my comparisons because there are a lot of names, and I wanted a conservative count. However, a quick glance through the "others" reveals that many that failed to match because of case: adiós/Adios, Harmônicos/harmônico, Réflexion/reflexion, trésors/Trésor. There are also some decent cognate matches: économistes/economist, expérimentales/experimental, légendes/legends, méthodes/methods, poèmes/poems. As well as plenty of not-so-great groupings.
I'm so used to working with queries, which need to be private because they include potential PII, that I forgot I can share the raw data! If anyone else
is a masochist wants to take a look, the raw output of my analysis program is available.
The overall impact of performing ascii-folding before stemming on English Wikipedia is small, and largely positive. We should do it!
Since this change affects how terms are indexed, it requires a full re-index of English Wikipedia. We’ll be doing that for the BM25 implementation within the next quarter or so, so it makes sense to do BM25 and stemming-before-indexing at the same time.