User:TJones (WMF)/Notes/Adding Ascii-Folding to French Wikipedia

August 2016 — See TJones_(WMF)/Notes for other projects. (T142620)

Summary
We want to add ascii-folding to French Wikipedia, so we thought we’d try it out “in the lab” and see how many extra indexing collisions it caused.

Highlights:
 * The default French analysis chain unexpectedly does some ascii-folding already, after stemming.
 * Unpacking the default French analysis chain per the Elasticsearch docs leads to different results, but most of the changes are desirable, and the effect size is very small.
 * English and Italian, which have been similarly unpacked to add ascii-folding in the past, include a bit of extra tokenizing help for periods and underscores, which we may want to also do for French—though it does violence to acronyms and may not work with BM25.
 * Ascii-folding itself effects significantly more tokens than ascii-folding in English—50 times as many (as a percentage) for a 50K article corpus—which is not entirely a surprise, since many more accented characters are regularly used in French.

Introduction
The current configuration of the Elasticsearch text analysis for French Wikipedia uses the default analysis French chain, which includes handling elision (e.g., converting l’amour, d’amour, etc to amour), stop words (usually small ignorable non-content words that are dropped, like the, a, an, it, etc. in English, and au, de, il, etc. in French ), and stemming, but no separate ascii-folding step. Unlike in English, the French stemmer expects accents and handles them fine. See my recent English write up for a more detailed description of stemming and ascii-folding from and English-language point of view.

The lack of ascii-folding is causing problems for query terms like louÿs (See T141216.) David suggested that we should enable asciifolding_preserve, which not only indexes the ascii-folded version of a term, but also preserves and indexes the original unfolded version. The point of this experiment is to make sure that not too much unexpected noise would be introduced by such a reconfiguration.

Corpora Generation and Processing
I extracted two corpora of randomly selected articles from French Wikipedia—with 1K and 50K articles, respectively. The main point of the 1K corpus is to test code and do exploratory analysis. After extracting the corpora, I filtered all HTML tags, and other XML-like tags from the text.

In order to add ascii-folding to the analysis chain, it was necessary to unpack the built-in French analyzer into its constituent parts so that the additional step could be added. (This had previously been done for English to add ascii-folding to the end, as we suggest doing here, and similarly for Italian.) The equivalent explicit analysis chain for French in Elasticsearch 2.3 is available in the Elasticsearch docs.

For each corpus, I planned to call the Elasticsearch analyzer in each configuration, and note the results. In particular, we were looking for words that would now “collide”—that is, would be indexed under the same analyzed form—that hadn’t collided before. There were a few unexpected bumps along the way.

N.B.: See the notes in the English analysis on types (unique words) and tokens (all words) is you aren’t familiar with the terms or want more details and examples.

Unexpected Features of French Analysis Chain
While manually running analyses on the command line to make sure I’d properly switched my local config from English to French, I discovered that some ascii-folding is already going on, apparently after stemming. Some common French diacritics are folded to their unaccented variants, while some French diacritics and other more general diacritics are not. In particular: Of special note, the characters that are folded are left unchanged if the word is less than five letters long, so âge is not folded to age. Also, deduplication doesn’t happen if the word is fewer than five characters: aaaa (4xa) is indexed as aaaa, but aaaaa (5xa) comes out as just a. (This turns out to be pretty important!)
 * folded: á â à é ê è î ô û ù ç
 * unfolded: ä å ã ë í ï ì ó ö ò ø õ ú ü ÿ ñ ß œ æ

I was also able to determine that folding happens after stemming based on the analyzed version of élément and element. The proper form, élément (“element”) is analyzed as element (without accents, as per the list above). Meanwhile, element was analyzed as ele, with the final -ment (roughly equivalent English adverbial ending -ly) stripped by the stemmer. (As a result, you can search French Wikipedia for ele and get results on element, which is often in accentless redirects. As a comparison in English, searching Bever Hills on English Wikipedia gives “exact matches” on Beverly Hills because Beverly is stemmed in English to bever because the -ly looks like an adverbial ending, even though it isn’t.)

Unexpected Differences in Unpacked Analysis Chain
After re-implementing the French analysis chain as its component steps I re-ran my small 1K sample to make sure that the results were the same as the built-in analysis chain. It turns out that there are a few differences that don’t seem to be a product of my re-implementation. I re-ran the 50K sample, too, to get a better idea of the differences.

The differences seem to mostly be improvements:
 * Invisible Unicode characters are stripped; they would otherwise keep some words from being indexed properly. Examples:
 * bidirectional codes like U+202C and U+200E.
 * byte order mark like U+FEFF.


 * Better handling of Greek:
 * ς (the word-final form of σ) is folded to σ.
 * cursive forms are folded to non-cursive forms: ϐ to β (apparently this is a French thing!)
 * capitals are properly lowercased, like ∏ to π.
 * other variants are folded, such as ϑ to θ.

I also noticed that there were slightly different versions of the unpacked French analysis chain in different versions of the docs. Whenever we update Elasticsearch, we should check the docs to see if the default analysis chains have changed. If they have, we might want to consider making similar changes to our unpacked analysis chains (English, Italian, and probably now French), even if the results of the unpacked chains are not identical to the built ins.
 * Letter-like symbols are folded into matching characters: the micro sign (µ) is folded to lowercase letter mu (μ), the aleph symbol (ℵ) is folded into the letter aleph (א)—depending on your font, those pairs can be indistinguishable!
 * Double-struck letters (common in mathematics) are folded to their normal version: ℚ, ℝ, and ℤ become q, r, and z. (This isn’t always ideal—e.g., the mathematical nℤ is folded in with NZ, the abbreviation for New Zealand.)
 * German ß is folded to s; it is probably folded to ss first, but the French analysis chain is already known for deduping repeated characters.
 * Some phonetic characters are folded to their normal counterparts: ʲ and ʰ become j and h.
 * Other small raised letters, as in 1ᵉʳ (cf., English 1ˢᵗ) are folded to their normal counterparts, e and r.
 * Fullwidth and halfwidth CJK characters are folded to their more typical variants: ３ becomes 3, ｱｷ becomes アキ.
 * Soft hyphens are ignored.
 * Arabic “isolated” variants are folded with the normal character.
 * One obvious regression: Turkish dotted I (İ) is no longer folded to plain i.

Results—Built-in French vs Unpacked French
The size of the effect is very small, and generally positive. We have a very small net loss in tokens—the tokenizers appear to be slightly different between the built-in and unpacked French analysis chains. Note that "new collisions" are post-analysis types (final buckets), and all other are pre-analysis types (original forms). This is confusing—sorry. The number of types changes (it is reduced) after analysis, but the number of tokens doesn't. N.B.: I did not match if the capitalization was different. Too many names out there, but there are more matches that could be made if case were not a factor—Œuvres / oeuvres, for example.
 * total tokens: basically the total number of words in the corpora
 * pre-analysis types: the number of unique different forms of words before stemming and folding
 * post-analysis types: the number of unique different forms of words after stemming and folding
 * new collision types: the number of unique post-analysis forms of words that are bucketed together by the change (the % changed is in comparison to the post-analysis types)
 * new collision tokens: the number of individual words that are bucketed together by the change (the % changed is in comparison to the original total tokens)
 * plurals: where an apparent singular and plural came together, such as Crónica/Crónicas.
 * folded: where accented and unaccented forms came together, such as Eric/Éric, and Elias/Elías.
 * folded_plurals: got a match both folded and pluralized.
 * others: where it wasn’t likely that the new bucketing was helpful, such as Gore/Göring.

The overall impact is very small, but most of it is clearly positive.

Because of these changes, I had to re-analyze my larger sample with the unpacked French analysis chain to form a new baseline to isolate the effect of the ascii folding. It looks like some of the ascii-folding job is done by unpacking the French analysis chain—however our motivating character—ÿ, and umlauts/trémas in general—are not affected.

Notes on Italian, character filters and tokenizing, etc.
Looking at the config for English and Italian (which have also been similarly unpacked so that ascii-folding can be added), I noticed that both the English and Italian configurations—which may have been copied one from the other—include the word_break_helper character filter in the tokenizer. This is a custom filter that maps underscores, periods, and parens to spaces, to make sure those things are definitely counted as word breaks. (It looks like parens are already word boundaries for French at least.)

Among other effects, this splits up domain names like wikipedia.org and youtube.com into parts, so that queries like wikipedia and youtube, respectively, could match the domains.

Since it takes 6 hours to run the full 50K French corpus, I only ran a quick test on my 1K corpus to see what effect the word_break_helper has on tokenizing.
 * Dates and similar period-separated numbers like 01.06.1958 are broken up into parts (01, 06, and 1958).
 * The same applies to letters, and acronyms (A.S.S.E.T.T. or A.D.) and web domains are split up.
 * Certain typos (d'années.Elle) are split up and processed correctly.

Mini Results—Unpacked French vs Unpacked French with word_break_helper
Looking at collisions and token counts: It was a fairly minor impact. I think it’s a net positive, though I don’t like the way acronyms are treated.

After talking to David, and looking at the impact on word_break_helper on acronyms in English and how it interacts with his BM25 work, I think maybe we shouldn’t implement it, and maybe we should turn it off for English, too.

Results—Unpacked French vs Unpacked and Ascii-folded French
These results focus on the effect of ascii-folding (preserving the original accented form as well). There is an increase in total tokens because we preserve the accented form and the ascii-folded form. The effect is relatively large (> 2%) and would be even larger for the full set of 1.7M French articles.

More than 80% of the new collisions are folded matches—the accentless form exists as its own pre-analysis type, generally indicative of a decent match.

Review
There were 361 new collisions in the 1K data. I reviewed them and here’s what jumped out at me: I was less optimistic at first because of all the short words I was seeing, but looking through the whole list, I think it’s a net positive. David kindly looked more closely at the 244 “other” collisions that didn’t fall into the folded or plurals categories and gave his native-speaker judgements on the quality of the merges. His judgements, plus the automatically assessed folded & plural counts are below. So, only 10.20% of tokens (417/4088) involved in new collisions in the 1K sample are demonstrably worse.
 * lots of short word folding: Á, Å, and ä are now all folded in with a; âge and âgé are folded in with age. This was already happening with longer words, but now happens for the short ones, too.
 * names with diacritics are folded in with their accentless versions: Agnès & Agnés with Agnes; aïkido with aikido; Düsseldorf with Dusseldorf, Rodríguez with Rodriguez, Shâh with Shah, etc.
 * I also noticed a typo (Bajà for Spanish Bajá), which would now get indexed properly.
 * some of the short words that are folded together, especially when dedpulication has happened, don’t strike me as great: Bâle with balle(s), bébé with Beebe
 * English contractions and possessives with smart quotes are correctly being indexed with straight quote variants: can’t, don’t, it’s, ain’t, King’s. This is also happening to a few French words, like aujourd’hui. Looks like the French stemmer can handle straight quotes or smart quotes for elision, but doesn’t fold them in general.
 * Long words with very short stems are folded together. education is stemmed to educ, éducation is stemmed to éduc—since the stems are only four letters long, there’s no ascii folding for é in the French stemmer, and these are indexed separately. Now they are one!
 * This happens with plurals, too, so that édits is stemmed to édit, which is too short to be ascii-folded by the French stemmer.
 * It seems that inside the French stemmer, some stemming happens before ascii-folding, some after. édité, éditée, éditées, édités, éditeur, and éditeurs all stem to edit, but édits does not. With explicit post-stemmer ascii-folding, they are all indexed together.
 * Ewww. The short word thing hits some masc/fem pairs. égal (the masculine, “equal”) is indexed as égal, but égale (the feminine) is 5 letters, and eligible for ascii-folding before the final e comes off. It comes out as egal. With post-stemmer ascii-folding, they all end up together under egal. Similarly for reçu / reçue.
 * Better handling of digraphs: œuvre with oeuvre, Phœnix with Phoenix, Schnæbelé with Schnaebelé, Cæsar with Caesar.
 * It’s not all great: thé with the is going to be the worst, I’m sure.
 * Even with the specific ascii-folding step, Turkish dotted I (İ) is no longer folded to plain i—so İstanbul and Istanbul are no longer indexed together. We could fix this by mapping İ to I before tokenization; dotted İ is not going to be distinctive very often in French.

It looks like we have a winner!

Potential Hard to Explain Behavior
I haven’t come across a concrete example, but I’m going to write this down here because it’s going to confuse someone at some point. The French stemmer does some ascii-folding (as noted with élément and element), before it does deduplication (hhhhoooommmmmmmmeeeee goes to home). The additional, more universal ascii-folding step we’ve added comes after that, so the deduplication doesn’t always work out like you’d expect.

So, here we have an artificial example of 6 a’s in a row, with various accents. The French stemmer folds á to a, but not Scandanvian å. Dedpulication happens on exact character matches (modulo case). Then the extra ascii-folding happens (preserving the “original”, which is the output of the French stemmer). None of the five tokens used to index the three originals are the same as each other, and only one is the same as its original. I don’t know when or where, but this will eventually come up. Mark my words!

Conclusions

 * The overall impact of performing ascii-folding before stemming on French Wikipedia is largely positive. We should do it!
 * We should probably set up a character map from İ to I so as not to regress on Turkish names.
 * Adding the word_break_helper character filter is dubious.

Deployment Notes
Since this change affects how terms are indexed, it requires a full re-index of French Wikipedia. We’ll be doing that for the BM25 implementation within the next quarter or so, so it makes sense to do BM25 and stemming-before-indexing at the same time.