Wikimedia Discovery/Disabling Messaging Fallbacks for Language Analysis

Summary
The Search Platform team (formerly part of Discovery) is planning to fix a long-standing search bug on many wikis by disabling the code in CirrusSearch that re-uses the “fallback” languages (which are specified for user interface or system messages) for the language analysis modules (which are used to index words in search). Deployment is planned to start the week of October 9, 2017.

Background
Messaging fallbacks specify what language to show a message in when there is no message available in the language of a given wiki. A language analysis module is language-specific software that processes text to improve searching—so that, for example, searching for a given word will find related forms of that word, like hope, hopes, hoping, hoped or resume, resumé, résumé on English-language wikis.

Fallback languages for system messages make sense for historical and cultural reasons—a reader of the Chechen Wikipedia is more likely to understand a user interface or system message in Russian than in French, Greek, Hindi, Italian, or Japanese—but the fallbacks don't necessarily make any linguistic sense. For example, Chechen is a Northeast Caucasian language while Russian is an Indo-European language; while the languages have undoubtedly influenced one another, their grammars are completed different.

Even for languages that are in the same family—like French and Spanish, or English and German—trying to analyze one with the grammar of the other is an expensive way to get results that range from poor to actively harmful (see “Some Examples” below). There are a small number of language/fallback pairs that are possibly similar enough to warrant continued use of the fallback language analyzer, but even then, properly configuring them can lead to complex, brittle code that breaks unexpectedly and often silently when a change to the fallback language configuration affects indexing and search in numerous other languages.

I have done a write up that lays out what fallbacks are enabled where, with a very rough measure of the linguistic relatedness of the languages involved. However, this configuration is a bug that should be fixed, and even mutual intelligibility only indicates that human speakers of one language are clever enough to understand the other, not that any given software is!

Solution
We plan to deploy the software change that disables using messaging fallbacks for language analysis fallbacks the week of October 9, 2017, with any cross-language analysis exceptions explicitly configured in a new manner. Changes will not immediately happen to all affected wikis because each wiki in each language will need to be re-indexed, which is a separate process that takes time. There may also be other delays caused by Elasticsearch upgrades or other changes that need immediate attention.

You can track progress of the task on Phabricator.

Some Examples
Wolof, a Niger–Congo language, has French, an Indo-European Romance language, as its fallback. The French language analyzer removes French inflections, like the final -s that marks plural nouns, and verb conjugation endings, including the final -r of verbs. It also folds some characters, particularly é to e, and in general many non-French letters—and reduces repeated letters to just one instance. It also removes "stop words", which are common words that aren't always helpful in search, like le, la, les ("the"); stop words are still searchable, but removing them in the analyzer effects scoring of matches.

As a result of processing Wolof text as French:
 * The following Wolof words are treated as stop words on Wolof-language projects: cet ("cleanliness"), des ("to remain"), du ("not"), en ("luggage", "to put a load on one's head"), et ("log"), les ("to sharpen"), ma ("I say"), ne ("to say"), pas ("amulet", "to tie a knot"), sa ("your").

Applying language processing for Russian (an Indo-European Slavic language) to Bashkir (a Turkic language) has an effect similar to that of applying French to Wolof, for example collapsing these words together: алым ("method"), ала ("mottled"), ал ("pink", "take"), Али ("Ali”, a name), Алиев ("Aliyev”, a name), Алей ("Alee", used in naming rivers).
 * These words are treated as essentially the same for search purposes:
 * Gana ("Ghana"), Gànnaar ("Mauritania"), ganaar ("chicken");
 * geéna ("to go out"), gena ("to be more than"), géna ("pipe")
 * geénee ("to subtract"), génee ("remove")
 * caa ("oh!", "stew"), caas ("fishing line", "muscle")
 * ban ("question", "mud", "which", "to please"), bañ ("to refuse"), baŋ ("a bench")
 * bon ("mean", "bad"), bóñ ("tooth")

Applying Indonesian to Buginese has no effect, even though both are Austronesian languages. Out of a sample of about 6400 words, none were affected, so it's all just wasted CPU processing.

Similarly, applying Danish (an Indo-European Germanic language) to Greenlandic (an Eskimo–Aleut language) has little effect. Only 40 words out of a sample of over 10,000 were affected, and those 40 were all Indo-European words.

For both Upper and Lower Sorbian, (Indo-European Slavic languages), processing with German (an Indo-European Germanic language), we do see lots of words getting -e stripped off the end of the word. This isn't incorrect for either variety of Sorbian, but it missed the rest of about 18 noun declension suffixes and about 30 verb conjugation suffixes—but, hey, even a broken clock is right twice a day.

Affected Languages
The list of affected languages, grouped by fallback language, is below. Languages marked with * are configured to use the fallback language shown, but currently do not use it in production, for various historical technical reasons, but would use the fallback when re-indexed. Once these are configured properly, they are completed, and will not require re-indexing.
 * Arabic: Egyptian Arabic
 * Catalan: Occitan
 * Czech: Slovak
 * Danish: Greenlandic
 * Dutch: Dutch Low Saxon, Limburgish, Sranan, West Flemish, Zeelandic
 * Finnish: Livvi-Karelian
 * French: Bambara, Breton, Franco-Provençal, Fula, Haitian, Lingala, Malagasy, Norman, Picard, Sango, Tahitian, Walloon, Wolof, *Atikamekw, *Kabiye
 * German: Alemannic, Bavarian, Low Saxon, Lower Sorbian, Luxembourgish, North Frisian, Palatinate German, Pennsylvania German, Ripuarian, Saterland Frisian, Upper Sorbian
 * Greek: Pontic
 * Hebrew: *Yiddish
 * Hindi: Maithili, Sanskrit
 * Indonesian: Acehnese, Banjar, Banyumasan, Buginese, Javanese, Minangkabau, Sundanese
 * Italian: Corsican, Emilian-Romagnol, Friulian, Ligurian, Lombard, Neapolitan, Piedmontese, Tarantino, Sicilian, Venetian
 * Latvian: Latgalian
 * Lithuanian: Samogitian
 * Persian: Gilaki, Mazandarani, Northern Luri, Southern Azerbaijani
 * Polish: Kashubian, Silesian
 * Portuguese: Mirandese
 * Romanian: Aromanian, Молдовеняскэ (Moldovan Cyrillic), Romani
 * Russian: Abkhazian, Avar, Bashkir, Buryat, Chechen, Chuvash, Erzya, Hill Mari, Kalmyk, Karachay-Balkar, Komi, Komi-Permyak, Lak, Lezgian, Meadow Mari, Moksha, Ossetian, Sakha, Tatar, Tuvan, Udmurt
 * Spanish: Aragonese, Asturian, Aymara, Chavacano, Extremaduran, Guarani, Ladino, Nahuatl, Quechua
 * Turkish: Gagauz
 * Ukrainian: Rusyn

You can follow the re-indexing on Phab ticket T177871.