User:TJones (WMF)/Notes/Fallback Redux

September/October 2017 — See TJones_(WMF)/Notes for other projects. See also T147959 and Disabling Messaging Fallbacks for Language Analysis.

Background
As noted in my write up from last year messaging fallback languages that make sense geographically and historically but not necessarily linguistically are also being used to enable language analyzers in places where they don't make a ton of sense.

Data
I did a quick-n-dirty analysis of languages as configured in code last time, but this time I pulled out actual live configuration for Wikipedias in every language, and "Other Wikimedia projects" listed in the Special:SiteMatrix page on mediawiki where possible. For private wikis, I used the info on the main page of the wiki and the config under  in  (very large link).

There are a few mismatches between config in code and the live config in production, probably caused by fallback languages being configured after the wikis were started; those wikis haven't been re-indexed yet, so the new fallback config hasn't had a chance to take effect.

Analysis
The table below has all the wikis I looked at, grouped by language configured. For wikis with fallback language analyzers enabled, I also listed the number of articles on the wikis and the percentage of search traffic for each wiki. The numbers are snapshots so the links have changed, but they should give workable estimates.

The columns include:
 * Compatibility (and some other info), indicated by the following codes:
 * + = configured language matches content, or is ICU default
 * ! = plausible, as languages are listed as mutually intelligible in written form, but not guaranteed
 * ? = genetic relation to fallback; may be useful (but I'm extremely doubtful)
 * x = no genetic relation to fallback
 * # = should be in configured language, and would be if re-indexed, but is not currently
 * - = wiki is closed
 * WP Articles—count of articles in Wikipedia in that language
 * Search Volume—percentage of search volume from Discovery dashboards; 3% is high for anything other than English
 * Lg of Wiki—Language of the Wiki in question.
 * Lg Used—Language analyzer configured. Note that CJK is a generic processor for Chinese, Japanese, and Korean. ICU is an open-source library for Unicode processing.
 * Wiki domain—the domain of the wiki
 * Notes—Notes on mutual intelligibility, differences in code/live configuration, etc. * indicates that I had to get the language used from code or the main page of wiki, since the live config was unavailable since it's a private wiki.

For each language group there is a summary row. The row lists totals for displayed article counts and search volumes (i.e., those from potentially incompatible wikis). Language families of the unrelated languages are also listed (e.g., Eskimo-Aleut is listed in the row for Danish because Greenlandic, which falls back to Danish, is an Eskimo-Aleut language, while Danish is not—it's Germanic).

The language groups with potential problems are listed here alphabetically. The others are listed at the end of the page—provided for completeness, but not very interesting.

Next Steps
There are 102 wikis with non-exact language analysis configurations: I've done a more detailed but still rough analysis of the similarity of the potential keepers, and asked for community for feedback on the following: We'll see what comes of those discussions. In the meantime I've configured these as exceptions in the [WIP patch ] I've submitted to Gerrit.
 * 47 are obvious linguistic mis-matches.
 * 12 are configured with the analyzer for a reasonably mutually intelligible language and so have a reasonable potential to be doing more good than harm.
 * The middle 43 are genetically related, but not really very likely on average to benefit hugely from having the wrong-language analyzer.
 * Egyptian Arabic (Maṣri) as Arabic
 * Gagauz as Turkish
 * Limburgish as Dutch
 * Livvi-Karelian as Finnish
 * Mirandese as Portuguese
 * Occitan as Catalan
 * Slovak as Czech

The rest are scheduled to be disabled in the code the week of October 9th, though the actual re-indexing after that may take a while after that. Re-indexing is tracked on Phab task T177871.

The outline of the plan has been laid out on another page: Disabling Messaging Fallbacks for Language Analysis, which is where community discussion will be directed, though there are also links back to here and to Phab.

The Rest of the Table
This is the rest of the table from above, where nothing terribly exciting is happening. Everything is either using the appropriate language or the ICU default.