User:TJones (WMF)/Notes/Fallback Redux

September 2017 — See TJones_(WMF)/Notes for other projects. See also T147959

Background
As noted in my write up from last year messaging fallback languages that make sense geographically and historically but not necessarily linguistically are also being used to enable language analyzers in places where they don't make a ton of sense.

Data
I did a quick-n-dirty analysis of languages as configured in code last time, but this time I pulled out actual live configuration for Wikipedias in every language, and "Other Wikimedia projects" listed in the Special:SiteMatrix page on mediawiki where possible. For private wikis, I used the info on the main page of the wiki and the config under  in  (very large link).

There are a few mismatches between config in code and the live config in production, probably caused by fallback languages being configured after the wikis were started; those wikis haven't been re-indexed yet, so the new fallback config hasn't had a chance to take effect.

Analysis
The table below has all the wikis I looked at, grouped by language configured. For wikis with fallback language analyzers enabled, I also listed the number of articles on the wikis and the percentage of search traffic for each wiki. The numbers are snapshots so the links have changed, but they should give workable estimates.

The columns include:
 * Compatibility (and some other info), indicated by the following codes:
 * + = configured language matches content, or is ICU default
 * ! = plausible, as languages are listed as mutually intelligible in written form, but not guaranteed
 * ? = genetic relation to fallback; may be useful (but I'm extremely doubtful)
 * x = no genetic relation to fallback
 * # = should be in configured language, and would be if re-indexed, but is not currently
 * - = wiki is closed
 * WP Articles—count of articles in Wikipedia in that language
 * Search Volume—percentage of search volume from Discovery dashboards; 3% is high for anything other than English
 * Lg of Wiki—Language of the Wiki in question.
 * Lg Used—Language analyzer configured. Note that CJK is a generic processor for Chinese, Japanese, and Korean. ICU is an open-source library for Unicode processing.
 * Wiki domain—the domain of the wiki
 * Notes—Notes on mutual intelligibility, differences in code/live configuration, etc. * indicates that I had to get the language used from code or the main page of wiki, since the live config was unavailable since it's a private wiki.

For each language group there is a summary row. The row lists totals for displayed article counts and search volumes (i.e., those from potentially incompatible wikis). Language families of the unrelated languages are also listed (e.g., Eskimo-Aleut is listed in the row for Danish because Greenlandic, which falls back to Danish, is an Eskimo-Aleut language, while Danish is not—it's Germanic).

The language groups with potential problems are listed here alphabetically. The others are listed at the end of the page—provided for completeness, but not very interesting.

Next Steps
There are 102 wikis with non-exact language analysis configurations: That's a lot to sort through. Doing a moderately detailed analysis of each one and getting feedback from the communities would take several months at least.
 * 47 are obvious linguistic mis-matches.
 * 12 are configured with the analyzer for a reasonably mutually intelligible language and so have a reasonable potential to be doing more good than harm.
 * The middle 43 are genetically related, but not really very likely on average to benefit hugely from having the wrong-language analyzer.


 * We could turn off some of the more linguistically obviously poor ones and see if anyone complains.
 * We could invite comment (for all, or for only the less obviously bad ones) and see if anyone in the community wants to see some sort of analysis of what kind of difference it makes with and without the analyzer. If there is no objection or request for analysis, we could turn them off.
 * We could do some pre-emptive analysis for the ones most likely to be similar and start the conversation with the communities with that information.

Suggestions on how best to approach this huge undertaking are very welcome!

If there are any that should stay configured—such as Czech and Slovak or Catalan and Occitan—then it's easy enough to configure those as explicit desirable fallbacks in the AnalysisConfigBuilder code.

Based on the size and search volume of the wikis in question, this seems like a reasonable order to address the configurations if it makes sense to work on one language group at a time.


 * Czech/Slovak
 * French
 * Indonesian
 * German
 * Russian
 * Italian
 * Spanish
 * Norwegian
 * Catalan
 * Hindi
 * Arabic
 * Persian
 * Polish
 * Dutch
 * Hebrew (note that this one hasn't been deployed yet, so it isn't broken yet!)

The Rest of the Table
This is the rest of the table from above, where nothing terribly exciting is happening. Everything is either using the appropriate language or the ICU default.