User:TJones (WMF)/Notes/Fallback Languages

October 2016 — See TJones_(WMF)/Notes for other projects. (related to T146358 and T147959)

Background
For some languages, if we don't have a language analyzer, we can specify that we should fall back to a different language analyzer, if available. For example, there is no Ukrainian analyzer, so we fall back to the Russian analyzer. Obviously this is far from perfect, but it's better than nothing. It's also good to keep these in mind when making changes to an analyzer that is a default for other languages.

In the Ukrainian/Russian case, this seems to mean that changes made to Russian-language wikis will happen on Ukrainian-language wikis, and changes necessary for Ukrainian-language wikis will have to be made for Russian-language wikis, too, unless we special-case them by language (which is doable, but hacky). That complicates matters.

I've done my best to expand the language and orthography codes (in parens) for ease of reading and searching the page, but if their expansion disagrees with the code, the code is correct.

This list of codes was extracted from the  variables in the source files. (Turns out that this fact is documented in in Manual:Language—who knew‽) They were extracted on October 7, 2016—and have been getting stale since then. Nemo kindly points out that there's a very handy up-to-date list at Localisation statistics. (Thanks!)

Update: After looking into this some more, I now realize that these fallbacks—which are defined in Messages.php and reused elsewhere—have an obvious geographical and historical basis, which is not necessarily linguistic at all. I'd picked out three representative examples that made no linguistic sense—Guaraní / Spanish, Wolof / French, and Chechen / Russian—and for each X / Y pair, it does make sense that a speaker of X is more likely to know Y than most other world languages, so it's a reasonable fallback for messages, banners, etc., when no native option is available. However, the list should not have been expanded to matters of language processing. My favorite current example of nonsensical language processing is searching for lorsqu'ele on the Wolof Wikipedia—if you know a little French and a few secrets about the French analysis chain, it's no surprise that it returns matches on element—but I don't think that's the expected result in Wolof (though Wolof speakers are more likely to know some French, and thus may not be entirely surprised). We're going to try to address this issue over time. See T147959.

Graphical Representation
This may be even more outdated, but it is useful nonetheless to get a sense of the Big Picture:.