User:TJones (WMF)/Notes/Fallback Redux

From mediawiki.org

September/October 2017 — See TJones_(WMF)/Notes for other projects. See also T147959 and Disabling Messaging Fallbacks for Language Analysis.

Background[edit]

As noted in my write up from last year messaging fallback languages that make sense geographically and historically but not necessarily linguistically are also being used to enable language analyzers in places where they don't make a ton of sense.

Data[edit]

I did a quick-n-dirty analysis of languages as configured in code last time, but this time I pulled out actual live configuration for Wikipedias in every language, and "Other Wikimedia projects" listed in the Special:SiteMatrix page on mediawiki where possible. For private wikis, I used the info on the main page of the wiki and the config under wgLanguageCode in wmf-config/InitialiseSettings.php (very large link).

There are a few mismatches between config in code and the live config in production, probably caused by fallback languages being configured after the wikis were started; those wikis haven't been re-indexed yet, so the new fallback config hasn't had a chance to take effect.

Analysis[edit]

The table below has all the wikis I looked at, grouped by language configured. For wikis with fallback language analyzers enabled, I also listed the number of articles on the wikis and the percentage of search traffic for each wiki. The numbers are snapshots so the links have changed, but they should give workable estimates.

The columns include:

  • Compatibility (and some other info), indicated by the following codes:
    • + = configured language matches content, or is ICU default
    • ! = plausible, as languages are listed as mutually intelligible in written form, but not guaranteed
    • ? = genetic relation to fallback; may be useful (but I'm extremely doubtful)
    • x = no genetic relation to fallback
    • # = should be in configured language, and would be if re-indexed, but is not currently
    • - = wiki is closed
  • WP Articles—count of articles in Wikipedia in that language
  • Search Volume—percentage of search volume from Discovery dashboards; 3% is high for anything other than English
  • Lg of Wiki—Language of the Wiki in question.
  • Lg Used—Language analyzer configured. Note that CJK is a generic processor for Chinese, Japanese, and Korean. ICU is an open-source library for Unicode processing.
  • Wiki domain—the domain of the wiki
  • Notes—Notes on mutual intelligibility, differences in code/live configuration, etc. * indicates that I had to get the language used from code or the main page of wiki, since the live config was unavailable since it's a private wiki.

For each language group there is a summary row. The row lists totals for displayed article counts and search volumes (i.e., those from potentially incompatible wikis). Language families of the unrelated languages are also listed (e.g., Eskimo-Aleut is listed in the row for Danish because Greenlandic, which falls back to Danish, is an Eskimo-Aleut language, while Danish is not—it's Germanic).

The language groups with potential problems are listed here alphabetically. The others are listed at the end of the page—provided for completeness, but not very interesting.

Notes[edit]

  • Having a "genetic relation" to a fallback (indicated by ?) means the languages may be only as closely related as Spanish and Romanian or German and English—so they are not necessarily very similar.
  • Mutual intelligibility (indicated by !) between languages means that speakers are usually clever enough to figure out how to understand one another. It does not mean that software will necessarily have a similarly easy time.
    • As a hypothetical example, a simple variant of English replacing all plural -s with -z, the endings -ing and -ed with -in and -t, -able/-ible with -oble, -ly with -like, and all ws with vs vould be perfectlike unserstandoble (that is, very highlike mutuallike intelligoble) to most English speakerz, but our English language analysis softvare vould be confust by the changez and vould be makin many mistakez.
    • On the other hand, sometimes two very similar "dialects" are varieties of the same language separated for historical, political, or cultural reasons.
  • The French Wikipedia has the French language analyzer and ICU folding enabled, while all others with French only have the French language analyzer.
  • Atikamekw and Kabiye have French configured as their fallback, but are not using it in production.
  • A new Hebrew language analyzer has recently been deployed. The Yiddish Wikipedia is configured to use it but it will not be not enabled until the wiki is re-indexed.
  • I've given some examples of poor and pointless processing on the main community communication page.

The Table[edit]

Compatibility WP Articles Search Volume Lg of Wiki Lg Used Wiki domain Notes
Arabic 17,186 0.033%
+ Arabic Arabic ar.wikipedia.org
? 17,186 0.033% Egyptian Arabic Arabic arz.wikipedia.org
Catalan 83,546 0.027%
+ Catalan Catalan ca.wikipedia.org
! 83,546 0.027% Occitan Catalan oc.wikipedia.org (high mutual intelligibility)
Czech 222,068 0.276%
+ Czech Czech cs.wikipedia.org
! 222,068 0.276% Slovak Czech sk.wikipedia.org (significant mutual intelligibility)
+ Czech Czech arbcom-cs.wikipedia.org *
Danish 1,646 0.001% (Eskimo–Aleut)
+ Danish Danish da.wikipedia.org
x 1,646 0.001% Greenlandic Danish kl.wikipedia.org
+ Danish Danish dk.wikimedia.org
Dutch 30,642 0.009%
+ Dutch Dutch nl.wikipedia.org
? 6,952 0.001% Dutch Low Saxon Dutch nds-nl.wikipedia.org
? 12,043 0.004% Limburgish Dutch li.wikipedia.org
x 1,059 0.000% Sranan Dutch srn.wikipedia.org
? 6,209 0.004% West Flemish Dutch vls.wikipedia.org
? 4,379 0.000% Zeelandic Dutch zea.wikipedia.org
+ Dutch Dutch nl.wikimedia.org
+ Dutch Dutch arbcom-nl.wikipedia.org *
Finnish 2,308 0.000%
+ Finnish Finnish fi.wikipedia.org
!? 2,308 0.000% Livvi-Karelian Finnish olo.wikipedia.org (dialect of Karelian, which is highly mutually intelligible with Finnish)
+ Finnish Finnish fi.wikimedia.org
+ Finnish Finnish arbcom-fi.wikipedia.org *
French 233,042 0.070% (Niger-Congo, Austronesian, Algic, Celtic)
x 3,090 0.000% Bambara French bm.wikipedia.org
x 63,035 0.018% Breton French br.wikipedia.org
! 2,627 0.000% Franco-Provençal French frp.wikipedia.org
+ French French + ICU Folding fr.wikipedia.org
x 220 0.000% Fula French ff.wikipedia.org
? 51,518 0.002% Haitian French ht.wikipedia.org
x 2,915 0.001% Lingala French ln.wikipedia.org
x 84,634 0.010% Malagasy French mg.wikipedia.org
? 3,627 0.000% Norman French nrm.wikipedia.org
? 3,525 0.031% Picard French pcd.wikipedia.org
x 253 0.000% Sango French sg.wikipedia.org
x 1,191 0.000% Tahitian French ty.wikipedia.org
? 14,631 0.004% Walloon French wa.wikipedia.org
x 1,157 0.003% Wolof French wo.wikipedia.org
x# 79 0.000% Atikamekw French atj.wikipedia.org Using ICU normalizer + Standard tokenizer
x# 540 0.001% Kabiye French kbp.wikipedia.org Using ICU normalizer + Standard tokenizer
German 154,421 0.059% (Slavic)
? 23,330 0.017% Alemannic German als.wikipedia.org
? 23,087 0.009% Bavarian German bar.wikipedia.org
+ German German de.wikipedia.org
? 26,703 0.008% Low Saxon German nds.wikipedia.org
x 3,088 0.001% Lower Sorbian German dsb.wikipedia.org
! 50,146 0.011% Luxembourgish German lb.wikipedia.org (partial mutual intelligibility)
? 5,303 0.001% North Frisian German frr.wikipedia.org
? 2,071 0.001% Palatinate German German pfl.wikipedia.org
? 1,800 0.001% Pennsylvania German German pdc.wikipedia.org
? 2,836 0.000% Ripuarian German ksh.wikipedia.org
? 3,786 0.001% Saterland Frisian German stq.wikipedia.org
x 12,271 0.009% Upper Sorbian German hsb.wikipedia.org
+ German German arbcom-de.wikipedia.org *
Greek 453 0.000%
+ Greek Greek el.wikipedia.org
? 453 0.000% Pontic Greek Greek pnt.wikipedia.org ("at best" partial mutual intelligibility)
Hebrew 14,101 0.008% (Germanic)
+# Hebrew Hebrew he.wikipedia.org Hebrew analysis not yet deployed
x# 14,101 0.008% Yiddish Hebrew yi.wikipedia.org Hebrew analysis not yet deployed
+# Hebrew Hebrew il.wikimedia.org * / Hebrew analysis not yet deployed
Hindi 22,999 0.040%
+ Hindi Hindi hi.wikipedia.org
? 11,817 0.002% Maithili Hindi mai.wikipedia.org
? 11,182 0.038% Sanskrit Hindi sa.wikipedia.org
Indonesian 347,328 0.027%
? 7,228 0.000% Acehnese Indonesian ace.wikipedia.org
? 1,727 0.000% Banjar Indonesian bjn.wikipedia.org
? 13,285 0.001% Banyumasan Indonesian map-bms.wikipedia.org
x 14,120 0.000% Buginese Indonesian bug.wikipedia.org (partially in Lontara alphabet)
+ Indonesian Indonesian id.wikipedia.org
? 50,295 0.016% Javanese Indonesian jv.wikipedia.org
? 221,993 0.001% Minangkabau Indonesian min.wikipedia.org
? 38,680 0.009% Sundanese Indonesian su.wikipedia.org
Italian 181,597 0.041%
! 5,454 0.013% Corsican Italian co.wikipedia.org
? 9,034 0.005% Emilian-Romagnol Italian eml.wikipedia.org
? 3,186 0.000% Friulian Italian fur.wikipedia.org
+ Italian Italian it.wikipedia.org
? 3,281 0.000% Ligurian Italian lij.wikipedia.org
x 36,147 0.003% Lombard Italian lmo.wikipedia.org (explicitly listed as not mutually intelligible with Italian)
!x 14,466 0.002% Neapolitan Italian nap.wikipedia.org (conflicting info on mutual intelligibility with Italian)
? 64,183 0.001% Piedmontese Italian pms.wikipedia.org
!x 25,642 0.014% Sicilian Italian scn.wikipedia.org (conflicting info on mutual intelligibility with Italian)
? 9,234 0.000% Tarantino Italian roa-tara.wikipedia.org
x 10,970 0.003% Venetian Italian vec.wikipedia.org (explicitly listed as not mutually intelligible with Italian)
Latvian 801 0.000%
? 801 0.000% Latgalian Latvian ltg.wikipedia.org
+ Latvian Latvian lv.wikipedia.org
Lithuanian 16,128 0.000%
+ Lithuanian Lithuanian lt.wikipedia.org
? 16,128 0.000% Samogitian Lithuanian bat-smg.wikipedia.org
Norwegian 134,828 0.038%
+ Norwegian Bokmål Norwegian no.wikipedia.org
? 134,828 0.038% Norwegian Nynorsk Norwegian nn.wikipedia.org Elastic has a light_nynorsk stemmer
+ Norwegian Norwegian no.wikimedia.org
+ Norwegian Norwegian noboard-chapters.wikimedia.org *
Persian 70,131 0.006% (Turkic)
? 5,679 0.000% Gilaki Persian glk.wikipedia.org
? 12,539 0.002% Mazandarani Persian mzn.wikipedia.org
? 5,324 0.000% Northern Luri Persian lrc.wikipedia.org
+ Persian Persian fa.wikipedia.org
x 46,589 0.004% Southern Azerbaijani Persian azb.wikipedia.org
Polish 11,537 0.033%
x 5,205 0.029% Kashubian Polish csb.wikipedia.org (explicitly listed as not mutually intelligible with Polish)
+ Polish Polish pl.wikipedia.org
? 6,332 0.004% Silesian Polish szl.wikipedia.org
+ Polish Polish pl.wikimedia.org
Portuguese 3,517 0.001%
! 3,517 0.001% Mirandese Portuguese mwl.wikipedia.org
+ Portuguese Portuguese pt.wikipedia.org
Romanian 2,205 0.001% (Indo-Iranian)
? 1,210 0.000% Aromanian Romanian roa-rup.wikipedia.org
x- 394 0.000% Moldovan Cyrillic (Romanian) Romanian mo.wikipedia.org
x 601 0.001% Romani Romanian rmy.wikipedia.org
+ Romanian Romanian ro.wikipedia.org
Russian 394,996 0.022% (Turkic, Uralic, Mongolic)
x 3,220 0.001% Abkhazian Russian ab.wikipedia.org
x 2,312 0.000% Avar Russian av.wikipedia.org
x 39,808 0.007% Bashkir Russian ba.wikipedia.org
x 1,989 0.000% Buryat Russian bxr.wikipedia.org
x 164,314 0.002% Chechen Russian ce.wikipedia.org
x 40,620 0.001% Chuvash Russian cv.wikipedia.org
x 3,861 0.000% Erzya Russian myv.wikipedia.org
x 10,240 0.000% Hill Mari Russian mrj.wikipedia.org
x 2,074 0.000% Kalmyk Russian xal.wikipedia.org
x 2,019 0.001% Karachay-Balkar Russian krc.wikipedia.org
x 5,250 0.001% Komi Russian kv.wikipedia.org
x 3,448 0.000% Komi-Permyak Russian koi.wikipedia.org
x 1,213 0.000% Lak Russian lbe.wikipedia.org
x 3,846 0.000% Lezgian Russian lez.wikipedia.org
x 9,649 0.000% Meadow Mari Russian mhr.wikipedia.org
x 1,171 0.000% Moksha Russian mdf.wikipedia.org
x 10,529 0.001% Ossetian Russian os.wikipedia.org
+ Russian Russian ru.wikipedia.org
x 11,407 0.001% Sakha Russian sah.wikipedia.org
x 72,540 0.007% Tatar Russian tt.wikipedia.org
x 1,410 0.000% Tuvan Russian tyv.wikipedia.org
x 4,076 0.000% Udmurt Russian udm.wikipedia.org
+ Russian Russian ru.wikimedia.org
Spanish 128,134 0.057% (Aymara, Tupi–Guarani, Uto-Aztecan, Quechuan)
! 32,383 0.011% Aragonese Spanish an.wikipedia.org
! 50,499 0.018% Asturian Spanish ast.wikipedia.org
x 4,250 0.001% Aymara Spanish ay.wikipedia.org
? 3,004 0.000% Chavacano Spanish cbk-zam.wikipedia.org
? 2,910 0.000% Extremaduran Spanish ext.wikipedia.org
x 3,209 0.023% Guarani Spanish gn.wikipedia.org
? 4,498 0.003% Ladino Spanish lad.wikipedia.org
x 7,113 0.000% Nahuatl Spanish nah.wikipedia.org
x 20,268 0.001% Quechua Spanish qu.wikipedia.org
+ Spanish Spanish es.wikipedia.org
+ Spanish Spanish ar.wikimedia.org
+ Spanish Spanish co.wikimedia.org
+ Spanish Spanish mx.wikimedia.org
Turkish 2,757 0.003%
!? 2,757 0.003% Gagauz Turkish gag.wikipedia.org (partial mutually intelligibility)
+ Turkish Turkish tr.wikipedia.org
+ Turkish Turkish tr.wikimedia.org
Ukrainian 6,160 0.003%
? 6,160 0.003% Rusyn Ukrainian rue.wikipedia.org
+ Ukrainian Ukrainian uk.wikipedia.org
+ Ukrainian Ukrainian ua.wikimedia.org

Next Steps[edit]

There are 102 wikis with non-exact language analysis configurations:

  • 47 are obvious linguistic mis-matches.
  • 12 are configured with the analyzer for a reasonably mutually intelligible language and so have a reasonable potential to be doing more good than harm.
  • The middle 43 are genetically related, but not really very likely on average to benefit hugely from having the wrong-language analyzer.

I've done a more detailed but still rough analysis of the similarity of the potential keepers, and asked for community for feedback on the following:

We'll see what comes of those discussions. In the meantime I've configured these as exceptions in the [WIP] patch I've submitted to Gerrit.

The rest are scheduled to be disabled in the code the week of October 9th, though the actual re-indexing after that may take a while after that. Re-indexing is tracked on Phab task T177871.

The outline of the plan has been laid out on another page: Disabling Messaging Fallbacks for Language Analysis, which is where community discussion will be directed, though there are also links back to here and to Phab.

The Rest of the Table[edit]

This is the rest of the table from above, where nothing terribly exciting is happening. Everything is either using the appropriate language or the ICU default.

Compatibility Lg of Wiki Lg Used Wiki domain Notes
Armenian Armenian Armenian Armenian
+ Armenian Armenian hy.wikipedia.org
Basque
+ Basque Basque eu.wikipedia.org
Brazilian Portuguese
+ Brazilian Portuguese Brazilian Portuguese br.wikimedia.org
Bulgarian
+ Bulgarian Bulgarian bg.wikipedia.org
Chinese
+ Chinese Chinese zh.wikipedia.org
+ Chinese Chinese cn.wikimedia.org
CJK
+ Japanese CJK ja.wikipedia.org
+ Korean CJK ko.wikipedia.org
English
+ English English en.wikipedia.org
+ English English simple.wikipedia.org
+ English English nostalgia.wikipedia.org
+ English English test.wikipedia.org
+ English English test2.wikipedia.org
+ English English be.wikimedia.org (yep, the Wikimedia Belgium site is in English)
+ English English beta.wikiversity.org
+ English English ca.wikimedia.org
+ English English commons.wikimedia.org
+ English English donate.wikimedia.org
+ English English incubator.wikimedia.org
+ English English labtestwikitech.wikimedia.org
+ English English login.wikimedia.org
+ English English meta.wikimedia.org
+ English English nyc.wikimedia.org
+ English English outreach.wikimedia.org
+ English English species.wikimedia.org
+ English English test.wikidata.org
+ English English vote.wikimedia.org
+ English English wikimania2017.wikimedia.org
+ English English wikimediafoundation.org
+ English English wikisource.org
+ English English wikitech.wikimedia.org
+ English English www.mediawiki.org
+ English English www.wikidata.org
+- English English ten.wikipedia.org
+- English English advisory.wikimedia.org
+- English English nz.wikimedia.org
+- English English pa-us.wikimedia.org
+- English English quality.wikimedia.org
+- English English strategy.wikimedia.org
+- English English usability.wikimedia.org
+- English English wikimania2005.wikimedia.org
+- English English wikimania2006.wikimedia.org
+- English English wikimania2007.wikimedia.org
+- English English wikimania2008.wikimedia.org
+- English English wikimania2009.wikimedia.org
+- English English wikimania2010.wikimedia.org
+- English English wikimania2011.wikimedia.org
+- English English wikimania2012.wikimedia.org
+- English English wikimania2013.wikimedia.org
+- English English wikimania2014.wikimedia.org
+- English English wikimania2015.wikimedia.org
+- English English wikimania2016.wikimedia.org
+ English English affcom.wikimedia.org *
+ English English arbcom-en.wikipedia.org *
+ English English auditcom.wikimedia.org *
+ English English chair.wikimedia.org *
+ English English checkuser.wikimedia.org *
+ English English collab.wikimedia.org *
+ English English ec.wikimedia.org *
+ English English exec.wikimedia.org *
+ English English fdc.wikimedia.org *
+ English English grants.wikimedia.org *
+ English English iegcom.wikimedia.org *
+ English English legalteam.wikimedia.org *
+ English English office.wikimedia.org *
+ English English ombudsmen.wikimedia.org *
+ English English otrs-wiki.wikimedia.org *
+ English English projectcom.wikimedia.org *
+ English English searchcom.wikimedia.org *
+ English English steward.wikimedia.org *
+ English English transitionteam.wikimedia.org *
+ English English wikimaniateam.wikimedia.org *
+ English English zero.wikimedia.org *
+ English English board.wikimedia.org *
+ English English boardgovcom.wikimedia.org *
+ English English internal.wikimedia.org *
+ English English movementroles.wikimedia.org *
+ English English spcom.wikimedia.org *
+ English English techconduct.wikimedia.org *
+ English English wg-en.wikipedia.org *
Galician
+ Galician Galician gl.wikipedia.org
Hungarian
+ Hungarian Hungarian hu.wikipedia.org
Irish
+ Irish Irish ga.wikipedia.org
Sorani
+ Sorani Sorani ckb.wikipedia.org
Swedish
+ Swedish Swedish sv.wikipedia.org
+ Swedish Swedish se.wikimedia.org
Thai
+ Thai Thai th.wikipedia.org
ICU normalizer + ICU tokenizer
+ Tibetan ICU normalizer + ICU tokenizer bo.wikipedia.org
+ Min Dong ICU normalizer + ICU tokenizer cdo.wikipedia.org
+ Cree ICU normalizer + ICU tokenizer cr.wikipedia.org
+ Dzongkha ICU normalizer + ICU tokenizer dz.wikipedia.org
+ Gan ICU normalizer + ICU tokenizer gan.wikipedia.org
+ Hakka ICU normalizer + ICU tokenizer hak.wikipedia.org
+ Khmer ICU normalizer + ICU tokenizer km.wikipedia.org
+ Lao ICU normalizer + ICU tokenizer lo.wikipedia.org
+ Burmese ICU normalizer + ICU tokenizer my.wikipedia.org
+ Wu ICU normalizer + ICU tokenizer wuu.wikipedia.org
+ Classical Chinese ICU normalizer + ICU tokenizer zh-classical.wikipedia.org
+ Min Nan ICU normalizer + ICU tokenizer zh-min-nan.wikipedia.org
+ Cantonese ICU normalizer + ICU tokenizer zh-yue.wikipedia.org
ICU normalizer + Standard tokenizer
+ Adyghe ICU normalizer + Standard tokenizer ady.wikipedia.org
+ Afrikaans ICU normalizer + Standard tokenizer af.wikipedia.org
+ Akan ICU normalizer + Standard tokenizer ak.wikipedia.org
+ Amharic ICU normalizer + Standard tokenizer am.wikipedia.org
+ Anglo-Saxon ICU normalizer + Standard tokenizer ang.wikipedia.org
+ Aramaic ICU normalizer + Standard tokenizer arc.wikipedia.org
+ Assamese ICU normalizer + Standard tokenizer as.wikipedia.org
+ Azerbaijani ICU normalizer + Standard tokenizer az.wikipedia.org
+ Central Bicolano ICU normalizer + Standard tokenizer bcl.wikipedia.org
+ Belarusian-Taraškievica ICU normalizer + Standard tokenizer be-tarask.wikipedia.org
+ Belarusian ICU normalizer + Standard tokenizer be.wikipedia.org
+ Bihari ICU normalizer + Standard tokenizer bh.wikipedia.org
+ Bislama ICU normalizer + Standard tokenizer bi.wikipedia.org
+ Bengali ICU normalizer + Standard tokenizer bn.wikipedia.org
+ Bishnupriya Manipuri ICU normalizer + Standard tokenizer bpy.wikipedia.org
+ Bosnian ICU normalizer + Standard tokenizer bs.wikipedia.org
+ Cebuano ICU normalizer + Standard tokenizer ceb.wikipedia.org
+ Chamorro ICU normalizer + Standard tokenizer ch.wikipedia.org
+ Cherokee ICU normalizer + Standard tokenizer chr.wikipedia.org
+ Cheyenne ICU normalizer + Standard tokenizer chy.wikipedia.org
+ Crimean Tatar ICU normalizer + Standard tokenizer crh.wikipedia.org
+ Old Church Slavonic ICU normalizer + Standard tokenizer cu.wikipedia.org
+ Welsh ICU normalizer + Standard tokenizer cy.wikipedia.org
+ Dinka ICU normalizer + Standard tokenizer din.wikipedia.org
+ Zazaki ICU normalizer + Standard tokenizer diq.wikipedia.org
+ Doteli ICU normalizer + Standard tokenizer dty.wikipedia.org
+ Divehi ICU normalizer + Standard tokenizer dv.wikipedia.org
+ Ewe ICU normalizer + Standard tokenizer ee.wikipedia.org
+ Esperanto ICU normalizer + Standard tokenizer eo.wikipedia.org
+ Estonian ICU normalizer + Standard tokenizer et.wikipedia.org
+ Võro ICU normalizer + Standard tokenizer fiu-vro.wikipedia.org
+ Fijian ICU normalizer + Standard tokenizer fj.wikipedia.org
+ Faroese ICU normalizer + Standard tokenizer fo.wikipedia.org
+ West Frisian ICU normalizer + Standard tokenizer fy.wikipedia.org
+ Scottish Gaelic ICU normalizer + Standard tokenizer gd.wikipedia.org
+ Goan Konkani ICU normalizer + Standard tokenizer gom.wikipedia.org
+ Gothic ICU normalizer + Standard tokenizer got.wikipedia.org
+ Gujarati ICU normalizer + Standard tokenizer gu.wikipedia.org
+ Manx ICU normalizer + Standard tokenizer gv.wikipedia.org
+ Hausa ICU normalizer + Standard tokenizer ha.wikipedia.org
+ Hawaiian ICU normalizer + Standard tokenizer haw.wikipedia.org
+ Fiji Hindi ICU normalizer + Standard tokenizer hif.wikipedia.org
+ Croatian ICU normalizer + Standard tokenizer hr.wikipedia.org
+ Interlingua ICU normalizer + Standard tokenizer ia.wikipedia.org
+ Interlingue ICU normalizer + Standard tokenizer ie.wikipedia.org
+ Igbo ICU normalizer + Standard tokenizer ig.wikipedia.org
+ Inupiak ICU normalizer + Standard tokenizer ik.wikipedia.org
+ Ilokano ICU normalizer + Standard tokenizer ilo.wikipedia.org
+ Ido ICU normalizer + Standard tokenizer io.wikipedia.org
+ Icelandic ICU normalizer + Standard tokenizer is.wikipedia.org
+ Inuktitut ICU normalizer + Standard tokenizer iu.wikipedia.org
+ Jamaican Patois ICU normalizer + Standard tokenizer jam.wikipedia.org
+ Lojban ICU normalizer + Standard tokenizer jbo.wikipedia.org
+ Georgian ICU normalizer + Standard tokenizer ka.wikipedia.org
+ Karakalpak ICU normalizer + Standard tokenizer kaa.wikipedia.org
+ Kabyle ICU normalizer + Standard tokenizer kab.wikipedia.org
+ Kabardian ICU normalizer + Standard tokenizer kbd.wikipedia.org
+ Kongo ICU normalizer + Standard tokenizer kg.wikipedia.org
+ Kikuyu ICU normalizer + Standard tokenizer ki.wikipedia.org
+ Kazakh ICU normalizer + Standard tokenizer kk.wikipedia.org
+ Kannada ICU normalizer + Standard tokenizer kn.wikipedia.org
+ Kashmiri ICU normalizer + Standard tokenizer ks.wikipedia.org
+ Kurdish ICU normalizer + Standard tokenizer ku.wikipedia.org
+ Cornish ICU normalizer + Standard tokenizer kw.wikipedia.org
+ Kirghiz ICU normalizer + Standard tokenizer ky.wikipedia.org
+ Latin ICU normalizer + Standard tokenizer la.wikipedia.org
+ Luganda ICU normalizer + Standard tokenizer lg.wikipedia.org
+ Maori ICU normalizer + Standard tokenizer mi.wikipedia.org
+ Macedonian ICU normalizer + Standard tokenizer mk.wikipedia.org
+ Malayalam ICU normalizer + Standard tokenizer ml.wikipedia.org
+ Mongolian ICU normalizer + Standard tokenizer mn.wikipedia.org
+ Marathi ICU normalizer + Standard tokenizer mr.wikipedia.org
+ Malay ICU normalizer + Standard tokenizer ms.wikipedia.org
+ Maltese ICU normalizer + Standard tokenizer mt.wikipedia.org
+ Nauruan ICU normalizer + Standard tokenizer na.wikipedia.org
+ Nepali ICU normalizer + Standard tokenizer ne.wikipedia.org
+ Newar ICU normalizer + Standard tokenizer new.wikipedia.org
+ Novial ICU normalizer + Standard tokenizer nov.wikipedia.org
+ Northern Sotho ICU normalizer + Standard tokenizer nso.wikipedia.org
+ Navajo ICU normalizer + Standard tokenizer nv.wikipedia.org
+ Chichewa ICU normalizer + Standard tokenizer ny.wikipedia.org
+ Oromo ICU normalizer + Standard tokenizer om.wikipedia.org
+ Oriya ICU normalizer + Standard tokenizer or.wikipedia.org
+ Punjabi ICU normalizer + Standard tokenizer pa.wikipedia.org
+ Pangasinan ICU normalizer + Standard tokenizer pag.wikipedia.org
+ Kapampangan ICU normalizer + Standard tokenizer pam.wikipedia.org
+ Papiamentu ICU normalizer + Standard tokenizer pap.wikipedia.org
+ Pali ICU normalizer + Standard tokenizer pi.wikipedia.org
+ Norfolk ICU normalizer + Standard tokenizer pih.wikipedia.org
+ Western Punjabi ICU normalizer + Standard tokenizer pnb.wikipedia.org
+ Pashto ICU normalizer + Standard tokenizer ps.wikipedia.org
+ Romansh ICU normalizer + Standard tokenizer rm.wikipedia.org
+ Kirundi ICU normalizer + Standard tokenizer rn.wikipedia.org
+ Kinyarwanda ICU normalizer + Standard tokenizer rw.wikipedia.org
+ Sardinian ICU normalizer + Standard tokenizer sc.wikipedia.org
+ Scots ICU normalizer + Standard tokenizer sco.wikipedia.org
+ Sindhi ICU normalizer + Standard tokenizer sd.wikipedia.org
+ Northern Sami ICU normalizer + Standard tokenizer se.wikipedia.org
+ Serbo-Croatian ICU normalizer + Standard tokenizer sh.wikipedia.org
+ Sinhalese ICU normalizer + Standard tokenizer si.wikipedia.org
+ Slovenian ICU normalizer + Standard tokenizer sl.wikipedia.org
+ Samoan ICU normalizer + Standard tokenizer sm.wikipedia.org
+ Shona ICU normalizer + Standard tokenizer sn.wikipedia.org
+ Somali ICU normalizer + Standard tokenizer so.wikipedia.org
+ Albanian ICU normalizer + Standard tokenizer sq.wikipedia.org
+ Serbian ICU normalizer + Standard tokenizer sr.wikipedia.org
+ Swati ICU normalizer + Standard tokenizer ss.wikipedia.org
+ Sesotho ICU normalizer + Standard tokenizer st.wikipedia.org
+ Swahili ICU normalizer + Standard tokenizer sw.wikipedia.org
+ Tamil ICU normalizer + Standard tokenizer ta.wikipedia.org
+ Tulu ICU normalizer + Standard tokenizer tcy.wikipedia.org
+ Telugu ICU normalizer + Standard tokenizer te.wikipedia.org
+ Tetum ICU normalizer + Standard tokenizer tet.wikipedia.org
+ Tajik ICU normalizer + Standard tokenizer tg.wikipedia.org
+ Tigrinya ICU normalizer + Standard tokenizer ti.wikipedia.org
+ Turkmen ICU normalizer + Standard tokenizer tk.wikipedia.org
+ Tagalog ICU normalizer + Standard tokenizer tl.wikipedia.org
+ Tswana ICU normalizer + Standard tokenizer tn.wikipedia.org
+ Tongan ICU normalizer + Standard tokenizer to.wikipedia.org
+ Tok Pisin ICU normalizer + Standard tokenizer tpi.wikipedia.org
+ Tsonga ICU normalizer + Standard tokenizer ts.wikipedia.org
+ Tumbuka ICU normalizer + Standard tokenizer tum.wikipedia.org
+ Twi ICU normalizer + Standard tokenizer tw.wikipedia.org
+ Uyghur ICU normalizer + Standard tokenizer ug.wikipedia.org
+ Urdu ICU normalizer + Standard tokenizer ur.wikipedia.org
+ Uzbek ICU normalizer + Standard tokenizer uz.wikipedia.org
+ Venda ICU normalizer + Standard tokenizer ve.wikipedia.org
+ Vepsian ICU normalizer + Standard tokenizer vep.wikipedia.org
+ Vietnamese ICU normalizer + Standard tokenizer vi.wikipedia.org
+ Volapük ICU normalizer + Standard tokenizer vo.wikipedia.org
+ Waray ICU normalizer + Standard tokenizer war.wikipedia.org
+ Xhosa ICU normalizer + Standard tokenizer xh.wikipedia.org
+ Mingrelian ICU normalizer + Standard tokenizer xmf.wikipedia.org
+ Yoruba ICU normalizer + Standard tokenizer yo.wikipedia.org
+ Zhuang ICU normalizer + Standard tokenizer za.wikipedia.org
+ Zulu ICU normalizer + Standard tokenizer zu.wikipedia.org
+- Afar ICU normalizer + Standard tokenizer aa.wikipedia.org
+- Choctaw ICU normalizer + Standard tokenizer cho.wikipedia.org
+- Hiri Motu ICU normalizer + Standard tokenizer ho.wikipedia.org
+- Herero ICU normalizer + Standard tokenizer hz.wikipedia.org
+- Nuosu ICU normalizer + Standard tokenizer ii.wikipedia.org
+- Kuanyama ICU normalizer + Standard tokenizer kj.wikipedia.org
+- Kanuri ICU normalizer + Standard tokenizer kr.wikipedia.org
+- Marshallese ICU normalizer + Standard tokenizer mh.wikipedia.org
+- Muscogee ICU normalizer + Standard tokenizer mus.wikipedia.org
+- Ndonga ICU normalizer + Standard tokenizer ng.wikipedia.org
+ ICU normalizer + Standard tokenizer bd.wikimedia.org
+ ICU normalizer + Standard tokenizer ee.wikimedia.org
+ ICU normalizer + Standard tokenizer mai.wikimedia.org
+ ICU normalizer + Standard tokenizer mk.wikimedia.org
+ ICU normalizer + Standard tokenizer pt.wikimedia.org
+ ICU normalizer + Standard tokenizer rs.wikimedia.org
+ ICU normalizer + Standard tokenizer wb.wikimedia.org
+ ICU normalizer + Standard tokenizer wikimania2018.wikimedia.org