User:TJones (WMF)/Notes/Swedish Analyzer Analysis

March 2017 — See TJones_(WMF)/Notes for other projects. See also T155822.

Background
For T155822 we want to enable ASCII folding for Swedish, in particular to so that ä and a match. The purpose of this analysis is to see what effect ASCII folding (which gets automatically upgraded to ICU folding) has on representative Swedish text (i.e., a 10K article sample from Swedish Wikipedia)

Prod vs Unpacked
As with other analyzers, we can unpack the default Swedish analyzer according to the documentation. We've now come to expect there will be some generally positive changes from unpacking.

Both prod and the unpacked analyzer have the same number of tokens (718,362) and pre-analysis types (97,123). After analysis—which includes lowercasing, stemming, and stop-word removal— there are fewer tokens, with prod having 78,543 and the unpacked config having 75,269.

The difference, 3,274 (4.168%), is the number of types with new collision. Those types also represent only 6,621 (0.922%) tokens. The vast majority of the new collision types are numbers followed by non-breaking spaces (with one each of the number with and without the non-breaking space). There are also about 1600 each of V and Ö followed by a non-breaking space. I'm not sure why non-breaking spaces are common in Swedish Wikipedia, but the new collisions are likely good ones.

The rest are the usual suspects: folded Unicode characters. There are only 6 types not included above. The list below shows the new form (with frequency count) before the arrow, and following the arrow, the group it will be joining (with frequency counts). So, groß will now be analyzed to be the same as Gross, Grosse, and Grosser. [1 fortiﬁcation] -> [6 Fortification] [2 rußberg] -> [3 Russberg] [1 straße] -> [1 Strasse] [1 weiß] -> [1 Weisse] [5 groß] -> [3 Gross][2 Grosse][16 Grosser] [1 µm] -> [13 μm]

Unpacked vs Folded
When we add ASCII folding to the unpacked analysis chain config, it's actually the ICU folding that is enabled. (See T137830.)

We get a lot more tokens with the folding enabled because we preserve the original and the folded form (i.e., both sydväst and sydvast are indexed in the same location.

The unpacked analysis chain creates 718,362 tokens; adding folding results in 863,236 tokens.

Unpacked has 97,123 types before analysis, 75,269 after.

Unpacked and folded has 116,220 types before analysis, 88,524 after—indicating that a large number of the new types either fold into old types, or fold into each other, or both.

There are only 1,041 (1.383%) types with collisions (several types have multiple new collisions). They represent a larger proportion of tokens, though: 39,654 (5.52%).

Collision Examples
There were just over 2,000 new types that collided with an old type (as above, several new types can collide with the same old type). A random sample of 100 are below, for review by a Swedish speaker.

The list shows the new form (with frequency count) before the arrow, and following the arrow, the group it will be joining (with frequency counts). So, álvarez will now be analyzed to be the same as Alvarez, and ändarna will be analysed to be the same as And, Andas, ... Anderna, Andernas, andarna and others. [2 â] -> [198 A][221 a] [1 ágatha] -> [1 Agatha][1 Agathe] [1 ägda] -> [10 Agder] [19 alsó] -> [1 also] [1 álvarez] -> [1 Alvarez] [3 ändarna] -> [6 And][1 Andas][3 Ande][1 Anden][11 Anderna][1 Andernas] [5 Andes][171 and][5 anda][1 andades][1 andar][1 andarna] [2 andas][3 ande][4 andlig][9 andliga] [1 äng] -> [1 Anger][2 ange][5 anger][23 anges] [1 äpple] -> [5 Apple][1 Apples] [2 äpplen] -> [5 Apple][1 Apples] [10 åre] -> [10 Are][28 Area][3 are][98 area] [11 armé] -> [1 Armee] [3 bält] -> [3 Balta] [1 bänden] -> [45 Band][8 Banda][3 Bandar][1 Banden][66 band][3 banden][1 bands] [1 béal] -> [1 Beale] [2 boû] -> [49 Bou][3 bou] [1 bråkat] -> [1 Brake] [21 bröderna] -> [4 Brod][1 Broder][1 Brodern][2 broder][11 brodern][1 broderns] [2 bröst] -> [4 Brost] [8 bū] -> [1 bu] [5 būr] -> [8 Bur][1 Buren] [1 byť] -> [2 Byte][4 byta][3 bytas][1 byte][8 byten][4 byter][2 bytes][2 byts] [1 ç] -> [11174 C][180 c] [13 chāh] -> [1 Chah][4 Chahar] [6 cœur] -> [3 Coeur][1 coeur] [3 čukar] -> [1 Cukor] [1 dåd] -> [2 DADA][9 Dada][3 Dade][1 dada] [1 dåre] -> [10 Dar][6 Dara][2 Dare] [1 döva] -> [1 Dover][1 dov] [1 eš] -> [1 ES][5 Es][23 es] [2 européer] -> [1 Europea] [1 fårade] -> [5 Far][5 Fara][7 Faras][2 Farlig][40 Fars] [66 far][8 fara][2 farlig][5 farliga][1 farligaste] [4 farligt][5 fars] [8 färdigt] -> [2 Farda] [1 föder] -> [2 foder] [1 fontäner] -> [1 Fontana] [1 fören] -> [14 For][6 Fors][50 for][22 fors] [3 häftig] -> [8 Haft][66 haft] [38 håll] -> [15 Hall][9 Halle][3 Halls][2 hall][1 hallar][2 hallen] [3 härdar] -> [1 Hard][2 hard] [2 höök] -> [3 Hook][13 Hooker] [4 isländsk] -> [1 islandske] [1 känsliga] -> [11 Kansas] [22 kasaï] -> [25 Kasai] [3 klöcker] -> [1 Klockorna][3 klockare][1 klockaren][2 klockor][1 klockorna] [16 låga] -> [8 Lag][1 Laga][1 Lagar][3 Lagen][1 Lager][71 lag][5 laga] [9 lagar][5 lagarna][21 lagen][1 lagens][7 lager] [2 lagliga][1 lagligt][3 lags] [1 låses] -> [152 Las][42 las] [1 lönade] -> [1 Lon] [4 lösas] -> [189 Los][2 Lose][42 los][2 losa] [1 malé] -> [3 Maleen] [1 märchen] -> [10 March][1 Marche] [1 märkas] -> [28 Mark][20 Marks][30 mark][31 marken][3 marker] [8 maysān] -> [1 Maysan] [1 misstänkas] -> [1 Misstankarna][2 misstankar][1 misstanke] [2 möjligheterna] -> [1 Moje][2 Mojen] [1 musées] -> [3 museer][1 museernas] [19 nämns] -> [5 Namn][2 Namnen][431 namn][7 namnen] [3 o’neill] -> [2 O'Neill] [3 ōrt] -> [23 Orten][535 ort][18 orten][1 ortens][7 orter][1 orterna] [271 öst] -> [2 Osten][5 ost][1 ostligaste] [8 päls] -> [1 Pal][4 Pale] [8 pām] -> [2 pama] [7 pérez] -> [2 Perez] [7 priština] -> [1 Pristina][1 Pristinas] [1 prövning] -> [1 Provnings] [35 râs] -> [2 RAS][19 Ras][2 Rasen][8 ras][1 rasa][8 rasade] [1 rasande][2 rasat][6 rasen][3 raser] [1 rè] -> [1 Re][1 re] [110 región] -> [60 Region][5 Regionen][83 region][703 regionen] [2 regionens][11 regioner][3 regionerna][1 regions] [7 réserve] -> [35 Reserve][1 reserv][9 reservat][2 reserven][1 reserverna] [14 rincón] -> [1 Rincon] [35 röda] -> [1 ROD][1 Rod][3 Rode][1 Rodes][1 roder][1 roderna] [3 rörlighet] -> [1 ror] [24 sångare] -> [10 Sang][7 Sangar][1 sang] [1 sättare] -> [20 satt][29 satte][22 sattes][2 satts] [4 sätter] -> [20 satt][29 satte][22 sattes][2 satts] [2 skönheten] -> [1 skonar][1 skonare][1 skonaren] [3 sköt] -> [1 skoter][1 skots] [1 sołe] -> [1 SOL][3 Sol][5 Sola][1 Solar][2 Solen][4 Solor] [4 sol][1 sole][11 solen][1 solens][1 soliga][1 sols] [1 sötare] -> [2 Sot] [2 söt] -> [2 Sot] [1 spån] -> [1 SPAN][1 spana] [37 städer] -> [2 Stad][33 Staden][3 Stadens][111 stad] [160 staden][28 stadens][1 stadigt][13 stads] [5 stånd] -> [4 Stand][2 stand] [1 störar] -> [31 Stor][78 Stora][4 Store][264 stor][210 stora] [13 store][3 stores][3 storhet] [1 stormäns] -> [1 storman] [75 stränder] -> [13 Strand][42 strand][8 stranden] [3 strängade] -> [7 Strang][4 Strange][1 strange] [10 ström] -> [1 strom] [5 tá] -> [4 Ta][85 ta] [16 täckt] -> [8 Tack][15 tack][6 tackade][2 tackar][2 tackas] [1 tō] -> [18 To][69 to] [1 tonsättarna] -> [1 tonsatt][1 tonsatta] [1 törstande] -> [13 Torsten] [1 trådar] -> [1 Trad][22 Trade][2 Trader] [1 väcks] -> [7 vacker] [4 väckt] -> [7 vacker] [6 väcka] -> [7 vacker] [2 väggar] -> [1 vagga] [1 värvat] -> [15 varv][1 varvare][1 varvas] [76 wādī] -> [6 Wadi][149 wadi] [2 yéyé] -> [5 Yeye] [1 zéro] -> [1 Zero][1 zero]

word_break_helper
We have word_break_helper (WBH) enabled for the unpacked and modified English and Italian analyzers (a setting that dates back to before my time). WBH changes underscores, periods, and parens into spaces, so that words can creak on them. This has good and bad side-effects, but we decided not to enable it for French.

There's a weird thing about WBH that we keep forgetting: it doesn't do anything when it is enabled on a built-in analyzer. The cirrus-settings-dump will show that it is enabled, but it seems that you can't combine other stuff with the built-in analyzers (which is part of the reason we have to unpack them).

I also analyzed the folded vs folded+WBH analyzers, and the results were essentially the same as for the unpacked vs unpacked+WBH discussed here.

The short version is that it wasn't on before, even though it looked like it was, so we shouldn't bother.

The long version is that I've upgraded my analysis tool to show me what happens in this case, so I'm going to take a look anyway.

Unpacked tokens: 718,362. Unpacked+WBH tokens: 732,533—so more tokens, as we'd expect. Unpacked pre-analysis types : 97,123 Unpacked post-analysis types: 75,269

Unpacked+WBH pre-analysis types : 95,042 Unpacked+WBH post-analysis types: 73,606 There are ewer types pre- and post-analysis, because tokens like wikipedia.org become wikipedia and org, both of which already exist.

There are a handful of new collisions: only 567 (0.753%) types, but they account for 15,972 (2.223%) tokens!

10,239 types disappeared ("lost"), and 8,158 new types appeared ("found"). The vast majority of the types in both cases are numbers. Tokens like 0.11907 become 0 and 11907.

A handful of acronyms and acronym-like tokens are split up (bad), citation-form names (like T.Jones get split up—probably good), numbers get split up (probably bad), web domains get split up (probably good), and general word.word.words get split up (hard to say), and a handful of other stuff that gets split up is not easily categorized.

A sample of "lost" and "found" tokens of each type is provided below (I omitted the web domains and included all the "other" uncategorized examples).

Lost tokens by type
 * other: 33
 * 1.43,40 1.43,50 2.51,2 38.20,8 4.54,4 4.57,4 484162005B8D890FA34353A0365446DA.nbdigital3 G09.330.380 Helsingborg.Helsingör J318.5 K.Å Kautsky.Ännu Last.fm:s Lobo_ Malmö.Libris Mesh_No P.O'byrne Rv23.02 Rv25.01 Rv25.02 Rv25.05 Rv31.01 Valentina_Quinn beskrivning_av_finlands_bannat_2 d.ä heut_ist_des_herren_ruhetag kompassfel.Besättningsmannen pa_svenska pjäsen.Malone poppen_hoppar_av_assyriska s.å säkerhet.Religionerna ö.h


 * acronyms: 118
 * A.B A.D A.E A.H A.J.G.H A.L A.M A.T A.U.S.A A.V


 * acronyms-like: 14
 * T.o.m a.M c.a d.v.s d.y e.o f.d m.m n.s o.s.v


 * measurements: 3
 * 1.04k 2.0B 2.9x


 * namelike: 43
 * A.M.Lopes A.T.Oliveira A.Wharton B.Hylm B.Hylmö


 * numbers-decimals: 6611 / numbers-decimals+nbsp: 3259
 * 0.1 0.11907 0.13333 0.16667 0.18583 0.21312 0.21667


 * numbers-period-sep: 4
 * 1.8.1956 1.9.1985 15.1.1977 2011.11.004


 * web domains: 82

Found tokens by type
 * words-period-sep: 72 (includes other TLD domains, and misc other stuff)
 * Fil.dr Hb.Gey LL.M Lar.N Snapphanarna.Stockholm St.John
 * other: 1
 * nbdigital3


 * ID-like: 4
 * 0B 484162005B8D890FA34353A0365446DA G09 J318


 * Latin (Basic): 149
 * Achemenet Antipova Arnholm Backh Boissieu Egypt


 * Latin (Extended): 2
 * Besättningsmannen Hylmö


 * measurements: 2
 * 04k 9x


 * numbers lists-comma-sep: 5
 * 20,8 43,40 51,2 54,4 57,4


 * numbers-integers: 5192 / numbers-integers+nbsp: 2802
 * 00007 00029 0003 0005 00065 00084 00087


 * words-colon-sep: 1
 * fm:s

Recommendation
I'm waiting for some feedback from some native speakers of Swedish on the 100 examples collisions above, but if they approve, then we're good to push the change to production.