User:TJones (WMF)/Notes/Swedish Analyzer Analysis

From mediawiki.org

March 2017 — See TJones_(WMF)/Notes for other projects. See also T160562 and, to a lesser degree, T155822. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background[edit]

For T155822 we want to enable ASCII folding for Swedish, in particular to so that ä and a match. The purpose of this analysis is to see what effect ASCII folding (which gets automatically upgraded to ICU folding) has on representative Swedish text (i.e., a 10K article sample from Swedish Wikipedia)

Turns out that was a bad idea—thanks Johan, Peter, and Jessica for pointing that out! Since we've got everything set up to analyze and enable folding, we should do it for everything but å, ä, and ö. (T160562)

Prod vs Unpacked[edit]

As with other analyzers, we can unpack the default Swedish analyzer according to the documentation. We've now come to expect there will be some generally positive changes from unpacking.

Both prod and the unpacked analyzer have the same number of tokens (718,362) and pre-analysis types (97,123). After analysis—which includes lowercasing, stemming, and stop-word removal— there are fewer tokens, with prod having 78,543 and the unpacked config having 75,269.

The difference, 3,274 (4.168%), is the number of types with new collision. Those types also represent only 6,621 (0.922%) tokens. The vast majority of the new collision types are numbers followed by non-breaking spaces (with one each of the number with and without the non-breaking space). There are also about 1600 each of V and Ö followed by a non-breaking space. I'm not sure why non-breaking spaces are common in Swedish Wikipedia, but the new collisions are likely good ones.

The rest are the usual suspects: folded Unicode characters. There are only 6 types not included above. The list below shows the new form (with frequency count) before the arrow, and following the arrow, the group it will be joining (with frequency counts). So, groß will now be analyzed to be the same as Gross, Grosse, and Grosser.

[1 fortification] -> [6 Fortification]
[2 rußberg] -> [3 Russberg]
[1 straße] -> [1 Strasse]
[1 weiß] -> [1 Weisse]
[5 groß] -> [3 Gross][2 Grosse][16 Grosser]
[1 µm] -> [13 μm]

Unpacked vs Folded—Now With Too Much Folding [edit]

(This section shows the results of the original, incorrect plan to fold everything to resolve the inconsistency between the Go Feature and full text searching. It does bad things to Swedish, but is left here for reference.)

When we add ASCII folding to the unpacked analysis chain config, it's actually the ICU folding that is enabled. (See T137830.)

We get a lot more tokens with the folding enabled because we preserve the original and the folded form (i.e., both sydväst and sydvast are indexed in the same location.

The unpacked analysis chain creates 718,362 tokens; adding folding results in 863,236 tokens.

Unpacked has 97,123 types before analysis, 75,269 after.

Unpacked and folded has 116,220 types before analysis, 88,524 after—indicating that a large number of the new types either fold into old types, or fold into each other, or both.

There are only 1,041 (1.383%) types with collisions (several types have multiple new collisions). They represent a larger proportion of tokens, though: 39,654 (5.52%).

Collision Examples[edit]

There were just over 2,000 new types that collided with an old type (as above, several new types can collide with the same old type). A random sample of 100 are below, for review by a Swedish speaker.

The list shows the new form (with frequency count) before the arrow, and following the arrow, the group it will be joining (with frequency counts). So, álvarez will now be analyzed to be the same as Alvarez, and ändarna will be analyzed to be the same as And, Andas, ... Anderna, Andernas, andarna and others.

[2 â] -> [198 A][221 a]
[1 ágatha] -> [1 Agatha][1 Agathe]
[1 ägda] -> [10 Agder]
[19 alsó] -> [1 also]
[1 álvarez] -> [1 Alvarez]
[3 ändarna] -> [6 And][1 Andas][3 Ande][1 Anden][11 Anderna][1 Andernas]
               [5 Andes][171 and][5 anda][1 andades][1 andar][1 andarna]
               [2 andas][3 ande][4 andlig][9 andliga]
[1 äng] -> [1 Anger][2 ange][5 anger][23 anges]
[1 äpple] -> [5 Apple][1 Apples]
[2 äpplen] -> [5 Apple][1 Apples]
[10 åre] -> [10 Are][28 Area][3 are][98 area]
[11 armé] -> [1 Armee]
[3 bält] -> [3 Balta]
[1 bänden] -> [45 Band][8 Banda][3 Bandar][1 Banden][66 band][3 banden][1 bands]
[1 béal] -> [1 Beale]
[2 boû] -> [49 Bou][3 bou]
[1 bråkat] -> [1 Brake]
[21 bröderna] -> [4 Brod][1 Broder][1 Brodern][2 broder][11 brodern][1 broderns]
[2 bröst] -> [4 Brost]
[8 bū] -> [1 bu]
[5 būr] -> [8 Bur][1 Buren]
[1 byť] -> [2 Byte][4 byta][3 bytas][1 byte][8 byten][4 byter][2 bytes][2 byts]
[1 ç] -> [11174 C][180 c]
[13 chāh] -> [1 Chah][4 Chahar]
[6 cœur] -> [3 Coeur][1 coeur]
[3 čukar] -> [1 Cukor]
[1 dåd] -> [2 DADA][9 Dada][3 Dade][1 dada]
[1 dåre] -> [10 Dar][6 Dara][2 Dare]
[1 döva] -> [1 Dover][1 dov]
[1 eš] -> [1 ES][5 Es][23 es]
[2 européer] -> [1 Europea]
[1 fårade] -> [5 Far][5 Fara][7 Faras][2 Farlig][40 Fars]
              [66 far][8 fara][2 farlig][5 farliga][1 farligaste]
              [4 farligt][5 fars]
[8 färdigt] -> [2 Farda]
[1 föder] -> [2 foder]
[1 fontäner] -> [1 Fontana]
[1 fören] -> [14 For][6 Fors][50 for][22 fors]
[3 häftig] -> [8 Haft][66 haft]
[38 håll] -> [15 Hall][9 Halle][3 Halls][2 hall][1 hallar][2 hallen]
[3 härdar] -> [1 Hard][2 hard]
[2 höök] -> [3 Hook][13 Hooker]
[4 isländsk] -> [1 islandske]
[1 känsliga] -> [11 Kansas]
[22 kasaï] -> [25 Kasai]
[3 klöcker] -> [1 Klockorna][3 klockare][1 klockaren][2 klockor][1 klockorna]
[16 låga] -> [8 Lag][1 Laga][1 Lagar][3 Lagen][1 Lager][71 lag][5 laga]
             [9 lagar][5 lagarna][21 lagen][1 lagens][7 lager]
             [2 lagliga][1 lagligt][3 lags]
[1 låses] -> [152 Las][42 las]
[1 lönade] -> [1 Lon]
[4 lösas] -> [189 Los][2 Lose][42 los][2 losa]
[1 malé] -> [3 Maleen]
[1 märchen] -> [10 March][1 Marche]
[1 märkas] -> [28 Mark][20 Marks][30 mark][31 marken][3 marker]
[8 maysān] -> [1 Maysan]
[1 misstänkas] -> [1 Misstankarna][2 misstankar][1 misstanke]
[2 möjligheterna] -> [1 Moje][2 Mojen]
[1 musées] -> [3 museer][1 museernas]
[19 nämns] -> [5 Namn][2 Namnen][431 namn][7 namnen]
[3 o’neill] -> [2 O'Neill]
[3 ōrt] -> [23 Orten][535 ort][18 orten][1 ortens][7 orter][1 orterna]
[271 öst] -> [2 Osten][5 ost][1 ostligaste]
[8 päls] -> [1 Pal][4 Pale]
[8 pām] -> [2 pama]
[7 pérez] -> [2 Perez]
[7 priština] -> [1 Pristina][1 Pristinas]
[1 prövning] -> [1 Provnings]
[35 râs] -> [2 RAS][19 Ras][2 Rasen][8 ras][1 rasa][8 rasade]
            [1 rasande][2 rasat][6 rasen][3 raser]
[1 rè] -> [1 Re][1 re]
[110 región] -> [60 Region][5 Regionen][83 region][703 regionen]
                [2 regionens][11 regioner][3 regionerna][1 regions]
[7 réserve] -> [35 Reserve][1 reserv][9 reservat][2 reserven][1 reserverna]
[14 rincón] -> [1 Rincon]
[35 röda] -> [1 ROD][1 Rod][3 Rode][1 Rodes][1 roder][1 roderna]
[3 rörlighet] -> [1 ror]
[24 sångare] -> [10 Sang][7 Sangar][1 sang]
[1 sättare] -> [20 satt][29 satte][22 sattes][2 satts]
[4 sätter] -> [20 satt][29 satte][22 sattes][2 satts]
[2 skönheten] -> [1 skonar][1 skonare][1 skonaren]
[3 sköt] -> [1 skoter][1 skots]
[1 sołe] -> [1 SOL][3 Sol][5 Sola][1 Solar][2 Solen][4 Solor]
            [4 sol][1 sole][11 solen][1 solens][1 soliga][1 sols]
[1 sötare] -> [2 Sot]
[2 söt] -> [2 Sot]
[1 spån] -> [1 SPAN][1 spana]
[37 städer] -> [2 Stad][33 Staden][3 Stadens][111 stad]
               [160 staden][28 stadens][1 stadigt][13 stads]
[5 stånd] -> [4 Stand][2 stand]
[1 störar] -> [31 Stor][78 Stora][4 Store][264 stor][210 stora]
              [13 store][3 stores][3 storhet]
[1 stormäns] -> [1 storman]
[75 stränder] -> [13 Strand][42 strand][8 stranden]
[3 strängade] -> [7 Strang][4 Strange][1 strange]
[10 ström] -> [1 strom]
[5 tá] -> [4 Ta][85 ta]
[16 täckt] -> [8 Tack][15 tack][6 tackade][2 tackar][2 tackas]
[1 tō] -> [18 To][69 to]
[1 tonsättarna] -> [1 tonsatt][1 tonsatta]
[1 törstande] -> [13 Torsten]
[1 trådar] -> [1 Trad][22 Trade][2 Trader]
[1 väcks] -> [7 vacker]
[4 väckt] -> [7 vacker]
[6 väcka] -> [7 vacker]
[2 väggar] -> [1 vagga]
[1 värvat] -> [15 varv][1 varvare][1 varvas]
[76 wādī] -> [6 Wadi][149 wadi]
[2 yéyé] -> [5 Yeye]
[1 zéro] -> [1 Zero][1 zero]

word_break_helper[edit]

We have word_break_helper (WBH) enabled for the unpacked and modified English and Italian analyzers (a setting that dates back to before my time). WBH changes underscores, periods, and parens into spaces, so that words can creak on them. This has good and bad side-effects, but we decided not to enable it for French.

There's a weird thing about WBH that we keep forgetting: it doesn't do anything when it is enabled on a built-in analyzer. The cirrus-settings-dump will show that it is enabled, but it seems that you can't combine other stuff with the built-in analyzers (which is part of the reason we have to unpack them).

I also analyzed the folded vs folded+WBH analyzers, and the results were essentially the same as for the unpacked vs unpacked+WBH discussed here.

The short version is that it wasn't on before, even though it looked like it was, so we shouldn't bother.

The long version is that I've upgraded my analysis tool to show me what happens in this case, so I'm going to take a look anyway.

Unpacked tokens: 718,362. Unpacked+WBH tokens: 732,533—so more tokens, as we'd expect.

Unpacked pre-analysis types : 97,123
Unpacked post-analysis types: 75,269
Unpacked+WBH pre-analysis types : 95,042
Unpacked+WBH post-analysis types: 73,606

There are fewer types pre- and post-analysis, because tokens like wikipedia.org become wikipedia and org, both of which already exist.

There are a handful of new collisions: only 567 (0.753%) types, but they account for 15,972 (2.223%) tokens!

10,239 types disappeared ("lost"), and 8,158 new types appeared ("found"). The vast majority of the types in both cases are numbers. Tokens like 0.11907 become 0 and 11907.

A handful of acronyms and acronym-like tokens are split up (bad), citation-form names (like T.Jones get split up—probably good), numbers get split up (probably bad), web domains get split up (probably good), and general word.word.words get split up (hard to say), and a handful of other stuff that gets split up is not easily categorized.

A sample of "lost" and "found" tokens of each type is provided below (I omitted the web domains and included all the "other" uncategorized examples).

Lost tokens by type

  • other: 33
    • 1.43,40 1.43,50 2.51,2 38.20,8 4.54,4 4.57,4 484162005B8D890FA34353A0365446DA.nbdigital3 G09.330.380 Helsingborg.Helsingör J318.5 K.Å Kautsky.Ännu Last.fm:s Lobo_ Malmö.Libris Mesh_No P.O'byrne Rv23.02 Rv25.01 Rv25.02 Rv25.05 Rv31.01 Valentina_Quinn beskrivning_av_finlands_bannat_2 d.ä heut_ist_des_herren_ruhetag kompassfel.Besättningsmannen pa_svenska pjäsen.Malone poppen_hoppar_av_assyriska s.å säkerhet.Religionerna ö.h
  • acronyms: 118
    • A.B A.D A.E A.H A.J.G.H A.L A.M A.T A.U.S.A A.V
  • acronyms-like: 14
    • T.o.m a.M c.a d.v.s d.y e.o f.d m.m n.s o.s.v
  • measurements: 3
    • 1.04k 2.0B 2.9x
  • namelike: 43
    • A.M.Lopes A.T.Oliveira A.Wharton B.Hylm B.Hylmö
  • numbers-decimals: 6611 / numbers-decimals+nbsp: 3259
    • 0.1 0.11907 0.13333 0.16667 0.18583 0.21312 0.21667
  • numbers-period-sep: 4
    • 1.8.1956 1.9.1985 15.1.1977 2011.11.004
  • web domains: 82
  • words-period-sep: 72 (includes other TLD domains, and misc other stuff)
    • Fil.dr Hb.Gey LL.M Lar.N Snapphanarna.Stockholm St.John

Found tokens by type

  • other: 1
    • nbdigital3
  • ID-like: 4
    • 0B 484162005B8D890FA34353A0365446DA G09 J318
  • Latin (Basic): 149
    • Achemenet Antipova Arnholm Backh Boissieu Egypt
  • Latin (Extended): 2
    • Besättningsmannen Hylmö
  • measurements: 2
    • 04k 9x
  • numbers lists-comma-sep: 5
    • 20,8 43,40 51,2 54,4 57,4
  • numbers-integers: 5192 / numbers-integers+nbsp: 2802
    • 00007 00029 0003 0005 00065 00084 00087
  • words-colon-sep: 1
    • fm:s

Unpacked vs Folded—Folded Just Right[edit]

After feedback from three speakers of Swedish (thanks Peter, Johan, and Jessica!), it became clear that given the original goal of having more consistency in behavior between the Go feature/near match/upper right search box and full text search, folding å, ä, and ö is the wrong way to go about it. Since we're 90% of the way done with doing folding in Swedish the right way (folding everything but å, ä, and ö), we're going to go ahead and analyze that.

This is basically the same config as before, but with the CirrusSearchICUFoldingUnicodeSetFilter set to exclude å, ä, ö, Å, Ä, and Ö from folding. This was actually already configured, but the wrong language code was used. (Apologies to both Swedish (sv) and Swahili (sw) speakers from your English-speaking friends!) There's also some trouble in the dev environment, and this is configured for Russian by default in the FullyFeaturedConfig.php—turning that off may cause some problems in dev, so that'll require some extra testing, too.

Anyway, let's get to the results!

The total new tokens went way down—most diacritics in Swedish text are the Swedish diacritics, and they aren't being folded anymore.

The unpacked analysis chain creates 718,362 tokens; adding maximal folding before resulted in 863,236 tokens; adding only non-Swedish folding now results in only 731,882 tokens—more than just unpacked, but not a ton more.

Unpacked has 97,123 types before analysis, 75,269 after.

Unpacked and maximally folded has 116,220 types before analysis, 88,524 after.

Unpacked and non-Swedish folded has 100,607 types before analysis, 78,140 after—the impact is clearly much smaller.

Maximal folding resulted in 1,041 (1.383%) types with collisions, representing 39,654 (5.52%) tokens.

Non-Swedish folding resulted in only 478 (0.635%) with collisions, representing 4,516 (0.629%) tokens—again a much smaller impact, and I think this is the first time I've seen the percentage of tokens be less than the percentage of types, indicating that collectively these are words of slightly below average frequency.

Collision Examples[edit]

There were just over 500 new types that collided with an old type (several new types can collide with the same old type). A random sample of 100 are below, for review by Swedish speakers.

The list shows the new form (with frequency count) before the arrow, and following the arrow, the group it will be joining (with frequency counts). So, antónio will now be analyzed to be the same as Antonio, and bīl will be analyzed to be the same as Bil, Bilar, Bilarna and others.

[1 á]  ->  [198 A][221 a]
[1 ə]  ->  [198 A][221 a]
[2 â]  ->  [198 A][221 a]
[8 ālī]  ->  [2 ALI][32 Ali][1 Alias][4 alias]
[1 alléer]  ->  [2 Allee]
[2 antonín]  ->  [3 Antonina]
[1 antónio]  ->  [32 Antonio]
[1 ascensión]  ->  [4 Ascension]
[1 āşş]  ->  [1 Ass][4 Assas][2 ass]
[1 baghlān]  ->  [10 Baghlan]
[2 baïse]  ->  [2 Baisas]
[1 banī]  ->  [2 Bani]
[1 béal]  ->  [1 Beale]
[1 beïda]  ->  [1 Beida]
[15 beïd]  ->  [1 Beida]
[9 bīl]  ->  [4 Bil][1 Bilar][1 Bilarna][5 Bilen][2 Bilens]
             [16 bil][9 bilar][12 bilen][1 bilens][1 billiga]
             [7 billigare][1 billigt]
[2 boû]  ->  [49 Bou][3 bou]
[1 brünn]  ->  [3 Brunner][1 brunnar]
[1 burhānuddin]  ->  [1 Burhanuddin]
[1 byť]  ->  [2 Byte][4 byta][3 bytas][1 byte][8 byten][4 byter]
             [2 bytes][2 byts]
[86 cañada]  ->  [4 Canada]
[2 cardeña]  ->  [2 Cardenas]
[1 chikmagalūr]  ->  [3 Chikmagalur]
[2 claës]  ->  [11 Claes]
[1 conférence]  ->  [1 Conference][1 conference]
[6 cristóbal]  ->  [1 Cristobal][1 cristobala]
[4 čuka]  ->  [1 Cukor]
[4 déserts]  ->  [3 Desert][1 deserta]
[1 dôn]  ->  [17 Don][3 Done][1 Donen][1 done]
[1 dūr]  ->  [1 Dur][27 dur]
[1 düring]  ->  [1 during]
[1 ē]  ->  [99 E][1389 e]
[1 elegía]  ->  [5 Elegi][3 Elegie][1 Elegier][5 elegi][4 elegier]
                [1 elegies]
[1 élisabeth]  ->  [23 Elisabeth]
[13 émilie]  ->  [2 Emilia]
[1 enríquez]  ->  [1 Enriquez]
[1 fallén]  ->  [1 fallenhet]
[1 fé]  ->  [5 Fe]
[16 félix]  ->  [7 Felix]
[1 gavilán]  ->  [1 Gavilan]
[1 geneviève]  ->  [3 Genevieve]
[2 gómez]  ->  [3 Gomez]
[2 heróica]  ->  [1 Heroic]
[1 hsü]  ->  [1 Hsu]
[3 hwè]  ->  [2 Hwe]
[35 île]  ->  [16 Ile][1 ile]
[3 iō]  ->  [3 Io]
[2 jāy]  ->  [7 Jay]
[2 jiří]  ->  [1 Jiri]
[1 jónsson]  ->  [7 Jonsson]
[1 julián]  ->  [10 Julian][2 Juliana][3 Juliane][2 Julianne]
[13 kāl]  ->  [2 Kal][1 Kala][3 Kalar][2 Kale][2 kala]
[4 kariaí]  ->  [1 Kariai]
[1 kātākhāli]  ->  [1 Katakhali]
[2 kébir]  ->  [8 Kebir][9 Kebira]
[1 kriša]  ->  [1 Krisens][7 kris][2 krisen][1 krisens][1 kriser]
[1 kyodō]  ->  [1 Kyodo]
[5 lāt]  ->  [4 Late][2089 lat][4 lata][1 late][1 latens][1 laterna]
[1 lîf]  ->  [2825 Life][3 life]
[3 lôa]  ->  [1 Loa]
[1 malé]  ->  [3 Maleen]
[1 mt’a]  ->  [15 Mt'a]
[6 müll]  ->  [2 Mull]
[1 nē]  ->  [3 Ne][3 ne]
[1 noêmia]  ->  [1 Noemi]
[1 ōkuma]  ->  [13 Okuma]
[37 ouâdi]  ->  [6 Ouadi]
[7 pérez]  ->  [2 Perez]
[4 piña]  ->  [2 Pin][2 Pinar][18 Pine][7 Pins]
[9 pīr]  ->  [3 Pira][1 Piras][1 pires]
[1 poémes]  ->  [4 poem][3 poems]
[26 potosí]  ->  [2 Potosi]
[355 québec]  ->  [5 Quebec]
[1 râ]  ->  [3 RA][1 Ra]
[6 reč]  ->  [3 Rec]
[1 régent]  ->  [1 Regenter][15 regent][3 regenten][1 regentens]
                [1 regenter][1 regenterna][1 regenternas]
[2 renée]  ->  [2 Renee]
[1 república]  ->  [1 Republic]
[169 río]  ->  [92 Rio]
[10 rīz]  ->  [6 Riz][1 Rize]
[1 rübsaamen]  ->  [1 Rubsaamen]
[1 ruíz]  ->  [5 Ruiz]
[16 shāh]  ->  [3 Shah][1 shah]
[1 sharīf]  ->  [1 Sharif]
[2 shīr]  ->  [6 Shire][1 shire]
[1 sì]  ->  [5 SI][9 Si][8 si]
[2 soûdâne]  ->  [3 Soudane]
[5 spīn]  ->  [2 spin][2 spina]
[1 spíritus]  ->  [1 spiritus]
[5 tá]  ->  [4 Ta][85 ta]
[7 tāl]  ->  [6 Tal][2 Tala][1 Talas][16 Talat][3 Tales]
             [1 Taliga][68 tal][11 tala][7 talade][6 talades][2 talande]
             [14 talar][3 talare][1 talarna][14 talas][1 talat][1 tale]
             [22 talen][1 talens][1 tals]
[2 tnîyé]  ->  [1 Tniye]
[1 tristán]  ->  [25 Tristan][2 Tristans]
[2 tūtī]  ->  [1 Tuti]
[4 ţūţī]  ->  [1 Tuti]
[1 valéry]  ->  [1 Valery]
[2 yéyé]  ->  [5 Yeye]
[5 żeligowski]  ->  [2 Zeligowski]
[1 žena]  ->  [2 Zen][2 zen]
[1 zéro]  ->  [1 Zero][1 zero]

Recommendation[edit]

I'm waiting for feedback from the friendly native speakers of Swedish on the 100 new example collisions above. I expect they'll like these better. If they approve, then we're good to push the change to production.