User:TJones (WMF)/Notes/Breton Analyzer Analysis

July 2021 — See TJones_(WMF)/Notes for other projects. See also T258094. For help with the technical jargon used in Analysis Chain Analysis, see the Language Analysis section of the Search Glossary.

Background
At Celtic Knot 2020 I asked people to contact me if they were interested in improving search for a particular language, and I got a positive response and a pointer to a stopword list from VIGNERON.

The basic plan (copied from the phab ticket) was:


 * Create a Breton-specific language analysis configuration
 * Finalize a list of stop words (the linked-to list seems to be too aggressive, and is more a list of common words) and add them to the Breton config.
 * Enable elision support for d', n', and p'. Look further into including m' and z'.
 * Look at the impact of adding some support for the more common French elision (l', s', j', and qu') since there is a fair amount of French text on Breton Wikipedia. (Definitely do not include c', since c'h is a letter in Breton.)
 * Enable ICU folding. Very likely need an exception for ñ. Less likely for â, ê, î, ô, û, ù, ü (all used in Breton); watch for problems with ç (commonly used in French).
 * Make sure apostrophes are normalized (e.g., c’hoar & c'hoar should get the same results).

However, making this a 10% project for me plus some concerns about the initial stopword list delayed things quite a bit (as documented in my lightning talk at Arctic Knot 2021.

Hopefully we are back on track!

Data
I ended up only pulling 5,000 documents each from Breton Wikipedia and Wiktionary.

Speaker Review: d', n', and p' Elision
The question for speakers of Breton reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "gained" words now found the other words, and vice versa?

Adding elision support for d', n', and p' seemed uncontroversial.

Random Sample
Below is a sample of 15 randomly selected stemming groups (words that would all be indexed together) that gained members as a result of adding d', n', and p' elision support. (These are from the Wikipedia sample.)

Notes:


 * It appears likely that P'tite being stemmed to tite is an error; P'tite seems to be a contraction of French Petite. It's uncommon, and no language processing is perfect, plus exact matching will generally prefer matches to P'tite over matches to tite if the search term is p'tite.
 * Elision handling automatically takes care of curly apostrophes, like d’.
 * Examples of p' elision seem to be less common; none appear in the random group, though there are some in the high-impact groups.

Key:


 * alphonse >> 1
 * alphonse indicates that all of these words were stemmed to alphonse. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
 * >> 1 indicates that from "old" to "new", this stemming groups gained 1 member.
 * o: — the "old" group, in this case, the current behavior
 * n: — the "new" group, in this case, with d', n', and p' elision support
 * [11 Alphonse] — Alphonse occurs 11 times in our sample (of 5K articles)

Newly gained group members are bolded. alphonse >> 1 o: [11 Alphonse] n: [11 Alphonse][1 d'Alphonse] amiens >> 1 o: [11 Amiens] n: [11 Amiens][1 d'Amiens] amore >> 4 o: [1 Amore][2 amore] n: [1 Amore][1 D'Amore][1 D'amore][2 amore][2 d'amore][1 d’amore] après >> 1 o: [1 Après][5 après] n: [1 Après][5 après][3 d'après] arbres >> 1 o: [1 arbres] n: [1 arbres][1 d'arbres] arbrissel >> 1 o: [4 Arbrissel] n: [4 Arbrissel][1 d’Arbrissel] ardeiñ >> 1 o: [1 Ardeiñ][4 ardeiñ] n: [1 Ardeiñ][4 ardeiñ][1 d'ardeiñ] aura >> 1 o: [2 aura] n: [2 aura][2 n'aura] avranches >> 1 o: [5 Avranches] n: [5 Avranches][1 d'Avranches] eben >> 1 o: [1 Eben][29 eben] n: [1 Eben][6 d'eben][29 eben] éducation >> 1 o: [1 Éducation][1 éducation] n: [2 d'éducation][1 Éducation][1 éducation] emlazhañ >> 1 o: [2 emlazhañ] n: [1 d'emlazhañ][2 emlazhañ] enquête >> 1 o: [3 Enquête][3 enquête] n: [3 Enquête][4 d'enquête][3 enquête] entraigues >> 1 o: [4 Entraigues] n: [4 Entraigues][2 d'Entraigues] extermination >> 1 o: [1 extermination] n: [1 d'extermination][1 extermination] heller >> 3 o: [1 Heller] n: [1 Heller][1 N’heller][4 n'heller][1 n’heller] hon >> 3 o: [6 Hon][49 hon] n: [6 Hon][2 d'hon][2 d’hon][49 hon][2 n'hon] occasion >> 1 o: [1 occasion] n: [1 d'occasion][1 occasion] offenbach >> 1 o: [2 Offenbach] n: [2 Offenbach][1 d'Offenbach] ont >> 1 o: [6 ont] n: [1 n'ont][6 ont] ötzi >> 1 o: [10 Ötzi][1 ötzi] n: [2 d'Ötzi][10 Ötzi][1 ötzi] ouzomp >> 1 o: [5 ouzomp] n: [1 N'ouzomp][5 ouzomp] st >> 1 o: [1 ST][111 St] n: [1 ST][111 St][1 d'St] tite >> 1 o: [1 Tite] n: [1 P'tite][1 Tite] ugent >> 1 o: [4 Ugent][103 ugent] n: [4 Ugent][1 n'ugent][103 ugent]

High-Impact Groups
There was only one stemming group that gained 10 or more members, so I've included the 5 stemming groups that gained 5 or more members.

Newly gained group members are bolded. ar >> 5 o: [3 AR][1644 Ar][21923 ar] n: [3 AR][1644 Ar][180 D'ar][32 D’ar][21923 ar][1 d'Ar][2143 d'ar][164 d’ar] en >> 8 o: [672 En][8989 en] n: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][52 n'en][6 n’en] [17 p'en][3 p’en] eo >> 7 o: [3 Eo][5228 eo] n: [3 Eo][81 N'eo][10 N’eo][3 P'eo][5228 eo][183 n'eo][17 n’eo][7 p'eo] [1 p’eo] he >> 7 o: [2 HE][116 He][1581 he] n: [3 D'he][2 HE][116 He][3 N'he][102 d'he][7 d’he][1581 he][10 n'he][1 n’he] [4 p'he] o >> 11 o: [183 O][2957 o][1 º] n: [1 D'o][1 D’o][6 N'o][2 N’o][183 O][1 d'O][78 d'o][6 d’o][36 n'o] [2 n’o][2957 o][3 p'o][1 p’o][1 º]

High-Frequency Words
There were 10 stemming groups with high-frequency tokens that gained tokens that are not covered above (all of the High-Impact Groups above also contain High-Frequency Words).

Newly gained group members are bolded. a >> 1 o: [789 A][25073 a]  n: [789 A][25073 a][5 n'a] al >> 3 o: [282 Al][2164 al] n: [282 Al][1 D'al][2164 al][80 d'al][2 d’al] an >> 4 o: [5 AN][1042 An][13605 an] n: [5 AN][1042 An][178 D'an][25 D’an][13605 an][1680 d'an][138 d’an] e >> 4 o: [2553 E][33962 e]  n: [8 D'e][2553 E][302 d'e][13 d’e][33962 e][2 n'e] er >> 2 o: [3 ER][455 Er][3284 er] n: [3 ER][455 Er][1 d’er][3284 er][1 n'er] eus >> 5 o: [90 Eus][7121 eus] n: [90 Eus][57 N'eus][5 N’eus][9 d'eus][7121 eus][117 n'eus][6 n’eus] ez >> 2 o: [2 EZ][10 Ez][1187 ez] n: [2 EZ][10 Ez][2 N'ez][1187 ez][1 n'ez] oa >> 2 o: [2 OA][1 Oa][6371 oa] n: [1 N'oa][2 OA][1 Oa][2 n'oa][6371 oa] un >> 3 o: [3 UN][555 Un][4214 un] n: [1 D’un][3 UN][555 Un][113 d'un][15 d’un][4214 un] ur >> 3 o: [1075 Ur][7261 ur] n: [2 D'ur][1075 Ur][128 d'ur][6 d’ur][7261 ur]

Speaker Review: m' and z' Elision
The question for speakers of Breton reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "gained" words now found the other words, and vice versa?

Adding elision support for m'  and z' seemed less clear, so these have been separated out.

Notes:


 * These changes are being compared against a baseline that already assumes d', n', and p' elision support.

Random Sample
There were only 17 stemming groups (words that would all be indexed together) that gained members as a result of adding m'  and z' elision support that were not covered by the other groups, below, so all 17 of them are here. (These are from the Wikipedia sample.)

Key:


 * am >> 1
 * am indicates that all of these words were stemmed to am. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
 * >> 1 indicates that from "old" to "new", this stemming groups gained 1 member.
 * o: — the "old" group, in this case, the current behavior
 * n: — the "new" group, in this case, with m', and z' elision support
 * [2 m'am] — m'am occurs 2 times in our sample (of 5K articles)

Newly gained group members are bolded. am >> 1 o: [3 Am][29 am][3 d'am] n: [3 Am][29 am][3 d'am][2 m'am] as >> 1 o: [12 AS][12 As][18 as] n: [12 AS][12 As][18 as][1 m'as] bili >> 1 o: [2 Bili][13 bili] n: [2 Bili][1 M'Bili][13 bili] ec'h >> 1 o: [137 ec'h][1 n'ec'h]  n: [137 ec'h][1 m'ec'h][1 n'ec'h] edo >> 1 o: [20 Edo][2 N'edo][7 P'edo][106 edo][13 p'edo] n: [20 Edo][2 N'edo][7 P'edo][106 edo][14 m'edo][13 p'edo] emaint >> 1 o: [1 Emaint][1 N'emaint][11 emaint][2 n'emaint][1 p'emaint][1 p’emaint] n: [1 Emaint][1 N'emaint][11 emaint][1 m’emaint][2 n'emaint][1 p'emaint] [1 p’emaint] emañ >> 3 o: [164 Emañ][4 N'emañ][1 P'emañ][511 emañ][12 n'emañ][3 p'emañ] n: [164 Emañ][8 M'emañ][4 N'emañ][1 P'emañ][511 emañ][49 m'emañ] [6 m’emañ][12 n'emañ][3 p'emañ] emaoc'h >> 1 o: [4 emaoc'h]  n: [1 M'emaoc'h][4 emaoc'h] emeur >> 1 o: [2 Emeur][5 emeur] n: [2 Emeur][5 emeur][3 m'emeur] est >> 1 o: [15 Est][38 est][4 n'est][1 n’est] n: [15 Est][38 est][1 m'est][4 n'est][1 n’est] hai >> 1 o: [2 hai] n: [1 M'hai][2 hai] hen >> 1 o: [12 Hen][1 d'hen][31 hen][1 n'hen] n: [12 Hen][1 d'hen][31 hen][1 m'hen][1 n'hen] ho >> 1 o: [1 HO][6 Ho][1 d'ho][26 ho][1 n’ho] n: [1 HO][6 Ho][1 d'ho][26 ho][1 m’ho][1 n’ho] hoc'h >> 2 o: [2 Hoc'h][17 hoc'h]  n: [2 Hoc'h][2 M'hoc'h][17 hoc'h][1 m'hoc'h] hon >> 1 o: [6 Hon][2 d'hon][2 d’hon][49 hon][2 n'hon] n: [6 Hon][1 M'hon][2 d'hon][2 d’hon][49 hon][2 n'hon] int >> 2 o: [2 Int][2 N'int][1 N’int][158 int][35 n'int][4 n’int] n: [2 Int][2 N'int][1 N’int][158 int][1 m’int][35 n'int][4 n’int][2 z'int] ont >> 1 o: [1 n'ont][6 ont] n: [1 m'ont][1 n'ont][6 ont]

High-Impact Groups
There was only one stemming group that gained 5 or more members.

Newly gained group members are bolded. eo >> 5 o: [3 Eo][81 N'eo][10 N’eo][3 P'eo][5228 eo][183 n'eo][17 n’eo][7 p'eo] [1 p’eo] n: [3 Eo][2 M'eo][81 N'eo][10 N’eo][3 P'eo][5228 eo][51 m'eo][13 m’eo] [183 n'eo][17 n’eo][7 p'eo][1 p’eo][4 z'eo][1 z’eo]

High-Frequency Words
There were only seven stemming group with high-frequency tokens that gained tokens, one of which is the eo group above. The rest are presented below.

Newly gained group members are bolded. a >> 1 o: [789 A][25073 a][5 n'a]  n: [789 A][25073 a][4 m'a][5 n'a] en >> 2 o: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][52 n'en][6 n’en] [17 p'en][3 p’en] n: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][46 m'en][8 m’en] [52 n'en][6 n’en][17 p'en][3 p’en] er >> 1 o: [3 ER][455 Er][1 d’er][3284 er][1 n'er] n: [3 ER][455 Er][1 d’er][3284 er][1 m'er][1 n'er] eus >> 1 o: [90 Eus][57 N'eus][5 N’eus][9 d'eus][7121 eus][117 n'eus][6 n’eus] n: [90 Eus][57 N'eus][5 N’eus][9 d'eus][7121 eus][117 n'eus][6 n’eus][3 z'eus] he >> 2 o: [3 D'he][2 HE][116 He][3 N'he][102 d'he][7 d’he][1581 he][10 n'he][1 n’he] [4 p'he] n: [3 D'he][2 HE][116 He][3 N'he][102 d'he][7 d’he][1581 he][8 m'he][1 m’he] [10 n'he][1 n’he][4 p'he] o >> 2 o: [1 D'o][1 D’o][6 N'o][2 N’o][183 O][1 d'O][78 d'o][6 d’o][36 n'o]     [2 n’o][2957 o][3 p'o][1 p’o][1 º] n: [1 D'o][1 D’o][6 N'o][2 N’o][183 O][1 d'O][78 d'o][6 d’o][13 m'o] [3 m’o][36 n'o][2 n’o][2957 o][3 p'o][1 p’o][1 º]

Speaker Review: Common French Elision
The question for speakers of Breton reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "gained" words now found the other words, and vice versa?

Adding elision support for the most commonly seen French elision items (l', s', j', and qu' ) seems like it might be reasonable, given the prevalence of French-language content in Breton Wikipedia and the fact that some of the Breton elision items are also used in French (d', n', and m'—though with different meanings).

Notes:


 * These changes are being compared against a baseline that already assumes Breton elision support (d', n', p', m', and z').

Random Sample
Below are a semi-random selection of 25 stemming groups (words that would all be indexed together) that gained members as a result of adding common French elision support. (These are from the Wikipedia sample.)


 * The first 20 examples were chosen randomly. The last 5 examples were chosen from among those with s', j', and qu'  elision—the vast majority of examples only have l' elision.
 * To my untrained eye, more of these words look French, which is not a surprise. It makes sense to me—though I'm very happy to be corrected!—that dealing with common French elision would improve searching for names and French/Breton cognates or borrowings that appear in French-language contexts on Breton Wikipedia.

Key:


 * aigua >> 1
 * aigua indicates that all of these words were stemmed to aigua. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
 * >> 1 indicates that from "old" to "new", this stemming groups gained 1 members.
 * o: — the "old" group, in this case, the current behavior
 * n: — the "new" group, in this case, with common French elision support
 * [2 l'aigua] — l'aigua occurs 2 times in our sample (of 5K articles)

Newly gained group members are bolded. aigua >> 1 o: [1 Aigua] n: [1 Aigua][2 l'aigua] âme >> 2 o: [3 âme] n: [1 l'Âme][3 l'âme][3 âme] association >> 1 o: [22 Association][3 association] n: [22 Association][1 L'Association][3 association] atelier >> 1 o: [2 atelier] n: [1 L'Atelier][2 atelier] aube >> 3 o: [5 Aube] n: [5 Aube][1 L'Aube][1 L'aube][2 l'Aube] enquête >> 2 o: [3 Enquête][4 d'enquête][3 enquête] n: [3 Enquête][3 L'Enquête][4 d'enquête][3 enquête][1 l'enquête] enseignement >> 2 o: [2 enseignement] n: [1 L'enseignement][2 enseignement][4 l'enseignement] escaut >> 1 o: [8 Escaut] n: [8 Escaut][3 l'Escaut] esperanto >> 1 o: [1 Esperanto] n: [1 Esperanto][1 L'esperanto] exposition >> 2 o: [1 Exposition][2 exposition] n: [1 Exposition][1 L'Exposition][2 exposition][1 l'exposition] hiver >> 2 o: [1 Hiver] n: [1 Hiver][1 L'Hiver][1 l'hiver] honneur >> 1 o: [1 Honneur][1 d'honneur][1 d’honneur] n: [1 Honneur][1 d'honneur][1 d’honneur][1 l'honneur] horizon >> 1 o: [1 Horizon] n: [1 Horizon][1 l'Horizon] index >> 1 o: [5 Index][2 index] n: [5 Index][2 index][1 l'Index] isle >> 2 o: [4 Isle] n: [4 Isle][3 L'Isle][1 l'isle] isola >> 2 o: [1 Isola] n: [1 Isola][1 L'Isola][1 l'Isola] oncle >> 1 o: [2 oncle] n: [1 L'Oncle][2 oncle] opera >> 1 o: [1 Opera][12 opera] n: [1 L'opera][1 Opera][12 opera] oz >> 2 o: [3 Oz] n: [1 L'OZ][1 L'Oz][3 Oz] universelle >> 1 o: [1 universelle] n: [1 l'universelle][1 universelle]

aime >> 1 o: [1 Aime] n: [1 Aime][2 j'aime] ait >> 1 o: [1 Ait][1 ait] n: [1 Ait][1 ait][1 qu'ait] elle >> 2 o: [3 Elle][7 elle] n: [3 Elle][7 elle][1 l'Elle][1 qu’elle] est >> 2 o: [15 Est][38 est][1 m'est][4 n'est][1 n’est] n: [15 Est][1 Qu'est][38 est][1 m'est][4 n'est][1 n’est][1 s’est] obre >> 1 o: [1 Obre] n: [1 Obre][1 s'obre]

High-Impact Groups
There were only five stemming groups that gained 5 or more members.


 * Again, these look more French to me, as expected.

Newly gained group members are bolded. amour >> 5 o: [4 Amour][4 amour][1 d'amour][1 d’amour] n: [4 Amour][2 L'Amour][2 L'amour][4 amour][1 d'amour][1 d’amour][1 l'Amour] [3 l'amour][1 l’amour] art >> 5 o: [1 ART][31 Art][9 art][4 d'Art][3 d'art][1 d’art] n: [1 ART][31 Art][1 L'Art][1 L'art][9 art][4 d'Art][3 d'art][1 d’art][3 l'Art] [4 l'art][2 l’Art] histoire >> 5 o: [1 HISTOIRE][50 Histoire][10 d'Histoire][11 d'histoire][10 histoire] n: [1 HISTOIRE][50 Histoire][1 L'Histoire][1 L'histoire][10 d'Histoire] [11 d'histoire][10 histoire][11 l'histoire][1 l’Histoire][8 l’histoire] homme >> 6 o: [2 Homme][1 d'Homme][3 d’homme][8 homme] n: [2 Homme][6 L'Homme][1 L'homme][1 L’Homme][1 d'Homme][3 d’homme][8 homme] [1 l'Homme][7 l'homme][1 l’homme] île >> 5 o: [5 Île][1 île] n: [3 L'Île][1 L'île][1 L’Île][3 l'Île][4 l'île][5 Île][1 île]

High-Frequency Words
There were only four stemming group with high-frequency tokens that gained tokens.


 * Again, these look more French to me, as expected.

Gained members are bolded. a >> 1 o: [789 A][25073 a][4 m'a][5 n'a]  n: [789 A][25073 a][1 l'a][4 m'a][5 n'a] an >> 2 o: [5 AN][1042 An][178 D'an][25 D’an][13605 an][1680 d'an][138 d’an] n: [5 AN][1042 An][178 D'an][25 D’an][1 L'An][13605 an][1680 d'an][138 d’an] [3 l'an] en >> 2 o: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][46 m'en][8 m’en] [52 n'en][6 n’en][17 p'en][3 p’en] n: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][1 j'en][46 m'en] [8 m’en][52 n'en][6 n’en][17 p'en][3 p’en][1 qu'en] un >> 1 o: [1 D’un][3 UN][555 Un][113 d'un][15 d’un][4214 un] n: [1 D’un][3 UN][555 Un][113 d'un][15 d’un][1 qu’un][4214 un]