User:TJones (WMF)/Notes/Breton Analyzer Analysis

July 2021 — See TJones_(WMF)/Notes for other projects. See also T258094. For help with the technical jargon used in Analysis Chain Analysis, see the Language Analysis section of the Search Glossary.

Background
At Celtic Knot 2020 I asked people to contact me if they were interested in improving search for a particular language, and I got a positive response and a pointer to a stopword list from VIGNERON.

The basic plan (copied from the phab ticket) was:


 * Create a Breton-specific language analysis configuration
 * Finalize a list of stopwords (the linked-to list seems to be too aggressive, and is more a list of common words) and add them to the Breton config.
 * Enable elision support for d', n', and p'. Look further into including m' and z'.
 * Look at the impact of adding some support for the more common French elision (l', s', j', and qu') since there is a fair amount of French text on Breton Wikipedia. (Definitely do not include c', since c'h is a letter in Breton.)
 * Enable ICU folding. Very likely need an exception for ñ. Less likely for â, ê, î, ô, û, ù, ü (all used in Breton); watch for problems with ç (commonly used in French).
 * Make sure apostrophes are normalized (e.g., c’hoar & c'hoar should get the same results).

However, making this a 10% project for me plus some concerns about the initial stopword list delayed things quite a bit (as documented in my lightning talk at Arctic Knot 2021.

Hopefully we are back on track!

Data
I ended up only pulling 5,000 documents each from Breton Wikipedia and Wiktionary.

Speaker Review: d', n', and p' Elision
The question for speakers of Breton reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "gained" words now found the other words, and vice versa?

Adding elision support for d', n', and p' seemed uncontroversial.

Random Sample
Below is a sample of 15 randomly selected stemming groups (words that would all be indexed together) that gained members as a result of adding d', n', and p' elision support. (These are from the Wikipedia sample.)

Notes:


 * It appears likely that P'tite being stemmed to tite is an error; P'tite seems to be a contraction of French Petite. It's uncommon, and no language processing is perfect, plus exact matching will generally prefer matches to P'tite over matches to tite if the search term is p'tite.
 * Elision handling automatically takes care of curly apostrophes, like d’.
 * Examples of p' elision seem to be less common; none appear in the random group, though there are some in the high-impact groups.

Key:


 * alphonse >> 1
 * alphonse indicates that all of these words were stemmed to alphonse. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
 * >> 1 indicates that from "old" to "new", this stemming groups gained 1 member.
 * o: — the "old" group, in this case, the current behavior
 * n: — the "new" group, in this case, with d', n', and p' elision support
 * [11 Alphonse] — Alphonse occurs 11 times in our sample (of 5K articles)

Newly gained group members are bolded. alphonse >> 1 o: [11 Alphonse] n: [11 Alphonse][1 d'Alphonse] amiens >> 1 o: [11 Amiens] n: [11 Amiens][1 d'Amiens] amore >> 4 o: [1 Amore][2 amore] n: [1 Amore][1 D'Amore][1 D'amore][2 amore][2 d'amore][1 d’amore] après >> 1 o: [1 Après][5 après] n: [1 Après][5 après][3 d'après] arbres >> 1 o: [1 arbres] n: [1 arbres][1 d'arbres] arbrissel >> 1 o: [4 Arbrissel] n: [4 Arbrissel][1 d’Arbrissel] ardeiñ >> 1 o: [1 Ardeiñ][4 ardeiñ] n: [1 Ardeiñ][4 ardeiñ][1 d'ardeiñ] aura >> 1 o: [2 aura] n: [2 aura][2 n'aura] avranches >> 1 o: [5 Avranches] n: [5 Avranches][1 d'Avranches] eben >> 1 o: [1 Eben][29 eben] n: [1 Eben][6 d'eben][29 eben] éducation >> 1 o: [1 Éducation][1 éducation] n: [2 d'éducation][1 Éducation][1 éducation] emlazhañ >> 1 o: [2 emlazhañ] n: [1 d'emlazhañ][2 emlazhañ] enquête >> 1 o: [3 Enquête][3 enquête] n: [3 Enquête][4 d'enquête][3 enquête] entraigues >> 1 o: [4 Entraigues] n: [4 Entraigues][2 d'Entraigues] extermination >> 1 o: [1 extermination] n: [1 d'extermination][1 extermination] heller >> 3 o: [1 Heller] n: [1 Heller][1 N’heller][4 n'heller][1 n’heller] hon >> 3 o: [6 Hon][49 hon] n: [6 Hon][2 d'hon][2 d’hon][49 hon][2 n'hon] occasion >> 1 o: [1 occasion] n: [1 d'occasion][1 occasion] offenbach >> 1 o: [2 Offenbach] n: [2 Offenbach][1 d'Offenbach] ont >> 1 o: [6 ont] n: [1 n'ont][6 ont] ötzi >> 1 o: [10 Ötzi][1 ötzi] n: [2 d'Ötzi][10 Ötzi][1 ötzi] ouzomp >> 1 o: [5 ouzomp] n: [1 N'ouzomp][5 ouzomp] st >> 1 o: [1 ST][111 St] n: [1 ST][111 St][1 d'St] tite >> 1 o: [1 Tite] n: [1 P'tite][1 Tite] ugent >> 1 o: [4 Ugent][103 ugent] n: [4 Ugent][1 n'ugent][103 ugent]

High-Impact Groups
There was only one stemming group that gained 10 or more members, so I've included the 5 stemming groups that gained 5 or more members.

Newly gained group members are bolded. ar >> 5 o: [3 AR][1644 Ar][21923 ar] n: [3 AR][1644 Ar][180 D'ar][32 D’ar][21923 ar][1 d'Ar][2143 d'ar][164 d’ar] en >> 8 o: [672 En][8989 en] n: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][52 n'en][6 n’en] [17 p'en][3 p’en] eo >> 7 o: [3 Eo][5228 eo] n: [3 Eo][81 N'eo][10 N’eo][3 P'eo][5228 eo][183 n'eo][17 n’eo][7 p'eo] [1 p’eo] he >> 7 o: [2 HE][116 He][1581 he] n: [3 D'he][2 HE][116 He][3 N'he][102 d'he][7 d’he][1581 he][10 n'he][1 n’he] [4 p'he] o >> 11 o: [183 O][2957 o][1 º] n: [1 D'o][1 D’o][6 N'o][2 N’o][183 O][1 d'O][78 d'o][6 d’o][36 n'o] [2 n’o][2957 o][3 p'o][1 p’o][1 º]

High-Frequency Words
There were 10 stemming groups with high-frequency tokens that gained tokens that are not covered above (all of the High-Impact Groups above also contain High-Frequency Words).

Newly gained group members are bolded. a >> 1 o: [789 A][25073 a]  n: [789 A][25073 a][5 n'a] al >> 3 o: [282 Al][2164 al] n: [282 Al][1 D'al][2164 al][80 d'al][2 d’al] an >> 4 o: [5 AN][1042 An][13605 an] n: [5 AN][1042 An][178 D'an][25 D’an][13605 an][1680 d'an][138 d’an] e >> 4 o: [2553 E][33962 e]  n: [8 D'e][2553 E][302 d'e][13 d’e][33962 e][2 n'e] er >> 2 o: [3 ER][455 Er][3284 er] n: [3 ER][455 Er][1 d’er][3284 er][1 n'er] eus >> 5 o: [90 Eus][7121 eus] n: [90 Eus][57 N'eus][5 N’eus][9 d'eus][7121 eus][117 n'eus][6 n’eus] ez >> 2 o: [2 EZ][10 Ez][1187 ez] n: [2 EZ][10 Ez][2 N'ez][1187 ez][1 n'ez] oa >> 2 o: [2 OA][1 Oa][6371 oa] n: [1 N'oa][2 OA][1 Oa][2 n'oa][6371 oa] un >> 3 o: [3 UN][555 Un][4214 un] n: [1 D’un][3 UN][555 Un][113 d'un][15 d’un][4214 un] ur >> 3 o: [1075 Ur][7261 ur] n: [2 D'ur][1075 Ur][128 d'ur][6 d’ur][7261 ur]

Speaker Review: m' and z' Elision
The question for speakers of Breton reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "gained" words now found the other words, and vice versa?

Adding elision support for m'  and z' seemed less clear, so these have been separated out.

Notes:


 * These changes are being compared against a baseline that already assumes d', n', and p' elision support.

Random Sample
There were only 17 stemming groups (words that would all be indexed together) that gained members as a result of adding m'  and z' elision support that were not covered by the other groups, below, so all 17 of them are here. (These are from the Wikipedia sample.)

Key:


 * am >> 1
 * am indicates that all of these words were stemmed to am. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
 * >> 1 indicates that from "old" to "new", this stemming groups gained 1 member.
 * o: — the "old" group, in this case, the current behavior
 * n: — the "new" group, in this case, with m', and z' elision support
 * [2 m'am] — m'am occurs 2 times in our sample (of 5K articles)

Newly gained group members are bolded. am >> 1 o: [3 Am][29 am][3 d'am] n: [3 Am][29 am][3 d'am][2 m'am] as >> 1 o: [12 AS][12 As][18 as] n: [12 AS][12 As][18 as][1 m'as] bili >> 1 o: [2 Bili][13 bili] n: [2 Bili][1 M'Bili][13 bili] ec'h >> 1 o: [137 ec'h][1 n'ec'h]  n: [137 ec'h][1 m'ec'h][1 n'ec'h] edo >> 1 o: [20 Edo][2 N'edo][7 P'edo][106 edo][13 p'edo] n: [20 Edo][2 N'edo][7 P'edo][106 edo][14 m'edo][13 p'edo] emaint >> 1 o: [1 Emaint][1 N'emaint][11 emaint][2 n'emaint][1 p'emaint][1 p’emaint] n: [1 Emaint][1 N'emaint][11 emaint][1 m’emaint][2 n'emaint][1 p'emaint] [1 p’emaint] emañ >> 3 o: [164 Emañ][4 N'emañ][1 P'emañ][511 emañ][12 n'emañ][3 p'emañ] n: [164 Emañ][8 M'emañ][4 N'emañ][1 P'emañ][511 emañ][49 m'emañ] [6 m’emañ][12 n'emañ][3 p'emañ] emaoc'h >> 1 o: [4 emaoc'h]  n: [1 M'emaoc'h][4 emaoc'h] emeur >> 1 o: [2 Emeur][5 emeur] n: [2 Emeur][5 emeur][3 m'emeur] est >> 1 o: [15 Est][38 est][4 n'est][1 n’est] n: [15 Est][38 est][1 m'est][4 n'est][1 n’est] hai >> 1 o: [2 hai] n: [1 M'hai][2 hai] hen >> 1 o: [12 Hen][1 d'hen][31 hen][1 n'hen] n: [12 Hen][1 d'hen][31 hen][1 m'hen][1 n'hen] ho >> 1 o: [1 HO][6 Ho][1 d'ho][26 ho][1 n’ho] n: [1 HO][6 Ho][1 d'ho][26 ho][1 m’ho][1 n’ho] hoc'h >> 2 o: [2 Hoc'h][17 hoc'h]  n: [2 Hoc'h][2 M'hoc'h][17 hoc'h][1 m'hoc'h] hon >> 1 o: [6 Hon][2 d'hon][2 d’hon][49 hon][2 n'hon] n: [6 Hon][1 M'hon][2 d'hon][2 d’hon][49 hon][2 n'hon] int >> 2 o: [2 Int][2 N'int][1 N’int][158 int][35 n'int][4 n’int] n: [2 Int][2 N'int][1 N’int][158 int][1 m’int][35 n'int][4 n’int][2 z'int] ont >> 1 o: [1 n'ont][6 ont] n: [1 m'ont][1 n'ont][6 ont]

High-Impact Groups
There was only one stemming group that gained 5 or more members.

Newly gained group members are bolded. eo >> 5 o: [3 Eo][81 N'eo][10 N’eo][3 P'eo][5228 eo][183 n'eo][17 n’eo][7 p'eo] [1 p’eo] n: [3 Eo][2 M'eo][81 N'eo][10 N’eo][3 P'eo][5228 eo][51 m'eo][13 m’eo] [183 n'eo][17 n’eo][7 p'eo][1 p’eo][4 z'eo][1 z’eo]

High-Frequency Words
There were only seven stemming group with high-frequency tokens that gained tokens, one of which is the eo group above. The rest are presented below.

Newly gained group members are bolded. a >> 1 o: [789 A][25073 a][5 n'a]  n: [789 A][25073 a][4 m'a][5 n'a] en >> 2 o: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][52 n'en][6 n’en] [17 p'en][3 p’en] n: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][46 m'en][8 m’en] [52 n'en][6 n’en][17 p'en][3 p’en] er >> 1 o: [3 ER][455 Er][1 d’er][3284 er][1 n'er] n: [3 ER][455 Er][1 d’er][3284 er][1 m'er][1 n'er] eus >> 1 o: [90 Eus][57 N'eus][5 N’eus][9 d'eus][7121 eus][117 n'eus][6 n’eus] n: [90 Eus][57 N'eus][5 N’eus][9 d'eus][7121 eus][117 n'eus][6 n’eus][3 z'eus] he >> 2 o: [3 D'he][2 HE][116 He][3 N'he][102 d'he][7 d’he][1581 he][10 n'he][1 n’he] [4 p'he] n: [3 D'he][2 HE][116 He][3 N'he][102 d'he][7 d’he][1581 he][8 m'he][1 m’he] [10 n'he][1 n’he][4 p'he] o >> 2 o: [1 D'o][1 D’o][6 N'o][2 N’o][183 O][1 d'O][78 d'o][6 d’o][36 n'o]     [2 n’o][2957 o][3 p'o][1 p’o][1 º] n: [1 D'o][1 D’o][6 N'o][2 N’o][183 O][1 d'O][78 d'o][6 d’o][13 m'o] [3 m’o][36 n'o][2 n’o][2957 o][3 p'o][1 p’o][1 º]

Speaker Review: Common French Elision
The question for speakers of Breton reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "gained" words now found the other words, and vice versa?

Adding elision support for the most commonly seen French elision items (l', s', j', and qu' ) seems like it might be reasonable, given the prevalence of French-language content in Breton Wikipedia and the fact that some of the Breton elision items are also used in French (d', n', and m'—though with different meanings).

Notes:


 * These changes are being compared against a baseline that already assumes Breton elision support (d', n', p', m', and z').

Random Sample
Below are a semi-random selection of 25 stemming groups (words that would all be indexed together) that gained members as a result of adding common French elision support. (These are from the Wikipedia sample.)


 * The first 20 examples were chosen randomly. The last 5 examples were chosen from among those with s', j', and qu'  elision—the vast majority of examples only have l' elision.
 * To my untrained eye, more of these words look French, which is not a surprise. It makes sense to me—though I'm very happy to be corrected!—that dealing with common French elision would improve searching for names and French/Breton cognates or borrowings that appear in French-language contexts on Breton Wikipedia.

Key:


 * aigua >> 1
 * aigua indicates that all of these words were stemmed to aigua. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
 * >> 1 indicates that from "old" to "new", this stemming groups gained 1 members.
 * o: — the "old" group, in this case, the current behavior
 * n: — the "new" group, in this case, with common French elision support
 * [2 l'aigua] — l'aigua occurs 2 times in our sample (of 5K articles)

Newly gained group members are bolded. aigua >> 1 o: [1 Aigua] n: [1 Aigua][2 l'aigua] âme >> 2 o: [3 âme] n: [1 l'Âme][3 l'âme][3 âme] association >> 1 o: [22 Association][3 association] n: [22 Association][1 L'Association][3 association] atelier >> 1 o: [2 atelier] n: [1 L'Atelier][2 atelier] aube >> 3 o: [5 Aube] n: [5 Aube][1 L'Aube][1 L'aube][2 l'Aube] enquête >> 2 o: [3 Enquête][4 d'enquête][3 enquête] n: [3 Enquête][3 L'Enquête][4 d'enquête][3 enquête][1 l'enquête] enseignement >> 2 o: [2 enseignement] n: [1 L'enseignement][2 enseignement][4 l'enseignement] escaut >> 1 o: [8 Escaut] n: [8 Escaut][3 l'Escaut] esperanto >> 1 o: [1 Esperanto] n: [1 Esperanto][1 L'esperanto] exposition >> 2 o: [1 Exposition][2 exposition] n: [1 Exposition][1 L'Exposition][2 exposition][1 l'exposition] hiver >> 2 o: [1 Hiver] n: [1 Hiver][1 L'Hiver][1 l'hiver] honneur >> 1 o: [1 Honneur][1 d'honneur][1 d’honneur] n: [1 Honneur][1 d'honneur][1 d’honneur][1 l'honneur] horizon >> 1 o: [1 Horizon] n: [1 Horizon][1 l'Horizon] index >> 1 o: [5 Index][2 index] n: [5 Index][2 index][1 l'Index] isle >> 2 o: [4 Isle] n: [4 Isle][3 L'Isle][1 l'isle] isola >> 2 o: [1 Isola] n: [1 Isola][1 L'Isola][1 l'Isola] oncle >> 1 o: [2 oncle] n: [1 L'Oncle][2 oncle] opera >> 1 o: [1 Opera][12 opera] n: [1 L'opera][1 Opera][12 opera] oz >> 2 o: [3 Oz] n: [1 L'OZ][1 L'Oz][3 Oz] universelle >> 1 o: [1 universelle] n: [1 l'universelle][1 universelle]

aime >> 1 o: [1 Aime] n: [1 Aime][2 j'aime] ait >> 1 o: [1 Ait][1 ait] n: [1 Ait][1 ait][1 qu'ait] elle >> 2 o: [3 Elle][7 elle] n: [3 Elle][7 elle][1 l'Elle][1 qu’elle] est >> 2 o: [15 Est][38 est][1 m'est][4 n'est][1 n’est] n: [15 Est][1 Qu'est][38 est][1 m'est][4 n'est][1 n’est][1 s’est] obre >> 1 o: [1 Obre] n: [1 Obre][1 s'obre]

High-Impact Groups
There were only five stemming groups that gained 5 or more members.


 * Again, these look more French to me, as expected.

Newly gained group members are bolded. amour >> 5 o: [4 Amour][4 amour][1 d'amour][1 d’amour] n: [4 Amour][2 L'Amour][2 L'amour][4 amour][1 d'amour][1 d’amour][1 l'Amour] [3 l'amour][1 l’amour] art >> 5 o: [1 ART][31 Art][9 art][4 d'Art][3 d'art][1 d’art] n: [1 ART][31 Art][1 L'Art][1 L'art][9 art][4 d'Art][3 d'art][1 d’art][3 l'Art] [4 l'art][2 l’Art] histoire >> 5 o: [1 HISTOIRE][50 Histoire][10 d'Histoire][11 d'histoire][10 histoire] n: [1 HISTOIRE][50 Histoire][1 L'Histoire][1 L'histoire][10 d'Histoire] [11 d'histoire][10 histoire][11 l'histoire][1 l’Histoire][8 l’histoire] homme >> 6 o: [2 Homme][1 d'Homme][3 d’homme][8 homme] n: [2 Homme][6 L'Homme][1 L'homme][1 L’Homme][1 d'Homme][3 d’homme][8 homme] [1 l'Homme][7 l'homme][1 l’homme] île >> 5 o: [5 Île][1 île] n: [3 L'Île][1 L'île][1 L’Île][3 l'Île][4 l'île][5 Île][1 île]

High-Frequency Words
There were only four stemming group with high-frequency tokens that gained tokens.


 * Again, these look more French to me, as expected.

Gained members are bolded. a >> 1 o: [789 A][25073 a][4 m'a][5 n'a]  n: [789 A][25073 a][1 l'a][4 m'a][5 n'a] an >> 2 o: [5 AN][1042 An][178 D'an][25 D’an][13605 an][1680 d'an][138 d’an] n: [5 AN][1042 An][178 D'an][25 D’an][1 L'An][13605 an][1680 d'an][138 d’an] [3 l'an] en >> 2 o: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][46 m'en][8 m’en] [52 n'en][6 n’en][17 p'en][3 p’en] n: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][1 j'en][46 m'en] [8 m’en][52 n'en][6 n’en][17 p'en][3 p’en][1 qu'en] un >> 1 o: [1 D’un][3 UN][555 Un][113 d'un][15 d’un][4214 un] n: [1 D’un][3 UN][555 Un][113 d'un][15 d’un][1 qu’un][4214 un]

ICU Folding
As part of the default analysis chain, ICU normalization was already enabled. I enabled ICU folding with the "preserve" option enabled and a folding exception for Breton ñ. (The "preserve" option, parallel to the  filter, indexes both the unfolded and folded version of a token. So, Adélaïde would be indexed as both adélaïde and adelaide, which allows the term to be found regardless of accents (or with slightly different accents, such as adelaïde), but will boost exact matches between query and text when they occur.

Elision processing doesn't change the number of tokens indexed, but "preserve" folding does, because two forms of a word can be indexed. For the Wikipedia sample, we had a 4.535% increase in the number of tokens, and for the Wiktionary sample, and increase of 5.641%.

I see the usual ICU folding patterns:


 * diacritic stripping (other than ñ) in Latin (Zurich/Zûrich/Zürich), Greek (ή/η), Cyrillic (Петро́вич/Петрович), Hebrew (כַּשְרוּת/כשרות), Arabic (جبل/جَبَل)
 * Normalization of variants in many scripts
 * Mapping of uncommon characters to related common characters (æ/ae, ɕ/c, ð/d, Ⅎ/f, ŋ/ɲ/ɳ/ɴ/n, ʐ/z, etc.)
 * Stripping of invisibles; in this case bi-directionality markers, which even appear in left-to-right tokens
 * Straightening of curly quotes (“/”/" and ‘/’/')

The biggest number of mergers are single-letter tokens (a, e, and o; see below), though most of the merged tokens are rare. Only à (presumably French) occurs more than 30 times in our 5K sample.

(Foreshadowing: since a, e, and o are all stopwords—see below—in these large groups the "old" tokens will all be removed as stopwords, and only the more uncommon characters will remain in these groups.)

a >> 19 o: [789 A][25073 a][1 l'a][4 m'a][5 n'a]  n: [789 A][25073 a][1 a͈][1 l'a][4 m'a][5 n'a][1 qu'à][1 qu’à][14 À][1 Â] [2 Ä][2 Å][209 à][3 á][1 ä][1 å][1 Ā][2 ā][1 Ă][1 Ǟ][1 ǟ][1 ɑ] [1 ɒ][1 ə] e >> 11 o: [8 D'e][2553 E][302 d'e][13 d’e][33962 e][2 n'e]  n: [8 D'e][2553 E][302 d'e][13 d’e][33962 e][1 eː][2 n'e][1 É][1 Ê][1 Ë] [9 è][30 é][1 ë][1 Ē][1 ē][2 ɛ][1 ˈɛ] o >> 21 o: [1 D'o][1 D’o][6 N'o][2 N’o][183 O][1 d'O][78 d'o][6 d’o][13 m'o]     [3 m’o][36 n'o][2 n’o][2957 o][3 p'o][1 p’o][1 º] n: [1 D'o][1 D’o][6 N'o][2 N’o][183 O][1 d'O][78 d'o][6 d’o][13 m'o]     [3 m’o][36 n'o][2 n’o][2957 o][3 p'o][1 p’o][1 º][10 Ó][1 Õ][2 Ö] [1 ó][1 ô][1 õ][2 ö][2 ø][1 Ō][1 ō][1 Ơ][1 ơ][1 Ȫ][1 ȫ][1 Ȭ][1 ȭ] [1 Ȯ][1 ȯ][1 Ȱ][1 ȱ][1 ɔ]

The straightening of curly apostrophes is a moderate source of extra tokens for Breton: c'h is a letter in Breton, so the curly variant c’h is fairly common, and even one c‘h appeared. These extra tokens aren't super helpful, because it is unlikely that anyone is going to expect c'hwec'h ("sister"), c’hwec'h, c’hwec'h, and c’hwec’h to be different things—unlike Adélaïde (vs Adelaide), which could match a more specific spelling of a person's name.

Taking care of only c'h variants before ICU folding happens is tedious. There are eight obvious variants: c‘h, c’h, C‘h, C’h, C‘H, C’H, and the less likely but possibly c‘H and c’H. That doesn't cover the implausible but possible Ｃ‘ｈｗｅｃ’Ｈ or moderately absurd ᶜ’ʰʷᵉᶜ‘ʰ. Gotta Might as well catch ’em all!—so I added a character map to straighten single curly apostrophes (both ‘left‘ and ’right’).

Apostrophe Straightening
As noted above, there are a fair number of variants of c'h with curly apostrophes, and straightening the apostrophes everywhere is an easy way to fix them, so I added a character filter to do just that.

The impact of the change was a bit less than I expected: only about 0.2% fewer tokens in the Wikipedia sample (4.1% of tokens added by ICU folding), and about 0.1% fewer tokens in the Wiktionary sample (2.2% of tokens added by ICU folding). However, this normalization will prevent weird edge cases where curly or non-curly apostrophes change ranking.

The vast majority of affected tokens are instances of c'h, though there are a few others, such as the French words aujourd’hui and jusqu’à, Italian dall’inizio and English (proper noun) Matasović’s.

Stopwords
I added the shorter and more focused list of Breton stopwords that VIGNERON provided.

It had a huge impact: 29.8% of tokens in the Wikipedia 5K sample and 23.9% of the tokens in the Wiktionary sample were filtered as stopwords. This isn't a bad thing! It means that searches in Breton will likely be ranked better, based on content words, rather than stopwords.

Notes:


 * These changes are being compared against a baseline that already assumes elision support, ICU folding, and apostrophe straightening.

Most of the tokens removed as stopwords are obvious matches to the words on the list, but there are some that interact with ICU normalization and folding, and elision. I want to do a quick review of those and make sure nothing looks wrong.


 * º—this gets normalized to o, which is on the stopword list; that's a common and generally acceptable occurrence, because the token will still be indexed in the plain field.

Speaker Review: Stopwords
The question for speakers of Breton reviewing this sections is this: would it be bad if the words below were moderately discounted (like other stopwords) when searching?

The words below are a sample of words filtered as stopwords because, after elision handling, all that's left is in fact a stopword. The first four groups undergo Breton elision and the last group undergoes "common French" elision.

Keep in mind that stopwords are not completely ignored; they are stripped from the "text" field index, but are still present in the "plain" field index, so they aren't required for a match, but they can affect ranking.

Only d'an and d'ar were "common" (more than 1,000 occurrences in our 5K sample).


 * d'al, d'an, d'ar, d'e, d'en, d'er, d'eus, d'he, d'hec'h, d'ho, d'hon, d'o, d'ul, d'ul, d'un, d'ur
 * m'a, m'en, m'he, m'ho, m'hoc'h, m'hon, m'hor, m'o
 * n'a, n'an, n'e, n'eo, n'eus, n'he, n'ho, n'hoc'h, n'hon, n'hor, n'int, n'o, n'oa
 * p'en, p'eo


 * j'en, l'an, s'en

Overall Impact
Taking into account all the changes above, the Wikipedia sample had a net loss of 26.7% of its tokens, and the Wiktionary sample lost 19.7%. (These are mostly losses from stopwords, offset by gains from ICU folding with the "preserve" option.)

Wiktionary Notes
Other than differences in the number or percentage of examples of each sort above, the Wiktionary changes are very similar to the Wikipedia changes. The one standout difference is that there are a lot more International Phonetic Alphabet tokens in Wiktionary (were they provide pronunciation guidance). These are usually pronunciations of words being mapped onto the words themselves (such as Andrea/anˈdrɛːa/ãnˈdrea or Andreas/anˈdreːas/anˈdʀeːas), which should be fine.

Speaker Review Summary
TBD

Next Steps

 * Await speaker review and address any concerns that arise.
 * Put up a patch with the changes and get the code reviewed.
 * Deploy the changes and reindex Breton-language wikis.
 * Celebrate!