User:TJones (WMF)/Notes/Breton Analyzer Analysis

From mediawiki.org

July 2021 — See TJones_(WMF)/Notes for other projects. See also T258094. For help with the technical jargon used in Analysis Chain Analysis, see the Language Analysis section of the Search Glossary.

Background[edit]

At Celtic Knot 2020 I asked people to contact me if they were interested in improving search for a particular language, and I got a positive response and a pointer to a stopword list from VIGNERON.

The basic plan (copied from the phab ticket) was:

  • Create a Breton-specific language analysis configuration
  • Finalize a list of stopwords (the linked-to list seems to be too aggressive, and is more a list of common words) and add them to the Breton config.
  • Enable elision support for d', n', and p'. Look further into including m' and z'.
    • Look at the impact of adding some support for the more common French elision (l', s', j', and qu') since there is a fair amount of French text on Breton Wikipedia. (Definitely do not include c', since c'h is a letter in Breton.)
  • Enable ICU folding. Very likely need an exception for ñ. Less likely for Ăą, ĂȘ, Ăź, ĂŽ, Ă», Ăč, ĂŒ (all used in Breton); watch for problems with ç (commonly used in French).
    • Make sure apostrophes are normalized (e.g., c’hoar & c'hoar should get the same results).

However, making this a 10% project for me plus some concerns about the initial stopword list delayed things quite a bit (as documented in my lightning talk at Arctic Knot 2021.

Hopefully we are back on track!

Data[edit]

The usual process for creating a sample of documents (for testing language analysis modifications) is to retrieve 10,000 Wikipedia articles and 10,000 Wiktionary entries for the language in question. Sometimes we get fewer than 10,000 if there aren’t that many articles available in a particular project. Wikipedia articles usually provide a good example of typical formal written text in the language, and Wiktionary usually provides a larger number of distinct forms of words, and some additional variety of foreign scripts and languages. Foreign scripts and languages are not always processed well by language-specific text processing.

I sanitize the documents by removing markup (mostly HTML tags) and leading white space, and deduplicating individual lines. Deduplication reduces the number of instances of wiki-specific words, such as the local equivalent of "References", "See also", "Noun", "Etymology", etc.

I ended up only pulling 5,000 documents each from Breton Wikipedia and Wiktionary.

Speaker Review: Overview[edit]

The core task of the speaker doing the review is to decide whether words are being properly grouped together for search, and whether any changes to those groupings are better or worse. When words are grouped together, it means that searching for one word in the group will find all of the other words in the group, too. With the current English language processing, for example, searching for any of the words hope, hopes, hoped, hoping, hope’s, hoper, or hopers will find all of the others. (Note that the results in each case will be ranked differently because exact matches are preferred).

In addition to listing the words that are grouped together, we also include the number of times each word appears in the text sample. This helps us estimate the relative importance of potential errors. For example, if two words are improperly grouped together, but the words are very rare, that’s not as bad as if they were very common.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

When we make less extreme modifications to the language processing done for search—like introducing diacritic folding—we can usually look more meaningfully at groups before and after the modification to assess the effect of the group changes.

Old-vs-new groups are presented as follows:

hope >> 2
  o: [152 Hope][23 Hopes][1208 hope][346 hoped][488 hopes]
  n: [152 Hope][1 Hopē][23 Hopes][1208 hope][346 hoped][488 hopes][2 Ąợáč•á»…]

The first line shows the stem (hope), a pair of arrow heads (>>) indicating whether words were gained or lost by the group, and a number indicating how many gains and/or losses there were (2).

The stem is the form that all of the other words were reduced to. The stem does not have to be the actual root form of the word or even a word at all. However, seeing the stem sometimes makes it easier to understand what the stemmer or other parts of the analysis were trying to do.

In terms of gains and losses:

  • >> indicates that words were gained by the group
  • << indicates that words were lost from the group
  • >< indicates that there were both losses and gains

The o: section (for “old”) shows all the words that shared a stem before the change. The n: section (for “new”) shows all the words that shared a stem after the change. Sharing a stem means that searching for any of the words will find all of the others. (Note that while searching for each word in a group will give the same results, the results could be in a very different order—in particular because exact matches are given more weight.)

The numbers with the word—e.g., [1208 hope] and [1 Hopē]—indicate how many times a given word appears in the text sample. In this case, hope is over a thousand times more common than Hopē. Rare words that are not great matches with the rest of a group are less of a problem because they don’t occur very often. When you search for them, exact matching will usually bring them to the top of the results list.

Problems can arise when more common words are grouped together incorrectly. For example, a grouping like [1208 hope][747 hop] would be worse, because these words don’t belong together, and both words are common.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

Speaker Review: d', n', and p' Elision[edit]

The question for speakers of Breton reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "gained" words now found the other words, and vice versa?

Adding elision support for d', n', and p' seemed uncontroversial.

Random Sample[edit]

Looking at a random sample of the word groups is the best way to see what the typical effects of a modification are. If the majority of changes are good, and any less desirable changes are understandable and acceptable, then overall the modification is good.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

Below is a sample of 15 randomly selected stemming groups (words that would all be indexed together) that gained members as a result of adding d', n', and p' elision support. (These are from the Wikipedia sample.)

Notes:

  • It appears likely that P'tite being stemmed to tite is an error; P'tite seems to be a contraction of French Petite. It's uncommon, and no language processing is perfect, plus exact matching will generally prefer matches to P'tite over matches to tite if the search term is p'tite.
  • Elision handling automatically takes care of curly apostrophes, like d’.
  • Examples of p' elision seem to be less common; none appear in the random group, though there are some in the high-impact groups.

Key:

  • alphonse >> 1
    • alphonse indicates that all of these words were stemmed to alphonse. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
    • >> 1 indicates that from "old" to "new", this stemming groups gained 1 member.
  • o: — the "old" group, in this case, the current behavior
  • n: — the "new" group, in this case, with d', n', and p' elision support
  • [11 Alphonse] — Alphonse occurs 11 times in our sample (of 5K articles)

Newly gained group members are bolded.

alphonse >> 1
  o: [11 Alphonse]
  n: [11 Alphonse][1 d'Alphonse]
amiens >> 1
  o: [11 Amiens]
  n: [11 Amiens][1 d'Amiens]
amore >> 4
  o: [1 Amore][2 amore]
  n: [1 Amore][1 D'Amore][1 D'amore][2 amore][2 d'amore][1 d’amore]
aprĂšs >> 1
  o: [1 AprĂšs][5 aprĂšs]
  n: [1 AprĂšs][5 aprĂšs][3 d'aprĂšs]
arbres >> 1
  o: [1 arbres]
  n: [1 arbres][1 d'arbres]
arbrissel >> 1
  o: [4 Arbrissel]
  n: [4 Arbrissel][1 d’Arbrissel]
ardeiñ >> 1
  o: [1 Ardeiñ][4 ardeiñ]
  n: [1 Ardeiñ][4 ardeiñ][1 d'ardeiñ]
aura >> 1
  o: [2 aura]
  n: [2 aura][2 n'aura]
avranches >> 1
  o: [5 Avranches]
  n: [5 Avranches][1 d'Avranches]
eben >> 1
  o: [1 Eben][29 eben]
  n: [1 Eben][6 d'eben][29 eben]
Ă©ducation >> 1
  o: [1 Éducation][1 Ă©ducation]
  n: [2 d'Ă©ducation][1 Éducation][1 Ă©ducation]
emlazhañ >> 1
  o: [2 emlazhañ]
  n: [1 d'emlazhañ][2 emlazhañ]
enquĂȘte >> 1
  o: [3 EnquĂȘte][3 enquĂȘte]
  n: [3 EnquĂȘte][4 d'enquĂȘte][3 enquĂȘte]
entraigues >> 1
  o: [4 Entraigues]
  n: [4 Entraigues][2 d'Entraigues]
extermination >> 1
  o: [1 extermination]
  n: [1 d'extermination][1 extermination]
heller >> 3
  o: [1 Heller]
  n: [1 Heller][1 N’heller][4 n'heller][1 n’heller]
hon >> 3
  o: [6 Hon][49 hon]
  n: [6 Hon][2 d'hon][2 d’hon][49 hon][2 n'hon]
occasion >> 1
  o: [1 occasion]
  n: [1 d'occasion][1 occasion]
offenbach >> 1
  o: [2 Offenbach]
  n: [2 Offenbach][1 d'Offenbach]
ont >> 1
  o: [6 ont]
  n: [1 n'ont][6 ont]
ötzi >> 1
  o: [10 Ötzi][1 ötzi]
  n: [2 d'Ötzi][10 Ötzi][1 ötzi]
ouzomp >> 1
  o: [5 ouzomp]
  n: [1 N'ouzomp][5 ouzomp]
st >> 1
  o: [1 ST][111 St]
  n: [1 ST][111 St][1 d'St]
tite >> 1
  o: [1 Tite]
  n: [1 P'tite][1 Tite]
ugent >> 1
  o: [4 Ugent][103 ugent]
  n: [4 Ugent][1 n'ugent][103 ugent]

High-Impact Groups[edit]

High-impact groups are those with 10 or more changes to the number of distinct words in the group (gains >>, losses <<, or a mix ><). These groups are more likely to have problems because they are outliers.

Sometimes an apparent high-impact group is not really an outlier. This happens when a large group has the stem of a small group. For example, if a group of 10 words and a group of 2 words merge, you could see it as the group of 10 gaining 2 new members (which is not an outlier), or as the group of 2 gaining 10 new members (which looks like an outlier).

The most interesting cases are when two relatively large groups merge, or when more than two medium-sized groups merge—because then lots of potentially unrelated words are being grouped together.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

There was only one stemming group that gained 10 or more members, so I've included the 5 stemming groups that gained 5 or more members.

Newly gained group members are bolded.

ar >> 5
  o: [3 AR][1644 Ar][21923 ar]
  n: [3 AR][1644 Ar][180 D'ar][32 D’ar][21923 ar][1 d'Ar][2143 d'ar][164 d’ar]
en >> 8
  o: [672 En][8989 en]
  n: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][52 n'en][6 n’en]
     [17 p'en][3 p’en]
eo >> 7
  o: [3 Eo][5228 eo]
  n: [3 Eo][81 N'eo][10 N’eo][3 P'eo][5228 eo][183 n'eo][17 n’eo][7 p'eo]
     [1 p’eo]
he >> 7
  o: [2 HE][116 He][1581 he]
  n: [3 D'he][2 HE][116 He][3 N'he][102 d'he][7 d’he][1581 he][10 n'he][1 n’he]
     [4 p'he]
o >> 11
  o: [183 O][2957 o][1 Âș]
  n: [1 D'o][1 D’o][6 N'o][2 N’o][183 O][1 d'O][78 d'o][6 d’o][36 n'o]
     [2 n’o][2957 o][3 p'o][1 p’o][1 Âș]

High-Frequency Words[edit]

High-frequency words are those that occur 1,000 times or more in the sample. These are more likely to be very common words, so it’s important to look at cases where a high-frequency word was added or removed from a group, to make sure the change isn’t going to cause problems.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

There were 10 stemming groups with high-frequency tokens that gained tokens that are not covered above (all of the High-Impact Groups above also contain High-Frequency Words).

Newly gained group members are bolded.

a >> 1
  o: [789 A][25073 a]
  n: [789 A][25073 a][5 n'a]
al >> 3
  o: [282 Al][2164 al]
  n: [282 Al][1 D'al][2164 al][80 d'al][2 d’al]
an >> 4
  o: [5 AN][1042 An][13605 an]
  n: [5 AN][1042 An][178 D'an][25 D’an][13605 an][1680 d'an][138 d’an]
e >> 4
  o: [2553 E][33962 e]
  n: [8 D'e][2553 E][302 d'e][13 d’e][33962 e][2 n'e]
er >> 2
  o: [3 ER][455 Er][3284 er]
  n: [3 ER][455 Er][1 d’er][3284 er][1 n'er]
eus >> 5
  o: [90 Eus][7121 eus]
  n: [90 Eus][57 N'eus][5 N’eus][9 d'eus][7121 eus][117 n'eus][6 n’eus]
ez >> 2
  o: [2 EZ][10 Ez][1187 ez]
  n: [2 EZ][10 Ez][2 N'ez][1187 ez][1 n'ez]
oa >> 2
  o: [2 OA][1 Oa][6371 oa]
  n: [1 N'oa][2 OA][1 Oa][2 n'oa][6371 oa]
un >> 3
  o: [3 UN][555 Un][4214 un]
  n: [1 D’un][3 UN][555 Un][113 d'un][15 d’un][4214 un]
ur >> 3
  o: [1075 Ur][7261 ur]
  n: [2 D'ur][1075 Ur][128 d'ur][6 d’ur][7261 ur]

Speaker Review: m' and z' Elision[edit]

The question for speakers of Breton reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "gained" words now found the other words, and vice versa?

Adding elision support for m' and z' seemed less clear, so these have been separated out.

Notes:

  • These changes are being compared against a baseline that already assumes d', n', and p' elision support.

Random Sample[edit]

There were only 17 stemming groups (words that would all be indexed together) that gained members as a result of adding m' and z' elision support that were not covered by the other groups, below, so all 17 of them are here. (These are from the Wikipedia sample.)

Key:

  • am >> 1
    • am indicates that all of these words were stemmed to am. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
    • >> 1 indicates that from "old" to "new", this stemming groups gained 1 member.
  • o: — the "old" group, in this case, the current behavior
  • n: — the "new" group, in this case, with m', and z' elision support
  • [2 m'am] — m'am occurs 2 times in our sample (of 5K articles)

Newly gained group members are bolded.

am >> 1
  o: [3 Am][29 am][3 d'am]
  n: [3 Am][29 am][3 d'am][2 m'am]
as >> 1
  o: [12 AS][12 As][18 as]
  n: [12 AS][12 As][18 as][1 m'as]
bili >> 1
  o: [2 Bili][13 bili]
  n: [2 Bili][1 M'Bili][13 bili]
ec'h >> 1
  o: [137 ec'h][1 n'ec'h]
  n: [137 ec'h][1 m'ec'h][1 n'ec'h]
edo >> 1
  o: [20 Edo][2 N'edo][7 P'edo][106 edo][13 p'edo]
  n: [20 Edo][2 N'edo][7 P'edo][106 edo][14 m'edo][13 p'edo]
emaint >> 1
  o: [1 Emaint][1 N'emaint][11 emaint][2 n'emaint][1 p'emaint][1 p’emaint]
  n: [1 Emaint][1 N'emaint][11 emaint][1 m’emaint][2 n'emaint][1 p'emaint]
     [1 p’emaint]
emañ >> 3
  o: [164 Emañ][4 N'emañ][1 P'emañ][511 emañ][12 n'emañ][3 p'emañ]
  n: [164 Emañ][8 M'emañ][4 N'emañ][1 P'emañ][511 emañ][49 m'emañ]
     [6 m’emañ][12 n'emañ][3 p'emañ]
emaoc'h >> 1
  o: [4 emaoc'h]
  n: [1 M'emaoc'h][4 emaoc'h]
emeur >> 1
  o: [2 Emeur][5 emeur]
  n: [2 Emeur][5 emeur][3 m'emeur]
est >> 1
  o: [15 Est][38 est][4 n'est][1 n’est]
  n: [15 Est][38 est][1 m'est][4 n'est][1 n’est]
hai >> 1
  o: [2 hai]
  n: [1 M'hai][2 hai]
hen >> 1
  o: [12 Hen][1 d'hen][31 hen][1 n'hen]
  n: [12 Hen][1 d'hen][31 hen][1 m'hen][1 n'hen]
ho >> 1
  o: [1 HO][6 Ho][1 d'ho][26 ho][1 n’ho]
  n: [1 HO][6 Ho][1 d'ho][26 ho][1 m’ho][1 n’ho]
hoc'h >> 2
  o: [2 Hoc'h][17 hoc'h]
  n: [2 Hoc'h][2 M'hoc'h][17 hoc'h][1 m'hoc'h]
hon >> 1
  o: [6 Hon][2 d'hon][2 d’hon][49 hon][2 n'hon]
  n: [6 Hon][1 M'hon][2 d'hon][2 d’hon][49 hon][2 n'hon]
int >> 2
  o: [2 Int][2 N'int][1 N’int][158 int][35 n'int][4 n’int]
  n: [2 Int][2 N'int][1 N’int][158 int][1 m’int][35 n'int][4 n’int][2 z'int]
ont >> 1
  o: [1 n'ont][6 ont]
  n: [1 m'ont][1 n'ont][6 ont]

High-Impact Groups[edit]

There was only one stemming group that gained 5 or more members.

Newly gained group members are bolded.

eo >> 5
  o: [3 Eo][81 N'eo][10 N’eo][3 P'eo][5228 eo][183 n'eo][17 n’eo][7 p'eo]
     [1 p’eo]
  n: [3 Eo][2 M'eo][81 N'eo][10 N’eo][3 P'eo][5228 eo][51 m'eo][13 m’eo]
     [183 n'eo][17 n’eo][7 p'eo][1 p’eo][4 z'eo][1 z’eo]

High-Frequency Words[edit]

There were only seven stemming group with high-frequency tokens that gained tokens, one of which is the eo group above. The rest are presented below.

Newly gained group members are bolded.

a >> 1
  o: [789 A][25073 a][5 n'a]
  n: [789 A][25073 a][4 m'a][5 n'a]
en >> 2
  o: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][52 n'en][6 n’en]
     [17 p'en][3 p’en]
  n: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][46 m'en][8 m’en]
     [52 n'en][6 n’en][17 p'en][3 p’en]
er >> 1
  o: [3 ER][455 Er][1 d’er][3284 er][1 n'er]
  n: [3 ER][455 Er][1 d’er][3284 er][1 m'er][1 n'er]
eus >> 1
  o: [90 Eus][57 N'eus][5 N’eus][9 d'eus][7121 eus][117 n'eus][6 n’eus]
  n: [90 Eus][57 N'eus][5 N’eus][9 d'eus][7121 eus][117 n'eus][6 n’eus][3 z'eus]
he >> 2
  o: [3 D'he][2 HE][116 He][3 N'he][102 d'he][7 d’he][1581 he][10 n'he][1 n’he]
     [4 p'he]
  n: [3 D'he][2 HE][116 He][3 N'he][102 d'he][7 d’he][1581 he][8 m'he][1 m’he]
     [10 n'he][1 n’he][4 p'he]
o >> 2
  o: [1 D'o][1 D’o][6 N'o][2 N’o][183 O][1 d'O][78 d'o][6 d’o][36 n'o]
     [2 n’o][2957 o][3 p'o][1 p’o][1 Âș]
  n: [1 D'o][1 D’o][6 N'o][2 N’o][183 O][1 d'O][78 d'o][6 d’o][13 m'o]
     [3 m’o][36 n'o][2 n’o][2957 o][3 p'o][1 p’o][1 Âș]

Speaker Review: Common French Elision[edit]

The question for speakers of Breton reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "gained" words now found the other words, and vice versa?

Adding elision support for the most commonly seen French elision items (l', s', j', and qu') seems like it might be reasonable, given the prevalence of French-language content in Breton Wikipedia and the fact that some of the Breton elision items are also used in French (d', n', and m'—though with different meanings).

Notes:

  • These changes are being compared against a baseline that already assumes Breton elision support (d', n', p', m', and z').

Random Sample[edit]

Below are a semi-random selection of 25 stemming groups (words that would all be indexed together) that gained members as a result of adding common French elision support. (These are from the Wikipedia sample.)

  • The first 20 examples were chosen randomly. The last 5 examples were chosen from among those with s', j', and qu' elision—the vast majority of examples only have l' elision.
  • To my untrained eye, more of these words look French, which is not a surprise. It makes sense to me—though I'm very happy to be corrected!—that dealing with common French elision would improve searching for names and French/Breton cognates or borrowings that appear in French-language contexts on Breton Wikipedia.

Key:

  • aigua >> 1
    • aigua indicates that all of these words were stemmed to aigua. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
    • >> 1 indicates that from "old" to "new", this stemming groups gained 1 members.
  • o: — the "old" group, in this case, the current behavior
  • n: — the "new" group, in this case, with common French elision support
  • [2 l'aigua] — l'aigua occurs 2 times in our sample (of 5K articles)

Newly gained group members are bolded.

aigua >> 1
  o: [1 Aigua]
  n: [1 Aigua][2 l'aigua]
Ăąme >> 2
  o: [3 Ăąme]
  n: [1 l'Âme][3 l'ñme][3 ñme]
association >> 1
  o: [22 Association][3 association]
  n: [22 Association][1 L'Association][3 association]
atelier >> 1
  o: [2 atelier]
  n: [1 L'Atelier][2 atelier]
aube >> 3
  o: [5 Aube]
  n: [5 Aube][1 L'Aube][1 L'aube][2 l'Aube]
enquĂȘte >> 2
  o: [3 EnquĂȘte][4 d'enquĂȘte][3 enquĂȘte]
  n: [3 EnquĂȘte][3 L'EnquĂȘte][4 d'enquĂȘte][3 enquĂȘte][1 l'enquĂȘte]
enseignement >> 2
  o: [2 enseignement]
  n: [1 L'enseignement][2 enseignement][4 l'enseignement]
escaut >> 1
  o: [8 Escaut]
  n: [8 Escaut][3 l'Escaut]
esperanto >> 1
  o: [1 Esperanto]
  n: [1 Esperanto][1 L'esperanto]
exposition >> 2
  o: [1 Exposition][2 exposition]
  n: [1 Exposition][1 L'Exposition][2 exposition][1 l'exposition]
hiver >> 2
  o: [1 Hiver]
  n: [1 Hiver][1 L'Hiver][1 l'hiver]
honneur >> 1
  o: [1 Honneur][1 d'honneur][1 d’honneur]
  n: [1 Honneur][1 d'honneur][1 d’honneur][1 l'honneur]
horizon >> 1
  o: [1 Horizon]
  n: [1 Horizon][1 l'Horizon]
index >> 1
  o: [5 Index][2 index]
  n: [5 Index][2 index][1 l'Index]
isle >> 2
  o: [4 Isle]
  n: [4 Isle][3 L'Isle][1 l'isle]
isola >> 2
  o: [1 Isola]
  n: [1 Isola][1 L'Isola][1 l'Isola]
oncle >> 1
  o: [2 oncle]
  n: [1 L'Oncle][2 oncle]
opera >> 1
  o: [1 Opera][12 opera]
  n: [1 L'opera][1 Opera][12 opera]
oz >> 2
  o: [3 Oz]
  n: [1 L'OZ][1 L'Oz][3 Oz]
universelle >> 1
  o: [1 universelle]
  n: [1 l'universelle][1 universelle]
aime >> 1
  o: [1 Aime]
  n: [1 Aime][2 j'aime]
ait >> 1
  o: [1 Ait][1 ait]
  n: [1 Ait][1 ait][1 qu'ait]
elle >> 2
  o: [3 Elle][7 elle]
  n: [3 Elle][7 elle][1 l'Elle][1 qu’elle]
est >> 2
  o: [15 Est][38 est][1 m'est][4 n'est][1 n’est]
  n: [15 Est][1 Qu'est][38 est][1 m'est][4 n'est][1 n’est][1 s’est]
obre >> 1
  o: [1 Obre]
  n: [1 Obre][1 s'obre]

High-Impact Groups[edit]

There were only five stemming groups that gained 5 or more members.

  • Again, these look more French to me, as expected.

Newly gained group members are bolded.

amour >> 5
  o: [4 Amour][4 amour][1 d'amour][1 d’amour]
  n: [4 Amour][2 L'Amour][2 L'amour][4 amour][1 d'amour][1 d’amour][1 l'Amour]
     [3 l'amour][1 l’amour]
art >> 5
  o: [1 ART][31 Art][9 art][4 d'Art][3 d'art][1 d’art]
  n: [1 ART][31 Art][1 L'Art][1 L'art][9 art][4 d'Art][3 d'art][1 d’art][3 l'Art]
     [4 l'art][2 l’Art]
histoire >> 5
  o: [1 HISTOIRE][50 Histoire][10 d'Histoire][11 d'histoire][10 histoire]
  n: [1 HISTOIRE][50 Histoire][1 L'Histoire][1 L'histoire][10 d'Histoire]
     [11 d'histoire][10 histoire][11 l'histoire][1 l’Histoire][8 l’histoire]
homme >> 6
  o: [2 Homme][1 d'Homme][3 d’homme][8 homme]
  n: [2 Homme][6 L'Homme][1 L'homme][1 L’Homme][1 d'Homme][3 d’homme][8 homme]
     [1 l'Homme][7 l'homme][1 l’homme]
Ăźle >> 5
  o: [5 Île][1 üle]
  n: [3 L'Île][1 L'üle][1 L’Île][3 l'Île][4 l'üle][5 Île][1 üle]

High-Frequency Words[edit]

There were only four stemming group with high-frequency tokens that gained tokens.

  • Again, these look more French to me, as expected.

Gained members are bolded.

a >> 1
  o: [789 A][25073 a][4 m'a][5 n'a]
  n: [789 A][25073 a][1 l'a][4 m'a][5 n'a]
an >> 2
  o: [5 AN][1042 An][178 D'an][25 D’an][13605 an][1680 d'an][138 d’an]
  n: [5 AN][1042 An][178 D'an][25 D’an][1 L'An][13605 an][1680 d'an][138 d’an]
     [3 l'an]
en >> 2
  o: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][46 m'en][8 m’en]
     [52 n'en][6 n’en][17 p'en][3 p’en]
  n: [672 En][10 N'en][3 P'en][43 d'en][6 d’en][8989 en][1 j'en][46 m'en]
     [8 m’en][52 n'en][6 n’en][17 p'en][3 p’en][1 qu'en]
un >> 1
  o: [1 D’un][3 UN][555 Un][113 d'un][15 d’un][4214 un]
  n: [1 D’un][3 UN][555 Un][113 d'un][15 d’un][1 qu’un][4214 un]

ICU Folding[edit]

As part of the default analysis chain, ICU normalization was already enabled. I enabled ICU folding with the "preserve" option enabled and a folding exception for Breton ñ. (The "preserve" option, parallel to the asciifolding_preserve filter, indexes both the unfolded and folded version of a token. So, Adélaïde would be indexed as both adélaïde and adelaide, which allows the term to be found regardless of accents (or with slightly different accents, such as adelaïde), but will boost exact matches between query and text when they occur.

Elision processing doesn't change the number of tokens indexed, but "preserve" folding does, because two forms of a word can be indexed. For the Wikipedia sample, we had a 4.535% increase in the number of tokens, and for the Wiktionary sample, and increase of 5.641%.

I see the usual ICU folding patterns:

  • diacritic stripping (other than ñ) in Latin (Zurich/ZĂ»rich/ZĂŒrich), Greek (Îź/η), Cyrillic (ĐŸĐ”Ń‚Ń€ĐŸÌĐČоч/ĐŸĐ”Ń‚Ń€ĐŸĐČоч), Hebrew (Ś›Ö·ÖŒŚ©Ö°ŚšŚ•ÖŒŚȘ/Ś›Ś©ŚšŚ•ŚȘ), Arabic (ŰŹŰšÙ„/ŰŹÙŽŰšÙŽÙ„)
  • Normalization of variants in many scripts
  • Mapping of uncommon characters to related common characters (ĂŠ/ae, ɕ/c, Ă°/d, â„Č/f, Ƌ/ÉČ/Éł/ÉŽ/n, ʐ/z, etc.)
  • Stripping of invisibles; in this case bi-directionality markers, which even appear in left-to-right tokens
  • Straightening of curly quotes (“/”/" and ‘/’/')

The biggest number of mergers are single-letter tokens (a, e, and o; see below), though most of the merged tokens are rare. Only Ă  (presumably French) occurs more than 30 times in our 5K sample.

(Foreshadowing: since a, e, and o are all stopwords—see below—in these large groups the "old" tokens will all be removed as stopwords, and only the more uncommon characters will remain in these groups.)

a >> 19	
  o: [789 A][25073 a][1 l'a][4 m'a][5 n'a]
  n: [789 A][25073 a][1 a͈][1 l'a][4 m'a][5 n'a][1 qu'à][1 qu’à][14 À][1 Â]
     [2 Ä][2 Å][209 Ă ][3 ĂĄ][1 Ă€][1 Ă„][1 Ā][2 ā][1 Ă][1 Ǟ][1 ǟ][1 ɑ]
     [1 ɒ][1 ə]
e >> 11	
  o: [8 D'e][2553 E][302 d'e][13 d’e][33962 e][2 n'e]
  n: [8 D'e][2553 E][302 d'e][13 d’e][33962 e][1 eː][2 n'e][1 É][1 Ê][1 Ë]
     [9 Ăš][30 Ă©][1 Ă«][1 Ē][1 ē][2 ɛ][1 ˈɛ]
o >> 21
  o: [1 D'o][1 D’o][6 N'o][2 N’o][183 O][1 d'O][78 d'o][6 d’o][13 m'o]
     [3 m’o][36 n'o][2 n’o][2957 o][3 p'o][1 p’o][1 Âș]
  n: [1 D'o][1 D’o][6 N'o][2 N’o][183 O][1 d'O][78 d'o][6 d’o][13 m'o]
     [3 m’o][36 n'o][2 n’o][2957 o][3 p'o][1 p’o][1 Âș][10 Ó][1 Õ][2 Ö]
     [1 Ăł][1 ĂŽ][1 Ă”][2 ö][2 Ăž][1 ƌ][1 ƍ][1 Æ ][1 ÆĄ][1 ÈȘ][1 È«][1 ÈŹ][1 È­]
     [1 Èź][1 ÈŻ][1 È°][1 ȱ][1 ɔ]

The straightening of curly apostrophes is a moderate source of extra tokens for Breton: c'h is a letter in Breton, so the curly variant c’h is fairly common, and even one c‘h appeared. These extra tokens aren't super helpful, because it is unlikely that anyone is going to expect c'hwec'h ("sister"), c’hwec'h, c’hwec'h, and c’hwec’h to be different things—unlike AdĂ©laĂŻde (vs Adelaide), which could match a more specific spelling of a person's name.

Taking care of only c'h variants before ICU folding happens is tedious. There are eight obvious variants: c‘h, c’h, C‘h, C’h, C‘H, C’H, and the less likely but possibly c‘H and c’H. That doesn't cover the implausible but possible ïŒŁâ€˜ïœˆïœ—ïœ…ïœƒâ€™ïŒš or moderately absurd ᶜ’ʰʷᔉᶜ‘ʰ. Gotta Might as well catch ’em all!—so I added a character map to straighten single curly apostrophes (both ‘left‘ and ’right’).

Apostrophe Straightening[edit]

As noted above, there are a fair number of variants of c'h with curly apostrophes, and straightening the apostrophes everywhere is an easy way to fix them, so I added a character filter to do just that.

The impact of the change was a bit less than I expected: only about 0.2% fewer tokens in the Wikipedia sample (4.1% of tokens added by ICU folding), and about 0.1% fewer tokens in the Wiktionary sample (2.2% of tokens added by ICU folding). However, this normalization will prevent weird edge cases where curly or non-curly apostrophes change ranking.

The vast majority of affected tokens are instances of c'h, though there are a few others, such as the French words aujourd’hui and jusqu’à, Italian dall’inizio and English (proper noun) Matasović’s.

Stopwords[edit]

I added the shorter and more focused list of Breton stopwords that VIGNERON provided.

It had a huge impact: 29.8% of tokens in the Wikipedia 5K sample and 23.9% of the tokens in the Wiktionary sample were filtered as stopwords. This isn't a bad thing! It means that searches in Breton will likely be ranked better, based on content words, rather than stopwords.

Notes:

  • These changes are being compared against a baseline that already assumes elision support, ICU folding, and apostrophe straightening.

Most of the tokens removed as stopwords are obvious matches to the words on the list, but there are some that interact with ICU normalization and folding, and elision. I want to do a quick review of those and make sure nothing looks wrong.

  • Âș—this gets normalized to o, which is on the stopword list; that's a common and generally acceptable occurrence, because the token will still be indexed in the plain field.

Speaker Review: Stopwords[edit]

The question for speakers of Breton reviewing this sections is this: would it be bad if the words below were moderately discounted (like other stopwords) when searching?

The words below are a sample of words filtered as stopwords because, after elision handling, all that's left is in fact a stopword. The first four groups undergo Breton elision and the last group undergoes "common French" elision.

Keep in mind that stopwords are not completely ignored; they are stripped from the "text" field index, but are still present in the "plain" field index, so they aren't required for a match, but they can affect ranking.

Only d'an and d'ar were "common" (more than 1,000 occurrences in our 5K sample).

  • d'al, d'an, d'ar, d'e, d'en, d'er, d'eus, d'he, d'hec'h, d'ho, d'hon, d'o, d'ul, d'ul , d'un, d'ur
  • m'a, m'en, m'he, m'ho, m'hoc'h, m'hon, m'hor, m'o
  • n'a, n'an, n'e, n'eo, n'eus, n'he, n'ho, n'hoc'h, n'hon, n'hor, n'int, n'o, n'oa
  • p'en, p'eo
  • j'en, l'an, s'en

Overall Impact[edit]

Taking into account all the changes above, the Wikipedia sample had a net loss of 26.7% of its tokens, and the Wiktionary sample lost 19.7%. (These are mostly losses from stopwords, offset by gains from ICU folding with the "preserve" option.)

Wiktionary Notes[edit]

Other than differences in the number or percentage of examples of each sort above, the Wiktionary changes are very similar to the Wikipedia changes. The one standout difference is that there are a lot more International Phonetic Alphabet tokens in Wiktionary (were they provide pronunciation guidance). These are usually pronunciations of words being mapped onto the words themselves (such as Andrea/anˈdrɛːa/ĂŁnˈdrea or Andreas/anˈdreːas/anˈdʀeːas), which should be fine.

Speaker Review Summary[edit]

TBD

Next Steps[edit]

  • Await speaker review and address any concerns that arise.
  • Put up a patch with the changes and get the code reviewed.
  • Deploy the changes and reindex Breton-language wikis.
  • Celebrate!