User:TJones (WMF)/Notes/Speaker Review Notes

Language Analysis Modifications—Speaker Review Notes
For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background
When we make modifications to the way text is processed for on-wiki searching in a specific language—for example, by introducing a new stemmer—it can be difficult to decide whether the modifications are overall beneficial or not.

In the past, one approach we’ve tried was to set up a test instance of Wikipedia with the modifications enabled and let people try searching on it, but the results were not always useful. People sometimes had trouble coming up with representative queries, or they focused exclusively on very specific queries they were concerned about. People also got distracted by issues that were often valid, but not related to the modifications that were being tested. For example, someone might point out that a particular query had poor results, which is quite reasonable. However, if those specific results were exactly the same before and after the modification being tested, then the problem isn’t really relevant to the modification.

We also have RelForge, which allows us to re-run queries sampled from real user queries and compare results before and after the modification. However, concerns about privacy mean that we can’t easily share queries publicly. Also, it can be very difficult to figure out what people intend by their queries, it can be hard to determine if particular results are relevant, and all of it is exponentially harder when you don’t know the language. (That said, RelForge is still very useful for lots of things!)

A/B tests give excellent results, but they are expensive and time consuming in general. They are also very difficult for language analysis modifications because they require two separate indexes for a test—which for larger wikis is not something we have the resources for.

We believe we’ve finally found a way to focus the attention of fluent speakers of a language on the core differences made by modifications to language processing—have them look at what words are grouped together (or not) by that processing. For example, on English Wikipedia, searching for hope also finds hopes, hoped, hoping, and hope’s, along with hoper and hopers. A fluent speaker of English can readily see that those first four are great, and the last two are not great, but acceptable and understandable (i.e., interpreting hoper as a person who hopes, which is rare, but plausible).

The goal of this page is to explain and document this review process, for speakers doing the review, and for anyone who might want to run such a review. (And to provide encapsulated information about various parts of the process to transclude into other pages.)

Data
The usual process for creating corpora for testing language analysis modifications is to pull 10,000 Wikipedia articles and 10,000 Wiktionary entries for the language in question—though it can be fewer if there aren’t that many in a particular project. Wikipedia articles usually provide a good example of normal, formal written text in the language, and Wiktionary usually provides more distinct forms of words as well as some additional variety of foreign scripts and languages, which are not always processed well by language-specific text processing.

I sanitize the corpora by removing markup and leading white space, and deduplicating individual lines (to reduce the number of instances of the wiki-specific words, such as the equivalent of "References", "See also", "Noun", etc.)

The Speaker’s Core Task
The core task of the speaker doing the review is to decide whether words are being properly grouped together for search, and whether changes to those groupings are better or worse. When words are grouped together, it means that searching for one word in the group will find all of the other words in the group, too. With the current English language processing, for example, searching for any of hope, hopes, hoped, hoping, hope’s, hoper, or hopers will find all of the others (though the results will be ranked differently because exact matches are preferred).

In addition to the specific words that are matched, we also note the frequency of the words in the text sample, so that the relative impact of potential errors can be determined. For example, if two words are improperly grouped together, but each occur only once in a large sample, that’s not nearly as bad as if they each occur thousands of times.

Toy English Example
As an example, suppose English had no stemmer, and then we added one.

Before a stemmer was added, hope would only match the same word with case variation, like hope, Hope, HOPE, and HopE.

If we added a stemmer and it grouped hope with hopes, hoped, hoping, and hope’s, that would be a good stemmer!

If we added a stemmer and it grouped hope with Hopper, hopi, hopple, Hopkins, and hopscotch, that would be a terrible stemmer.

In reality, stemmers are rarely either all good or all bad, and have a mix of desirable and undesirable groupings. Looking at frequency information can also help to determine how much of a bad impact undesirable groupings are likely to have. Looking at words grouped with hope by the good stemmer above, we might see a group like this:

hope: [152 Hope][1208 hope][12 hope’s][346 hoped][1 Hoper][1 hoper][1 Hopers][23 Hopes][488 hopes][17 Hoping][285 hoping]

Most of the words grouped together are expected variants of hope. However, Hoper, hoper, and Hopers might not be very good. However, they are relatively rare, with only 3 instances total, compared to over 2,500 instances of the more usual forms of hope.

Random Group Samples
Looking at a random sample of groups is the best way to get a sense of what typical effects are going to be like overall. If the majority are good and any less desirable effects are understandable and acceptable, then overall the modification is likely to be for the better.

We look at random group samples for all language processing modifications.

All-New Groups
When we make big modifications to the language processing done for search—like introducing a new stemmer or a whole new analysis chain—it is hard to meaningfully map word groupings from before to after the modification. Instead, we look at the groupings made by the new language analysis and assess them as they are.

All-new groups are presented as follows:

hope: [152 Hope][1208 hope][12 hope’s][346 hoped][1 Hoper][1 hoper][1 Hopers] [23 Hopes][488 hopes][17 Hoping][285 hoping]

The word before the colon, in this case hope, is the stem that all of the other words were reduced to. The stem does not have to be the actual root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer was trying to do.

The rest of the words are the words that all share the stem, meaning that searching for one will find all of the others. (Note that while searching for each word in a group would give the same results, the results could be in a very different order—in particular because exact matches are given more weight.)

The numbers with the word—e.g., [1208 hope] and [1 hoper]—indicate how many times a given word appears in the corpus. In this case, hope is over a thousand times more common than hoper. Rare words that are not great matches with the rest of a group are less of a problem because they just don’t occur very often, and if you search for them, exact matching will usually bring them to the top of the results list.

Problems can arise when more common words are grouped together incorrectly. For example, a grouping like “[1208 hope][747 hop]” would be more worrying, because they don’t go together, but both words are fairly common.

All-new groups are usually something we look at for modifications like adding a new stemmer—that is, big modifications that create a lot of new groups rather than modify existing groups.

Large Groups
Large groups are not necessarily bad groups, but they are somewhat more likely to have problems because they are outliers.
 * One reason why a group might be large is that it contains many forms of a very common word; because the word is common, rarer forms of the word are more likely to show up in a given sample. For example, if stop words are not removed, then you would expect to see more forms of to be or to have in a given sample than forms of to transclude.
 * Another reason why a group might be large is that it is the result of a merger of two groups that ideally would be separate, but which overlap. Sometimes this is unavoidable, so it is not a fatal problem, but it should be noted. There aren’t any great examples in English, but cloth and clothes should be separate, but a stemmer could strip the -es from clothes and merge the two.
 * Sometimes the groups are very large because the stemmer did something that makes no sense; those are the cases we are really trying to find!

Large groups are usually something we look at for modifications like adding a new stemmer—that is, big modifications that create a lot of new groups rather than modify existing groups.

Potential Problem Groups
Groups are “potential problems”—for lack of a better term—when the words in the group have no common beginning or ending letters, even after stripping known affixes (like slavic naj- which is roughly English -est), and doing appropriate folding (such as converting å, á, ä, â, à, and ã to a in English).
 * Sometimes this is perfectly reasonable, as with English good/better/best or be/am/is/are/were/was.
 * Sometimes this is because the language analysis does unexpected but reasonable folding, like folding ɐ to a, or stripping a leading modifier character, like converting ʿa to a. (These can often be manually removed before speaker review since they are usually easy enough for a non-speaker to identify.)
 * Sometimes the idea of “potential problem” group doesn’t make a lot of sense because of the writing system of a particular language.

Potential problem groups are usually something we look at for modifications like adding a new stemmer—that is, big modifications that create a lot of new groups rather than modify existing groups.

Old-vs-New Groups
When we make less radical modifications to the language processing done for search—like introducing diacritic folding—we can usually look more meaningfully at before-and-after groups to assess the impact of the grouping changes.

Old-vs-new groups are presented as follows:

hope >> 2 o: [152 Hope][23 Hopes][1208 hope][346 hoped][488 hopes] n: [152 Hope][1 Hopē][23 Hopes][1208 hope][346 hoped][488 hopes][2 ĥợṕễ]

The first line shows the stem, a pair of arrow heads indicating whether words were gained or lost by the group, and a number indicating how many gains and/or losses there were.

The stem is the form that all of the other words were reduced to. The stem does not have to be the actual root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer was trying to do.

In terms of gains and losses:
 * >> indicates that words were gained by the group
 * << indicates that words were lost from the group
 * >< indicates that there were both losses and gains

The o: section (for “old”) shows all the words that shared a stem before the change. The n: section (for “new”) shows all the words that shared a stem after the change. Sharing a stem means that searching for one will find all of the others. (Note that while searching for each word in a group would give the same results, the results could be in a very different order—in particular because exact matches are given more weight.)

The numbers with the word—e.g., [1208 hope] and [1 Hopē]—indicate how many times a given word appears in the corpus. In this case, hope is over a thousand times more common than Hopē. Rare words that are not great matches with the rest of a group are less of a problem because they just don’t occur very often, and if you search for them, exact matching will usually bring them to the top of the results list.

Problems can arise when more common words are grouped together incorrectly. For example, a grouping like “[1208 hope][747 hop]” would be more worrying, because they don’t go together, but both words are fairly common.

Old-vs-new groups are usually something we look at for smaller modifications that generally modify existing groups.

High-Impact Groups
High-impact groups are those with 10 or more changes to the number of words in the group (gains, losses, or a mix). These groups are more likely to have something undesirable going on just because they are outliers.

Sometimes high-impact groups are not that interesting because a large group and small group have merged and end up having the stem of the smaller group. For example, if a group of 10 words and a group of 2 words merge, you could see it as the group of 10 gaining 2 new members (which is not an outlier), or as the group of 2 gaining 10 new members (which looks like an outlier).

The most interesting cases are when two relatively large groups merge, or when more than two medium-sized groups merge—because then lots of potentially unrelated words are being grouped together.

High-impact groups are usually something we look at for smaller modifications that generally modify existing groups.

High-Frequency Word Groups
High-frequency words are those occur 1,000 times or more in the sample. These are more likely to be very common words, so it’s important to look at cases where the high-frequency word was added or removed from a group, to make sure the change isn’t going to cause problems.

High-frequency word groups are usually something we look at for smaller modifications that generally modify existing groups.