User:TJones (WMF)/Notes/Speaker Review Notes

Language Analysis Modifications—Speaker Review Notes
For help with the technical jargon used in the Analysis Chain Analysis, see the Language Analysis section of the Search Glossary.

Background
When we make modifications to the way text is processed for searching on-wiki in a specific language, it can be difficult to decide whether the modifications are generally good or not. These kinds of modifications include introducing a new stemmer, changing diacritic folding, or re-ordering the steps in the language processing.

In the past, one approach we tried was setting up a test instance of Wikipedia with the modifications enabled and letting people search on it. However, the results were not always useful. People sometimes had trouble coming up with representative queries, or they focused only on very specific queries that caused problems. People got distracted by issues that were often valid, but not related to the modifications that were being tested. For example, someone might point out that a particular query had poor results, which is quite reasonable. However, if those specific results were exactly the same before and after the modification being tested, then the problem isn’t relevant to the modification.

We also have RelForge, which allows us to re-run queries sampled from real user queries and compare results before and after the modification to language processing. However, concerns about privacy mean that we can’t easily share queries publicly. Also, it can be very difficult to figure out what people intend by their queries, making it can be hard to determine which results are relevant. And everything is exponentially harder when you don’t know the language. (That said, RelForge is still very useful for lots of things!)

A/B tests give excellent results, but they are expensive and time consuming in general. They are also very difficult for language analysis modifications because they require two separate indexes for the test—which for larger wikis is not something we have the resources for.

We believe we’ve finally found a way to focus the attention of fluent speakers of a language on the core differences made by modifications to language processing: have them look at what words are grouped together (or not) by that processing. For example, on English Wikipedia, searching for hope also finds hopes, hoped, hoping, and hope’s, along with hoper and hopers. A fluent speaker of English can readily see that those first four are great, and the last two are not great, but acceptable and understandable (i.e., by interpreting hoper as a person who hopes, which is rare, but plausible).

The goal of this page is to explain and document this review process, particularly for speakers doing the review. It may also be useful for anyone who might want to run such a review. (And it provides self-contained information about various parts of the process to transclude into other pages.)

Data
The usual process for creating a sample of documents (for testing language analysis modifications) is to retrieve 10,000 Wikipedia articles and 10,000 Wiktionary entries for the language in question. Sometimes it is fewer than 10,000 if there aren’t that many articles available in a particular project. Wikipedia articles usually provide a good example of typical formal written text in the language, and Wiktionary usually provides a larger number of distinct forms of words, and some additional variety of foreign scripts and languages. Foreign scripts and languages are not always processed well by language-specific text processing.

I sanitize the documents by removing markup (mostly HTML tags) and leading white space, and deduplicating individual lines. Deduplication reduces the number of instances of wiki-specific words, such as the local equivalent of "References", "See also", "Noun", "Etymology", etc.

The Speaker’s Core Task
The core task of the speaker doing the review is to decide whether words are being properly grouped together for search, and whether any changes to those groupings are better or worse. When words are grouped together, it means that searching for one word in the group will find all of the other words in the group, too. With the current English language processing, for example, searching for any of the words hope, hopes, hoped, hoping, hope’s, hoper, or hopers will find all of the others. (Note that the results in each case will be ranked differently because exact matches are preferred).

In addition to listing the words that are grouped together, we also include the number of times each word appears in the text sample. This helps us estimate the relative importance of potential errors. For example, if two words are improperly grouped together, but the words are very rare, that’s not as bad as if they were very common.

Toy English Example
As an example, suppose English had no stemmer, and then we added one.

Before a stemmer was added, hope would only match the same word with case variation, like hope, Hope, HOPE, and HopE.

If we added a stemmer and it grouped hope with hopes, hoped, hoping, and hope’s, that would be a good stemmer!

If we added a stemmer and it grouped hope with Hopper, hopi, hopple, Hopkins, and hopscotch, that would be a terrible stemmer.

In reality, stemmers are rarely either all good or all bad. They usually have a mix of desirable and undesirable groupings. Looking at frequency count information can also help to determine how much bad an undesirable groupings are.

Looking at words grouped with hope by the good stemmer above, we might see a group like this:

hope: [152 Hope][1208 hope][12 hope’s][346 hoped][1 Hoper][1 hoper][1 Hopers][23 Hopes][488 hopes][17 Hoping][285 hoping]

Most of the words grouped together are expected variants of hope. The others—Hoper, hoper, and Hopers—might not be very good matches for hope. However, these bad matches are relatively rare, with only 3 instances total, compared to over 2,500 instances of the more usual forms of hope.

Random Group Samples
Looking at a random sample of the word groups is the best way to see what the typical effects of a modification are. If the majority of changes are good, and any less desirable changes are understandable and acceptable, then overall the modification is good.

We look at random group samples for all language processing modifications.

All-New Groups
When we make big modifications to the language processing done for search—like introducing a new stemmer or creating an entirely new analysis chain—it can be hard to meaningfully map word groupings from before and after the modification. Instead, we look at the groupings made by the new language analysis and assess them as they are.

All-new groups are presented as follows:

hope: [152 Hope][1208 hope][12 hope’s][346 hoped][1 Hoper][1 hoper][1 Hopers] [23 Hopes][488 hopes][17 Hoping][285 hoping]

The word before the colon, in this case , is the stem, or common form, that all of the other words were transformed to. The stem does not have to be the actual root form of the word or even a word at all. However, seeing the stem sometimes makes it easier to understand what the stemmer or other parts of the analysis were trying to do.

The rest of the words are the words that all share the stem, meaning that searching for any of them will find all of the others. (Note that searching for each word in a group will give the same results, but the results could be in a very different order. A big factor in re-ordering the results is that exact matches are given more weight.)

The numbers with the word—e.g.,  and  —indicate how many times a given word appears in our text sample. In this case, hope is over a thousand times more common than hoper. Rare words that are not great matches with the rest of a group are less of a problem because they don’t occur very often. When you search for them, exact matching will usually bring them to the top of the results list.

Problems can arise when more common words are grouped together incorrectly. For example, a grouping like  would be worse, because these words don’t belong together, and both words are common.

All-new groups are usually something we look at for modifications like adding a new stemmer—that is, big modifications that create a lot of new groups rather than modify existing groups.

Large Groups
Large groups are not necessarily bad groups. However, they are more likely to have problems because they are outliers.
 * One reason why a group might be large is that it contains many different forms of a very common word. English generally doesn't have very many inflected forms, but we can still give an example. It is much more likely to have all of the forms of go—go, goes, going, gone, and went—in a random sample of text than it is to have all the forms of transclude in the same sample. So the groups with common words—especially the most common verbs, like come, do, get, go, know, look, make, say, see, take, think, want, etc.—are more likely to be large groups.
 * A group might also be large because it is the result of a merger of two overlapping groups that ideally would be separate. Sometimes this is unavoidable. It is not a fatal problem, but it should be noted. There aren’t any great examples in English, but cloth and clothes should be separate, but a stemmer could strip the -es from clothes (because it looks like a plural) and merge the two.
 * Sometimes the groups are very large because the stemmer did something that makes no sense. For example, grouping duck, leaf, and blanket in English would make no sense. Those are the cases we are really trying to find!

Large groups are usually something we look at for modifications like adding a new stemmer—that is, big modifications that create a lot of new groups rather than modify existing groups.

Potential Problem Groups
Groups are “potential problems”—for lack of a better term—when the words in the group have no common beginning or ending letters. When we can, we also strip known affixes (like slavic naj- which is roughly English -est) and do appropriate folding (such as converting å, á, ä, â, à, and ã to a in English) before looking for common beginning or ending letters.
 * Sometimes, having no common beginning or ending letters is perfectly reasonable. For example, English good/better/best or be/am/is/are/were/was.
 * Sometimes groups are marked as "potential problems" because the language analysis does unexpected but reasonable character folding. For example, folding ɐ to a or stripping a leading modifier character, like converting ʿa to a. (These can often be manually removed before speaker review since they are usually easy for a non-speaker to identify.)
 * Sometimes the idea of “potential problem” group doesn’t make a lot of sense because of the writing system of a particular language.

Potential problem groups are usually something we look at for modifications like adding a new stemmer—that is, big modifications that create a lot of new groups rather than modify existing groups.

Old-vs-New Groups
When we make less extreme modifications to the language processing done for search—like introducing diacritic folding—we can usually look more meaningfully at groups before and after the modification to assess the effect of the group changes.

Old-vs-new groups are presented as follows:

hope >> 2 o: [152 Hope][23 Hopes][1208 hope][346 hoped][488 hopes] n: [152 Hope][1 Hopē][23 Hopes][1208 hope][346 hoped][488 hopes][2 ĥợṕễ]

The first line shows the stem, a pair of arrow heads indicating whether words were gained or lost by the group, and a number indicating how many gains and/or losses there were.

The stem is the form that all of the other words were reduced to. The stem does not have to be the actual root form of the word or even a word at all. However, seeing the stem sometimes makes it easier to understand what the stemmer or other parts of the analysis were trying to do.

In terms of gains and losses:
 * indicates that words were gained by the group
 * indicates that words were lost from the group
 * indicates that there were both losses and gains

The   section (for “old”) shows all the words that shared a stem before the change. The   section (for “new”) shows all the words that shared a stem after the change. Sharing a stem means that searching for any of the words will find all of the others. (Note that while searching for each word in a group would give the same results, the results could be in a very different order—in particular because exact matches are given more weight.)

The numbers with the word—e.g., [1208 hope] and [1 Hopē]—indicate how many times a given word appears in the text sample. In this case, hope is over a thousand times more common than Hopē. Rare words that are not great matches with the rest of a group are less of a problem because they just don’t occur very often, and if you search for them, exact matching will usually bring them to the top of the results list.

Problems can arise when more common words are grouped together incorrectly. For example, a grouping like “[1208 hope][747 hop]” would be more worrying, because they don’t go together, but both words are fairly common.

Old-vs-new groups are usually something we look at for smaller modifications that generally modify existing groups.

High-Impact Groups
High-impact groups are those with 10 or more changes to the number of words in the group (gains, losses, or a mix). These groups are more likely to have something undesirable going on just because they are outliers.

Sometimes high-impact groups are not that interesting because a large group and small group have merged and end up having the stem of the smaller group. For example, if a group of 10 words and a group of 2 words merge, you could see it as the group of 10 gaining 2 new members (which is not an outlier), or as the group of 2 gaining 10 new members (which looks like an outlier).

The most interesting cases are when two relatively large groups merge, or when more than two medium-sized groups merge—because then lots of potentially unrelated words are being grouped together.

High-impact groups are usually something we look at for smaller modifications that generally modify existing groups.

High-Frequency Word Groups
High-frequency words are those occur 1,000 times or more in the sample. These are more likely to be very common words, so it’s important to look at cases where the high-frequency word was added or removed from a group, to make sure the change isn’t going to cause problems.

High-frequency word groups are usually something we look at for smaller modifications that generally modify existing groups.