User:TJones (WMF)/Notes/Relevance Lab

Relevance Lab
The primary purpose of the Relevance Lab is to allow us to experiment with proposed modifications to our search process and gauge their effectiveness and impact before releasing them into production, and even before doing any kind of user acceptance or A/B testing. Also, testing in the relevance lab gives an additional benefit over A/B tests (esp. in the case of very targeted changes): with A/B tests we're aren't necessarily able to test the behavior of the same query with 2 different configurations.

Because there are so many moving parts to the search process, different use cases can have significantly different infrastructure needs, and the complexity of the interplay among the various use cases also needs to be handled.

At the highest level and for the simplest case of comparing a single change against a baseline, we need to be able to:
 * specify a set of baseline (A) queries to run
 * optionally specify a set of modified (B) queries to run
 * specify a baseline (A) search configuration
 * optionally specify a modified (B) search configuration
 * automatically compare the results of A and B and generate summary statistics
 * automatically identify and manually inspect a subset of the differences between A and B

Queries
Sometimes the proposed change would entail modifying queries (e.g., dropping stop words, dropping quotes, removing question components, eliminating inappropriate wildcards, dropping the word "quot", automatically translating queries into the target language of the wiki before searching, etc.), with the rest of the search configuration the same.

Query Sets
In the simplest case, we could have two sets of queries, in corresponding order. For example:

Set A:
 * "first man on the moon"
 * what is the "house of representatives"
 * "laverne AND shirley"

Set B:
 * first man on the moon
 * what is the house of representatives
 * laverne AND shirley

We run A, we run B, and we compare the differences in the results.

Query Mungers
Alternatively, we could have one set of queries, and a query munger (say a runable script following a standard API—possibly as simple as read from STDIN, write to STDOUT ) that modifies the query set and compares against that. For example:

Set A:
 * "first man on the moon"
 * what is the "house of representatives"
 * "laverne AND shirley"

Query Munger:
 * perl -pe 's/"//g;'

This has the advantage of testing the actual proposed munging method and potentially accounting for the runtime of the query munging. The disadvantage is that setting a runnable script loose is a potential security concern.

Targeted Query Set
During development, it may make sense to test a modification against a targeted query set so that more use cases are represented in a smaller, more manageable set that can be more quickly run many times.

For example, if testing the effects of removing quotes or modifying the slop parameter, a corpus of queries with quotes is better than a generic query corpus because most queries don't actually have any quotes in them. Such a corpus could be dozens to thousands of queries.

Regression Test Set
Conversely, it makes sense to have a generic, representative query set that can be run as a regression test, to make sure that changes don't have significant unexpected impact on queries outside their intended focus. Such a corpus could be hundreds to hundreds of thousands of queries.

Gold Standard Corpus
A gold standard corpus (i.e., one with known desired results) makes a good regression test set, but is expensive to create. Two (complementary) approaches have been suggested:


 * SME-created corpus: a human expert (e.g., member of the Discovery team, community volunteers, etc.) compares a query and the results it generates, and judges which, if any results, are, say, generally good matches (should be in the top-N), bad matches (should not be in the top-N), "meh" matches (can take it or leave it), or the one true preferred match. Difficulties include level of effort for SMEs, and some ambiguity, esp. for "difficult" queries. The process could be managed ad hoc, where only the current top-N results are reviewed by an SME (or, if we care most about top-N, SMEs could preemptively review the top 2N, on the assumption that radical movement is uncommon). Then, any new results brought up by a change in the search process would need review before they could be used to score the changes. Further options include using only one SME to review each item (to save effort), or having 2+ SMEs review each item and then more carefully reviewing disagreements (to improve accuracy); given the nature of our work (low impact from a small number of errors or inconsistencies), one review is probably sufficient in almost all cases.

Unfortunately, such a corpus would require annotation tools, and it would be a small corpus compared to the scale of our A/B testing; there's also a risk of the corpus getting stale unless you continually add to it (and optionally age out older queries). However, it does give a more consistent evaluation/scoring from test to test, and could allow for automatic parameter setting over a largish parameter space if the set is large enough to be divided into training and testing subsets.


 * User-created corpus: we track which results are clicked on by the user for a given query, and assume, say, that a click implies relevance, and average across multiple users. Difficulties include modelling user actions/intentions (do we track time on clicked-to page?, for example), and finding multiple users with the same queries, which limits query diversity. Should we highlight queries that get few clicks as ones without good answers?

We may be able to extract this information from the User Satisfaction logs without any additional front end work. Another advantage of this approach is that it allows non-binary answers (i.e., if for query A, 95% of users click the second result, that's the best result; if for query B, 35% click the first result and 40% click the second, there isn't really a clear best result, but there is possibly a preferred order, and scoring can take that into account).

Search configurations
This is a very broad umbrella that covers lots of different kinds of changes, which may have significantly different infrastructure requirements to run, and may have different analysis requirements as well. Four different use cases are presented below.

Parameter/configuration changes
There are simple configuration changes that don't require any additional code to run (e.g., adjusting the slop parameter, or enabling an option). These could be run against the same index, potentially at the same time if the cluster hosting the index is up to it. (This also applies to modified queries, above.)

Modified code
Sometimes a change requires modifying code, rather than just changing configuration. There are a few ways to handle this, including:


 * Modify the code being run, e.g., by specifying a Gerrit patch, which has to be downloaded, merged, and run against the appropriate index. We need to insulate such changes from each other, either by giving each change its own sandbox to work in (possibly on the tester's own laptop), or by scheduling jobs so that they are run in sequence.


 * Merging the modified code into, say, a branch specific to Relevance Lab, and controlling it by a parameter/config setting, and keeping the default behavior the same, thereby reducing this to the previous case (config change). This adds complexity in terms of maintaining the branch, and has the potential of some changes leaking onto others. However, it may require fewer resources.

We don't want dealing with modified code to cause problems for simple query or config modifications, so we need to either sandboxing or job scheduling or some other mechanism to keep different test scenarios separate.

Clearly there's lots to be learned here from Gerrit (good and bad), and I need help fleshing out the technical details (esp. limitations and hardware requirements). (?Erik)

Modified indexes
(?Erik) It's not clear to me how to handle changes that require modifying or adding indexes—like using a reverse index so we can catch typos in the first two letters of a word. Assuming we have the hardware to support it, can we specify which indexes to use via configuration, in such a way that the default behavior isn't changed? For reverse indexing, this might be an additional index (along with some code), but in the case of a replacement index (say,a partially case-sensitive index), we need to be able to specify not the default index as well.

Scope of indexes
Because of hardware limitations, we may not be able to host all the wikis at once. We may also need to be able to swap out indexes (e.g., drop English and add Finnish to test a Finnish morphological analyzer).

We should be able to specify which index queries should be run against (perhaps with a default of enwiki), at the level of the comparison (i.e., for both A and B), for each query set (one for A, one for B—I've actually done this), or at the level of individual queries (possibly with a query-set specification acting as a default; e.g., based on language detection results). I think at the level of each query set with individual query overrides is a good option.

Results munging, etc.
There may be cases in the future that do something not covered by these cases, especially if we change the kinds of results were giving. Such changes and changes to the Relevance Lab need to go hand in hand.

As an example, consider a method of results munging that includes highlights from infoboxes of top results, so that the actors of a film might be listed after the film itself, particularly if the query includes a film title and the word actor. In this case, perhaps we'd want to give the actors the same data-serp-pos value (the rank with in the search engine results), or a related one (e.g., 2.1, 2.2, 2.3, etc for actors listed along with a film that ranked #2).

Generating diffs and stats
When we've run two different versions of our query/search config combination, we want to see what changed.

Effectiveness
If we have a gold standard corpus, we can automatically compute whether the net change is better or worse—more specifically, we can computer recall, precision, F1 (and/or maybe F2 since recall is probably more important), etc.

See also "Inspecting changes" below.

Impact
Even without gold standard results, we can measure many useful changes automatically:


 * the number of queries with zero results
 * the number of queries with changes in order in the top-N (5?, 10?, 20?) results
 * the number of queries with new results in the top-N results
 * the number of queries with changes in total results
 * a heatmap of the overall shift in ranks (e.g., how many #1s fell to #5, how many #37s rose to #2, etc.)
 * etc.

Obviously, changes that have almost no effect on a targeted set of queries probably aren't worth deploying without improvement, while changes that effect 94% of queries in a regression test set need to be very carefully vetted.

Performance
We can also get summary performance statistics for each run. For example:
 * A: 2871 queries ran in 3.5s, 0.00122s/query
 * B: 2871 queries ran in 350.0s, 0.12191s/query

Big jumps in performance would need to be taken with a grain of salt—competing jobs on the same cluster, non-production-like limitations of memory size or disk speed, or the phases of the moon could affect performance—but serve as a useful warning flag.

Search config diffs
Search config diffs should be noted (e.g., specifying which index a file was run against, and which Gerrit patches were used, the names of LocalSettings.php variants used) and where possible provided as actual diffs (e.g., diffs in LocalSettings.php variants).

Inspecting changes
For any category of change, in whatever direction, it should be possible to inspect examples of the change. For example, if we see that 137 queries went from zero results to some results, 10 went from some results to zero results, and 17 had results move in or out of the top 5, we should be able to click on examples to get a side-by-side list of results from the A case and the B case with diffs highlighted.

It isn't always possible (or necessary) to inspect all changes. If 20 out of 20 random examples are terrible, it's probable that most of the changes are not good, and you should rethink what you are trying to do.

One query set / one config
It's also possible that our "A" and "B" query sets and configs are the same, so that no diff is needed.

One reason for this could be that we want to see how a particular set of queries behaves in the default case and gather statistics on that set; for example, we could run a bunch of non-English queries against enwiki, or a collection of queries with "quot" in them, just to see how many get any results.

Another case would be to generate a standard baseline to be used to compare other variations against (see below), so we don't want the results of this run, but rather we want it handily pre-computed so we can compare it some change (see below).

We should be smart enough to allow the system to run one batch of queries and generate stats without having to have a second one to compare to at the moment.

Running against an existing baseline
In addition to being able to specify a second set of queries/configs to run and compute diffs against, it's useful (and efficient) to be able to compare against an existing run.

For example, if we want to do a multiway comparison of the performance of various slop values, say, 0 (the default), 1, 2, 3 (reasonable cases), and 100 (a limiting case to gauge the maximum possible impact). We can run A=Slop0 against B=Slop1, then reuse A=Slop0 against B=Slop2, B=Slop3, and B=Slop100, all without re-running Slop0 (meaning 5 sets of queries run, rather than 8).

Later, when deciding between Slop1 and Slop2, we should be able to just run the diff process on Slop1 results and Slop2 results, which would be very quick, and would highlight the cases where the difference in slop value actually mattered.

In another example, we are testing the effects of removing quotation marks, so A=Baseline and B=WithoutQuotes. The results look good, but we realize we forgot to remove “smart quotes”, so we can run another diff, but with A=WithoutQuotes and B=WithoutQuotes2, so the only difference we see is the effect of removing smart quotes in B, since all the regular quotes will already have been removed in A.

So, we should be able to kick off a diff for some A and B, and be smart enough not to run A or B if it already exists, and use stored intermediate results to prepare a diff summary.

Security
The more open the Relevance Lab is (and we want it to be open!) the more we have to consider issues that broadly fall under the heading of security.

Queries as personally identifiable information
Unsanitized user queries are considered to potentially be personally identifiable information (PII). It's clear from search logs that people accidentally search Wikipedia all the time, and may very well search for their own or other people's personal information (names, usernames, email addresses, physical addresses, and phone numbers all show up in the logs).

There are ways of automatically sanitizing queries, but none are 100%, and too much sanitization can skew a query sample (e.g., by removing all names, including those of celebrities that people search for all the time, which are not PII). Manually sanitizing queries can be more accurate, but is also more time consuming.

We need to come up with some way to safely handle queries, especially if the Relevance Lab is running on the labs cluster and is not considered safe for PII.

Vagrant
One option would be to disable certain functions in labs that display queries (such as inspecting results), and still leave basic functions available, like summary statistics. To enable full functionality against the full wiki indexes, we could disable standard logging and pipe queries over SSH from an appropriately secure server or laptop.

A properly configured Vagrant role could set all this up in such a way that the only shared portion of the Relevance Lab in the wmflabs cluster are the indexes (and even allow one to point to other indexes). This would not allow ready sharing of certain resources (like recent regression test baseline runs), but would deal with lots of other issues, like sandboxing query mungers and PII. It's not clear to me what level of code customization would be possible with such a configuration.

This seems like the most reasonable path to supporting community invovlement in using the Relevance Lab.

Community invovlement
There are lots of ways that people outside the Discovery team can use the Relevance Lab to help improve our search process relevance. The community as a whole has a significantly larger pool of resources than the Discovery team alone—including everything from specific language skills to just having lots more eyeballs on search results.


 * Being able to generate and run a targeted set of queries to demonstrate an issue. For example, to show that queries with quotes do worse than queries in general, anyone could run a set of queries with quotes and generate stats. Or better, if someone has an idea on how to improve the situtation (e.g., remove the quotes): run the same queries without quotes and generate diff stats.


 * Helping with languages outside our area of expertise. I don't think we have any particularly fluent German speakers on the Discovery team (and if we do, there are plenty of languages we don't have, and I know enough to give examples of possible problems in German). We could have someone look at results from German queries and point out why they think certain queries aren't improving (e.g., because we don't handle certain compound nouns well).

Tasks
Everything here is a straw man proposal, and happily subject to revision, adding of details, removal of MVP features, and other random improvements and modifications.

Minimum Viable Product (MVP)
Start with a simple web-based interface that allows the user to specify a comparison run. (I'm voting against a command line interface because of the amount of text involved (esp. descriptions, which are very helpful). It doesn't need to be fancy at all. See "Crude mockups" below.)

A comparison run consists of two query runs, a baseline and a delta, the resulting stats and diffs on them, a label/name, optional description text, and the current date.

A query run consists of a specified query file, a search config, a label/name, optional description text, and the current date.

A search config consists of a specified LocalSettings.php, which indexes to run against, identifiers for any Gerrit patch to apply (future feature), and the baseline version of the code being used, a label/name, optional description text, and the current date.


 * UI: Specify the comparison run: give it a name, optional description, and specify two query runs.
 * UI: Specify a query run: give it a name, and optional description, and:
 * Queries: specify a query file (for query run B, any automatic munger will have been run manually)
 * Search config: specify variant LocalSettings.php (could be just unchanged current LocalSettings.php)
 * UI: Similarly specify second query run.
 * See "Crude mockups" below.


 * Indexes: default index is enwiki (just to start)


 * Run:
 * 1) backup LocalSettings.php
 * 2) replace with LocalSettingsA.php, connect to index server via SSH, run query file A, record results in local directory (including time to run)
 * 3) replace with LocalSettingsB.php, connect to index server via SSH, run query file B, record results in local directory (including time to run)
 * (?Erik) What's the best way to connect to the index server via SSH to run queries? Does it require additional configuration or set up?
 * 4) restore LocalSettings.php
 * 5) generate diffs and stats


 * Diffs and Stats: need a tool to run on two sets of results
 * Compute zero results rate for each set of results
 * Note diffs in impact stats (right now, just zero results rates) and get a list of examples of change in any direction.
 * Gather info needed to display diffs in results for any given query (details TBD)


 * Internals: locally have a designated relevancelab/ directory
 * relevancelab/queryruns/ has a subdirectory for each query run, e.g., relevancelab/queryruns/baseline151012/, with:
 * info.txt (or info.xml) file with description, and date, name of default index, Gerrit patch ID (future), and baseline code version
 * a copy of the LocalSettings.php file
 * query results stored to file system (details TBD—really large query sets (10K+) are a potential problem: one file with all query results could get very big and be slow to pull out a specific query from; one file per query could run into trouble with the max number of files allowed by the OS in a directory (or even on a partition))
 * relevancelab/comparisons/ has a subdirectory for each comparison run, e.g., relevancelab/comparisons/quotetest151012/, with:
 * info.txt (info.xml) file with description, date, and name/directory of query runs being compared
 * summary stats files (TBD: one summary stats file for everything, or snippets for each, or what?)
 * diff files: includes links to examples of changes for each stat, and the data needed to display diffs between the two query runs (could be pre-computed, or could be generated on the fly; performance isn't super important).


 * UI: view comparison summary (appropriate URL will fetch and display details from relevancelab/ directory):
 * comparison details (name/label, maybe description, names of query runs)
 * summary stats (time to run, time per query, query run metrics (i.e., zero results rate))
 * search config details (link to diff of LocalSettings.php variants)
 * diff examples (examples of queries with changes in query run metrics)
 * UI: diff viewer (given a query, look up results from both query runs, and display in useful diff format)
 * See "Crude mockups" below.


 * Corpora: create targets sets manually if needed, create relevant regression test sets manually as needed, maybe share on stats1002 or fluorine for now

Crude mockups
These mockups are pretty crude, and show a few features that might not be included in the MVP. A picture is worth a thousand words, even if some of them are slightly inaccurate!

In this scenario, we're testing the effect of removing quotes from queries. We've already run a comparison, but afterwards we noticed errors in the delta query file, so we need to re-run the comparison. Since the baseline query run is the same, we re-use it, rather than re-run it. We choose the existing baseline "Quotes Baseline 151012", and its info is populated in the form, and is not editable. We choose a "[new]" delta and give it a name "Quotes Test 151012b" (...b because we already ran one today and it didn't work out). The query file is different in the delta because here the only difference is the query strings, which have been pre-munged. LocalSettings.php isn't affected, so it's the default in both cases.

After everything has run, this is the summary page for the comparison run. It shows the stats for the baseline and delta (note that the delta took 10x as long to run!), and the metrics (at this point, just zero results rate, which went down by 2.2%)

Diffs are available.
 * The Diff Viewer lets you browse all queries, including those with no diffs. Detailed UI TBD.
 * LocalSettings.php Diff shows the diffs between the LocalSettings.php files (in this case there are none).
 * Then there are specific diffs for each metric (still just zero results rate). Down arrows indicate ZRR decreased (i.e., we got some hits when there were none); up arrows indicate ZRR increased (i.e., we used to get hits, now we get none). What's the right icon? Arrows or +/- to indicate increase/decrease? Thumbs up/down to indicate better/worse (since sometimes an increase is better, sometimes worse)?

(These examples are unrealistic, since ZRR will almost always go down when quotes are removed. Hence "crude mockups".)

Now we're in the diff viewer. Either browsing, or having clicked on a specific example from the metrics section.

(Not that this example does not correspond to the previous summary mockup, since removing quotes has, more realistically, given more results. The results are fudged, too, to show more interesting diffs. There should probably be more info on top, too, showing what comparison run we're looking at. Diffs may not include snippets of results or bolding of search terms. Just trying to present a flavor of the diff view. Hence "crude mockups".)

More improvements!
In priority order, based on a mysterious mix of complexity, impact, desirability, and dependencies, subject to much re-ordering. Big things are tagged [Project], and may (probably) require more planning.


 * Tools: a Relevance Lab portal page should have link to launch, link ot list of available comparison runs, and link to list of available query runs
 * Tools: allow the selection of an existing query run for either baseline or delta
 * Run: 2) & 3) skip as applicable


 * Indexes: specify index for each query file
 * Run: 2) & 3) ... connect to appropriate index ...


 * Diffs and Stats: support more metrics: the number of queries with changes in order in the top-N (5?, 10?, 20?) results
 * Diffs and Stats: think about additional useful metrics, describe and prioritize them. Some random brainstorming:
 * No change: In some cases, it might be good to look specifically at queries that show no change, esp. in a targeted set, to figure out what we are failing to account for.


 * Tools: check for existing run names and suggest improved names (i.e., add a letter on the end)


 * Queries: specify query file A and query munger to generate query file B
 * Run: 0) run query munger on file A with output to file B


 * Indexes: query index server in labs for available indexes, provide list for file-level specification


 * Diffs and Stats: support more metrics: the number of queries with new results in the top-N results
 * Diffs and Stats: support more metrics: the number of queries with changes in total results


 * Tools: tools should be smart enough to check that a query run exists and do the right thing (esp. when A=B); if the diff tools are quick enough, running them when A=B to get stats (speed and ZRR, for example) probably doesn't matter; but we shouldn't re-run all the queries against the index in that case.


 * Search config: adapt as needed to allow results munging
 * Diffs and Stats: may need to modify diffs because the same result may be a miss in one case, and a hit in another
 * (This might be higher than it deserves because I really want to work on this!)


 * Indexes: support for query-by-query index specification
 * (?Erik) Should these be batched by index or can we easily change indexes on the fly?


 * Search config: support Gerrit patch specification [Project]
 * lots of difficulty with dealing with conflicts, sandboxing, etc. Details TBD.


 * Vagrant support: add the necessary bits to a vagrant role to allow anyone to use the index server [Project]


 * Diffs and Stats: support more metrics: a heatmap of the overall shift in ranks (e.g., how many #1s fell to #5, how many #37s rose to #2, etc.)


 * Tools: pre-populate forms when choosing an existing query run (or hide it until pre-populating is possible)
 * Tools: auto-add the date part of name


 * Search config: add support for using different indexes [Project]
 * lots of details TBD.


 * Corpora: Regression test set—either
 * Create one or more standard regression test sets (e.g., enwiki, multi-wiki, etc), and a canonical place to keep them (with appropriate PII protection), or,
 * Create a method for creating a regression test set on the fly (e.g., specify a list of wikis, the number of queries to sample from each, and a date range)


 * Corpora: Gold Standard Corpora [Project]
 * develop schema for recording annotations
 * query
 * set of results
 * result ID
 * SME quality: {good,bad,meh,onetrue}
 * User click quality: % (of clicks in some sample)
 * etc.
 * SME-based annotations
 * create annotation tools to run query and allow results to be tagged
 * allow additional (not returned) results to be added (e.g., "ggeorge clooney" returns nothing, but clearly "George Clooney" is the desired result).
 * User-based annotations
 * build tool to extract click info from logs
 * Gold standard scoring tools
 * Given a set of search results for a corpus and the annotated results, generate various scores:
 * recall, precision, F2, etc, for top result, top 3, top 5, etc.
 * special processing for SME scores (e.g., good = +1, bad = -1, meh = 0; onetrue?)
 * partial credit for user % scores, order dependent
 * multi-objective optimization/scoring might give more weight to getting a onetrue result into the top 5 for one query than to improving the order of good results in the top 5 for any number of queries, for example.


 * Diffs and Stats: if we ever work on changing the article snippets we show, we might want to highlight snippet differences, too.