User:TJones (WMF)/Notes

This is an index for the reports I've written up on various search- & discovery-related topics.

Language Detection Evaluation—TextCat
Language Detection with TextCat (December 2015)—An evaluation of TextCat (an n-gram–based language identifier) on the enwiki zero-results queries. Includes updates to TextCat, re-training on query data, and limiting language identification to "useful" languages. Offers an improvement over the ES Plugin.

Language Detection Evaluation—Update: Thresholds by Language
Language Detection Evaluation—Update: Thresholds by Language (October 2015)—Evaluated adding a language specific threshold (i.e., "it's never Romanian" on enwiki!) to the ElasticSearch language detection plugin. Results are overfitted because of small available data set, but are indicative of significant improvement to precision in language detection.

Relevance Lab!
Relevance Lab (October 2015)—High level description and design of a Relevance Lab for Discovery, which would allow us (and others!) to experiment with proposed modifications to our search process and gauge their effectiveness and impact before deploying them.

Why People Use Search Engines
Why People Use Search Engines (September 2015)—An overview of how well English Wikipedia Search performs on a sample of ~4K queries that came from Google, with analysis of categories of unsuccessful queries and lots of ideas (not all necessarily practical) for Wikimedia search improvements.

Cross Language Wiki Searching
Cross Language Wiki Searching (September 2015)—An attempt to estimate the impact on enwiki's zero-results rate given "perfect" (or at least human-level) language identification.

Language Detection Evaluation
Language Detection Evaluation (September 2015)—A test of language detection against a representative sample of hand-coded zero-results queries from enwiki.
 * ElasticSearch language detection plugin—A language detection plugin available for ElasticSearch;
 * also evaluated with initial and final spaces added (which gives better results, probably because of better recognition of letters at the edges of words)
 * Always "English" detector—Baseline against the current de facto default; also demonstrates that F-score is not necessarily the only relevant measure for search purposes.

Phrase Slop Pre-Test
Phrase Slop Pre-Test (August 2015)—An in vitro test of the ElasticSearch phrase slop parameter against ptwiki and dewiki before the in vivo A/B test. The final report, prepared by Mikhail, is here.

Survey of Zero-Results Queries
Survey of Zero-Results Queries (July 2015)—A survey of the readily identifiable patterns in full-text zero-results queries. Lots of potential bots and bugs identified.
 * One Month Followup (August 2015)—Overview of day-by-day changes in full-text traffic for known bots and bugs one month later, and monthly changes in zero-results rate for top wikis by volume.
 * Full manual review of a 1K enwiki sample (August 2015)—Hand coding and categorization of a 1K sample of full-text zero-results queries.