User:TJones (WMF)/Notes

This is an index for the reports I've written up on various search- & discovery-related topics.

Favoring Recall in Language Identification
Favoring Recall in Language Identification (May 2016) Analysis of recall-favoring options for language detection (rather than precision-favoring), using the same data from frwiki, eswiki, itwiki, and dewiki as below.

TextCat Optimization for frwiki, eswiki, itwiki, and dewiki
TextCat Optimization for frwiki, eswiki, itwiki, and dewiki (April 2016) Analysis of low-performing queries (< 3 results) to optimize languages to be used for language detection.

Balanced Language Identification Evaluation Set for Queries
Balanced Language Identification Evaluation Set for Queries (February 2016) Creation of a 21-language balanced query corpus, and the evaluation of TextCat against that corpus.

TextCat with Additional Non-Word Characters
TextCat with Additional Non-Word Characters (January 2016) A follow up on an idea from Stas about modifying the non-word characters in TextCat. Ignoring parens helps a wee bit.

How Wrong Would Using Out of Date Page View Data Be?
How Wrong Would Using Out of Date Page View Data Be? (January 2016) We want to integrate page view information into the scoring algorithms we use for both the completion suggester and our regular search results. Our initial idea is we only update this page view information when doing normal document updates after a page edit (for technical reasons, page view data is available/provided when a page is edited). We need to analyze if this page view data will be "good enough" or if we need to do something more.

ElasticSearch Plugin—Limiting Languages & Retraining
ES Plugin, Limiting Language Options and Retraining on Query Data (December 2015) David retrained the ES Plugin models using the data from the TextCat evaluations, and figured out how to limit the plugin to the "useful" languages. The results are much improved and on-par with TextCat.

Language Detection Evaluation—TextCat
Language Detection with TextCat (December 2015)—An evaluation of TextCat (an n-gram–based language identifier) on the enwiki zero-results queries. Includes updates to TextCat, re-training on query data, and limiting language identification to "useful" languages. Offers an improvement over the ES Plugin.

Language Detection Evaluation—Update: Thresholds by Language
Language Detection Evaluation—Update: Thresholds by Language (October 2015)—Evaluated adding a language specific threshold (i.e., "it's never Romanian" on enwiki!) to the ElasticSearch language detection plugin. Results are overfitted because of small available data set, but are indicative of significant improvement to precision in language detection.

Relevance Lab!
Relevance Lab (October 2015)—High level description and design of a Relevance Lab for Discovery, which would allow us (and others!) to experiment with proposed modifications to our search process and gauge their effectiveness and impact before deploying them.

Why People Use Search Engines
Why People Use Search Engines (September 2015)—An overview of how well English Wikipedia Search performs on a sample of ~4K queries that came from Google, with analysis of categories of unsuccessful queries and lots of ideas (not all necessarily practical) for Wikimedia search improvements.

Cross Language Wiki Searching
Cross Language Wiki Searching (September 2015)—An attempt to estimate the impact on enwiki's zero-results rate given "perfect" (or at least human-level) language identification.

Language Detection Evaluation
Language Detection Evaluation (September 2015)—A test of language detection against a representative sample of hand-coded zero-results queries from enwiki.
 * ElasticSearch language detection plugin—A language detection plugin available for ElasticSearch;
 * also evaluated with initial and final spaces added (which gives better results, probably because of better recognition of letters at the edges of words)
 * Always "English" detector—Baseline against the current de facto default; also demonstrates that F-score is not necessarily the only relevant measure for search purposes.

Phrase Slop Pre-Test
Phrase Slop Pre-Test (August 2015)—An in vitro test of the ElasticSearch phrase slop parameter against ptwiki and dewiki before the in vivo A/B test. The final report, prepared by Mikhail, is here.

Survey of Zero-Results Queries
Survey of Zero-Results Queries (July 2015)—A survey of the readily identifiable patterns in full-text zero-results queries. Lots of potential bots and bugs identified.
 * One Month Followup (August 2015)—Overview of day-by-day changes in full-text traffic for known bots and bugs one month later, and monthly changes in zero-results rate for top wikis by volume.
 * Full manual review of a 1K enwiki sample (August 2015)—Hand coding and categorization of a 1K sample of full-text zero-results queries.