This is an index for the reports I've written up on various search- & discovery-related topics.
- 1 Elasticsearch Analysis Chain Analysis
- 1.1 Chinese Analyzer Analysis
- 1.2 Vietnamese Analyzer Analysis
- 1.3 Analysis Analysis Tools
- 1.4 Kuromoji Analyzer Analysis
- 1.5 HebMorph Analyzer Analysis
- 1.6 Ukrainian Morfologik Analysis
- 1.7 Swedish Analyzer Analysis
- 1.8 Stempel Analyzer Analysis
- 1.9 On Generic ICU Folding
- 1.10 Upgrading ASCII Folding to ICU Folding for French and English
- 1.11 Removing Stress Accents and Folding Ё to Е for Russian Wikis
- 1.12 On Merging Apostrophes and Other Unicode Characters
- 1.13 Adding Ascii-Folding to French Wikipedia
- 1.14 Re-Ordering Stemming and Ascii-Folding on English Wikipedia
- 2 Crimean Tatar Transliteration
- 3 Accents, Dead Keys, and Suggestions
- 4 Some Thoughts on the Math of Scoring
- 5 So Many Search Options
- 6 TextCat, Language ID, Etc.
- 6.1 TextCat Improvements
- 6.2 TextCat Released into Production!
- 6.3 TextCat and Confidence
- 6.4 Typing on the Wrong Keyboard / Russian and English
- 6.5 Favoring Recall in Language Identification
- 6.6 Balanced Language Identification Evaluation Set for Queries
- 6.7 TextCat with Additional Non-Word Characters
- 6.8 ElasticSearch Plugin—Limiting Languages & Retraining
- 6.9 Language Detection Evaluation—TextCat
- 6.10 Language Detection Evaluation—Update: Thresholds by Language
- 6.11 Language Detection Evaluation
- 7 TextCat Optimizations
- 8 Spaceless Writing Systems and Wiki-Projects
- 9 Fallback Langauges
- 10 Top Unsuccessful Search Queries
- 11 Dropping Final Question Marks in the Top 10 Wikipedias
- 12 Quotes and Questions
- 13 How Wrong Would Using Out of Date Page View Data Be?
- 14 Relevance Lab!
- 15 Why People Use Search Engines
- 16 Cross Language Wiki Searching
- 17 Phrase Slop Pre-Test
- 18 Survey of Zero-Results Queries
Elasticsearch Analysis Chain Analysis
Chinese Analyzer Analysis
Chinese Analyzer Analysis (February–April 2017) Analysis of several Chinese Elasticsearch plugins for traditional-to-simplified character conversion and for word segmenting.
Punctuation config update (August 2017) About 16% of tokens are punctuation, all indexed as commas, which is silly.
Vietnamese Analyzer Analysis
Vietnamese Analyzer Analysis Analysis of the Vietnamese language analyzer.
Analysis Analysis Tools
Analysis Analysis Tools (July 2017) The first draft of the README file for my Language Analysis Analysis tools, which are being added to the RelForge repo.
Kuromoji Analyzer Analysis
Kuromoji Analyzer Analysis (June-July 2017) Analysis of the Kuromoji language analyzer for Japanese.
HebMorph Analyzer Analysis
HebMorph Analyzer Analysis (May 2017) Analysis of the HebMorph language analyzer for Hebrew.
Ukrainian Morfologik Analysis
Ukrainian Morfologik Analysis (March 2017) Analysis of Elasticsearch plugin for Ukrainian Morfologik Analyzer, recommended by Elasticsearch. It looks good, but because we were originally using the Russian analyzer, the situation is complicated.
Swedish Analyzer Analysis
Swedish Analyzer Analysis (March 2017) Quick analysis of the impact of folding on Swedish.
Stempel Analyzer Analysis
Stempel Analyzer Analysis (February 2017) Analysis of Stempel Polish Analyzer from Elasticsearch, which we'd like to deploy for Polish wiki projects. Generally it works well, but it has some interesting bugs.
On Generic ICU Folding
On Generic ICU Folding (December 2016) Copy of quick discussion on Phab about the goals of generic ICU folding, and how to apply it to specific language wikis, hopefully with input from the wiki/language communities.
Upgrading ASCII Folding to ICU Folding for French and English
Upgrading ASCII Folding to ICU Folding for French and English (September 2016) A quick analysis of the effects of enabling ICU folding instead of simple ASCII folding for French and English.
Removing Stress Accents and Folding Ё to Е for Russian Wikis
Removing Stress Accents and Folding Ё to Е for Russian Wikis (September 2016) A quick-ish test on the effects of adding stress-accent-stripping and ё-folding to Russian wikis.
On Merging Apostrophes and Other Unicode Characters
On Merging Apostrophes and Other Unicode Characters (August 2016) Copied from a quick analysis in a Phab ticket on merging Unicode characters so I can easily find it later.
Adding Ascii-Folding to French Wikipedia
Adding Ascii-Folding to French Wikipedia (August 2016) A not-so-quick test on the effects of adding ascii-folding to French Wikipedia. Many unexpected twists and surprises!
Re-Ordering Stemming and Ascii-Folding on English Wikipedia
Re-Ordering Stemming and Ascii-Folding on English Wikipedia (August 2016) A quick test of the effects of moving ascii-folding before stemming on English Wikipedia.
Crimean Tatar Transliteration
Crimean Tatar Transliteration (May-July 2017) An analysis of a work-in-progress transliteration module, adapted from previous work from 2010.
Accents, Dead Keys, and Suggestions
Accents, Dead Keys, and Suggestions (July 2017) Copy from Phabricator of discussion of accented characters not generating completion suggester suggestions.
Some Thoughts on the Math of Scoring
Some Thoughts on the Math of Scoring (April 2017) A cleaned up and slightly expanded version of a discussion I had with David about the math of scoring functions. Use your hyperoperations, kids!
So Many Search Options
So Many Search Options (December 2016): an initial proposal to encourage thinking about how to deal with all the different additional ways of searching when a query doesn't give great results ("Did you mean" suggestions, language detection, quote stripping, wrong keyboard detection, etc).
January 2017: Lots of updates and refinements, and the first draft of a proposal to update the API. Now moved out of my Notes to a more generic page.
TextCat, Language ID, Etc.
November 2016—These are all on one wiki page if you want to browse them all, or jump to a specific section.
- Optimization Framework: I slapped together a grid-search optimizer around my existing tools.
- Multiple Language Model Directories: Initial analysis of effects of allowing wiki-text models in addition to query-based models.
- Maximum Returned Languages and Results Ratio: Initial analysis of the effects of optimizing what counts as "ambiguous"; also, turns out that model size is interesting and important.
- Minimum Input Length: Initial analysis of the effects of introducing a minimum input length.
- Max Proportion of Max Score: scores that are too close the worst possible score are likely junk!
- Optimization Framework updates (Dec/Jan): now with coordinate descent!
- Bucketing and Bonuses (Dec/Jan): give the most likely languages—esp. the "host" language—a boost so that ambiguity or near ambiguity comes out in their favor. Also, re-evaluate whether we've made enough progress to warrant putting back some languages we had to exclude (spoiler: we have!)
- Unknown n-gram Penalty: Maybe an extra penalty for unknown n-grams will reduce ambiguity; or maybe the penalty is too high and we're throwing out the baby with the bathwater; or maybe it's just right.
- Final Summary & Recommendations: stick a fork in it; it's done! A review of the overall findings, and the general improvement in F0.5 accuracy across the nine corpora we currently have.
TextCat Released into Production!
July 2016—There's a blog post on the Wikimedia blog that Deb and I worked on, announcing TextCat/language ID being in production for five wikis, and a PDF of a longer first draft I wrote on Commons. And while I'm here, I'll suggest the online demo if you want to play around with language identification directly.
TextCat and Confidence
TextCat and Confidence (July 2016) Quick summary of concerns and ideas for assigning a confidence score to TextCat's language identification.
Typing on the Wrong Keyboard / Russian and English
Typing on the Wrong Keyboard / Russian and English (June 2016) A quick attempt to identify and convert queries typed on the wrong keyboard on the English and Russian Wikipedias.
Favoring Recall in Language Identification
Favoring Recall in Language Identification (May 2016) Analysis of recall-favoring options for language detection (rather than precision-favoring), using the same data from frwiki, eswiki, itwiki, and dewiki as below.
Balanced Language Identification Evaluation Set for Queries
Balanced Language Identification Evaluation Set for Queries (February 2016) Creation of a 21-language balanced query corpus, and the evaluation of TextCat against that corpus.
TextCat with Additional Non-Word Characters
TextCat with Additional Non-Word Characters (January 2016) A follow up on an idea from Stas about modifying the non-word characters in TextCat. Ignoring parens helps a wee bit.
ElasticSearch Plugin—Limiting Languages & Retraining
ES Plugin, Limiting Language Options and Retraining on Query Data (December 2015) David retrained the ES Plugin models using the data from the TextCat evaluations, and figured out how to limit the plugin to the "useful" languages. The results are much improved and on-par with TextCat.
Language Detection Evaluation—TextCat
Language Detection with TextCat (December 2015)—An evaluation of TextCat (an n-gram–based language identifier) on the enwiki zero-results queries. Includes updates to TextCat, re-training on query data, and limiting language identification to "useful" languages. Offers an improvement over the ES Plugin.
Language Detection Evaluation—Update: Thresholds by Language
Language Detection Evaluation—Update: Thresholds by Language (October 2015)—Evaluated adding a language specific threshold (i.e., "it's never Romanian" on enwiki!) to the ElasticSearch language detection plugin. Results are overfitted because of small available data set, but are indicative of significant improvement to precision in language detection.
Language Detection Evaluation
Language Detection Evaluation (September 2015)—A test of language detection against a representative sample of hand-coded zero-results queries from enwiki.
- ElasticSearch language detection plugin—A language detection plugin available for ElasticSearch;
- also evaluated with initial and final spaces added (which gives better results, probably because of better recognition of letters at the edges of words)
- Always "English" detector—Baseline against the current de facto default; also demonstrates that F-score is not necessarily the only relevant measure for search purposes.
TextCat Optimization for plwiki, arwiki, zhwiki, and nlwiki
TextCat Optimization for plwiki, arwiki, zhwiki, and nlwiki (September 2016) Analysis of low-performing queries (< 3 results) to optimize languages to be used for language detection.
TextCat Optimization for ptwiki, ruwiki, and jawiki
TextCat Optimization for ptwiki, ruwiki, jawiki (July 2016) Analysis of low-performing queries (< 3 results) to optimize languages to be used for language detection.
TextCat Re-optimization for enwiki
TextCat Re-optimization for enwiki (June 2016) Analysis of low-performing queries (< 3 results) to optimize languages to be used for language detection; plus comparison to similar previous ZRR-based enwiki corpus from 2015.
TextCat Optimization for frwiki, eswiki, itwiki, and dewiki
TextCat Optimization for frwiki, eswiki, itwiki, and dewiki (April 2016) Analysis of low-performing queries (< 3 results) to optimize languages to be used for language detection.
Spaceless Writing Systems and Wiki-Projects
Spaceless Writing Systems and Wiki-Projects (November 2016) A quick review of languages/projects that don't use spaces between most words in their writing systems.
Fallback Languages (October 2016) A list of languages that are potentially used as fallbacks for other languages in language analysis.
Top Unsuccessful Search Queries
Top Unsuccessful Search Queries (July 2016) Analysis of the top 100 most frequent zero-results queries for enwiki for the month of May, 2016, to help determine whether mining such queries is worthwhile.
Dropping Final Question Marks in the Top 10 Wikipedias
Dropping Final Question Marks in the Top 10 Wikipedias (June 2016) More detailed look at the effects on search results (especially Zero Results Rate and Poorly Performing Queries) of dropping final question marks from queries on the top 10 Wikipedias.
Quotes and Questions
Quotes and Questions (May 2016) Quick write up of effects of removing quotation marks and question marks from poorly performing queries.
How Wrong Would Using Out of Date Page View Data Be?
How Wrong Would Using Out of Date Page View Data Be? (January 2016) We want to integrate page view information into the scoring algorithms we use for both the completion suggester and our regular search results. Our initial idea is we only update this page view information when doing normal document updates after a page edit (for technical reasons, page view data is available/provided when a page is edited). We need to analyze if this page view data will be "good enough" or if we need to do something more.
Relevance Lab (October 2015)—High level description and design of a Relevance Lab for Discovery, which would allow us (and others!) to experiment with proposed modifications to our search process and gauge their effectiveness and impact before deploying them.
Why People Use Search Engines
Why People Use Search Engines (September 2015)—An overview of how well English Wikipedia Search performs on a sample of ~4K queries that came from Google, with analysis of categories of unsuccessful queries and lots of ideas (not all necessarily practical) for Wikimedia search improvements.
Cross Language Wiki Searching
Cross Language Wiki Searching (September 2015)—An attempt to estimate the impact on enwiki's zero-results rate given "perfect" (or at least human-level) language identification.
Phrase Slop Pre-Test
Survey of Zero-Results Queries
Survey of Zero-Results Queries (July 2015)—A survey of the readily identifiable patterns in full-text zero-results queries. Lots of potential bots and bugs identified.
- One Month Followup (August 2015)—Overview of day-by-day changes in full-text traffic for known bots and bugs one month later, and monthly changes in zero-results rate for top wikis by volume.
- Full manual review of a 1K enwiki sample (August 2015)—Hand coding and categorization of a 1K sample of full-text zero-results queries.