Search/Old

This page describes the Wikimedia Foundation's activities surrounding our sites' search functionality.

Rationale
The Wikimedia search infrastructure hasn't had significant development work for many years. The current system is based on homegrown layer on top of Lucene (lsearchd) that has since been tackled by much larger projects such as Solr and Elasticsearch. The search system frequently breaks in ways that are difficult to diagnose, and generally makes our Operations staff sad.

Goals for our current effort:
 * Make our existing tools more robust
 * Improve logging in our existing tools to make problems easier to diagnose
 * Migrate away from lsearchd to Solr (or something similar)

Our current search infrastructure is highly outdated and difficult to manage due to tons of custom code. We'd like to replace it with Solr (also based on Lucene), as it's very stable, contains many of the features we need, and doesn't require nearly as much custom code to support.

Timeline
This page is a timeline for future deployments of Cirrus (as secondary or primary) backends to wikis. Our general goal is to deploy CirrusSearch (backed by Elastic Search) by the end of 2013.

Primary backend

 * test2wiki - Aug 15
 * mediawikiwiki - Sep 11
 * ikwiktionary to primary - Sep 23
 * All disabled wikis - Sep 24
 * cawiki
 * enwikisoucre - Oct 15

Secondary backend

 * itwiktionary - Sep 16
 * enwikisource to secondary - Sep 23
 * cawiki - Sep 23
 * Wikidata - October 29th
 * bnwiki - October 23
 * wikivoyages - October 23
 * se.wikimedia.org - November 5
 * ast.wikipedia.org - November 5
 * gu.wikipedia.org - November 5
 * el.wikipedia.org - November 5
 * fr.wikisource.org - November 5
 * nl.wikipedia.org or it.wikipedia.org - Maybe November 12 (if we have the resources on the test machines)

Other

 * Upgrade Elasticsearch to 0.90.4 - Sep 18
 * Install new Elasticsearch servers and decommission testsearch100X - November 12ish

Near term to be scheduled

 * to be determined

Solr vs Elasticsearch
We spent some time looking at search systems we could use and it became pretty apparent that the thought leaders in the open source world for search are Solr and Elasticsearch. We spent a few weeks with each and decided to build on Elasticsearch because of its wonderful suggester, easilly composable queries, good documentation. We are also happy with the process of submitting changes upstream to Elasticsearch.

Core Search
We're working feverishly on CirrusSearch and we're deploying it to wikis on a volunteer basis. We're actively looking for new volunteers so if you want in email neverett@wikimedia.org.

GeoData
We've just started looking at how to move GeoData to Elasticsearch. For now, it'll remain in Solr with plans to migrate it to Elasticsearch when time permits. Some considerations:

The index is relatively small (so no need to make it distributed), but requires a lot of computational power to work with. Full-text search is not currently used. Currently, data from all the wikis is stored in the same core, in the future we will need to split data to many cores (the puppet changes for using multiple cores with shared configuration/schema are here, needs more work).
 * Load expectations: unclear, but will be high if we start using it heavily e.g. for maps display.
 * Backups: not really needed - if master is down just switch to a slave. If all servers are down, reindexing from scratch is quick.
 * Note: because GeoData's schema is very stripped-down, /admin/ping doesn't work - should be remembered if someone wants to rewrite the current monitoring.

Nice to haves

 * A pony

Search Weighting Ideas
Some things that could be factored into search results, in rough proposed grouping order:

(+) positive impact on default ranking (-) negative impact on default ranking

Relationship

 * In-article searches (when I'm reading an article, I want to search within it) (+)
 * GeoLoc of user when searching (+)
 * Articles nearness (geographically) to current article (+)
 * Articles linked from current page (+)
 * Wikidata related items (+)
 * Categories on Articles (+)

Relevance

 * Article "meshiness" (no. articles that link to article, no. of articles that articles links to) (+)
 * External search referrer terms saved into article metadata (Not a thing, yet) (+)
 * Article importance (per wikiprojects) (+) (-)
 * Recency of last edit (+) (-)
 * Article recency (creation) (+)
 * Matches in Title, Headings, Body Text, Alt-text, References (weighted differently) (+) (-)
 * Other wiki search results
 * Is this a weighting thing, or a content type thing (e.g. if it has related results on other wikis its ranked higher?)

Quality

 * Notability of an article (i.e. featured) (+)
 * Article quality (per wikiprojects) (+) (-)
 * Content with article issue templates (-)
 * Call to actions (# of article issues, missing images, etc) (-)
 * Stub status (-)

Personalization*

 * User's watched pages (+)
 * User's contribution history (+)
 * User's offline saved pages (mobile app) (+)
 * Ponies (+++)
 * Active user's recent searches (mobile app) (+)
 * We don't track user's search history, like we don't track user's pageviews
 * Sure, we don't track individual users search history, but do/can we search history in aggregate or generalized cohorts?
 * User page view history (mobile app) (+)
 * Exists legacy mobile app, will be in new mobile app, currently stored in app, not profile, this may change(?)

Aggregate User behavior

 * Top searches in the last hour/day/week (+)
 * Search terms that were followed vs unfollowed (+) (-)
 * Content which received many thanks (+)
 * Recent pages arrived at via external search (+)



Existing

 * Articles / Talk
 * User / Talk
 * File / Talk
 * Category / Talk
 * Lists / Talk
 * Portal / Talk
 * Help / Talk
 * Template	 / Talk
 * Module / Talk
 * Wikipedia / Talk
 * Education Program / Talk
 * TimedText (captions/subtitles) / Talk
 * MediaWiki / Talk
 * Book / Talk

New/Modified

 * Places


 * Media
 * Images
 * Raster/bitmap
 * Vector
 * Video
 * PDFs
 * Audio
 * Other media types(?)


 * Wikiprojects


 * Article text
 * Article sections


 * Preference sections
 * Preferences


 * Tea house
 * Reference desk
 * Village Pump

Documents

 * Search documentation on Wikitech: Search
 * Ram's setup instructions: wikitech:User:Ram/Search
 * Some notes from Brion in 2008
 * The MWSearch extension provides a SearchEngine subclass which contacts Wikimedia's Lucene-based search server. This replaces the older LuceneSearch extension which reimplemented the entire Special:Search page.
 * /2013-02 discussion - Discussion with Rainman about how the current system works

Links

 * Bugzilla