Search

From MediaWiki.org
Jump to: navigation, search

This page describes the Wikimedia Foundation's activities surrounding our sites' search functionality. Our current project is to replace our legacy lsearchd system with a new system based on Elasticsearch (using a new extension called CirrusSearch). This project started in June 2013, with the migration slated to last until early 2014.

Status

2014-03-monthly:

In March we upgraded to the newest version of Elasticsearch and expanded onto more wikis. We also started a performance assessment which has started showing us the work required to use Cirrus as the primary search back-end for the larger wikis. We then started in on that work.

Rationale

The Wikimedia search infrastructure hasn't had significant development work for many years. The current system is based on homegrown layer (named "lsearchd") on top of Lucene. The problem lsearchd solves has since been tackled by much larger projects such as Solr and Elasticsearch. The lsearchd system frequently breaks in ways that are difficult to diagnose[clarification needed], and generally makes our Operations staff sad.

Goals for our current effort:

  • Make our existing tools more robust
  • Improve logging in our existing tools to make problems easier to diagnose
  • Migrate away from lsearchd to Solr or something similar

Our current search infrastructure is highly outdated and difficult to manage due to tons of custom code. We are now replacing lsearchd with Elasticsearch (which is also a layer on Lucene), as it's very stable, contains many of the features we need, and doesn't require nearly as much custom code to support. What custom code we write will be incorporated in a MediaWiki extension called CirrusSearch.

Timeline

This page is a timeline for future deployments of Cirrus (as secondary or primary) backends to wikis. Our general goal is to deploy CirrusSearch (backed by Elasticsearch) by the end of 2013.

Wikis

Wiki Secondary Date BetaFeature Date Primary Date
test2wiki August 15 Never! August 15
mediawikiwiki December 3 (first ???) Never! December 9 (first September 11)
itwiktionary December 3 (first September 16) December 9 December 18 (first September 23)
disabled December 3 (first September 23) December 9 December 18 (first September 24)
cawiki December 3 (first October XX) December 9 December 18 (first September 23)
enwikisource December 3 (first October 15) December 9 December 18 (first September 23)
bnwiki December 3 (first October 23) December 9 Early March to May (broad estimate)
wikivoyages December 3 (first October 23) December 9 January 6
sewikimedia December 3 (first November 5) December 9 Early March to May (broad estimate)
astwiki December 3 (first November 5) December 9 Early March to May (broad estimate)
guwiki December 3 (first November 5) December 9 Early March to May (broad estimate)
elwiki December 3 (first November 5) December 9 Early March to May (broad estimate)
frwikisource December 3 (first November 5) December 9 January 6
nlwiki December 3 (first November 12) December 9 Early March to May (broad estimate)
wikidata December 4 (first November XX) December 9 January 6
itwiki December 3 December 9 January 6
plwiktionary December 3 December 9 Early March to May (broad estimate)
wikimedias December 11 December 16 Early March to May (broad estimate)
wikimanias December 11 December 16 Early March to May (broad estimate)
wiktionaries December 11 December 16 Early March to May (broad estimate)
wikisources December 16 December 18 Early March to May (broad estimate)
commonswiki December 16 December 30 Early March to May (broad estimate)
wikinewses December 18 December 30 Early March to May (broad estimate)
specieswiki December 18 December 30 Early March to May (broad estimate)
frwiki December 30 January 6 Early March to May (broad estimate)
eswiki December 30 January 6 Early March to May (broad estimate)
ruwiki December 30 January 6 Early March to May (broad estimate)
ptwiki December 30 January 6 Early March to May (broad estimate)
wikibooks January 6 January 8 Early March to May (broad estimate)
dewiki January 6 January 8 Early March to May (broad estimate)
small ones January 8 January 8
enwiki January 13 February 3 Late March to May (broad estimate)
huwiki January 22 February 3 Late March to May (broad estimate)
itwikiquote February 19 February 19 Late March to May (broad estimate)
wikiversities February 19 February 19 Late March to May (broad estimate)
Everything but commons, wikipedia, incubator, and meta already done above April 2 Late March to May (broad estimate)
zhwiki As soon as we have more machines Late March to May (broad estimate)
everything else (in stages) Marchish Marchish Late March to May (broad estimate)

*: proposed

Other

  • Upgrade Elasticsearch to 0.90.4 - Sep 18
  • Upgrade Elasticsearch to 0.90.7 - November 18
  • Install new Elasticsearch servers and decommission testsearch100X - November 18
  • Upgrade Elasticsearch to 0.90.9 - January 2
  • Upgrade Elasticsearch to 0.90.10 - January 14
  • Upgrade Elasticsearch to 1.x - March 6
  • Upgrade Elasticsearch to 1.1.0 - April 8

Near term to be scheduled

  • zhwiki as secondary
  • more primaries

Solr vs Elasticsearch

We spent some time looking at search systems we could use and it became pretty apparent that the thought leaders in the open source world for search are Solr and Elasticsearch. We spent a few weeks with each and decided to build on Elasticsearch because of its wonderful suggester, easily composable queries and good documentation. We are also happy with the process of submitting changes upstream to Elasticsearch.

Elasticsearch implementation plan

Core Search

We're working feverishly on CirrusSearch and we're deploying it to wikis on a volunteer basis. We're actively looking for new volunteers so if you want in email neverett@wikimedia.org.

For other WMF applications

GeoData

We've just started looking at how to move GeoData to Elasticsearch. For now, it'll remain in Solr with plans to migrate it to Elasticsearch when time permits. Some considerations:

The index is relatively small (so no need to make it distributed), but requires a lot of computational power to work with. Full-text search is not currently used. Currently, data from all the wikis is stored in the same core, in the future we will need to split data to many cores (the puppet changes for using multiple cores with shared configuration/schema are here, needs more work).

  • Load expectations: unclear, but will be high if we start using it heavily e.g. for maps display.
  • Backups: not really needed - if master is down just switch to a slave. If all servers are down, reindexing from scratch is quick.
  • Note: because GeoData's schema is very stripped-down, /admin/ping doesn't work - should be remembered if someone wants to rewrite the current monitoring.

Translation Memory

  • Niklas wants to work with Chad & Nik to figure out what is needed
  • "This spring/summer"

Search Weighting Ideas

Some things that could be factored into search results, in rough proposed grouping order:

(+) positive impact on default ranking (-) negative impact on default ranking

Relationship

  • In-article searches (when I'm reading an article, I want to search within it) (+)
  • GeoLoc of user when searching (+)
  • Articles nearness (geographically) to current article (+)
  • Articles linked from current page (+)
  • Wikidata related items (+)
  • Categories on articles (+)

Relevance

  • Article "meshiness" (number of articles that link to article, number of articles that article links to) (+)
  • External search referrer terms saved into article metadata (not a thing yet) (+)
  • Article importance (per wikiprojects) (+) (-)
  • Recency of last edit (+) (-)
  • Article recency (creation) (+)
  • Matches in title, headings, body text, alternate text, references (weighted differently) (+) (-)
  • Other wiki search results
    • Is this a weighting thing, or a content type thing (e.g. if it has related results on other wikis it's ranked higher)?

Quality

  • Notability of an article (i.e. featured) (+)
  • Article quality (per wikiprojects) (+) (-)
  • Content with article issue templates (-)
  • Calls to action (number of article issues, missing images, etc.) (-)
  • Stub status (-)

Aggregate User behavior

  • Top searches in the last hour/day/week (+)
  • Search terms that were followed vs unfollowed (+) (-)
  • Recent pages arrived at via external search (+)

Items in search results

Existing

  • Articles / Talk
  • User / Talk
  • File / Talk
  • Category / Talk
  • Lists / Talk
  • Portal / Talk
  • Help / Talk
  • Template / Talk
  • Module / Talk
  • Wikipedia / Talk
  • Education Program / Talk
  • TimedText (captions/subtitles) / Talk
  • MediaWiki / Talk
  • Book / Talk

New/Modified

  • Places
  • Media
    • Images
      • Raster/bitmap
      • Vector
    • Video
    • PDFs
    • Audio
    • Other media types(?)
  • Wikiprojects
  • Article text
  • Article sections
  • Preference sections
  • Preferences
  • Tea house
  • Reference desk
  • Village Pump
Sister Project Results
  • Wiktionary results
  • Wikivoyage results (desktop only)
  • Wikidata items


Contextual Search

Flow

General Flow search
  • Topic titles
  • Board titles (user names/preferred names)
  • Full text search
  • Board Descriptions
User Mention search
  • Users participating in the current topic
  • Users mentioned in the current topic
  • Users participating on current board
  • Users mentioned in current board
  • Users mentioned my me in the last X days
  • Users whom I follow (Gasp! This doesn't exist yet!)

Category/Tags

  • Recently applied tags


Template/Transclusion Search

  • Recently Used
  • Most common templates inserted from current context
  • Most common template inserted after last inserted

General Search Behaviors

  • When zero results are found for exact match but "did you mean" is shown, show special header with did you mean note and create article note, but show did you mean results instead.
  • In full text search show exact text page title matches in alternate visual appearance


Research

  • Show floating right side microsurvey widget to rate search results
  • Allow users to up and down vote results as relevant to inform weighting hypothesis
  • Analyze search/followed link pair
    • Run microsurvey "is this what you were looking for" on results page?
    • Can we determine if it was what they were looking for without asking?
      • amount of page viewed
      • back to results rather than following link on page or bounce


Documents

Links