User:Rainman/search internals

This page describes the scoring scheme and other algorithms used in Extension:lucene-search as implemented in version 2.1.

Introduction
Lucene-search is a 40k-lines strong java wrapper around Lucene lowlevel search API. It extends basic search functionality to provide scalable architecture, link analysis, advanced spell-checking, custom scoring schemes, and other features. It has been written to be used with MediaWiki, by having in mind the needs of Wikimedia projects.

Scoring
We divided scoring into three classes: exact article, relevant article and matched article. These are gradients of relevance put into classes since they are calculated in distinct ways. All articles returned to user need to fit into the third class and match the query, while the first two give additional score boosts.

Exact article
when query matches (from higher to lower score):
 * exactly an article title
 * exactly a redirect to an article
 * complete title with words out of order
 * complete title by ignoring stop words, in/out of order (e.g. q:france history -> history of france)
 * complete title when both query and title nouns are converted into singular form
 * complete title or redirect when expanded with WordNet context-free synonyms (e.g. q:surname -> family name)
 * complete title or redirect by omitting accents and non-letters (e.g. q:test -> .test)

Scores are additionally ordered by article rank, thus some of the lower subclasses might dominate the higher ones if the article has a very high rank.

This class contains the most straightforward search results a user would certainly expect to see (sorted according to how general the article is). The WordNet synonym databases gives some capacity to rephrase queries into equivalent terms, but is albeit incomplete. To complete the "correct article" class, a linguistic rephrasing should be implemented, e.g. a transformation of queries like q:population of africa into q:african population.

Relevant article
the relevancy of an article is measured as a product of:
 * 1) score based on positional information of query terms in the article
 * 2) relevance assessment based on context estimated via related articles and article title and redirects.

Thus, we use two independent measures: one local - concerning only the article, and the other global - inferred from link analysis. Their product can be taught of as a covariance with zero expected value.

Positional information of terms within article is evaluated as follows (from higher to lower score):
 * terms are found in exact order at the very beginning of the article
 * terms are found out of order at the beginning of the article
 * terms are found on first page of the article (first 500 words / until first section heading), either out of order, or in order
 * terms are found in section headings
 * terms are found in order somewhere in the text
 * terms are found out of order, but close to each other somewhere in the text

The exact order matches typically require all words from the query, while the sloppy ones only words that are not stop words. The above scheme is repeated for queries rewritten with WordNet synonyms. Matches at the beginning of the article are additionally scaled by article rank. The scheme makes the assumption that most relevant information about the article are mentioned first. This seems to be true for encyclopedias, while it might not be always true for any data. Infoboxes do not count as beginning-of-the-article text.

Link-analysis relevance is based on link analysis data saved in the index. For each article it is calculated how related it is to all articles that link to it, i.e. a list of most related articles that link to it is attached to every article. For instance, for Douglas Adams, the most relevant article is The Hitchhiker's Guide to the Galaxy. Vice versa is also true, although the relatedness is often not symmetrical. This means that a query for the hitchhikers guide to the galaxy will yield extra score for article Douglas Adams (and in this case vice versa). The extra score depends on how strong the relatedness is.

The other global measure in addition to related article names is article rank (i.e. if title or part of the title matches), or redirect rank (if redirect or its part matches). In both cases, scores are calculated on how well does the query matches title names (single words, in-order phrases, whole query, with/out stop words, synonyms...).

Matched article
All hits need to match the terms in the query. The match-clause ensures this is the case, but contributes very little to overall score and is used to order results if the above two yield no additional ranking. It uses the somewhat customized Lucene formula.

Additional scores
Depending on configuration, some wikis receive additional boosts based on the last edit of the article (e.g. mediawiki.org, metawiki and wikinews). This helps filter out the very old and no longer interesting content, although a more sophisticated age&trust algorithms could also be used. Different namespaces receive different score boosts, e.g. talk namespaces typically get 50% lower scores, as it is the case with subpages.

Spell checking
Spell checking is done on three levels: whole titles, phrases, context and words. It proved hard to implement spell checking efficiently, especially for (groups of) shorter words, since their edit distance neighborhood is huge. The spell check engine relies on actual search results and highlighted text to aid in disambiguating various cases, as explained below.

Spell checking words
Words are suggested based on the n-gram index of words. All indexed words are split into single characters, and groups of two and three adjacent letters. This makes the word index quite minimal and capable to check very short words, but also very slow. Typically, a query is made on this ngram index, then top few hundred hits evaluated by taking modified edit distance and using metaphones. Metaphones are used only to reject words since metaphone calculation can be noise and is not always correct, we only require that the spell-checked word doesn't have a completely different metaphone from the typed-in one. The words are extracted from full-text content for main namespace, and from titles for other namespaces. This enables namespace-specific suggestions.

The word at lowest edit distance, and with largest frequency is suggested as a spell-checked word.

Spell checking phrases
Sometimes two words can be correct alone, but together form a misspell. E.g. los angles (los angeles) or noble prize (nobel prize). We break down all content (for main namespace) and titles (for other) into phrases anchored in non-stop-words. For instance, text "together form a misspell" is broken into phrases together_form and form_a_misspell. The same is done for the query user entered. For every word similar words are generated (using single words spell-checking) and from these correct phrases inferred. The other implementation would be to use n-grams for phrases (i.e. same as words), but this generates a huge index and generally works very slowly.

Checking based on phrases does a descent job finding common misspells (when one form at distance one is 100-times more frequent that other), and also helps spell-check common phrases and simple linguistic constructs.

Spell checking context
In many cases, however, adjacent words don't form phrases, but are instead keywords. For every word in title we generate a list of contextual words: all other words in article titles where it's found, those article's redirects, section headings and words from links within the first page of the article. Similar as in checking phrases, we use single word suggestions to generate similar words, and then check if they are in context of each other. The best suggestion from both phrases and context is then used as the final suggestion.

This helps spell check queries in-context, where the spelling of one word influences the way the others are spelled. For instance, "dasain" is a 15-day festival in Nepal, but "dasain heidegger" gets spell checked into "dasein heidegger", as dasein is Heidegger's concept for existence.

Spell checking titles
Finally, the whole query is also matched against whole titles. This is only done if the query is long enough (more than 7 letters) and has a limited capacity for correcting errors for performance reasons. One fast check is usually done at the beginning of the spell-check process to see if there is distance 1 title, and if so, returned if it has larger rank than the first hit with the current query. If this is not the case, the process of spell-checking words, phrases and context are performed, the best suggestions are then again matched against titles, and finally if that fails, best suggestions are returned to the user.

Spell checking based on whole titles is a fast way to correct long queries, and is also a way to check the whole query instead of checking individual words of adjacent words (as phrases and context).

Prefix suggestions
The prefix index is used to provide as-you-type suggestions using AJAX. The idea mainly comes from Julien Lemoine. His engine has been reimplemented using lucene index instead of trie, and with a couple of modifications to avoid multiple redirects to same article, and provide suggestions for an arbitrary subset of namespaces.

Article rank
Rank of the article is a logarithm of number of links article and all of its redirects receive. This is a measure of how general an article is. A comparison with PageRank is in order and it should be noted that while the structure of WWW is almost completely free-form, the structure of links within an encyclopedia is dictated by how general a certain article is. This is why very general articles like years and country names have very high rank. However, they are by no means the most popular or relevant articles for the content they hold (e.g. names of people died in certain year).

Context-free synonyms
WordNet defines a dictionary of synsets, sets of synonyms that can be used interchangeably in some context. From this database, I extracted only those words and phrases that belong to exactly one synset, and thus can be interchanged in any context - there are about 17000 such synsets. Here are few examples: (nineteen 19) (airstrip flight_strip landing_strip) (nonsweet sugarless) (uk u.k. united_kingdom britain)...

Related articles
Article A is said to be related to article B if there is an article C that links to both A and B. The number and relative weight of links together with closeness within referring articles is used to infer how related A and B are. To put in simply, related articles are basically frequently co-occurring links.

Edit distance
We use a variation of standard edit distance with following modifications:
 * swamping two adjacent chars is of distance 1 (e.g. teh -> the )
 * repeating a character is of distance 0 (e.g. thhee -> the )
 * deleting/changing the first letter is of distance 2 (e.g. cat -> rat )
 * other (addition/change/deletion) are of distance 1
 * additional flag for exact match (e.g. to distinguish met and meet, etc)