User:Rainman/search internals

This page describes scoring scheme and other algorithms used in Extension:lucene-search as implemented in version 2.1.

Introduction
Lucene-search is a 40k-lines strong java wrapper around Lucene lowlevel search API. It extends basic search functionality to provide scalable architecture, link analysis, advanced spell-checking, custom scoring schemes, and other. It has been written to be used with MediaWiki, by having in mind the needs of Wikimedia projects.

Scoring
We divided scoring into three classes: correct article, relevant article, matched article. These are gradients of relevance put into classes since they are calculated in distinct ways. All articles returned to user need to fit into the third class and match the query, while the first two give additional score boosts.

Correct article
when query matches (from higher to lower score):
 * exactly an article title
 * exactly a redirect to an article
 * complete title with words out of order
 * complete title by ignoring stop words, in/out of order (e.g. q:france history -> history of france)
 * complete title when both query and title nouns are converted into singular form
 * complete title or redirect when expanded with WordNet context-free synonyms (e.g. q:surname -> family name)
 * complete title or redirect by omitting accents and non-letters (e.g. q:test -> .test)

Scores are additionally ordered by article rank, thus some of the lower subclasses might dominate the higher ones if the article has a very high rank.

This class contains the most straightforward search results user would certainly expect to see (sorted according to how general the article is). The WordNet synonym databases gives some capacity to rephrase queries into equivalent terms, but is albeit incomplete. To complete the "correct article" class, a linguistic rephrasing should be implemented, e.g. a transformation of queries like q:population of africa into q:african population.

Relevant article
the relevancy of an article is measured as a product of:
 * 1) score based on positional information of query terms in the article
 * 2) relevance assessment based on context estimated via related articles and article title and redirects.

Thus, we use two independent measures: one local - concerning only the article, and the other global - inferred from link analysis. Their product can be taught of as a covariance with zero expected value.

Positional information of terms within article is evaluated as follows (from higher to lower score):
 * terms are found in exact order at the very beginning of the article
 * terms are found out of order at the beginning of the article
 * terms are found on first page of the article (first 500 words / until first section heading), either out of order, or in order
 * terms are found in section headings
 * terms are found in order somewhere in the text
 * terms are found out of order, but close to each other somewhere in the text

The exact order matches typically require all words from the query, while the sloppy ones only words that are not stop words. The above scheme is repeated for queries rewritten with WordNet synonyms. Matches at the beginning of the article are additionally scaled by article rank. The scheme makes the assumption that most relevant information about the article are mentioned first. This seems to be true for encyclopedias, while it might not be always true for any data. Infoboxes do not count as beginning-of-the-article text.

Link-analysis relevance is recalculated periodically. For each article it is calculated how related it is to all articles that link to it. Thus for each article a list of most related articles that link to it gets generated. For instance, for Douglas Adams, the most relevant article is The Hitchhiker's Guide to the Galaxy. Vice versa is also true, although the relatedness is often not symmetrical. This means that a query for the hitchhikers guide to the galaxy will yield extra score for article Douglas Adams (and in this case vice versa). The extra score depends on how strong the relatedness is.

The other global measure in addition to related article names is article rank (if title or part of the title matches), or redirect rank (if redirect or its part matches). In both cases, score is calculated on how well does the query matches title names (single words, in-order phrases, whole query, with/out stop words, synonyms...).

Matched article
All hits need to match the terms in the query. The match-clause ensures this is the case, but contributes very little to overall score and is used to order results if the above two yield no additional ranking. It uses the somewhat customized Lucene formula.

Additional scores
Depending of configuration, some wikis receive additional boosts based on the last edit of the article (e.g. mediawiki.org, metawiki and wikinews). This helps filter out the very old and no longer interesting content, although a more sophisticated age&trust algorithms could also be used. Different namespaces receive different score boosts, e.g. talk namespaces typically get 50% lower scores, as it is the case with subpages.

Spell checking
Spell checking is done on three levels: whole titles, phrases, context and words. It proved hard to implement spell checking efficiently, especially for (groups of) shorter words, since their edit distance neighborhood is huge. The spell check engine relies on actual search results and highlighted text to aid in disambiguating various cases, as explained below.

Spell checking words
Words are suggested based on the n-gram index of words. All indexed words are split into single characters, and groups of two and three adjacent letters. This makes the word index quite minimal and capable to check very short words, but also very slow. Typically, a query is made on this ngram index, then top few hundred hits evaluated by taking modified edit distance and using metaphones. Metaphones are used only to reject words since metaphone calculation can be noise and is not always correct, we only requite that the spell-checked word doesn't have a completely different metaphone from the typed-in one. The words are extracted from full-text content for main namespace, and from titles for other namespaces. This enables namespace-specific suggestions.

The word at lowest edit distance, and with largest frequency is suggested as a spell-checked word.

Spell checking phrases
Sometimes two words can be correct alone, but together form a misspell. E.g. los angles (los angeles) or noble prize (nobel prize). We break down the all content (for main namespace) and titles (for other) into phrases anchored in non-stop-words. for instance text "together form a misspell" is broken into phrases "together form" and "form a misspell". The same is done for the query user entered. For every word similar words are found (using single words spell-checking) and then phrases with these words are searched for. This works much faster than using an n-gram index for all phrases, since there is huge number of different phrases.

Checking based on phrases does a descent job finding common misspells (when one form at distance=1 is 100-times more frequent that other), spell-checking common phrases and simple linguistic constructs.

Prefix suggestions
todo

Highlighting && interwiki stuff - (maybe) todo - this is more technical...

Article rank
Rank of the article is a logarithm of number of links article and all of its redirects receive. This is a measure of how general an article is. A comparison with PageRank is in order and it should be noted that while the structure of WWW is almost completely free-form, the structure of links within an encyclopedia is dictated by how general a certain article is. This is why very general articles like years and country names have very high rank. However, they are by no means the most popular or relevant articles for the content they hold (e.g. names of people died in certain year).

Context-free synonyms
WordNet defines a dictionary of synsets, sets of synonyms that can be used interchangeably in some context. From this database, I extracted only those words and phrases that belong to exactly one synset, and thus can be interchanged in any context - there are about 17000 such synsets. Here are few examples: (nineteen 19) (airstrip flight_strip landing_strip) (nonsweet sugarless) (uk u.k. united_kingdom britain)...

Related articles
Article A is said to be related to article B if there is an article C that links to both A and B. The number and relative weight of links together with closeness within referring articles is used to infer how related A and B are. To put in simply, related articles are frequently co-occurring links.

Edit distance
We use a variation of standard [[w:Edit distance|edit distance}] with following modifications:
 * swamping two adjacent chars is of distance 1 (e.g. distance=1 for teh -> the )
 * repeating a character is of distance 0 (e.g. distance=0 for thhee -> the )
 * deleting/changing the first letter is of distance 2 (e.g. distance=2 for cat -> rat )
 * otherwise (addition/change/deletion) are of distance 1
 * additional flag for exact match (e.g. to distinguish met and meet, etc)