User:Rainman/search internals

This page describes scoring scheme and other algorithms used in Extension:lucene-search as implemented in version 2.1.

Introduction
Lucene-search is a 40k-lines strong java wrapper around Lucene lowlevel search API. It extends basic search functionality to provide scalable architecture, link analysis, spell-checking, custom scoring schemes, and other. It has been written to be used with MediaWiki, by having in mind the needs of Wikimedia projects.

Scoring
We divided scoring into three classes: correct article, relevant article, matches query. These are gradients of relevance put into classes since they are calculated in distinct ways. All articles returned to user need to fit into the third class and match the query, while the first two give additional score boosts.

when query matches (from higher to lower score):
 * correct article
 * exactly an article title
 * exactly a redirect to an article
 * complete title with words out of order
 * complete title by ignoring stop words, in/out of order (e.g. q:france history -> history of france)
 * complete title when both query and title nouns are converted into singular form
 * complete title or redirect when expanded with WordNet context-free synonyms (e.g. q:surname -> family name)
 * complete title or redirect by omitting accents and non-letters (e.g. q:test -> .test)

Scores are additionally ordered by article rank, thus some of the lower subclasses might dominate the higher ones if the article has a very high rank.

This class contains the most straightforward search results user would certainly expect to see (sorted according to how general the article is). The WordNet synonym databases gives some capacity to rephrase queries into equivalent terms, but is albeit incomplete. To complete the "correct article" class, a linguistic rephrasing should be implemented, e.g. a transformation of queries like q:population of africa into q:african population.


 * relevant article

the relevancy of an article is measured as a product of:
 * 1) score based on positional information of query terms in the article
 * 2) relevance assessment based on context estimated via related articles and article title and redirects.

Thus, we have two independent measures: one local - concerning only the article, and the other global - inferred from link analysis. Their product can be taught of as a covariance with zero expected value.

Positional information of terms within article is evaluated as follows (from higher to lower score):
 * terms are found in exact order at the very beginning of the article
 * terms are found out of order at the beginning of the article
 * terms are found on first page of the article (first 500 words / until first section heading), either out of order, or in order
 * terms are found in section headings
 * terms terms are found in order somewhere in the text
 * terms are found out of order, but close to each other somewhere in the text

The exact order matches typically require all words from the query, while the sloppy ones only words that are not stop words. The above scheme is repeated for queries rewritten with WordNet synonyms. Matches at the beginning of the article are additionally scaled by article rank. The scheme makes the assumption that most relevant information about the article are mentioned first. This seems to be true for encyclopedias, while it might not be always true for any data. Infoboxes do not count as beginning-of-the-article text.

Link-analysis relevance is recalculated periodically

Article rank
Rank of the article is a logarithm of number of links article and all of its redirects receive. This is a measure of how general an article is. A comparison with PageRank is in order and it should be noted that while the structure of WWW is almost completely free-form, the structure of links within an encyclopedia is dictated by how general a certain article is. This is why very general articles like years and country names have very high rank. However, they are by no means the most popular or relevant articles for the content they hold (e.g. names of people died in certain year).

Context-free synonyms
WordNet defines a dictionary of synsets, sets of synonyms that can be used interchangeably in some context. From this database, I extracted only those words and phrases that belong to exactly one synset, and thus can be interchanged in any context - there are about 17000 such synsets. Here are few examples: (nineteen 19) (airstrip flight_strip landing_strip) (nonsweet sugarless) (uk u.k. united_kingdom britain)...

Related articles
Article A is said to be related to article B if there is an article C that links to both A and B. The number and relative weight of links together with closeness within referring articles is used to infer how related A and B are. To put in simply, related articles are frequently co-occurring links.