User:OrenBochman/Search/Spec

=Intorduction=

Before proposing the NG Search I decided to review the current seach engine Lucene-search devloped by Rainman.

=Lucene Search 'Spec'=

This section will attempt to outline the existing search engine as an informal spec, with criticism in the body as comment, or questions. Please provide additional information/corrections as you are able.


 * MWSearch is the gateway between mediaWiki and Lucene-search
 * Listens on port 8123 for search requestd.
 * Listens on port 8321 for index updates.

User Guide

 * http://en.wikipedia.org/wiki/Help:Searching#Search_engine_features

Lucene Search Bugs

 * | Bugzilla

Features

 * Distributed index - due to size the index is distributed on multiple machines.
 * Offline Indexing - starts by indexing a XML_dump and produces:
 * a Search index
 * Q. with what fields, boosting?
 * a Highlight index
 * Q. is this necessary with document term vectors now available?
 * Q. with what fields, boosting?
 * Spellcheck indexes - support for did you mean
 * Fields
 * 2-Grams of wikipedia fulltext of minimum and maximum
 * All titles
 * Boosting for
 * Titles
 * Section Headers
 * First Paragraph
 * Redirects
 * In which source file is the queay cooking formula at?


 * Online indexing & search - use lsearchd.
 * Ranking Algoritmss :
 * Did you mean :
 * Updated front end.

MediaWiki Cluster Configuration
base on:, and

Summary

 * Search Machines
 * Did You Mean
 * Highlighting Configuration

Raw Data

 * English Wikipedia search index is split into 2 part

Questions:
 * 1) is this a machine readable file?
 * 2) if so what reads it?

[Database] {file:///home/wikipedia/common/pmtpa.dblist} : (single,true,20,1000) (prefix) (spell,10,3) enwiki : (nssplit,2) enwiki : (nspart1,[0],true,20,500,2) enwiki : (nspart2,[],true,20,500) enwiki : (spell,40,10) (warmup,500) mediawikiwiki, metawiki, commonswiki, strategywiki : (language,en) commonswiki : (nssplit,2) (nspart1,[6]) (nspart2,[]) dewiki, frwiki : (spell,20,5) dewiki, frwiki, itwiki, ptwiki, jawiki, plwiki, nlwiki, ruwiki, svwiki, zhwiki : (nssplit,2) (nspart1,[0,2,4,12,14]) (nspart2,[])

The Ranking Algoritm

 * Ranking system :
 * PageRank-like algorithm in the sense of reference-to-article counting.
 * it may not be so great if one indexes only a wikipedia since
 * the links graph is too sparse for specialist pages.
 * few page are link hogs (e.g. year 1945)
 * an effective pagerank also needs a good map reduce to work fast.

Did you mean Algorithm

 * Did you mean - queary correction (phrase and words)
 * Q. What information is important or representative of article? (often more informative than PageRank)
 * beginning of articles,
 * redirects,
 * words used to refer to article,
 * section captions


 * Q. what disambiguates the article from related terms is its context?
 * extracted frequently co-occuring article titles in all of wikipedia to extract article association.


 * no open source "Did you mean..." engine at that time. (there are now)
 * There are programs like aspell, but all of them spell-check only single words.
 * the algorithm is 2-gram of all words in the language, with frquency thresholds (min and max).
 * would be improved by a a language model (morphology + semantics)
 * can fix some simple errors, but is not powerful enough.
 * added scoring via heuristics.
 * added special score to boost 2-grams that are in titles,
 * added whole titles,
 * "fuzzy" 2-grams of words that might provide context for a title words, by taking all words from redirects and links in first paragraph of article. (PLEASE CLARIFY)
 * the search results to see if the rare spelling a user entered is significant

Solr may enable to dump code for:
 * configuration
 * maintain consistent copies of split indexes
 * smooth updates from indexer to searchers


 * Contact rainman aka Robert Stojnić rainman-sr who Developed Extension:Lucene-search. and Maintained the search servers.
 * Rainman/search_internals
 * (Consult his thesis)


 * Consult the unit test
 * Consult the API
 * Consult search related bus
 * Write a spec