User:OrenBochman/Search/Spec

=Lucene Search 'Spec'=

This section will attempt to outline the existing search engine as an informal spec, with criticism in the body as comment, or questions. Please provide additional information/corrections as you are able.


 * MWSearch is the gateway between mediaWiki and Lucen-search
 * Listens on port 8123 for search requestd
 * Listens on port 8321 for index updates

Features

 * Distributed index - due to size the index is distributed on multiple machines.
 * Offline Indexing - starts by indexing a XML_dump and produces:
 * a Search index
 * Q. with what fields, boosting?
 * a Highlight index
 * Q. is this necessary with document term vectors now available?
 * Q. with what fields, boosting?
 * Spellcheck indexes - support for did you mean
 * Fields
 * 2-Grams of wikipedia fulltext of minimum and maximum
 * All titles
 * Boosting for
 * Titles
 * Section Headers
 * First Paragraph
 * Redirects
 * In which source file is the queay cooking formula at?


 * Online indexing & search - use lsearchd.
 * Ranking Algoritmss :
 * Did you mean :
 * Updated front end.

Search In the Cluster
base on:, and

The Ranking Algoritm

 * Ranking system :
 * PageRank-like algorithm in the sense of reference-to-article counting.
 * it may not be so great if one indexes only a wikipedia since
 * the links graph is too sparse for specialist pages.
 * few page are link hogs (e.g. year 1945)
 * an effective pagerank also needs a good map reduce to work fast.

Did you mean Algorithm

 * Did you mean - queary correction (phrase and words)
 * Q. What information is important or representative of article? (often more informative than PageRank)
 * beginning of articles,
 * redirects,
 * words used to refer to article,
 * section captions


 * Q. what disambiguates the article from related terms is its context?
 * extracted frequently co-occuring article titles in all of wikipedia to extract article association.


 * no open source "Did you mean..." engine at that time. (there are now)
 * There are programs like aspell, but all of them spell-check only single words.
 * the algorithm is 2-gram of all words in the language, with frquency thresholds (min and max).
 * would be improved by a a language model (morphology + semantics)
 * can fix some simple errors, but is not powerful enough.
 * added scoring via heuristics.
 * added special score to boost 2-grams that are in titles,
 * added whole titles,
 * "fuzzy" 2-grams of words that might provide context for a title words, by taking all words from redirects and links in first paragraph of article. (PLEASE CLARIFY)
 * the search results to see if the rare spelling a user entered is significant

Solr may enable to dump code for:
 * configuration
 * maintain consistent copies of split indexes
 * smooth updates from indexer to searchers


 * Contact rainman aka Robert Stojnić rainman-sr who Developed Extension:Lucene-search. and Maintained the search servers.
 * Rainman/search_internals
 * (Consult his thesis)


 * Consult the unit test
 * Consult the API
 * Consult search related bus
 * Write a spec