User:OrenBochman/Search/Spec

=Intorduction=

Before proposing the NG Search I decided to review the current seach engine Lucene-search devloped by Rainman.

=Lucene Search 'Spec'=

This section will attempt to outline the existing search engine as an informal spec, with criticism in the body as comment, or questions. Please provide additional information/corrections as you are able.


 * MWSearch is the gateway between mediaWiki and Lucene-search
 * Listens on port 8123 for search requestd.
 * Listens on port 8321 for index updates.

User Guide

 * Search Engine Features Search Engine Features

Lucene Search Bugs

 * | Bugzilla

Features

 * Distributed index - due to size the index is distributed on multiple machines.
 * Offline Indexing - starts by indexing a XML_dump and produces:
 * a Search index
 * Q. with what fields, boosting?
 * a Highlight index
 * Q. is this necessary with document term vectors now available?
 * Q. with what fields, boosting?
 * Spellcheck indexes - support for did you mean
 * Fields
 * 2-Grams of wikipedia fulltext of minimum and maximum
 * All titles
 * Boosting for
 * Titles
 * Section Headers
 * First Paragraph
 * Redirects
 * In which source file is the queay cooking formula at?


 * Online indexing & search - use lsearchd.
 * Ranking Algoritmss :
 * Did you mean :
 * Updated front end.

=MediaWiki Cluster Configuration=

base on:
 * Global Configuration Spec
 * The Global Configuration ,
 * lucene.php
 * luceneSearch

Summary

 * Search Machines
 * Did You Mean
 * Highlighting Configuration

[Database] Section Configuration Format

 * 1)  - declares that   is a single (nondistributed) index and that it should use English as its default language, and thus use English stemming.
 * 2)  imports a database list using a file
 * 3) The optional  instructs Lucene to apply 100 queries to the index when an updated version is fetched to warm it up. This enables smooth transition in performance and ensures indexes are always well cached and buffered.
 * 4) Make sure there are no spaces in the arguments (e.g. ). This condition can lead to failure to create search, snapshot or index folders when building Index.
 * 5) wikilucene : (nssplit,3) (nspart1,[0]) (nspart2,[4,5,12,13]), (nspart3,[]) - declares that  should be split (distributed) into 3 indexes according to namespaces, where shard1 called nspart1has namespace 0, shrad2 called nspart2 4,5,12,13 and sharad3 called nspart3 has the other namespaces.

[Database] Raw Data
[Database] (language,en) (warmup,10) {file:///home/wikipedia/common/pmtpa.dblist} : (single,true,20,1000) (prefix) (spell,10,3) enwiki : (nssplit,2) enwiki : (nspart1,[0],true,20,500,2) enwiki : (nspart2,[],true,20,500) enwiki : (spell,40,10) (warmup,500) mediawikiwiki, metawiki, commonswiki, strategywiki : (language,en) commonswiki : (nssplit,2) (nspart1,[6]) (nspart2,[]) dewiki, frwiki : (spell,20,5) dewiki, frwiki, itwiki, ptwiki, jawiki, plwiki, nlwiki, ruwiki, svwiki, zhwiki : (nssplit,2) (nspart1,[0,2,4,12,14]) (nspart2,[])
 * 1) wikilucene : (single) (language,en) (warmup,0)
 * 2) wikidev : (single) (language,sr)
 * 3) wikilucene : (nssplit,3) (nspart1,[0]) (nspart2,[4,5,12,13]), (nspart3,[])
 * 4) wikilucene : (language,en) (warmup,10)
 * 5) format:
 * 6) database_name (, database_name)+ :([single|mainsplit|nssplit],[SHRAD-COUNT],[TRUE|FALSE],[IDX_BUFFER_DOCS],[IDX_MERGE_FACTOR])

[Database-Group] Configuration Format

 * TODO: research and document

[Database-Group] Raw Data
sv-titles: (titles_by_suffix,2) (tspart1,[ svwiki|w ]) (tspart2,[ svwiktionary|wikt, svwikibooks|b, svwikinews|n, svwikiquote|q, svwikisource|src]) mw-titles: (titles_by_suffix,1) (tspart1, [ mediawikiwiki|mw, metawiki|meta ])

[Search-Group] Configuration Format

 * TODO: research and document

[Search-Group] Raw Data
[Search-Group] search1: enwiki.nspart1.sub1 enwiki.nspart1.sub2 search2: enwiki.nspart1.sub1.hl enwiki.spell #enwiki.nspart1.sub2.hl search3: enwiki.nspart1.sub1 enwiki.nspart1.sub2 search4: enwiki.nspart1.sub1 enwiki.nspart1.sub2 search5: enwiki.nspart1.sub2.hl enwiki.spell #enwiki.nspart1.sub1.hl search8: enwiki.prefix #enwiki.spell search9: enwiki.nspart1.sub1 enwiki.nspart1.sub2 search12: enwiki.spell search13: enwiki.nspart2* search13x: en-titles* search14: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl search19: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl search20: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl
 * 1) Search hosts layout
 * 1) search 1 (enwiki)

search6: dewiki.nspart1 dewiki.nspart2 frwiki.nspart1 frwiki.nspart2 jawiki.nspart1 jawiki.nspart2 search6: itwiki.nspart1.hl search15: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl search16: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl search17: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl
 * 1) search 2 (de,fr,jawiki)

search7: itwiki.nspart1 itwiki.nspart2 nlwiki.nspart1 nlwiki.nspart2 ruwiki.nspart1 ruwiki.nspart2 svwiki.nspart1 search7: svwiki.nspart2 plwiki.nspart1 plwiki.nspart2 eswiki ptwiki.nspart1 ptwiki.nspart2 zhwiki.nspart1 zhwiki.nspart2 search15: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl search15: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl search15: ptwiki.nspart1.hl ptwiki.nspart2.hl search16: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl search16: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl search16: ptwiki.nspart1.hl ptwiki.nspart2.hl search17: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl search17: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl search17: ptwiki.nspart1.hl ptwiki.nspart2.hl
 * 1) search 3 (it,nl,ru,sv,pl,pt,es,zhwiki)

search10x: de-titles* ja-titles* it-titles* nl-titles* ru-titles* fr-titles* search10x: sv-titles* pl-titles* pt-titles* es-titles* zh-titles* search10: dewiki.spell frwiki.spell itwiki.spell nlwiki.spell ruwiki.spell search10: svwiki.spell plwiki.spell ptwiki.spell eswiki.spell
 * 1) search 2-3 interwiki/spellchecks

search11x: commonswiki.spell commonswiki.nspart1.hl commonswiki.nspart1 commonswiki.nspart2.hl commonswiki.nspart2 search11: commonswiki.nspart1 commonswiki.nspart1.hl commonswiki.nspart2.hl search11: commonswiki.nspart2 search11: *? search11x: *tspart1 *tspart2 search19: (?!(enwiki.|dewiki.|frwiki.|itwiki.|nlwiki.|ruwiki.|svwiki.|plwiki.|eswiki.|ptwiki.))*.spell search12: (?!(enwiki.|dewiki.|frwiki.|itwiki.|nlwiki.|ruwiki.|svwiki.|plwiki.|eswiki.|ptwiki.|jawiki.|zhwiki.))*.hl
 * 1) search 4

search18: *.prefix
 * 1) prefix stuffs

searchNone: *.related jawiki.nspart1.hl jawiki.nspart2.hl zhwiki.nspart1.hl zhwiki.nspart2.hl searchNone: enwiki.spell enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl [Index] searchidx2: *
 * 1) stuffs to deploy in future
 * 1) Indexers

[Index-Path] : /search [OAI] simplewiki : http://simple.wikipedia.org/w/index.php rswikimedia : http://rs.wikimedia.org/w/index.php ilwikimedia : http://il.wikimedia.org/w/index.php nzwikimedia : http://nz.wikimedia.org/w/index.php sewikimedia : http://se.wikimedia.org/w/index.php alswiki : http://als.wikipedia.org/w/index.php alswikibooks : http://als.wikibooks.org/w/index.php alswikiquote : http://als.wikibooks.org/w/index.php alswiktionary : http://als.wiktionary.org/w/index.php chwikimedia : http://www.wikimedia.ch/w/index.php crhwiki : http://chr.wikipedia.org/w/index.php roa_rupwiki : http://roa-rup.wikipedia.org/w/index.php roa_rupwiktionary : http://roa-rup.wiktionary.org/w/index.php be_x_oldwiki : http://be-x-old.wikipedia.org/w/index.php ukwikimedia : http://uk.wikimedia.org/w/index.php brwikimedia : http://br.wikimedia.org/w/index.php dkwikimedia : http://dk.wikimedia.org/w/index.php trwikimedia : http://tr.wikimedia.org/w/index.php arwikimedia : http://ar.wikimedia.org/w/index.php mxwikimedia : http://mx.wikimedia.org/w/index.php [Namespace-Boost] commonswiki : (0, 1) (6, 4) : (0, 1) (1, 0.0005) (2, 0.005) (3, 0.001) (4, 0.01), (6, 0.02), (8, 0.005), (10, 0.0005), (12, 0.01), (14, 0.02)
 * 1) Rsync path where indexes are on hosts, after default value put
 * 2) hosts where the location differs
 * 3) Syntax: host :
 * 1) Global properies

[Properties] Database.suffix=wiki wiktionary wikiquote wikibooks wikisource wikinews wikiversity wikimedia
 * 1) suffixes to database name, the rest is assumed to be language code

Search.maxlimit=501
 * 1) Allow only up to 500 results per page

AgeScaling.strong=wikinews AgeScaling.medium=mediawikiwiki metawiki
 * 1) Age scaling based on last edit, default is no scaling
 * 2) Below are suffixes (or whole names) with various scaling strength
 * 1) AgeScaling.weak=wiki

AdditionalRank.suffix=mediawikiwiki metawiki
 * 1) Use additional per-article ranking data, more suitable for non-encyclopedias

ExactCase.suffix=wiktionary jbowiki
 * 1) suffix for databases that should also have exact-case index built
 * 2) note: this will also turn off stemming!

WMF.InitialiseSettings=file:///home/wikipedia/common/wmf-config/InitialiseSettings.php
 * 1) wmf-style init file, attempt to read OAI and lang info from it
 * 2) for sample see http://noc.wikimedia.org/conf/InitialiseSettings.php.html
 * 3) WMF.InitialiseSettings=file:///home/wikipedia/common/php-1.5/InitialiseSettings.php
 * 4) WMF.InitialiseSettings=file:///home/wikipedia/common/wmf-deployment/wmf-config/InitialiseSettings.php

Commons.wiki=commonswiki.nspart1 [Namespace-Prefix] all : [0] : 0 [1] : 1 [2] : 2 [3] : 3 [4] : 4 [5] : 5 [6] : 6 [7] : 7 [8] : 8 [9] : 9 [10] : 10 [11] : 11 [12] : 12 [13] : 13 [14] : 14 [15] : 15 [100] : 100 [101] : 101 [104] : 104 [105] : 105 [106] : 106 [0,6,12,14,100,106]: 0,6,12,14,100,106 [0,100,104] : 0,100,104 [0,2,4,12,14] : 0,2,4,12,14 [0,14] : 0,14 [4,12] : 4,12
 * 1) Where common images are
 * 1) Syntax:  : 
 * 2) is a special keyword meaning all namespaces
 * 3) E.g. all_talk : 1,3,5,7,9,11,13,15

=The Algoritms=

The Ranking Algoritm

 * Ranking system :
 * PageRank-like algorithm in the sense of reference-to-article counting.
 * it may not be so great if one indexes only a wikipedia since
 * the links graph is too sparse for specialist pages.
 * few page are link hogs (e.g. year 1945)
 * an effective pagerank also needs a good map reduce to work fast.

Did You Mean? Algorithm

 * Did you mean - queary correction (phrase and words)
 * Q. What information is important or representative of article? (often more informative than PageRank)
 * beginning of articles,
 * redirects,
 * words used to refer to article,
 * section captions


 * Q. what disambiguates the article from related terms is its context?
 * extracted frequently co-occuring article titles in all of wikipedia to extract article association.


 * no open source "Did you mean..." engine at that time. (there are now)
 * There are programs like aspell, but all of them spell-check only single words.
 * the algorithm is 2-gram of all words in the language, with frquency thresholds (min and max).
 * would be improved by a a language model (morphology + semantics)
 * can fix some simple errors, but is not powerful enough.
 * added scoring via heuristics.
 * added special score to boost 2-grams that are in titles,
 * added whole titles,
 * "fuzzy" 2-grams of words that might provide context for a title words, by taking all words from redirects and links in first paragraph of article. (PLEASE CLARIFY)
 * the search results to see if the rare spelling a user entered is significant

Solr may enable to dump code for:
 * configuration
 * maintain consistent copies of split indexes
 * smooth updates from indexer to searchers


 * Contact rainman aka Robert Stojnić rainman-sr who Developed Extension:Lucene-search. and Maintained the search servers.
 * Rainman/search_internals
 * (Consult his thesis)


 * Consult the unit test
 * Consult the API
 * Consult search related bus
 * Write a spec

Porting Filters
Filters should be ported from the 2.4.x to Lucene 2.9.x api. This involves: private CharTermAttribute termAtt; //Copies the contents of buffer, at an offset for length characters, to the termBuffer array. private TypeAttribute typeAttr; which should be intiated in the constructor via: public filterXtor {  super(input); termAttr = (CharTermAttribute) addAttribute(CharTermAttribute .class); typeAttr = addAttribute(TypeAttribute.class); }
 * 1) writing unit tests for the old filter and seeing that they still work with new input....
 * 2) Token next and Token next(Token) have been deprected.
 * 3) incrementToken needs to be called on the input token string, not on the filter which will cause a stack overflow).
 * 4) to process the token add to the filter properties:

boolean incrementToken {  if (!input.incrementToken) return false;
 * 1) boolean incrementToken is now required.
 * 2) it moves the token stream one step forward.
 * 3) it returns true is there are more tokens, false otherwise.

// process token via termAttr.term // next update buffers termAttr.setTermBuffer(modifiedToken); termAttr.setTermLength(this.parseBuffer(termAtt.termBuffer, termAtt.termLength)); typeAttr.setType(TOKEN_TYPE_NAME); return true; }

We also decided to move the buffer handling into the parse token function to handle this, and remember to include the length of the “live” part of the buffer (the buffer will be larger, but only the content up to termLength will be valid).

The return value from our parseBuffer function is the actual amount of usable data in the buffer after we’ve had our way with it. The concept is to modify the buffer in place, so that we avoid allocating or deallocating memory.

org.apache.lucene.search

 * changes/extention of to lucene.search ofbjects

org.apache.lucene.search/ArticleInfo.Java
note: the only implementation wraps methods of ArticleMetaSource so it could be refactored away
 * (limited) interface for metadata on article
 * isSubpage - if it is a subpage
 * daysOld - age in index
 * namespace - articles nameSpace
 * interface implementation is in org.wikimedia.lsearch.search/ArticleInfoImpl.java

org.apache.lucene.search/ArticleNamespaceScaling.Java

 * boosts article using its namespace.
 * is used in:
 * ArticleQueryWrap.customExplain,
 * ArticleQueryWrap.customScore
 * SearchEngine.PrefixMatch
 * tested in
 * testComplex
 * testDefalut

org.apache.lucene.search/ConstMinScore

 * provides a boost queary with a minumum score
 * used by
 * CustomScorer

org.apache.lucene.search/CustomBoostQuery

 * Query that sets document score as a programmatic function of (up to) two (sub) scores.