User:OrenBochman/Search/Spec

From mediawiki.org

Intorduction[edit]

Before proposing the NG Search I decided to review the current seach engine Lucene-search devloped by Rainman.

Lucene Search 'Spec'[edit]

This section will attempt to outline the existing search engine as an informal spec, with criticism in the body as comment, or questions. Please provide additional information/corrections as you are able.

  • Listens on port 8123 for search requestd.
  • Listens on port 8321 for index updates.

User Guide[edit]

Lucene Search Bugs[edit]

Features[edit]

  • Distributed index - due to size the index is distributed on multiple machines.[1]
  • Offline Indexing - starts by indexing a XML_dump [2] and produces:
  • a Search index
  • Q. with what fields, boosting?
  • a Highlight index
  • Q. is this necessary with document term vectors now available?
  • Q. with what fields, boosting?
  • Spellcheck indexes - support for did you mean
  • Fields
  • 2-Grams of wikipedia fulltext of minimum and maximum
  • All titles
  • Boosting for
  • Titles
  • Section Headers
  • First Paragraph
  • Redirects
  • In which source file is the queay cooking formula at?
  • Online indexing & search - use lsearchd.
  • Ranking Algoritmss[1]:
  • Did you mean[1]:
  • Updated front end.

MediaWiki Cluster Configuration[edit]

base on:

  • Global Configuration Spec[3]
  • The Global Configuration[4],
  • lucene.php [5]
  • luceneSearch [6]

Summary[edit]

  • Search Machines
  • Did You Mean
  • Highlighting Configuration

[Database] Section Configuration Format[edit]

  1. # for a single line comment.
  2. wikidb : (single) (language,en) - declares that wikidb is a single (nondistributed) index and that it should use English as its default language, and thus use English stemming.
  3. {file:///home/wikipedia/common/pmtpa.dblist} imports a database list using a file
  4. The optional (warmup,100) instructs Lucene to apply 100 queries to the index when an updated version is fetched to warm it up. This enables smooth transition in performance and ensures indexes are always well cached and buffered.
  5. Make sure there are no spaces in the arguments (e.g. (warmup,10)). This condition can lead to failure to create search, snapshot or index folders when building Index.
  6. wikilucene : (nssplit,3) (nspart1,[0]) (nspart2,[4,5,12,13]), (nspart3,[]) - declares that wikilucene should be split (distributed) into 3 indexes according to namespaces, where shard1 called nspart1has namespace 0, shrad2 called nspart2 4,5,12,13 and sharad3 called nspart3 has the other namespaces.

[Database] Raw Data[edit]

[Database]
#wikilucene : (single) (language,en) (warmup,0)
#wikidev : (single) (language,sr)
#wikilucene : (nssplit,3) (nspart1,[0]) (nspart2,[4,5,12,13]), (nspart3,[])
#wikilucene : (language,en) (warmup,10)
#format:
#database_name (, database_name)+ :([single|mainsplit|nssplit],[SHRAD-COUNT],[TRUE|FALSE],[IDX_BUFFER_DOCS],[IDX_MERGE_FACTOR])
                                   (language,en)
                                   (warmup,10)
{file:///home/wikipedia/common/pmtpa.dblist} : (single,true,20,1000) (prefix) (spell,10,3)
enwiki : (nssplit,2) 
enwiki : (nspart1,[0],true,20,500,2)
enwiki : (nspart2,[],true,20,500)
enwiki : (spell,40,10) (warmup,500)
mediawikiwiki, metawiki, commonswiki, strategywiki : (language,en)
commonswiki : (nssplit,2) (nspart1,[6]) (nspart2,[])
dewiki, frwiki : (spell,20,5)
dewiki, frwiki, itwiki, ptwiki, jawiki, plwiki, nlwiki, ruwiki, svwiki, zhwiki : (nssplit,2) (nspart1,[0,2,4,12,14]) (nspart2,[])

[Database-Group] Configuration Format[edit]

  • TODO: research and document

[Database-Group] Raw Data[edit]

<all> : (titles_by_suffix,2) (tspart1,[ wiki|w ]) (tspart2,[ wiktionary|wikt, wikibooks|b, wikinews|n, wikiquote|q, wikisource|s, wikiversity|v])
sv-titles: (titles_by_suffix,2) (tspart1,[ svwiki|w ]) (tspart2,[ svwiktionary|wikt, svwikibooks|b, svwikinews|n, svwikiquote|q, svwikisource|src])
mw-titles: (titles_by_suffix,1) (tspart1, [ mediawikiwiki|mw, metawiki|meta ])

[Search-Group] Configuration Format[edit]

  • TODO: research and document

[Search-Group] Raw Data[edit]

# Search hosts layout
[Search-Group]
# search 1 (enwiki) 
search1: enwiki.nspart1.sub1 enwiki.nspart1.sub2 
search2: enwiki.nspart1.sub1.hl enwiki.spell #enwiki.nspart1.sub2.hl
search3: enwiki.nspart1.sub1 enwiki.nspart1.sub2
search4: enwiki.nspart1.sub1 enwiki.nspart1.sub2
search5: enwiki.nspart1.sub2.hl enwiki.spell #enwiki.nspart1.sub1.hl
search8: enwiki.prefix #enwiki.spell
search9: enwiki.nspart1.sub1 enwiki.nspart1.sub2
search12: enwiki.spell
search13: enwiki.nspart2*
search13x: en-titles*
search14: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl
search19: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl
search20: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl

# search 2 (de,fr,jawiki) 
search6: dewiki.nspart1 dewiki.nspart2 frwiki.nspart1 frwiki.nspart2 jawiki.nspart1 jawiki.nspart2
search6: itwiki.nspart1.hl
search15: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl
search16: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl 
search17: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl 

# search 3 (it,nl,ru,sv,pl,pt,es,zhwiki) 
search7: itwiki.nspart1 itwiki.nspart2 nlwiki.nspart1 nlwiki.nspart2 ruwiki.nspart1 ruwiki.nspart2 svwiki.nspart1
search7: svwiki.nspart2 plwiki.nspart1 plwiki.nspart2 eswiki ptwiki.nspart1 ptwiki.nspart2 zhwiki.nspart1 zhwiki.nspart2
search15: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl 
search15: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl 
search15: ptwiki.nspart1.hl ptwiki.nspart2.hl
search16: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl
search16: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl 
search16: ptwiki.nspart1.hl ptwiki.nspart2.hl
search17: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl
search17: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl
search17: ptwiki.nspart1.hl ptwiki.nspart2.hl

# search 2-3 interwiki/spellchecks
search10x: de-titles* ja-titles* it-titles* nl-titles* ru-titles* fr-titles*
search10x: sv-titles* pl-titles* pt-titles* es-titles* zh-titles*
search10: dewiki.spell frwiki.spell itwiki.spell nlwiki.spell ruwiki.spell 
search10: svwiki.spell plwiki.spell ptwiki.spell eswiki.spell

# search 4
search11x: commonswiki.spell commonswiki.nspart1.hl commonswiki.nspart1 commonswiki.nspart2.hl commonswiki.nspart2
search11: commonswiki.nspart1 commonswiki.nspart1.hl commonswiki.nspart2.hl
search11: commonswiki.nspart2
search11: *?
search11x: *tspart1 *tspart2
search19: (?!(enwiki.|dewiki.|frwiki.|itwiki.|nlwiki.|ruwiki.|svwiki.|plwiki.|eswiki.|ptwiki.))*.spell
search12: (?!(enwiki.|dewiki.|frwiki.|itwiki.|nlwiki.|ruwiki.|svwiki.|plwiki.|eswiki.|ptwiki.|jawiki.|zhwiki.))*.hl

# prefix stuffs
search18: *.prefix

# stuffs to deploy in future
searchNone: *.related jawiki.nspart1.hl jawiki.nspart2.hl zhwiki.nspart1.hl zhwiki.nspart2.hl
searchNone: enwiki.spell enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl
# Indexers
[Index]
searchidx2: *

# Rsync path where indexes are on hosts, after default value put 
# hosts where the location differs
# Syntax: host : <path>
[Index-Path]
<default> : /search
[OAI]
simplewiki : http://simple.wikipedia.org/w/index.php
rswikimedia : http://rs.wikimedia.org/w/index.php
ilwikimedia : http://il.wikimedia.org/w/index.php
nzwikimedia : http://nz.wikimedia.org/w/index.php
sewikimedia : http://se.wikimedia.org/w/index.php
alswiki : http://als.wikipedia.org/w/index.php
alswikibooks : http://als.wikibooks.org/w/index.php
alswikiquote : http://als.wikibooks.org/w/index.php
alswiktionary : http://als.wiktionary.org/w/index.php
chwikimedia : http://www.wikimedia.ch/w/index.php
crhwiki : http://chr.wikipedia.org/w/index.php
roa_rupwiki : http://roa-rup.wikipedia.org/w/index.php
roa_rupwiktionary : http://roa-rup.wiktionary.org/w/index.php
be_x_oldwiki : http://be-x-old.wikipedia.org/w/index.php
ukwikimedia : http://uk.wikimedia.org/w/index.php
brwikimedia : http://br.wikimedia.org/w/index.php
dkwikimedia : http://dk.wikimedia.org/w/index.php
trwikimedia : http://tr.wikimedia.org/w/index.php
arwikimedia : http://ar.wikimedia.org/w/index.php
mxwikimedia : http://mx.wikimedia.org/w/index.php
[Namespace-Boost]
commonswiki : (0, 1) (6, 4)
<default> : (0, 1) (1, 0.0005) (2, 0.005) (3, 0.001) (4, 0.01), (6, 0.02), (8, 0.005), (10, 0.0005), (12, 0.01), (14, 0.02)
# Global properies

[Properties]
# suffixes to database name, the rest is assumed to be language code
Database.suffix=wiki wiktionary wikiquote wikibooks wikisource wikinews wikiversity wikimedia

# Allow only up to 500 results per page
Search.maxlimit=501

# Age scaling based on last edit, default is no scaling
# Below are suffixes (or whole names) with various scaling strength
AgeScaling.strong=wikinews
AgeScaling.medium=mediawikiwiki metawiki
#AgeScaling.weak=wiki

# Use additional per-article ranking data, more suitable for non-encyclopedias
AdditionalRank.suffix=mediawikiwiki metawiki

# suffix for databases that should also have exact-case index built
# note: this will also turn off stemming!
ExactCase.suffix=wiktionary jbowiki

# wmf-style init file, attempt to read OAI and lang info from it
# for sample see http://noc.wikimedia.org/conf/InitialiseSettings.php.html
#WMF.InitialiseSettings=file:///home/wikipedia/common/php-1.5/InitialiseSettings.php
#WMF.InitialiseSettings=file:///home/wikipedia/common/wmf-deployment/wmf-config/InitialiseSettings.php
WMF.InitialiseSettings=file:///home/wikipedia/common/wmf-config/InitialiseSettings.php

# Where common images are
Commons.wiki=commonswiki.nspart1
# Syntax: <prefix_name> : <coma separated list of namespaces>
# <all> is a special keyword meaning all namespaces
# E.g. all_talk : 1,3,5,7,9,11,13,15
[Namespace-Prefix]
all : <all>
[0] : 0
[1] : 1
[2] : 2
[3] : 3
[4] : 4
[5] : 5
[6] : 6
[7] : 7
[8] : 8
[9] : 9
[10] : 10
[11] : 11
[12] : 12
[13] : 13
[14] : 14
[15] : 15
[100] : 100
[101] : 101
[104] : 104
[105] : 105
[106] : 106
[0,6,12,14,100,106]: 0,6,12,14,100,106
[0,100,104] : 0,100,104
[0,2,4,12,14] : 0,2,4,12,14
[0,14] : 0,14
[4,12] : 4,12

The Algoritms[edit]

The Ranking Algoritm[edit]

  • Ranking system[1]:
  • PageRank-like algorithm in the sense of reference-to-article counting.
  • it may not be so great if one indexes only a wikipedia since
  • the links graph is too sparse for specialist pages.
  • few page are link hogs (e.g. year 1945)
  • an effective pagerank also needs a good map reduce to work fast.

Did You Mean? Algorithm[edit]

  • Did you mean - queary correction (phrase and words)
  • Q. What information is important or representative of article? (often more informative than PageRank)
  • beginning of articles,
  • redirects,
  • words used to refer to article,
  • section captions
  • Q. what disambiguates the article from related terms is its context?
  • extracted frequently co-occuring article titles in all of wikipedia to extract article association.


  • no open source "Did you mean..." engine at that time. (there are now)
  • There are programs like aspell, but all of them spell-check only single words.
  • the algorithm is 2-gram of all words in the language, with frquency thresholds (min and max).
  • would be improved by a a language model (morphology + semantics)
  • can fix some simple errors, but is not powerful enough.
  • added scoring via heuristics.
  • added special score to boost 2-grams that are in titles,
  • added whole titles,
  • "fuzzy" 2-grams of words that might provide context for a title words, by taking all words from redirects and links in first paragraph of article. (PLEASE CLARIFY)
  • the search results to see if the rare spelling a user entered is significant


Solr may enable to dump code for:

  • configuration
  • maintain consistent copies of split indexes
  • smooth updates from indexer to searchers
  • Contact rainman aka Robert Stojnić rainman-sr who Developed Extension:Lucene-search. and Maintained the search servers.
  • Consult the unit test
  • Consult the API
  • Consult search related bus
  • Write a spec


Under The Hood[edit]

Refrences[edit]