User:OrenBochman/Search

=Overview=

Mediawiki software when installed without extensions, with the default settings does not offer search capabilities. Installing the MWSearch extension does not do searches either since this extension only provides integration with the Lucene-search extension.

However the Lucene search extension is part of the MediaWiki cluster deployment. Many of it's design parameters are long standing bugs are a result of this previliged relationship

Besides Lucene search many other search solutions exist. For many projects standalone projects exist which produce better results than the built-in search however these do not get intergrated into the cloud.

=Lucene Search 'Spec'=

This section will attempt to outline the existing search engine as an informal spec, with criticism in the body as comment, or questions. Please provide additional information/corrections as you are able.


 * MWSearch is the gateway between mediaWiki and Lucen-search
 * Listens on port 8123 for search requestd
 * Listens on port 8321 for index updates

Features

 * Distributed index - due to size the index is distributed on multiple machines.
 * Offline Indexing - starts by indexing a XML_dump and produces:
 * a Search index
 * Q. with what fields, boosting?
 * a Highlight index
 * Q. is this necessary with document term vectors now available?
 * Q. with what fields, boosting?
 * Spellcheck indexes - support for did you mean
 * Fields
 * 2-Grams of wikipedia fulltext of minimum and maximum
 * All titles
 * Boosting for
 * Titles
 * Section Headers
 * First Paragraph
 * Redirects
 * In which source file is the queay cooking formula at?


 * Online indexing & search - use lsearchd.
 * Ranking Algoritmss :
 * Did you mean :
 * Updated front end.

Search In the Cluster
base on:, and

The Ranking Algoritm

 * Ranking system :
 * PageRank-like algorithm in the sense of reference-to-article counting.
 * it may not be so great if one indexes only a wikipedia since
 * the links graph is too sparse for specialist pages.
 * few page are link hogs (e.g. year 1945)
 * an effective pagerank also needs a good map reduce to work fast.

Did you mean Algorithm

 * Did you mean - queary correction (phrase and words)
 * Q. What information is important or representative of article? (often more informative than PageRank)
 * beginning of articles,
 * redirects,
 * words used to refer to article,
 * section captions


 * Q. what disambiguates the article from related terms is its context?
 * extracted frequently co-occuring article titles in all of wikipedia to extract article association.


 * no open source "Did you mean..." engine at that time. (there are now)
 * There are programs like aspell, but all of them spell-check only single words.
 * the algorithm is 2-gram of all words in the language, with frquency thresholds (min and max).
 * would be improved by a a language model (morphology + semantics)
 * can fix some simple errors, but is not powerful enough.
 * added scoring via heuristics.
 * added special score to boost 2-grams that are in titles,
 * added whole titles,
 * "fuzzy" 2-grams of words that might provide context for a title words, by taking all words from redirects and links in first paragraph of article. (PLEASE CLARIFY)
 * the search results to see if the rare spelling a user entered is significant

Solr may enable to dump code for:
 * configuration
 * maintain consistent copies of split indexes
 * smooth updates from indexer to searchers


 * Contact rainman aka Robert Stojnić rainman-sr who Developed Extension:Lucene-search. and Maintained the search servers.
 * Rainman/search_internals
 * (Consult his thesis)


 * Consult the unit test
 * Consult the API
 * Consult search related bus
 * Write a spec

= Brainstorm Some Search Problems=

LSDEAMON vs Apache Solr
As search evolves it might be prudent to migrate to Apache Solr as a stand alone search server instead of the LSDEAMON

Disadvantages

 * Typical develpment risks.
 * Untested in the cloud
 * Need MWSearch to be modified to XML queries on a singe port.
 * Less familiar with dev/deployment

Advantages

 * Should cut the code-base of the back-end (distributed indexes)
 * Generally integrated with latest lucene version
 * Supports extra features such as:
 * Highlighting (out of the box).
 * Supports of Faceting
 * Support for Spell Checking "Did you mean..." queries
 * Support for clustering inc. third part tools
 * The Text field Can hold aggregate copies of multiple fields just for searching to reduce queries (similar to the OAIRepository)
 * Tested/Supported on large user base


 * Scalability
 * Shrading (splitting index) hash title
 * Shrad Replication
 * Monitoring via JMX


 * Can communicate diretly with PHP via JSON.
 * Clustering of search results
 * via Carrot2 for top 1000 results.
 * via Mahut for millions of results.

What is the aproach to wikipedia ranking

 * 1) How does My Ideas currently do not involve changing the ranking algorithm.

Problem: Lucene search processes Wikimedia source text and not the outputted HTML.
Solution:
 * 1) index output HTML (placed into cache)
 * 2) stip unwanted tags (while)
 * 3) boosting thingslike
 * Headers
 * Interwikis
 * External Links

Problem: HTML also contains CSS, HTML, Script, Comments
Either index these too, or run a filter to remove them. Some Strategies are: (interesting if one wants to also compress output for integrating into DB or Cache.
 * 1) solution:
 * 1) Discard all markup.
 * 2) A markup_filter/tokenizer could be used to discard markup.
 * 3) Lucene Tika project can do this.
 * 4) Other ready made solutions.
 * 5) Keep all markup
 * 6) Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
 * 1) Selective processing
 * 2) A table_template_map extension could be used in a strategy to identify structured information for deeper indexing.
 * 3) This is the most promising it can detect/filter out unapproved markup (Javascripts, CSS, Broken XHTML).

Problem: Indexing offline and online

 * 1) real-time "only" - slowly build index in background
 * 2) offline "only" - used dedicated machine/cloud to dump and index offline.
 * 3) dua - each time the lingustic component becomes significantly better (or there is a bug fix) it would be desire able to upgrade search. How this would be done would depend much on the architecture of the analyzer. One possible aproach would be
 * 4) production of a linguistic/entity data or a new software milestone.
 * 5) offline analysis from dump (xml,or html)
 * 6) online processing newest to oldest updates with a (Poisson wait time prediction model)

Problem: Lucene Best Analyzers are Language specific

 * 1) N-Gram analyzer is language independent.
 * 2) A new Multilingual analyzer with a language detector can produced by
 * 3) extract features from query and check against model prepared of line.
 * 4) model would contain lexical feature such as:
 * 5) alphabet
 * 6) bi/trigram distribution.
 * 7) stop lists; collection of common word/pos/language sets (or lemma/language)
 * 8) normalized frequency statistics based on sampling full text from different languages..
 * 9) a light model would be glyph based.

Problem: Search is not aware of morphological language variation

 * 1) Language with rich morphology this will reduce effectiveness of search. (e.g. Hebrew, Arabic, Hungarian, Swahili)
 * 2) Text Mine en.Wiktionary && xx.Wiktionary to for the data of a "lemma analyzer". (Store it in a table based on Apertium Morphlogical Dictionary format).
 * 3) Index xx.Wikipeia for frquency data and via a row/column algorithem to fill in the gaps of the Morphological Dictionary Table
 * 4) dumb lemma (bag with a representative)
 * 5) smart lemma (list ordered by frequency)
 * 6) quantum lemma (organized by morphological state and frequency)
 * 7) lemma based indexing.
 * 8) run a semantic disambiguation algorithm (tag )on disambiguate
 * other benefits:
 * 1) lemma based compression. (arithmetic coding based on smart lemma)
 * 2) indexing all lemmas
 * 3) smart resolution of disambiguation page.
 * 4) algorithm translate English to simple English.
 * 5) excellent language detection for search.
 * metrics:
 * 1) extract amount of information contributed by a user
 * 2) since inception.
 * 3) in final version.

How can search be made more interactive via Facets?

 * 1) SOLR instead of Lucene could provide faceted search involving categories.
 * 2) The single most impressive change to search could be via facets.
 * 3) Facets can be generated via categories (Though they work best in multiple shallow hierarchies).
 * 4) Facets can be generated via template analysis.
 * 5) Facets can be generated via semantic extensions. (explore)
 * 6) Focus on culture (local,wiki), sentiment, importance, popularity (edit,view,revert) my be refreshing.
 * 7) Facets can also be generated using named entity and relational analysis.
 * 8) Facets may have substantial processing cost if done wrong.
 * 9) A Cluster map interface might be popular.

How can data be used to make search resolve ambiguity

 * The The Art Of War proscribes the following advice "know the enemy and know yourself and you shall emerge victorious in 1000 searches". (Italics are mine).
 * Google calls it "I'm feeling lucky".

Ambiguity can come from lexical form of the queary or from the result domain. When the top result of a search is an exact match is a disambiguation page. In either case the search engine should be able to make a good (measured) guess as to what the user ment.

Instrumenting Links
 than fetches the required page.
 * If we wanted to collect intelligece we could instrument all links to jump to a redirect page which logs
 * It would be interesting to have these stats for all pages.
 * It would be realy interesting to have these stats for disambiguation/redirect pages.


 * Some of this may be available from the site logs (are there any)

Use case 1. General browsing history stats available for disambiguation pages
Here is a reolution huristic
 * 1) use inteligence vector of  to jump to the most popular (80% solution) - call it "I hate disambiguation" preference.
 * 2) use inteligence vector  to produce document term vector projections of source vs target to match most related source and traget pages. (should index source).

Use case 2. crowd source local interest
Search Patterns are often affected by televison etc. This call for analyzing search data and producing the following intelligence vector . This would be produced every N<=15 minutes.
 * 1) use inteligence vector   together with  if significant on the search term to steer to the current interest.

Use case 3. Use specific browsing history also available

 * 1) use  and as above but with a mememory  weighed by time to fetch personalised search results.

How can search be made more relavant via Intelligence?

 * 1) Use current page (AKA refrerer)
 * 2) Use browsing history
 * 3) Use search history
 * 4) Use Profile
 * 5) API for serving ads/fundrasing

How Can Search Be Made More Relevant via meta data extraction ?
While semenatic wiki is one approch to matadata collections, the Apache UIMA offers a possiblity of extraction of metadata from free text as well (as templates).


 * entitiy detection.

How To Test Quality of Search Results ?
Idealy one would like to have a list of queries + top result, highlight etc for different wikis and test the various algorithms. Since data can change one would like to use something that is stable over time.


 * 1) generated Q&A corpus.
 * 2) sanpshot corpus.
 * 3) real world Q&A (less robust since a real world wiki test results will change over time).
 * 4) some queries are easy targets (unique article) while others are harder to find (many results).

Developer/Admin Information

 * | media wiki manual
 * | extentions

Search Options
highlights:
 * | Search Extentions
 * | Extension MWSearch
 * | Lucene Search
 * | Extension:EzMwLucene
 * | Extension:SphinxSearch
 * 

=More Info Search Tools=

=Prototype=

TODO: Write a Solr Prototype in under 1000 lines of code with

Phase I

 * Read Pages off cache
 * Wikisource/HTML Analyzer
 * Highlighting
 * Did you mean

Phase II

 * Clustering
 * Shrading
 * Replication

=References=