User:OrenBochman/Search

=Overview= A quick review of the above is summarized as follows:

Mediawiki does not appear to have native search capabilities. It can be searched via external components (indexed and then searched) via three extensions:
 * 1) Sphinx Search - for small sites (updated 2010)
 * 2) Lucene Search - Lucene search for large sites
 * 3) EzMwLucene - Easy  Lucene search - an unadapted package from

MWSearch does not perform searches rather it provides integration with Lucene-search.

= Brainstorm Some Search Problems=

How does Search currently work
Solution:
 * Contact rainman aka Robert Stojnić rainman-sr who Developed Extension:Lucene-search. and Maintained the search servers.
 * (Consult his thesis)


 * Consult the unit test
 * Consult the api
 * Consult search related bus
 * Write a spec

What is the aproach to wikipedia ranking

 * 1) How does My Ideas currently do not involve changing the ranking algorithm.

Problem: Lucene search processes Wikimedia source text and not the outputted HTML.
Solution:
 * 1) index output HTML (placed into cache)
 * 2) stip unwanted tags (while)
 * 3) boosting thingslike
 * Headers
 * Interwikis
 * External Links

Problem: HTML also contains CSS, HTML, Script, Comments
Either index these too, or run a filter to remove them. Some Strategies are: (interesting if one wants to also compress output for integrating into DB or Cache.
 * 1) solution:
 * 1) Discard all markup.
 * 2) A markup_filter/tokenizer could be used to discard markup.
 * 3) Lucene Tika project can do this.
 * 4) Other ready made solutions.
 * 5) Keep all markup
 * 6) Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
 * 1) Selective processing
 * 2) A table_template_map extension could be used in a strategy to identify structured information for deeper indexing.
 * 3) This is the most promising it can detect/filter out unapproved markup (Javascripts, CSS, Broken XHTML).

Problem: Indexing offline and online

 * 1) real-time "only" - slowly build index in background
 * 2) offline "only" - used dedicated machine/cloud to dump and index offline.
 * 3) dua - each time the lingustic component becomes significantly better (or there is a bug fix) it would be desire able to upgrade search. How this would be done would depend much on the architecture of the analyzer. One possible aproach would be
 * 4) production of a linguistic/entity data or a new software milestone.
 * 5) offline analysis from dump (xml,or html)
 * 6) online processing newest to oldest updates with a (Poisson wait time prediction model)

Problem: Lucene Best Analyzers are Language specific

 * 1) N-Gram analyzer is language independent.
 * 2) A new Multilingual analyzer with a language detector can produced by
 * 3) extract features from query and check against model prepared of line.
 * 4) model would contain lexical feature such as:
 * 5) alphabet
 * 6) bi/trigram distribution.
 * 7) stop lists; collection of common word/pos/language sets (or lemma/language)
 * 8) normalized frequency statistics based on sampling full text from different languages..
 * 9) a light model would be glyph based.

Problem: Search is not aware of morphological language variation

 * 1) in language with rich morphology this will reduce effectiveness of search. Hebrew, Arabic,
 * 2) index Wiktionary so as to produce data for a "lemma analyzer".
 * 3) dumb lemma (bag with a representative)
 * 4) smart lemma (list ordered by frequency)
 * 5) quantum lemma (organized by morphological state and frequency)
 * 6) lemma based indexing.
 * 7) run a semantic disambiguation algorithm (tag )on disambiguate
 * other benefits:
 * 1) lemma based compression. (arithmetic coding based on smart lemma)
 * 2) indexing all lemmas
 * 3) smart resolution of disambiguation page.
 * 4) algorithm translate English to simple English.
 * 5) excellent language detection for search.
 * metrics:
 * 1) extract amount of information contributed by a user
 * 2) since inception.
 * 3) in final version.

Can Search be made more interactive via Facets?

 * 1) SOLR instead of Lucene could provide faceted search involving categories.
 * 2) The single most impressive change to search could be via facets.
 * 3) Facets can be generated via categories (Though they work best in multiple shallow hierarchies).
 * 4) Facets can be generated via template analysis.
 * 5) Facets can be generated via semantic extensions. (explore)
 * 6) Focus on culture (local,wiki), sentiment, importance, popularity (edit,view,revert) my be refreshing.
 * 7) Facets can also be generated using named entity and relational analysis.
 * 8) Facets may have substantial processing cost if done wrong.
 * 9) A Cluster map interface might be popular.

Developer/Admin Information

 * | media wiki manual
 * | extentions

Search Options
highlights:
 * | Search Extentions
 * | Extension MWSearch
 * | Lucene Search
 * | Extension:EzMwLucene
 * | Extension:SphinxSearch
 *