User:OrenBochman/Search

Problem: Lucene search processes Wikimedia source text and not the outputted HTML.

 * 1) (Also) index output HTML file?

Problem: HTML also contains CSS, HTML, Script, Comments
Either index these too, or run a filter to remove them. Some Strategies are: (interesting if one wants to also compress output for integrating into DB or Cache.
 * 1) solution:
 * 1) Discard all markup.
 * 2) A markup_filter/tokenizer could be used to discard markup.
 * 3) Lucene Tika project can do this.
 * 4) Other ready made solutions.
 * 5) Keep all markup
 * 6) Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
 * 1) Selective processing
 * 2) A table_template_map extension could be used in a strategy to identify structured information for deeper indexing.
 * 3) This is the most promising it can detect/filter out unapproved markup (Javascripts, CSS, Broken XHTML).

Problem: Indexing offline and online

 * 1) real-time "only" - slowly build index in background
 * 2) offline "only" - used dedicated machine/cloud to dump and index offline.
 * 3) dua - each time the lingustic component becomes significantly better (or there is a bug fix) it would be desire able to upgrade search. How this would be done would depend much on the architecture of the analyzer. One possible aproach would be
 * 4) production of a linguistic/entity data or a new software milestone.
 * 5) offline analysis from dump (xml,or html)
 * 6) online processing newest to oldest updates with a (Poisson wait time prediction model)

Problem: Lucene Best Analyzers are Language specific

 * 1) N-Gram analyzer is language independent.
 * 2) A new Multilingual analyzer with a language detector can produced by
 * 3) extract features from query and check against model prepared of line.
 * 4) model would contain lexical feature such as:
 * 5) alphabet
 * 6) bi/trigram distribution.
 * 7) stop lists; collection of common word/pos/language sets (or lemma/language)
 * 8) normalized frequency statistics based on sampling full text from different languages..
 * 9) a light model would be glyph based.

Problem: Search is not aware of morphological language variation

 * 1) in language with rich morphology this will reduce effectiveness of search. Hebrew, Arabic,
 * 2) index Wiktionary so as to produce data for a "lemma analyzer".
 * 3) dumb lemma (bag with a representative)
 * 4) smart lemma (list ordered by frequency)
 * 5) quantum lemma (organized by morphological state and frequency)
 * 6) lemma based indexing.
 * 7) run a semantic disambiguation algorithm (tag )on disambiguate
 * other benefits:
 * 1) lemma based compression. (arithmetic coding based on smart lemma)
 * 2) indexing all lemmas
 * 3) smart resolution of disambiguation page.
 * 4) algorithm translate English to simple English.
 * 5) excellent language detection for search.
 * metrics:
 * 1) extract amount of information contributed by a user
 * 2) since inception.
 * 3) in final version.

Question: What type of Query Cooking, Analysis and Ranking are currently Implemented

 * 1) My Ideas currently do not involve changing the ranking algorithm.

How to translate metadata into a better search experience (Facets)?

 * 1) SOLR instead of Lucene could provide faceted search involving categories.
 * 2) The single most impressive change to search could be via facets.
 * 3) Facets can be generated via categories (Though they work best in multiple shallow hierarchies).
 * 4) Facets can be generated via template analysis.
 * 5) Facets can be generated via semantic extensions. (explore)
 * 6) Focus on culture (local,wiki), sentiment, importance, popularity (edit,view,revert) my be refreshing.
 * 7) Facets can also be generated using named entity and relational analysis.
 * 8) Facets may have substantial processing cost if done wrong.
 * 9) A Cluster map interface might be popular.

Problem Search is monolithic

 * is it ?
 * allow add/config of new components into search via configuration.
 * a Wikisource analyzer.
 * On demand template expansion via php ?
 * update index on a section by section basis. (Tricky Due To Non Linier Tree Nature )
 * Plug-gable Architecture

Developer/Admin Information

 * | media wiki manual
 * | extentions

Search Options
highlights:
 * | Search Extentions
 * | Extension MWSearch
 * | Lucene Search
 * | Extension:EzMwLucene
 * | Extension:SphinxSearch
 * 

Potential Contact People
| comitt capable developers, | irc:#mediawiki

Screened

 * Brion Vibber - lead dev
 * Multichill
 * Andrew Garrett - active paid developer
 * Roan Kattouw - usability initiative, previously lead developer and maintainer of the MediaWiki API.
 * Siebrand Mazeland

Unscreened

 * Seb35
 * Aryeh Gregor
 * David Richfield
 * Niklas Laxström - experienced MediaWiki developer

Misc

 * | zim offline format