User:OrenBochman/Search/NGSpec

From MediaWiki.org
Jump to: navigation, search
Left.svg
Search NG Project
Todo List Operational Plan Test Plan Risk Assessment
NG Search Spec Search NG Analytics NLP Tools Search Tools
Search Labs Configuration Lucene-search Spec Old Code Review
Q&A
Right.svg


  • The ultimate goal is to make searching simple and satisfacory.

secondry goals are:

  • improve precision and recall.
  • Evaluate component by the knowldge & intelligence they can expose
  • Use infrastrucure effectiv;ly
  • Low edit to index time


Contents

[edit] Features

[edit] Standard Features

  • Result Highlighting
  • Did you mean?
    • Spell checker
    • Auto-Complete Suggestions
  • Automatic Query Expansion ReSearcher
  • More Like This[1]
  • Facetting

[edit] Ranking

  • links - page rank
  • confidence - author rank
  • external links

[edit] Wiki Specific Features

  • Wiki Code search
  • Edit History Search
  • Index both source and output
  • Links & Anchor text
  • Images
  • Index tables
  • Index disambiguation pages

[edit] Media Support

  • Indexing of uploaded document[2]
    • pdf
    • excel
    • word
  • Indexing of image metadata commons
  • Geographical Search based on GPS data & Geographical Named Entities
  • TimeLine Search based on time Merology

[edit] Performance & Scalabability

  • based on Apache SOLR
  • modify index for Bittorent protocol update (via a edit ziph distribution field)
  • Easy Management - Installation, Management - Leverage SOLR
  • Bittorent index distribution

[edit] UI & Admin UI

    • user front end
    • mw admin front end
    • sys op front end - JMX
    • search admin front end

[edit] search analytics

  • Search analytics
    • Search CTR
    • 0 hits
    • Top Queries
    • Slow Queries
    • User Click Ranking
    • Paging Depth
    • Top Facets

[edit] Crowdsourcable components

  • Wiki Sourcing
    • Lexical Repository
      • <Ngram,Language> Distribution;
      • <Lexeme,Language> Distribution;
      • <Lemma<Lexem>,Language> Distribution;
      • <CrossLangLemma<Lexem>> Distribution;
      • <Shingle,lang> Distribution;
      • <Co-location,lang> Distribution;
      • <ProperNoun,lang> Distribution;
      • Transliterator_XX_IPA
    • Semantical Repository
      • <Title,Language> Distribution;
      • <TitleThs,Language> Distribution;
      • <Title,IW> Distribution;
    • Ontological Repository
      • Import external
      • Use Categories
      • Use Interwiki
      • External DBS
        • CIA factbook,
        • world-gazetteer.com
    • Ontological Repository
  • Data Based Learning
    • Learning Morphology, Grammar, WordSense, CrossLanguage
    • Search Analytics
    • Data Mining


[edit] Indexing

Filter Chains.svg


[edit] Cache Analytics

  • Rank Pages/Links By Chace Hits (Hadoop)
  • Score Links in Disambiguation Pages
  • Score Redirects Pages
  • Notmalize on Interwiki links

[edit] Reputations

a documents reputation is derived from I intrinsic factors and E extrinsic factors

Intrinsic reputation

  • sum of authuor-rank per token

Extrinsic reputation

  • stability of external references
  • time between edits
  • vandalism, reversions, edit wars, and locking

[edit] PreProcessing

[edit] Lexical

[edit] Semantic

[edit] Cross Language

[edit] Brainstorm Some Search Problems

[edit] LSDEAMON vs Apache Solr

As search evolves it might be prudent to migrate to Apache SOLR[3] as a stand alone search server instead of the LSDEAMON


[edit] Pros

  • Reducing The Code Base to MediaWiki specific features
  • Free Feature Upgrades - Since Integrated with lucene releses
  • Tested/Supported on large user base
  • Monitoring via JMX
  • Can communicate directly with PHP via JSON.


  • Exisintg Features Supported :
  • Fast Vector Highlighting[4]
  • Spell Checking [5]
  • More Like this [6]
  • Two Word phrase indexing via shingles
  • The Text field Can hold aggregate copies of multiple fields just for searching to reduce queries (similar to the OAIRepository)
  • Shrading (splitting index) hash title
  • Shrad Replication

Can support many more features from above matrix

  • Clustering of search results
  • via Carrot2 [7] for top 1000 results.
  • via Mahut [8] for millions of results.

[edit] Cons

  • Typical development risks.
  • Untested in the cloud

May require a new front end MWSearch.

  • Less familiar with dev/deployment


[edit] Query Expansion

[edit] Indexing Source as opposed to HTML

[edit] Problem: Lucene search processes Wikimedia source text and not the outputted HTML.

Solution:

  1. Index output HTML (placed into cache)
  2. Stip unwanted tags (while)
  3. boosting things like
  • Headers
  • Interwikis
  • External Links

[edit] Problem: HTML also contains CSS, HTML, Script, Comments

  1. solution:
    Either index these too, or run a filter to remove them. Some Strategies are:
    1. Discard all markup.
      1. A markup_filter/tokenizer could be used to discard markup.
      2. Tika project can do this.
      3. Other ready made solutions.
    2. Keep all markup
      1. Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
        (interesting if one wants to also compress output for integrating into DB or Cache.
    3. Selective processing
      1. A table_template_map extension could be used in a strategy to identify structured information for deeper indexing.
      2. This is the most promising it can detect/filter out unapproved markup (Javascripts, CSS, Broken XHTML).

[edit] Problem: Indexing offline and online

  1. solr can access the DB directly...?
  2. real-time "only" - slowly build index in background
  3. offline "only" - used dedicated machine/cloud to dump and index offline.
  4. dua - each time the linguistic component becomes significantly better (or there is a bug fix) it would be desire able to upgrade search. How this would be done would depend much on the architecture of the analyzer. One possible approach would be
    1. production of a linguistic/entity data or a new software milestone.
    2. offline analysis from dump (xml,or html)
    3. online processing newest to oldest updates with a (Poisson wait time prediction model)

[edit] NG Search Features

[edit] Problem: Analysis And Language

  1. N-Gram analyzer is language independent.
  2. A new Multilingual analyzer with a language detector can produced by
  3. extract features from query and check against model prepared of line.
  4. model would contain lexical feature such as:
    1. alphabet
    2. bi/trigram distribution.
    3. stop lists; collection of common word/pos/language sets (or lemma/language)
    4. normalized frequency statistics based on sampling full text from different languages..
    5. a light model would be glyph based.

[edit] Problem: Search is not aware of morphological language variation

  1. Language with rich morphology this will reduce effectiveness of search. (e.g. Hebrew, Arabic, Hungarian, Swahili)
  2. Text Mine en.Wiktionary && xx.Wiktionary to for the data of a "lemma analyzer". (Store it in a table based on Apertium Morphlogical Dictionary format).
  3. Index xx.Wikipeia for frequency data and via a row/column algorithm to fill in the gaps of the Morphological Dictionary Table
    1. dumb lemma (bag with a representative)
    2. smart lemma (list ordered by frequency)
    3. quantum lemma (organized by morphological state and frequency)
  4. lemma based indexing.
  5. run a semantic disambiguation algorithm (tag )on disambiguate
  • other benefits:
  1. lemma based compression. (arithmetic coding based on smart lemma)
    1. indexing all lemmas
  2. smart resolution of disambiguation page.
  3. algorithm translate English to simple English.
  4. excellent language detection for search.
  • metrics:
  1. exact amount of information contributed by a user
    1. since inception.
    2. in final version.


  1. Phonetical Compiler
    1. Index sound of Proper Names
    2. transliteration Plugin
      1. IPA-NGRAM mapping transliterator (databased)
      2. Allow domain expert to write a rules based transliteration from IPA to their script/language.
      3. Allow exceptions (say old hungarian names)
    3. Search for "Yasser Arafat" or "Marwan Bargutti" and match the original (arabic script)


  1. Lexical Compiler - Compiles Machine Readable Lexicons/Thasari for lexical analysis chain.
  2. Ontological Compiler - Compiles Machine Readable Lexicons for Ontological analysis chain.
  3. Human intervention lyre - Allow a human to override the lexicon.
  4. Wiki Compressor Utility (build a compression utility for a wiki).

| width="25%" align="left" valign="top" |

[edit] Lexical Chain

  1. Language Detection
    • Document - Apache Tika (Extend to all wiki languages)
    • Query - Apache Tika
    • Lexeme -
    1. Produce Machine Lexicons (consumable by analyzes, machine translation and spellers).
    2. Produce Thesarus (Semantic Interface) bootstrap with WordNet (a pTaylor diSeme expansion)
  1. Disambiguators.
    1. probabilistic POS Tagger (Morphological ambiguity).
    2. Semantic (Word Sense ambiguity).
    3. Xlanguage Disambiguator (disambiguate by looking across languages)
    4. Disambiguator Simplifier (replace poor word choices)
    5. Disambiguator

| width="25%" align="left" valign="top" |


[edit] Semantic Chain[9]

  1. Titles/Disambiguation/Redirect "Proper Nouns"
  2. Category/Clustering
  3. Link (Detect|Annotate role)
  4. Named Entity detection
  5. Annotation for Ontological Indexing [10].
  6. Merology.
    • Equal>>DirectPartOf>>PartOf.
    • Disjoint/OverLap.
  7. Time Ontology (Partial).
    1. Instant.
    2. Interval>>ProperInterval>>DateTimeInterval
    3. DateTime
    4. Interval Before/After/Contains/OverLaps.
    5. Instant Before/After
  • Lexical semantic interface (cross back and forth to disambiguate based on new knowledge)
    • POS of W1 based on recognising a ProperNoun is TIME_ONT/INSTANT.
    • Recognise that ALON is Name and not Tree based on verb...

|}

[edit] Soluition 2 - specialized Language Support

Integrate new resources for languages analysing as they become available.

  1. contrib location for
    1. lucene
    2. solr
  2. external resources
language resource status Comments
Arabic Stemmer - algorithmic TestArabicNormalizationFilter.java at https://issues.apache.org/jira/secure/attachment/12391029/LUCENE-1406.patch
Arabic Stemmer - data based http://savannah.nongnu.org/projects/aramorph
Chinese SmartChineseSentenceTokenizerFactory.java and SmartChineseWordTokenFilterFactory.java
Hungarian morphology identified
Finish morphology http://gna.org/projects/omorfi
Hebrew morphology identified
Japanese morphology identified
Polish StempelPolishStemFilterFactory.java
  1. Benchmarking
  2. TestSuite (check resource against N-Gram)
  3. Acceptence Test
    1. Ranking Suite based on "did you know..." glosses and thier articles

[edit] How can search be made more interactive via Facets?

  1. SOLR instead of Lucene could provide faceted search involving categories.
  2. The single most impressive change to search could be via facets.
  3. Facets can be generated via categories (Though they work best in multiple shallow hierarchies).
  4. Facets can be generated via template analysis.
  5. Facets can be generated via semantic extensions. (explore)
  6. Focus on culture (local,wiki), sentiment(), importance, popularity (edit,view,revert) my be refreshing.
  7. Facets can also be generated using named entity and relational analysis.
  8. Facets may have substantial processing cost if done wrong.
  9. A Cluster map interface might be popular.

[edit] How Can Search Resolve Unexpected Title Ambiguity

  • The The Art Of War prescribes the following advice "know the enemy and know yourself and you shall emerge victorious in 1000 searches". (Italics are mine).
  • Google called it "I'm feeling lucky".

Ambiguity can come from:

  • The Lexical form of the query (bank - river, money)
  • From the result domain - the top search result is an exact match of a disambiguation page.

In either case the search engine should be able to make a good (measured) guess as to what the user meant and give them the desired result.

The following data is available:

  • Squid Chace access is sampled at 1 to a 1000
  • All edits are logged too.

[edit] Instrumenting Links

  • If we wanted to collect intelligence we could instrument all links to jump to a redirect page which logs

<source,target,user/ip-cookie,timestamp> than fetches the required page.

  • It would be interesting to have these stats for all pages.
  • It would be really interesting to have these stats for disambiguation/redirect pages.
  • Some of this may be available from the site logs (are there any)

[edit] Use case 1. General browsing history stats available for disambiguation pages

Here is a resolution heuristic

  1. use intelligence vector of <target,frequency> to jump to the most popular (80% solution) - call it "I hate disambiguation" preference.
  2. use intelligence vector <source,target,frequency> to produce document term vector projections of source vs target to match most related source and target pages. (should index source).

[edit] Use case 2. crowd source local interest

Search Patterns are often affected by television etc. This call for analyzing search data and producing the following intelligence vector <top memes, geo location>. This would be produced every N<=15 minutes.

  1. use inteligence vector <source,target,target freshness,frequency> together with <top memes, geo location> if significant on the search term to steer to the current interest.

[edit] Use case 3. Use specific browsing history also available

  1. use <source,target,frequency> and as above but with a memory <my top memes + edit history> weighed by time to fetch personalised search results.

[edit] How can search be made more relevant via Intelligence?

  1. Use current page (AKA referer)
  2. Use browsing history
  3. Use search history
  4. Use Profile
  5. API for serving ads/fundraising

[edit] How Can Search Be Made More Relevant via metadata extraction ?

While semantic wiki is one approach to metadata collections, the Apache UIMA offers a possibility of extraction of metadata from free text as well (as templates).

  • entity detection.

[edit] How To Test Quality of Search Results ?

Ideally one would like to have a list of queries + top result, highlight etc for different wikis and test the various algorithms. Since data can change one would like to use something that is stable over time.

  1. generated Q&A corpus.
  2. snapshot corpus.
  3. real world Q&A (less robust since a real world wiki test results will change over time).
  4. some queries are easy targets (unique article) while others are harder to find (many results).

[edit] Personalised Results via ResponseTrackingFilter

  • Users post search action should be tracked anonymously to test and evaluate the ranking to their needs.
  • Users should be able to opt in for personalised tracking based on their view/edit history.
  • This information should be integrated into the tracking algorithm as a component that can filter search.

[edit] External Links Checker

External Links should be scanned once they are added. This will facilitate

  • testing is a link is alive.
  • testing if the content has changed.

The links should be doctored for frequency count.

[edit] PLSI Field for cross language search

  • index a cross language field with N=200 words from each language version of wikipedia in it.
  • the run PLSI alorithem on it.
  • this will produce a matrix that associates phrases with cross language meaning.
  • so it should then be possible to use the out put of this index to do xross language search.

[edit] Payloads

  • payloads allow storing and retrieving arbitrary tokens for each token.
  • payloads can be used to boost at the term level (using function queries)

What might go into payloads?

  1. Html (Logical) Markup Info that is stripped (e.g.)
    1. isHeader
    2. isEmphesized
    3. isCode
  2. WikiMarkUp
    1. isLinkText
    2. isImageDesc
    3. TemplateNestingLevel
  3. Linguistic data
    1. LangId
    2. LemmaId - Id for base form
    3. MorphState - Lemma's Morphological State
    4. ProbPosNN - probability it is a noun
    5. ProbPosVB - probability it is a noun
    6. ProbPosADJ - probability it is a noun
    7. ProbPosADV - probability it is a noun
    8. ProbPosPROP - probability it is a noun
    9. PropPosUNKOWN - probability it is Other/Unknown
  4. Semantic data
    1. ContextBasedSeme (if disambiguated)
  5. LanguageIndependentSemeId
  6. isWikiTitle
  7. Reputation
    1. Owner(ID,Rank)
    2. TokenReputation
  • some can be used for ranking.
  • some can be used for cross language search.
  • some can be used to improve precision.
  • some can be used to increase recall.

[edit] References

  1. via document term vectors cosines
  2. tika
  3. http://lucene.apache.org/solr/
  4. Lucene In Action 2nd Edition P. 275
  5. Lucene In Action 2nd Edition P. 277
  6. Lucene In Action 2nd Edition P. 283
  7. http://project.carrot2.org/release-3.5.0-notes.html
  8. http://project.carrot2.org/release-3.5.0-notes.html
  9. http://www.lirmm.fr/~croitoru/kcap07-onto-2.pdf
  10. http://delicias.dia.fi.upm.es/wiki/images/a/a5/GeneralOntologies.pdf
Personal tools
Namespaces

Variants
Actions
Navigation
Support
Download
Development
Communication
Print/export
Toolbox