User:OrenBochman/Search/NGSpec
|
||||||||||||||||||||||
- The ultimate goal is to make searching simple and satisfacory.
secondry goals are:
- improve precision and recall.
- Evaluate component by the knowldge & intelligence they can expose
- Use infrastrucure effectiv;ly
- Low edit to index time
[edit] Features
[edit] Standard Features
[edit] Ranking
|
[edit] Wiki Specific Features
[edit] Media Support
|
[edit] Performance & Scalabability
|
[edit] UI & Admin UI
|
[edit] search analytics
|
[edit] Crowdsourcable components
|
[edit] Indexing
[edit] Cache Analytics
[edit] Reputationsa documents reputation is derived from I intrinsic factors and E extrinsic factors Intrinsic reputation
Extrinsic reputation
|
[edit] PreProcessing[edit] Lexical |
[edit] Semantic[edit] Cross Language |
[edit] Brainstorm Some Search Problems
[edit] LSDEAMON vs Apache Solr
As search evolves it might be prudent to migrate to Apache SOLR[3] as a stand alone search server instead of the LSDEAMON
[edit] Pros
Can support many more features from above matrix
|
[edit] Cons
May require a new front end MWSearch.
|
[edit] Query Expansion
[edit] Indexing Source as opposed to HTML
[edit] Problem: Lucene search processes Wikimedia source text and not the outputted HTML.
Solution:
- Index output HTML (placed into cache)
- Stip unwanted tags (while)
- boosting things like
-
- Headers
- Interwikis
- External Links
[edit] Problem: HTML also contains CSS, HTML, Script, Comments
- solution:
Either index these too, or run a filter to remove them. Some Strategies are:- Discard all markup.
- A markup_filter/tokenizer could be used to discard markup.
- Tika project can do this.
- Other ready made solutions.
- Keep all markup
- Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
(interesting if one wants to also compress output for integrating into DB or Cache.
- Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
- Selective processing
- A table_template_map extension could be used in a strategy to identify structured information for deeper indexing.
- This is the most promising it can detect/filter out unapproved markup (Javascripts, CSS, Broken XHTML).
- Discard all markup.
[edit] Problem: Indexing offline and online
- solr can access the DB directly...?
- real-time "only" - slowly build index in background
- offline "only" - used dedicated machine/cloud to dump and index offline.
- dua - each time the linguistic component becomes significantly better (or there is a bug fix) it would be desire able to upgrade search. How this would be done would depend much on the architecture of the analyzer. One possible approach would be
- production of a linguistic/entity data or a new software milestone.
- offline analysis from dump (xml,or html)
- online processing newest to oldest updates with a (Poisson wait time prediction model)
[edit] NG Search Features
[edit] Problem: Analysis And Language
- N-Gram analyzer is language independent.
- A new Multilingual analyzer with a language detector can produced by
- extract features from query and check against model prepared of line.
- model would contain lexical feature such as:
- alphabet
- bi/trigram distribution.
- stop lists; collection of common word/pos/language sets (or lemma/language)
- normalized frequency statistics based on sampling full text from different languages..
- a light model would be glyph based.
[edit] Problem: Search is not aware of morphological language variation
- Language with rich morphology this will reduce effectiveness of search. (e.g. Hebrew, Arabic, Hungarian, Swahili)
- Text Mine en.Wiktionary && xx.Wiktionary to for the data of a "lemma analyzer". (Store it in a table based on Apertium Morphlogical Dictionary format).
- Index xx.Wikipeia for frequency data and via a row/column algorithm to fill in the gaps of the Morphological Dictionary Table
- dumb lemma (bag with a representative)
- smart lemma (list ordered by frequency)
- quantum lemma (organized by morphological state and frequency)
- lemma based indexing.
- run a semantic disambiguation algorithm (tag )on disambiguate
- other benefits:
- lemma based compression. (arithmetic coding based on smart lemma)
- indexing all lemmas
- smart resolution of disambiguation page.
- algorithm translate English to simple English.
- excellent language detection for search.
- metrics:
- exact amount of information contributed by a user
- since inception.
- in final version.
- Phonetical Compiler
- Index sound of Proper Names
- transliteration Plugin
- IPA-NGRAM mapping transliterator (databased)
- Allow domain expert to write a rules based transliteration from IPA to their script/language.
- Allow exceptions (say old hungarian names)
- Search for "Yasser Arafat" or "Marwan Bargutti" and match the original (arabic script)
- Lexical Compiler - Compiles Machine Readable Lexicons/Thasari for lexical analysis chain.
- Ontological Compiler - Compiles Machine Readable Lexicons for Ontological analysis chain.
- Human intervention lyre - Allow a human to override the lexicon.
- Wiki Compressor Utility (build a compression utility for a wiki).
| width="25%" align="left" valign="top" |
[edit] Lexical Chain
- Language Detection
- Document - Apache Tika (Extend to all wiki languages)
- Query - Apache Tika
- Lexeme -
- Produce Machine Lexicons (consumable by analyzes, machine translation and spellers).
- Produce Thesarus (Semantic Interface) bootstrap with WordNet (a pTaylor diSeme expansion)
- Disambiguators.
- probabilistic POS Tagger (Morphological ambiguity).
- Semantic (Word Sense ambiguity).
- Xlanguage Disambiguator (disambiguate by looking across languages)
- Disambiguator Simplifier (replace poor word choices)
- Disambiguator
| width="25%" align="left" valign="top" |
[edit] Semantic Chain[9]
- Titles/Disambiguation/Redirect "Proper Nouns"
- Category/Clustering
- Link (Detect|Annotate role)
- Named Entity detection
- Annotation for Ontological Indexing [10].
- Merology.
- Equal>>DirectPartOf>>PartOf.
- Disjoint/OverLap.
- Time Ontology (Partial).
- Instant.
- Interval>>ProperInterval>>DateTimeInterval
- DateTime
- Interval Before/After/Contains/OverLaps.
- Instant Before/After
- Lexical semantic interface (cross back and forth to disambiguate based on new knowledge)
- POS of W1 based on recognising a ProperNoun is TIME_ONT/INSTANT.
- Recognise that ALON is Name and not Tree based on verb...
|}
[edit] Soluition 2 - specialized Language Support
Integrate new resources for languages analysing as they become available.
- contrib location for
- lucene
- https://svn.apache.org/repos/asf/lucene/dev/tags/
- https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_5_0/lucene/contrib/
- https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/; and for branch_3x (to be released next as v3.6), see
- https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/
- solr
- https://svn.apache.org/repos/asf/lucene/solr/dev/tags/
- https://svn.apache.org/repos/asf/lucene/solr/dev/tags/lucene_solr_3_5_0/lucene/contrib/
- https://svn.apache.org/repos/asf/lucene/solr/dev/trunk/lucene/contrib/; and for branch_3x (to be released next as v3.6), see
- https://svn.apache.org/repos/asf/lucene/solr/dev/branches/branch_3x/lucene/contrib/
- lucene
- external resources
| language | resource | status | Comments |
|---|---|---|---|
| Arabic Stemmer - algorithmic | TestArabicNormalizationFilter.java at https://issues.apache.org/jira/secure/attachment/12391029/LUCENE-1406.patch | ||
| Arabic Stemmer - data based | http://savannah.nongnu.org/projects/aramorph | ||
| Chinese | SmartChineseSentenceTokenizerFactory.java and SmartChineseWordTokenFilterFactory.java | ||
| Hungarian | morphology | identified | |
| Finish morphology | http://gna.org/projects/omorfi | ||
| Hebrew | morphology | identified | |
| Japanese | morphology | identified | |
| Polish | StempelPolishStemFilterFactory.java |
- Benchmarking
- TestSuite (check resource against N-Gram)
- Acceptence Test
- Ranking Suite based on "did you know..." glosses and thier articles
[edit] How can search be made more interactive via Facets?
- SOLR instead of Lucene could provide faceted search involving categories.
- The single most impressive change to search could be via facets.
- Facets can be generated via categories (Though they work best in multiple shallow hierarchies).
- Facets can be generated via template analysis.
- Facets can be generated via semantic extensions. (explore)
- Focus on culture (local,wiki), sentiment(), importance, popularity (edit,view,revert) my be refreshing.
- Facets can also be generated using named entity and relational analysis.
- Facets may have substantial processing cost if done wrong.
- A Cluster map interface might be popular.
[edit] How Can Search Resolve Unexpected Title Ambiguity
- The The Art Of War prescribes the following advice "know the enemy and know yourself and you shall emerge victorious in 1000 searches". (Italics are mine).
- Google called it "I'm feeling lucky".
Ambiguity can come from:
- The Lexical form of the query (bank - river, money)
- From the result domain - the top search result is an exact match of a disambiguation page.
In either case the search engine should be able to make a good (measured) guess as to what the user meant and give them the desired result.
The following data is available:
- Squid Chace access is sampled at 1 to a 1000
- All edits are logged too.
[edit] Instrumenting Links
- If we wanted to collect intelligence we could instrument all links to jump to a redirect page which logs
<source,target,user/ip-cookie,timestamp> than fetches the required page.
- It would be interesting to have these stats for all pages.
- It would be really interesting to have these stats for disambiguation/redirect pages.
- Some of this may be available from the site logs (are there any)
[edit] Use case 1. General browsing history stats available for disambiguation pages
Here is a resolution heuristic
- use intelligence vector of <target,frequency> to jump to the most popular (80% solution) - call it "I hate disambiguation" preference.
- use intelligence vector <source,target,frequency> to produce document term vector projections of source vs target to match most related source and target pages. (should index source).
[edit] Use case 2. crowd source local interest
Search Patterns are often affected by television etc. This call for analyzing search data and producing the following intelligence vector <top memes, geo location>. This would be produced every N<=15 minutes.
- use inteligence vector <source,target,target freshness,frequency> together with <top memes, geo location> if significant on the search term to steer to the current interest.
[edit] Use case 3. Use specific browsing history also available
- use <source,target,frequency> and as above but with a memory <my top memes + edit history> weighed by time to fetch personalised search results.
[edit] How can search be made more relevant via Intelligence?
- Use current page (AKA referer)
- Use browsing history
- Use search history
- Use Profile
- API for serving ads/fundraising
[edit] How Can Search Be Made More Relevant via metadata extraction ?
While semantic wiki is one approach to metadata collections, the Apache UIMA offers a possibility of extraction of metadata from free text as well (as templates).
- entity detection.
[edit] How To Test Quality of Search Results ?
Ideally one would like to have a list of queries + top result, highlight etc for different wikis and test the various algorithms. Since data can change one would like to use something that is stable over time.
- generated Q&A corpus.
- snapshot corpus.
- real world Q&A (less robust since a real world wiki test results will change over time).
- some queries are easy targets (unique article) while others are harder to find (many results).
[edit] Personalised Results via ResponseTrackingFilter
- Users post search action should be tracked anonymously to test and evaluate the ranking to their needs.
- Users should be able to opt in for personalised tracking based on their view/edit history.
- This information should be integrated into the tracking algorithm as a component that can filter search.
[edit] External Links Checker
External Links should be scanned once they are added. This will facilitate
- testing is a link is alive.
- testing if the content has changed.
The links should be doctored for frequency count.
[edit] PLSI Field for cross language search
- index a cross language field with N=200 words from each language version of wikipedia in it.
- the run PLSI alorithem on it.
- this will produce a matrix that associates phrases with cross language meaning.
- so it should then be possible to use the out put of this index to do xross language search.
[edit] Payloads
- payloads allow storing and retrieving arbitrary tokens for each token.
- payloads can be used to boost at the term level (using function queries)
What might go into payloads?
- Html (Logical) Markup Info that is stripped (e.g.)
- isHeader
- isEmphesized
- isCode
- WikiMarkUp
- isLinkText
- isImageDesc
- TemplateNestingLevel
- Linguistic data
- LangId
- LemmaId - Id for base form
- MorphState - Lemma's Morphological State
- ProbPosNN - probability it is a noun
- ProbPosVB - probability it is a noun
- ProbPosADJ - probability it is a noun
- ProbPosADV - probability it is a noun
- ProbPosPROP - probability it is a noun
- PropPosUNKOWN - probability it is Other/Unknown
- Semantic data
- ContextBasedSeme (if disambiguated)
- LanguageIndependentSemeId
- isWikiTitle
- Reputation
- Owner(ID,Rank)
- TokenReputation
- some can be used for ranking.
- some can be used for cross language search.
- some can be used to improve precision.
- some can be used to increase recall.
[edit] References
- ↑ via document term vectors cosines
- ↑ tika
- ↑ http://lucene.apache.org/solr/
- ↑ Lucene In Action 2nd Edition P. 275
- ↑ Lucene In Action 2nd Edition P. 277
- ↑ Lucene In Action 2nd Edition P. 283
- ↑ http://project.carrot2.org/release-3.5.0-notes.html
- ↑ http://project.carrot2.org/release-3.5.0-notes.html
- ↑ http://www.lirmm.fr/~croitoru/kcap07-onto-2.pdf
- ↑ http://delicias.dia.fi.upm.es/wiki/images/a/a5/GeneralOntologies.pdf