User:OrenBochman/Search/Tools

=Scope= Search is viewed by most as a neccessary evil. You go to a search engine, type something. The engine read your text, reads your mind for what your really mean and takes you there. No thought's involved and it is great - anything else is a dissaster...

Except that search in a microcosmos of Wikipedia could serve as the oracle. A orcale in the sense of a an individual perhaps touched by a madness that allows him to see further a field and possibly even sense the will of the gods. By going down this path one has to reconsider Search as a distruptive technology.

As such it should commit to the same core set of values used by Wikipedia and its goal is to further the same goals of Wikipdia. What does this mean in practice.

Markets

 * 1) Consider information outside of Wikipedia (is it free or not) - grade external links and offer them in search results.
 * 2) Consider information across a wiki's bounderies - promote sharing resources between projects.
 * 3) Consider information accress language - promote results in other languages. (Registration advantage!)
 * 4) Opt in to personal search - based on browsing/editing history of logged in users.
 * 5) Newbie mode.
 * 6) Social Search results.
 * 7) Trust - show confidence in search results
 * survey results
 * wikitrust

=Tools=

Here is some information on words: Words in a language/corpus follow a long tail distribution. So do keywords within the documents available in the internet. This is called also called the Zip's distribution. In english is we can cluster words by frequency we would observe an interesting phenomena.

There basically three clusters.


 * 1) Core syntax - words with no or minimal semantic significance require to give each sentence its structure
 * 2) Simple English - core vocabulary required to communicate without specialised terms. There words express concrete or generic concepts of low degree of abstraction, are non technical terms and can have (multiple word sense).
 * 3) Specialised Words - limited to a specialised fields, tend to be nouns unfamiliar

=notability tool= this [pre|stub|article] contains [N] units of unique data. suggested external links, references, images, terms via inter-wiki.

=relatedness Tool= these two documents are within radius N/1000 these N documents are within radius N/1000

where relatedness considers unique terms and medium frequency synonyms

=In and out of Context Scoring Systems=

It is commonly assumed that adding links between Wikipedia articles helps the wiki grow and is therefore a useful activity. However how many terms should be wikified. None is bad - an island. All is worse - it is a mess.

A metaphor: If one consider each article as a cistern of information one could imagine that links and their numbers should create a network of connections that should allow the information to flow within the systems. But the information does not appear to more. (It does as memes whithin the heads of the editors as transfer vectors). In this view links would optimal assistance to comprehend ideas.

heuristic.

Let there be a collection of documents D1...DN.

One can define a distance between two documents via:
 * the Cosine of their Term Vectors.
 * the Cosine of their Entity Vector.
 * the Cosine of their Relationship Vector.

Terms are lexical units of meaning within a discipline. They can be words, phrases or thechnical terms. Since the vector spaces of Terms Entities and Relationships imply triangle inequality and therefore a metric space.

Then the relatedness of each document

A "Context" scoring mechanism.

This is intended as a quick yet robust mechanism to qualify the salience links of information between two documents. Q. which words in a sentence A in article B should be wikified.

=Good Link/Bad Link =

Some Wikifing Strategies
Score links as relevant. should one wikify :


 * 1) New York Times reports a sex scandal involving President Bill Clinton
 * 2) New York Times reports a sex scandal involving President Bill Clinton
 * 3) New York Times reports a sex scandal involving president Bill Clinton
 * 4) New York Times reports a sex scandal involving president Bill Clinton
 * 5) New York Times reports a sex scandal involving President Bill Clinton
 * 6) New York Times reports a sex scandal involving President Bill Clinton
 * 7) New York Times reports a sex scandal involving President Bill Clinton

from a search/information point of view.

Link in Context
Score links within their local/global contribution.

How to do full featured analysis chain for a field in lucene

 * Analyzer (probably a per field version)
 * CharReader (usualy not needed)
 * CharFilter (pre tokeniser processing - e.g. normalize ZWNJs to spaces in Farsee or convert)
 * Tokenizer - Split char sequences to Tokens
 * TokenFilter - Process Tokens