User:Gautham shankar/Gsoc

Identity

 * Name : Gautham Shankar G
 * Email : gautham.shankar@hiveusers.com
 * Project Title: Lucene Automatic Query Expansion From Wikipedia Text

Contact / Working Info

 * Timezone: UTC/GMT +5:30 hours
 * Typical working hours: 9 AM to 9 PM (3:30 AM to 4:30 PM UTC) (flexible)
 * IRC handle: gauthamshankar
 * Skype: gautham.shankar3
 * Phone: +919884837794

Project Summary
Query expansion is a method used to improve the performance of information retrieval systems. The following problems may exist when an user gives a search query,


 * Users typically formulate very short queries and are not likely to take the trouble of constructing long and carefully stated queries.

Hence the search results that are obtained are not satisfactory and might not contain the relevant information that the user is looking for.To solve this problem the input query is expanded with relevant terms in order to encompass a wider range of context for a successful search.In order to obtain relevant terms for expansion we need to first create a database of words grouped into sets of synonyms called 'synsets'. The search query can be compared to the mined word net to obtain relevant expansion terms.
 * The Words Used to describe a concept in the query are different from the words authors use to describe the same concept in their documents.

My Proposal is to,

Completing this project would greatly enhance the quality of the search results obtained by the user.
 * mine Wiktionary to construct a word net
 * to write a Lucence filter that will use the word net to expand queries in order to provide improved search results.

Deliverables

 * 1) A framework to mine Wiktionary to create a word net
 * 2) Parse through the word list and obtain the Synonyms,Hyponyms, Derived terms and other available data for a particular word
 * 3) organize them into a hirachial dataset with a IS A relationship


 * 1) Search Engine filter
 * 2) First run
 * 3) A search needs to be run with the query given by the user.The pages obtained are used to perform Word-sense disambiguation [WSD].
 * 4) A reranking of search results can be performed in case the query does not yield relavent results.(optional)
 * 5) Second run
 * 6) Using the results from WSD the required expansion queires are obtained from the word net.
 * 7) The expansion terms are appended to the search query.
 * 8) The query terms are given a rank based on their relavence to the search context.
 * 9) The expanded query is then given as input to obtain the final results