User:Gautham shankar/Gsoc

Identity

 * Name : Gautham Shankar G
 * Email : gautham.shankar@hiveusers.com
 * Project Title: Lucene Automatic Query Expansion From Wikipedia Text

Contact / Working Info

 * Timezone: UTC/GMT +5:30 hours
 * Typical working hours: 9 AM to 9 PM (3:30 AM to 4:30 PM UTC) (flexible)
 * IRC handle: gauthamshankar
 * Skype: gautham.shankar3
 * Phone: +919884837794

Project Summary
Query expansion is a method used to improve the performance of information retrieval systems. The following problems may exist when an user gives a search query,


 * Users typically formulate very short queries and are not likely to take the trouble of constructing long and carefully stated queries.

Hence the search results that are obtained are not satisfactory and might not contain the relevant information that the user is looking for.To solve this problem the input query is expanded with relevant terms in order to encompass a wider range of context for a successful search.In order to obtain relevant terms for expansion we need to first create a database of words grouped into sets of synonyms called 'synsets'. The search query can be compared to the mined word net to obtain relevant expansion terms.
 * The Words Used to describe a concept in the query are different from the words authors use to describe the same concept in their documents.

My Proposal is to,

Completing this project would greatly enhance the quality of the search results obtained by the user.
 * Mine Wiktionary to construct a word net
 * To write a Lucence filter that will use the word net to expand queries in order to provide improved search results.

Deliverables
Required deliverables


 * 1) A framework to mine Wiktionary to create a word net
 * 2) Parse through the word list and obtain the Synonyms,Hyponyms, Derived terms and other available data for a particular word.
 * 3) organize them into a hierarchical dataset with a IS A relationship.

Once the word net is completed, we can obtain the synonyms of a word, or obtain the lemma of a word etc.


 * 1) Search Engine filter (adhoc feedback system)
 * 2) First Search (obtain feedback)
 * 3) A search needs to be run with the query given by the user.The pages obtained are used to perform Word-sense Disambiguation [WSD].
 * 4) A re-ranking of search results can be performed in case the query does not yield relevant results.(optional)
 * 5) Second Search (use feedback to improve search)
 * 6) Using the results from WSD the required expansion queries are obtained from the word net.
 * 7) The expansion terms are appended to the search query.
 * 8) The query terms are given a rank based on their relevance to the search context.
 * 9) The expanded query is then given as input to obtain the final results.

Research has proved that directly expanding the query has at times degraded the results due to lack of relavent terms being appended. Hence the feedback system is used to imporve the quality of the terms.

Community bonding Period

 * Interact with the mentors and the community.
 * Discuss the deliverables with the mentor and finalize the approach to be taken to solve the problem.
 * familiarize myself with the required algorithms and compare performance of various algorithms to choose the best.

Coding Period
I have my university exams until 31st of may and will can start coding from 1st of june.

Schedule for the first leg :


 * 1) 1st June to 16th June (Milestone 1, 2.2 weeks)
 * 2) *Download the required Wiktionary dumps.
 * 3) *Decide on the data structure to store the word net.
 * 4) *Write a script to Prune through the data and organize it into the required data structure.
 * 5) 17th June to 30th June (Milestone 2, 3 weeks)
 * 6) *Run the first search to obatin the feedback.
 * 7) *Perform a re-ranking for the results.
 * 8) 1st July to 7th July (Milestone 3, 1 week)
 * 9) * Completion of coding for first leg of Gsoc.
 * 10) * Prepare documentation for Mid-term Evaluation.

Schedule for the second leg :
 * 1) 8th july to 14th july (Milestone 4, 1 week)
 * 2) * feedback on performance from mentors.
 * 3) *Parse the documents from the search results for Word-sense Disambiguation.
 * 4) 15th july to 4th August (Milestone 5, 3 weeks)
 * 5) *implement different WSD algorithms to find the optimal algorithm.
 * 6) 5th August to 11th August (Milestone 6, 1 week)
 * 7) * Use the WSD results to get relavent expansion terms from word net.
 * 8) * 12th August to 18th August (Milestone 7, 1 week)
 * 9) * Obatin the final search results
 * 10) * Complete coding for second leg
 * 11) * Prepare documentation for Evaluation
 * 12) *19th August to 29th August (Milestone 8, 1.5 weeks)
 * 13) * Make final changes if any to make it presentable.