User:Gautham shankar/Gsoc

Identity

 * Name : Gautham Shankar G
 * Email : gautham.shankar@hiveusers.com
 * Project Title: Lucene Automatic Query Expansion From Wikipedia Text

Contact / Working Info

 * Timezone: UTC/GMT +5:30 hours
 * Typical working hours: 9 AM to 9 PM (3:30 AM to 4:30 PM UTC) (flexible)
 * IRC handle: gauthamshankar
 * Skype: gautham.shankar3
 * Phone: +919884837794

Project Summary
Query expansion is a method used to improve the performance of information retrieval systems. The following problems may exist when an user gives a search query,


 * Users typically formulate very short queries and are not likely to take the trouble of constructing long and carefully stated queries.

Hence the search results that are obtained are not satisfactory and might not contain the relevant information that the user is looking for. This project aims to solve the problem in two stages
 * The words used to describe a concept in the query are different from the words authors use to describe the same concept in their documents.


 * 1) Creation of a multilingual wordnet
 * 2) *Wordnet is a lexical database that has become a standard resource in NLP research.
 * 3) *In wordnet nouns, verbs,adjectives and adverbs are organized in sets of synonyms, called synsets which convey a concept.
 * 4) *These synsets are connected to other synsets by semantic relations (hiponymy,antonomy etc.)
 * 5) *The wordnet can be built using the vast multilingual data in Wiktionary.
 * 6) Query expansion
 * 7) *The input query is expanded with relevant terms in order to encompass a wider range of context for a successful search.
 * 8) *The search query is mapped with wordnet to obtain relevant expansion terms.
 * 9) *Integrating wordnet and search will provide data on the effectiveness of the wordnet.
 * 10) *The query expansion is added as a Lucene filter.

Both the stages are very big and most of the time during Gsoc will go into building the wordnet. If time permits the lucene filter can be added. Completing the wordnet would provide a vast lexical database in machine readable format for future NLP projects. The completion of the entire project would greatly enhance the quality of the search results obtained by the user.

Deliverables
A framework to mine Wiktionary to create a wordnet

There are two approaches to create a multilingual wordnet For this project i find Expand model more suitable since the translation for English words are automatically generated by Wiktionary. As the synsets for the English words are being generated their counterparts in other languages can also be simultaneously generated. To reduce the complexity involved, the wordnet will be built only on noun and verb forms of the word and the synsets will be semantically linked using hypernymy/hyponymy. For example, hypernymy – {tree, tree diagram} is a kind of {plane figure, two-dimensional figure}, hyponymy – {tree} can for example be {chestnut, chestnut tree}. The wordnet generation can be first tested for two languages, English being one of them.
 * 1) Merge Model
 * A new ontology is constructed for the target language and relations between a existing wordnet in English and this local wordnet are generated.
 * 1) Expand Model
 * The synsets in the English wordnet are translated (using bilingual dictionaries) into equivalent synsets in the other language.

Words in some of the languages are polysemic,so then more than one synset is assigned. In some of these cases, a word can be monosemic at least in one language, with a unique synset assigned. Thus the wordnet will contain two types of semantic links, Language independent links (between languages) and Language dependent links (within languages).

Required deliverables

 * 1) Page Collection Module
 * 2) *Wiktionary dumps are downloaded
 * 3) *The data is parsed to remove noise
 * 4) *An effective storage mechanism is created for future retrieval
 * 5) Dbase Module
 * 6) *The final wordnet data storage
 * 7) *A data structure is created to store the wordnet
 * 8) *The wordnet will be in RDF/OWL format
 * 9) Extraction Module
 * 10) *This extracts information for a particular word
 * 11) *Synonyms are extracted to generate synsets
 * 12) *Hypernymy/hyponymy are extracted to generate links
 * 13) *Translations for the word and adding them to the queue
 * 14) Mapping Module
 * 15) *This takes care of establishing the two semantic links in the wordnet
 * 16) *Data generated from extractors are used
 * 17) Extraction Manager
 * 18) *This module coordinates the Extractors and Mapping
 * 19) *It writes the final output into Dbase
 * 20) *Consistency check are put into place in this module
 * 21) Process Manager
 * 22) *The process manager automates the task of fetching new words from extractors and adding them to queue
 * 23) *It regulates the entire automatic wordnet creation process
 * 24) *Consistency check are written in this module

If time permits

 * 1) A Lucene filter for query expansion
 * 2) *The filter will use the wordnet to generate expansion terms
 * 3) *The expansion terms can be filtered by creating semantic maps of the query
 * 4) *Modified query is passed to obtain results

Community bonding Period

 * Interact with the mentors and the community.
 * Discuss the deliverables with the mentor and finalize the approach to be taken to solve the problem.
 * Familiarize myself with the required algorithms and data structures for the project.

Coding Period
I have my university exams until 31st of may and will start coding from 1st of June.

Schedule for the first leg :


 * 1) 1st June to 16th June (Milestone 1, 2.2 weeks)
 * Page Collection Module
 * 1) 17th June to 23th June (Milestone 2, 1 week)
 * Dbase Module
 * 1) 24th June to 30th June (Milestone 3, 1 week)
 * Extraction Module
 * 1) 1st July to 7th July (Milestone 4, 1 week)
 * 2) * Completion of coding and testing for first leg of Gsoc
 * 3) * Prepare documentation for Mid-term Evaluation

Schedule for the second leg :
 * 1) 8th July to 14th July (Milestone 5, 1 week)
 * 2) * Feedback on performance from mentors
 * 3) *Mapping Module
 * 4) 15th July to 28th July (Milestone 6, 2 weeks)
 * Extraction Manager
 * 1) 29th July to 11th August (Milestone 7, 2 weeks)
 * Process Manager
 * 1) 12th August to 18th August (Milestone 8, 1 week)
 * 2) * Obtain the final wordnet results
 * 3) * Complete coding and testing for second leg
 * 4) * Prepare documentation for Evaluation
 * 5) 19th August to 29th August (Milestone 9, 1.5 weeks)
 * 6) * Make final changes if any to make it presentable

About Me
I'm Gautham Shankar, final year engineering student pursuing by B.E in computer science and engineering. I have a great passion for programming and problem solving.I used to primarily code in c/c++ and have created a basic MS paint replica in c++. moving into college i got interested in the web and especially search.I have developed websites using php, mysql, javascript and built a recommendation framework for search query suggestion.I have extensively used open source technologies and I would like to contribute back to this community by being a part of gsoc2012.

Participation
Coding has never felt like work.Would love to spend the summer doing what i truly love.I'm willing to work from 9am to 10pm. I'm usually online on skype or gtalk and respond to mails without much delay. I work very hard in solving problems and use the forums and blogs when i am stuck with a problem.

Past open source experience
I have not contributed to any open source software, but would like to this as an opportunity to do so.I have used the a lot of open source softwares in my projects like the Lucene search engine and phpbb.