User:Gautham shankar/Gsoc

Identity[edit]

Name : Gautham Shankar G

Email : gautham.shankar@hiveusers.com

Project Title: Lucene Automatic Query Expansion From Wikipedia Text

Contact / Working Info[edit]

Timezone: UTC/GMT +5:30 hours

Typical working hours: 9 AM to 9 PM (3:30 AM to 4:30 PM UTC) (flexible)

IRC handle: gauthamshankar

Skype: gautham.shankar3

Phone: +919884837794

Project Summary[edit]

Query expansion is a method used to improve the performance of information retrieval systems.
The following problems may exist when an user gives a search query,

Users typically formulate very short queries and are not likely to take the trouble of constructing long and carefully stated queries.

The words used to describe a concept in the query are different from the words authors use to describe the same concept in their documents.

Hence the search results obtained are not satisfactory and might not contain the relevant information that the user is looking for. This project aims to solve the problem in two stages

Creation of a multilingual wordnet
- Wordnet is a lexical database that has become a standard resource in NLP research.
- In wordnet nouns, verbs, adjectives and adverbs are organized in sets of synonyms, called synsets which convey a concept.
- These synsets are connected to other synsets by semantic relations (hiponymy,antonomy etc.).
- The wordnet can be built using the vast multilingual data in Wiktionary.
Query expansion
- The input query is expanded with relevant terms in order to encompass a wider range of context for a successful search.
- The search query is mapped with wordnet to obtain relevant expansion terms.
- Integrating wordnet and search will provide data on the effectiveness of the wordnet.
- The query expansion is added as a Lucene filter.

Both the stages are very big and most of the time during Gsoc will go into building the wordnet. If time permits the lucene filter can be added.

While there are other wordnets like DBpedia and EuroWordNet that include multilingual support, EWN requires licenses for use and DBpedia and other Wikipedia based wordnets have been generated using wiki articles and are mined based on the categories and content of the article rather using a linguistic approach. Also in the future, wordnet can be automatically updated as new words are added or when changes are made to Wiktionary. Completing the wordnet would provide a vast lexical database in machine readable format for future NLP projects. The completion of the entire project would greatly enhance the quality of the search results obtained by the user.

Deliverables[edit]

A framework to mine Wiktionary to create a wordnet

There are two approaches to create a multilingual wordnet

Merge Model
- A new ontology is constructed for the target language.
- Relations between an existing wordnet in English and this local wordnet are generated.
Expand Model
The synsets in the English wordnet are translated (using bilingual dictionaries) into equivalent synsets in the other language.

For this project I find Expand model more suitable since the translation for English words are automatically generated by Wiktionary. As the synsets for the English words are being generated their counterparts in other languages can also be simultaneously generated. To reduce the complexity involved, the wordnet will be built only on noun and verb forms of the word and the synsets will be semantically linked using hypernymy/hyponymy. For example,

hypernymy – {tree, tree diagram} is a kind of {plane figure, two-dimensional figure}
hyponymy – {tree} can be a {chestnut, chestnut tree}.

The wordnet generation can be first tested for two languages, English being one of them. Words in some of the languages are polysemic, so more than one synset is assigned. In some of these cases, a word can be monosemic at least in one language, with a unique synset assigned. Thus the wordnet will contain two types of semantic links, Language independent links (between languages) and Language dependent links (within languages).

Required deliverables[edit]

Page Collection Module
- Wiktionary dumps are downloaded
- The data is parsed to remove noise
- An effective storage mechanism is created for future retrieval
Dbase Module
- Handles the final wordnet data storage
- A data structure is created to store the wordnet
- The wordnet will be in RDF/OWL format
Extraction Module
- This extracts information for a particular word
- Synonyms are extracted to generate synsets
- Hypernymy/hyponymy are extracted to generate links
- Gets the translations for the word and adds them to the queue
Mapping Module
- This takes care of establishing the two semantic links in the wordnet
- Data generated from extractors are used
Extraction Manager
- This module coordinates the Extractors and Mapping
- It writes the final output into Dbase
- Consistency checks are put into place in this module
Process Manager
- The process manager automates the task of fetching new words from extractors and adding them to queue
- It regulates the entire automatic wordnet creation process
- Consistency checks are written in this module

If time permits[edit]

A Lucene filter for query expansion
- The filter will use the wordnet to generate expansion terms
- The expansion terms can be filtered by creating semantic maps of the query
- Modified query is passed to obtain results

Future Project Maintenance[edit]

Synchronizing wordnet data with wiktionary
- As of now this project will build the wordnet using wiktionary dumps.
- By time, wiktionary data are revised, which makes the data in the wordnet outdated.
- A live wordnet extractor module needs to be written to continuously keep in sync the wiktionary and wordnet data.
- Until then a monthly wordnet update needs to be released to include the revised content.
Spam Control
- The addition of noise to the data will greatly affect the results produced by the wordnet.
- To identify and remove spam from the wiktionary dumps in order to build a effective wordnet.
Algorithm Updates
- The field of computation linguistics is widely researched with frequent updates to techniques used and algorithms.
- To keep in constant touch with new algorithms and techniques and to test it on the wordnet for performance and update the code if necessary.

Project Schedule[edit]

Community bonding Period[edit]

Interact with the mentors and the community.
Discuss the deliverables with the mentor and finalize the approach to be taken to solve the problem.
Familiarize myself with the required algorithms and data structures for the project.

Coding Period[edit]

I have my university exams until 31st May and will start coding from 1st June.

Schedule for the first leg :

1st June to 16th June (Milestone 1 , 2.2 weeks)
Page Collection Module
17th June to 23th June (Milestone 2, 1 week)
Dbase Module
24th June to 30th June (Milestone 3, 1 week)
Extraction Module
1st July to 7th July (Milestone 4, 1 week)
- Completion of coding and testing for first leg of Gsoc
- Prepare documentation for Mid-term Evaluation

Schedule for the second leg :

8th July to 14th July (Milestone 5, 1 week)
- Feedback on performance from mentors
- Mapping Module
15th July to 28th July (Milestone 6, 2 weeks)
Extraction Manager
29th July to 11th August (Milestone 7, 2 weeks)
Process Manager
12th August to 18th August (Milestone 8, 1 week)
- Obtain the final wordnet results
- Complete coding and testing for second leg
- Prepare documentation for Evaluation
19th August to 29th August (Milestone 9, 1.5 weeks)
Make final changes if any to make it presentable

About Me[edit]

I'm Gautham Shankar pursuing my fourth year B.E in computer science and engineering. I have a great passion for programming and problem solving. My first exposure to programming was using c++ and I created a basic MS paint replica using c++ in high school.The thrill of watching my friends draw shapes and fill colors in something that i 'created' has left me hooked into programming ever since. What drives me is the joy of creation using a language, just like a painting created by an artist using a brush. Moving into college i got interested in the world wide web and have ever since been fascinated with the huge volumes of data available and its potential when it is structured. This motivated me to take courses in Artificial Intelligence and Data Mining in order to create better art.

I'm fluent in C, C++ and Java. I'm also familiar with PHP and have built a product hive using PHP, Mysql and javascript. The search engine used in hive is lucene. My interests in data mining led me to build a recommendation framework in Java using heat diffusion principle, The project has been implemented on the AOL dataset and gives effective query recommendations. I have been exposed to the concepts of Computational Linguistics and WordNet but have not worked on practical implementations.This project will be my first efforts in that direction.

Search is the gateway to harnessing the wealth of the Internet and any improvement in search would greatly affect the average Internet user. I believe this project will provide me the opportunity to do so.

Participation[edit]

I generally work from 9am to 10pm
I use Emails to communicate updates and progress
The project can be hosted on github so that my mentor can review code
To discuss on project issues in detail i use Skype or gtalk
When I'm convinced that i need help on a certain problem i look into forums and blogs for people who have faced similar issues, if that does not yield results, then i contact relevant people/mentors in the community through mailing lists or IRC to address my problems.

Past open source experience[edit]

I have experience working in data mining and have built a Recommendation Framework Using Webgraphs that implements the heat diffusion algorithm. The framework currently uses the AOL search dataset to recommend better queries that can be typed for a given input query. It has been implemented in Java. Since it is a framework it can be used to recommend different types of data. For example the same framework can be used to recommend movies as well as music. I'm currently working on an extension of this project to add social network graphs so as to map similar people based on the content they search for. The AOL dataset is stored using the Webgraphs and Graphlabeler Java libraries. I have uploaded the project code on github under the title webgraphs.

I have also built a web based product "hive" which is a networking platform for members of the power generation industry. It is an open forum where members can share their experiences and interact with one another to effectively run their machines and solve common problems. This is similar to an open source community. The product has been implemented using PHP (Zend framework), mysql, javascript (inc ajax). Lucene is the search engine and is used to index and retrieve large volumes of machine history. phpbb is used for forums. The code is available on github under the title hive. The website is currently live at http://www.hiveusers.com.

My github link is https://github.com/gauthamshankar

I have extensively used open source technologies for all my projects and given an opportunity i would like to work and contribute back to the community that i extensively exploit for my needs.