User:Karthikprasad/GSOC 2012 proposal

Jump to navigation Jump to search


Karthik R Prasad
Project Title
Wikipedia Corpus Tools
Oren Bochman

Contact/working info[edit]

Bangalore - India(UTC/GMT +5:30 hours)
Typical Working hours
11 AM to 11 PM (5:30 AM to 5:30 PM UTC)
IRC or IM networks/handle(s)
irc :: karthikprasad
Skype :: prasadkarthik

Project Summary[edit]

A Corpus - in a broad sense - is a collection of texts in electronic form (in the case of the spoken language - a transcription of speech), used for linguistic research. Search engines can facilitate work with corpus. It will aid users in finding words and collocations in context and determine their frequency in the corpus and their original text source. It also enables further processing of the found data (alphabetical classification, etc.). Some corpora can also be searched according to parts of speech(POS).

Wikipedia text is - without any doubts - a grand repository of articles and information about almost anything available present in the form of day-to-day Natural Language.

"This project aims to automate the process of extraction and cleaning up of the into a suitable corpus format."

In this project, at least three languages will be worked upon: English, Hebrew and Hindi. I also propose to work with Kannada(my native language). QA and from other language speakers can also be incorporated, if and when we get hold of them, into the workflow simultaneously.

The main idea about NLP(Natural Language Processing)components, with POS tagger as an example, is:

1. A fall back system that does unsupervised POS tagging.
2. The ability to plug in an existing POS tagger as these become available for specific languages.


a) Pos tagging will provide 80%-96% accuracy in lexical disambiguation.
b) Most WMF languages lack corpuses for use Natural Language Programming.
c) The corpus delivered will be very helpful to a lot of researchers and companies, who do not have resources for doing such CPU intensive tasks, and can be easily picked up.
d) Would be consumed by downstream projects : search engine, grammar and spell checking, machine translation, language detection, etc.


Required Deliverables[edit]

  1. A framework for handling different languages.
    1. Cleanup dumps
      1. Discard redirects, templates.
      2. Split the corpus according to mainspace, and talk pages. (optional)
      3. Develop a heuristic to handle spam revisions - edits that get reverted should be placed into thier own corpus.
    2. Train/integrate sentence chunkers. (find sentence boundary)
    3. Integrate part of speech tagging
      1. Build or integrate a general purpose unsupervided POS tagger
      2. Add integration for existing POS tagger.
    4. Integration of Named Entities tagger (optional)
  2. POS/tagged Wikipedia dumps. [1]
  3. N-Gram datasets derived from the corpus.
  • Detailed break up plan mentioned in the Schedule.

If time permits[edit]

  1. Handling spam revisions is something I found very interesting and challenging and am willing to work more on improving the heuristic developed to handle spam revisions.
  2. Working on integrating the obtained corpus with search engine.

I intend to work on these 2 axes even after the completion of GSoC period after August.

Project Schedule[edit]

Before Coding Period Starts...[edit]

  • Getting familiar with the mentors
  • Discuss the deliverables with the mentors and outline a precise requirements document.
  • Discuss the general approach to be taken with the mentors.
  • Thoroughly understand the tools and libraries that will be used in this project.

Coding Period[edit]

My University Examinations get over on May 26th.

Schedule for the first leg:-
1) 1 Week (27th May to 2nd June) :-

  • Extract the wiki dumps.
  • Clean the dumps and make it suitable for further processing.
  • Splitting the corpus into main-space and talk pages.
  • Write a script for automating this process of extraction and cleaning.

2) 2 weeks (3rd June to 16th June) :-

  • Training sentence chunkers.
  • Understanding the best way to do this.
  • Documenting the process simultaneously.

3) 2 weeks (17th June to 30th June) :-

  • Part of Speech Tagging of the wiki dumps.
  • Building N-Gram Datasets.
  • Analysing the path to be taken to make unsupervised POS tagger.

4) 1 week (1st July to 7th July) :-

  • Wrapping up of the first leg of GSoc.
  • Fine tuning of the work done till date.
  • Compiling a report for submission for Mid-term Evaluation

Schedule for the second leg:-
5) 1 week (8th July to 14th July) :-

  • Continuing the work on a script to make POS tagger.
  • Documenting and testing the script extensively.
  • Discussion with the mentors about the road ahead.

6) 4 weeks (15th July to 11th August) :-

  • Exploring the various measures that can detect spam revisions.
  • Coming up with a reasonable heuristic for spam revisions.
>>As per mentor's recommendation, I intend to look into deletion/reversion based heuristic. This can enable us to build a spam corpus which would boost the accuracy of the corpuses.
  • Documenting the approach and technique used by citing assumptions and rationale.

7) 1 week (12th August to 18th August) :-

  • Wrapping up of the second leg of GSoc.
  • Fine tuning the work done till date.
  • Compiling the final report for submission.

8) 1.5 weeks (19th August to 29th August) :-

  • Polishing the deliverables for submission to Google.

--The time-line outlined is a tentative one which I feel should be fine. I am, however, flexible to the suggestions made by my mentor.
--The implementation plan suggested in the timeline is not in the strict order specified in the deliverabales section. I have split up the work into modules which I think would be fine.
--My intention is to finish the deliverables well before time and try to work on the areas mentioned in the 'If time permits' section above.

About Me[edit]

A student of Computer Science and Engineering at PESIT, Bangalore, I am a Tech Enthusiast and an avid Nature Lover.

I have done courses on 'Introduction to Artificial Intelligence' and 'Machine Learning'. My special interests are in Text Processing and Data Mining. I have undertaken course in 'Natural Language Processing' and have chosen 'Data Mining' as an elective. I have done a couple of projects based on Natural Language Processing and hence am familiar with the pre-requisites for this project to a very good extent. I am very comfortable in Java, C, C++ and Python. Thanks to my projects and my interests, I know the principles of Corpus Linguistics and Wordnet.

Computers and coding is something that has fascinated me from a very long time. The fact that machines could be made to "think" by instructing them to do tasks in a logical sequence simply blew my mind. Apart from a natural interest in Logic & Computing, My specific interests are: Natural Language Processing, Data Mining, Artificial Intelligence, Machine Learning, Algorithms and Data Structures.

I have always been attracted to languages. While geometry is credited with the birth of modern science, I believe that ancient science was born from languages, the study of their structures and grammars. My dream is, therefore, to rediscover the core principles behind languages and gather insight into the working of nature and humans.

I try to extract and extend an idea from one field and apply the same in another(especially computers)...hence I catch up with "Ideas worth Spreading" on TED. On a more philosophical note, I believe that Human Mind is the Most Powerful creation of Nature and that it's thoughts can "Define/Redefine Dimensions". I intend to shift the day-to-day activities of Human Beings, Technologically, on to a "virtual platform"; thus curbing pollution and corruption, facilitating the harmonious co-existence of the various lifeforms to bring about a balance in the ingredients of nature.

Why this project?[edit]

Being a Language lover I am interested in the aspect of this project. Hence this project gives me an opportunity to mix my love for Language and my love for coding. This project proves to be a new learning opportunity in development which will help to upgrade my technical knowledge and skills. It also serves to help me learn and understand the applications of computer science and engineering for problem solving in different real world scenarios in the industry. I wish to be able to contribute to the best of my ability as an open source developer and gain insightful experience in the field of computer applications. I ensure full commitment to the GSOC project and ensure that I will work the required amount of time on the project during the program period.


An 'Interesting work' is not work, but it is fun. Languages, Data and Computers coming together with GSoC and Wikimedia makes this project even more interesting and fun. I am willing to to work from 11AM to 11PM IST with very small refreshing breaks in between.

I wish to discuss the progress of my project with my mentor at least once a week. I use email as the primary channel for formal communication. I would also suggest Chats and Skype sessions for communication as it will help in understanding the problem at hand more clearly. Sharing of the outcome of the project can be done over git or any other way the mentor recommends.

I love facing challenges and when faced with one, I simply do not settle down till I come up with a solution. When stuck, I turn to different forums, irc channels and technical articles for solution. I also try to discuss the matter with my friends, teachers and mentors.

Writing a thesis proposal, dissertation proposal or research proposal is something that every college/university student faces at least once. We all know that these assignments are extremely time-consuming.

Past Open Source Experience[edit]

The projects I have worked on till date are either independent projects done out of my personal interest or are as part of my university projects. The projects I have worked on are summarised below:

1) Parallelizing the Common Motifs Problem on CUDA

The project aimed at parallelizing the Saggot Algorithm for "Common Motifs Problem" and the Ukkonen Algorithm to build suffix trees, and implementing them on CUDA architecture.
The project was guided by Dean of Research,PESIT and was carried out in collaboration with the Old Dominion University, USA

2) Sentiment Analysis on WEB

Extraction of reviews about a gadget from tech-review forums, analysis of the Sentiments of the reviews thus predicting the sentiment/opinion associated with that gadget and then generation of appropriate Rating on the scale of 10.

3) Automated Essay Grader(AEG) is the project I am currently working on

A system that automatically grades English essays based on Spelling, Grammar and Structure, Coherence, Frequent phrases and Vocabulary as weighted parameters. Realized by implementing a

self-designed algorithm – studying the ‘relation graph’ of words of the essay. Implemented in Python.

Having said this, open-source is not new to me. I have borrowed heavily from the open source community for my projects. The wonderful operating system I use is also the fruit of the efforts of this community. This project gives me a wonderful opportuity to contribute to the community and will act as a starting point of my contributions to this world which will hopefully be of some help to others.

References and Other Info[edit]