User:Shaohong

Incremental Data Dumps

 * Public URL: http://www.mediawiki.org/wiki/User:Shaohong
 * Bugzilla report: https://bugzilla.wikimedia.org/show_bug.cgi?id=28956
 * Announcement: http://lists.wikimedia.org/pipermail/wikitech-l/2013-May/069062.html

Name and contact information
Name: Shao Hong, Peh

Email:shaohong86@gmail.com

IRC or IM networks/handle(s): shh @ freenode

Location: Singapore

Typical working hours: 7am to 12pm (GMT +8)

Synopsis
Mailing list thread

This project proposal is for Bug 28956 - “Dumps should be incremental”; the objective of this project is to convert monthly dump into incremental dump, unlike the current practice which is to crawl the whole dump each month and perform a full dump.

With the new implementation, it will be able to cut down time and resources on creating whole data dump each month. This is because each month there is only a small subset of information changes, such as new pages, new revisions or deleted revision. So therefore it only will replace/add those changes comparing to the last dump.

It will benefit both Wikipedia and Mediawiki; Wikipedia will be able to make use of those resources for other projects and Mediawiki will be able to use this function for other administrator to use this feature to backup their own wiki system.

Deliverables
For this project it will be divided into two section: tackling the current process's bottleneck then develop a new process to overcome the current problem and this will lead to creating a new incremental dump. Next part of the project is to integrate the new incremental dump into the last full dump which is generated normally last month.

Done by mid-term review
Currently, the bottleneck of creating dump is due to the part where it had to generate the whole Wikipedia history and merging it to each page stub then so therefore this process cause the whole dumping process to be slow. So for the first part of the system, I will be looking at then page stub dump and cherry picking those page have a revision dates that are within the time range from the last dump to the current time. After picking up those pages, the next step is to go through the whole history to pull out history that are correlated to the page_id and this process will be faster because it have to only match a smaller set pages comparing to whole set of pages in the current process. This process will also create a new set of dumps that will be “incremental” because it is only those pages that have new changes in revision, added, moved and deleted. More design and planning will be put into this stage to create a new xml markup of this “incremental” dump.

Done by final evaluation
After creating a new “incremental” dump, the next part of the project will be designing a tool to import the “incremental” dump into the last full dump. The challenges here will be trying to be able to change records in a big XML file with the most optimised way. Currently, using SAX in one of the method in mind but will have more research and planning to be done. Beside writing this internal tool, I also will be writing a tool for user who wishes to use the “incremental” dump for SQL database, so meaning this tool will convert it into SQL query and allow user to use it to apply the incremental dump to their MySQL database.

Timeline breakdown
This is a very general timeline for now but it will be more shape as the time goes.

May 27–June 17


 * Beside spending time to know more about the community, I will be also try to ask my mentor for an access to a simple wiki to allow me to go through the whole process of running a full dump and to better understanding of the whole bottleneck
 * Having discussion with mentor about my research and propose different methods on handling it

Week 1–7: June 17–August 5
Week 1: June 17–June 23


 * Start working on the plans that is discussed with mentor

Week 2–4: June 24–July 14


 * Continue working and testing on design

Week 5–6: July 15–July 28


 * Perform testing to ensure that incremental dump is being generated properly

Week 7: July 29–August 4 Aug 2 - Mid-term review


 * Test for bugs before Mid-term review.

Week 8–15: August 5–September 27
Week 8–11½: August 5–September 4


 * Start writing tools for performing incremental dump onto last full dump

Week 11½–Week 15: September 5–September 27

''Sept 16 - Soft Pencil Down

Sept 16 - Final Pencil Down''

Sept 27 Final evaluation


 * By now, the main project should be deployed. This period will be dedicated to any further bug fixing and most importantly touch up on documentation.

About you
Currently I am an undergrad from National University of Singapore, major in Computer Science and focus on Software Engineering and Information Retrieval. I had a few internship experience in companies like IBM, Apple and Defense Science Org in Singapore. Most of my project heavily involved in writing application that requires to do some big data crunching and give output to users.

As software engineering trained developer, one of my strongest point over others, is that I am able to understand the whole flow from requirements gathering from the client to maintaining the application. Back in IBM and Apple, I am being exposed to such full development cycle where I have to meet up with the users and learn from them their requirement and up to delivering the system itself. Beside experiences from the two big companies, I also have experience from working on various freelance projects where it is not that systematic but I still strongly believe in the development cycle.

As mentioned, I work on an application that requires some big data crunching and give users some sort of nicer formatted output. One of the experiences is in Apple, I was with the Business Intelligence team that looks at Asia Pacific region. I was a software developer where I had to design and develop a web application to process customers’ orders and information every hourly. And it is output to the logistics planning team to allow them to have better picture on planning forecast of demand. This application was written in Java and using Google Web Toolkit as the front-end framework.

Whole of my last year, I was involved in a school project that uses Wikipedia and Twitter. I was in-charged of making use of Wiki data dump for the backend. Through that project, I got to have a better picture of how data dump works, and the process of each month updating the dump on the database was a painful process for me. This is because I had to ensure my script does not take up too much memory and also it does not encounter any exceptions that cause the whole script to breakdown.

Therefore, I hope that I am able to help to solve this problem on making data dump is easy to use for system admin where they do not have to spend time and resources to do a full update on their existing collection of dump. I am not trying to save the whole but I believe with this new feature, it will allow more people to have more time for fun!

Participation
I plan to write a weekly update on my Mediawiki.org user page where everyone can see my progress and to have at least twice a week communication with my mentor via email, IRC or Skype. I will be using MediaWiki's internal Git repo to host my code if possible otherwise I will host it on GitHub where everyone can see the progress.

I will be also working closely with Ariel on this project, I understand that she is currently handling this and if there any problem or advise needed, I will approach her. Beside working on this for project for GSoC, I plan to intend to contribute my time to Wikipedia even after GSoC because I had gain knowledge from Wikipedia and now is my turn to play a part in contributing back to create a even better system for the rest.

Past open source experience
This will be my third year joining GSoC as I took part in GSoC 2011 and GSoC 2012. Last year, I was with New Vision Public School from New York City, they are in-charged of quite a number of public schools in New York. They were exploring the idea of using Google App Script to help schools to automate some of the admin process and also to help teachers to manage some of the classroom management. I help them to explore the idea of connecting Google App Script to database such as MySQL, MSSQL and Oracle Database to their Google Spreadsheet and also created front-end of non-sql user to be able to select data based on their needs using simple idea such as And, Or etc. and convert them into SQL statement. This project was also presented in Google NYC office for showcase.

I also helping Common Crawl to design tutorial and helping them to design a new tutorial web site for new users and currently it is still working on it but you can take a look at https://github.com/pshken/cc-wordcount-py for the simple tutorial I wrote. I also help to rewrite a feature for a open-source project Goose during one of my internship a few years back, I added the feature to allow user to input html directly instead of using URL.

My github profile: https://github.com/pshken