User:Wywin/mediaConvert

This will be changed shortly!

Identity
Name: Wyatt Winters Email: mediawikiGSoC at wyattwinters.com Project title: Incremental XML Dumps

Contact/working info
Timezone: UTC -5 (CDT) Typical working hours: Super flexible IRC or IM networks/handle(s): irc://irc.freenode.net - wywin

Project summary
English Wikipedia is absolutely massive. At a certain point, it will become prohibitively expensive, both in time and in cpu cycles to dump its database every two weeks. However, full dumps are not needed, as not every article is changed in that two week period. By dumping the changed articles, and then updating the previous "full" dump, significant work can be saved.

Assuming I did this the ideal way, all MediaWiki projects that run scheduled or regular database dumps would benefit immensely, especially small projects where computing power is a significant portion of their budget, and may not have as much horsepower as Wikimedia.

Required deliverables

 * Script to compare previous dump file against current state of database (via API), noting where revision IDs have changed, and writing list of updated article IDs to disk
 * Modified Export.php that allows for dumping specific article IDs read in from file (the current Export.php allows for ranges, but perhaps allowing the user to specifiy a file of articles to dump that already exists on the server, or uploading a file (through the web browser) they have on the local machine)
 * Script to merge old full XML dump with new, updated XML dump.

If time permits

 * Improve how merging works. Move from the initial (easy) implementation of simply replacing old article text with new, to the more efficient / difficult replacing only changed words / lines / sections, making the update far smaller
 * Improve performance and lower memory footprints

Project schedule

 * 1) . Check for article updates script: ~3 days
 * 2) . Modified Export.php: ~2 weeks
 * 3) . Merge script: ~2 weeks
 * 4) . Review, merging, all that fun stuff: ~4 weeks
 * 5) . Leftovers, for when something goes kablooie: ~2 week, 4 days

About you
I am wrapping up my second year of study at Rochester Institute of Technology, majoring in Information Security and Forensics / Computing Security (they're changing my major name halfway through!), and am having a blast. I have grown up on computers, which has been both a blessing and somewhat of a curse. According to parental reports, I could operate a Windows 95 environment at the age of two.

I love Wikipedia. In my mind, it is the most obvious, shining example of the open source ideal - except the barrier of contribution is even lower than most code-based projects! As all languages of it grow, the cost and time consumed doing database dumps (which don't strictly need to be offered, but it's great they are!) grows as well. Being able to make offline versions more accessible (by lowering both disk and bandwidth usage to the end user) as well as reduce the strain on the Wikimedia servers excites me. Wikipedia is a fantastic resource, and making it easier and cheaper to distribute will extend its use into areas where bandwidth and computing power are costly.

Aside from Wikimedia projects, I hope to write this functionality in such a way that all Mediawiki based projects will have easier access to portable formats of their data, and make Mediawiki even more appealing to wiki-based projects that do not currently utilize it.

Participation
I always have my email and IRC open. I would ideally condense all mentor-directed into a daily "digest" mail to reduce my disruption, with urgent or time-sensitive issues addressed via IRC / other mentor-preferred real-time communication method.

I am a fan of Github, and would likely push there so I don't taint the offical repo with my in-production code. I would push commits at least daily, likely much more frequently.

Should I run into any bugs (which is unlikely, considering I am hacking on a single file from the core Mediaiwiki), reporting those with detailed, logical, and thorough bug reports are priority one. We should fix existing functionality before adding new functionality!

Past open source experience
Despite playing with computers since I was born, I have been learning programming (I started in Visual Basic... that was a mistake), using Linux, and other open source projects since 6th grade. Up until recently, I had tried to contribute to open source projects, but various obstacles (git is complicated!) impeded any formal commits. While filing a bug or two here and there, my breakthrough moment was taking the | Humanitarian Open Source Software course at my university.

Having a fantastic professor (Justin Sherril of DragonflyBSD), and teaching assistants from the "FOSSBox" to guide me through the minefield that is git, best practices for coding with others in mind, and guiding my group and I through our work on a mail application for the Sugar OLPC environment: Sweetermail

While I don't have any code contributions to MediaWiki, I am a semi-active editor at the English Wikipedia.

Any other info
Nothing here yet, I might try and do some proof-of-concept stuff, but finals... :(

Thanks!
Thanks for reading my proposal, and please give me any and all feedback you have! Either my talk page here, my talk page on en.wikipedia, email, or IRC is fine.