Kiwix/ZIM incremental updates

Incremental updates for Kwix Reader(offline wikipedia) 

Public URL: https://www.mediawiki.org/wiki/User:Kiran_mathew_1993

Bugzilla report: https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

Announcement: http://lists.wikimedia.org/pipermail/wikitech-l/2013-May/069006.html

Mentor 1: Emmanuel 'Kelson' Engelhart

IRC: Kelson on #kiwix

Mentor 2: Tommi 'tntnet' Mäkitalo

IRC: tntnet on #kiwix

Name and contact information

Name: Kiran Mathew Koshy

Email: kiranmathewkoshy@gmail.com ; kiran.ee11@iitp.ac.in

IRC or IM networks/handle(s): Kiran_

Location: Kerala, India

Typical working hours: 11:00 A.M. to 6:00 P.M, 9:00 P.M to 3 :00 A.M.

Synopsis

Wikipedia has played a tremendous role in making the world's information available for free, but so far, an active internet connection is required for accessing the latest information. This project was thought up in order to make Wikipedia available to remote places without a proper internet connection.

Using the Kiwix project, it is possible to have a local copy of Wikipedia. However, a feature that is missing is a proper update feature, by which the data is updated once in a while. As of now, users need to download the full database every time they need to update, and this is quite cumbersome and/or impractical for a user with a slow internet connection.

Once the project is finished, this would greatly benefit many schools/other institutes in developing regions of the world. It will enable them to keep a local cache of the data, which updates itself automatically.

Deliverables:

1. Two tools, zimdiff and zimpatch will be implemented in C++. Their details are given below: a. zimdiff : This will be used to compute the difference between two ZIM files. This will be run on the server, and will be used every time a new reease is available, to compute the changes made to the ZIM file. Using this, a ZIM diff file is generated which will then be downloaded by the client. b. zimpatch : This tool will run on the client, and will be used to patch an existing ZIM file using the ZIM diff file as the input. There will be two different ways to implement zimpatch, and both will be implemented. Method 1: simple merge of the file and rewriting of the index(fast,requires more storage) Method 2: recompute a new file (slow, requires less storage).

c. Integrating zimpatch and zimdiff into the existing Kiwix code. The ZIM diff file will be generated automatically by the server, and once the ZIM diff file is downloaded, it will be automatically added to the existing ZIM file by the client-side Kiwix code.

Note that there will be two ways of downloading implemented, by which the program will either download the diff file automatically or will update from a file provided by the user. d: An additional functionality of notifying the users about available diff files through email(for clients opting for manual update) will also be provided.

Timeline:

Total duration: 3 months/13 weeks(excluding community bonding period).

Community Bonding period: Study the existing ZIM file format, the zimlib library and the Kiwix source code.

Phase 1: -coding  Implementing zimdiff-server code. The code will be developed as a separate C++ program. Duration: 1.5-2 weeks

Phase 2: -coding Implementing zimdpatch -client side code. This will also be implemented separately in C++. It will not be integrated into Kiwix source code. Duration: 1.5-2 weeks

Phase 3: -Bug Hunt Tests, Bug fixes and optimizations for the above tools- Duration: 2 weeks

Phase 4: -coding Integrating zimdiff and zimpatch with the server and client side code. Duration: 1-1.5 weeks

Phase 5: -testing, bug fixes Extensive testing, bug fixes if any. A full sets of tests will be done on a ZIM copy of Wikipedia. Documentation is done. Duration: 3 weeks

Phase 6: - Email notification feature Extremely simple to implement. Duration: 1 week

Phase 7: -Deployment The final code is deployed to the Kiwix project. Duration: 1.5 weeks

About Me: By the time you evaluate this application, I would have completed 2 years of my undergraduate studies at IIT Patna, India. Programming has been my passion for the last 6 years. Languages: C/C++, Python, PHP, etc. Hobbies: CUDA programming, Robotics, etc. I'm a big fan of FOSS. By completing this project, I would be playing a good role in providing information to less privileged people around the globe, which is the reason I came up with this idea. The amount of knowledge in Wikipedia is so vast that I'm sure this project would help a lot of people. I have participated in a few FOSS activities in our campus in the past.