Talk:Kiwix/ZIM incremental updates

Google Summer of Code
You're running out of time to apply for Google Summer of Code 2013 with Wikimedia! Please see Where to start. I won't be watching this talkpage for response -- if you need help, please ask the chat channel (IRC) or the mailing list. Sharihareswara (WMF) (talk) 16:55, 26 April 2013 (UTC)

My Exams are going on, I'll finish it in a couple of days.

Just completed my application. My exams will end on Saturday, so you can expect to see more activity from me from then on.

Support

 * Thanks for this proposal, I hope we can get this done. I've followed Kiwix for a few years now and this is the feature that everyone keeps asking about ZIM and Wikimedia projects offline. --Nemo 12:32, 4 May 2013 (UTC)

Are you already in the Indian Wikimedia community?
Please join the Wikimedia India mailing list so you can keep up with what's happening in the Indian Wikimedia community! Sharihareswara (WMF) (talk) 15:15, 8 May 2013 (UTC)

Congratulations and welcome!
Your project is very exciting! Congratulations, and I am looking forward to seeing the work that comes out of it. Jwild (talk) 15:49, 28 May 2013 (UTC)


 * Thanks, I have started the work, and I expect a very exciting summer Kiran mathew 1993 (talk) 11:38, 4 June 2013 (UTC)

Longer intro, please?
Hi, I'm drafting a blog post about GSoC and it would be helpful to know a bit more about you through your user profile, if you don't mind. Thank you!--Qgil (talk) 20:48, 19 June 2013 (UTC)


 * Hi, I just came across this today, but I had edited the draft page for the post days ago. Is it fine ? Kiran mathew 1993 (talk) 12:09, 23 June 2013 (UTC)

GSoC / OPW IRC AllHands this week
Hi, you are invited to the GSoC / OPW IRC AllHands meeting on Wednesday, June 26, 2013 at 15:00 UTC (8:30pm IST, 8am PDT). We have done our best finding a time that works decently in as many timezones as possibles. Please confirm at qgil@undefinedwikimedia.org so I can add you to the calendar invitation and I have your preferred email for other occasions. If you can't make it's fine, but let me know as well. Thank you!--Qgil (talk) 18:02, 24 June 2013 (UTC)

Tech feedback after 3 weeks

 * zimcheck
 * Non perfect command line error handling
 * Error messages in case of bad arguments are unclear (what does "Unknown option `115'" mean?)
 * Launch without argument should print the usage
 * usage is not the man page. Explanation are always welcome, but usage should be short. We need maybe to create a man page?
 * Multiple small typography issue (for example, never a space before a comma)
 * Indentation is good, but lack of spaces in the code, in particular around the operators.
 * Notion of check 1, 2, ... should be abandoned everywhere (code&output). Checks do not depend from each other.
 * methodsare too long, code should be factorized.
 * zimcheck.cpp is too long, should be split.
 * methods names are not well named ("get_links2" is unclear). Same for variable names, "arr" is really a bad name for an array.
 * methods name should be for example getLinks instead of get_links
 * argument parsing seems to be good, but way to store the result is not efficient; use a bit array and constants to store the list of checks to do.
 * Strange "unknown mime type code 65535" error code on the output
 * Remove progress bar "################################################################", or if you really want to see a progression print a new [INFO] line each X %.
 * Error messages should be more explicit, for example "Favivon not found in ZIM file" should be "Favivon not found in ZIM file, "/-/favicon" must exist."
 * Output an info message to all successful checks
 * Remove useless variable like "arr"
 * Compile with -W, and remove warnings
 * In redundancy check
 * Why not using a hash of hash (instead of a hash of list) to store the hash keys, would be a lot easier to detect duplicates (no sort, no loop for the duplicate detection).
 * You should avoid using "i" during the hash computation and store "it" instead.
 * Error message in not useful to fix the ZIM file, the redundant article urls should be give
 * Internal url check
 * HTML parsing should be based either on a pre-existent html parser or on regex.
 * is_internal_url should be improve to avoid all urls starting with a protocol.
 * zim already implements a binary search for title, does not make sense to cache anything there, except maybe on a article level (don't check twice an url for the same url in one article).
 * does not work with relative urls
 * merge is_external_url and is_internal_url, we should not have here two different logics.
 * External dependencies checks do not work properly, a simple link pointing to an external page will generate an error. What should be forbidden are only external img/src, scripts, stylesheet external dependencies.
 * Verifying MIME Types does nothing


 * zimdiff
 * Almost the same remarks like for zimcheck regarding cosmetics.
 * Command line argument parsing should be based on getopt - no custom arg parsing.
 * approach seems to me to be wrong: http://www.openzim.org/wiki/Talk:Zimdiff#Remarks
 * Seems to work in a sense that only the article which should be rewritten by zimpatch are in the diff_file.
 * Too slow


 * zimpatch
 * Seems to fail with a simple case