Talk:Kiwix/ZIM incremental updates

From mediawiki.org

Google Summer of Code[edit]

You're running out of time to apply for Google Summer of Code 2013 with Wikimedia! Please see Where to start. I won't be watching this talkpage for response -- if you need help, please ask the chat channel (IRC) or the mailing list. Sharihareswara (WMF) (talk) 16:55, 26 April 2013 (UTC)Reply[reply]


My Exams are going on, I'll finish it in a couple of days.

Just completed my application. My exams will end on Saturday, so you can expect to see more activity from me from then on.

Support[edit]

  • Thanks for this proposal, I hope we can get this done. I've followed Kiwix for a few years now and this is the feature that everyone keeps asking about ZIM and Wikimedia projects offline. --Nemo 12:32, 4 May 2013 (UTC)Reply[reply]

Are you already in the Indian Wikimedia community?[edit]

Please join the Wikimedia India mailing list so you can keep up with what's happening in the Indian Wikimedia community! Sharihareswara (WMF) (talk) 15:15, 8 May 2013 (UTC)Reply[reply]

Congratulations and welcome![edit]

Your project is very exciting! Congratulations, and I am looking forward to seeing the work that comes out of it. Jwild (talk) 15:49, 28 May 2013 (UTC)Reply[reply]

Thanks, I have started the work, and I expect a very exciting summer Kiran mathew 1993 (talk) 11:38, 4 June 2013 (UTC)Reply[reply]

Longer intro, please?[edit]

Hi, I'm drafting a blog post about GSoC and it would be helpful to know a bit more about you through your user profile, if you don't mind. Thank you!--Qgil (talk) 20:48, 19 June 2013 (UTC)Reply[reply]

Hi, I just came across this today, but I had edited the draft page for the post days ago. Is it fine ? Kiran mathew 1993 (talk) 12:09, 23 June 2013 (UTC)Reply[reply]

GSoC / OPW IRC AllHands this week[edit]

Hi, you are invited to the GSoC / OPW IRC AllHands meeting on Wednesday, June 26, 2013 at 15:00 UTC (8:30pm IST, 8am PDT). We have done our best finding a time that works decently in as many timezones as possibles. Please confirm at qgil@wikimedia.org so I can add you to the calendar invitation and I have your preferred email for other occasions. If you can't make it's fine, but let me know as well. Thank you!--Qgil (talk) 18:02, 24 June 2013 (UTC)Reply[reply]

Tech feedback after 3 weeks[edit]

zimcheck
  • Non perfect command line error handling -
Fixed
    • Error messages in case of bad arguments are unclear (what does "Unknown option `115'" mean?)
Fixed- BTW, 115 was the ASCII code of the bad argument.
    • Launch without argument should print the usage()
Done
    • usage() is not the man page. Explanation are always welcome, but usage() should be short. We need maybe to create a man page?
Will create a man page. I trimmed the usage() too.
    • Multiple small typography issue (for example, never a space before a comma)
Fixed.
  • Indentation is good, but lack of spaces in the code, in particular around the operators.
Fixed
  • Notion of check 1, 2, ... should be abandoned everywhere (code&output). Checks do not depend from each other.
This notion exists only in code comments, and even there, the names of the tests are given along with.
  • methodsare too long, code should be factorized.
  • zimcheck.cpp is too long, should be split.
zimcheck.cpp was split into 3 different files.
  • methods names are not well named ("get_links2" is unclear). Same for variable names, "arr" is really a bad name for an array.
Done, please mention if you find anything else.
  • methods name should be for example getLinks() instead of get_links()
Done
  • argument parsing seems to be good, but way to store the result is not efficient; use a bit array and constants to store the list of checks to do.
Do we need this? I mean, the current system is easier to understand.
  • Strange "unknown mime type code 65535" error code on the output
I couldn't reproduce this error. Which ZIM file did you use ?
  • Remove progress bar "################################################################", or if you really want to see a progression print a new [INFO] line each X %.
The progress bar is now switched off by default. It can be switched on by the --progress flag.
  • Error messages should be more explicit, for example "Favivon not found in ZIM file" should be "Favivon not found in ZIM file, "/-/favicon" must exist."
  • Output an info message to all successful checks
Done
  • Remove useless variable like "arr"
Done
  • Compile with -W, and remove warnings
Done
  • In redundancy check
    • Why not using a hash of hash (instead of a hash of list) to store the hash keys, would be a lot easier to detect duplicates (no sort, no loop for the duplicate detection).
I don't get this. What do you mean by hash of list ?(did you mean list of hashes?).Same for hash of hash. Oh, and the hash algorithm I used is optimized for speed, so it has quite a bit of collisions.
    • You should avoid using "i" during the hash computation and store "it" instead.
Will do.
    • Error message in not useful to fix the ZIM file, the redundant article urls should be give
The details are usually quite long, so they are controlled by the --details flag.
  • Internal url check
    • HTML parsing should be based either on a pre-existent html parser or on regex.
Will do.
    • is_internal_url() should be improve to avoid all urls starting with a protocol.
is_internal_url() is just checks if the url has a namespace and a title following it. I already avoids protocols.
    • zim already implements a binary search for title, does not make sense to cache anything there, except maybe on a article level (don't check twice an url for the same url in one article).
Will do.
    • does not work with relative urls
Please explain a bit more on this.
  • merge is_external_url() and is_internal_url(), we should not have here two different logics.
Actually, is_external_url() and is_internal_url() are not complementary. is_external refers to non-wikipedia URLs, while is_internal refers to URLs in the ZIM file. For example, www.wikipedia.org is neither. If this approach is wrong, please comment.
  • External dependencies checks do not work properly, a simple link pointing to an external page will generate an error. What should be forbidden are only external img/src, scripts, stylesheet external dependencies.
Fixed.
  • Verifying MIME Types does nothing
I'm thinking of modifying zimlib to add the functions required for these too in the next week. After that, I'll add it.
zimdiff
  • Almost the same remarks like for zimcheck regarding cosmetics.
  • Command line argument parsing should be based on getopt - no custom arg parsing.
In zimdiff , there are only 3 arguments, and no flags. start_file, end_file, name for diff_file. Same in zimpatch.
I re-wrote the main code in zimdiff according to the new approach.
  • Seems to work in a sense that only the article which should be rewritten by zimpatch are in the diff_file.
No, the added articles, the modified articles, and the additional metadata are present.
  • Too slow
The new version is an improvement, but is still slow.
zimpatch
  • Seems to fail with a simple case
$ ./zimdiff ICD10-fr.zim ICD10-fr.zim diff.zim ; echo "==================" ; ./zimpatch ICD10-fr.zim diff.zim new.zim ; md5sum ICD10-fr.zim new.zim
zimdiff
[INFO] Parsing through input ZIM files..
[INFO] Parsing through ICD10-fr.zim
[INFO] Parsing through ICD10-fr.zim
[INFO] Comparing articles..
[INFO] 0 articles deleted in ICD10-fr.zim
[INFO] 0 articles updated in ICD10-fr.zim
[INFO] 0 articles added in ICD10-fr.zim
[INFO] 290 articles remained unchanged

[INFO] 100% Match between articles in ICD10-fr.zim and ICD10-fr.zim
create directory entries
collect articles
sort 3 directory entries (aid)
remove invalid redirects from 3 directory entries
sort 3 directory entries (url)
set index
translate redirect aid to index
3 directory entries created
create title index
3 title index created
create clusters
0% ready
10% ready
20% ready
30% ready
40% ready
50% ready
60% ready
1 clusters created
fill header
write zimfile
ready
==================
zimpatch
Diff File=diff.zimsize= 0
CoMM checkcreate directory entries
collect articles
article index out of range
184e69e44cc36afff06b68eba76c0b58  ICD10-fr.zim
md5sum: new.zim: Aucun fichier ou dossier de ce type
works now. Kiran mathew 1993 (talk) 08:45, 11 July 2013 (UTC)Reply[reply]

Reply[edit]

I have fixed most of the errors mentioned above, some are still remaining in zimcompare, though. Kiran mathew 1993 (talk) 12:41, 10 July 2013 (UTC)Reply[reply]

Wrapping up GSoC[edit]

Congratulations for your PASS! Now please wrap up your GSoC project properly:

  • Update the related Bugzilla report(s) accordingly, filing reports for known bugs when appropriate.
  • Publish your wrap up post at wikitech-l (as en email or a blog post) and then add the URL to Mentorship_programs/status#2013-09-monthly.

Take a break and celebrate. You deserve it! We hope to see you sticking around, extending your project or joining new tasks. If you need advice please check with your mentors or myself. I will be happy to help you in whatever I can!--Qgil (talk) 21:13, 1 October 2013 (UTC)Reply[reply]