Google Books, Internet Archive, Commons upload cycle/Progress

Shahzad Khan

(Automation Tool) Google Books > Internet Archive > Commons upload cycle

 * Public URL: [//www.mediawiki.org/wiki/Google_Books,_Internet_Archive,_Commons_upload_cycle //www.mediawiki.org/wiki/Google_Books,_Internet_Archive,_Commons_upload_cycle]
 * Bugzilla report: Bug - 57813
 * Hosted on tools-lab: http://tools.wmflabs.org/bub/
 * Maintained on github: https://github.com/rohit-dua/bub
 * Ncert Books On exambeet:http://exambeet.in/ncert-books-free-download/

Goals for the first half of the internship

 * Create the front-end for the web-tool to be hosted on tools-lab server.
 * Develop bot that handles queries in database (time-out deletion/queue handling/IPC-communication messages).
 * Extract meta-data from Google-Books and introduce system to check if book already present in IA(Internet Archive).
 * Create script to download from Google-books.
 * This will be done by extracting individual page image, and then converting'em to pdf.

Communication plan

 * I find IRC a quick way to contact to my mentors.
 * Email will be used when mentors are not available.
 * plan to have involvement of interested parties for testing/ suggestions.
 * For this announcement on wikitech-l, wikisource-l, commons-l, will be made.

Lessons learned since 21st April

 * Every task becomes a piece of cake, if you love doing it.
 * For queries, google cannot be as good as a real-time chat/email with someone experienced.
 * Before the core-coding, the set-up work does take a lot of time and edits.
 * Discussions and feedback make thing better.

Before Week 1

 * Started the fronted development of the tool (the web face.)
 * Using bootstrap
 * Shifted the workspace to tools-lab.
 * Linked the github repo. to the tools folder.
 * Examined the code-base.

Week 1: May 19 to May 25

 * University Examinations
 * Familiarized myself with tools-lab.

Week 2: May 26 to June 1

 * Worked on the back-end python script.
 * Added script to verify Commons Name and the Google-books ID.
 * Cookie/session handling
 * Linked the DB to the tool.
 * Set up a cron-job to delete unconfirmed requests.
 * The tool can now be tested(for the frontend only) here
 * Understood the redis-queue implementation.

Week 3: June 2 to June 8

 * Understood and worked on implementation of redis lists and interprocess locks.
 * Worked on scripts to(not yet deployed):
 * verify if upload already present in archive.
 * using http://archive.org/advancedsearch.php
 * download books from Google-Books as images and convert'em to one pdf.
 * upload book pdf with metadata to Inetrnet archive.
 * using http://pypi.python.org/pypi/internetarchive/0.6.5

Week 4: 9 June to 15 June

 * Deployed the above worked scripts to tools-labs:
 * IA upload verification.
 * Used a score value system (with thressholds) to check if book already present in IA.
 * Google-books download and pdf conversion.
 * IA upload with meta-data.
 * ia upload using internetarchive python module.
 * Improved on the database management.
 * migrated from 2 databases( sessions + requests) to single main database.
 * Understood the grid usage, and deployed 2 continuous jobs to grid.(worker.py and upload-checker.py)
 * worked on the email notification (using exim)

Week 5: 16 June to 22 June

 * Improved on the code structure.
 * removed all global variables and moved to use of classes.
 * Resolved a bug in internetarchive python module relating to metadata overwrite
 * Resolved the google-Id and commonsName parsing bug.

23 June to August 7

 * Major Changes:
 * Converted the web app from cgi to flask-cgi (using virtual environment)
 * Added scripts for mass upload to Internet-archive, with a separate job for regular uploads and mass-uploads.
 * Built queue pages for regular uploads and mass-uploads.
 * Built progress page for all uploads.
 * Improved on code to finding similar books on archive before uploading.(to avoid dublicacy)
 * Added support of other libraries - (hathi-trust, brasiliana-USP, DSpace-based general library)


 * Some Minor Changes:
 * Added retry/off-line checker wrappers to the code.
 * Added admin-login page for administrative tasks like mass-uploads.
 * Improved the error/job logging mechanism.