Google Books, Internet Archive, Commons upload cycle/Progress
(Automation Tool) Google Books > Internet Archive > Commons upload cycle
- Public URL: //www.mediawiki.org/wiki/Google_Books,_Internet_Archive,_Commons_upload_cycle
- Bugzilla report: Bug - 57813
- Hosted on tools-lab: http://tools.wmflabs.org/bub/
- Maintained on github: https://github.com/rohit-dua/bub
- Ncert Books On exambeet:http://exambeet.in/ncert-books-free-download/
Community Bonding Report
Goals for the first half of the internship
- Create the front-end for the web-tool to be hosted on tools-lab server.
- Develop bot that handles queries in database (time-out deletion/queue handling/IPC-communication messages).
- Extract meta-data from Google-Books and introduce system to check if book already present in IA(Internet Archive).
- Create script to download from Google-books.
- This will be done by extracting individual page image, and then converting'em to pdf.
- I find IRC a quick way to contact to my mentors.
- Email will be used when mentors are not available.
- plan to have involvement of interested parties for testing/ suggestions.
- For this announcement on wikitech-l, wikisource-l, commons-l, will be made.
Lessons learned since 21st April
- Every task becomes a piece of cake, if you love doing it.
- For queries, google cannot be as good as a real-time chat/email with someone experienced.
- Before the core-coding, the set-up work does take a lot of time and edits.
- Discussions and feedback make thing better.
Before Week 1
- Started the fronted development of the tool (the web face.)
- Using bootstrap
- Shifted the workspace to tools-lab.
- Linked the github repo. to the tools folder.
- Examined the code-base.
Week 1: May 19 to May 25
- University Examinations
- Familiarized myself with tools-lab.
Week 2: May 26 to June 1
- Worked on the back-end python script.
- Added script to verify Commons Name and the Google-books ID.
- Cookie/session handling
- Linked the DB to the tool.
- Set up a cron-job to delete unconfirmed requests.
- The tool can now be tested(for the frontend only) here
- Understood the redis-queue implementation.
Week 3: June 2 to June 8
- Understood and worked on implementation of redis lists and interprocess locks.
- Worked on scripts to(not yet deployed):
- verify if upload already present in archive.
- download books from Google-Books as images and convert'em to one pdf.
- upload book pdf with metadata to Inetrnet archive.
Week 4: 9 June to 15 June
- Deployed the above worked scripts to tools-labs:
- IA upload verification.
- Used a score value system (with thressholds) to check if book already present in IA.
- Google-books download and pdf conversion.
- IA upload with meta-data.
- ia upload using internetarchive python module.
- IA upload verification.
- Improved on the database management.
- migrated from 2 databases( sessions + requests) to single main database.
- Understood the grid usage, and deployed 2 continuous jobs to grid.(worker.py and upload-checker.py)
- worked on the email notification (using exim)
Week 5: 16 June to 22 June
- Improved on the code structure.
- removed all global variables and moved to use of classes.
- Resolved a bug in internetarchive python module relating to metadata overwrite
- Resolved the google-Id and commonsName parsing bug.
23 June to August 7
- Major Changes:
- Converted the web app from cgi to flask-cgi (using virtual environment)
- Added scripts for mass upload to Internet-archive, with a separate job for regular uploads and mass-uploads.
- Built queue pages for regular uploads and mass-uploads.
- Built progress page for all uploads.
- Improved on code to finding similar books on archive before uploading.(to avoid dublicacy)
- Added support of other libraries - (hathi-trust, brasiliana-USP, DSpace-based general library)
- Some Minor Changes:
- Added retry/off-line checker wrappers to the code.
- Added admin-login page for administrative tasks like mass-uploads.
- Improved the error/job logging mechanism.