Google Books, Internet Archive, Commons upload cycle/Progress

From mediawiki.org

(Automation Tool) Google Books > Internet Archive > Commons upload cycle[edit]

Community Bonding Report[edit]

Goals for the first half of the internship[edit]

  • Create the front-end for the web-tool to be hosted on tools-lab server.
  • Develop bot that handles queries in database (time-out deletion/queue handling/IPC-communication messages).
  • Extract meta-data from Google-Books and introduce system to check if book already present in IA(Internet Archive).
  • Create script to download from Google-books.
    • This will be done by extracting individual page image, and then converting'em to pdf.

Communication plan[edit]

  • I find IRC a quick way to contact to my mentors.
  • Email will be used when mentors are not available.
  • plan to have involvement of interested parties for testing/ suggestions.
    • For this announcement on wikitech-l, wikisource-l, commons-l, will be made.

Lessons learned since 21st April[edit]

  • Every task becomes a piece of cake, if you love doing it.
  • For queries, google cannot be as good as a real-time chat/email with someone experienced.
  • Before the core-coding, the set-up work does take a lot of time and edits.
  • Discussions and feedback make thing better.

Weekly Reports[edit]

Before Week 1[edit]

  • Started the fronted development of the tool (the web face.)
    • Using bootstrap
  • Shifted the workspace to tools-lab.
  • Linked the github repo. to the tools folder.
  • Examined the code-base.

Week 1: May 19 to May 25[edit]

  • University Examinations
  • Familiarized myself with tools-lab.

Week 2: May 26 to June 1[edit]

  • Worked on the back-end python script.
    • Added script to verify Commons Name and the Google-books ID.
    • Cookie/session handling
  • Linked the DB to the tool.
  • Set up a cron-job to delete unconfirmed requests.
  • The tool can now be tested(for the frontend only) here
  • Understood the redis-queue implementation.

Week 3: June 2 to June 8[edit]

Week 4: 9 June to 15 June[edit]

  • Deployed the above worked scripts to tools-labs:
    • IA upload verification.
      • Used a score value system (with thressholds) to check if book already present in IA.
    • Google-books download and pdf conversion.
    • IA upload with meta-data.
      • ia upload using internetarchive python module.
  • Improved on the database management.
    • migrated from 2 databases( sessions + requests) to single main database.
  • Understood the grid usage, and deployed 2 continuous jobs to grid.(worker.py and upload-checker.py)
  • worked on the email notification (using exim)

Week 5: 16 June to 22 June[edit]

  • Improved on the code structure.
    • removed all global variables and moved to use of classes.
  • Resolved a bug in internetarchive python module relating to metadata overwrite
  • Resolved the google-Id and commonsName parsing bug.


23 June to August 7[edit]

  • Major Changes:
    • Converted the web app from cgi to flask-cgi (using virtual environment)
    • Added scripts for mass upload to Internet-archive, with a separate job for regular uploads and mass-uploads.
    • Built queue pages for regular uploads and mass-uploads.
    • Built progress page for all uploads.
    • Improved on code to finding similar books on archive before uploading.(to avoid dublicacy)
    • Added support of other libraries - (hathi-trust, brasiliana-USP, DSpace-based general library)
  • Some Minor Changes:
    • Added retry/off-line checker wrappers to the code.
    • Added admin-login page for administrative tasks like mass-uploads.
    • Improved the error/job logging mechanism.