Google Books, Internet Archive, Commons upload cycle

(Automation Tool) Google Books > Internet Archive > Commons upload cycle

 * Public URL: https://www.mediawiki.org/wiki/User:8ohit.dua/GSoC_proposal_2014
 * Bugzilla report: Bug - 57813
 * Announcement: https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Google_Books_.3E_Internet_Archive_.3E_Commons_upload_cycle

Name and contact information
Name: Rohit Dua Email: 8ohit.dua@gmail.com IRC or IM networks/handle(s): rohit-dua Location: New Delhi, India Time-zone: UTC+5:30 Typical working hours: 12:00 pm to 5:00 pm, 8:00 pm to 3:00am (IST)

Synopsis
Wikisources all around the world use heavily Google-Books digitizations for transcription and proofreading. The books often are disappeared from the GB database. Currently the users have to manually download a book from GB, then upload them to IA(if they want to preserve) or directly upload to Wikimedia-Commons(again manual task) with appropriate meta-data.

This project focuses on automating all the three altogether! The user will just have to give appropriate url(or identifier) for the book(s) they wish to upload, and all other task is just automated, notifying user only when their intervention is needed. [[File:Flowchart-gb2commons.png|thumbnail|right|Flowchart for the project Direct Link Core Libraries/tools used:
 * 310x310px|alt=]]
 * internetarchive
 * smtplib
 * urllib2 python-requests
 * IA-Upload
 * Google-Books API
 * JSON API (IA)

Deliverables
Goals of this project :

Required Goals:
This will take as input:  GOOGLE_BOOK_URL OR ID         //This is the ID/URL for book that will be uploaded to IA and Commons FILE_NAME_FOR_COMMONS        //This is the user defined name for djvu file (will be passed to IA-Upload)
 * Tool hosted on Tool-Labs with a JavaScript front-end and python core.

Google provides Google-Books API: This will be used to extract all the details about the book (meta-data) and check if it is public domain or not.
 * Extract meta-data from GB and check if it is Public Domain

Internet Archive provides JSON API for advanced searching. ''This will be used to check whether the book is already available in IA or not. ''
 * Check if a book is available on IA     

The required book will be downloaded from Google Books in a manner that each page will first be downloaded as PNG/JPG image, and then they will be converted to PDF format for easy upload to IA. Link to proof of concept code for book-download given at bottom 
 * Download all its pages from GB and convert to PDF/ZIP

The python library internetarchive will be used for this step. For each book that'll be uploaded to IA, its meta-data(taken from GB) will be added. This will be a better means to avoid duplicated uploads in the long run. Files uploaded to IA are OCR'ed so that their text is searchable. ''This takes time. Therefore at this step the user will be asked to input their email address,'' so that they can be notified as soon as the OCR is complete. Users email, corresponding url identifiers, and the entered FILE_NAME_FOR_COMMONS will be stored. A web crawler will periodically visit the url with stored identifiers to check on OCR completion.
 * Upload to IA with appropriate meta-data

''If the OCR process is completed, the user will be notified via email. Python Library smtplib will be used to send emails.''
 * Wait for its OCR, when completed notify user via email

The emails will contain the link of type: http://tools.wmflabs.org/ia-upload/commons/fill?iaId=ID&commonsName=FILENAME, where ID --> identifier stored previously and FILENAME --> the FILE_NAME_FOR_COMMONS taken as input at the beginning. This will help in avoiding the unnecessary front-page of IA-Upload. 
 * Upload to Commons using IA-Upload tool.

Optional Goals:
If a user wants an immediate use of the Commons file, he/she might want to skip the step of uploading to IA.(as it takes time). wikitools library and MediaWiki API will be used to connect and upload to commons.
 * Direct upload to Commons.

Project schedule
* The above plan could go as expected or invariably re-distribute among the tasks.

Participation
During my work hours, I would always be logged in IRC (channels: #mediawiki, #wikimedia-dev, #mediawiki-labs) and also can always be reached at my email. I'm an computer addict and have hard time staying off of it. All source code I write will be published to my Github repo, although my tool will be hosted on Tool-Labs. At each stage of development I would like to discuss implementation details with the mentors so that there are no delays/issues later on. If face some other doubts or need feedback I would head over to the mailing list(Wikitech-I).

About you
My name is Rohit Dua, and I'm currently pursuing my B.Tech in Electronics and Communication at Jaypee Institute of Information Technology, Noida at India. My home-town is New-Delhi, India. I code in Python/JavaScript/C/C++. I'm passionate about computer-security/automation and Coding gets me high! I am new to world of open-source and its community bonding. When I first heard about Open Source at a Linux User Group Meetup at my university, I went crazy about it as I always thought there's no such thing as a free bread, but then there always was free knowledge. Prior to this I never used to go to someone with my programming issues/bugs(online or offline). But now I feel I can grow and learn much faster with community-bondings in the Open Source universe. This project is my first opportunity to bond with an open source organization. GSoC will be my bridge to the open-source community. Also Google Summer of Code will be my top priority and I will be happily accepting this as a full time job.

Past open source experience
GitHub profile: rohit-dua

Proof of concept code
For the sake of demonstration, I have the script to - download any public domain book from GB - https://github.com/rohit-dua/gb-download (Python)