Google Books, Internet Archive, Commons upload cycle

Google Books > Internet Archive > Commons upload cycle

 * Public URL: https://www.mediawiki.org/wiki/User:8ohit.dua/GSoC_proposal_2014
 * Bugzilla report: Bug - 57813
 * Announcement: https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Google_Books_.3E_Internet_Archive_.3E_Commons_upload_cycle

Name and contact information
Name: Rohit Dua Email: 8ohit.dua@gmail.com IRC or IM networks/handle(s): rohit-dua Location: New Delhi, India Timezone: UTC+5:30 Typical working hours: 12:00 pm to 5:00 pm, 8:00 pm to 3:00am (IST)

Synopsis
Wikisources all around the world use heavily Google-Books digitizations for transcription and proofreading. The books often are disappeared from the GB database. This project focuses on how we can automatically upload GB books to IA(Internet Archive) with apporpriate metadata and then to Wikimedia-Commons. The user will just have to give appropriate description(orl identifiers) for the book(s) they wish to upload.

Deliverables
Goals of this project : Required goals :
 * Tool hosted on Tool-Labs with a javascript frontend and python core.
 * Check if a book is available on IA
 * If not, search it on GB, check if it is Public Domain
 * Download all its pages and convert to pdf/zip
 * Upload to IA with appropriate metadata
 * Wait for its OCR, when completed notify user via email
 * Upload to Commons using IA-Upload tool.

Optional goals :
 * Remove the 'Digitized by Google' watermark from downloaded pages.

Project schedule
The above plan could go as expected or invariably re-distribute among the tasks.

Core Libraries/tools usedː
 * internetarchive
 * smtplib
 * urllib2
 * IA-Upload

Participation
During my work hours, I would always be logged in IRC (channels: #mediawiki, #wikimedia-dev, #mediawiki-labs) and also can always be reached at my email. I'm an computer addict and have hard time staying off of it. All source code I write will be published to my Github repo. At each stage of development I would like to discuss implementation details with the mentor so that there are no delays/issues later on.

About you
My name is Rohit Dua, and I'm currently studying for a BTech in Electronics and Communication at the Jaypee Institute of Information Technology, Noida at India. My hometown is New-Delhi, India. I code in Python/javascript/C/C++ I'm passionate about computer-securtiy/automation. Coding gets me high. I am new to world of open-source and its community bonding. When I first heard about Open Source, I was crazy about it as I always thought there's no such thing as a free bread. Prior to this I was a lone-coder i.e. I never used to go to someone with my programming issues/bugs(online or offile). But now I feel I can grow and learn much faster with community-bondings in the foss universe. This project is my first opportunity to bond with an open source organization. GSoC will be my bridge to the open-source community.

Past open source experience
GitHub profile: rohit-dua

Proof of concept code
For the sake of demonstration, as I don't have much past open-source experience(being new!), I have the script to - download any public domain book from GB - here(Python)