Jump to content

Google Books, Internet Archive, Commons upload cycle

From mediawiki.org

(Automation Tool) Google Books > Internet Archive > Commons upload cycle

[edit]

BUB :  Book Uploader Bot

[edit]

Public URL: //www.mediawiki.org/wiki/Google_Books,_Internet_Archive,_Commons_upload_cycle


Name and contact information

[edit]

Name: Rohit Dua
Email: 8ohit.dua@gmail.com
IRC or IM networks/handle(s): rohit-dua
Location: New Delhi, India
Time-zone: UTC+5:30
Typical working hours: 12:00 pm to 5:00 pm , 8:00 pm to 2:00am(IST) until August, 6:00 pm to 2:00 am after August.


Synopsis

[edit]

Wikisources all around the world use heavily Google-Books digitizations for transcription and proofreading. The books often are disappeared from the GB database. Currently the users have to manually download a book from GB, then upload them to IA(if they want to preserve) or directly upload to Wikimedia-Commons(again manual task) with appropriate meta-data.

This project focuses on automating all the three altogether! The user will just have to give appropriate url(or identifier) for the book(s) they wish to upload, and all other task is just automated, notifying user only when their intervention is needed.

Flowchart for the project
Direct Link

Core Libraries/tools used:

Deliverables

[edit]

Goals of this project :

Required Goals:
[edit]
  • Tool hosted on Tool-Labs with a JavaScript front-end and python core.
    This will take as input: 
    LIBRARY_TO_CHOOSE             //This is the Library like Google-Books. More libraries can be added in future
GOOGLE_BOOK_URL OR ID //This is the ID/URL for book that will be uploaded to IA and Commons
FILE_NAME_FOR_COMMONS //This is the user defined name for djvu file (will be passed to IA-Upload) EMAIL_ID
  • Extract meta-data from GB and check if it is Public Domain
    Google provides Google-Books API:
    This will be used to extract all the details about the book (meta-data) and check if it is public domain or not.
  • Check if a book is available on IA
    Internet Archive provides JSON API for advanced searching.
    This will be used to check whether the book is already available in IA or not. 
  • Download all its pages from GB and convert to PDF/ZIP
    The required book will be downloaded from Google Books in a manner that each page will first be downloaded as PNG/JPG image,
    and then they will be converted to PDF format for easy upload to IA.
    Link to proof of concept code for book-download given at bottom 
  • Upload to IA with appropriate meta-data
    The python library internetarchive will be used for this step.
    For each book that'll be uploaded to IA, its meta-data(taken from GB) will be added.
    This will be a better means to avoid duplicated uploads in the long run.

Files uploaded to IA are OCR'ed so that their text is searchable. This takes time. Therefore as soon as the OCR is complete, users will be notified via email. Users email, corresponding url identifiers, and the entered FILE_NAME_FOR_COMMONS will be stored(sqlite). A web crawler will periodically visit the url with stored identifiers to check on OCR completion.
  • Wait for its OCR, when completed notify user via email
    If the OCR process is completed, the user will be notified via email. Python Library smtplib will be used to send emails.
    The emails will contain the link of type: http://tools.wmflabs.org/ia-upload/commons/fill?iaId=ID&commonsName=FILENAME,
    where ID --> identifier stored previously and FILENAME --> the  FILE_NAME_FOR_COMMONS taken as input at the beginning.
    This will help in avoiding the unnecessary front-page of IA-Upload.
    <since users will not have to manually enter the identifier of the uploaded file>

Optional Goals:

[edit]
  • Direct upload to Commons.
    If a user wants an immediate use of the Commons file, he/she might want to skip the step of
    uploading to IA.(as it takes time).
    wikitools library and MediaWiki API will be used to connect and upload to commons.
  • Add support for other popular Public Library Networks
    Support for public libraries like Digital Library of India (Archived 2013-08-06 at the Wayback Machine) and West Bengal Public Library Network
    will be added, which will work in a similar fashion to Google-Books.

* The Design of the code will be in a form that support for more libraries (like Digital Library of India (Archived 2013-08-06 at the Wayback Machine)) can be easily added.


Project schedule

[edit]
Timeline Task
Apr 21 - May 19 Get familiar with code base, move local environment to Labs, bond with community
May 19 - May 26 University Examinations
May 26 - May 30 Add feature to extract meta-data from GB and check if its public-domain (proof of code)
May 30 - Jun 05 Download from GB and convert to PDF
May 30 - Jun 05 code to properly upload to IA using internetarchive library
Jun 05 - Jun 10 code to check if book is available in IA
Jun 10 - Jun 22 Database and its python connector for email/identifier storage
Jun 23 Mid Term Evaluation
Jun 24 - Jul 05 Spider bot to check for updates
Jun 05 - Jul 15 Automatic notification email using smtplib and link with IA-Upload tool
Jul 15 - Jul 25 UI Polishing, Bug fixing
Jul 25 - Aug 18 Code clean up, documentation + Buffer time for unprecedented delays

* The above plan could go as expected or invariably re-distribute among the tasks.


Participation

[edit]

During my work hours, I would always be logged in IRC (channels: #mediawiki, #wikimedia-dev, #mediawiki-labs) and also can always be reached at my email. I'm an computer addict and have hard time staying off of it. All source code I write will be published to my Github repo, although my tool will be hosted on Tool-Labs.
At each stage of development I would like to discuss implementation details with the mentors so that there are no delays/issues later on. If face some other doubts or need feedback I would head over to the talk at Talk:Google Books, Internet Archive, Commons upload cycle or the mailing list(Wikitech-I).


About you

[edit]

My name is Rohit Dua, and I'm currently pursuing my B.Tech in Electronics and Communication at Jaypee Institute of Information Technology, Noida at India. My home-town is New-Delhi, India.
I code in Python/JavaScript/C/C++.
I'm passionate about computer-security/automation and Coding gets me high! I am new to world of open-source and its community bonding.
When I first heard about Open Source at a Linux User Group Meetup at my university, I went crazy about it as I always thought there's no such thing as a free bread, but then there always was free knowledge. Prior to this I never used to go to someone with my programming issues/bugs(online or offline). But now I feel I can grow and learn much faster with community-bondings in the Open Source universe.
This project is my first opportunity to bond with an open source organization. GSoC will be my bridge to the open-source community. Also Google Summer of Code will be my top priority and I will be happily accepting this as a full time job.


Past open source experience

[edit]

GitHub profile: rohit-dua

Proof of concept code

[edit]

For the sake of demonstration, I have the script to - download any public domain book from GB - https://github.com/rohit-dua/gb-download (Python)

* UI and some verification code(project named BUB: book uploader bot): https://github.com/rohit-dua/BUB


UI Mockup

[edit]