User:Puntonim/Gsoc 2014 Google Books to Internet Archive to Commons Upload Cycle

UPDATE
I've finally decided to work on 3 ideas for GSoC: one is with Django Software Foundation, one is with Bookie, and one is with Neurostars. My first preference is definitely for Django Software foundation because it is a kind of dream for me to work together with Django core developers, but I didn't get any feedback from them. My second choice are Bookie and Neurostars and I got a good feedback from them.

So I guess I have no time to formalize a good proposal for Wikimedia. Don't take me wrong, contributing to Wikimedia would be for me an honour for all the great products that the Wikimedia foundation offers to all of us! So I'm afraid I'll give priority to other projects for GSoC 2014 but please count on me for the future!

Introduction
I'm Paolo Coffetti, a software engineer living in Amsterdam, the Netherlands. I'm very close to a master degree at University of Bergamo, Italy: I've finished all the courses and currently working on my thesis which I'll be defending on June 9. The thesis is about a project which I temporarily named Moogle (My Own Google). Moogle is a website with a full text search engine for private data: it connects to a user's accounts in Facebook, Twitter, Google (Drive, Gmail, Google Plus) and Dropbox, indexes all her data (only textual information) and provides a private full text search (available only to her). The project is not complete yet, but I'm working hard.

I've been working in Amsterdam in a small startup company named United Academics since 2010 as web developer. Officially I'm not working for them anymore because I'm 100% focused on my thesis, but from time to time I still help their IT team. United Academics is a company which aims today to provide an Open Access repository to scientists, but unfortunately is lately facing some financial difficulties. In the last years United Academics was something slightly different and I worked on projects like a printing on demand website (www.print2book.com), a job portal (now offline), a bookstore (now offline), a couple of Wordpress websites (http://www.united-academics.org/magazine/) and more recently an Open Access repository (not online yet). All those projects (apart from the Wordpress ones) were made in Python and Django so I consider myself an expert with those technologies. I'm definitely not a senior cause I still have a lot to learn, but I'm also definitely more than a junior.

I spent some hours today on Wikimedia's proposals for Google Summer of Code. I'm particularly interested in: Google Books > Internet Archive > Commons upload cycle.

I've got a good impression reading the idea and I would like to ask you more details. This is not my official proposal, I haven't checked out and studied the code yet, nor made a detailed plan, but only a first approach in order to get a clearer idea on what the aims are and see if I am on the right track.

Search for the book "Alice in Wonderland" in Google Books
Google Books has a API service, very well designed, so this should be an easy task: https://developers.google.com/books/docs/v1/getting_started

This is a response for the query "Alice in Wonderland": { "kind": "books#volumes", "totalItems": 491, "items": [ {  "kind": "books#volume", "id": "KsARckM-mG0C", "etag": "9OxBVibAetI", "selfLink": "https://content.googleapis.com/books/v1/volumes/KsARckM-mG0C", "volumeInfo": { "title": "Alice in Wonderland / druk 1", "subtitle": "Alice had er genoeg van niets te doen te hebben", "authors": [ "Lewis Carroll" ],   "publisher": "Kemper Conseil Publishing", "publishedDate": "2005", "description": "Een klein meisje beleeft in haar slaap de wonderlijkste avonturen.", "industryIdentifiers": [ {     "type": "ISBN_10", "identifier": "9076542120" },    {      "type": "ISBN_13", "identifier": "9789076542126" }   ],    "pageCount": 192, "printType": "BOOK", "categories": [ "Fiction" ],   "averageRating": 5, "ratingsCount": 1, "contentVersion": "1.1.0.0.preview.1", "imageLinks": { "smallThumbnail": "http://bks7.books.google.nl/books?id=KsARckM-mG0C&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api", "thumbnail": "http://bks7.books.google.nl/books?id=KsARckM-mG0C&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api" },   "language": "nl", "previewLink": "http://books.google.nl/books?id=KsARckM-mG0C&pg=PA61&dq=alice+in+wonderland&hl=&cd=1&source=gbs_api", "infoLink": "http://books.google.nl/books?id=KsARckM-mG0C&dq=alice+in+wonderland&hl=&source=gbs_api", "canonicalVolumeLink": "http://books.google.nl/books/about/Alice_in_Wonderland_druk_1.html?hl=&id=KsARckM-mG0C" },  "saleInfo": { "country": "NL", "saleability": "NOT_FOR_SALE", "isEbook": false },  "accessInfo": { "country": "NL", "viewability": "PARTIAL", "embeddable": true, "publicDomain": false, "textToSpeechPermission": "ALLOWED", "epub": { "isAvailable": false },   "pdf": { "isAvailable": false },   "webReaderLink": "http://books.google.nl/books/reader?id=KsARckM-mG0C&hl=&printsec=frontcover&output=reader&source=gbs_api", "accessViewStatus": "SAMPLE", "quoteSharingAllowed": false },  "searchInfo": { "textSnippet": "e Rups en Alice keken elkaar enige tijd zwij- &#39;gend aan. Ten slotte nam de Rups \nde waterpijp uit zijn mond en sprak haar aan met een lome, slaperige stem. &#39;Wie \nben jij?&#39; vroeg de Rups. Dit was geen erg bemoedigend begin voor een gesprek \n ..." } },  ...

In Google Books some books can be downloaded, like this one:

http://books.google.nl/books?id=hWByX5-c5SIC&lpg=PP1&dq=alice%20in%20wonderland&hl=it&pg=PP1#v=onepage&q=alice%20in%20wonderland&f=false

and some are not:

http://books.google.nl/books?id=3CWNgZnD-V4C&lpg=PP1&dq=alice%20in%20wonderland&hl=it&pg=PP1#v=onepage&q=alice%20in%20wonderland&f=false

Are we going to download only the ones we can actually download?

Nemo says: We want to download both

Plus Google uses a cover page and a watermark on those books, I think we can remove them, do you already have any solution?

Nemo says: Usually just pdftk to drop some pages + pdfimages for the rest is enough.

We can even extend this and search not only in Google Books but also in some other repositories, like www.gutenberg.org. Do you maybe already have a list of such websites? In case those websites have no public APIs, we can use Scrapy (a Python project, taking part of GSoC as well!) to scrape them.

''Nemo says: We're interested in scans. There are many digital libraries but extracting metadata from them all is tough, better focus on one thing at a time though planned for extensibility.''

Check if the book "Alice in Wonderland" has already been uploaded to Internet Archive
Internet Archive offers a API service, not very well desgined at a first look: https://archive.org/help/json.php

For instance this is the list of books present in Internet Archive when searching for "alice in wonderland": https://archive.org/advancedsearch.php?q=alice+in+wonderland+AND+mediatype%3Atexts&fl%5B%5D=identifier&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=50&page=1&output=json&callback=callback&save=yes#raw

We can easly parse this json dictionary and get the right information, but how can we make sure that one of the books in the list is exactly the book coming from Google Books or anyway the book the user is interested in? Do we need user interaction here?

Nemo says: Probably the use is expected to search on archive.org first, but we're probably not going to have many dupllicates, we could check essential book info and if title, author, year match ask.

PS: Internet Archive offers a service named OpenLibrary https://openlibrary.org/developers/api Those API seem to have a way better design than the previous one, but maybe the set of books in OpenLibrary is not the same as the one in Internet Archive

In case "Alice in Wonderland" has not been already uploaded to Internet Archive, we need to do that
You link the following library: https://pypi.python.org/pypi/internetarchive which seems to be proactively manintained (last commit on 2014-01-31), so it should work painlessly.

Get the djvu from Internet Archive and upload to Commons
You mention that there is a tool named IA-Upload for this task. I have tested a bit the tool and it seems to be working correctly. If any maintenance is needed here I can help out with that, but my knowledge of php is really basic. If we want to rewrite this tool in Python, I am up for that as well. If we want to create a tool for the entire process, I would love to do that in Python.

Nemo says: For now we don't seek to rewrite IA-Upload from scratch, but it would be nice to trigger it automatically.

I believe we should try to estimate the effort for each task and give them a priority.

''Nemo says: Yes! Very important.''

I've never worked for a Open Source project but I've been willing to do that since many years, so I'm excited to finally have the chance to do so.

Also, please consider that I'd really love to take part of a Google Summer of Code project, so I will apply for more than one project (mainly Python projects). So could you please tell me if you have already many students interested in this task and if I have high/low chances to get this, so I can focus more on the right projects?

''Nemo says: We have another person interested in this project. We can have multiple persons applying for the same project and your proposal looks more structured, I'd like you to publish it with the tweaks from the answers above (otherwise I have to summarise them myself in the projext idea :P). We also have closely related projects which are instead in PHP, so there may be room for everyone anyway.''

Paolo