User:Puntonim/Gsoc 2014 Google Books to Internet Archive to Commons Upload Cycle

From mediawiki.org

UPDATE[edit]

I've finally decided to work on 3 ideas for GSoC: one is with Django Software Foundation, one is with Bookie, and one is with Neurostars. My first preference is definitely for Django Software foundation because it is a kind of dream for me to work together with Django core developers, but I didn't get any feedback from them. My second choice are Bookie and Neurostars and I got a good feedback from them.

So I guess I have no time to formalize a good proposal for Wikimedia. Don't take me wrong, contributing to Wikimedia would be for me an honour for all the great products that the Wikimedia foundation offers to all of us! So I'm afraid I'll give priority to other projects for GSoC 2014 but please count on me for the future!

Steps[edit]

Search for the book "Alice in Wonderland" in Google Books[edit]

Google Books has a API service, very well designed, so this should be an easy task: https://developers.google.com/books/docs/v1/getting_started

This is a response for the query "Alice in Wonderland": {

"kind": "books#volumes",
"totalItems": 491,
"items": [
 {
  "kind": "books#volume",
  "id": "KsARckM-mG0C",
  "etag": "9OxBVibAetI",
  "selfLink": "https://content.googleapis.com/books/v1/volumes/KsARckM-mG0C",
  "volumeInfo": {
   "title": "Alice in Wonderland / druk 1",
   "subtitle": "Alice had er genoeg van niets te doen te hebben",
   "authors": [
    "Lewis Carroll"
   ],
   "publisher": "Kemper Conseil Publishing",
   "publishedDate": "2005",
   "description": "Een klein meisje beleeft in haar slaap de wonderlijkste avonturen.",
   "industryIdentifiers": [
    {
     "type": "ISBN_10",
     "identifier": "9076542120"
    },
    {
     "type": "ISBN_13",
     "identifier": "9789076542126"
    }
   ],
   "pageCount": 192,
   "printType": "BOOK",
   "categories": [
    "Fiction"
   ],
   "averageRating": 5,
   "ratingsCount": 1,
   "contentVersion": "1.1.0.0.preview.1",
   "imageLinks": {
    "smallThumbnail": "http://bks7.books.google.nl/books?id=KsARckM-mG0C&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api",
    "thumbnail": "http://bks7.books.google.nl/books?id=KsARckM-mG0C&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api"
   },
   "language": "nl",
   "previewLink": "http://books.google.nl/books?id=KsARckM-mG0C&pg=PA61&dq=alice+in+wonderland&hl=&cd=1&source=gbs_api",
   "infoLink": "http://books.google.nl/books?id=KsARckM-mG0C&dq=alice+in+wonderland&hl=&source=gbs_api",
   "canonicalVolumeLink": "http://books.google.nl/books/about/Alice_in_Wonderland_druk_1.html?hl=&id=KsARckM-mG0C"
  },
  "saleInfo": {
   "country": "NL",
   "saleability": "NOT_FOR_SALE",
   "isEbook": false
  },
  "accessInfo": {
   "country": "NL",
   "viewability": "PARTIAL",
   "embeddable": true,
   "publicDomain": false,
   "textToSpeechPermission": "ALLOWED",
   "epub": {
    "isAvailable": false
   },
   "pdf": {
    "isAvailable": false
   },
   "webReaderLink": "http://books.google.nl/books/reader?id=KsARckM-mG0C&hl=&printsec=frontcover&output=reader&source=gbs_api",
   "accessViewStatus": "SAMPLE",
   "quoteSharingAllowed": false
  },
  "searchInfo": {
   "textSnippet": "e Rups en Alice keken elkaar enige tijd zwij- 'gend aan. Ten slotte nam de Rups 
\nde waterpijp uit zijn mond en sprak haar aan met een lome, slaperige stem. 'Wie
\nben jij?' vroeg de Rups. Dit was geen erg bemoedigend begin voor een gesprek
\n ..." } }, ...

In Google Books some books can be downloaded, like this one:

http://books.google.nl/books?id=hWByX5-c5SIC&lpg=PP1&dq=alice%20in%20wonderland&hl=it&pg=PP1#v=onepage&q=alice%20in%20wonderland&f=false

and some are not:

http://books.google.nl/books?id=3CWNgZnD-V4C&lpg=PP1&dq=alice%20in%20wonderland&hl=it&pg=PP1#v=onepage&q=alice%20in%20wonderland&f=false

Are we going to download only the ones we can actually download?

Nemo says: We want to download both

Plus Google uses a cover page and a watermark on those books, I think we can remove them, do you already have any solution?

Nemo says: Usually just pdftk to drop some pages + pdfimages for the rest is enough.

We can even extend this and search not only in Google Books but also in some other repositories, like www.gutenberg.org. Do you maybe already have a list of such websites? In case those websites have no public APIs, we can use Scrapy (a Python project, taking part of GSoC as well!) to scrape them.

Nemo says: We're interested in scans. There are many digital libraries but extracting metadata from them all is tough, better focus on one thing at a time though planned for extensibility.

Check if the book "Alice in Wonderland" has already been uploaded to Internet Archive[edit]

Internet Archive offers a API service, not very well desgined at a first look: https://archive.org/help/json.php

For instance this is the list of books present in Internet Archive when searching for "alice in wonderland": https://archive.org/advancedsearch.php?q=alice+in+wonderland+AND+mediatype%3Atexts&fl%5B%5D=identifier&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=50&page=1&output=json&callback=callback&save=yes#raw

We can easly parse this json dictionary and get the right information, but how can we make sure that one of the books in the list is exactly the book coming from Google Books or anyway the book the user is interested in? Do we need user interaction here?

Nemo says: Probably the use is expected to search on archive.org first, but we're probably not going to have many dupllicates, we could check essential book info and if title, author, year match ask.

PS: Internet Archive offers a service named OpenLibrary https://openlibrary.org/developers/api Those API seem to have a way better design than the previous one, but maybe the set of books in OpenLibrary is not the same as the one in Internet Archive

In case "Alice in Wonderland" has not been already uploaded to Internet Archive, we need to do that[edit]

You link the following library: https://pypi.python.org/pypi/internetarchive which seems to be proactively manintained (last commit on 2014-01-31), so it should work painlessly.

Get the djvu from Internet Archive and upload to Commons[edit]

You mention that there is a tool named IA-Upload for this task. I have tested a bit the tool and it seems to be working correctly. If any maintenance is needed here I can help out with that, but my knowledge of php is really basic. If we want to rewrite this tool in Python, I am up for that as well. If we want to create a tool for the entire process, I would love to do that in Python.

Nemo says: For now we don't seek to rewrite IA-Upload from scratch, but it would be nice to trigger it automatically.

I believe we should try to estimate the effort for each task and give them a priority.

Nemo says: Yes! Very important.

I've never worked for a Open Source project but I've been willing to do that since many years, so I'm excited to finally have the chance to do so.

Also, please consider that I'd really love to take part of a Google Summer of Code project, so I will apply for more than one project (mainly Python projects). So could you please tell me if you have already many students interested in this task and if I have high/low chances to get this, so I can focus more on the right projects?

Nemo says: We have another person interested in this project. We can have multiple persons applying for the same project and your proposal looks more structured, I'd like you to publish it with the tweaks from the answers above (otherwise I have to summarise them myself in the projext idea :P). We also have closely related projects which are instead in PHP, so there may be room for everyone anyway.