User:NeilK/Multimedia2011/ChunkedUploads

Scratch notes from discussion between Michael Dale, User:NeilK, and Jan Gerber (j@thing.net) on making MediaWiki friendly to large files, started at MediaWiki Hackaton 2011 in Berlin (May 12-14 2011)

The story so far
In we already discussed with the Firefogg protocol was, in Tim Starling's opinion, not adequate for MediaWiki

Neil proposed using a protocol similar to Google's Resumable Media Uploads.

Michael Dale's employer (Kaltura) is interested in fostering open source video

Meanwhile, Google is hiring Jan Gerber to also get WebM support in Wikipedia, for this large file upload in MediaWiki is important. And the RMU protocol is becoming a standard thing that Google uses

Things on the agenda
 * Large file uploads, i.e. chunked uploads, in MediaWiki
 * Large file "transfers", where we obtain a file from somewhere else like Flickr or YouTube and republish it on MediaWiki
 * We didn't discuss this in detail but this is called "UploadByUrl" in the MediaWiki world and Bryan Tong Minh has had plans to do this for a while see bug 20512

What we can do today
I'm still unsure about what if anything you wanted me (NeilK) to help with, since Michael wanted to write his own client & server code. Michael has had some problems getting his code accepted but I don't have any special powers either.

I'm open to replicating the Google Resumable Media Upload protocol in its entirety, and Brion seems to think it's a good idea too, or a similar protocol that we can show is just as good.

Google Resumable Media Upload protocol
I scanned http://code.google.com/apis/gdata/docs/resumable_upload.html ...

The full version of RMU works like this, in short

Advertising we can handle RMU uploads

 * server publishes a  to a magic URL which can be used to start RMU uploads.

Starting the upload

 * client posts to the magic URL, with a Gdata message, which at minimum includes
 * header, GData version number
 * authorization header (but we would rely on cookies)
 * header, X-Upload-Content-Type
 * header, X-Upload-Content-Length
 * optional: body with xml describing the upload


 * server responds with 200 OK, Location: (upload uri)

Sending data

 * client does a PUT to upload URI, with Content-Length & Content-Type, and *perhaps* with byte-range,
 * in body is the data


 * if all complete, server responds with 200 OK


 * if not complete, then server responds with 308 Resume Incomplete, and asks for a Range: in its response headers
 * server may update the upload URI with a Location: response header.


 * repeat until done

There are other complications for resuming uploaded files but let's ignore that for now

Simpler protocol for MediaWiki?
There are some concerns about replicating this in PHP world
 * using PUT as a verb
 * all kinds of custom behaviour in reading headers and responding with headers

Brion thinks it all can be done, we agreed that there isn't anything really stopping us from implementing at least the basics of that protocol.

However, at the last minute I think Michael started to propose a simplified, non-Gdata version of the above (?) without specifics.

I tried to imagine how that would work, so correct if possible.

Presumably we should use the API where we can.

Quick question though, I believe Michael thought it would be onerous to serve the chunk as a POST argument. Not sure why as there is no reason to escape the data if there are simple boundaries in the POST body. This requires the client to construct their own multipart POST body but it's not too difficult.

https://developer.mozilla.org/en/Using_files_from_web_applications#Handling_the_upload_process_for_a_file.2c_asynchronously

So perhaps we could do this, with simple MediaWiki API methods.

Initiate upload

 * client authenticates in usual ways (or doesn't, depending on local wiki policies)


 * client POSTs API call to MediaWiki, like action=resumableUpload & start=1 & contentType=video/ogg & contentLength=2000000 & chunkSize = 100000


 * server sets up how it's going to store chunks, and relate them back to the user if required
 * server-side, each chunk has a timestamp for when it was received


 * server response returns an upload identifier, 'a1b2c3', and an initial range of bytes. The client said that it could post chunks of 100K in size, so the server should only request ranges of bytes that are less than or equal to 100K. In this case the server will respond with a parameter "bytes" equal to "0-999999".


 * if the server thinks the combination of parameters is unwanted or insane, it returns 200 OK with Error parameter, the typical way that API errors are returned in MediaWiki. Errors include:
 * disallowed contentType
 * contentLength too big
 * contentLength is zero
 * chunkSize too big / small

Sending data

 * client POSTs API call to MediaWiki, action=resumableUpload & upload=a1b2c3 & bytes=0-100000


 * server response is 200 OK
 * server checks its list of chunks to see if it has everything
 * if it doesn't, it requests the next chunk it needs, and adds that to the response
 * bytes = '100001-200000'


 * repeat until server believes it has the entire file
 * at which point it concats the resumableUpload chunks into a real file and "uploads" it internally
 * (this may be a "stash" upload if we want to delay publication until user supplies further metadata)
 * response is simply 200 OK, without any bytes parameter in the response.

QUESTION: at this point the server now needs to reassemble the file and check for malware. This could take a long time so it should be done asynchronously. How to indicate this back to the user?

If it all goes wrong
the above protocol should resist any temporary problems with bandwidth & loss of connection.

client gives up whenever it wants to. Conceivable to imagine some way to resume the upload (if they can find the file again) since it is tied to the user but let's leave that aside for now

on server, a cronjob runs looking for resumableUploads that are not complete and where there is no chunk uploaded in the past 24 hours, and removes them & associated chunk files.

if the client gets a final 200 OK with no bytes parameter (in other words, the server says "we're done") but the client doesn't think they've actually uploaded the whole file to the server, they will have to indicate this to the user. There will be no recourse, other than deleting the uploaded file.