UploadWizard/Software design

Some documentation for developers and reviewers interested in how UploadWizard works.

Incomplete Frontend docs

PHP (server side files)


 * New
 * includes/upload/UploadStash.php
 * includes/specials/SpecialUploadStash.php
 * extensions/UploadWizard/ApiQueryStashImageInfo.php
 * extensions/UploadWizard/SpecialUploadWizard.php
 * extensions/UploadWizard/UploadWizard.alias.php
 * extensions/UploadWizard/UploadWizard.i18n.php
 * extensions/UploadWizard/UploadWizardMessages.php (should be obsoleted by ResourceLoader in the near future)
 * extensions/UploadWizard/UploadWizardPage.php (should be obsoleted by ResourceLoader in the near future)


 * Modified
 * includes/upload/UploadBase.php
 * includes/api/ApiUpload.php
 * includes/filerepo/File.php


 * Changed config
 * includes/AutoLoader.php
 * includes/SpecialPage.php
 * languages/messages/MessagesEn.php


 * Javascript
 * To be documented

Overview
UploadWizard is:
 * a multiple file uploader, with some "batch" capabilities
 * with a "wizard", step-by-step interface
 * with improved metadata and licensing entry
 * Designed to be deployed on Wikimedia Commons, although it should be useful for many other wikis

To achieve this, we've changed a lot about how uploads are accomplished.

The standard Mediawiki way


This is the how media uploads have worked for a long time with MediaWiki -- very simply.

The file is uploaded with an HTML form, along with wikitext for the File: page that will surround the image.

Each wiki page could be very different; there's little standard formatting.

However, we still use the base operation here -- to upload a media file with accompanying wikitext.

The Commons way


This is how Wikimedia Commons works in late 2010.

Nothing fundamental has changed here -- they are still uploading a media file with some associated wikitext. But it's being done just a little differently. There is more bureaucracy up front to try to categorize various media types. (At left we see only one example of many.) The user fills out a form, and some Javascript on the page creates equivalent wikitext, and sends that with the media file to the server.

There is much more preamble, as they feel they need to warn uploaders about Commons' licensing and interface requirements in very scary text.

The form page is very complicated, and has more structure and required fields, but ultimately it's just creating wikitext.

While an improvement over the previous version, the usability is now very poor. The page spends half its time warning you about bad things that can happen.


 * This is largely because they are still tied to an interaction model where it all comes down to just one click, which will add all the information to the database and publish the file almost immediately. Over time, Commons administrators have become very fed up with people who publish files which need to be taken down, and have piled on warning after warning.


 * Also, the page cannot provide sensible defaults for many of the fields since it has no way of analyzing the file itself.


 * The page doesn't have any structure flow to it -- it's just trying to amass as much information as possible in one go.


 * The page is just generally poorly organized, with questions about authorship and licensing scattered all over the page.


 * Mistakes or errors usually cause the user to lose work, and it's possible to make *many* mistakes since the form fields all inter-relate.


 * The interface elements can be somewhat bizarre and non-standard


 * Much screen real estate is given over to rarely-needed UI.

The UploadWizard way


UploadWizard at heart uses the same system -- associate a media file with wikitext. But it adds two new layers to the entire interaction.

The most obvious change is that we are shepherding multiple files through this process, at the same time.

On the client side, in the user's browser, we now have a "wizard" style interface flow. Information that is related is gathered at the same time, and then the user proceeds to the next step. For example, there's exactly one screen about licensing, and for the most part everything is handled there.

On the server side, we have a new way of storing data and media files that stops just short of publishing them to the Wiki.

This is important for us mostly due to a quirk of how web browsers have traditionally worked. Web browsers cannot analyze the files they are uploading or provide any information about them, not even a thumbnail -- they need help. The Firefogg extension is one kind of help, but the one that works with all browsers is to upload the file to the server and then ask it for what it can determine about these files. So UploadWizard first uploads the files to the server, and then it gets:
 * Thumbnails for each image (helpful for identifying multiple files!)
 * An analysis of the metadata in each file. For instance, many photos have information hidden within them that tells us when they were taken. We can use that information to prefill many form fields. (The user can still change them).

So the user can complete filling out all this information in relative peace, focusing on one thing at a time, not worrying if they've accidentally released an unlicensed file into the public sphere.

And then when they're ready, they can publish it to the wiki.

UploadStash
To make the above design for UploadWizard work, we needed to store files with the following constraints:


 * files must be temporary
 * files not be public, and only writable/accessible by the uploader
 * we must be able to obtain icons and thumbnails (with the same security as views)
 * we must be able to obtain metadata about the files, such as EXIF or IPTC tags (with the same security as views)

The new UploadStash module and Special:UploadStash page answers the need for such a file area within MediaWiki.

This is not a radically new concept, as we've been using temporary stashes for uploads. If an uploaded file is found to have some problem which the user could fix before it was committed to the database (typically, a naming conflict) the file would be placed into the repository's "temp" area, its location saved in the user's "session", and a session "key" returned to the user so s/he can refer to this stashed file later.

However, aside from storing the file with a fixed set of metadata, there wasn't anything else one could do with the file.

We add a few new features with SessionStash:


 * The ability for the application to associate arbitrary data with the stash.
 * The implementation is straightforward; a key-value PHP object is serialized into the stash.


 * Content hashes as default keys
 * But other keys can be used.


 * The ability for the user to read the entire file out again.
 * By our current design, the temp area does not have to be web-accessible. Furthermore, even if it were, MediaWiki (and Wikimedia projects especially) have zero security for media files. So, to keep from inadvertently "publishing" files, we simply create special URLs under Special:SessionStash that, when invoked, look up an "session key" in the user's current session and read the file directly back to the user with appropriate HTTP headers. In other words, for this limited purpose, PHP takes on the role that Apache normally does in serving media files. See below for security and other implications. Incidentally, this is not the first time either that we've done this with MediaWiki; Tim Starling's WebStore module uses a similar strategy, although there the reason isn't security.


 * The ability to transform the file and for the viewer to see transformations.
 * This is to get thumbnails. This uses the standard facilities for transforming files in MediaWiki. Sound files and other non-visual media should be assigned icons of the appropriate size. These icons and other files will be stored in the temporary area. Since they are stored under their content hash, identical icons are only stored once. These thumbnails are then "stashed" themselves and thus become accessible in the way noted above.


 * The ability to get metadata about files
 * A new module, ApiQueryStashImageInfo, a subclass of ApiQueryImageInfo, is being added.

All of the above has been carefully designed to be 100% compatible with the previous methods of stashing files (in fact, from a data perspective, identical).

Security and other implications
Since UploadStash allows one to read temp files off the MediaWiki server in a new way, it has to be checked very carefully that it does not open any new security holes. Here is what is in place:


 * The actual temporary path is not revealed to the user. The user only uses an opaque session key, or a related Special:UploadStash URL.
 * UploadStash tries to check that the path it is reading from really is in the repository's temp area.
 * Since UploadStash is using PHP to serve the file, which is inherently less efficient, it will refuse to serve files that are larger than a preset limit (currently 250K).

Even so, it is conceivable that if were ever used for "upload by URL" the user could turn MediaWiki into a sort of silent, private, slow, inefficient web proxy.

There is an opportunity for a denial of service attack, by uploading files and requesting transformations ad infinitum.

Opportunities for rationalizing other parts of the codebase
Incidentally, this "stashing" functionality has existed for a while in our base class UploadBase, but extremely similar code is also to be found in the extension FirefoggChunkedUpload, as well as other extensions in various states of upkeep (SpecialUploadMogile, MultiUpload, SemanticForms, and SocialProfile....) UploadStash aims to encompass all the use cases noted above and in most cases should be a drop-in replacement. It also should make other forms of asynchronous uploading (such as Upload By URL) simpler to manage.