UploadWizard/Operations assessment

I was asked about the potential effects of Extension:UploadWizard on Wikimedia operations. Here's what I (User:NeilK) know, which amounts to a lot of educated guesses. It's easy to tell you the differences between one uploaded file the old way versus the new way. It's harder to guess how this is going to affect uploading patterns overall.

Summary
 * 1) We do not face any new kinds of threats. This tool just uses the existing API. We are not any more vulnerable now than we were before.
 * 2) Per uploaded file, Special:UploadWizard uses less resources than the old upload page on Commons, with the exception of thumbnailing, where it uses more.
 * 3) Due to increased usability, it probably will greatly increase the amount of uploads per day.
 * 4) Ops must pay a little more attention to keeping temporary storage "clean".

Changes in serving the upload page
Table of typical bandwidth usage of Commons' enhanced Special:Upload versus UploadWizard

When it comes to serving the actual page itself, Special:UploadWizard is actually somewhat more efficient than the old Special:Upload. This is because more functionality is concentrated in jQuery and better delivered, using ResourceLoader. So, to simply load the application, you won't see any major differences.

Both pages also make lots of API calls to validate various fields, but it turns out to be about the same in terms of efficiency.

The major difference is that Special:UploadWizard displays thumbnails from stashed files. This does not add very much to the page in terms of downloaded bytes (maybe 100K at most) but it will require more processing power to make these thumbnails.

Increased volume/burstiness in thumbnailing
When UploadWizard uploads a file, it immediately requests a thumbnail, and may request a larger one if the user clicks on the small thumbnail. Previously, it was unlikely that any thumbnail size other than the "standard" one would be requested.

(N.B. if this is a problem, we can defer some of these so they occur over a longer period of time.)

Recommendation for ops: monitor CPU, cache, network utilized by thumbnailing servers; expect increased usage

Potential for some kinds of resource exhaustion
This is not a new problem, but UploadWizard may make it easier for users to upload zillions of small files, and leave them in the stash (the FileRepo temporary zone). See 26063. Note: this potential problem exists in the API, and has since late fall 2010; it is not actually UploadWizard-specific.

The proposed solution, to be implemented by UploadWizard and/or MediaWiki developers, is to have a maximum on the number of files a user can have in the stash at any given time (say, 100 or so). When any more are uploaded, some are removed, even if incomplete.

Recommendation for ops: the per-user check mentioned above can be implemented in MediaWiki software, so no ops action is needed there. But for extra security against this class of problem, cronjobs should be employed to sweep that area clean of files older than 6 hours or so, and to also maintain the total number of files stashed below some threshold.

Increased volume of uploads
We expect that the increased ease of use will, over time, accelerate the number of uploads and the need for permanent storage. We have no idea how much.

Recommendation: monitor growth of number of files and average size, see if increased volume will accelerate storage growth needs

Increased 'burstiness' of uploads from a single user
UploadWizard allows users to upload multiple files relatively easily. This will cause numerous files to be uploaded within a few seconds of each other.

The tool is currently limited to only allow ten uploads in total per invocation. However, this configuration is on the client and thus can be changed by the client. Or, the user can open the tool in multiple browser tabs.

We can block abuse server-side with the measures discussed above under "Potential for some kinds of resource exhaustion".

Recommendation: monitor daily/hourly/minute-by-minute variation in number of uploaded files, see if increased burstiness needs extra capacity

More simultaneous uploads from the same user
The tool is designed to allow simultaneous uploads. Currently, these are turned off (see 26179 for the exciting (not) details).

When simultaneous uploads are turned on, the tool will allow some configured number of simultaneous transactions. For example, if the user is trying to upload seven items, it will at first start three of them. When one finishes, the tool will start another upload, and so on, until the full list of seven has been uploaded.

This can be circumvented client side by hacking the configuration (it's just Javascript) or opening multiple browser windows.

We can block abuse server-side with the measures discussed above under "Potential for some kinds of resource exhaustion". However, in general, we are not aware of any design limitations in MediaWiki or its backend file store that cause problems when the same user makes simultaneous accesses.

Recommendation for ops: none, other than cronjobs already mentioned

Increased number of files going to temporary zone (and/or being abandoned there)
Currently, the temporary stash area is only used when a file has a problem that can be corrected by some user action. For UploadWizard, it is the first step on the journey to publishing the file. It is possible for the user to abandon the file in the stash if they don't complete the process.

Recommendation for ops: none, other than cronjobs already mentioned

Leftover records from abandoned uploads
Following Raindrift's changes to UploadWizard, temporary files will have a database record. This will need to be periodically cleaned.

The correct solution is probably a cronjob that does the following:


 * Find rows in the uploadstash table with a us_timestamp that's more than n hours (6? 30? i don't know) in the past. delete the disk files and remove the rows.
 * Find files with a ctime more than a few hours old, which have no corresponding uploadstash row, and delete those as well (there may be some rare cases where a file can exist without a database record)

Recommendation for ops: none, other than cronjobs already mentioned