Extension:SwiftMedia

http://OpenStack.org has an object store system called Swift. This code allows you to use a Swift repository to store MediaWiki media files. There are two parts to this code. The first is middleware for Swift's proxy server which converts the MediaWiki image URLs into the URL format needed by Swift. The second is an extension to MediaWiki.

Swift middleware
Swift hands its files out to users via a proxy. You can actually access the cluster directly, but you need to know as much about the system as the proxy knows, so unless you want to go to that effort, you should use the proxy, and we do. The proxy requires a URL with three parts: an account name, a container name, and an object name. The account name is a function of the authentication system, and is a long hex string; effectively a UUID. The container is simply an opaque string which doesn't have slashes. The object name may have anything in it.

Our media store URLs, on the other hand, start with the name of the host (or possibly a separate host), the string 'images' (by default), possibly several hashed subdirectory levels, and the name of the object. In the case of Wikipedia, the host is 'upload.wikimedia.org, followed by "wikipedia/commons" instead of 'images', followed by two levels of hashing, and the name of the file. Thumbnails, archived files, and deleted files have a prefix on the hashing. These URLs have been published, and people will link directly to them. We have decided to preserve these links, hence the middleware to rewrite the URLs.

The middleware inserts the account name into the URL, converts the "wikipedia/commons" section into a Swift container name by replacing slash with %2F, adds "%2Fthumb" or "%2Farchived" or "%2Fdeleted" to the container name and adds the rest of the hashing and filename as the object name. Swift doesn't need the hashing since it does its own hashing; it can take or leave our hashing. For backwards compatibility and ease of finding files, we leave it there. Once the URL has been rewritten, it gets handed to the remainder of the Swift proxy, which then hands the file back.

So yes, Swift's proxy is serving up image files to our caching front-ends. Usually a token is needed to access files, but we've marked some containers as "public", meaning that no token is needed.

404 handler
The middleware intercepts the return value from Swift, and looks at the result. If it's a 404 error, the 404 handler is invoked. Currently it contacts the existing thumbnail server, fetches the file, and writes it into Swift. Effectively it causes Swift to act as a cache.

The production version of Swift will need to do something similar, because we allow people to defer creation of a thumbnail until it's time to fetch it from the media server. If a thumbnail isn't in Swift, we request the file from the scaler cluster, which renders it and hands it back. There are two ways we could put this new thumbnail file back into Swift. The existing 404 handler in the middleware will write it into Swift. Or the existing thumbnail creation code in MW will write it back into the filerepo. Right now, we don't have sufficient information to decide which is better. Hence the creation of this list of characteristics to see if I can create a decision point. We must make a decision because both systems will be attempting to write the thumbnail into the same location.

Just a word of explanation. The machines in the scaler cluster are not particularly special. It's a MediaWiki install using WikiMedia extensions including SwiftMedia. Plus, it may have specially-compiled packages running on it, or may be a later version of Linux, or may have more memory or may even have a DSP hardware assist (I wouldn't rule it out). The point being that the machine is more capable of scaling files, but that's the only thing special about it. It scales images when it receives a request for a thumb that doesn't exist. It generates a 404 call to thumb-handler, which causes the image to be scaled.

The only magic is that the front-end MW servers are configured to forward 404 requests to the scaler cluster, whereas the scaler cluster's 404 handler is configured to render.

Middleware writes

 * The existing code already writes out the image that gets returned to it.
 * This may not be a good thing, because the scaler's existing code also writes it out.
 * the code is a bit of a hack, because instead of simply streaming back the result from the scaler (which is easy) it has to intercept that streaming and make two copies. That's a foreign concept to wsgi and so took a bit of code.
 * It's new code. There may be reliability and/or security flaws in it.
 * If we take it out, we have to fill in any holes, patch, and sand it.
 * It's the trickiest part of the middleware because of the need to read from one stream and write to two streams. Pulling it would be pulling the least trustworthy part.

Scaler writes

 * The standard MW scaling process will write out the file to the filerepo. Since we have to *fetch* via Swift, we'll also be able to write out via Swift.
 * This is well-tested code (at least, the MW part of it -- but the SwiftMedia extension passes its unit tests).
 * If there are any security problems, the security problems are everywhere and there's more people likely to find and/or fix them.

client.py mods
As of version 1.2, Swift's client.py expect you to have the entire object in memory or have a read function which will return it when called. I expect that later versions have taken away this limitation. I have also coded such a function, and perhaps incompatibly. I wrote a new class called Put_object_chunked. The standard put_object requires either the data itself, or a function with an 'read', which it calls to fetch data. Unfortunately, that doesn't work if you have your own loop which generates the data, or if you're writing the data from an iterator (think "writing the same data to two locations.")

We only need this because of the need to fetch files from the thumbnail server and write them into the object store AND return them back to the client. It's quite possible that we can use standard functions solely if we drop the need to write into the object store. Current design specifications call for leaving the code in place, but disabling it with a run-time option. We may find ourselves wanting to review that decision.

MediaWiki Extension
Swift provides no access to a filesystem; it is an object server, not a file server. In order to allow our media handlers to do their work, The extension pulls files in from Swift, runs the media handler, and writes the resulting file out to the object store in the appropriate location. When a file is uploaded, rather than store it in the filesystem, it gets uploaded as a Swift object.

Several configuration variables are needed for LocalSettings.php in the $wgLocalFileRepo array.

You must let MW know that the class of the repo is SwiftRepo: 'class' => 'SwiftRepo', MUST be 'local': 'name' => 'local', Your swift username. 'user' => 'system:media', Your swift password 'key' => 'secret', A URL pointing to a proxy which is also running the auth server. 'authurl' => 'http://alsted.wikimedia.org/auth/v1.0', This wiki's base container name. Must not contain a forward slash. Other container names will be generated by appending %2Fthumb, %2Ftemp, and %2Fdeleted. 'container' => 'images%2Fswift', The URL pointing to scripts. 'scriptDirUrl' => $wgScriptPath, 'scriptExtension' => $wgScriptExtension, The URL containing the container name, so "http://alsted.wikimedia.org/images/swift" 'url' => $wgUploadBaseUrl ? $wgUploadBaseUrl. $wgUploadPath : $wgUploadPath, 'hashLevels' => $wgHashedUploadDirectory ? 2 : 0,       'transformVia404' => !$wgGenerateThumbnailOnParse, 'deletedHashLevels' => 3

MediaWiki install
Install php-cloudfiles/ somewhere where PHP looks for files. Currently, it's in /usr/share/php-cloudfiles on ersch. Php-cloudfiles requires php5-curl (and its dependencies). Restart apache2 after installing php5-curl.

Install the swiftmedia extension into MediaWiki in the usual manner.

Swift install
Install Swift according to their instructions. Follow their recommendations for options.

Put the contents of wmf somewhere where Python looks for files. Currently, it's in /usr/local/lib/python2.6/dist-packages/wmf/ on alsted.

Add this section to /etc/swift/proxy-server.conf:

[filter:rewrite] account = your_account_here url = http://127.0.0.1/auth/v1.0 login = yourloginhere key = yourpasswordhere thumbhost = yourscalerhere user_agent = Mozilla/5.0 paste.filter_factory = wmf.rewrite:filter_factory
 * 1) the auth system turns our login and key into an account / token pair.
 * 2) the account remains valid forever, but the token times out.
 * 1) the name of the scaler cluster.
 * 1) upload doesn't like our User-agent (Python-urllib/2.6), otherwise we could call it using urllib2.urlopen
 * 1) uncomment this if we want to write the 404 handler's output into Swift.
 * 2) writethumb=yes

Also in /etc/swift/proxy-server.conf modify [pipeline:main] so it starts with rewrite. Like this: pipeline = rewrite healthcheck cache swauth proxy-server

Use Cases
Since the plan is to switch Wikipedia over to this media storage system, we're trying to be as conservative and not-break-it as possible. If you are in the habit of manipulating files on your MediaWiki, or on Wikipedia itself, could you take a few minutes to document your particular combination of operations? Obviously, we've got test cases for "upload a file", "delete a file", "upload another file", "revert an older file". Those are the simple things to test. We're looking for your "idioms" or "use cases", where you do things we don't expect. Please add four tildes to the end of your description in case we need clarification.

Your help is appreciated. I'll prime the pump with two entries:
 * I will adjust the brightness of an image if it doesn't look good with the other images on a page, and then upload the edited file under the same name. RussNelson 01:15, 10 August 2011 (UTC)
 * Sometimes upload one version of an image, decide I don't like it, upload another version, decide that I don't like that, change my mind and revert back to the original image. RussNelson 01:15, 10 August 2011 (UTC)
 * And then deletes the old version. -- Bryan ( talk|commons ) 13:16, 12 August 2011 (UTC)
 * The only other thing, although more of a MediaWiki side of things, is protecting files from being (re)uploaded or touched (eg: reverted). KPeachey
 * Sometimes we use a file to keep track of something that continuously changes, such as a chapter map or organization chart. We upload many different versions under the same file name over a long period of time.  Cbrown1023  talk  01:44, 10 August 2011 (UTC)
 * Embed a file using [[File:Foo.jpg]] syntax
 * Link to a file:
 * Using File:Foo.jpg
 * As well as [[Media:Foo.jpg]]
 * Also make sure &#123;&#123;filepath}} works
 *   tags Happy ‑ melon 09:57, 10 August 2011 (UTC)
 * Force rethumbnailing by purging the file page. -- Bryan ( talk|commons ) 13:16, 12 August 2011 (UTC)
 * Undeleting one or several old versions. Platonides 19:23, 18 October 2011 (UTC)
 * Delete thumbs so they are recreated, and only the presently necessary sizes are recreated (many more may have been needed before). Vigilius 23:27, 22 October 2011 (UTC)
 * Flush the metadata cache inside mediawiki, because some kind soul has written a much better media-metadata extraction toolkit. Force all metadata to be newly extracted from the media. Vigilius 23:27, 22 October 2011 (UTC)