User:HighInBC/Integration with S3

The purpose of this article is to describe how to migrate your existing mediawiki setup so that it serves its files from the Amazon Simple Storage Solution(called S3 from here on out).

This tutorial is assuming that your mediawiki is installed on a Debian Lenny server and that you have root shell access to that server. The ideas put forth here should work fine on any Linux/Apache based mediawiki setup, though the details may be different. If you do not have root shell access to your server then this technique may not be workable for you.

Unless otherwise stated all of these commands are ran as root.

I use the editor "joe" in this tutorial, you can use any text editor you prefer. If you don't have joe and want to use it you just do:

apt-get install joe

Step 1

 * Create an Amazon AWS account and enable S3

Go to the Amazon S3 sign up page and click "Sign up for Amazon S3". Follow the instructions. S3 is a pay service and does require a credit card, it may take a day or so for them to activate the account. This setup only needs to be done once.

Step 2

 * Get your access key and your secret key

Once your S3 account is setup log in and under the "Your Account" menu select "Security Credentials". On that page you should see a table showing your current "Access Key ID" and "Secret Access Key". You will need to click the "Show" button to reveal your secret key. Record both of these keys into a text editor to use later, be sure to keep track of which one is which.

These are the keys that allow you to access your file storage area, keep them secret or people can abuse your account and create charges that you will be responsible for.

Step 3

 * Install tools and create a bucket

As root on your mediawiki server do the following:

apt-get update apt-get install s3cmd

Now configure the tool with:

s3cmd --configure

Give it your access key and secret key when it asks you for them. Just hit enter for the other questions to use the default settings

Now that you can access your s3 account with the s3cmd tool you can create a bucket:

s3cmd mb s3://static.mydomain.com

Where "mydomain.com" is a domain that you have DNS control over.

Set the CNAME of static.mydomain.com to:

static.mydomain.com.s3.amazonaws.com.

Notice the "." at the end. If you do not know how to create a subdomain and set its CNAME record then you need to ask your hosting provider or just google for the information as this goes beyond the scope of this article.

Step 4

 * Setup and install s3fs-fuse

Make sure fuse-utils is installed:

apt-get install fuse-utils

Go to the S3FS download page and download the latest version of s3fs to your mediawiki server.

Unpack the archive(the name of the file may be different):

tar -xvf s3fs-r177-source.tar.gz

Enter the new path:

cd s3fs

Make sure you have the needed tools to build s3fs:

sudo apt-get install build-essential libcurl4-openssl-dev libxml2-dev libfuse-dev

Then run make:

make

Now install it:

make install

Step 5

 * Create an S3 linked folder and migrate existing data to it.

First stop your apache server so mediawiki does not try to change the path while you are working on it:

/etc/init.d/apache2 stop

Now move to the mediawiki path:

cd /var/lib/mediawiki

Move the old images folder out of the way:

mv images images.bak

Now create a path for the s3fs cache and a path to link with s3fs:

mkdir /s3fscache /var/lib/mediawiki/images

I had permission issues so I opened them up on the folder:

chown www-data.www-data /var/lib/mediawiki/images chmod a+rwx /var/lib/mediawiki/images

Now edit your fstab file:

joe /etc/fstab

Add the following line to the bottom of fstab, making sure you replace the placeholders with your access and secret key:

s3fs#static.mydomain.com /var/lib/mediawiki/images fuse allow_other,default_acl=public-read,use_cache=/s3fscache,retries=5,accessKeyId= ,secretAccessKey= 0 0

Make sure there is a newline after this line in fstab. Now try it out:

mount -a

and look at the free space for your images folder:

df -h /var/lib/mediawiki/images

You should see something like: Filesystem           Size  Used Avail Use% Mounted on s3fs                  256T     0  256T   0% /var/lib/mediawiki/images Filesystem "s3fs", available space 256 terabytes, not bad. If you see something else then something has probably gone wrong.

Now you need to migrate your old image data into the S3 bucket:

cp --preserve=all -u -r -v /var/lib/mediawiki/images.bak/* /var/lib/mediawiki/images/

If you have a lot of images, or large images, this can take a while. Once this is done copying your images should be publicly hosted on S3.

Step 6

 * Tell apache to rewrite image urls

Enable mod_rewrite for apache:

ln -s /etc/apache2/mods-available/rewrite.load /etc/apache2/mods-enabled/

Then open mediawiki's apache.conf file:

joe /etc/mediawiki/apache.conf

Inside of the stanza that starts with "  " add this new rewrite rule:

Step 7

 * A bug

I don't know if this is a mediawiki bug or an s3fs bug, however when I first tried this it worked great except when creating small sized images. When I rendered an image small it would say "Thumbnail creation error: ". I added the following line to me LocalSettings.php file:

$wgDebugLogFile = "/tmp/wiki.log";

I looked at the debug output and found the line that was complaining about the failure, it said: "Removing bad 0-byte thumbnail". I ran the same imagemagick command that mediawiki showed in the log file and I saw that it worked just fine. I searched code for the phrase "Removing bad" and found this: It appears after it created the thumbnail it looked at it and saw it was 0 bytes long. I am assuming this is caused by the latency that s3fs can experience. I do have the local cache option on so I don't know why s3fs would show it as 0 byte. This does not happen on larger files.

My solution was to bypass the subroutine altogether. The result of this could be that a real 0 byte file is not deleted, I can live with that.

I really wanted to implement this s3 integration without performing code changes to mediawiki itself, but to make it reliable I had to.

Just find the file "/var/lib/mediawiki/includes/media/Generic.php" and replace the above section with(in my copy it starts on line 260):

Step 8

 * Restart apache and test

Start apache:

/etc/init.d/apache2 start

Look at your wiki, you should be able to see all of the images you had before. You should be able to create thumbnails from those images in different sizes. You should be able to upload new images and render them at different sizes. Test making a small image, about 25px.

Now right click on an image and click "properties". Look at the image url, it should be pointing at your mediawiki server. Now put that url in a new window and load it, you should see the url switch to your s3 domain and the image should be speedily presented to you. You mediawiki server apache logs should record the request.

If everything works well and you feel it is safe, you may now delete the "images.bak" folder to recover your disk space.

Advantages

 * Unlimited media storage
 * High speed reliable hosting for images
 * Transparent to the mediawiki software
 * Your server still logs image requests, but does not actually server the images

Disadvantages

 * When uploading you must send the file to the mediawiki server and then the server sends it to S3, instead of the user sending to S3 directly.
 * Some latency may be seen with uploads.
 * Some latency will be seen when creating a new size thumbnail.
 * If the size of the images stored exceeds available local storage then some files will not be cached locally. This means that new thumbnail creation involves re-downloading the original file

Room for improvement

 * Local cache limits

If you truly want to store more media than you have available space then some sort of tool needs to be made to remove files from the /s3fscache path when disk space becomes low. As it is the cache will hold a copy of every file that is sent to S3 and increase in size until the disk space is used. Any file in the local cache can be safely deleted at any time, if s3fs needs the file(for the creation of a new sized thumbnail for example) it will download it from S3 again.

If you create a good solution for this please let me know.


 * Direct upload

With this current setup when one uploads a file to the mediawiki it will go to the mediawiki server which will stream the contents of the file to the S3 bucket. When the download is complete it is in S3, but it does use your server's bandwidth.

It is possible to craft a special upload form that points to the S3 bucket directly and contains a signature. This signature authorises a user to send a file with a very specific file name, and under a maximum size directly to your S3 bucket. The mediawiki server would then see the file in its s3fs folder and can work on creating thumbnails.

The mediawiki server would still need to download the image from S3FS at least once for the creation of thumbnails, but it would not have to receive it once and then send it.

I have not delved into this at it almost certainly involved significantly modifying the mediawiki source code. Specifically it is essential that the file name is known at the time that the form is created, so it may have to be a 2 stage process instead of the current 1 stage process.

If you create a good solution for this please let me know.