1.6 image storage

From mediawiki.org

Old scheme[edit]

Directory: $wgUploadDirectory/1/12/File_name.jpg

URL: $wgUploadPath/1/12/File_name.jpg

  • Current files have privileged location, get replaced on modification
  • MD5 name hash used in directory generation
  • Files on disk named for in-wiki name
  • URL maps directly to filesytem

Problems:

  • Hard to mirror and cache
  • Filenames break on non-*nix servers
  • Duplicate files waste disk space

Proposed new[edit]

Directory: $wgUploadDirectory/1/2/3/456789abcdef

URL: $wgUploadPath/1/2/3/456789abcdef/File_name.jpg

  • Files will not move on modification
  • New files will get new URLs
  • Files that are the same content can share storage
  • Use a content hash (first 64 bits of MD5?) for identification
  • URL won't map directly to filesystem, requires intermediary script?
    • (If we include extensions on the files, we can use the direct-map URLs but they'll be ugly filenames if you try to save to disk.)
    • Script should just chop out the pretty part easy to do fast no modification to smart webservers, trivial modification or cgi elsewhere, some potential for silly behavior.

Problems:

  • URLs may be longer
    • Images are not the correct locator for an image in any case, they do not provide the required copyright information.
  • Hash collisions? (unlikely, and you just can't upload the second file as we already have the first stored)
    • 1-e^(-1000000^2/(2*2^64)) = 2.7105e-08 which is sufficently unlikely. However if we have one object for each person on earth it becomes fairly likely. When we cross 10 million objects we should move to a larger hash. Using the first 64 bits of a longer hash would make this very easy.
  • May be slower, using an intermediary script to make pretty filenames. (Or can we skip them?)
    • Leave them but only to facilitate file saving?
  • Changing extensions could be a problem
    • Store by content. Object can be loaded as any acceptable extension.

Thoughts on hash size:

  • MD5 is already used in mediawiki, sticking to one hash would be good.
  • The security problems with md5 wont impact our expected use since first uploaded wins.
  • There are security concerns with sha1 too, if we were really concerned with security we should use sha256, but portablilty becomes a pain.
  • Further problems with MD5 or later applications of the hash might require changing, but thats not too hard.
  • 64 bits is enough for our current need. Quite possibly not enough for our future need. If we truncate a 128>= bit hash to 64 bits, growing to a larger size later is easy.
  • a 64 bit type is probably more efficent in the database.
  • Need to weigh the cost of changing in the future (how far?) vs longer URLs (do we really care?) and less efficent storage in the DB to decide 64 vs 128( or 160) bit.

New thumbs[edit]

Possibilities...

  • $wgUploadDirectory/1/2/3/456789abcdef-500px.jpg
  • $wgUploadDirectory/1/2/3/456789abcdef-thumb/500px.jpg
  • $wgUploadDirectory/thumb/1/2/3/456789abcdef-500px.jpg
    • Advantage: separate thumb dir makes it easy to copy/backup 'non-thumbs'
  • ????

Database tables[edit]

file[edit]

file: Refers to a particular instance of a file

  • file_hash
  • file_size
  • file_width (pixels, for images)
  • file_height (pixels, for images)
  • file_bits
  • file_playtime (milliseconds, for audio/video... possibly unused for now)
  • file_media_type
  • file_mime_major
  • file_mime_minor
  • file_metadata
  • file_refcount <- is something like this necessary?
  • <- what about a marker for public / deleted-archive storage state?

upload: Refers to an upload / file manipulation event

  • upload_id
  • upload_name_id (key to filename.fn_id) <- using an id here could make it easier to rename images
  • upload_hash (key to file_hash)
  • upload_timestamp
  • upload_user
  • upload_user_text
  • upload_description
  • upload_deleted <- should we have a deleted flag?

filename:

  • fn_id
  • fn_name (equiv to page_title in NS_IMAGE)
  • fn_latest (key to upload_id for convenient joins)
  • fn_deleted <- necessary to cleanly handle deletion/undeletions?

... needs some work. There probably needs to be a 'page'-equivalent with shorthand on the current version of a given filename.

Processing[edit]

Upload new file:

  1. Generate cropped content hash (eg 123456789abcdef0)
  2. Check for hash collisions in upload table
    • Collision? Already have this file; can discard the uploaded one.
    • Otherwise, move the file into place: $wgUploadDirectory/1/2/3/456789abcdef0
  3. Check file table for existing record with the given name ('Puppy.jpg')
    • None? Insert a new null record for the filename
  4. Insert a new upload record for filename 'Puppy.jpg', file 123456789abcdef0
  5. Update the file record for the filename to point at this latest upload
  6. Purge affected page caches

Revert file:

  1. Insert a new upload record referring to the prior file

Upgrading[edit]

It should be possible to upgrade files to the new system on the fly to minimize service disruption:

  • When an image can't be found in the uploads table, check the leftover image table:
    • If found, then for each matching record in 'image' and 'oldimage':
      1. Checksum the file.
      2. Check for collisions. If none:
        1. Rename the file to its new location
        2. Leave a compatibility symlink behind
        3. Add a file record, copying data from image/oldimage
      3. Iterate over the thumbs subdirectory if it exists:
        1. Rename each file to its new location
        2. Leave a compatibility symlink behind
      4. Add an upload record

There should be appropriate concurrency guards on the above.

A background process can be run to convert everything, to ensure that rarely- and unused images also get updated.

Future[edit]

One day we may move backend storage of upload files into an object server rather than using filesystem & NFS as we do now. The content-hash could fit well with that.