Requests for comment/Refactor on File-FileRepo-MediaHandler

Proposal
This is a requests for comment about refactoring File/FileRepo/MediaHandler classes. Please leave your comments below.

Ideas in progress -- not ready for implementation.

Background
MediaWiki's current uploaded file-handling system is broken roughly into three parts:
 * FileRepo -- manages storage/data management of files (local or remote)
 * File -- wraps management of a single file from a FileRepo
 * MediaHandler -- takes a file that is/will be managed by a File object and does manipulation that depends on knowing the file type

Files can be shared between sites using ForeignDBRepo (how our local sites use Commons, diving into its database, filesystem, and memcache directly) or ForeignAPIRepo (InstantCommons, how other sites use Commons, fetching data and thumbnails over MediaWiki API).

We're also in the process of developing an alternate Swift file storage backend, which has shown up some areas where File/FileRepo are poorly factored.

MediaHandler mixed high-level and low-level use
MediaHandler classes currently mix low-level logic with high-level logic:

Low-level:
 * extracting basic metadata (width/height/pages/length) from a file
 * extracting further metadata (exif/etc) from a file
 * performing thumbnailing, resizing operations

High-level:
 * determining whether it's possible to show thumbnail images (mustRender + canRender)
 * producing HTML output via MediaTransformOutput
 * formatting width/height/length/page count/etc into a human-readable message

This can clash; see for several examples.

Possible split lines
MediaHandler mostly feels like it should be doing low-level stuff -- the data extraction, thumbnail generation, etc. This should be stuff you can only do if you have the original file.

When dealing with remote files over API you can't do any of these things because the local file isn't present! Having explicit separation here also means that in 'fancy file repo' situations like Swift, we'll be less inclined to accidentally perform a low-level file manipulation (requiring checkout of the file to a temporary directory) when we don't really need to.

Generation of inline view/player HTML also happens in the handlers these days; but we can broadly chop that up into two kinds:
 * feeding a thumbnail URL through the standard bitmap handling, probably with no filetype-specific code
 * wildly alternate rendering such as OggHandler's chunks of JS and stuff to load various things

If the HTML bits for video etc do need to be tied to the MediaHandler, they should at least only be called for local files again -- for remote files we could be more portable by exporting an iframe source / width / height. This is conceptually identical to the thumbnail image case, and lets the client site not have to worry about being able to generate the same local HTML player on remote data it may not know anything about.

Remote TIFF thumbnails

 * 31282 InstantCommons tif thumbnails should work without $wgTiffThumbnailType set on local wiki

The stock TiffHandler, on a default configuration, disables thumbnailing of TIFF images unless configuration is tweaked. This sounds fine until you try to load what happens to be a TIFF image via InstantCommons -- the thumbnail is generated by Commons' server, so why does the local site even need to check whether its TiffHandler can create local TIFF thumbnails?

Worked around in r98812 on trunk.

Remote multipage TIFFs
Similar to the bug above, if the local handler doesn't understand features like multiple pages, this overrides whatever the original server supported.

The default TiffHandler will show the first page of this document no matter what you ask for:


 * Abundance - Le Testament de Carmentrant à VIII personnaiges.tif
 * Abundance - Le Testament de Carmentrant à VIII personnaiges.tif
 * Abundance - Le Testament de Carmentrant à VIII personnaiges.tif

Not yet worked around.

Remote video

 * 19043 Instant commons video can't be played by cortado

OggHandler (and I assume TimedMediaHandler!) jump through a lot of hoops to add video player logic to local output. When using the Cortado Java applet as the player base, it's restricted to showing files hosted on the same domain as the applet's .jar.

This prevents loading of both local and remote files -- you could perhaps load remote files by forcing the entire thing to use upload.wikimedia.org's copy of the .jar, but then you couldn't load any local files.

If instead the local handler didn't have to do anything, and we just grabbed an embedding iframe URL from Commons, we could perhaps short-circuit around the entire thing.

Not yet worked around.

Related issues

 * 25854 Expose image thumbs, embedded video players via oEmbed (API + discovery )
 * 29242 Extension to embed offsite media via oEmbed (video, photos, rich media)

I've got a couple feature reqs in for using [w:oEmbed oEmbed] protocol for media embedding discovery & export. oEmbed's photo mode is similar to exporting a thumbnail in ForeignAPIRepo -- the client requests a size, and the server returns a thumbnail URL that's meant to more or less fit.

oEmbed's video/rich media handling is just to pass a chunk of embeddable HTML and a width/height; if we can expose that as an embeddable iframe URL that should work well for InstantCommons-like scenarios.

Having the same photo or embeddable iframe passed through with the API info for ForeignAPIRepo should then let us handle a much wider array of file/media types without having to have explicit support in the client.

Comments
Sure, we can split presentation and backend operations in MediaHandler if that makes sense for InstantCommons.

There should definitely be a transformation interface which operates on files on the local filesystem and doesn't care about any remote stuff. The idea of making all paths potentially remote and requiring remote repo awareness in the transformation and metadata extraction code has occasionally been raised. I'm not at all keen on that idea since almost all transformations rely on shell apps which need the file to be locally accessible anyway. Keeping the creation and destruction of temporary copies in File/FileRepo makes MediaHandler subclasses easy to write. -- Tim Starling 05:04, 4 October 2011 (UTC)


 * *nod* I think keeping MediaHandler classes dedicated just to working with actual local files should be a good fit here, with a more generic interface taking over on the 'hand in some parameters, give back a MediaTransform' end. Then local files can run on through the MediaHandler to produce direct output, and remote files can grab a thumbnail or iframe wrapper from their upstream data source.


 * Perhaps the simplest way to start migrating is to introduce a 'RemoteMediaHandler' and have files moderated by ForeignAPIRepo use that instead of a filetype-based one. That should also show up which methods are generic-ish and that we might to move if doing further refactoring. --brion 18:43, 4 October 2011 (UTC)