Manual:File metadata handling

When a user uploads a file to the wiki, MediaWiki stores and extracts metadata from that file (for some formats). This metadata includes embedded information in the file, such as who is the author of the file, or what type of camera took a certain photo (e.g. Exif metadata). It can also includes technical information needed in order to render the file, such as width and height. The text layer of a document format (Like PDF or DjVu) can also be extracted, which is used by some extensions like ProofreadPage.

This page is meant to document how MediaWiki handles such metadata. Note, this does not talk about "mediawiki" level metadata, such as which user uploaded the photo, if a particular version is revision deleted, etc. This is only about metadata that functionally depends on the file in question (A consequence of this, is the metadata can be re-extracted at any time - which does happen any time the user does an ?action=purge of the image description page).

Where metadata is stored
Metadata is stored in the img_metadata field of the image table (or oi_metadata/fa_metadata for the oldimage and filearchive tables). This field is a blob which should contain a serialized php array, using php's native serialization format (Exception: DjVu does something different which breaks things like instantcommons).

This has some obvious downsides of not being searchable whatsoever. Storing all of it together also means you have to retrieve all of it at once. This can be a problem, as some formats like DjVu store the entire OCR contents of the file, together with the width and height of each page. The width and height is needed to render the thumbnail, but the text content isn't. This has lead to OOM errors, due to deserializing so much unneeded info.

Some very important metadata properties (width, height, size (of file in bytes), etc) get their own field in the DB. It should be noted though that for multipage documents, where the width and height of each page can vary, the width and height tend to be stored in the img_metadata blob.

The last thing to note about this blob, is it is entirely dependant on the media handler class being used. You cannot simply look for a key named "Artist" to find who the author of the file is, you have to know where the media handler put that data. This makes it harder for third parties to retrieve such information, since the structure varies.

How metadata is obtained
If a subclass of supports extracting metadata, it overrides the method getMetadata. This method is passed the path to the file as its second argument, and allows the media handler to extract whatever information it deems interesting.

Human interest metadata
Broadly speaking, there are three types of metadata: technical (needed to render the file), human-interest, and text layer. The human interest type fields are interesting to people. They answer the Who, what, when, where, why and how of the file (e.g. Who created the file, when was the file made, what type of camera made the file, where (GPS) was the picture taken). For the purposes of this discussion, we're considering the technical context the file was taken in to be of human interest (Things like f-number, shutter speed, etc).

Well some of these human interest fields vary depending on format, (A pdf file is not likely to have an f-number) - they are generally broadly applicable across a wide range of file formats. Pretty much all formats support some notion of specifying author or copyright. You can even have things like f-number show up in unexpected places (For example, in a PNG file. In theory one could have a pdf of a photo, and it is possible to include an f-number field in the XMP block of a PDF file).

Additionally XMP can be used in a wide variety of file formats to encode this type of data. Thus with all this in mind, MediaWiki centralizes the processing of human interest metadata to a common part of the code. (Specifically ).

Merging of metadata
In many image formats, there is more than one way to specify metadata (New formats replace old formats and so on). For example, in JPEG files, one can specify the Author of the file in three different ways: via Exif, IPTC-iim or XMP. Since this is all the same data, MediaWiki will try and merge the fields together, so that only one author is found. The method we use for doing this, is set out in the Metadata Working Group Specification v1.01 (They've since come out with a new version. I haven't read the new version of their standard yet, so I'm not sure what the difference is). The tl;dr version is that native format metadata (e.g. Exif) takes priority over add-on metadata formats (XMP or IPTC-IIM).

As a result of this, we combine all the metadata fields into a set of common field names. (The field names in question are essentially all the mediawiki messages starting with the word "exif-". For historical reasons we use the exif prefix, even though some of these fields aren't really exif fields). Its expected that if extensions add new fields to this type, they will first try and join to an existing field if appropriate, and otherwise chose a non-ambigious name (PdfHandler does this for an example).

Once all the fields are merged, they are turned into a common format. Many handlers use this common format for their human readable metadata. However where they put it in their img_metadata serialized php blob varies. For JPEG files, it is the entirety of metadata blob. In GIF files it is the value of a key named "metadata". PDF files have a key named "mergedMetadata" containing this structure. Other formats do other things.

Last of all, it should be noted, that do to historical reasons of other metadata formats converging with our exif support, which originally just dumped everything, these common metadata fields contain some fields from exif that are very technical, and probably inappropriate to include (Component configuration and the like).

One notable exception to the rule of always trying to merge together metadata with same semantics into one key is file comments. File comments generally get their own field. The rationale being that the user may be interested in what type of file comment something is, and that file comments don't really have very much semantic meaning, so a "GIF file comment" and a "JPEG file comment" aren't really the same thing (for that matter, one "JPEG file comment" isn't really the same as another "JPEG file comment").

Format of this merged metadata
The format of the merged metadata array is as follows array(  'NameOfMetadataFieldKey' => MetadataValue,   'NameOfMetadataFieldKey2' => MetadataValue2,   ... );

Where NameOfMetadataFieldKey is the name of the metadata field (Which has a corresponding message named exif-nameofmetadatafieldkey - note the message name is lowercased.

MetadataValue is one of the following things:

An unordered list of values: array(  'SomeMetadataField' =>       array( 'A Value', 'Another Value', 'Can never have too much value', '_type' => 'ul      ) )

Note, that if for some reason a "_type" field is missing, it is assumed to be an unordered list type.

Note also, that if the field has a single value, it will also be presented as _type = 'ul'. i.e.

array(  'SomeMetadataField' =>       array( 'Only Value', '_type' => 'ul' ) )

Which is quite common. Often this string has some sort of format based on what field it is (For example LightSource has a bunch of constants defined in the Exif standard. Dates are formatted as YYYY:MM:DD HH:MM:SS - also known as TS_EXIF to MediaWiki). It is somewhat of a failing of this data model that we don't really know what type of data (A date, a human readable string, a number, etc) is present from the data alone, and formatting is solely based on field name.

A single string:

array(  'SomeMetadataField' =>          'Only Value', )

This is considered short for the '_type' => 'ul' field above. (It may make sense to normalize all these to the array type above)

An ordered list type:

array(  'SomeMetadataField' =>       array( 'First value', 'Second value', 'Third value', '_type' => 'ol' ) )

This is just presented to the user as an ordered list. Not all that common.

A language switch:

array(  'SomeMetadataField' =>       array( 'x-default' => 'English', 'en' => 'English', 'en-gb' => 'The Queen\'s English', 'fr' => 'French', 'tlh' => 'Klingon' '_type' => 'lang' ) )

This presents a number of alternatives, based on language. It is basically a mirror of how this type of data is encoded in XMP (See page 28 of http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf ) Some things to note:
 * If 'x-default' appears, it represents the default value. If it is unknown what language the default value is in, then the x-default key is different from the other keys. If the value of the x-default key is identical to some other key, then that indicates that the key it is identical to is the default value, and that default value is in whatever language the other key is.
 * The value of language keys can only be strings. They cannot be nested arrays.

There are two exceptions to this format (In retrospect its probably a bad thing that these exceptions exist):

The Software field:

Most of the time, Software looks like a normal value. However, if it comes from IPTC it may look like array(  'Software' => array( array( 'Software Name', 'Software version number' ), ) )

The contact info field: Contact info may be an associative array (Its an associative array if it comes from XMP, which is the most common case if specified):

array( 'Contact' =>  array ( 'CiAdrCity' => 'Oceanside', 'CiAdrRegion' => 'CA', 'CiAdrPcode' => '92054', 'CiEmailWork' => 'ttorb@mac.com', ), )

The full list of possible subkeys of Contact is:  'CiAdrExtadr', 'CiAdrCity', 'CiAdrCtry', 'CiEmailWork', 'CiTelWork', 'CiAdrPcode'  'CiAdrRegion', 'CiUrlWork'.

Again it should be noted that these special cases apply only some of the time. If the contact info does not come from XMP, it looks like a normal value, and if Software does not come from IPTC (or Software version is not specified in IPTC), it looks like a normal value. (In retrospect, this special casing, especially the software one, seems like a rather bad idea).

Displaying of metadata
The img_metadata field is often used internally for various purposes. However it is also displayed to the user in the form of a table at the bottom of the image description page.

How/what to display for metadata is up to the handler (The interface for doing this is a bit restrictive in formatting though, and could probably be refactored). In practise, almost all handlers that display metadata, display the same type of "merged" data talked about above, so also use the same code to format it. [I personally would like all media handlers that do human interest metadata to use this format. SVG is a current exception, but that's mostly because we have limitted support for extracting metadata from svgs].

The table is broken up into hidden and non-hidden fields. Hidden fields are not shown unless show all fields is clicked. Normal fields are always shown. This is controlled by mediawiki:metadata-fields (Putting this entirely onwiki introduces some problems with not being able to update it when new fields are introduced, or if an extension introduces new fields. A better system would be to have this specified in PHP, and use the wiki page as an override of the defaults).

The captions for the field come from the exif-fieldname mediawiki messages (If a message is missing, the key name is just shown). Some wikis modify these to say link f-number to f-number.

The actual values of these fields are also formatted in various ways. Numeric fields are usually prettified (e.g. 15000 -> 15,000 ). Dates are formatted. Certain fields that come from a fixed set are replaced by messages. For example if LightSource has a value of three, it is replaced with the exif-lightsource-3 message, which is "". Some values are passed through a message for the sole purpose of allowing on wiki customization. On some wikis, mediawiki:exif-software-value is changed so that the software is linked to a specific wikipedia article (Using a {{#switch:.. to chose which article if the precise value of the field doesn't go to the appropriate place).

A consequence of this, is that it is difficult to divorce formatting of the values into something human readable from per-wiki customization.

The each entry in the merged metadata array gets its own row in the table can also be a bit cumbersome some time. For example, Latitiude and Longitude get individual lines, where it might make sense to put them in one line marked location. (The rule that each entry in array gets its own row isn't strictly followed. XResolution is combined for example. However its largely true. Some of the really obnoxious data storage choices are modified in the extraction stage. For example latitude (GPSLatitude) in Exif is stored as an array of three values (degrees, minutes, seconds) plus a second field named GPSLatitudeRef which specifies whether it is North or South. We recombine at extraction time to get a single positive or negative decimal number. However there's still some icky bits in places where we don't do this, like DateTime vs DateTimeSubsecond)

The API
The API can retrieve the img_metadata blob from the database (Which it expects to be PHP serialized data). It then converts it to the API's serialization format in a somewhat less than straightforward way. Basically every element of the array gets turned into an array named "metadata" with a key "name" for the actual key name, and "value" for the value of the real array key.

As an example, consider File:RGBA_Logo_Circle-Variable_Transparency-Large.png. It has the following metadata:

* This is possibly not the best example, as there is a bug in the PNG extraction code, where timestamps aren't converted to EXIF format like they are supposed to be (56064)

This gets turned into the following in the XML serialization by the API

Or in JSON as :

For reference, if you are looking at a database dump, you would see the serialized PHP values, which look like: {{cquote| a:6:{s:10:"frameCount";i:0;s:9:"loopCount";i:1;s:8:"duration";d:0;s:8:"bitDepth";i:8;s:9:"colorType";s:16:"truecolour-alpha";s:8:"metadata";a:11:{s:11:"XResolution";s:9:"10000/100";s:11:"YResolution";s:9:"10000/100";s:14:"ResolutionUnit";i:3;s:10:"ObjectName";a:2:{s:9:"x-default";s:9:"RGBA Logo";s:5:"_type";s:4:"lang";}s:6:"Artist";a:2:{s:9:"x-default";s:10:"Shlomi Tal";s:5:"_type";s:4:"lang";}s:16:"ImageDescription";a:2:{s:9:"x-default";s:102:"Image demonstrating the use of an alpha channel for anti-aliasing of transparency and for translucency";s:5:"_type";s:4:"lang";}s:14:"PNGFileComment";a:2:{s:9:"x-default";s:43:"Colours are from the Gretag-Macbeth palette";s:5:"_type";s:4:"lang";}s:9:"Copyright";a:2:{s:9:"x-default";s:53:"Creative Commons Attribution-ShareAlike 3.0 and older";s:5:"_type";s:4:"lang";}s:8:"Software";a:2:{s:9:"x-default";s:11:"GIMP 2.2.10";s:5:"_type";s:4:"lang";}s:17:"DateTimeDigitized";a:2:{s:9:"x-default";s:27:"Tue 18 Mar 2008 16:03 +0200";s:5:"_type";s:4:"lang";}s:15:"_MW_PNG_VERSION";i:1; }}

Metadata versions
There is also a notion of metadata versions. The idea being if a handler wants to update its metadata format without breaking existing clients (In particular instant commons, which is another mediawiki instance, possibly running different version), it can use a version parameter in the API request, to differentiate different response. As of this writing, the only format to use this feature is JPEG. In essence, not setting a version for jpegs makes it output a single array, where all the values are simple strings (so no nested arrays). Compare vs  of File:Tian_Tan_Buddha_by_Beria.jpg. Note how the Copyright field is displayed (which comes from XMP, and hence a language switch, even though only a single language is defined).