User:NeilK/Multimedia2011/Titles

From mediawiki.org

Proposal: Titles for Commons[edit]

Current situation[edit]

In its simplest configuration, MediaWiki simply assigns every file a "File page" which has a unique URL based on its filename, e.g.

http://wiki.sample.org/wiki/File:Tomcat_76a.jpg

Which links to the original, or preview of the file, which is something like

http://wiki.sample.com/w/images/a/ab/Tomcat_76a.jpg

Where /a/ab is a hashed directory structure based on the first characters of the SHA1 hash of the filename.

Obviously, in the Commons world, these services are not provided by one simple flat directory, but we can imagine that they are, for all practical purposes.

Problems[edit]

The title must be unique -- among millions of files[edit]

This often presents a vexing problem to the uploader. It is difficult to make this usable, although we have tried with UploadWizard. It is difficult to manage in code across API or web interactions.

The user often ends up making a title which is not particularly descriptive, it just satisfies the need for uniqueness.

Bleeding constraints[edit]

  1. The title is immutable (because the URL has to be immutable). Think of a better title later? Tough.
  2. The title has to have the extension, because the filename has to have the extension. This is just ugly, and when looking at the file, who cares if it's a .jpg or .png? Consumers care about the content.
  3. The title can't contain a slash, and would have any other limitation that the underlying file store had for filenames, such as length.

The title is not internationalized[edit]

It's kind of silly when you think about it -- we have internationalized descriptions, but not titles?

One of the best images of a cat on Commons is commons:File:Olhos_de_um_gato-3.jpg but you could never even find that if you searched titles for "cat" in English. There is an English description with "cat" in it, but nothing for "chat" in French.

Early files get to colonize the namespace[edit]

Is commons:File:Cat.jpg really the best picture of a cat ever? Why does everyone else get "Cat 1a" or "Cat 2b"?

Proposed solution[edit]

We decouple title from URL from filename. This means:

  • Description page URLs get to be immutable and unique
  • Titles get to be descriptive, although mutable and non-unique
  • The linked media file can be whatever it wants to be

We use the fact that Commons is internationalized to create a new URL scheme, 100% compatible with the old system, but giving us what we want anyway. Here's how it works:

The File: page's URL should be arbitrarily assigned on upload, from a service which provides unique ids. This page will show titles and descriptions in all languages, much as the current File: page works.

http://wiki.sample.org/wiki/File:923873298

The URL for the media image can be anything, and we don't have to change the systems that depend on it being directory-hashed to the SHA1 hash. We just make sure that SHA1 hash is the File:id, because that's still unique.

http://wiki.sample.com/w/images/f/f4/923873298.jpg

But, we also are able to resolve subpages for each language. These show the appropriate titles and descriptions for their languages. When a title or description does not exist, the uploader's title and description is the fallback.

http://wiki.sample.org/wiki/File:923873298/en/
http://wiki.sample.org/wiki/File:923873298/fr/

For higher Google rankings as well as a better filename for saved pages, we can also optionally append the title to the URL. When creating thumbnails and links on the Wikipedias, we pick the default for the language of that site. However, (important!) the URL path after the /language/ segment plays no part in resolution.

http://wiki.sample.org/wiki/File:923873298/en/Tomcat
http://wiki.sample.org/wiki/File:923873298/fr/Matou

The title, in the HTML title tag and in the h1 tag, should be obtained from a database field (and internationalized) the same way the description is.

Tomcat (on English page)
Matou (on French page)

Why this is better[edit]

Super easy uploads[edit]

We just don't care what your filename was.

Mutable titles[edit]

We can allow you to upload, and even publish, files with crappy titles. They'll get fixed by the community, without breaking anything. If someone thinks that your file, which you called "My Kitty Cat" should really be "Adult male Persian Blue in profile on windowsill" then that is perfectly fine. No URLs will break. And the link to the underlying media file is not affected.

Prettier titles[edit]

No extension! No need for arbitrary "unique" numbers appended to your filename or other stupid workarounds!

And incidentally, this also clears the way for the extension to be normalized.

Better internationalization[edit]

I think there is no contest here. It would be easy to "translate" a page right on Commons, perhaps using Extension:Translate with a few modifications. Or we could involve TranslateWiki somehow.

This is compatible with our old URL naming system, and file stores[edit]

Nuff said. No other layer needs to break.

Why this is worse[edit]

Mischief[edit]

It is possible to be mischievous with a link by appending nonsense, e.g. imagine an image of Barack Obama which normally has the URL:

http://wiki.sample.org/wiki/File:38798231/en/Barack_Obama

A bad person could make a link like this:

http://wiki.sample.org/wiki/File:38798231/en/Retard

Because the part after /en/ doesn't count, that will resolve. However, the displayed title will still be taken from whatever the current title in 'en' is supposed to be, so it will still display the 'Barack Obama' title.

Other web publishing systems have similar weaknesses, and this has not proven to be fatal.

Complexity (or not)[edit]

Now there are languages**2 possible versions of the page, since the interface language and the localized content language could be different.

This may not be a problem -- see implementation.

How to implement[edit]

It's relatively trivial, for such a large change.

  • Create a service to provide unique IDs, that can handle many upload servers.
    • We could probably get away with MySQL autoincrement for a while, as long as we started from a number larger than any other File: id.
  • Add title field to the database, or (?) to the wikitext. Create constraints in PHP code such that one must always enter it in an internationalized manner. Perhaps it is really a serialized PHP data structure such as array( 'en' => 'English Description', 'fr' => '...' ), or wikitext similar to the internationalized Description in Infobox.
  • Create updater script for old database entries to have a title field (this is easy; we just copy directly from the filename field). Note that we do not change the filename field for old files. Their old File: URLs still work and are permanent. The URLs to their linked files still work. Their description pages just get extra URLs now, like File:Olhos_de_um_gato-3.jpg/en/Eyes_of_a_cat.
  • Be able to serve File: pages for each language
    • Suppress other languages when showing a /language/localizedTitle page
      • The easy way to do this is to do nothing to the File: page. It still serves all the languages together. But for /language/localizedTitle pages, javascript on the page notices the URL and suppresses every other language other than the one selected. This feature is already available on Commons, or almost: checkout commons:Category:Karen_Allen. The page is multilingual, but certain templates have associated JavaScript to show one language at a time.
      • OR
      • In PHP, change URL resolution for Files & somehow parse out internationalized title and description when generating the subpages for each language. But this blows the cache up considerably.
    • Create interface to link to other language versions, and to create them if missing. Extension:Translate may help

Comments on this proposal[edit]

  • Don't use the word title for this concept, we should call it short descriptions or something like that, to not confuse things with the mediaWiki notion of unique Title in all the api and mediawiki codebase.
  • Some thought would need to be given to instant commons so that description pages there also reflect the updated "short descriptions"
  • Right now we have a one-to-one relationship between the title and the file extension. And there quite a few places in the code that are dependent on that correlation. Would be easier ( maybe for the first phase ) to use File:923873298.jpg instead of File:923873298
  • All the search tools, tools that access titles via the api, the recent changes feeds, moving page special page tools etc. Would all have to be

updated to reflect this new "short description" property that is not the traditional "title" property associated with a wiki page.--Mdale 19:15, 12 April 2011 (UTC) 19:14, 12 April 2011 (UTC)

Since this was just mentioned on bugzilla:4421, I thought I'd leave a few comments too :). Mostly, I disagree that the problems are actual problems. Titles are not actually immutable, you can move images. Files can even have multiple file names given that redirects in file namespace now work (otoh the whole moving file system is a bit delicate from what i understand - and you do have a point with the image url, but I wouldn't consider that a huge issue). As for unique human readable identifiers being difficult, you're going to have uniquely identify the image somehow in order to link to/embed it. I don't think [[file:<some autoincrementing id>]] will make users happy (That's probably my main concern, requiring users to embed files in a page by typing an opaque code is a major disadvantage imo). As for limitations of the underlying file system, I suppose that depends on the file system, but on most unixes, files can be 255 bytes long, and cannot contain '/' or null. Mediawiki already would enforce the 255 bytes and the null rule [along with a big pile of other constraints], its probably not a bad thing to not allow '/' due to confusion with the subpage system. i18n-ized titles is a much larger problem in mediawiki, and something that all pages, not just images need (for example category pages in commons). Bawolff 06:53, 7 June 2011 (UTC)