Wikimedia Apps/Commons/Category editing

From mediawiki.org

The basics[edit]

Fetching a list of categories is easy... changing the live categories is harder, as you have to pick apart and modify markup.

A file description page looks something like this:

== {{int:filedesc}} ==
{{Information|Description=Backside of iPod Touch fifth-generation. Taken at wikimedia offices by original uploader.|source={{own}}|author=[[User:Brooke Vibber]]|date={{According to EXIF data|2013-04-09}}}}
== {{int:license-header}} ==
{{self|cc-by-sa-3.0}}

{{Uploaded from Mobile|platform=iOS|version=0.17}}
{{Uncategorized|year=2013|month=April|day=9}}

If categories have been included, you won't have that "Uncategorized" template, and you'll have some lines like this:

[[Category:Something]]
[[Category:Another thing]]

Note that wikitext formatting is.... extremely variable. Pages may have been edited manually, and are not guaranteed to conform to the samples above. Category links might appear multiple on a line, or have variant whitespace, or all kinds of interesting things.

Things Commons app has to worry about[edit]

When adding categories:

  • remove the "Uncategorized" template if it's present
  • preferably add them on their own line

When removing categories:

  • beware that some categories might actually be implied via templates, and don't appear in the page text as links! Ideally we can distinguish these before we offer a "remove" option and disable it.
  • The first character of a category's name is always normalized to uppercase, but it may be lowercase in the link. (eg 'Category:foo' really means 'Category:Foo')
  • "Category:" can be in any darn case! Yes, it's inconsistent with the rest of the title. 'cAtEgOrY:foo' is legal, and equivalent to 'Category:Foo'
  • if you remove all the categories, you should put back the "Uncategorized" template.

Editing markup[edit]

There's basically two ways you can deal with the markup: throwing regexes at it and hoping that doesn't break, and using the XML preprocessor parse tree.

Regex[edit]

Beware of several things:

  • multiple links may appear on a single line
  • case issues mentioned above
  • category links can include sorting overrides as the second component of the link's pipe-separated parameters. Yeah, fun. [[Category:Foobar|Blah blah]]. If you're just removing things you don't need to care what the content is but you do need to be able to find it with your regex.
  • Uncategorized template could include more or different parameters than are shown in the above examples, in theory.

XML[edit]

There's an option on prop=revisions for getting an XML parse tree of the content in addition to the markup source. This is an XML document based on a parse of the basic heading, template, and link portions of the wiki syntax [1]. This uses the actual MediaWiki parser's preprocessor, so you know it divides up the document in the "right" way.

You could pass this into an XML parser and walk through the structure, looking for the template and link nodes with the matching names and modifying them.

You can flatten the document back to source code simply by concatenating all the text nodes together.