Requests for comment/Clean up URLs

Wikimedia projects have traditionally used the /wiki/ prefix for article URLs, and a /w/{index,api}.php entry point for edit views. The downside of this setup is that URLs are longer than necessary, and especially edit URLs are fairly ugly and hard to derive from regular view URLs.

Part 1: Remove /wiki/ prefix (http://en.wikipedia.org/Foo)
Removing the /wiki/ prefix would shorten and clean up read-only URLs a fair bit. The risk is however that we could run into name conflicts. Private resources we control can be prefixed with an underscore ('_images', '_skins'), which is not a valid title and often used in REST APIs for private sub-resources. There are however some top-level resources that have fixed names:


 * favicon.ico, robots.txt: The articles about these tend to be capitalized (en:Favicon.ico/en:Robots.txt), so there should not be a conflict here.

Part 2: Remove /w/index.php?title=.. entry point (http://en.wikipedia.org/Foo?action=edit)
This seems to be fairly straightforward.

Titles with an embedded question mark ('Foo?') are not an issue, as they are already encoded as %3F. IIRC in the past there were some advantages to this setup around disabling caching for the /w/ prefix and not counting hits as page views in webalizer. Since our tools and MediaWiki's cache header handling have slightly improved since 2003 that does not seem to be a factor any more.

Migration
Before making any changes, we should make sure that all important users of the PHP parser's output (which contains the /wiki/ prefix) support the the /wiki/-less format too. We should also reach out to search engines to make sure that search rankings are not affected. Since Wikipedia is such an important site to them they usually already have custom handling, so asking them to support this in advance will likely be possible.

Old URLs can continue to work with some simple rewrite rules in Varnish:


 * /wiki/* to /*
 * /w/index.php?title= & to / ?

By doing this rewriting in Varnish we can avoid cache fragmentation. Only URLs following the new scheme need to be purged.

To prevent search engines from indexing action pages, we can add a glob rule, which is supported by the major search engines including Google, MSN, Slurp (Yahoo) and Yandex:

Disallow: /*?

These glob rules are used in the robots.txt of sites like Twitter, Yahoo, Bing and Google.