Requests for comment/Clean up URLs

Wikimedia projects have traditionally used a /w/{index,api}.php entry point for all action views. The downside of this setup is that URLs are longer than necessary and hard to derive from regular view URLs (see for example bug 17981 and bug 16659). Additionally, multiple query parameters in a URL complicate caching. Both  and   are equivalent queries, which both would need to be purged.

Proposal: Use instead of
Besides being shorter and easier to derive from the view URL, this form also fixes the position of the title parameter in the URL. This makes the URLs with a single additional query parameter deterministic and thus potentially cacheable. This can be used in a RESTful content API which will be discussed in a separate RFC.

Potential issues
Titles with an embedded question mark ('Foo?') are not an issue, as they are already encoded as %3F.

IIRC in the past there were some advantages to this setup around disabling caching for the /w/ prefix and not counting hits as page views in webalizer. Since our tools and MediaWiki's cache header handling have slightly improved since 2003 that does not seem to be a factor any more.

There was an issue in IE <= 3.0 (!) that caused it to disregard the Content-type header and instead guess the content based on the file extension. This would be a problem if we served unsanitized content (wikitext for example) to logged-in IE3 users through ?action=raw. As IE3 is no longer in active use we should probably simply disable logins for it, and maybe also blacklist it from potentially dangerous content API access.

Migration
Old URLs can continue to work with a simple rewrite rule in Varnish:


 * /w/index.php?title= & to / ?

By doing this rewriting in Varnish we can avoid cache fragmentation. Only URLs following the new scheme need to be purged.

To prevent search engines from indexing action pages, we can add a glob rule, which is supported by the major search engines including Google, MSN, Slurp (Yahoo) and Yandex:

Disallow: /*?

These glob rules are used in the robots.txt of sites like Twitter, Yahoo, Bing and Google.

Dropped proposal to remove /wiki/
Removing the /wiki/ prefix would shorten and clean up read-only URLs a fair bit. The risk is however that we could run into name conflicts. Private resources we control can be prefixed with an underscore ('_images', '_skins'), which is not a valid title and often used in REST APIs for private sub-resources. There are however some top-level resources that have fixed names:


 * favicon.ico, robots.txt: The articles about these tend to be capitalized (en:Favicon.ico/en:Robots.txt), so there should not be a conflict here.
 * articles prefixed with the existing entry point  exist on enwiktionary (w/r/t and w/e), which would make backwards-compatibility with the current   entry point impossible.

The second issue sinks  removal for now, as switching off the   entry point is not really feasible in the short term. We should however consider picking non-title entry points ('/_w/' for example) in the future so that we get the option to move towards cleaner read-only URLs later.