Requests for comment/Clean up URLs

Wikimedia projects have traditionally used a /w/{index,api}.php entry point for all action views. The downside of this setup is that URLs are longer than necessary and hard to derive from regular view URLs (see for example bug 17981 and bug 16659). Additionally, multiple query parameters in a URL complicate caching. Both  and   are equivalent queries, which both would need to be purged.

Proposal: Use instead of
Besides being shorter and easier to derive from the view URL, this form also fixes the position of the title parameter in the URL. This makes the URLs with a single additional query parameter deterministic and thus potentially cacheable. This can be used in a RESTful content API which will be discussed in a separate RFC.

Potential issues
Titles with embedded question marks ('Foo?') are not an issue, as those are already encoded as %3F.

IIRC in the past there were some advantages to this setup around disabling caching for the /w/ prefix and not counting hits as page views in webalizer. Since our tools and MediaWiki's cache header handling have slightly improved since 2003 that does not seem to be a factor any more.

There is an issue in IE6 and possibly IE7 causes it to disregard the Content-type header and instead guess the content based on the URL. This is a problem when we serve unsanitized content (wikitext for example) to logged-in IE6 users through something like ?action=raw. Tim developed a solution based on the Content-Disposition header in bug 28235 that should protect IE{6,7} from potentially dangerous content in action URLs. From reading the source the only code still setting Content-Disposition headers seems to be in thumb handling in StreamFile, so there should not be any conflicts about this header that prevented it from being used earlier. The share of logged-in requests from IE 6 and 7 has also shrunk to 0.00479% (IE6) and 0.768% (IE7).

Migration
Old URLs can continue to work with a simple rewrite rule in Varnish:


 * /w/index.php?title= & to /wiki/ ?

By doing this rewriting in Varnish we can avoid cache fragmentation. Only URLs following the new scheme need to be purged.

To prevent search engines from indexing action pages, we can add a glob rule to robots.txt, which is supported by the major search engines including Google, MSN, Slurp (Yahoo), and Yandex:

Disallow: /wiki/*?

Such glob rules are used in the robots.txt of sites like Twitter, Yahoo, Bing, and [//www.google.com/robots.txt Google].

Dropped proposal to remove /wiki/
Removing the /wiki/ prefix would shorten and clean up read-only URLs a fair bit. The risk is however that we could run into name conflicts. Private resources we control can be prefixed with an underscore ('_images', '_skins'), which is not a valid title and often used in REST APIs for private sub-resources. There are however some top-level resources that have fixed names:


 * favicon.ico, robots.txt: The articles about these tend to be capitalized (en:Favicon.ico/en:Robots.txt), so there should not be a conflict here.
 * articles prefixed with the existing entry point  exist on enwiktionary (w/r/t and w/e), which would make backwards-compatibility with the current   entry point impossible.

The second issue sinks  removal for now, as switching off the   entry point is not really feasible in the short term. We should however consider picking non-title entry points ('/_w/' for example) in the future so that we get the option to move towards cleaner read-only URLs later.