Requests for comment/Clean up URLs

From mediawiki.org
Request for comment (RFC)
Clean up URLs
Component General
Creation date
Author(s) Gabriel Wicke
Document status in discussion
See Phabricator.

Wikimedia projects have traditionally used a /w/{index,api}.php entry point for all action views. The downside of this setup is that URLs are longer than necessary and hard to derive from regular view URLs (see for example bug 17981 and bug 16659). Additionally, multiple query parameters in a URL complicate caching. Both index.php?title=foo&action=bar and index.php?action=bar&title=foo are equivalent queries, which both would need to be purged.

Proposal: Use /wiki/Foo?action=history instead of /w/index.php?title=Foo&action=history[edit]

Besides being shorter and easier to derive from the view URL, this form also fixes the position of the title parameter in the URL. This makes the URLs with a single additional query parameter deterministic and thus potentially cacheable. This can be used in a RESTful content API which will be discussed in a separate RFC.

Potential issues[edit]

Titles with embedded question marks ('Foo?') are not an issue, as those are already encoded as %3F.

IIRC in the past there were some advantages to this setup around disabling caching for the /w/ prefix and not counting hits as page views in webalizer. Since our tools and MediaWiki's cache header handling have improved since 2003 that does not seem to be a factor any more.

There is an issue in IE6 and possibly IE7 causes it to disregard the Content-type header and instead guess the content based on the URL. This is a problem when we serve unsanitized content (wikitext for example) to logged-in IE6 users through something like ?action=raw. Tim developed a solution based on the Content-Disposition header in bug 28235 that should protect IE{6,7} from potentially dangerous content in action URLs. From reading the source the only code still setting Content-Disposition headers seems to be in thumb handling in StreamFile, so there should not be any conflicts about this header that prevented it from being used earlier. The share of logged-in requests from IE 6 and 7 has also shrunk to 0.00479% (IE6) and 0.768% (IE7).

Migration[edit]

Rewrite URLs[edit]

Old URLs can continue to work with a simple rewrite rule in Varnish:

  • /w/index.php?title=<title>&<more query parameters> to /wiki/<title>?<more query parameters>

By doing this rewriting in Varnish we can avoid cache fragmentation. Only URLs following the new scheme need to be purged.
(From IRC RFC discussion, "a permanent redirect should be considered over a varnish rewrite.")

robots.txt[edit]

To prevent search engines from indexing action pages, we can add a glob rule to robots.txt, which is supported by the major search engines including Google, MSN, Slurp (Yahoo), and Yandex:

Disallow: /wiki/*?
This rule is too simplistic, see phab:T95625

Such glob rules are used in the robots.txt of sites like Twitter, Yahoo, Bing, and Google.

See also[edit]

Dropped proposal to remove /wiki/[edit]

Removing the /wiki/ prefix would shorten and clean up read-only URLs a fair bit. The risk is however that we could run into name conflicts. Private resources we control can be prefixed with an underscore ('_images', '_skins'), which is not a valid title and often used in REST APIs for private sub-resources. There are however some top-level resources that have fixed names:

  • favicon.ico, robots.txt: The articles about these tend to be capitalized (en:Favicon.ico/en:Robots.txt), so there should not be a conflict here.
  • articles prefixed with the existing entry point w/ exist on enwiktionary (w/r/t and w/e), which would make backwards-compatibility with the current /w/ entry point impossible.

The second issue sinks /wiki/ removal for now, as switching off the /w/ entry point is not really feasible in the short term. We should however consider picking non-title entry points ('/_w/' for example) in the future so that we get the option to move towards cleaner read-only URLs later.