Requests for comment/Content API

Problem statement
With the growing popularity of mobile apps, JavaScript in the browser and moves towards fragment caching (ESI) MediaWiki's content is increasingly accessed through web APIs. The existing MediaWiki API is not optimized for high-volume content access. Per-request overhead is relatively high (20-30ms) and caching and URL rewriting is not generally possible as the URL schema is not deterministic and many end points are POST-based.

The storage service RFC proposes a REST-style content interface for internal use. A part of this internal interface can also be used as a public content API. To make this work well, issues important for an external content API need to be considered in the design of the storage service.

Goals

 * Support high request volumes -- provide an efficient API to retrieve content from the mobile apps, ESI, bots etc..
 * Caching support -- no random query parameter URLs that cannot be purged.
 * Support rewriting -- use URL patterns that support URL-based rewriting in something like Varnish.
 * API versioning -- enable evolution of APIs without breaking users unnecessarily
 * Consistency -- use essentially the same URL scheme externally and internally. Return the same content internally and externally, and make links in content work in both contexts without extensive rewriting.

API entry points
Our page names have established URIs. Page-related sub-resources (versions) in the content API can be conveniently and intuitively exposed as subresources of the canonical article resource. Example:

Other public content that is not page-related will need other entry points. Candidates:
 * : separate entry point, does not work well for wikis without a  style prefix.
 * : Stay within the wiki namespace, but don't collide with articles as those can't start with an underscore or colon. Works well with or without  style prefix. This looks like the best option so far. Minor variations:

Deterministic URIs for caching
Deterministic URIs let us cache and purge cached requests. REST-style paths are generally deterministic, but using them also for page-related sub-resources is complicated by literal slashes in page names and public URIs. We can instead use a query string to mark up the sub-resource. The easiest method to make those deterministic is to use only a single key-value pair, so that ordering of query parameters cannot introduce non-determinism.

Options for sub-resource encoding in query strings considered are:
 * : Sounds odd, as the key does not really match the sub-resource on the right.
 * : Would require query string key order normalization (alphabetic ordering) in caches as many client libraries don't let users control the order of parameters. Requests with missing mandatory parameters or invalid combinations are rejected. Unclear how listings would be modeled in a pure key-value model. With paths those naturally fall out of incomplete paths and the trailing slash. Harder to discover valid parameter combinations; with a path any path prefix is valid.
 * : Looks more path-y, but is longer and more noisy.
 * : Short and does not induce strange meaning like key=value. The path is a bit more broken up than the second option, but looks natural and less noisy for people used to query strings. Current favorite.

Relative links in content
We would like to use relative links in stored content wherever possible. Page names containing slashes complicate this a bit, as normal browser behavior is to interpret relative links relative to the page name.

The current solution used by Parsoid is to prefix relative links in a page called  with ../../. Sadly this does not work so well when content fragments from several pages are combined in one output page, for example in Flow timelines. All links in the content would need to be rewritten so that they work with a different page name. Similar issues occur when pages are renamed.

A promising alternative is to make all links relative to the wiki root, and make this work even for pages containing slashes by setting  in the skin. This also avoids issues with alternate path-less entry points like. Setting base href is much cheaper than rewriting all hrefs in content, and allows the combination of content fragments even where that is not easily possible (ESI).

Strawman page-related API tunneled to Rashomon backend
Following the goal of using the same URL schema internally and externally, the page-related subresources can be made publicly available as: GET /wiki/Main_Page?v1/rev/latest/html -- returns latest html, purged on new revision / re-render

See the storage service RFC for more example URLs following the same pattern.

Strawman general content API handled by storage service backend
An example request to a public key-value bucket as mentioned in the storage service RFC:

GET /wiki/_api/v1/pages/Main_Page?rev/latest/html GET /wiki/_api/v1/math-png/96d719730559f4399cf1ddc2ba973bbd.png