Requests for comment/Content API

Problem statement
With the growing popularity of mobile apps, JavaScript in the browser and moves towards fragment caching (ESI) MediaWiki's content is increasingly accessed through web APIs. The existing MediaWiki API is not optimized for high-volume content access. Per-request overhead is relatively high (20-30ms) and caching and URL rewriting is difficult as the URL schema is not deterministic and many end points are POST-based.

The storage service RFC proposes a REST-style content interface for internal use. A part of this internal interface can also be used as a public content API. To make this work well, issues important for an external content API need to be considered in the design of the storage service.

Goals

 * Support high request volumes -- provide an efficient API to retrieve content from the mobile apps, ESI, bots etc..
 * Caching support -- no random query parameter URLs that cannot be purged.
 * Support rewriting -- use URL patterns that support URL-based rewriting in something like Varnish.
 * API versioning -- enable evolution of APIs without breaking users unnecessarily
 * Consistency -- use essentially the same URL scheme externally and internally. Return the same content internally and externally, and make links in content work in both contexts without extensive rewriting.

Resource / URI layout considerations
The design of a URI layout involves a lot of trade-offs, which are discussed in more detail in these notes. Your feedback on this is more than welcome. This is a summary of the current thinking:

API entry point
Goals: Current favorite using the fact that articles and namespaces can't start with an underscore:
 * Support top-level wikis (no /wiki/ prefix)
 * Don't conflict with wiki pages
 * Be compact

Full example (latest HTML of Main Page ):

See the notes for more options and detail.

Page sub-resources
Page-related information like revisions or metadata are most naturally represented as sub-resources. The main issue here is that page names can contain slashes. Another issue is that URIs should be deterministic so that they can be cached.

Main options:
 * Slashes in page title percent-encoded
 * Regular REST path for sub-resources
 * Disadvantage: inconsistency with normal read URIs
 * Disadvantage: inconsistency with normal read URIs


 * Slashes in page title not encoded
 * Query string for sub-resource path
 * Disadvantage: ugly and somewhat atypical query string use
 * Disadvantage: ugly and somewhat atypical query string use

See the notes for details and more options.

Relative links in stored and rendered content vs. URIs
We would like to use relative links in stored content wherever possible. Page names containing slashes complicate this a bit, as normal browser behavior is to interpret relative links relative to the page name.

The current solution used by Parsoid is to prefix relative links in a page called  with ../../. Sadly this does not work so well when content fragments from several pages are combined in one output page, for example in Flow timelines. All links in the content would need to be rewritten so that they work with a different page name. Similar issues occur when pages are renamed.

A promising alternative is to make all links relative to the wiki root, and make this work even for pages containing slashes by setting  in the skin. This also avoids issues with alternate path-less entry points like. Setting base href is much cheaper than rewriting all hrefs in content, and allows the combination of content fragments even where that is not easily possible (ESI).

Content API handled by storage service backend
An example request to a public key-value bucket as mentioned in the storage service RFC:

GET /wiki/_api/v1/pages/Main_Page/rev/latest/html GET /wiki/_api/v1/math-png/96d719730559f4399cf1ddc2ba973bbd.png

See the storage service RFC for more example URLs following the same pattern.

Structured API specs

 * machine-readable
 * provide rich & auto-generated documentation / sandboxes and mocks
 * direct integration with API end points ensures that docs are always up to date (see prototype)
 * Overview articles 1, 2
 * Swagger
 * popular & fairly straightforward to use per API end point
 * out of band
 * JSON schema hypermedia: Google API discovery service, Heroku
 * very powerful, but less convenient to use; perhaps a good output format for Swagger
 * provides URL discovery out of band in schemas
 * JSON Hypertext Application Language RFC: Standard for linking in JSON responses (in-band)