Requests for comment/Content API

https://github.com/gwicke/restbase

Problem statement
With the growing popularity of mobile apps, JavaScript in the browser and moves towards fragment caching (ESI) MediaWiki's content is increasingly accessed through web APIs. The existing MediaWiki API is not optimized for high-volume content access. Per-request overhead is relatively high (20-30ms) and caching and URL rewriting is difficult as the URL schema is not deterministic and many end points are POST-based.

The storage service RFC proposes a REST-style content interface for internal use. A part of this internal interface can also be used as a public content API. To make this work well, issues important for an external content API need to be considered in the design of the storage service.

Goals

 * Support high request volumes -- provide an efficient API to retrieve content from the mobile apps, ESI, bots etc..
 * Caching support -- no random query parameter URLs that cannot be purged.
 * Support rewriting -- use URL patterns that support URL-based rewriting in something like Varnish.
 * API versioning -- enable evolution of APIs without breaking users unnecessarily
 * Consistency -- use essentially the same URL scheme externally and internally. Return the same content internally and externally, and make links in content work in both contexts without extensive rewriting.

Resource / URI layout considerations
The design of a URI layout involves a lot of trade-offs, which are discussed in more detail in these notes. Your feedback on this is more than welcome. This is a summary of the current thinking:

API entry point
With user accounts, discussions & media now working across wikis it seems to make sense to use a central domain for the API in the WMF setup, e.g. api.wikimedia.org. In smaller setups, the /v1/ path prefix can be mapped to the API.

Page sub-resources
Page-related information like revisions or metadata are most naturally represented as sub-resources. The main issue here is that page names can contain slashes. Another issue is that URIs should be deterministic so that they can be cached.

We have decided to encode slashes in path components, so that we can use natural REST paths throughout:
 * Slashes in page title percent-encoded
 * Regular REST path for sub-resources
 * Disadvantage: inconsistency with normal read URIs
 * Disadvantage: inconsistency with normal read URIs

See the notes for other options considered.

Relative links in stored and rendered content vs. URIs
We would like to use relative links in stored content wherever possible. Links should continue to work even if the API is moved to a different path prefix. Page names containing slashes complicate this a bit, as normal browser behavior is to interpret relative links relative to the page name.

The current solution used by Parsoid is to prefix relative links in a page called  with ../../. Sadly this does not work so well when content fragments from several pages are combined in one output page, for example in Flow timelines. All links in the content would need to be rewritten so that they work with a different page name. Similar issues occur when pages are renamed.

A promising alternative for HTML content is to make all links relative to the wiki root, and make this work even for pages containing slashes by setting  in the skin. This also avoids issues with alternate path-less entry points like. Setting base href is much cheaper than rewriting all hrefs in content, and allows the combination of content fragments even where that is not easily possible (ESI).

This solution won't work for JSON responses though. More research is needed there.

Content API handled by storage service backend
An example request to a public key-value bucket as mentioned in the storage service RFC:

GET /v1/enwiki/pages/Main_Page/rev/latest/html GET /v1/enwiki/math-png/96d719730559f4399cf1ddc2ba973bbd.png

See the storage service RFC for more example URLs following the same pattern.

Structured API specs

 * machine-readable
 * provide rich & auto-generated documentation / sandboxes and mocks
 * ensure that docs reflect the deployed software version & configuration by integrating them with the routing mechanism
 * Overview articles 1, 2
 * Swagger
 * popular & fairly straightforward to use per API end point
 * out of band
 * JSON schema hypermedia: Google API discovery service, Heroku
 * very powerful, but less convenient to use; perhaps a good output format for Swagger
 * supports URL discovery out of band in schemas: good for efficiency
 * Recommends linking the schema using a mime type with a profile link:
 * JSON Hypertext Application Language RFC: Standard for linking in JSON responses (in-band)