Page Previews/API Specification

Up until now, we've mostly gotten away with using the  MediaWiki API provided by TextExtracts and RESTBase to allow us to scale out Page Previews to a couple of large Wikipedias without issue.

However, the requirement that certain classes of pages should be handled differently means that TextExtracts is no longer the most appropriate place to house the notion of what a page preview. We should aim to keep TextExtracts as simple and as general as possible. It may be that we compose the  API and the new Page Preview API rather than integrating them but this is not a goal of this work.

To be clear, the primary goal of this work is to minimise the amount of text/HTML processing in the Page Previews client: the less work the client has to do to display a preview, the better.

Intros
The API returns well-formed HTML5 representing the introductory elements of a page, which are defined as follows: Herein we'll refer to these elements as an "intro".
 * The first paragraph from the introductory section.
 * The first ordered, unordered, or definition list that is the next sibling of the first paragraph.

Plaintext intros
Certain clients will not be able to handle HTML intros yet, e.g. the Wikipedia apps. To maintain compatibility with these clients, the API will also return a plaintext representation of the introductory elements of a page.

Generic intros
The notion of a "generic" preview was introduced early on in the rewrite of Page Previews (T151054).

A generic preview should be shown when the API indicates that it cannot generate a meaningful intro for a page, even though may have meaningful content.

Markup allowed in an intro
By default, the Page Preview API (herein "the API") must remove any tag that doesn't fall into one of the following cases.

Emphasis
The API must retain any bolded or italicised text in the intro, i.e. the Page Preview API must not remove,  , and   tags.

Formulae/MathML
In order to support browsers that don't support MathML, the API:
 * 1) Must remove   tags; and
 * 2) Must not remove either the inline or block layout fallback images generated by Math while parsing the page.

Super- and subscript
The API must retain all  and   tags.

Stripping of parenthetical statements
The API must remove all content enclosed within balanced parentheses.

Flattening inline elements
The API must replace all  and   tags with their text content, e.g.   should be flattened to   and   would be flattened to.

and
The API must remove any element with the  class to replicate the current behaviour of the NavPopups gadget. Additionally, the API must remove any element with the  class for compatibility with the current behavior of TextExtracts.

Responses
A successful response from the Page Preview API must have the following properties: Where an  type property must have the following properties:

For a page in the wiki's content namespace(s)
The Page Preview API must respond with 200 OK.

The  property of the response must be set to.

For a page outside of the wiki's content namespaces
The Page Preview API must respond with 204 No Content.

The response body must be empty.

For a disambiguation page
The Page Preview API must respond with 200 OK.

The  property of the response must be set to.

The  property of the response should be set to the intro of the page so that the client may display it if appropriate.

For a page that doesn't exist
The Page Preview API must respond with 404 Not Found.

The response body must be empty.

For a page that redirects to another page
The Page Preview API must respond with 302 Found.

The  HTTP header must be set to the URL that will get the intro for the target page.

The response body must be empty.

For a page that doesn't have an intro section
The Page Preview API must respond with 200 OK.

The  property of the response must be set to.

The  property of the response must be set to.

For a Wikidata item
This overrides the "For a page in the wiki's content namespace" case above.

The  property of the response must be set to "wikidata_preview".

The  property of the response must be set to the item's label.

The  property of the response must be set to the item's description.

If the item has the image property set (to I):
 * The  property of the response must be set to the   object that represents the Wikimedia Commons file referenced by I.


 * The  property of the response must be set to the   object that represents the corresponding thumbnail.

For a Wikidata item with no description
The response should be the same as the For a Wikidata item case apart from the following:

The  property of the response must be set to.