Requests for comment/API roadmap

From MediaWiki.org
Jump to: navigation, search
General2013-01-14Yurik, Anomieredrafting in progresspartial
Request for comment
API roadmap
Component General
Creation date 2013-01-14
Author(s) Yurik, Anomie
Document status redrafting in progress
Implementation status partial

Background[edit | edit source]

MediaWiki API has been steadily growing and adding features, and even though it provides most of the desired functionality, it has some areas in which it could be improved.

The sections below lay out several sub-proposals, many of which could be implemented independently.

Proposals[edit | edit source]

Remove the XML format[edit | edit source]

Removing the XML format would allow for simplifying the remaining modules, and would reduce bugs caused by forgetting to jump through the hoops needed for the XML format to function properly. In particular:

  • The XML format limits the keys used in data structures throughout the API, as they must conform to XML's limitations on element or attribute names (e.g. bug 43221)
  • The XML format requires every array with numeric keys have a '_element' pseudo-key added (e.g. bug 50676).
  • The XML format encourages the use of "*" as a key, as in XML that is interpreted as the text content of an element.
    • While the specific character could be changed, it still encourages use of a generic key of some sort rather than something more context-appropriate.

Opposing this change is the fact that a significant percentage of clients (8–25%, depending on who you ask) still use the XML format.

Comments[edit | edit source]

  • Oppose Strongly oppose -- Agreed, that XML is a terrible, terrible format to use with an API. However, for the sake of backwards compatibility, there is no way I could endorse this. -- MarkAHershberger(talk) 18:01, 21 October 2013 (UTC)
  • Oppose Strongly oppose — I agree with what Mark said. I think a good number of bots still use XML to read API pages, and technologies such as .NET's LINQ to XML even encourage it. However many problems it creates, XML is an industry standard. Moving away from it seems like shooting our bots in the foot. Perhaps, rather than eliminating it completely, the older APIs could be frozen or at least deprecated and limited to basic maintenance patches, while API 2.0 could be requested specifically under api2.php, allowing bots that are JSON-compliant to use API 2.0 if they can. If full backwards compatibility is maintained, however, then I have no issues at all with 2.0 going JSON only, and my vote should instead be considered as Support Support. – RobinHood70 talk 05:55, 11 January 2014 (UTC)
  • Support Support Removing XML as a format is completely justified for the future forward. Many major APIs on the web are already JSON-only and I don't believe there can be reasonable use cases for clients that can't deal with JSON. If a consumer can't deal with JSON, its a reasonable requirement for them to upgrade to an environment that does. However backwards compatibility for existing users and clients is fair and should be a high priority. There was already movement towards versioning our API. This would be a good example of a major change that we can only introduce in version 2. The base classes for the v1 API can be kept (could even further enforce XML compliance by asserting _element is always set etc – last I checked it is still quite easy to accidentally feed data to the Api result that breaks when formatted as XML). The v2 API base classes would require that result data is always in the form of lists or key/value pairs – never as objects that can have content. These would translate to arrays and objects in JSON, and we'd only support other formats if those formats support the concept of arrays and objects. PHP serialisation could be kept for example, though there's some security/performance concerns with that and is of little benefit (if any) to future consumers so we might want to just drop all non-JSON formats while at it (in the v2 API). Krinkle (talk) 23:07, 16 January 2014 (UTC)
  • Support Support (Obviously backwards compatibility, but I'm sure this can be deprecated and made obvious that new authors should only use JSON.) As far as I have seen, the XML API is a lot harder to use than the JSON one. For example, I am the author of Extension:MediaWikiChat, and for a reason unknown to me, when using my API with format=xml (as is the default), I get PHP fatals, but it works fine in JSON. (You could argue that maybe I'm writing the code wrongly, but there's got to be a problem with the software if it works in one format but not the other.) UltrasonicNXT (talk)
    • If the text of the exception message isn't indicating a problem in your code (e.g. forgetting to call ApiResult::setIndexedTagName on some array in the result), file a bug please. Anomie (talk) 14:04, 9 April 2014 (UTC)

JSON format cleanup[edit | edit source]

The existing JSON format suffers from a number of shortcomings that make it more difficult to use than necessary. Many of these are inherited from the underlying data structure being designed for the XML format.

  • '*' is used as a key, requiring syntax such as foo['*'] where something like foo.text or even foo._ would be more natural.
  • Page lists should be returned as arrays of page objects rather than objects mapping page_id to the page objects. This would make it easier for clients to iterate over the results.
    • This would allow for eliminating the indexpageids parameter, as its purpose is to facilitate iteration over the page list object.
    • Note that empty page lists are sometimes returned as empty arrays rather than empty objects.
  • Boolean data is indicated by returning a key with the empty string as a value (which JavaScript considers false!) for "true" and by omitting the key entirely for "false", rather than using actual booleans. This means clients have to know to test existence rather than value for these keys.
    • The API framework in latest stable MediaWiki core has already been improved to support this. When setting attributes to boolean true/false they remain proper true/false booleans in JSON, and become attributes with empty string (or absent attributes) in XML still. Just needs modules to update their code to make use of this (by setting $vals['foo'] = $bool; instead of if ( $bool ) { $vals['foo'] = ''; }).
      • Not exactly. The code has been updated so the XML format outputs existence properties when given a boolean (gerrit:108315), but changing all the existing modules would be a breaking change and so needs more thought than just changing things.

Comments[edit | edit source]

  • Are these changes worth breaking client compatibility, or worth introducing some sort of version into the json formatter? Anomie (talk) 04:31, 12 September 2013 (UTC)
  • Please do item 2 in the above list! Iterating over page objects instead of arrays is really annoying. Cgtdk (talk) 15:29, 10 February 2014 (UTC)

Other cleanup[edit | edit source]

Other miscellaneous cleanup has also been proposed:

  1. Rename various modules, parameters, and output keys, as detailed at /Naming Cleanup. This would regularize some aspects, at the cost of breaking compatibility (or requiring module aliases).
    • For example, modules, parameters, and return values with "image" in the name would be renamed to "file"
    • Some modules currently use mixed-case keys in the result, which should be in lowercase.
    • Parameters/output keys such as "comment", "reason", and "summary" are all referring to basically the same thing. Same for "length" versus "size".
  2. Remove deprecated params like watch/unwatch for action=edit, aliases like dimensions/size. Details are on /Naming Cleanup. This reduces technical debt.
  3. Core modules should use two-letter prefixes and extension modules should use three-letter prefixes (with 'g' prohibited as the first character). The intent here is to avoid collisions between extensions and new core modules.
  4. All query submodules currently place their data under a 'query' node in the result. It is suggested that this is unnecessary, and the 'pages', 'normalized', 'continue', and such be children of the root element. This simplifies traversal.
  5. Simplified continuation should be the default. This encourages usage of this easier-to-use method of continuation.
  6. The "prop" parameter would be required (or would default to the empty string) for normal formats, while remaining optional (or with the current defaults) for human-readable formats such as 'jsonfm'. This reduces the resource usage by clients that don't bother to limit the "prop" value to only those items they actually require.
  7. When using aplimit=max, the limits section should use parameter name, not module name: {'limits': {'aplimit': 500}} instead of {'allpages': 500}. This would make it easier for clients to match the limits result with the

Comments[edit | edit source]

  • While numbers 4 and 6 would be nice, I don't think they're worth it unless we're having a major BC break anyway. Number 7 seems entirely pointless, there is zero reason for a non-human to pay any attention to the 'limits' section at all instead of just continuing to supply "max" as the limit. Anomie (talk) 04:31, 12 September 2013 (UTC)
  • Number 3 seems unduly limiting; some core modules already use longer prefixes, and it does nothing to prevent collisions between extensions. Anomie (talk) 04:31, 12 September 2013 (UTC)

Allow paging the "titles" parameter[edit | edit source]

If too many titles/pageids/revids are given to the query module (or generator), it should page through them rather than erroring out or issuing a warning and ignoring some. This way client does not need to worry about passing too many titles; the query will simply treat it just like a generator, returning an appropriate continuation value.

Comments[edit | edit source]

  • With the simplified continuation, this could be done without even breaking BC. With the old-style continuation, though, the client would need to know what this new continuation parameter belonged to. Anomie (talk) 04:31, 12 September 2013 (UTC)

Embed the action in the URL[edit | edit source]

To facilitate directing particular actions to different API processing clusters, it would be advantageous to include the action in the URL even for POST requests. Embedding it in the PATH_INFO may make it easier to do this,[citation needed] but may not be possible on all hosts. As an alternative, the API could simply require that action be present in $_GET rather than $_POST.

Comments[edit | edit source]

Extension:SiteMatrix should create a query submodule[edit | edit source]

The action added by Extension:SiteMatrix, action=sitematrix, should really be a query submodule meta=sitematrix. In addition, it's output structure could be improved.

Further, this action seems to serve much the same purpose as meta=siteinfo&siprop=interwikimap. They could be merged somehow.

Comments[edit | edit source]

  • Actually replacing meta=siteinfo&siprop=interwikimap isn't really feasible unless we can make the output entirely compatible. And doing so would be facilitated by the following proposal. Anomie (talk) 04:31, 12 September 2013 (UTC)

meta=siteinfo should be split up[edit | edit source]

Many of the options available to meta=siteinfo's siprop should be split into their own meta submodules. This would be an interface cleanliness issue.

Comments[edit | edit source]

  • Support Support -- as long as there is some sort of versioning or backwards compatibility. -- MarkAHershberger(talk) 18:15, 21 October 2013 (UTC)

The distinction between 'list' and 'meta' submodules is unclear[edit | edit source]

action=query has three types of submodules: 'prop', 'list', and 'meta'. The distinction between 'prop' and non-'prop' is clear: prop modules take an ApiPageSet as input, and therefore can be fed by a generator.

But the distinction between 'meta' and 'list' is less clear. One possibility is that "list" is for things that ordinary users can alter and "meta" is for things that require highly advanced permissions or changes to the MediaWiki configuration.

Comments[edit | edit source]

Token handling improvement[edit | edit source]

API modules that perform changes must use tokens for CSRF protection. Currently there are multiple ways to retrieve a token: action=tokens, action=query&prop=info&intoken=..., action=query&prop=revisions&rvtoken=..., action=query&list=users&ustoken=..., action=query&list=recentchanges&rctoken=.... Formerly some modules would implement their own "gettoken" parameter, although now only action=login does anything like this. Further, some modules have their own "type" of token and others use the generic "edit" token type, and which is required for a particular module is not always clear.

Ideally, all types of tokens would be available from action=tokens so client authors need not have to be concerned with so many different ways of fetching tokens, even if the other options remain for certain types.

It has also been suggested that we eliminate all the different types of CSRF tokens and just use the "edit" token for everything. Opposing this suggestion is that it is advantageous for the API and the web UI to use the same tokens for the equivalent actions (e.g. if a user script is generating HTML that interacts with the web UI).

In addition, all token-using modules should explicitly specify the type of token they require in the automated help.

In the code, the fact that a module needs a token must be indicated in three ways:

  • By returning non-false from ->getTokenSalt()
  • By returning true from ->needsToken()
  • By including 'token' in the return from ->getFinalParams()

Ideally, only the first of these should be necessary.

Comments[edit | edit source]

  • Documentation update is easy. If we don't eliminate the different types of CSRF tokens, action=tokens would need the ability to specify the needed salt for certain types (apparently rollback and userrights). Anomie (talk) 04:31, 12 September 2013 (UTC)

Generator support for other actions[edit | edit source]

Note the infrastructure for this is complete; module conversion is still needed

Actions like delete, undelete, protect, rollback, etc. take a title or a list of titles to operate on, but in many cases they do not support being fed a list of titles from a generator as action=query does. For improved usability, many of these should support generators.

Comments[edit | edit source]

  • purge and setnotificationtimestamp have already been converted in this manner. A patch for watch exists, but is stalled. Anomie (talk) 04:31, 12 September 2013 (UTC)

Help screen cleanup[edit | edit source]

The main help screen is very long and hard to read.

  • action=help (no params default) should output just the list of modules with their descriptions
  • Clicking on the module name should bring module's full page - action=help & modules=name
  • Main page should have a link to show unified screen the way it is now (good for some text searches)

Comments[edit | edit source]

  • Support Strong support -- It more difficult than it should be to discover what a wiki's API allows. The page should make it possible to skim it visually to find out if the functionality needed is available on that wiki. -- MarkAHershberger(talk) 18:15, 21 October 2013 (UTC)

Errors and Warnings Localization[edit | edit source]

Mediawiki employs a very good translatewiki.net tool for all translation needs. But messages returned by the API are largely unlocalized. In addition, errors and warnings are returned as flat strings that need to be parsed.

Instead, errors and warnings should be returned as arrays with both a code and a localized message. The API should accept a lang= parameter that would specify what language the client wants messages in. Values could include:

  • 'none', which would return the message key and parameters.
  • 'user', which would use the $wgLang language.
  • A language code, which would use that language.
  • If omitted, the current flat string would be returned for backwards compatibility.

Additionally, the client should be able to select among the following options (not applicable for lang=none or lang omitted):

  • Parsed HTML ($msg->parse())
  • Wikitext ($msg->text())
  • Non-customized wikitext ($msg->useDatabase( false )->text())

Comments[edit | edit source]

Query incomplete pages[edit | edit source]

Notify client if not all properties have finished populating the 'page' element, and the client should merge it with the result of the subsequent api call. E.g. action=query~2&titles=Page1|Page2&prop=links could get this result, in which 'Page2' does not have all containing links, and should be merged with the result of the next call.

'pages': [
    {
        'id': 42,
        'title': 'Page1',
        'links': [...]
    },
    {
        'id': 84,
        'title': 'Page2',
        'links': [...],
        'incomplete': '',
    },
]

Comments[edit | edit source]

  • I fail to see how this is at all useful. When using multiple prop modules it can easily be the case that all pages are incomplete until the final query. It is also likely that the pages would be completed in an order that appears random to the client. And this requires that the prop module's PHP code know and reproduce the page processing order used by the SQL query. Clients should instead just run the prop queries to completion before assuming any of the page objects is complete, and if this would take an excessive amount of memory they should use smaller limits on the generator or supply fewer titles in the batch. Anomie (talk) 04:31, 12 September 2013 (UTC)

Multi-writing support[edit | edit source]

This proposal is incomplete and needs development
  • Many wikis use multiple writing systems, and can auto-convert from one to another. Current boolean flag converttitles uses the following logic, and might need to be changed to allow for variant requests (normalize title to variant=X) or possibly other methods.
if ($numberOfVariant > 1 && !$titleObj->exists()) $wgContLang->findVariantLink( $title, $titleObj );

Comments[edit | edit source]

Query item count[edit | edit source]

People sometimes request a count(*) functionality for various modules, and even though there is plenty of justification to get it, the fundamental database limitation has always stopped us - counting all items is an O(N) table traversal. As a result, the clients could only do a full client-side iteration of all the data and count it locally. This wastes both the server resources and bandwidth.

It would be relatively simple to allow modules to return an integer from 0 to the relevant limit. For example, if foolimit=100 then the result in "count" mode would be a number 0 to 100 or "101+".

Comments[edit | edit source]

Client Libraries[edit | edit source]

WMF should maintain simple libraries in several popular languages to illustrate basic API usage. The libraries should implement those features that are either absolute minimum or unlikely to be implemented by the library writers, but important to the servers.

  • Request throttling
  • Token management / login
  • Agent strings
  • Error & warning handling
  • Rudimentary request/response: {'action':'query', 'list':'allpages'}{'allpages':[...]}
  • Query continue and changed revisions detection

Less is more - the less functionality we define as "must have", the less we will have to maintain. There should not be too much action-specific functionality, possibly with the exception of query and edit, and even there - bare minimum.

A proposed set of languages is: Python, JavaScript, .NET, Java.

Comments[edit | edit source]

  • I don't think that the WMF should take on responsibility for this; I for one don't want to be trying to maintain libraries (and fending off featuritis) for API libraries in various languages. IMO this is better left to the community. Anomie (talk) 04:31, 12 September 2013 (UTC)
    • Alternatively, the MW Release Team could take it on since this is something third parties would find very useful. -- MarkAHershberger(talk) 18:15, 21 October 2013 (UTC)
  • I think it makes more sense to have proper documentation with many examples, like the Twitter API Docs. When that is available i'm sure they're be multiple implementations available. Husky (talk) 13:10, 7 November 2013 (UTC)

Should versioning be introduced?[edit | edit source]

Currently, if an otherwise non-backwards-compatible change is to be done in a way that maintains backwards compatibility, we generally introduce a flag parameter to the module that selects the new behavior. While this works well for its purpose, it doesn't allow us to change the default to the new behavior (without breaking backwards compatibility) and it could lead to a proliferation of feature flags.

Introducing version numbers (e.g. per module) would solve the latter issue by collapsing these multiple feature flags into a single (opaque) integer, and would mitigate the former by including the idea of "default behavior" in that opaque integer.

Comments[edit | edit source]

  • In most cases, proliferation of feature flags isn't really an issue. And a deprecation process with defined timeframes could mitigate the "changing defaults" problem (e.g. "in one year, this feature will become the default and the flag will be removed"). And providing these version numbers makes a false promise that the old versions will never break BC (even for security issues or underlying core changes that make continued support impossible) and may encourage frivolous BC breaks and a proliferation of old versions that must be maintained. Anomie (talk) 04:31, 12 September 2013 (UTC)
    • Many of the other suggestions here are only possible if we have some way to allow people to continue to use their current tools. Agreed that no promises should be made to support the security issues, but there should be some sort of known deprecation period where both the new and the old API are supported where we make backward incompatible changes. -- MarkAHershberger(talk) 18:15, 21 October 2013 (UTC)

General discussion[edit | edit source]