Requests for comment/API roadmap

Background
MediaWiki API has been steadily growing and adding features, and even though it provides most of the desired functionality, it has some areas in which it could be improved.

This RFC serves as an announcement of proposed breaking changes and a request for feedback on major new features. It's also Brad's todo list for after Wikimania.

Deprecation process
Discussion at the Architecture Summit in January 2014 was generally favorable to deprecations of major features, as long as we give people enough time to update. Minor changes will continue to be announced to the mediawiki-api-announce mailing list.

When it is possible for the new version of the feature to coexist with the old (e.g. prop=imageinfo and prop=fileinfo):
 * 1) The new feature will be implemented.
 * 2) The deprecation will be announced:
 * 3) * A message will be sent to the mediawiki-api-announce mailing list.
 * 4) * The old feature will report deprecation warnings.
 * 5) * Uses of the deprecated feature will be logged on the server (currently in WMF's case this is on fluorine where deployers have access).
 * 6) After a suitable timeframe (e.g. if the deprecation was in MediaWiki 1.24, during the 1.25 development cycle), usage of the deprecated feature on WMF wikis will be evaluated and the deprecated feature may be removed.

When it is not possible for the new version to coexist with the old (e.g. changing format=json):
 * 1) The new feature will be implemented, but must be explicitly requested by clients via a query parameter.
 * 2) The deprecation will be announced:
 * 3) * A message will be sent to the mediawiki-api-announce mailing list.
 * 4) * Deprecation warnings will be output when the parameter to request the new version is not given.
 * 5) * Uses of the deprecated feature will be logged privately.
 * 6) After a suitable timeframe, the new version will become the default and the old removed. The "request the new version" parameter will be silently ignored.
 * 7) The "request the new version" parameter will at some point be removed, leading to "unrecognized parameter" warnings.

When the default for a behavior is to be changed but the old behavior is not being removed (e.g. changing the default continuation to be the new easy-to-use style rather than the current query-continue):
 * 1) If not already present, a request parameter will be added to specifically request the old behavior.
 * 2) The change will be announced:
 * 3) * A message will be sent to the mediawiki-api-announce mailing list.
 * 4) * Deprecation warnings will be output when neither the select-new-version nor the select-old-version flags are used. Logs will also be made.
 * 5) After a suitable timeframe, the new version will become the default.
 * 6) Any flag to select the new version explicitly may at some point be removed, leading to "unrecognized parameter" warnings.

Remove obsolete output formats
The following output formats will be deprecated and removed:
 * wddx / wddxfm
 * yaml / yamlfm - it's identical to json anyway
 * txt / txtfm
 * dbg / dbgfm
 * dump / dumpfm

The following output formats will remain: json / jsonfm, xml / xmlfm, php / phpfm, rawfm, none.

JSON will be the preferred output format.

Comments

 * +1 Legoktm (talk) 22:29, 16 July 2014 (UTC)
 * +1 Addshore (talk) 10:59, 20 July 2014 (UTC)

JSON output as default
The API currently defaults to xmlfm when no format parameter is given. This will be changed to jsonfm.

Note this will not affect modules that use their own custom output formatters. Also, action=help will be getting its own custom output formatter (see below).

As no client should be trying to parse the *fm formats, this probably won't follow a deprecation process. It'll just be done once action=help is rewritten.

Comments

 * No brainer Legoktm (talk) 22:34, 16 July 2014 (UTC)
 * Strong support on this one Ladsgroup (talk) 08:21, 17 July 2014 (UTC)
 * +1 Addshore (talk) 10:59, 20 July 2014 (UTC)
 * +1 Protonk (talk) 22:04, 22 July 2014 (UTC)

Changes to JSON output format
The existing JSON format suffers from a number of shortcomings that make it more difficult to use than necessary. Many of these are inherited from the underlying data structure being designed for the XML format. Thus, the following changes are proposed.
 * The existing 'utf8' option will become the default.
 * A new 'ascii' request parameter will be introduced for clients who need all non-ASCII codepoints escaped.
 * Anything using '*' as a key will be renamed to something more natural. In some cases this may result in something like "query.page[1].foo['*']" becoming simply "query.page[1].foo", and in others something like "query.page[1].foo.content".
 * During the transition period, a 'sane=1' request parameter will request the new behavior.
 * Boolean result properties will use boolean true as the value, rather than the empty string. Whether a property will be present with a boolean false value or will continue to be entirely absent from the result when false will be determined on a case-by-case basis.
 * During the transition period, a 'sane=1' option will request the new behavior.
 * During the transition period, result parameters that are already being returned as booleans may accidentally change to the empty-string style in the non-sane output.
 * Page lists will be returned as arrays rather than objects with page_ids as keys. This will make it easier for clients to iterate over the results.
 * The 'indexpageids' parameter will be removed.
 * During the transition period, a 'pagesaslist=1' request parameter will request the new behavior.
 * The JSON formatter currently has a tendency to return values that are normally objects as arrays when empty (bug 10887). This will be fixable.

On the MediaWiki code side, developers will see the following changes:
 * If anything is currently returning boolean values as actual booleans rather than the API standard empty-string, code will need to be changed to preserve this behavior in the non-sane output. The exact code change is yet to be decided, but will be something along the lines of having to pass such boolean values through some method on ApiResult.
 * There will be a way to explicitly tag a PHP array in the result as "array" or "object", much like how ApiResult::setIndexedTagName is used for the XML format.
 * Ambiguous cases, such as empty arrays or some kinds of arrays with integer keys, might throw an error if not explicitly tagged. If this would bother you, comment!

Comments

 * This would seem to make the assumption that everyone can and will update their wikis and clients to the latest version within a reasonable time frame. I can still find 1.15 wikis out there, and even 1.18 or 1.19 aren't that uncommon. How do bots and such handle the varying format when they may have to deal with wikis that are on different versions? Much like the newer continue style, I think there always needs to be a parameter to indicate that the requester supports the new format. Even if you remove the old format entirely after the transition period, a bot should get an error that indicates that the old format is no longer supported, rather than having an unpredictable data format thrown at it. Rather than using "sane=1" on a temporary basis, would it not make more sense to use something like "format=json2" on a permanent basis to indicate which output format is actually in use? – RobinHood70 talk 16:14, 17 July 2014 (UTC)
 * The thing I don't like about that suggestion is that it requires up-to-date clients on up-to-date wikis to be specifying useless parameters forever. Anomie (talk) 10:41, 19 July 2014 (UTC)
 * You already have to specify the format parameter anyway. What's the difference between specifying "format=json" and "format=json2"? It allows transitional wikis/clients to support both while past and future wikis can fail or fallback gracefully. – RobinHood70 talk 15:12, 19 July 2014 (UTC)
 * Another benefit is that it makes the coding easier both during and after transition. Instead of having a bunch of "if sane" checks that later have to be removed, you're using a whole new ApiFormatJson2.php module that doesn't need to be checked for backwards compatibility at all because clients only get it if they ask for it. – RobinHood70 talk 18:06, 19 July 2014 (UTC)
 * Looking at the code, I suspect something like what I suggested wouldn't work, but I'm only passingly familiar with the MW code, so I'll leave that to you. My concern here is that a bot that works across multiple versions should know how to behave without having to do any guesswork. While the wiki version number should never be changed, it conceivably could be. Even a well-intentioned change, like "$wgVersion = MyWiki running on MW 1.23" would throw many version parsers for a loop. To that end, what about adding two new outputs to the siteinfo data:  (which I would see being a plain integer to indicate major changes only) and  ? A bot would then easily know what output format to expect and whether a sane check is required to get the new functionality. This assumes, of course, that API version and JSON changes go hand-in-hand. If we assume those might change independently, then other outputs could be added to indicate JSON version and sane requirements. – RobinHood70 talk 15:32, 23 July 2014 (UTC)
 * Just a thought on the topic of boolean values: while I agree that case-by-case is the way to go, as a general guideline, I'd like to suggest that false boolean values only be emitted if false isn't the default state. – RobinHood70 talk 14:12, 20 July 2014 (UTC)

Changes to PHP output format
The changes described above for the JSON output format will also be applied to the PHP output format, where applicable.

Comments

 * From a code perspective, are we just going to have one class that handles preparing an array for formatting, and then the subclasses just do something like  or  ? Legoktm (talk) 22:36, 16 July 2014 (UTC)

Changes to XML output format
Changes here will mostly be on the back-end; the actual data output to clients is intended to remain the same wherever possible. However, clients should be prepared for the following:
 * Result structure may no longer match the JSON format.
 * Tag and attribute names may be encoded when not conforming to XML requirements.
 * Result structure may change depending on the specific query. For example, passing both rvprop=content and rvdiffto=prev to prop=revisions will currently omit the diff from the result (bug 55371) (it should be throwing an error, but that's another bug). In the future, it's likely that this will return the content as the value of the node when rvdiffto is not supplied and as the value of a subnode of the node when it is.

For example, bug 43221 was fixed by changing the names of attributes such as "4::foo" to fit XML's restrictions. In the future, this would be fixed by either encoding the name (e.g. "_4.3A..3A.foo") or by changing the structure of output in only the XML format (e.g.  ).

On the MediaWiki code side, developers will see the following changes:
 * The XML formatter will no longer die if ApiResult::setIndexedTagName is forgotten. Instead, it will act as if that were called with something generic (e.g. ApiResult::setIndexedTagName( $array, 'item' )).
 * The XML formatter will no longer (be supposed to) raise an error when a node has both node content (ApiResult::setContent) and non-scalar attributes. Instead, it will simply shove the intended node content into a subnode.
 * Anything that's hard-coding '*' instead of using ApiResult::setContent is going to break.
 * There may be additional ApiResult calls required in some cases.

Changes to pretty-printed HTML formats
The pretty-printed HTML formats (jsonfm, xmlfm, phpfm, rawfm) will likely lose the automatic linking of links and various other bits of fanciness. They will gain a hook to allow for syntax highlighting via extensions such as Extension:SyntaxHighlight_GeSHi.

Comments
The geshi extension uses CSS provided by ResourceLoader to style highlighted syntax. Are you thinking of adding ResourceLoader support to api.php? TBH, I don't really see the point of adding syntax highlighting... Legoktm (talk) 23:04, 16 July 2014 (UTC)
 * Just as an alternative idea.... Add a index.php Special page, where you can post the api.php output to, then just add a link in the api.php introduction paragraph to this "syntax highlighted" output. Avoids mixing api.php and index.php more than that is desired. TheDJ (talk) 13:36, 17 July 2014 (UTC)

Why to remove the auto-linking feature? -- Ricordi  samoa  23:37, 25 July 2014 (UTC)

Removal of certain data from action=paraminfo
The data returned by action=paraminfo includes two items that appear to be at best incomplete and seem to have almost no possible uses:
 * 'props' is supposed to contain some sort of data structure indicating which result properties correspond to which request parameters. But the format of this data isn't even specified and the existing examples seem to be ad-hoc without any real consistency.
 * The intended use of this data appears to be for automatically generating objects with property accessors to wrap access to the MediaWiki API. But given the lack of any specification as to the data structure, I expect this has at most one actual user.
 * 'errors' is supposed to contain a list of possible errors that the API module can return. But the lists are incomplete, and in some cases cannot ever be complete since additional errors can be raised by extension hooks in code far removed from anything related to the API.
 * I imagine the intended use of this data is again for automatically generating strongly-typed errors in some library trying to wrap the MediaWiki API. But since the data is complete, any such library is already going to have to have a generic fallback. It would probably be best for it to use that fallback in all cases.

Comments

 * As an API user, seeing a list of possible errors that might occur is nice, so I can think about what might go wrong in a program flow. As a developer, if something as a hook, it's completely impossible to document what might happen. I think something like "This is a list of more common errors that might occur, but not a complete list" would be nice. Result properties are useless. Legoktm (talk) 23:07, 16 July 2014 (UTC)
 * OTOH, as an API user myself I don't much see the point to an incomplete list of errors unless there's something the program can do about the error automatically besides logging it for human attention and/or moving on to the next thing. At which point it's probably better done in the human-curated documentation (improvement of which is also planned for next quarter). But I agree that explicitly marking it as an incomplete list would be better than what we have now, which probably encourages some people to try to handle every one individually. Anomie (talk) 10:53, 19 July 2014 (UTC)

HTMLizing action=help
The output from action=help is currently a plain-text document wrapped in the usual API output formatting, with a few links and bolding added in post-processing when viewed via xmlfm. This will be changed to output an HTML document intended for viewing in a browser.

The default view of api.php will provide only general information and documentation of the main module (i.e. the bits of the current page above the "*** Modules ***" line and the credits at the bottom), with the various module names in the documentation for the 'format' and 'action' parameters linking to documentation for those modules. The documentation for action=query will similarly document only the query module itself, with links to documentation for the various 'prop', 'list', and 'meta' modules. There will be an option to output documentation for all modules on one page, likely 'all=1'.

At the same time, the various bits of text in the API help should be made localizable.

The possibility of including a version of Special:ApiSandbox on the help pages is also under consideration, although that may be left for a later iteration.

If anyone is actually using action=help from a client, please comment about your use cases if they wouldn't be satisfied by this proposal.

Comments

 * Why don't we just turn the help into a special page? Legoktm (talk) 22:34, 16 July 2014 (UTC)

Internationalizing API warnings and errors
API warnings and errors are currently returned in English, and further multiple warnings are concatenated into a single text string.

The error codes will generally not change, this will only control the human-readable messages.

The plan is for an error-language option with the following possibilities:
 * 'none' returns the message key and parameters, no human-readable message
 * 'user' uses the language in $wgLang to generate a human-readable message
 * A language code uses the specified language

The non-'none' options would have an additional option to specify whether the message should be returned as HTML, wikitext , or wikitext ignoring site-local customizations.

Errors and warnings will both be returned as arrays of objects, each object having a code, the source module (maybe not for errors?), and message data as above.

During the transition period, omitting the error-language option will produce backwards-compatible output. After, 'user' will likely be the default.

On the code side, this will entail a major reworking of the various error and warning methods in ApiBase.

Comments

 * Is there a reason we can't just provide both html and wikitext error messages, and let the user pick whichever one they want? Legoktm (talk) 23:12, 16 July 2014 (UTC)
 * I'd rather not clutter the response with useless repetition of every message in three different formats, when the client already knows which one it wants when making the request. Anomie (talk) 10:59, 19 July 2014 (UTC)

Support uselang
Some bits of the API support 'uselang' since the underlying MediaWiki methods support it. Once errors and help can be localized, it would make sense for it to offically be listed.

Comments

 * I'm not sure I agree with this. What's an example use case aside from localized error messages which was covered above? Legoktm (talk) 23:13, 16 July 2014 (UTC)
 * The parsing-related actions, mostly. And silly things like action=watch returning a UI message instead of letting the UI handle it (which reminds me, I should put that on the "to be deprecated" list). Anomie (talk) 11:02, 19 July 2014 (UTC)

Token handling
API modules that perform changes must use tokens for CSRF protection. Currently there are multiple ways to retrieve a token: action=tokens, action=query&prop=info&intoken=..., action=query&prop=revisions&rvtoken=..., action=query&list=users&ustoken=..., action=query&list=recentchanges&rctoken=.... Formerly some modules would implement their own "gettoken" parameter, although now only action=login does anything like this. Further, some modules have their own "type" of token and others use the generic "edit" token type, and which is required for a particular module is not always clear. And it's not possible to fetch both the token and the data of the page to be acted on at the same time.

The following changes will be made to token handling:
 * All existing methods of retrieving tokens will be deprecated.
 * A new meta=tokens will be added to action=query. It will work just like action=tokens does, but by virtue of being a submodule of action=query you can combine it with e.g. prop=revisions to fetch both the edit token and the page content being edited.
 * The help for every 'token' parameter will clearly indicate which token type is needed. The type will also be included in action=paraminfo.
 * Many of the existing token types will probably be merged into a single 'csrf' type, as they're already all the same token.
 * All tokens will be static, not varying based on the target of the action.
 * All tokens in core and WMF-deployed extensions are already static except for action=rollback, which depends on the title and user being rolled back. API action=rollback will accept both the existing title-and-user-based token and a new static rollback token. index.php action=rollback will continue to accept only the existing title-and-user-based token.

On the code side, the token-related methods in ApiBase will be changing. Details to come later.

Comments

 * Yes please. Also just noting that action=createaccount has it's own token handling logic similar to action=login. Legoktm (talk) 22:28, 16 July 2014 (UTC)
 * Both login and account creation will continue to need a special token to avoid login csrf. Otherwise I think this is all good.
 * Love it! Addshore (talk) 11:03, 20 July 2014 (UTC)
 * What will happen with the  currently emitted by  ? Since this needs to be updated per edit to check for page deletion since the edit started, I'd recommend just moving it to the general section of the info. – RobinHood70 talk 02:10, 27 July 2014 (UTC)

Removal of long-deprecated parameters
Analysis will be done to determine whether anyone is still using the following:


 * "watch" and "unwatch" parameters that have been replaced with "watchlist".
 * "sessionkey" parameter to action=upload and prop=stashimageinfo that was replaced with "filekey".
 * "toponly" parameter to list=usercontributions.
 * "querymodules" parameter to action=help, replaced with extended syntax for the "modules" parameter.
 * action=paraminfo will get the same treatment.
 * "title" parameter to action=watch.
 * "url" parameter to prop=langlinks

Additional deprecated parameters may also be considered.

Simplified continuation as default for action=query
Currently, this must be requested by passing an empty 'continue' parameter in the initial request. This will be changed to be the default, and the current query-continue may be requested with a 'rawcontinue' parameter.

Simplified continuation should indicate "pause points"
With the hard-to-understand query-continue continuation, it's easy for the client to know when it has a full batch of results for the current batch from the generator, so it can pause and process that batch before continuing the continuation.

The simplified continuation should support this sort of batching without having to parse the 'continue' parameter; a 'batchcomplete' boolean property in the result should suffice.

Comments

 * +1 Protonk (talk) 22:06, 22 July 2014 (UTC)

Query item count
People sometimes request a count(*) functionality for various modules, and even though there is plenty of justification to get it, the fundamental database limitation has always stopped us - counting all items is an O(N) table traversal. As a result, the clients could only do a full client-side iteration of all the data and count it locally. This wastes both the server resources and bandwidth.

It would be relatively simple to allow modules to return an integer from 0 to the relevant limit. For example, if  then the result in "count" mode would be a number 0 to 100 or "101+".

Comments

 * -- great idea! ☠ MarkAHershberger ☢ (talk) ☣ 18:15, 21 October 2013 (UTC)

Rewrite prop=imageinfo from scratch as prop=fileinfo
The code is a mess, the limit semantics make no sense, and we have several other options that don't really fit non-images.

The best thing to do here is probably to just write a prop=fileinfo module from scratch so we don't have to worry about backwards compatibility, and then deprecate prop=imageinfo.

Change defaults for "prop" parameters
Many query modules take a 'prop' parameter to specify which bits of information the client actually wants. Defaults for these parameters may be cut back or eliminated entirely. Or the prop parameter may be made required with no default.

Comments

 * I'm not sure if this one is really worth the trouble. Anomie (talk) 19:13, 16 July 2014 (UTC)

Allow paging the "titles" parameter
If too many titles/pageids/revids are given to the query module (or generator), it should page through them rather than erroring out or issuing a warning and ignoring some. This way client does not need to worry about passing too many titles; the query will simply treat it just like a generator, returning an appropriate continuation value.

Comments

 * Do we really want the client to be passing us 10000 titles just for us to tell them to retry 9950 of them? The client can as easily handle that on the client side, and save bandwidth in the process. Anomie (talk) 19:13, 16 July 2014 (UTC)
 * Anomie has a point here. I don't see a good way of submitting large numbers of titles and having them continue without having to resubmit those titles with each request, which is a waste of bandwidth. That said, it would be really nice at the client side to not have to worry about splitting page collections into smaller groupings or submitting requests every nth page or whatever approach you want to take. If a way can be found to submit a title list once and then page through it, that'd be great. Otherwise, yeah, I agree with dropping this idea. – RobinHood70 talk 16:45, 17 July 2014 (UTC)
 * Despite the API not being anywhere near "level 3 REST", I'd like to preserve the REST principle of avoiding server-stored request state (i.e. a remembered list of titles to be processed). Anomie (talk) 11:07, 19 July 2014 (UTC)
 * Agreed. Something that preserves state could leave the doors wide open to a DoS attack that would bring servers to their knees, so not a good idea. – RobinHood70 talk 17:34, 19 July 2014 (UTC)

Extension:SiteMatrix should create a query submodule
The action added by Extension:SiteMatrix, action=sitematrix, should really be a query submodule. In addition, it's output structure could be improved.

Further, this action seems to serve much the same purpose as. They could be merged somehow.

Comments

 * Actually replacing  isn't really feasible unless we can make the output entirely compatible. And doing so would be facilitated by the following proposal. Anomie (talk) 04:31, 12 September 2013 (UTC)

should be split up
Many of the options available to 's   should be split into their own meta submodules. This would be an interface cleanliness issue.

Comments

 * -- as long as there is some sort of versioning or backwards compatibility. -- ☠ MarkAHershberger ☢ (talk) ☣ 18:15, 21 October 2013 (UTC)

Module prefix limiting
Core modules should use two-letter prefixes and extension modules should use three-letter prefixes (with 'g' prohibited as the first character). The intent here is to avoid collisions between extensions and new core modules.

Comments

 * Seems unduly limiting; some core modules already use longer prefixes, and it does nothing to prevent collisions between extensions. Anomie (talk) 04:31, 12 September 2013 (UTC)

Embed the action in the URL
To facilitate directing particular actions to different API processing clusters, it would be advantageous to include the action in the URL even for POST requests. Embedding it in the PATH_INFO may make it easier to do this, but may not be possible on all hosts. As an alternative, the API could simply require that action be present in $_GET rather than $_POST.

General discussion

 * I basically everything in this. Please drop the 'may be dropped' ones. -- Krenair  (talk &bull; contribs) 20:56, 16 July 2014 (UTC)
 * I'm excited to see this move forward. I think it would be helpful if this list was prioritized, or at least ordered. Some of these are extremely easy to implement (like dropping deprecated parameters), and could be done independently of this RfC IMO. Legoktm (talk) 23:17, 16 July 2014 (UTC)
 * I feel good that I actually got things written down and will get to work on them in a focused way! ;) This RFC serves as "notify people of upcoming changes", "chance for people to object to upcoming changes", and "todo list" all at once, which is my excuse for the inclusion of easy stuff on the list. But to me it's not "seeking approval before beginning work on any of it". If you look back at the history, you'll see some of the stuff on earlier versions of this page actually has already been done (e.g. adding generator support to various actions); other easy stuff is welcome to be done similarly.Anomie (talk) 11:23, 19 July 2014 (UTC)
 * As for the "dropping deprecated parameters", the main blocker there is checking how many people still use them to determine how far in the future $DATE should be in the "These long-deprecated parameters are finally going away on $DATE" announcement. Anomie (talk) 11:23, 19 July 2014 (UTC)
 * Another idea that just occurred to me is that pretty much anything that outputs a page set gives you both the namespace ID and the full title, including the namespace. Unless there's a good reason for this that I'm not thinking of, I think you could get rid of that redundancy. – RobinHood70 talk 22:33, 19 July 2014 (UTC)