Requests for comment/API roadmap

Background
MediaWiki API has been steadily growing and adding features, and even though it provides most of the desired functionality, I (Yurik) feel it is necessary to discuss our future plans for growth, versioning, and overall development strategy.

Justification

 * Allows clients to avoid updating every time API changes
 * Reduces the cost of making a breaking change
 * Organize feature changes - if the client asks for ver X, API guarantees the capabilities of X and result in format X.
 * Recommended API usage is shown as the latest version. If API default behavior (v1) changes to be optional (v2), new developers will work with the new default (recommended) way from the start.
 * No clutter with ever expanding list of additional parameters - with versioning, new parameters could replace old ones, or change their meaning, or be removed completely without breaking any clients.
 * Ability to obsolete capabilities in a structured way: MW supports API requests with version X+, but will give standard warning for anything below the latest version Y. No need to parse warning messages to see if specific feature change applies.

Requirements
API versioning must solve these real life scenarios:
 * Client must identify itself to the host in order for us to notify developer of incorrect/suboptimal usage.
 * Client relies on the specific output format, and needs to always get the same.
 * Client wants to use feature X. How does it check if it is available.
 * Client has to be notified that feature Y is obsolete (Unsure of this)
 * Updates to the core must not change API output and behavior, except the obsolete notification
 * An extension may add functionality to the API, and might be updated independently from the core.
 * Minimalism: All API capabilities should return only the data requested to minimize bandwidth and improve speed.

General API Proposals
api2.php ? agent=MyProgramVer42 & action=query~2 & ...

api2.php
Setting a new versioned entry point allows us
 * change overall output structure
 * reduce tolerance for incorrect requests
 * require the new 'agent' parameter in case the HTTP's useragent string is missing or begins with 'mozilla' or 'opera' to help us contact the broken client's author. This parameter must be part of the URL's query string even for POST requests.
 * allow new structure for warnings (See warnings and errors internationalization below)

action~2 versioning
In addition to the versioned entry point, each action module could add its own versioning, which would allow:
 * remove previously added feature / parameter / behavior
 * change parameter naming
 * change parameter defaults
 * change default output format

Cleanup

 * Request rewriting (aliasing) facility for renaming modules and parameters, or any other parameter manipulations.
 * Individual module name and parameter changes are here.
 * action=watch should perhaps not return ui messages that vary with the user language.
 * JSON formatter - replace {'*': 'text'} with {'_': 'text'}

Modules refactoring
Here are modules that might duplicate functionality, appear closely related to be merged into one, or whose features should be moved out into a different/new module.
 * action=sitematrix extension
 * make into query submodule meta=sitematrix,
 * action=sitematrix seems to partially duplicate meta=siteinfo & siprop=interwikimap
 * meta=siteinfo should be broken up into many meta=A|B|C</tt> actions. They don't seem to share much in common, and this approach would be cleaner from the usage, as well as more modular and extendable. Example: meta=namespaces|usergroups</tt>. The deciding factor between the module being meta or list could be ability for common users to influence it. For example, a new user can be added to the wiki, so its a list, where as usergroup is set up by the administrators, hence its a meta.

Tokens
This section needs improvement. It will describe the API token infrastructure, both client usage and internal practices.
 * remove base::getToken - possibly replace by getTokenSalt
 * main:setupModule - can $gettoken be false?

generator support for other actions
It seems that in some cases, actions like watch, delete, undelete, purge, rollback, patrol</tt>, etc. require some of the action=query</tt> functionality for creating a list of pages to work on. I think it would make sense to provide ApiPageSet functionality for titles, pageids, revids, redirects, normalization, and most importantly all generators to all other relevant actions besides query. This will reduce the load on the master (one DB commit instead of multiple), reduce the number of individual module command parameters because they will reuse generator or pageset features, and allow much greater flexibility with regards to generating a list of pages - since most modules currently do not support any of the generator options. This feature has already been submitted and is being reviewd.

Help screen cleanup
The main help screen is very long and hard to read, and in a dire need to be cleaned up.
 * action=help</tt> (no params default) should output just the list of modules with their descriptions
 * Clicking on the module name should bring module's full page - action=help & modules=name</tt>
 * Main page should have a link to show unified screen the way it is now (good for some text searches)

Errors and Warnings Localization
Mediawiki employes a very good translatewiki.net tool for all translation needs, and I think we should use that, instead of each module providing a list of warnings and error messages. We could introduce global lang=code</tt> parameter that would specify what language the user needs the message in. In case of an error or a warning, API will translate the message into the required language, or wiki's default language if lang=(nothing)</tt>. Optionally we might decide to use the HTTP Accept-Language header. In case the lang</tt> parameter is not provided (or a magic keyword 'none'? TBD), the results are returned as arrays:

According to Manual:Messages API, message parsing could happen on both PHP and JavaScript level. Need an expert opinion if it's possible to structure this so that final string generation (more work) is done in the browser if available, rather than the server.

Message translation sequence for error/warning code=blbadcontinue, with params=['A','B'], and lang=ru</tt> (python style). This could be done as a method in the ApiBase, with optional override by the module subclass. The goal is to have one common repository of error messages that extensions can use, yet allow extensions to provide custom translation tables. TBD.

query~2 submodule versioning
Modular nature of the action=query</tt> allows us to version the individual props, lists, and meta submodules. The proposed rules are:
 * query~2 supports all nonversioned submodules that do not override default output (like watchlistraw)
 * query~2 will not allow any extension (non-core) submodules with less than 3-letter prefixes or beginning with letter 'g' (reserved for generator use).
 * a single query may not combine multiple versions of the same submodule: list=allpages|allpages~2</tt> is invalid
 * each submodule may declare minimum query version required: list~2 may only work under query~2 or higher, but not under query.
 * the output of each submodule~n is placed under the root without the version number: {'submodule':..., 'pages':..., 'normalized':..., 'continue':...}</tt>

Easy continue
Easy continue allows significant client simplification. Easy continue is an API guarantee to the client that by simply adding all items in the 'continue' section to the next query the client will receive all available data, without accidentally skipping some values due to a 'limit' parameters or the generator paging. This change can be made available in all query versions, and made default in action=query~2.

A client library could have this code (uses python requests lib):

Query incomplete pages
Notify client if not all properties have finished populating the 'page' element, and the client should merge it with the result of the subsequent api call. E.g. action=query~2&titles=Page1|Page2&prop=links</tt> could get this result, in which 'Page2' does not have all containing links, and should be merged with the result of the next call. This change can be made for all query versions.

Cleanup

 * Individual module cleanup is here.
 * Move all items from 'query' to root in the result. The 'pages', 'normalized', 'continue', and all list/meta elements will be under the root element.
 * Rename 'query-continue' to 'continue' because other actions may also be continuable, or might use pageset generator.
 * All query extensions should use 3+ letter prefixes to avoid conflicts with the core
 * query~2 would always use the easy continue unless the client sets 'legacycontinue' parameter
 * Remove indexpageids=''</tt> - won't be needed after the JSON formatter change
 * Any query module that has prop</tt> parameter will always require it, when used in production (not format=xmlfm/jsonfm/...)
 * Add MediaWiki version with possible GIT URL(s) to <tt>meta=siteinfo</tt> (same as Special:Version page)
 * When using <tt>aplimit=max</tt>, the limits section should use parameter name, not module name: <tt>{'limits': {'aplimit': 500}}</tt> instead of <tt>{'allpages': 500}</tt>
 * Allow query to page if too many titles/pageids/revids are given. This way client does not need to worry about passing too many titles (currently api only sets a textual warning) - the query will simply treat it just like a generator, processing first N pages, and specifying that in the next query the first N values should be ignored.
 * <tt>format=json</tt> will output pages as a list, not as a dictionary: <tt>'pages': [ {}, {}, {} ]</tt> instead of <tt>'pages': { '1':{}, '2':{}, '3':{} }</tt>
 * Replace <tt>'image'</tt> with <tt>'file'</tt> in module names, parameters, and output. See this list. Modules with image as output: <tt>action=upload|parse</tt>, <tt>meta=siteinfo</tt>. Modules with params: <tt>action=delete</tt> and maybe more. Error message clean up is harder to do, because than there will differ from core/gui.
 * Attributes always in lower case, e.g. <tt>action=block</tt> gives <tt>userID</tt> back. See this list.
 * Consistens for property names: some modules have reason, other a comment or description. Some modules have a pagesize, other a pagelen. See this list.
 * Removing deprecated params like watch/unwatch for action=edit, aliases like dimensions/size. See this list.

Multi-writing support (Seeking comments)

 * Many wikis use multiple writing systems, and can auto-convert from one to another. Current boolean flag <tt>converttitles</tt> uses the following logic, and might need to be changed to allow for variant requests (normalize title to variant=X) or possibly other methods.

Continuation

 * Fix all modules to use <tt>continue</tt> instead of overwriting one of the original parameters
 * Both <tt>start</tt> and <tt>continue</tt> are used by <tt>AllImages</tt>, <tt>CategoryMembers</tt>, <tt>Deletedrevs</tt>, <tt>ImageInfo</tt>, <tt>UserContributions</tt>
 * <tt>from</tt> is used by <tt>AllMessages</tt>, <tt>AllUsers</tt>
 * <tt>offset</tt>	is used by <tt>ExternalLinks</tt>, <tt>ExtLinksUsage</tt>, <tt>QueryPage</tt>, <tt>Search</tt>
 * <tt>prop</tt> is used by <tt>Siteinfo</tt>
 * <tt>start</tt> is used by <tt>Blocks</tt>, <tt>LogEvents</tt>, <tt>ProtectedTitles</tt>, <tt>RecentChanges</tt>, <tt>Watchlist</tt>
 * <tt>users</tt> is used by <tt>Users</tt>

Query item count
We get lots of requests to implement count(*) functionality for various modules, and even though there is plenty of justification to get it, the fundamental database limitation has always stopped us - counting all items is an O(N) table traversal. As a result, the clients could only do a full client-side iteration of all the data and count it locally. This wastes both the server resources and bandwidth.

Now, correct me if I am wrong, but it seems that frequently the client just needs to know if the count is above a certain threshold, e.g. has a user made more than 10 edits in the last year, or does this page have more than 1 page linking to it. We could easily implement this with a <tt>count</tt> parameter: ? action=query~2 & ... & count=backlinks|links & bllimit=100 & pllimit=100 If the module name is listed in the <tt>count</tt> parameter, the resulting element is replaced with the count. I believe the api users would be happy with this compromise, and in case they really do have to know the exact count and iterate over all items, would save a lot of bandwidth. The implementation is fairly straightforward - ApiQuery.php would replace any list or prop with the module's name, and would also allow the module to optimize internal SQL.

Changes under discussion

 * per module <tt>flags</tt> parameter should replace all the boolean flags of that module:
 * query - <tt>redirects=&export=&indexpageids=''</tt> should be replaced with <tt>flags=redirects|export|indexpageids</tt>
 * imageinfo - <tt>iilocalonly</tt> should be replaced with <tt>iiflags=localonly|...</tt>, etc

Client Libraries
WMF should maintain simple default libraries in several popular languages for basic API usage. The libraries should implement those features that are either absolute minimum or unlikely to be implemented by the library writers, but important to the servers. Less is more - the less functionality we define as "must have", the less we will have to maintain. There should not be too much action-specific functionality, possibly with the exception of <tt>query</tt> and <tt>edit</tt>, and even there - bare minimum.
 * Request throttling
 * Token management / login
 * Agent strings
 * Error & warning handling
 * Rudimentary request/response: <tt>{'action':'query', 'list':'allpages'}</tt> &rarr; <tt>{'allpages':[...]}</tt>
 * Query continue and changed revisions detection


 * Initial language support
 * Python (ver 3)
 * JavaScript
 * .NET
 * Java

Content API
This section is for the proposals related to purely content-requesting API. Due to heavy caching requirements unlike the other parts of the API, I think it would be highly beneficial to diverge from the overall API model here in order to take the most advantage of the squid or other types of caching.


 * REST-style
 * JSON-only output
 * Minimum number of parameters
 * URL rewriting

Use Cases (PLEASE EXPAND)
''' Please help us plan it by adding your usage scenario. '''
 * (mobile) Get HTML of one page/header/section/TOC
 * Embed page header/section into another site (iframe or extension)

Other use cases / wish list

 * (Parsoid) Expand a batch of templates and extension hooks with complete isolation between each member of the batch. The actions would be very similar to a combination of action=expandtemplates and action=parse, but without the parser involvement. For this, we need 1) ideally generic batching support, 2) a dedicated end point for template expansion and 3) an end point to call a tag extension hook directly. Batching is needed to amortize the per-request overheads (HTTP connection setup, PHP startup costs). Parsoid expands all templates in a page in parallel, which is not very efficient when done with one API request per template.