Talk:Requests for comment/API roadmap
Add topicThis page used the LiquidThreads extension to give structured discussions. It has since been converted to wikitext, so the content and history here are only an approximation of what was actually displayed at the time these comments were made. |
Deprecating watchlistraw
[edit]I agree that the output of list=watchlistraw
is broken and it should be fixed if we're going to have a new version of the API. But I disagree that “it does not seem to have any benefit compared with list=watchlist
”. watchlist
returns changes to pages on user's watchlist (equivalent to Special:Watchlist), while watchlistraw
returns the pages themselves (equivalent to Special:EditWatchlist). So the two modules do completely different things, and I think watchlistraw
shouldn't be simply deprecated. (But it should be fixed.) Svick (talk) 20:57, 7 January 2013 (UTC)
- Svick, thanks for the clarification. There are two things about watchlistraw:
- it does not behave like a query module (doesn't play nice with other query modules), so if we are to keep it, it should either be a standalone action module, or merge its output with the output of the other modules (like lists or meta do).
- if it is possible to get exact same info by another query, something like generator=watchlist & prop=revisions & rvprop=content, it should not be there just for the sake of being a shortcut. I am not saying it is possible (need to review this in depth), I am saying that it appears to be very similar with other modules. Yurik (talk) 03:18, 8 January 2013 (UTC)
- I believe you can't get the information
watchlistraw
provides some other way. I think you still don't understand what's the difference betweenwatchlistraw
andwatchlist
, it certainly doesn't have anything to do withrvprop=content
. For example, if I have pages A (that was changed recently by revision 40) and B (that wasn't changed recently) on my watchlist, thenwatchlist
will return revision 40, whilewatchlistraw
will return the pages A and B. And there is no way to get to page B fromwatchlist
, sowatchlistraw
is no shortcut. Svick (talk) 12:51, 8 January 2013 (UTC)- I think i got it now :) watchlistraw shows a list of all pages monitored by the current user, and watchlist shows the recent changes to the pages on that list. watchlist could show the same page title more than once, whereas in watchlistraw each page is always unique. We ought to do something with the naming! I guess there is no point in merging them with an extra parameter, as that would confuse everyone. Any renaming suggestions? Yurik (talk) 02:16, 9 January 2013 (UTC)
- In what way, specifically, does watchlistraw not behave like a query module? I see no evidence of your assertion in the page that it changes the output formatter. Anomie (talk) 14:13, 8 January 2013 (UTC)
- Sorry, looked too fast - watchlistraw does not append result to the 'query', but creates a new root, that's why i thought it creates a new printer.
- https://en.wikipedia.org/w/api.php?action=query&list=watchlistraw%7Cwatchlist - both should be under 'query' to be consintent. Yurik (talk) 01:34, 9 January 2013 (UTC)
- Oh, I missed that bit. Yes, watchlistraw should have been putting its results under the standard 'query' node. Anomie (talk) 14:43, 9 January 2013 (UTC)
"agent" parameter
[edit]It seems to me that this is redundant to the HTTP User-Agent in all cases except XMLHttpRequest-based clients, and to be useful for most cases it would have to be required to be in $_GET (so it would show up in the webserver access logs) even when the rest of the request is posted.
In other words, I'm not sure that making it required for all clients is really a good idea. Anomie (talk) 14:16, 8 January 2013 (UTC)
- How do you propose we ensure that the client code sends their identity? I agree that it is redundant with the HTTP useragent. We could have this logic: if 'agent' parameter is given, use that, otherwise use HTTP useragent, or die() if it is not set, or the useragent appears to be a browser. Not perfect, ideas welcome. Yurik (talk) 01:28, 9 January 2013 (UTC)
- The problem is that most cases of abuse that I can think of would be detected by looking at the webserver access logs, rather than logging every successful API request. And these access logs typically record the URI accessed (including the query string) and User-Agent header, but not the request body or any other headers.
- A reasonable approximation for detecting browser-like User-Agent headers is to look for "Mozilla/". For hysterical reasons, every major browser since something like IE2 includes that string in the User-Agent. Anomie (talk) 14:41, 9 January 2013 (UTC)
$ua = !isset($_SERVER['HTTP_USER_AGENT']) ? '' : $_SERVER['HTTP_USER_AGENT']; if( !$internal && !isset($_GET['agent']) && ($ua === '' || $ua.startsWith('Mozilla') || $ua.startsWith('Opera') ) die()
- Note that only $_GET is checked, forcing client to specify agent as part of the query string. A good error message will be needed to help devs if they accidentally POST the agent value. Opera is another big-enough that a dev might use for development.
- "hysterical"[sic] reasons )) Yurik (talk) 23:49, 9 January 2013 (UTC)
- With Opera's built-in user agent spoofing support, I missed that one of their many options doesn't have "Mozilla" when I was searching for example user agent strings. Good catch.
- Odd mix of PHP and non-PHP syntax. Did that come from Java, C#, or JavaScript? ;) Anomie (talk) 14:08, 10 January 2013 (UTC)
- it's a php-like pseudocode! Now, could you please review and merge the modules & badcontinue!!! =)) Yurik (talk) 14:20, 10 January 2013 (UTC)
- We're not going to be able to force users to identify themselves properly at the end of the day. Adding an agent parameter for ajax sounds like a good idea, but can be done with the current api.
- Note: Wikimedia blocks requests that don't have a user-agent header. Bawolff (talk) 22:41, 10 January 2013 (UTC)
- I will add it to the current version, but I cannot make api1 stop working if there is no agent param and the useragent is mozilla or opera. That can only be done in api2.php. Yurik (talk) 06:05, 11 January 2013 (UTC)
- it's a php-like pseudocode! Now, could you please review and merge the modules & badcontinue!!! =)) Yurik (talk) 14:20, 10 January 2013 (UTC)
"query~2 will not allow any extension (non-core) submodules with less than 3-letter prefixes"
[edit]And how exactly will it tell the difference? A hard-coded list of modules "allowed" to use 2-letter prefixes?
Better, perhaps, to just recommend that extensions use 3-or-more-letter prefixes. Anomie (talk) 14:18, 8 January 2013 (UTC)
- ApiQuery already has a list of all core query modules. Anything added through the global variable is an extension, and falls under the 3-letter rule. Obviously this is not a "must have" thing. Yurik (talk) 01:15, 9 January 2013 (UTC)
Query incomplete pages
[edit]Personally, I think this sounds not particularly useful, but probably won't hurt anything beyond making things more complicated as modules have to duplicate their SQL ORDER BY logic in PHP (but most are probably by either page_id or title, for which utility functions can be provided).
Implementation note: Easiest would be for each prop module to mark pages "incomplete" as necessary; if it goes by marking "complete" instead, then something has to keep track of which modules have marked as "complete" and set "incomplete" unless all have so marked. Also, it should not be prohibited (but may be discouraged) for a module to list everything as "incomplete" until it doesn't need to continue, in case it is not possible for some reason to sort by page. Anomie (talk) 14:31, 8 January 2013 (UTC)
- Usefulness: frameworks like pywiki let the user work with page object properties, so yielding an incomplete object could cause issues. True, this is not always the case, but frequent enough. For example, interwiki bot should analyze all interwiki links as a set, not partially.
- Implementation: yes, ORDER BY is how I thought of addressing this -- depending on how the module sorts its items, any page with the title or pageid greater or equal to the 'continue' value is marked as incomplete. Yurik (talk) 01:45, 9 January 2013 (UTC)
- As I said in the mailing list discussion, it's as easy for the client to assume everything is incomplete until it finishes with the prop continues. Your "easy continue" could return an explicit flag for this.
- For sanity's sake, we should probably provide methods like markTitlesIncompletePast( $dir, $ns, $title ), markPageIdsIncompletePast( $dir, $page_id ), and markAllPagesIncomplete(). And if for some reason the module returns having called setContinueEnumParameter() and without having called one of those methods, either assume markAllIncomplete() or throw an error. Open issues include making sure that ORDER BY and markTitlesIncompletePast() will always use the same collation and whether we need variations besides $ns then $title. Anomie (talk) 14:32, 9 January 2013 (UTC)
- Good idea about markng & detection, but I don't like the explicit flag that generator is now on the next page - it exposes the inner working of the easy continue. The implementation details should be separated from usage, and if we ever decide to do things differently internally (like continuous paging after complete items are done), we won't be able to do it. Yurik (talk) 22:53, 9 January 2013 (UTC)
- There's not a whole lot of difference between one global "incomplete" flag and one per page, IMO.
- As for "continuous paging", remember the issues I pointed out when you raised that in the mailing list discussion. We can't force every module (generators and prop modules) to process things in the same order, especially when we explicitly allow for using different orders for things like generator=categorymembers, so trying to sanely continue the props when the generated page set has changed seems to be basically impossible without having to maintain far too much state. Anomie (talk) 14:01, 10 January 2013 (UTC)
- Yes, I remember, but my point is that if the query can do it in certain cases, and without too much development cost, it just might, without breaking the interface. Or we might come up with a more brilliant way of iterating, etc. One global "incomplete" makes an assumption about the internal state - generator must be paging through results, which we won't be able to avoid once it is set. Per page gives us needed flexibility without costing much. Yurik (talk) 14:19, 10 January 2013 (UTC)
- Good idea about markng & detection, but I don't like the explicit flag that generator is now on the next page - it exposes the inner working of the easy continue. The implementation details should be separated from usage, and if we ever decide to do things differently internally (like continuous paging after complete items are done), we won't be able to do it. Yurik (talk) 22:53, 9 January 2013 (UTC)
"'flags' parameter should replace all the boolean flags"
[edit]Why? It makes things more difficult on the client as it has to add and filter items from an array instead of just setting/deleting an hash key, and it has zero benefit on the server side. It also falsely groups together things that have little-to-no relationship to each other simply because they are true/false values.
OTOH, one flag-related improvement that actually would be helpful would be if 'presence' items in the result had a truthy value instead of the empty string. Anomie (talk) 14:38, 8 January 2013 (UTC)
- Not sure what you meant by the 'presence' items. Could you give an example.
- Why would it be more difficult on the client? Client forms a request like flags=redirects|converttitles the same way it forms titles=... or props=... - by passing an array instead of a string value for the parameter. { titles=['A', 'B', 'C'] } turns into the A|B|C in the request. Server does have a tiny bit more work to do, but it already does it with the other parameters.
- The benefit (and I must admit, not very substantial), is cleanliness of the interface. I don't insist on this, so if you think its not worth it, lets drop it. Yurik (talk) 02:07, 9 January 2013 (UTC)
- Personally, I think the interface is cleaner with boolean parameters rather than cramming unrelated flags into a "flags" parameter just because they're flags. It's different for something like xxprop, where you're selecting which properties to return so there is a relationship.
- By "presense" items, I mean things like the "missing" in https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&titles=DoesNotExist, where the page is missing if the "missing" key is present and not missing if the "missing" key is not present. Most clients are going to have to use something like isset() or an explicit comparison to the empty string, because in most languages both undefined/null and the empty string are considered "false".
- It would also be nice if flags in the request would recognize "0" as false rather than true. But since everywhere else in MediaWiki &foo=0 is interpreted as "foo is set" and since we'd probably want to keep interpreting no-value-specified (i.e. empty string) as true, I'm not sure whether that would really be the least astonishing behavior. Anomie (talk) 14:10, 9 January 2013 (UTC)
- Nah, changing to flags just for one 'missing' attribute is not worth it IMO. Flags in the server request kinda group together because one could think of one prop/list as a function call, with all extra flag parameters, but you are right that they don't relate to each other as well as xxprop=a|b|c. Lets drop it unless someone else expresses desire for this. Yurik (talk) 22:49, 9 January 2013 (UTC)
- I don't like a flag parameter because it could make using mutiple modules much more complicated. Atm i can use different modules, create each part of the url separately and only join these url parts. With a flag parameter i would have to collect all boolean parameters first and combine them. It makes it also harder to reuse query strings. Sometimes you only need to add some additional parameter for api.php. Having as flag parameter forces you that the url is always recreated. Merlissimo (talk) 01:42, 10 January 2013 (UTC)
- Merlissimo, I did not mean one global flags parameter, but one per module: query - flags, imageinfo -- iiflags, etc Yurik (talk) 04:05, 10 January 2013 (UTC)
- I don't like a flag parameter because it could make using mutiple modules much more complicated. Atm i can use different modules, create each part of the url separately and only join these url parts. With a flag parameter i would have to collect all boolean parameters first and combine them. It makes it also harder to reuse query strings. Sometimes you only need to add some additional parameter for api.php. Having as flag parameter forces you that the url is always recreated. Merlissimo (talk) 01:42, 10 January 2013 (UTC)
- Nah, changing to flags just for one 'missing' attribute is not worth it IMO. Flags in the server request kinda group together because one could think of one prop/list as a function call, with all extra flag parameters, but you are right that they don't relate to each other as well as xxprop=a|b|c. Lets drop it unless someone else expresses desire for this. Yurik (talk) 22:49, 9 January 2013 (UTC)
What is converttitles?
[edit]btongminh over a year ago added this feature. I couldn't find any explanation at API:Query, and only this example in the mailing list, which i suspect is broken - 龙门飞甲 is shown as 龍門飛甲 in the result's convert section. Is this a one-to-one translation? Should we always enable it and add it to the <normalized> section? zh.wiki seems to be auto-redirecting without notification, making it a perfect normalization suspect. Yurik (talk) 02:25, 9 January 2013 (UTC)
- It's for the language converter, which automatically converts between language variants (Used mainly in Chinese and serbian - serbian is a lot easier for testing as the variants are latin and cryllic which are much more identifiable to an english speaker than simplified chinese vs traditional chinese)
- Note: it's not exactly a "normalization" as users can create pages in both variants, and I don't think one variant is considered more "correct" than the other (don't quote me on that). I think there are cases where you wouldn't want to convert titles, but am unsure. Bawolff (talk) 22:44, 10 January 2013 (UTC)
- We could have one of these default behaviors, assuming the original title A that could be converted to AA:
- return <page title="A"> if exists, else return <page title="AA"> even if AA does not exist.
- return <page title="A"> if exists, else return <page title="AA"> if exists, else return <page title="A" missing="">
- I think this should be by-default behavior, assuming (?) that each wiki has a default alphabet variant. Plus we could override it with a langvariant=XXX parameter, or langvariant= without a value to disable conversion.
- Another issue is how to deal with the mix of normalized & converted, e.g. [[A_B]] → [[AA B]]
- normalized: [[A_B]] → [[A B]]
- converted: [[A B]] → [[AA B]] or [[A_B]] → [[AA B]]
- Filed relevant bug bug 43852. Yurik (talk) 07:11, 11 January 2013 (UTC)
- We could have one of these default behaviors, assuming the original title A that could be converted to AA:
Make API responses more javascript friendly
[edit]As a javascript developer I personally hate a lot of the responses given by most queries
For example:
'pages': {
'42': {
'title': 'Page1',
'links': [...]
},
'84': {
'title': 'Page2',
'links': [...],
'incomplete': ,
},
}
I end up creating horrible functions like this:
https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/extensions/MobileFrontend.git;a=blob;f=javascripts/modules/mf-photo.js;h=f9c93ea3d1c2e5f245c9d07bffa653f9c1a15be7;hb=HEAD#l50
and
https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/extensions/MobileFrontend.git;a=blob;f=javascripts/specials/watchlist.js;h=4cbd13ec0bc6992e0e0f5d242698ab66f0417dd5;hb=HEAD#l4
What would be preferable for me (and I'm guessing other js developers) is something like this:
'pages': [
{
'title': 'Page1',
'links': [...],
'id': 4
},
{
'title': 'Page2',
'links': [...],
'incomplete': ,
id: 84
}
]
Jdlrobson (talk) 01:09, 10 January 2013 (UTC)
- I have heard lots of complains about this, guess json formatter could be improved this way, and we could also remove indexpageids as I suspect it was added specifically to address this issue. Yurik (talk) 04:15, 10 January 2013 (UTC)
Sensible defaults
[edit]I would prefer a new API to provide more in the default results and remove the need for various additional parameters.
For instance when requesting images I would prefer the api to provide as much information about that image (e.g. a list of urls to standard thumbnail sizes along with the dimensions)
to return that response without the need for an aiprop
The current API is so inconsistent and there are so many parameters that can be used that it is overwhelming to someone unfamiliar with it.
For example take this example where I expected a certain behaviour and didn't get it only to find the functionality I wanted was elsewhere. https://gerrit.wikimedia.org/r/#/c/39570/
(FWIW I think if we are doing an API 2 we would benefit most from drafting the API from scratch starting with a design document preferably implementing in a REST style - not a trivial task.) Jdlrobson (talk) 01:18, 10 January 2013 (UTC)
- This one is a bit complex, will try to break it down.
- API inconsistencies - agree, that's why I started this process, please propose individual cleanups (listed at API Cleanup (scroll up a bit, collapsable sections confuse browsers).
- Documentation - totally agree, must be cleaner and easier to use, plus we should have a cookbook with all the common recipes. Also, I think I will make it so that the api.php page shows just the list of actions & sub-actions, and one could click on them to get a full page info about each.
- Starting from scratch - possibly, but I think we should get the versioning and minor cleanup/reorg done first, since, as you can see, I already have a fairly good basic idea of where to take api2. The REST-style is a much more fundamental change, that would require a lot of discussion and a separate proposal, which might take considerable time, and would cause a lot more polemics. I suspect we won't be able to get it in until either api3 or a separate action=restquery module.
- Defaults - strongly disagree. Just like SQL teaches us, SELECT * is great for development, but horrible for production. Once you know exactly what you want for the job, you should specify exactly those properties. This lowers bandwidth and frequently allows the server to do significantly less work. Moreover, I would propose we make 'props' parameters *required* unless an "fm" format is used - this way queries will return a warning with format=xmlfm or jsonfm, but will give an error when format=xml or json is given. Yurik (talk) 04:45, 10 January 2013 (UTC)
- Sorry I should have been clearer - I wasn't suggesting every single thing under the sun but was trying to explain that sensible defaults are important. It would be good to review frequently used parameters and provide these by default (by the way any new API design should take into existing usage).
- To take an example, if I'm getting an image at the very least I should be able to expect enough information to 1) create a thumbnail of that image 2) in the size of my choice which 3) links to the file page. It might also be important to serve the license to encourage fair reuse. This seems like a very common task that other API users would want to make use of. We should start any new API design from how our developers want to use it and how they are using it. Sometimes there might be clever ways to provide these changes without adding additional processing. For example to avoid thumbnail generation one could imagine placing a $1 in the url of the thumbnail to allow the size of an image to be chosen by the user via a simple string substitution.
- I've been using the API for quite some time now for mobile apps and the mobile site and have found myself frustrated about the amount of things I have to additionally request that seem obvious to me. I will try and point these out now I know this page exists when I encounter them.
- We must also remember that if you do not provide sensible defaults we are making our api difficult to learn and turning developers away (the other day I spoke to a very experienced developer in London who said they were looking at the api but found it so confusing they decided to use Flickr instead). Jdlrobson (talk) 18:29, 10 January 2013 (UTC)
- No, let's not make format=json and format=jsonfm behave differently. The only difference should be the formatting of the output, and that should boil down to "fm" adds insignificant whitespace and HTML formatting for viewing in the browser. Otherwise debugging becomes a pain because you have to remember all the things that behave differently depending on whether you're looking at the "fm" or raw versions.
- But I do agree on the defaults question: if you want something, ask for it. Maybe Jdlrobinson wants dimensions and a thumbnail and buckets of reuse information, while maybe I just want to know whether it's local or from Commons. The default should strike a balance, returning information that is generally useful, cheap to query, and small enough that it's not a huge waste to send it when it's not actually wanted.
- As for REST, I've yet to see any specific examples of how such a thing might look. Looking at generic REST and ROA descriptions, though, it looks to me like it would result in the replacement of
&titles=..500-titles-here..
with 500 individual queries with one title each. If that's true, no thanks. Anomie (talk) 15:38, 11 January 2013 (UTC)- We could encourage prop to always be there by generating a warning when it's not present. This way it will work in samples and debugging, but will always tell user that the list of returned values might not be what they need or could potentially change unless they explicitly ask for it. Yurik (talk) 21:02, 11 January 2013 (UTC)
Versioning and removed features
[edit]More often than not, breaking changes that we make to the API are because we have to, and thus it would be impossible to have a previous version that supports old ways. For example, when action=login was changed, it was for security reasons - hence we wouldn't be able to have a versioned api with the old version as we can't be insecure in a previous version of the api. I'm sure quite a few other breaking changes were performance related (schema changes and what not) which we also would not be able to expose in an "old" version of the API.
There are certainly a few things that would be nice to change. Personally I don't like how action=watch returns a UI msg, there's a bunch of deprecated parameters to action=edit and a few other modules, etc. However, these are generally few and far between (and also not overly hurting people) as far as I can tell (Although I'm not all that familiar with deprecation and the API so may be wrong). Bawolff (talk) 22:49, 10 January 2013 (UTC)
- Since there is currently no way to do versioning, almost all breaking changes comes out of necessity - just like your login example. With versioning, we achieve one key advantage - we can gracefully change default behavior, remove or rename parameters and modules. Noone in their right mind would want to do that now because that gives marginal improvement at the expense of breaking clients.
- Hence, whatever pet peeves you have with the API, now is the good time to voice them :) Yurik (talk) 00:24, 11 January 2013 (UTC)
- On the other hand, with versioning we still have all the old, deprecated parameters to support whenever something on the backend changes. They're just "hidden", and more liable to code rot. Anomie (talk) 15:06, 11 January 2013 (UTC)
- True, but we will be able to declare end of life policy for them (a year or two), and monitor how much use the old api gets. Yurik (talk) 20:49, 11 January 2013 (UTC)
- On the other hand, with versioning we still have all the old, deprecated parameters to support whenever something on the backend changes. They're just "hidden", and more liable to code rot. Anomie (talk) 15:06, 11 January 2013 (UTC)
Why should all modules use "continue"?
[edit]For most, it makes sense: there is a degree of state that needs to be passed in that cannot be simply replicated by adjusting the other input parameters. Or in some cases (such as list=alllinks), normalization done on the "from" parameter may break things when naïvely applied to continuation.
But for something list=users or meta=siteinfo, there's no need to tell the client to send ususers=A|B|C|D|E&uscontinue=C|D|E
when we can as easily and correctly tell it to send ususers=C|D|E
. And for something where the only thing needed to continue is the new "start from" value, why require the client to send both the old and the new?
This seems like a situation where trying to make things "cleaner" in one aspect makes it uglier in another. Anomie (talk) 15:58, 11 January 2013 (UTC)
- Possibly, although in your case I would say the continue should be continue=2 to skip two items, but the more important thing is that given an arbitrary query, there should never be a case of different continues. For example, there shouldn't be this kind of sequence required for iteration: original, original+start, original+start+continue, original+start2+continue, where start, start2, and continue are continuation values returned in query-continue with each subsequent call. Yurik (talk) 02:25, 13 January 2013 (UTC)
- I still think it's as easy to just return the remaining usernames rather than a count of how many usernames to skip. Also, don't you keep claiming that we shouldn't restrict the server to one implementation? But here you're restricting the server to process the names in the order the client gave them, instead of processing them in some order more convenient for the server.
- As for "original, original+start, original+start+continue, original+start2+continue", do you have an example of a module that actually does this? The closest I can find is prop=imageinfo when used with a generator that returns gxxlimit*X+1 results, where it would use 'start' rather than 'continue' on the last iteration of the generator. But that's still a far cry from the confusion you're claiming. Anomie (talk) 16:32, 14 January 2013 (UTC)
- I think imageinfo is where I saw it originally. My worry is that if we adapt your proposal of each next request being "original req"+"continue", instead of "current"+"continue", if the module adds a different value to "continue" (like 'xxstart' instead of 'xxcontinue'), would it iterate from the same spot? Example sequence:
- param1 & param2 (original request)
- returned continue: param3
- param1 & param2 & param3
- returned continue: param4
- param1 & param2 & param4
- Will module continue properly without having param3? Yurik (talk) 01:34, 18 January 2013 (UTC)
- If the module won't continue properly without having param3, then it should be returning param3 and param4. Right now, there's no module that does that so it's hypothetical anyway. Anomie (talk) 14:06, 18 January 2013 (UTC)
- Well, not exactly - if the param3 is "start", module does not know that it was the "start" it returned last time in continue section, so it can only assume that it would get it next time as well. But if you think this won't bite us later, sure, I will implement it as request = "original request" + "continue section" on each iteration. Generator param names will be passed to the next iteration, so that query will either copy those params from current request to the continue if the generated pageset is not done, or include the new ones from the generator's continue, and remember which params generator gave this time. Still thinking about the stupid random generator case. Yurik (talk) 15:04, 18 January 2013 (UTC)
- If a module sometimes returns "start" as its continue and sometimes returns something else when the only difference is the value of "start", then IMO it's broken. It should be returning both "start" and "somethingelse" every time in that case.
- On the other hand, if the module wants to return "start" or "somethingelse" depending on what's passed for "titles" or "dir" or something, there's no risk of confusion there.
- As for random, it can't really work as a generator anyway. Even if we add some way for it to almost work, what happens when there are multiple pages with the same random value? Say A and B both have page_random = 0.19361839. If we ask the database
SELECT page_id, page_namespace, page_title FROM page WHERE page_random = 0.19361839 LIMIT 1
, there's no guarantee that we'll get the same page each time. Even if MySQL will return the same one each time (as it probably will with InnoDB and its clustered indexes), it's not guaranteed by SQL. Anomie (talk) 22:05, 18 January 2013 (UTC)- I'm debating if random should be removed from generators. If we can't make it stable and consistent, we shouldn't have it. Alternativelly random continue can be random value plus pageid, with some logic handling proper continue. Agree re regular continue. Yurik (talk) 13:23, 19 January 2013 (UTC)
- Well, not exactly - if the param3 is "start", module does not know that it was the "start" it returned last time in continue section, so it can only assume that it would get it next time as well. But if you think this won't bite us later, sure, I will implement it as request = "original request" + "continue section" on each iteration. Generator param names will be passed to the next iteration, so that query will either copy those params from current request to the continue if the generated pageset is not done, or include the new ones from the generator's continue, and remember which params generator gave this time. Still thinking about the stupid random generator case. Yurik (talk) 15:04, 18 January 2013 (UTC)
- I think imageinfo is where I saw it originally. My worry is that if we adapt your proposal of each next request being "original req"+"continue", instead of "current"+"continue", if the module adds a different value to "continue" (like 'xxstart' instead of 'xxcontinue'), would it iterate from the same spot? Example sequence:
Removing the help output?
[edit]Current default output of api.php (Help Message) is, IMO, 'weird' behavior. Can't we just move Extension:ApiSandbox into core and be done with it? (And have an action=docs, because *someone* will ask 'what if you do not have javascript) Yuvipanda (talk) 15:08, 14 January 2013 (UTC)
- As part of the action=help (default action) rewrite, the main output should shrink considerably. The sandbox link is already there at the top (we could make it a bit more highlighted visually). Lastly, it might be a good idea to make all examples auto-redirect to the api sandbox, so that users have better time playing with the queries. Still thinking about this one. Yurik (talk) 15:31, 14 January 2013 (UTC)
action=sitematrix
[edit]Note that this action is WMF-specific, added by Extension:SiteMatrix. It could probably have been put into $wgAPIMetaModules instead of $wgAPIModules easily enough, but it can't really be merged with meta=siteinfo&siprop=interwikimap
unless we put in a hook just for that. And then meta=siteinfo&siprop=interwikimap
would most likely wind up behaving differently on WMF wikis versus other MediaWiki installations, which would probably not be a good thing. Anomie (talk) 15:26, 14 January 2013 (UTC)
- It just looked very much the same, hence the confusion. Adding it to meta might be a good idea, although it got me thinking about how metas can be continued if they overfill the output. Yurik (talk) 15:33, 14 January 2013 (UTC)
- Related to action=sitematrix, I had to pull all Wikimedia wiki domains from the sitematrix API module a few hours ago in a script and it was kind of painful. There's all sorts of crazy shit in there and it has a really weird index/structure, at least as JSON. I'm not sure if others have encountered issues with it, but I think its overall structure needs further thought. MZMcBride (talk) 12:12, 2 March 2013 (UTC)
Token reform
[edit]Please! The token handling is a mess.
The problem is that some tokens need non-constant salt (e.g. a rollback token needs a page title and a username) and some don't. And while we have action=tokens
, it's not complete and cannot be complete since it lacks the ability to handle tokens that need non-constant salt.
And meanwhile, we have modules saying they need a token, but not specifying what kind of token they need (e.g. action=upload), and others that provide a "gettoken" parameter to get the token (but see bug 35993).
We could use some sanity here.
But I don't understand the comment "remove base::getToken()" in the proposal page. There doesn't seem to be any such function. Anomie (talk) 16:02, 14 January 2013 (UTC)
- Ok, you seem to know much more about the topic. Could you write up how exactly the tokens should be done? I will double check on the base::getToken. Yurik (talk) 20:51, 14 January 2013 (UTC)
- It seems like action=tokens is generally the way we want to go; as for implementation, at the moment I'm leaning towards either
&rollbacktitle=Namespace:Page title&rollbackuser=Anomie
or&title=Namespace:Page title&user=Anomie
. - If I get time, I may tackle this in the next week or so, since it should be doable without breaking changes (at least to the client). Anomie (talk) 14:38, 16 January 2013 (UTC)
- A bit unclear (I need to see the whole idea), so write it up on the main page first, so we can criticize it! :) Yurik (talk) 01:27, 18 January 2013 (UTC)
- It seems like action=tokens is generally the way we want to go; as for implementation, at the moment I'm leaning towards either
- I don't think there's really a good reason for the extra salt in some tokens like rollbacks.... IIRC I added it to them in the web UI because they were done with GET requests via links rather than form submissions, so link sharing could accidentally share a token.
- There's not really such an issue with the API, so rollbacks via API shouldn't require any salt. If they do, that's probably something that should be fixed. brion (talk) 00:15, 26 January 2013 (UTC)
- Could you list any actions that do need extra salt? How do you see the best overall flow with regards to tokens? Yurik (talk) 17:28, 29 January 2013 (UTC)
action=execute is not needed
[edit]Any module that wants to can easily enough use ApiPageSet. See for example ApiPurge and ApiSetNotificationTimestamp. The only thing lacking is generator support, but it looks like that would be easy enough to support too.
And IMO, "action=execute&command=foo" is redundant, that's basically what "action=foo" means. Anomie (talk) 16:14, 14 January 2013 (UTC)
- I might agree that it is redundant, but certainly doesn't look easy to implement - there is considerable amount of code needed to run the generator. I could think of how to restructure the code to make PageSet handle generator= parameter, or maybe a derived class. In a way having this "apply" action would mostly be a logical grouping on the help page - all actions that work with pagesets, but that's not critical. Yurik (talk) 20:50, 14 January 2013 (UTC)
- Looks like you managed Anomie (talk) 14:04, 18 January 2013 (UTC)
- Yeah, but it wasn't straightforward, and its buggy at the moment. Will get it done after weekend. Yurik (talk) 14:54, 18 January 2013 (UTC)
- Looks like you managed Anomie (talk) 14:04, 18 January 2013 (UTC)
HTTP errors
[edit]You may want to review bug 38716, where this was proposed and closed as WONTFIX.
I still agree with what I wrote there in comment 11: The API does not speak HTTP, it uses HTTP as a transport to return API results. Asking for the API to return an HTTP 400 message for an API-level error is like asking a webserver to return an ICMP error when there is an error generating the response webpage. Anomie (talk) 14:59, 16 January 2013 (UTC)
- Yes, it was proposed on IRC, and for a while I thought it would be a good idea, but even for the case of errors, this might be limiting, i.e. error due to servers too busy (lag) should not mark this URL as invalid in various caches. Unless we want to find an HTTP error code for each error api may produce. Most would be 400, non-logged in/tokens - 401, 409 - edit conflict, ... Probably all this is not worth it. Removing. Yurik (talk) 01:23, 18 January 2013 (UTC)
{ '*': 'text' } → 'text'
[edit]The one thing to note here is that we don't want to change { '*': 'text' }
to 'text'
if sometimes it could be { '*': 'text', 'prop': 'value' }
, since that would make things more difficult for clients rather than less.
In the context of breaking everything anyway, we should probably get rid of '*' as a key entirely and use an appropriate actual name for the parameter wherever we can't do { '*': 'text' }
to 'text'
. Anomie (talk) 18:35, 23 January 2013 (UTC)
- I think this should apply only to the elements that don't have any other properties. Good question is if there are any like that. As for renaming '*' into 'text' or 'value' or 'content' - sure. Will have to go through all actions and review all that too :) Yurik (talk) 18:40, 23 January 2013 (UTC)
- In XML, I think we don't want to change this. That is, we still want to output such text as element text (e.g.
<sometag>text</sometag>
), not an attribute (e.g.<sometag content="text" />
). Because of that, I think if*
was being changed to something else, it should be a unique name that won't be used by normal (attribute-based) properties. - Which means that the name shouldn't be
value
, because that's already used by thequerypage
module (unlessquerypage
was changed in the new API too). The other two proposed name (text
andcontent
) should be fine. (Note: I only looked at modules that provide machine readable description of their output through theparaminfo
module.) Svick (talk) 19:18, 23 January 2013 (UTC)- Sounds really like we are changing the '*' for the sake of changing :)... Maybe we should think of getting rid of XML alltogether... But than the question is how much extra value we gaining from json-only format? Also, don't forget that we get > 25% of all requests in XML. Yurik (talk) 19:35, 23 January 2013 (UTC)
- Well, '*' as a name rather than something in alphanumerics makes things more difficult for some people using the json format.
- Where'd you get the stat that over 25% of requests are for XML? Anomie (talk) 13:24, 24 January 2013 (UTC)
- Ok, the obvious replacement for '*' is '_' in that case :) As for 25%, that's what I saw when I looked at the performance counters -- formatters. Yurik (talk) 14:06, 24 January 2013 (UTC)
- Looks closer to 15% to me. And I suspect AWB is a big part of that. Anomie (talk) 02:10, 29 January 2013 (UTC)
- Sorry, not XML - all non-JSON, and there are about 2100 JSON (just double checked). 200 PHP + 500 XML / 2800 total is about 25% :) But all this doesn't matter - this would be a major breaking change without significant benefit. If we want to improve something, lets work on that. Do you agree with '*' to '_' conversion? Yurik (talk) 05:21, 29 January 2013 (UTC)
- I agree with brion that we should kill the current XML format. If we have to keep some sort of XML, make it something that doesn't need the rest of the API to do special things just for it.
- I still think it would be better to use logical names such as 'text' or 'content' rather than some symbol, but '_' is better than '*'. Anomie (talk) 20:24, 29 January 2013 (UTC)
- Agree with Anomie. 'content' would be my first choice, but '_' is better than '*' (since _ is not an operator in JS). FWIW, the Flickr API uses '_content' for this situation. Kaldari (talk) 21:27, 30 January 2013 (UTC)
- Some xml implementation using '#text' as name for the text node, but that is also invalid javascript syntax and make it not better.
- When you want change * to something better, than please do not change query-continue to continue at another place, because that is a reserverd word in many language and can make other things difficult. Using '-' in the name is also bad style. Duplicatebug (talk) 11:25, 3 February 2013 (UTC)
- Could you give an example of the language that will have a problem with "continue"? Being a keyword by itself should not cause the languages to break when used as a property of an object - obj.continue is not the same as continue alone. Yurik (talk) 18:50, 3 February 2013 (UTC)
- It is not only a problem with 'continue' as name for a object property (which is disallow in ECMA Script < 5)
- It is also a problem as variable name. When someone save the property in a local var the name of the property is often used, so writing
var continue = data.continue;
does not work. db (talk) 11:33, 10 February 2013 (UTC)- Yes, but if you use another variable name, like
var qc = data.continue;
- would that work in all of the modern browsers? If so, I really don't see the problem. We will have a sample code in every language to handle it, and I really like the simplicity of the one word unambiguous tag name. Yurik (talk) 15:40, 10 February 2013 (UTC)
- Than the question is, what the q stands for ;-)
- I see you point. My comment was only, to think about this name, due to possible confusing on the other site. The api is not for javascript only, I hope you want support bots with "API Future", too. There are also written in "compile languages" (like java, c, c#), not alone in "script languages" (like javascript, python). db (talk) 20:30, 10 February 2013 (UTC)
- Yes, I obviously want to continue supporting non-javascript languages. Moreover, API was specifically designed for bots at first (pywiki framework), and javascript usecases showed up slightly later. What I do want is to have the most concise tags for all the data, without verbosity, and with adequate support by languages. Since I don't think any of the languages would have a problem with the "continue" being a keyword, I would like to keep it as 'continue':
- C#: string @continue = data["continue"];
- C# with dynamics lib: string @continue = data.@continue; (See corner cases, although I suspect it might work without the @ too).
- Java: String queryContinue = data.getString("continue");
- etc. Yurik (talk) 10:17, 11 February 2013 (UTC)
- Yes, I obviously want to continue supporting non-javascript languages. Moreover, API was specifically designed for bots at first (pywiki framework), and javascript usecases showed up slightly later. What I do want is to have the most concise tags for all the data, without verbosity, and with adequate support by languages. Since I don't think any of the languages would have a problem with the "continue" being a keyword, I would like to keep it as 'continue':
- Could you give an example of the language that will have a problem with "continue"? Being a keyword by itself should not cause the languages to break when used as a property of an object - obj.continue is not the same as continue alone. Yurik (talk) 18:50, 3 February 2013 (UTC)
- Agree with Anomie. 'content' would be my first choice, but '_' is better than '*' (since _ is not an operator in JS). FWIW, the Flickr API uses '_content' for this situation. Kaldari (talk) 21:27, 30 January 2013 (UTC)
- Sorry, not XML - all non-JSON, and there are about 2100 JSON (just double checked). 200 PHP + 500 XML / 2800 total is about 25% :) But all this doesn't matter - this would be a major breaking change without significant benefit. If we want to improve something, lets work on that. Do you agree with '*' to '_' conversion? Yurik (talk) 05:21, 29 January 2013 (UTC)
- Looks closer to 15% to me. And I suspect AWB is a big part of that. Anomie (talk) 02:10, 29 January 2013 (UTC)
- Ok, the obvious replacement for '*' is '_' in that case :) As for 25%, that's what I saw when I looked at the performance counters -- formatters. Yurik (talk) 14:06, 24 January 2013 (UTC)
- Sounds really like we are changing the '*' for the sake of changing :)... Maybe we should think of getting rid of XML alltogether... But than the question is how much extra value we gaining from json-only format? Also, don't forget that we get > 25% of all requests in XML. Yurik (talk) 19:35, 23 January 2013 (UTC)
- In XML, I think we don't want to change this. That is, we still want to output such text as element text (e.g.
Client libraries
[edit]I'd vote "no" on this one, WMF has enough trouble without trying to maintain client libraries in numerous languages, particularly when it seems most people prefer a high-level pywikipedia-style framework rather than the bare-bones framework being proposed here.
OTOH, if someone who already maintains a client library wants to host it in WMF git rather than on github, sourceforge, or whatever, I wouldn't be opposed to that. Anomie (talk) 18:39, 23 January 2013 (UTC)
- Maybe we should do at least a reference implementation in one language (python), so that our main API concepts are flushed out to the other api library devs. Plus it will give us a good feedback by allowing to look at the api from the user's perspective, instead of waiting for other people to tell us how it should be changed. Yurik (talk) 20:47, 27 January 2013 (UTC)
- A user's perspective? I think things along those lines would be better served by getting input from existing client developers, rather than making a toy implementation that isn't likely to see real use. Anomie (talk) 02:09, 29 January 2013 (UTC)
- Anomie, I never doubt that you are a capable bot author. My concern is with the less experienced devs overloading wiki servers. As for the sample - we could adapt some core part of the pywiki framework as "prototype", and grow the rest of pywiki on top of it. These are just thoughts with one goal in mind - a stable sample for other, non-python pywiki-based authors playing with frameworks. In any case, this shouldn't be discussed as part of the API I guess. Yurik (talk) 05:11, 29 January 2013 (UTC)
- A user's perspective? I think things along those lines would be better served by getting input from existing client developers, rather than making a toy implementation that isn't likely to see real use. Anomie (talk) 02:09, 29 January 2013 (UTC)
Clean up formats
[edit]yaml format can be removed, since it's now identical to json. format=txt and format=dump seem entirely pointless, and format=dbg seems redundant to format=php for real use and format=rawfm for debugging. Now for the controversial part: format=xml seems to be a major source of problems, since it needs special handling all over the place. If we keep it at all, it would be very nice to change it to something that doesn't need magic "_element" and "*" members and won't cause bugs like bug 43221 (for the last, if nothing else define some sort of reversible encoding for invalid names). This would also allow us to get rid of format=rawfm, since we won't have any more magic elements. Anomie (talk) 19:08, 23 January 2013 (UTC)
- YES YES YES. Kill XML, let us just export an associative array and have it go straight to a JSON object. brion (talk) 00:11, 26 January 2013 (UTC)
- For everyone's enjoyment, I present to you the formatting usage stats. XML gets about 500 reqs/min (drop from ~1000 3 months ago), JSON ~2100, PHP has been growing to about 200 now, YAML dropped from 1.3/min to sporadic, DBG (?!?) is consistently used at about 1.3/min, RAW frequently spikes to up to 30!!!, TXT averages 3, but the real kicker - 50 reqs per minute is the xmlfm... FML!!! Need to track and kill it with vengeance. Yurik (talk) 19:30, 26 January 2013 (UTC)
- While the numbers are interesting, they may not tell the complete tale. The xml is a good example: there are three very sharp downward steps, suggesting three very high volume but specific tools have stopped using that format. Contrariwise, there's an informal but general increasing trend in PHP, suggesting a diversity of tools are using that format. Translated, this suggest a wider range of projects might be broken by removal of php as a format, while a smaller number of projects might be broken by removal of json as a format.
- Yes, I know you're not suggesting eliminating php as a format.
- But you are suggesting the content API should shut out the more diverse community of projects which are already using the API. Amgine (talk) 19:14, 27 January 2013 (UTC)
- Amgine, I think we definetly should keep the current multi-format API model for query/action modules, with possible drop of WDDX and YAML, but on the content side we should make it uniform to take advantage of caching. If the difference between using PHP and JSON is simply replacing one built-in method with another, it shouldn't be that big of a deal.
- And yes, we can make content data model be HTTP-error-coded and possibly even non-structured blob based, removing the need for JSON vs PHP vs XML debate alltogether :)
- In other words - keep using the current API, figure out what content (e.g. html) you need for some task, and than download the blob with a different content-api call. There shouldn't even be a need for an API library. At most there will be a simple json structure to separate TOC entries/sections - depending on the call. Yurik (talk) 20:38, 27 January 2013 (UTC)
- <mind essplodes> Amgine (talk) 21:24, 28 January 2013 (UTC)
- xmlfm is default, so testing or building a new queries in the browser will use this format or the help page is used here.
- In my opinion you should not drop xml, because not all program languages have native json or php format, for example java (at least in 1.6). Adding a new jar can be a blocker for this. Duplicatebug (talk) 11:22, 3 February 2013 (UTC)
- How would you feel about changing the XML format to something that would be less likey to cause issues? Something closer to an [:en:Property list XML-format property list] or [:en:WDDX WDDX]. Or maybe just keeping WDDX as the XML format? Anomie (talk) 14:57, 4 February 2013 (UTC)
- At which places the xml format makes problems? All xml related things should be in ApiFormatXml and nobody see it.
- bug 43221: property names with :: are also bad in json.
- When having a attribute name for content in json (like text or _continue) the xml wrapper can produce a text node out of that, than nobody needs ApiResult::setContent, when that is the problem. db (talk) 11:43, 10 February 2013 (UTC)
- I think the “special handling” Anomie was talking about is that you need to call
setIndexedTagName
every time you want to return (numerical) array from the API. (There could be other situations that require special handling for XML, that I didn't encounter yet.) Svick (talk) 13:39, 10 February 2013 (UTC)- There's also how other formats have to deal with a key named "*" so XML can do its "text content with properties" thing. Anomie (talk) 14:13, 11 February 2013 (UTC)
- Property names with "::" are fine in json, any string is allowed in a key (see RFC 4627). In JavaScript they can't be accessed with the
foo.bar
notation, butfoo['bar']
still works. Anomie (talk) 14:23, 11 February 2013 (UTC)- Yes, * is also fine in json, but you must write foo['*']. In another thread some people do not want write the string notation and want the object notation. So it makes no sense to have other params in string notation and break this. Than you can keep * also. db (talk) 17:32, 3 March 2013 (UTC)
- The use of "*" is gratuitous, we can easily pick something more sensible. The use of "::" as keys in the API for things that use "::" as keys in MediaWiki core is not gratuitous. Anomie (talk) 17:10, 4 March 2013 (UTC)
- Yes, * is also fine in json, but you must write foo['*']. In another thread some people do not want write the string notation and want the object notation. So it makes no sense to have other params in string notation and break this. Than you can keep * also. db (talk) 17:32, 3 March 2013 (UTC)
- I think the “special handling” Anomie was talking about is that you need to call
- How would you feel about changing the XML format to something that would be less likey to cause issues? Something closer to an [:en:Property list XML-format property list] or [:en:WDDX WDDX]. Or maybe just keeping WDDX as the XML format? Anomie (talk) 14:57, 4 February 2013 (UTC)
- <mind essplodes> Amgine (talk) 21:24, 28 January 2013 (UTC)
Multiple formats support is awkward and, especially with XML, is just plain weird. I'd strongly like to kill all formats except for JSON. JSON is widely supported, simple, doesn't have weird-ass attributes and text contents, and generally should be a good default. Kill XML with fire, please please please! The other formats are basically equivalent to JSON (YAML was actually replaced with JSON because valid JSON is valid YAML!) and there's not much benefit to their existence. brion (talk) 00:10, 26 January 2013 (UTC)
- Serialized php is faster in php, and easier for those of us coding in php, and only slightly less efficient in bandwidth. There is a rather large code base of tools using php serialize. Amgine (talk) 17:41, 27 January 2013 (UTC)
- Note that software using PHP serialization now should be able to update to JSON by simply changing the format parameter and switching from 'unserialize' to 'json_decode'. There _shouldn't_ be differences in the decoded data format, that I know of.
- I threw together a quick benchmark: [1]
- On 2000-items of RecentChanges data, file size:
138K rc.json
134K rc.xml
187K rc.phpser
- and speed per iteration:
$ php test.php
Benchmarking xml... 4.436 ms
Benchmarking json-objects... 4.846 ms
Benchmarking json-assoc... 4.312 ms
Benchmarking php... 2.776 ms
- So yes, on ~140-190KB of tightly-packed RC data you might save 2 milliseconds of low-level parse time. I'm not convinced this is a significant savings. brion (talk) 19:46, 27 January 2013 (UTC)
- @brion: generation comparison? Amgine (talk) 19:58, 27 January 2013 (UTC)
- That's actually a much stronger argument for JSON, alas...
- So, I'm wrong on the speed, and I apologize for that one. Amgine (talk) 20:25, 27 January 2013 (UTC)
$ php bench.php Benchmarking json-objects... 7.774 ms Benchmarking json-assoc... 7.720 ms Benchmarking php... 12.301 ms
- @brion: generation comparison? Amgine (talk) 19:58, 27 January 2013 (UTC)
Mysterious api base path
[edit]Another grievance worth fixing: Interwiki site info can be used to create an article URL, but not to access the remote site's API. MediaWiki's suggested apache rewrite rules yield unguessably different paths for index.php and api.php. My first recommendation would be to fix our suggested rewrite rules. But as a fallback, it would be smart if index.php were able to delegate to api.php by some mechanism, for example: https://en.wikipedia.org/wiki/?api=v2&action=query&generator=allpages&gaplimit=5&prop=images&imlimit=10 Adamw (talk) 07:44, 24 January 2013 (UTC)
- Which rewrite rules? For everything I've seen, if you can find index.php you should be able to find api.php just by replacing "index.php" with "api.php". Anomie (talk) 13:39, 24 January 2013 (UTC)
- The canonical base url is a psuedo-path which redirects to index.php, see Special:Interwiki. Try adding "api.php" to those urls... Adamw (talk) 17:16, 24 January 2013 (UTC)
Cacheable requests, URLs, and cache purging?
[edit]With more usage of the API going into our desktop web, mobile web, and native mobile user interfaces, cacheability is a concern.
As is, if we tried something like this:
/w/api.php?action=mobileview&titles=San_Francisco&maxage=86400&smaxage=86400
there'd be no way for the server-side code to know that that URL needs to be cleared when the 'San Francisco' article changes. You'd get stale results for up to a full day.
The URL isn't very predictable as someone might fudge that maxage around, or decide to use slightly different parameters to change 1% of the data return format, or the parameters might get re-ordered by an HTTP query library and appear in different order in different apps.
If we had a more traditionally structured object-fetch API, we could trivially make things like this HTTP-cacheable:
/w/api/v2/page/parse/San_Francisco /w/api/v2/page/mobileview/San_Francisco
at least for simple data-fetch operations like this.
This wouldn't work for every operation, but we're going to be doing this sort of thing a lot in mobile-land: fetching individual pages' data and then displaying it. brion (talk) 00:28, 26 January 2013 (UTC)
- Sounds good, I started a section for content-oriented REST API at the bottom, could you add all types of calls as you see needed for the first release? Thx. Yurik (talk) 05:16, 29 January 2013 (UTC)
- I think a REST API can be useful for most or all operations by requiring all (sub)module options be included in a specific order, by using "/" to separate options, by placing page name or prefix as the last option for modules wanting one, and using a placeholder (such as "*") when an option isn't relevant. The page name or prefix may possibly be omitted just as with the current API.
- Examples of what this might look like:
api.php ? action=query & list=allcategories & acmin=1 & acmax=1 & aclimit=10 →
/w/api/v2/categories/ascending/!hidden/10/1/1/
api.php ? action=query & list=allcategories & acmin=1 & acmax=1 & aclimit=10 & acprefix=Requests_for_comment/ →
/w/api/v2/categories/ascending/!hidden/10/1/1/Category:Requests_for_comment/
api.php ? action=query & list=categorymembers & cmtitle=Category:Requests_for_comment & cmlimit=10 →
/w/api/v2/categories/ascending/!hidden/10/*/*/Category:Requests_for_comment
api.php ? action=query & titles=Requests_for_comment/API_Future & prop=categories & cllimit=10 →
/w/api/v2/categories/ascending/!hidden/10/*/*/Requests_for_comment/API_Future
Darklama (talk) 20:46, 16 February 2013 (UTC)- Do you realize that some modules have up to 30 parameters? I think the query should be human readable (and writable), even if that's not the primary usage. If the REST approach was used for everything, spotting an error in a query would be extremely hard, especially since
/w/api/v2/categories/!hidden/ascending/
would be an error. Svick (talk) 20:55, 16 February 2013 (UTC) - not a good approach, especially considering that multiple props=langlinks|categories can be used. The URL rewrite rules are to make things simple, not more complex. Plus they don't solve the most important problem - cachability. A request for the HTML of a specific page or a section of that page is highly cachable, hence it should have a simple URL. The complex SQL-like joined request to the server needs to be cached on a different level (either memcache or delegating parts of the request to dedicated machines for processing in parallel, or both) Yurik (talk) 21:29, 16 February 2013 (UTC)
- While a request for the HTML of a specific page or a section of that page is highly cachable, the cache must be frequently purged/updated to reflect edits made to the page. I think other requests are also highly cachable with less need for the cache to be purged/updated because changes happen less often for other things.
- w:Representational state transfer (REST) can reduce server load by allowing clients, proxies, and anything else sitting between clients and servers to participate in caching results too, possibly reducing or eliminating the need for any response from the servers. Using REST with the "If-Modified-Since" header could allow servers to send a "304 Not Modified" response. I think this also fits in well with the goal of only sending what the client requests.
- As for multiple props, either don't support it to keep responses short or maybe use:
/w/api/v2/pages/categories|langlinks/Page
- Support for multiple props could require props be listed in alphabetical order, or 301/302 HTTP redirection could be used to force alphabetical ordering.
- No URL rewrite rules could be required if v2 is the script and the web server is relied on to set the environment variable PATH_INFO ("/pages/categories|langlinks/Page" in the above example). Darklama (talk) 19:04, 17 February 2013 (UTC)
- The existing (and the planned v2) API is not really amenable to a REST request structure: too many options, and there's not really a good way for a REST request to express a query for multiple objects at once.
- REST may have its niche for fetching single page and single section content, but trying to shove the entire API into that model is IMO trying to cram a [:en:hypercube hypercube] into a round hole. Anomie (talk) 01:03, 18 February 2013 (UTC)
- Agree with Anomie here - no REST for API that wraps SQL queries, but I do plan to introduce a section of the API dedicated to content that will have URL rewrite rules. Yurik (talk) 08:25, 18 February 2013 (UTC)
- Do you realize that some modules have up to 30 parameters? I think the query should be human readable (and writable), even if that's not the primary usage. If the REST approach was used for everything, spotting an error in a query would be extremely hard, especially since
CORS and third-party web apps
[edit]Currently, web applications using client-side JavaScript can only access our API via JSONP (wrapped in a function call and run through a <script> tag). This is a bit nasty for several reasons:
- different URLs -> breaks potential shared caching with other apps that use the same queries over JSON
- harder to get progress feedback or detect errors
- can't do POST requests at all
- authentication is disabled to prevent CSRF stealing a web user's credentials
This has practical limitations for some mobile platforms as well -- for instance our Wikipedia app for Firefox OS is a web app hosted on bits.wikimedia.org. Since an XHR can't access *.wikipedia.org/w/api.php from there, it has to use either JSONP or a server-side proxy (icky, hides IPs, no load balancing, etc). Since a proxy is icky and hard to scale, we're using JSONP for now... but this won't work once we try to add login and editing features, since auth isn't available.
If we had CORS headers set up to allow non-authenticated (no cookies) access via XHR from all third-party domains, and we could auth without cookies (can we use a token for this? I .... think so) that would be helpful.
Not sure if that's doable on the current API or not. :D brion (talk) 00:38, 26 January 2013 (UTC)
- There is some sort of CORS handling in the API, but I will need to look further into it to get a better understanding of how it is setup. Yurik (talk) 04:25, 30 January 2013 (UTC)
- Basically, it's three parts:
- The client adds an "origin" parameter to the request to indicate the origin and explicitly request CORS.
- The browser adds an "Origin" HTTP header, to also indicate the origin.
- The MediaWiki configuration has
$wgCrossSiteAJAXdomains
and$wgCrossSiteAJAXdomainExceptions
to determine whether to allow the cross-domain request.
- First, the "origin" parameter must match one of the values in the "Origin" header, or the request fails.
- Second, the "origin" parameter must match one of the patterns in
$wgCrossSiteAJAXdomains
and not match any pattern in$wgCrossSiteAJAXdomainExceptions
. These are currently set to allow various WMF wikis (but bits.wikimedia.org is not in the list). - If both checks pass, then the appropriate CORS headers are returned to instruct the browser to allow the request, including cookies.
- I guess the basic idea behind this proposed non-cookie authentication method would be that it works just like cookies except that it's handled by the client code rather than the browser? Anomie (talk) 16:04, 30 January 2013 (UTC)
- Basically, it's three parts:
- Yeah, it would be nice to drop JSONP for CORS. We'll have to disable anonymous editing over CORS though so that the API can't be turned into a kind of mass spam attack that could come from absolutely any innocent IP in the world without people's knowledge.
- For auth via tokens. This would basically be where OAuth (or ;) something like OAuth) would fit in. Daniel Friesen (Dantman) (talk) 21:31, 11 March 2013 (UTC)
- Technically speaking, I think anonymous editing via action=edit already allows that kind of attack. *cough* brion (talk) 23:43, 14 March 2013 (UTC)
- *facepalm* right it can, and I had a private bug about fixing that. Daniel Friesen (Dantman) (talk) 07:57, 15 March 2013 (UTC)
- Technically speaking, I think anonymous editing via action=edit already allows that kind of attack. *cough* brion (talk) 23:43, 14 March 2013 (UTC)
REST API vs. the rest of API
[edit]It seems like currently, the proposal is to split the API into two parts: most of the API will stay mostly the same as it is right now, while the “Content API” won't support XML and will probably have completely different URL schema (more REST-like).
I don't think splitting the API like that is a good idea. We should strive to make the API more consistent, not less. I understand that the intention is that the Content API is meant for different uses than the rest of the API, but in the end, I think many applications will want to use both parts of the API. And dealing with two different APIs will be a headache for the authors of applications, I think.
I understand that there are real issues that part of the proposal is trying to solve (caching), but I think (and hope) there are ways to solve that without making the API much harder to use.
What do you think? Svick (talk) 16:19, 26 January 2013 (UTC)
- Svick, thanks for your comment. I don't want to completely split them up. Rather, have a separate action that is documented to only result in json, have fewer parameters, be REST-full in nature, and have some URL rewrites that simplify beginner usages as well as URL-based caching. This solves the use-case and caching issues, and at the same time reuses most of the internal API infrastructure. Advanced uses will still require api.php?... approach simply because its not as easily cache-able and have substantially larger set of capabilities. I understand that this is not ideal in terms of full consistency, but given the radically different use cases and goals, this seems like a good compromise. Yurik (talk) 16:30, 26 January 2013 (UTC)
Need help with ApiEx template
[edit]I would like the Template:ApiEx hide the result of the execution, and it seems wikipedia:en:template:Hidden begin is exactly what would work (with the label on the left), but I couldn't get it to work properly. There seems to be a javascript bug on this site, showing up the "hide" label twice, and it only semi-works in chrome, not Firefox. Thanks!
Yurik (talk) 08:56, 18 February 2013 (UTC)
Title is better
[edit]I like "API roadmap" much more than "API Future". Nice move. :-) MZMcBride (talk) 12:29, 2 March 2013 (UTC)
PATH_INFO
[edit]It seems to me that using PATH_INFO is going to make things more complicated for clients, as instead of at a low level taking an assoc/dict/hash/etc of query parameters, they have to also take a PATH_INFO value. And while that's not much of a complication (if nothing else, a magic key could be extracted from the assoc/dict/hash/etc), what is the benefit? Anomie (talk) 13:20, 25 April 2013 (UTC)
- I agree that it will have to make "action" a special parameter (or could be extracted from the dict), but there are several benefits:
- Ability to easier partition server farm to create a cluster dedicated to certain actions - like parsing (requested by parsoid team)
- webserver access log files will contain the action even for post requests
- No need to introduce api2.php just yet - we can determine new version by request style
- Future core version changes can be done in the style api.php/action/2?...
- Shorter URL Yurik (talk) 13:40, 25 April 2013 (UTC)
- Partitioning, ok. Versioning could as well be done with action=foo/2. api.log already contains the action for post requests; I guess you're talking about the webserver access.log? Shorter URL and "api2.php", meh. Anomie (talk) 14:50, 25 April 2013 (UTC)
- Anomie, the core value is the #1 - everything else are side benefits :)
- As for logs, unless this is a very very recent change, I don't see action in the post req in the logs. Yurik (talk) 15:29, 25 April 2013 (UTC)
- Are you looking in the api.log (on fluorine), or in webserver access logs? Anomie (talk) 13:23, 26 April 2013 (UTC)
- I'm looking at the api log files that are rsynced to stats1. Yurik (talk) 20:05, 26 April 2013 (UTC)
- I don't know what's in that one. Anomie (talk) 13:22, 29 April 2013 (UTC)
- I'm looking at the api log files that are rsynced to stats1. Yurik (talk) 20:05, 26 April 2013 (UTC)
- Are you looking in the api.log (on fluorine), or in webserver access logs? Anomie (talk) 13:23, 26 April 2013 (UTC)
Meeting notes on etherpad
[edit]Last week we met in the office and discussed this RFC. The discussion notes are [etherpad:apiv2 on the etherpad]. Gabriel Wicke (GWicke) (talk) 18:19, 17 September 2013 (UTC)
Follow-up action items meeting september 11
[edit]The action items from the September 11 meeting are:
- Wikia makes their RFC public, ASAP :) - Federico
- Separate RfC re RESTful API?
- Prototype Parsoid REST API - Gabriel
Done
- Find motivating use case re flags versus versions - Yuri
- Restructure current RFC - Brad/Yuri ?
- Sumana to post this etherpad onwiki, email mediawiki-api & wikitech-l
I have added Done based on my understanding of the current status, please feel free to edit. Drdee (talk) 21:44, 17 December 2013 (UTC)
- The REST storage service and public content API are now discussed in these two closely related RFCs: Storage service and Content API.
- Wikia has released a REST API that covers their immediate needs: [2][3]. They also have an API team that might work on a more general REST API. I hope that we can collaborate with them on the REST API. Gabriel Wicke (GWicke) (talk) 23:57, 17 December 2013 (UTC)
- Here is a full copy of the etherpad before it disappears:Gabriel Wicke (GWicke) (talk) 22:18, 6 March 2014 (UTC)
API roadmap conversation, Sept 11 2013 at WMF office * Attendees: Yuri, Max, Yuvi, Erik B, Brad, Sumana, Subbu, Gabriel, RobLa, Roan, Federico, Tim == REASONS / JUSTIFICATIONS == Current proposal: https://www.mediawiki.org/wiki/Requests_for_comment/API_roadmap * Change output format - structured warnings / errors, localization ** Kill XML specifically :( (it's 25% of non-OpenSearch traffic but it's a mess and needs to die) * Split traffic between server pools depending on action ** Change URL to e.g. api.php/query?... *** Why make the URL longer? == Discussion == * Module refactoring - https://www.mediawiki.org/wiki/Requests_for_comment/API_roadmap#Modules_refactoring Drawbacks to versioning modules, versus individual flags: * Making promises we can't keep: we say action=foo~3 isn't going to change, but then some security issue or core change comes along and we have to break it anyway. * Code rot: "foo~3" implies an entirely separate module, the code for which will easily rot. ** Yes, the version *could* be treated as a feature flag within the module. Then you have this vaguely-named flag that doesn't indicate what it does besides "version". * Say we make "foo~3", then "foo~4". If a client wants something introduced in ~4, they have to accept ~3 as well. ** Encouraging people to upgrade to the latest version is often a benefit *** But forcing them to upgrade many features for one feature? URL change won't help with caching yet -> REST content API * query param order random, cannot be purged * don't want to wrap HTML in JSON https://www.mediawiki.org/wiki/Talk:Requests_for_comment/API_roadmap#Clean_up_formats_23045 Wikia's requirements: work with SDKs their ecology likes * REST ** What kind of REST? it means something different to everyone! ** Cacheable as much as possible -> no query params, deterministic URL so purgeable ** Representations? State Transformations? *** Content types, not everything should be wrapped in JSON **** +1 ** Discoverability - API results include URL's to possible state transofmrations, related resources, etc. Yuri: How will we proceed in changing the API? * Sumana advises: consult existing API usability research, just as we consult users & MW developers How do we change defaults? Star versus underscore is so JS can do "foo._" instead of "foo['*']" > Avoid underscores in js identifiers per conventions, maybe use "content" instead of "*" / "_" (also more descriptive) Idealist vs Pragmatism - Do you want something beautiful? Or something that continues to work? Why can't it do both? * The argument is to find specific use cases for each individual change, an overall beautiful API is not definable as individual little pieces but as an overarching design ethodology ==NEXT== * Wikia makes their RFC public, ASAP :) - Federico ** Separate RfC re RESTful API? ** Prototype Parsoid REST API - Gabriel * Find motivating use case re flags versus versions - Yuri * Restructure current RFC - Brad/Yuri ? * Sumana to post this etherpad onwiki, email mediawiki-api & wikitech-l
Architecture Summit notes
[edit]Please see Talk:Architecture Summit 2014/Storage services#API versioning and additional notes on that page. Sharihareswara (WMF) (talk) 04:36, 14 March 2014 (UTC)
Errors should use reasonable HTTP response codes
[edit]It would be great if API errors used the HTTP error response codes rather than returning 200. Sharihareswara (WMF) (talk) 03:50, 18 March 2014 (UTC)
- I see that we previously WONTFIXed this request but now that we're overhauling the API I think we should give it another look. Sharihareswara (WMF) (talk) 03:54, 18 March 2014 (UTC)
- My reasoning in that bug still applies: an HTTP error indicates that something went wrong with the HTTP request, for example that the target resource wasn't found or couldn't be executed. As far as the API is concerned, that's the [:en:transport layer transport layer]. If the API request is able to be processed but the result is an API error, that's reported at the [:en:application layer application layer] instead.
- Say the API did return an HTTP 400 or 500 response code for an API error. How does the client determine that this is an API error rather than a varnish timeout or the like? I don't much like "blindly try to parse the body, if it succeeds it's an API error".
- Also, say the API did return an HTTP 4xx response code for an API error. People would probably expect that action=delete would return a 404 if the target page isn't found to be deleted. But then what happens with action=query, when there may be multiple titles and some might be not found and others not? Or look at action=watch, before gerrit:53964 you could have made the case for it to return 404, but now it's like action=query. Anomie (talk) 13:16, 18 March 2014 (UTC)
- I agree with Anomie's reasoning. An API error is not an HTTP error, and should not be reported as one. – RobinHood70 talk 20:13, 8 April 2014 (UTC)
- If the error is related to application layer data, HTTP error codes are wrong, of course.
- However IIRC the MWAPI emits server errors with HTTP 200 and a response that includes an error code like internal_api_error_ExceptionFooBar. Those are a server error, and should/could have a HTTP 50x code because the application failed while attempting to complete processing of the request, and all bets are off on what parts of the request were performed and committed to the database.
- The current approach isnt _wrong_, as 50x are optional, but it worth reconsidering using them for the cases they actually apply to. John Vandenberg (talk) 13:14, 24 October 2014 (UTC)
Tuju> i would recommend like twitter does it, they version their protocols into urls and keep each url working, regardless that their db layout changes. hence they wont break applications. Tuju on #mediawiki 12:02, 19 August 2014 (UTC)
- +1, sounds great! ·addshore· talk to me! 21:13, 12 October 2015 (UTC)
Deprecation codes
[edit]It would be really handy if the API deprecation messages used an identifier, of some sort, that clients can use to 'understand' what these messages are. This doesnt even use the word 'deprecate'/'deprecation'.
- Formatting of continuation data will be changing soon. To continue using the current formatting, use the 'rawcontinue' parameter. To begin using the new format, pass an empty string for 'continue' in the initial query.
Using i18n codes for API warnings should be a high priority, as not everyone can understand English, and clients do not want to show English uncoded warnings to non-English users. John Vandenberg (talk) 13:41, 24 October 2014 (UTC)