Talk:Analytics/Hypercube

Phrasing
I suggest to have the first four bullet points over what the tool eventually should be able to do rephrased as: Right now you may get the impression that you may never get the Top 10 articles by pageviews from e.g. eswiki. Ainali (talk) 21:16, 9 October 2013 (UTC)
 * Top 10 articles by pageviews (optional: by project(s))
 * Top 10 countries by pageviews (optional: by project(s), by monthly timeseries from 2012-01-01 to 2012-12-31)


 * Your first use case is supported by the API:

GET https://hypercube.wikimedia.org/v1/json/ ?metric=pageviews &timeseries=daily &order=desc &filters=project:enwiki &limit=10 &start-timestamp=2012-04-23T00:00:00.000Z &end-timestamp=2012-09-30T00:00:00.000Z
 * The second use case will depend on adding the 'country' dimension to the datastream. Once that has happened, and that will not happen right away, then we can add support for the country dimension as a possible filter in the API. Hope this answers your questions. Drdee (talk) 19:00, 10 October 2013 (UTC)

What questions are we trying to answer ...
I think the key question is indeed the questions. What questions do people really want answered about WP (or other projects)? In my observation, while the "whole of wiki" questions e.g. "what are the top 10 articles on en.WP" might be interesting, they are usually not "mission-critical". I think what we often want to know is whether some strategy is working. Some examples might be:


 * For a GLAM partnership, there will be a set of pages that have some "connection" to the GLAM, e.g. use a photo provided by the GLAM, have edits done by editors associated with the GLAM (might be staff or volunteers), contain external links or citations back to the GLAM's own web site. Metrics around these pages "with connection to GLAM XYZ" are important to show the GLAM that this is an effective strategy for them. Such metrics might be number of page views, number of click-throughs on images/media and on links (external or references) provided by the GLAM. They may be interested in knowing how their "performance" in the world of WMF compares with other GLAMs. Clearly if one GLAM adds 1000 photos and gets more hits as a result than another that adds 1000 external links or whatever, then the GLAM wants to know this.


 * For a chapter or thematic organisation, they may be interested in knowing how content about their country or topic is "doing". By "doing", we might be interested in number of articles, length of articles, quality of articles, number of links into those articles, etc, and the rate of change of these. For example, does quality of articles (as assessed, or measured by some citation-to-length ratio) in any way relate to the number of page views it gets, or is page views inherently related to the popularity of the topic irrespective of the merits of the WP article?

My point is don't build a tool to ask "imagined questions". Do it in true "agile" style with some customers in the loop who have *real* questions they want to ask about.

One of the things that I suspect people will generally want is metrics over categories. One area where our tools seem very weak is working with categories recursively. If WMAU want to know about what's happening in Australia, it wants to know about articles that are directly or indirectly in Category:Australia, not just those few that are directly in Category:Australia. Kerry Raymond (talk) 21:52, 9 October 2013 (UTC)
 * Hear hear. I completely agree with using 'real customers' aka 'real questions' instead of hypothetical usecases. I could definitely give a few usecases for questions (although i think most are already covered by Kerry and Magnus beneath), but please do contact me if you need a sparring partner or help with defining a few user stories. Husky (talk) 13:57, 10 October 2013 (UTC)


 * I do not think these use cases are imagined -- I have collected them from Bugzilla tickets, mailinglist threads and conversations. We are building it in true agile style but in true agile there is a single customer and that does not apply to the wikimedia community -- it's not a homogenous group. In addition, it's hard to define a true MVP that makes sense from and customer POV and is also technically viable on the medium/long term. The fact that we open this RFC to you shows that we are committed to talk to 'real customers'. If you can give more real uses cases then please do so -- we really need them to help us in prioritizing the features. Drdee (talk) 19:07, 10 October 2013 (UTC)

I'm unsure about the ramp-up in 'questions we want to answer' (also I agree completely with Kerry Raymond] above). We have a number of very broad/shallow questions (e.g. "Top 10 articles per project") and then there is the n-dimensional namesake query. I think we can get some more realistic, middle-ground questions if we think about it a little. I may want to see pageviews for an article and every article which it links to. Perhaps I'm best served by parsing the article for links then providing those in q query, but perhaps it is a good compliment to querying across categories and the API should expose it directly. We may not get to answer these questions if we're focused on the high and low end. Protonk (talk) 16:54, 15 October 2013 (UTC)

GLAM
So here are my GLAM-related needs: I'm so glad and excited that this finally takes shape! --Magnus Manske (talk) 09:02, 10 October 2013 (UTC)
 * Get page views for a large number of pages (say, 100 sets, 1K-10K pages per set, sometimes more) in many projects
 * Need monthly views
 * Would be nice-to-have daily views as an option
 * Because of the large number of queries, SQL access would be nice-to-have
 * Page names need to be exactly as the page_title in the page table
 * Page names should not be prefixed by namespace, rather have numerical namespace ID as separate property
 * nice-to-have: Web API having cross-domain exception for Tools Labs, to allow POST queries from JavaScript (GET with lots of page titles is a problem...)
 * "Get page views for a large number of pages (say, 100 sets, 1K-10K pages per set, sometimes more) in many projects" I'd say a terse, powerful mechanism to do this (even accepting just an array of names or ids) is the absolute minimum we need from a project like this. Protonk (talk) 17:05, 15 October 2013 (UTC)

Categories
I use the existing grot stats a lot. It would be great to have categories and category trees. Easier long terms stats also. Also the ability to combine two or more articles in a graph (especially given the effect of a name change). Some simple analysis would be useful, given the annual rhythms most articles have. Ideally the origin of views, if that's possible. Johnbod (talk) 16:45, 10 October 2013 (UTC)
 * Do you have any specific thoughts about how to do this? One of the challenges is that categories trees are not real trees but they are graphs because cycles can exist. So we need a way to think about flattening the tree and think about how to handle child categories vs parent categories. For example should every category be treated as a parent category? Drdee (talk) 19:10, 10 October 2013 (UTC)

Needs for medicine and other high-traffic projects
For some community spaces Wikipedia is a compelling communication platform for the traffic it gets. In these efforts, the community would benefit from Will this project grant this functionality?  Blue Rasberry   (talk)   20:58, 10 October 2013 (UTC)
 * Sum of page views in a monthly or yearly time range for all pages in a category, including pages in sub categories

Output
A suggestion for a more compact format based on the following assumptions: --DarTar (talk) 17:22, 11 October 2013 (UTC)
 * I believe the main data object will always be a pure timeseries of the form 'timestamp' => 'count', regardless if the query is for a single article, a set of articles, an entire namespace or an entire project. This will make it very easy to build functionality/visualizations of the timeseries object while moving all the relevant metadata to the header.
 * I don't see there's a real use-case for mix-project or mix-namespace requests. If that's the case they can both be moved outside of the timeseries object
 * I don't think we are planning to define a general-purpose format for metrics other than pageviews, so we can probably get rid of an intermediate node

Single article

Whole namespace

Whole project

Multiple articles

See also the format of the JSON response used by the TempoDB API. --DarTar (talk) 17:49, 11 October 2013 (UTC)

The trouble with categories
For these reasons I'm tempted to suggest that the minimum viable product should only support individual articles and not bother with aggregation. --DarTar (talk) 17:35, 11 October 2013 (UTC)
 * As Drdee already pointed out Wikimedia projects don't have category trees, so retrieving data for anything other than the immediate children of a category is going to be challenging and create tons of false positives.
 * Categories and wikiprojects change over time, articles get added, deleted or moved across categories. An aggregate response per category or per wikiproject not specifying the full list of articles included in each data point will give inaccurate results: historical data will be based on the most recent definition of what articles are included in a category or wikiproject, generating useless data.
 * Note that the same problem applies to aggregation by namespace or project, but it's easy to map a given project or namespace to unique articles that belong to it at any given point in time
 * As handy as it may be, recursive conversion of categories to trees (by whatever measure) is outside the scope of this project. I think there could be room for a separate API for that task alone (accepting a category name and some options and returning an array or list object of articles contained in the category. Protonk (talk) 16:59, 15 October 2013 (UTC)

Emphasize robust api for specific article(s) requests
Our group has used article-level page view information from Wikipedia on multiple occasions. If you can ensure robust support for the core query of article+timespan -> views, I think external developers could do a lot of what they need to do. I'm not saying don't do all the other proposals, they sound great, just that single article metrics are the most flexible and important for our group's purposes. Here is an example of work we've done that used this information: http://nar.oxfordjournals.org/content/40/D1/D1255
 * Hi, when talking about robust, do you refer to certain performance / availability criteria? If so, can you be more specific about these criteria or if that's not what you mean can you elaborate on what robust means in this context? ty! Drdee (talk) 20:43, 14 October 2013 (UTC)
 * Robust, just means consistent availability. I'd like to be able to get view data for 10,000+ articles quickly and reliably.  Sorry for not signing last time.  I9606 (talk) 21:01, 15 October 2013 (UTC)
 * Hi I9606,
 * It's great to know what kind of performance we should aim for, I do not think that we will be able to provide such robustness in the first few releases. Regarding 10k+article queries: do you have any thoughts on how to submit that? I am pretty sure that a single GET request will not be possible. Drdee (talk) 21:59, 15 October 2013 (UTC)

Including information about featured content (main page)
Previous studies indicate that articles featured on the main pages of each Wikipedia experiment an important raise in the number of visits. So far, this has been checked for FAs on the main page but not for content highlighted in the Did you know? section. A nice-to-have feature would be an additional attribute indicating whether the article has been displayed on the Featured Article section of the main page, on the Did you know? section, and the date(s) on which they were shown to track any changes in the pattern of visits.--GlimmerPhoenix (talk) 19:27, 14 October 2013 (UTC)


 * GlimmerPhoenix, I believe this is outside the scope of the pageview API. Tracking links that get added or removed to a specific page and the timing of these changes is something that can be done via the MediaWiki API or the DB replicas. It's also going to be really hard to build infrastructure based on templates that vary across projects. I don't think this should be included in the requirements for the PV API. --DarTar (talk) 01:13, 16 October 2013 (UTC)