Analytics/Wikistats/TrafficReports/Future

Introduction
Wikistats broadly comes in two (nearly) independent parts:
 * A: Wikistats reports on database content and activity
 * These are reports per wiki, plus many comparison reports. They consist of html tables and charts, and are based on the xml dumps.
 * These reports are outside the scope of this page.


 * B: Wikistats traffic reports
 * These are html tables, based on two sets of sources (see below).
 * This page seeks to discuss the future of B: Wikistats traffic reports

For technical details, see Analytics/Wikistats/TrafficReports For overview of Wikistats ecology see this diagram Below read 'reports' as 'Wikistats reports' unless specified otherwise.

Two broad categories of traffic reports
Traffic reports are generated from two sources These counts (aka Domas' pageview counts) are further aggregated by Wikistats scripts into monthly totals per project per wiki, also broken down further into mobile and non-mobile traffic, and exist in two variations: normalized/non-normalized. All-in-all a bewildering 60 static reports exist, which are updated each day. See this site map. Example: TablesPageViewsMonthlyCombined.htm
 * B1 Reports based on hourly pageview counts per wiki

Actually these squid logs are generated now via hadoop in downward compatible format.
 * B2 Reports based on so called 'squid logs'

These 'squid logs' are used for two types of reports (with some hybrids)

For an overview see sitemap SquidReportsCountriesLanguagesVisitsEdits.htm
 * B2a Breakdowns of traffic by geographic criteria (country, continent, global North/South)

For an overview see Wikistats search 'breakdown of traffic'
 * B2b Breakdowns of traffic by non geographic criteria (os, browser, mime type, target wiki, referer, etc)

There are around 15 different squid based reports, some in varieties (views/edits), around 20 in total.

Future: general ideas
For those reports that should live on in some form or another here are general ideas for improvement (if budget allows).

More generic processing
Now each report has separate code to generate it. That probably can be improved upon.

Machine readability
Nowadays there are many intermediate csv files from which the reports are generated, which are not readily available online now (and sometimes with cryptic codes). In order to facilitate further processing, ability to produce a machine readable self documenting public format could help a lot (compare json files from stats.grok.se).

On demand generation
Current reports are all in static html, generated daily (B1) or once a month (B2a/B2b). This is inflexible, and leads to long reports full of details that may only matter to a few users. Flexible reports, which are generated ad hoc using client preferences and filters could be beneficial. This could also ease the burden to download huge files (B1).

Language independence
Ideally all reports could be generated in multiple languages.

Support for all Wikimedia projects
Some of the traffic reports focus only on Wikipedia, as the 1:1000 sampled logs do not allow enough precision for other projects. With unsampled hadoop input further extension to all projects would be nice.

Better demographical data
Some of the reports use demographical data harvested from several Wikipedia pages (example here). However this information can be quite of out date. Worldbank is a reputed resource with yearly updated figures and a sophisticated API. We could collect those figures (maybe store first in Wikidata).

Future: per report

 * Please add your feedback for 'B1 Reports based on hourly pageview counts per wiki' at this page (to do)
 * Please add your feedback for 'B2a/B2b Reports based on hourly pageview counts per wiki' at this page