Analytics/Wikistats/TrafficReports/Future

From mediawiki.org

Introduction[edit]

Wikistats broadly comes in two (nearly) independent parts:

  • A: Wikistats reports on database content and activity
These are reports per wiki, plus many comparison reports. They consist of html tables and charts, and are based on the xml dumps.
These reports are outside the scope of this page.
  • B: Wikistats traffic reports
These are html tables, based on two sets of sources (see below).
This page seeks to discuss the future of B: Wikistats traffic reports, and your input is needed!

For technical details, see Analytics/Wikistats/TrafficReports
For overview of Wikistats ecology see this diagram
Below read 'reports' as 'Wikistats reports' unless specified otherwise.

Two broad categories of traffic reports[edit]

Traffic reports are generated from two sources

B1 Reports based on hourly pageview counts per wiki

These counts (aka Domas' pageview counts) are further aggregated by Wikistats scripts into monthly totals per project per wiki, also broken down further into mobile and non-mobile traffic, and exist in two variations: normalized/non-normalized. All-in-all a bewildering 60 static reports exist, which are updated each day. See this site map. Example: TablesPageViewsMonthlyCombined.htm

B2 Reports based on so called 'squid logs'

Actually these squid logs are generated now via hadoop in downward compatible format.

These 'squid logs' are used for two types of reports (with some hybrids)

B2a Breakdowns of traffic by geographic criteria (country, continent, global North/South)

For an overview see sitemap SquidReportsCountriesLanguagesVisitsEdits.htm

B2b Breakdowns of traffic by non geographic criteria (os, browser, mime type, target wiki, referer, etc)

For an overview see Wikistats search 'breakdown of traffic'

There are around 15 different squid based reports, some in varieties (views/edits), around 20 in total.

Future: general ideas[edit]

For those reports that should live on in some form or another here are general ideas for improvement (if budget allows).

More generic processing[edit]

Now each report has separate code to generate it. That probably can be improved upon.

Machine readability[edit]

Nowadays there are many intermediate csv files from which the reports are generated, which are not readily available online now (and sometimes with cryptic codes). In order to facilitate further processing, ability to produce a machine readable self documenting public format could help a lot (compare json files from stats.grok.se).

On demand generation[edit]

Current reports are all in static html, generated daily (B1) or once a month (B2a/B2b). This is inflexible, and leads to long reports full of details that may only matter to a few users. Flexible reports, which are generated ad hoc using client preferences and filters could be beneficial. This could also ease the burden to download huge files (B1).

Language independence[edit]

Ideally all reports could be generated in multiple languages.

Support for all Wikimedia projects[edit]

Some of the traffic reports focus only on Wikipedia, as the 1:1000 sampled logs do not allow enough precision for other projects. With unsampled hadoop input further extension to all projects would be nice.

Better demographical data[edit]

Some of the reports use demographical data harvested from several Wikipedia pages (example here). However this information can be quite of out date. Worldbank is a reputed resource with yearly updated figures and a sophisticated API. We could collect those figures (maybe store first in Wikidata).

Future: per report[edit]

  • Please add your feedback for 'B1 Reports based on hourly pageview counts per wiki' at this page (to do)
  • Please add your requests for migration of 'B2a/B2b Reports based on hourly pageview counts per wiki' at this page

See also[edit]