User:Jeblad/Temporal statistics

From mediawiki.org

Temporal statistics are an adaption of the mw-core to allow calculation of temporal statistics for the last hours, days or weeks.

In the basic configuration it is intended for low traffic sites without Squid servers in its configuration without caching of traffic data, that is $wgHitcounterUpdateFreq less than or equal to 1. If the traffic is large there will be internal caching to reduce the load impact on the database server if $wgHitcounterUpdateFreq is larger than 1.

In the basic configuration without squids no external infrastructure is necessary to manage external logfiles. If there are squids in front of the web servers a maintenance script must be used to populate the database tables with statistics data. The data from the squids will then be integrated into the same presentations as usual traffic.

The adaption uses ring buffers to hold temporal statistics, as default either no ringbuffer or a ring buffer with one slot with collected statistics (that is a ring buffer with two slots). One slot in the ring buffer accumulates statistics, then will move on to the next one when the unix time epoch goes from one hour to the next, from one day to the next or from one week to the next. The length of the ring buffers, less one, will give the maximum length of the statistics with the given resolution. In the default configuration that means the previous whole hour is collected in a single slot, the previous whole day or the previous whole week.

Only real pages gets statistics, that is no special pages will have statistics. This should have few practical consequences.

Special pages[edit]

It is possible to build several special pages for various use. Actually they will be one and the same underlaying system but with some additional filtering mechanisms. Each of the pages can produce statistics from a single column in each ring buffer, accumulated statistics backwards in time from the last completely filled slot, difference trends, etc.

Single page statistics[edit]

This will be a special page to get statistics for a single page. The call syntax is [[Special:Pagestatistics/pagename]] and it will display the collected statistics for both global statistics and temporal statistics for this page. Presumbly the page will show all forms of statistics on a single page.

This page is most likely the only one that can't use a common layout.

General page statistics[edit]

This will be a special page to get comparative statistics for the most viewed pages. The call syntax is [[Special:Pagestatistics]] and it will show the number of page views as a number and as a bar for each page. The user will have to choose one of several pages, given the timespan for the statistics, and it will be possible to add several filtering mechanisms like those on [[Special:Recentchanges]].

Watched pages statistics[edit]

This will be a special page to get comparative statistics for those pages that a user has on the watched pages list. The call syntax is [[Special:Pagestatistics,type=watchlist]] and it will show the number of page views as a number and as a bar for each page. The user will have to choose one of several pages, given the timespan for the statistics.

User created pages statistics[edit]

This will be a special page to get comparative statistics for those pages that a user has created. The call syntax is [[Special:Pagestatistics,type=newpages,user=user]] and it will show the number of page views as a number and as a bar for each page. The user will have to choose one of several pages, given the timespan for the statistics.

Extensions[edit]

Several extensions could possibly use temporal statistics to allow for adaptive changes. Especially Extension:Intersection and Extension:DynamicPageList. By using temporal statistics the lock in effect of global statistics can be avoided and any listings will dynamically change over time.

Note that those extensions has to implement the same ring buffers, and can't assume that any specific column in the database are set and available at any given moment.

Update script[edit]

To import data from the Squid logs, there will be a script to filter and preprocess them. This script will typically run each hour.

Maintenance script[edit]

There will be a maintenance script to ease adjustments of the database scheme. If possible this script will try to rearrange data in the tables so to keep as much as possible of the collected statistics. This is done by altering the table to add temporary columns, then moving data into those columns, and then finally altering the table again to remove the original columns and renaming the new columns as the original ones.

In this process the tables will either loose slots, then the statistics will be truncated, or will gain additional slots, then the statistics will have empty trailing slots.

Configuration[edit]

Additional configuration in the DefaultSettings.php, with adaptions in LocalSettings.php

/**
 * wgHitcounterHours, wgHitcounterDays and wgHitcounterWeeks is the number of
 * slots in the ring buffers used for temporal statistics. One single slot should
 * not be used as it will not build complete statistics for any slot. A value of
 * zero will therefore turn off temporal statistics. The values should not be set
 * to a higher value than the number of slots in the database table as this will
 * lead to a sql error.
 * Note that this is the number of slots for collected data, that is one less than
 * the actual number of slots. The number of slots will not change the actual
 * database scema.
*/
$wgHitcounterHours = 1;
$wgHitcounterDays = 1;
$wgHitcounterWeeks = 1;

The configuration are used for modulus operations for calculating the present slot. That means any changes of the configuration will trash previous collected statistics. A maintenance script can be made to recalculate slot indexes, and then reorganize the database accordingly.

If the length of the ring buffers are changed, the table has to be kept temporarilly, the new table created and the statistics reorganized and moved back. This is the only way to generally keep the data in the slots. If the ring buffer is shorted, then old data are forgotten but the ring will be filled. If the the ring buffer is lengthened the ring will only be partially filled.

Typical configuration[edit]

There are a few typical configurations which are especially interesting.

One configuration of the hourly ringbuffer is to use a ring buffer with three slots, allowing two slots to be filled at a time. This is the smallest ring which allows differences and therefore can produce trends.

Another configuration uses 25 hourly slots, allowing one to compare a day from a sliding window with a fixed day.

Altering of page table[edit]

The following shows how the database table page are changed to accommodate the new functionality. If it is necessary to use longer time series the number of slots for page_hourX, page_dayY and page_weekZ are increased accordingly.

Note that some code should be added to verify that the number of slots stays in accordance with the definitions of $wgHitcounterHours, $wgHitcounterDays and $wgHitcounterWeeks.

ALTER TABLE mwpage
  ADD COLUMN page_tsview INT NOT NULL DEFAULT 0 AFTER page_counter,
  ADD COLUMN page_hour0 INT NOT NULL DEFAULT 0 AFTER page_tsview,
  ADD COLUMN page_hour1 INT NOT NULL DEFAULT 0 AFTER page_hour0,
  ADD COLUMN page_day0 INT NOT NULL DEFAULT 0 AFTER page_hour1,
  ADD COLUMN page_day1 INT NOT NULL DEFAULT 0 AFTER page_day0,
  ADD COLUMN page_week0 INT NOT NULL DEFAULT 0 AFTER page_day1,
  ADD COLUMN page_week1 INT NOT NULL DEFAULT 0 AFTER page_week0;

Note that the previous reflects the database scema used for $wgHitcounterHours = $wgHitcounterDays = $wgHitcounterWeeks = 1. If the numbers are changed the database scema must be changed accordingly.

Altering the site_stats table[edit]

The following shows how the database table site_stats are changed to accommodate the new functionality. If it is necessary to use longer time series the number of slots for page_hourX, page_dayY and page_weekZ are increased accordingly.

Note that some code should be added to verify that the number of slots stays in accordance with the definitions of $wgHitcounterHours, $wgHitcounterDays and $wgHitcounterWeeks.

ALTER TABLE mwsite_stats
  ADD COLUMN ss_tsview INT NOT NULL DEFAULT 0 AFTER ss_views,
  ADD COLUMN ss_hour0 INT NOT NULL DEFAULT 0 AFTER ss_tsview,
  ADD COLUMN ss_hour1 INT NOT NULL DEFAULT 0 AFTER ss_hour0,
  ADD COLUMN ss_day0 INT NOT NULL DEFAULT 0 AFTER ss_hour1,
  ADD COLUMN ss_day1 INT NOT NULL DEFAULT 0 AFTER ss_day0,
  ADD COLUMN ss_week0 INT NOT NULL DEFAULT 0 AFTER ss_day1,
  ADD COLUMN ss_week1 INT NOT NULL DEFAULT 0 AFTER ss_week0;

Note that the previous reflects the database scema used for $wgHitcounterHours = $wgHitcounterDays = $wgHitcounterWeeks = 1. If the numbers are changed the database scema must be changed accordingly.

Altering the hitcounter table[edit]

The following shows how the database table site_stats are changed to accommodate the new functionality.

ALTER TABLE mwsite_stats
  ADD COLUMN hc_tsview INT NOT NULL DEFAULT 0 AFTER hc_id;

Patch for Article.php[edit]

The patch is an adaption of incViewCount() to use additional columns in the database table for the page, typically named mw_page.

Note the incViewCount -function does not include the code for bulk updates.

	/**
	 * Used to increment the view counter
	 *
	 * @static
	 * @param integer $id article id
	 */
	function incViewCount( $id ) {
		$id = intval( $id );
		global $wgHitcounterUpdateFreq, $wgDBtype;
		global $wgHitcounterHours, $wgHitcounterDays, $wgHitcounterWeeks;

		$dbw = wfGetDB( DB_MASTER );
		$pageTable = $dbw->tableName( 'page' );
		$hitcounterTable = $dbw->tableName( 'hitcounter' );
		$acchitsTable = $dbw->tableName( 'acchits' );

		$result = null;
		wfRunHooks( 'incViewCount', array( &$this, &$id, &$dbw, &$result ) );
		if ( $result !== null ) {
			return $result;
		}

		if( $wgHitcounterUpdateFreq <= 1 ) {
			# first we needs our unix epoch type of time stamp
			if ($wgHitcounterHours || $wgHitcounterDays || $wgHitcounterWeeks)
				$s = $dbw->selectRow( 'page', array( 'page_tsview' ),  array( 'page_id' => $id ) );
			# this fails if we havent tried to get the timne stamp, which means we should not make any
			# calculation of temporal statistics
			if ($s) {
				# get now-time
				$now = time();
				# common initial part of query
				$query = "UPDATE $pageTable SET page_counter = page_counter + 1";
				$query .= ", page_tsview = $now";
				# calculate our new slots
				$hour = sprintf('%d', floor(($now%(3600*$wgHitcounterHours))/3600));
				$day = sprintf('%d', floor(($now%(86400*$wgHitcounterDays))/86400));
				$week = sprintf('%d', floor(($now%(604800*$wgHitcounterWeeks))/604800));
				# process hour, day and week slots
				# each tick is normalized to a slot in the time domain, then all slots from
				# after the last one and up to and including now-slot are zeroed out
				# then the final slot are incremented
				if ($wgHitcounterHours) {
					$then = $s->page_tsview - $s->page_tsview%3600;
					$i = 0;
					while ($then + 3600*($i+1) < $now && $i < $wgHitcounterHours) {
						if (!isset($h)) $h = floor(($then%3600)/3600);
						$query .= ", page_hour" . ($h+$i++)%$wgHitcounterHours . " = 0";
					}
					$query .= ", page_hour$hour = page_hour$hour + 1";
				}
				if ($wgHitcounterDays) {
					$then = $s->page_tsview - $s->page_tsview%86400;
					$i = 0;
					while ($then + 86400*($i+1) < $now && $i < $wgHitcounterDays) {
						if (!isset($d)) $d = floor(($then%86400)/86400);
						$query .= ", page_day" . ($d+$i++)%$wgHitcounterDays . " = 0";
					}
					$query .= ", page_day$day = page_day$day + 1";
				}
				if ($wgHitcounterWeeks) {
					$then = $s->page_tsview - $s->page_tsview%604800;
					$i = 0;
					while ($then + 604800*($i+1) < $now && $i < $wgHitcounterDays) {
						if (!isset($w)) $w = floor(($then%604800)/604800);
						$query .= ", page_week" . ($w+$i++)%$wgHitcounterWeeks . " = 0";
					}
					$query .= ", page_week$week = page_week$week + 1";
				}
				# common final part of query
				$query .= " WHERE page_id = $id";
//echo $query;
				$dbw->query( $query );
			}
			# we failed so only do global statistics
			else {
				$dbw->query( "UPDATE $pageTable SET page_counter = page_counter + 1"
					. " WHERE page_id = $id" );
			}
			return;
		}

		# Not important enough to warrant an error page in case of failure
		$oldignore = $dbw->ignoreErrors( true );

		$dbw->query( "INSERT INTO $hitcounterTable (hc_id) VALUES ({$id})" );

		$checkfreq = intval( $wgHitcounterUpdateFreq/25 + 1 );
		if( (rand() % $checkfreq != 0) or ($dbw->lastErrno() != 0) ){
			# Most of the time (or on SQL errors), skip row count check
			$dbw->ignoreErrors( $oldignore );
			return;
		}

		$res = $dbw->query("SELECT COUNT(*) as n FROM $hitcounterTable");
		$row = $dbw->fetchObject( $res );
		$rown = intval( $row->n );
		if( $rown >= $wgHitcounterUpdateFreq ){
			wfProfileIn( 'Article::incViewCount-collect' );
			$old_user_abort = ignore_user_abort( true );

			$result = null;
			wfRunHooks( 'incViewCollect', array( &$this, &$id, &$dbw, &$result ) );
			if ( $result !== null ) {
				return $result;
			}

			if ($wgDBtype == 'mysql')
				$dbw->query("LOCK TABLES $hitcounterTable WRITE");
			$tabletype = $wgDBtype == 'mysql' ? "ENGINE=HEAP " : '';
			$dbw->query("CREATE TEMPORARY TABLE $acchitsTable $tabletype AS ".
				"SELECT hc_id,COUNT(*) AS hc_n FROM $hitcounterTable ".
				'GROUP BY hc_id');
			$dbw->query("DELETE FROM $hitcounterTable");
			if ($wgDBtype == 'mysql') {
				$dbw->query('UNLOCK TABLES');
				$dbw->query("UPDATE $pageTable,$acchitsTable SET page_counter=page_counter + hc_n ".
					'WHERE page_id = hc_id');
			}
			else {
				$dbw->query("UPDATE $pageTable SET page_counter=page_counter + hc_n ".
					"FROM $acchitsTable WHERE page_id = hc_id");
			}
			$dbw->query("DROP TABLE $acchitsTable");

			ignore_user_abort( $old_user_abort );
			wfProfileOut( 'Article::incViewCount-collect' );
		}
		$dbw->ignoreErrors( $oldignore );
	}