Analytics/Wikistats/New mobile pageviews documentation

= Overview =

Usage
Using this software package means you just have to do

./pageviews.pl conf/stat1.json Given that you wrote a correct configuration file(see the next section), you will be able to run a report using this.

The running times for this may vary, depending on:


 * how many days you're processing
 * the logic in the version you're running
 * the restrictions in your configuration file

But, in general a full run on stat1 on 7 months of data should take at most 7 hours.

Configuration
We start with an example configuration and we'll go through each attribute and values to know what they mean.

{     &quot;model&quot;                : &quot;parallel&quot;, &quot;max-children&quot;        : 8, &quot;input-path&quot;          : &quot;/home/user/wikidata/raw_gzips&quot;, &quot;children-output-path&quot; : &quot;/tmp/pageviews/map&quot;, &quot;output-path&quot;         : &quot;/tmp/pageviews&quot;, &quot;output-formats&quot;      : [&quot;web&quot;,&quot;json&quot;,&quot;wikireport&quot;], &quot;logs-prefix&quot;         : &quot;sampled&quot;, &quot;restrictions&quot;        : { &quot;days-of-each-month&quot;  : [2,3], &quot;lines-for-each-day&quot;  : 100000 },     &quot;start&quot;    : { &quot;year&quot;  : 2012, &quot;month&quot; : 8 },     &quot;end&quot;      : { &quot;year&quot;  : 2013, &quot;month&quot; : 2 }   } What this says is we want to process only data between August-2012 and February-2013 and we want to restrict ourselves to the first 100_000 lines of each file .gz file and only take days 2 and 3 of each month into consideration.

We're also saying we want files to be processed in parallel (more details in the section about PageViews::Model::Parallel) and that we want at most 8 workers (e.g. 8 files) to be processed in parallel.

The output formats are also present.

PageView definition
It's not trivial to pinpoint exactly what a pageview is so we believe it's important to document what it means.

To this end, we've made a flowchart of the criteria through which we decide if a request which is found in the squid logs is or isn't a pageview.

This is subject to change.



General workflow
The following diagram describes how the components are connected together. The computation times for reports are also present.



Documentation
This documentation was generated using pandoc.

= PageViews::BotDetector =

NAME
PageViews::BotDetector -- Web Bot detection based on ip and ua

DESCRIPTION
It uses a Net::Patricia object (Patricia tree) to do a fast lookup on a list of pre-defined ip ranges for bots.This list of pre-defined ip ranges were taken from the wikistats project.

match_ip($self,$ip)
Does a fast lookup(using the patricia tree) to decide if the ip is in the ranges that bots are in.

match_ua($self,$ip)
Does pattern matching on the user-agent with a given list of keywords that bots are known to have.

= PageViews::Generate1 =

= PageViews::Util =

NAME
PageViews::Util - role with reusable methods

DESCRIPTION
This is a role consumed by the classes which need methods from it. Methods are imported into the classes selectively, as needed.

how_many_days_month_has($y,$m)
This is a function which returns how many days the month has. It receives as parameters the year and the month.

compact_days_to_months($self)
Because there are multiple formats of outputting the data, the processed counts are at granular as needed(for now). We have counts for each day. Some of the views require rendering in terms of months. This method goes over all hashes and adds up the counts from each day. It then replaces the daily counts inside the class with the monthly counts.

sorted_languages_in_counts($self)
This method finds a list of all the languages present in the counts. The languages returned are sorted alphabetically.

sorted_months_in_counts($self)
This method finds all the months present in the counts. We sort the months chronologically.

extrapolate($self,$factor)
To be implemented. This method would theoretically multiply all the pageviews by a certain factor. It would be used in order to scale the sampled input back to the original magnitude.

get_data_for_csv($self)
The csv file for wikistats( PageViewsPerMonthAll.csv ) requires a csv file with the following columns:


 * language
 * date
 * pageview count

The date column must be the last day of the month in Y/M/D format.

This method produces the necessary csv and returns it as a string

= PageViews::View::Web =

NAME
PageViews::View::Web -- Formatting the report in HTML format

DESCRIPTION
This module creates the pageviews.html file. The data is taken from a model and then, it is normalized and totals, rankings and deltas are calculated. All the data in this form is passed to a templating engine(Template Toolkit) and the reports are rendered.

The templates are in the templates/ directory.


 * compacting daily pageviews into monthly pageviews
 * scaling each month to a 30-day period so that months with more/less days can be compared
 * finding out rankings for each month
 * calculating totals

All the data mentioned above is then put in some data structures and they are passed to the templating engine(Template Toolkit) in order to turn them into html code.

safe_division($self,$numerator,$denominator)
This method performs safe division between two numbers.

format($self,$val)
This method formats a percentage so that it can be displayed in the table cells.

format_rank($self,$rank)
This method formats a rank name.

get_totals_sorted_months_present_in_data($self,$languages_present_uniq,$language_totals)
Returns a list of languages sorted by absolute total(total over all months).

get_time_sorted_months_present_in_data($self)
This method provides returns the months in YYYY-MM format sorted chronologically

third_pass_chart_data($self,$languages,$months)
This method takes as parameter the languages present and the months.

It returns a structure of the following form that can be used by the charts inside the html page(we currently use d3.js to render those).

{     &quot;en&quot;: { &quot;counts&quot; : [ 40    , 50        , 60         ], &quot;months&quot; : [ &quot;2012-9&quot;, &quot;2012-10&quot;, &quot;2012-11&quot; ] }   }

first_pass_languages_totals($self)
Returns a hash with the following keys, also calculates totals for each month and each language.


 * months_present
 * languages_present_uniq
 * month_totals
 * language_totals

second_pass_rankings($self,$languages,$months)
Returns the month rankings in a hash of the form:

{     &quot;2012-08&quot;: { &quot;en&quot;: 1, &quot;ja&quot;: 2, &quot;de&quot;: 3, ...     },      ...    }

scale_m_to_30($month,$value)
Receives a year and month and a value.

Returns the value scaled to a 30-day month.

scale_months_to_30($self)
Replaces all the pageview counts in the current object with scaled values of themselves.

get_data_for_template
Calculates deltas, rankings, totals. Provides data about discarded lines, how many they are, and the reasons for which they were discarded.

Formats the data from the model in the way that the templates expect it to be.

get_data_from_model($self,$model)
Copies all the needed data from the model in order to proceed with the rendering.

render($self,$params)
Receives as parameter a hash of parameters (which is read from the configuration).

Uses Template::Toolkit to render the template for this view.

NAME
PageViews::View::Limn -- Renders the data in Limn format.

DESCRIPTION
Unimplemented (low-priority).

= PageViews::View::JSON =

NAME
PageViews::View::JSON -- renders the pageview counts in a JSON file

DESCRIPTION
This module formats the data in JSON format and writes it to disk.

The Git SHA1 is also included in this JSON file as well as the actual configuration which was used for that particular run.

This is useful in the case that further updates need to be made on the way the data should be rendered, they can be rendered by using the PageViews::Model::JSON without having to re-run the counting again.

The SHA1 is useful in order to compare different revisions of the code to see what changes produced differences in the final counts.

get_data_from_model($self,$model)
Gets as parameter a model and collects what data is needed from it in order to produce the json.

render($self)
This method renders the pageview counts in JSON format. It also includes the config.json with which the run was made. It stores that in the __config key inside data.json There is also a git-sha1-for-this-run key to the json which shows the commit in the git history of the code that produced this json.

= PageViews::Model::Sequential =

NAME
PageViews::Model::Sequential - Processing squid log lines one file at a time

build_accepted_url_regex1
This is not a method. It is just a function. It creates a (rather big) regex that has 8 captures.



accept_case1($self,$u,$r)
This method takes url_info and referer_info in the format returned by accept_rule_url.

It treats the case where both the url and the referer have the same title, in which case it discards.

Otherwise it accepts because this means the request was not caused by the same page as the url.

accept_case2($self,$u,$r)
This method takes url_info and referer_info in the format returned by accept_rule_url.

This is a case which stemmed from feedback with the Mobile Team.

It treats the case where the title should be different although the referer and url are /wiki/ and /w/api.php links.

accept_rule_url_and_referer
This is one of the main parts of the logic in the pageview definition.

It uses accept_case1 and accept_case2 to deal with some of the cases.

All other edge-cases will be put in methods with the name accept_caseX where X will be a number.

So we are currently treating cases where the url and referer are api urls, and also the case where the url is a wiki url and the referer is a wiki url.

process_line($self,$line)
This method takes a log line as argument. It splits it by space and tab into fields.Afterwards a series of filters are applied for each filter.These filters are:


 * minimum field count constraint
 * accept_rule_time
 * accept_rule_status_code
 * accept_rule_method
 * accept_rule_mimetype
 * accept_rule_url
 * accept_rule_url_and_referer

get_files_in_interval($self,$params)
This method reads from the hash it is being passed(the configuration file parsed from json=&gt;perl data structures).

It reads the start and end date, it then selects the files which are in the input-path and match the logs-prefix, it sorts them and returns them as a list.

process_files($params)
The files which need to be processed are determined through get_files_in_interval and then processing commences, one file at a time.

= PageViews::Model::Parallel =

NAME
PageViews::Model::Parallel - Model for paralle processing of Squid logfiles.

DESCRIPTION
This module inherits from PageViews::Model::Sequential.

The main difference is in the way it processes the files. While ::Sequential processes files one-by-one ::Parallel has a loop where it forks up to max-children worker processes, each working on a different squid log file.

reduce($self,$json_path)
After all worker processes have finished, reduce adds up all the counts from each worker process.

Because there are multiple structures, each different in terms of keys, these are separated into some categories, to_reduce1, to_reduce2, and these are reduced separately.

Finally, the reduced counts are replaced inside the model.

update_child_slots
This method checks to see which of the PIDs of the worker processes in the attribute active_children_pids are still active. It then updates the array.

This method returns the number of workers workers still allowed to be started(up to the max-children limit).

write_child_output_to_disk($output_path)
This method writes to disk the counts a worker has processed.

process_files($params)
The process_files method receives as parameter the config hash read from the configuration file.

It then forks up to max-children. A check is made to see how many children are active, if there are fewer than max-children some more are started until the limit is reached.

The initial process that calls process_files effectively waits for the children to finish. After they have finished, it reduces(adds) all the counts and stores them inside the class.

= PageViews::Model::JSON =

NAME
PageViews::Model::JSON - Model which has a json file as data source

DESCRIPTION
Because currently a run on 7 months takes around 6hours, it's best to have the results of all computations stored on disk, so if some tweaks need to be done to the rendering of the data, these can be done afterwards without the need to rerun the counting.

This module is mainly intended for reusing the data.json produced by a previous run.

process_files($hash)
The parameter to this method is a hash. This hash contains the configuration with which this module is run.

There are multiple such configuration which can be found in the conf/ sub-directory of this project.

This method finds the data.json in the input-path, parses the json file and stores the needed keys in the class.