Analytics/udp-filter

=Wikimedia's generic UDP filtering system=

Wikimedia Foundation (c) 2012 // Diederik van Liere This code has been released under the GPL2.

=Dependencies= libgeoip-dev is a dependency that needs to be installed if you are going to compile manually.

=Installation=

Compiling
You can run the following sequence to install udp-filter:
 * ./configure
 * make install

Debian package
You can create a package yourself using the following steps: (replace strings where necessary)


 * 1) git clone git://gerrit.wikimedia.org:29416/analytics/udp-filters.git
 * 2) mv udp-filters udp-filters-0.1+git
 * 3) tar -cvzf udp-filters-0.1+git .tar.gz udp-filters-0.1+git /
 * 4) dh_make -c gpl2 -e  -f ../udp-filters-0.1+git .tar.gz
 * 5) make changes (if necessary) to control-sample and copy it to debian/control
 * 6) make changes (if necessary) to copyright-sample and copy it to debian/copyright
 * 7) dpkg-depcheck -d ./configure (#output of dpkg-depcheck (libraries) must be added to debian/control, the following three packages need to be added to the control file:
 * 8) libgeoip-dev
 * 9) mime-support
 * 10) mawk
 * 11) autotools-dev
 * 12) dpkg-buildpackage -rfakeroot -sgpg (if you want to sign the package then make sure that name/ email address of the maintainer in the control file 'exactly' matches the name and email address of your GPG key.

Example control file
Source: udp-filters Section: utils Priority: extra Maintainer: Diederik van Liere (Wikimedia Foundation)  Build-Depends: debhelper (>= 7.0.50~), autotools-dev, libgeoip-dev,mime-support, mawk Standards-Version: 3.8.4 Homepage:  Vcs-Git: git://gerrit.wikimedia.org:29416/analytics/udp-filters.git Vcs-Browser: https://gerrit.wikimedia.org/r/gitweb?p=analytics/udp-filters.git

Package: udp-filters Architecture: any Depends: libc6 (>= 2.4), libgeoip1 (>= 1.4.6) Description:  

=Background= This new filter system replaces the old collection of filters written in C.

=Command line arguments= The following is a list of valid command line parameters. Either -url or -project are mandatory (you can also use them both), the other command line parameters are optional:

-u or --url:         the string or multiple strings separated by a comma that indicate what you want to match.

-p or --project:     the part of the domain name that you want to match. For example, 'en.m.' would match all English mobile Wikimedia projects. -g or --geocode:     flag to indicate geocode the log, by default turned off.

-a or --anonymize:   flag to indicate anonymize the log, by default turned off

-d or --database:    specify alternative path to MaxMind database. Please use the Country database, the City database is not yet supported.

-c or --country_list: limit the log to particular countries, this should be a comma separated list of country codes. Valid country codes are the ISO 3166 country codes (see http://www.maxmind.com/app/iso3166).

-f or --force:          do not match on either domain or url part, basically turn filtering off. Can be useful when filtering for specific country.

-v or --verbose:     output detailed debug information, not recommended in production. Output is sent to stderr.

-h or --help:        show this menu with all command line options.

For options -u, -p and -c you can enter multiple values if you separate them by a comma.

=Examples=

./udp_filter -p en.wikipedia           this will log all pageviews (depending on the sampling rate) if the domain contains en.wikipedia

./udp_filter -u SOPO                   this will log all pageviews (depending on the sampling rate) when the url (excluding the domain name) contains SOPA. So this will collect SOPA *across* projects.

./udp_filter -p en.wikipedia -u SOPA	this will log all pageviews (depending on the sampling rate) where the domain contains en.wikipedia and the url contains SOPA.

./udp_filter -p en.wikipedia -u SOPA,PIPA  this will log all pageviews ( depending on the sampling rate) where the domain contains en.wikipedia and the url contains either SOPA or PIPA.

./udp_filter -p en.wikipedia -a	       this will log all pageviews (depending on the sampling rate) if the domain contains en.wikipedia and replace the ip address of the visitor with 0.0.0.0

./udp_filter -p en.wikipedia -g        this will log all pageviews (depending on the sampling rate) if the domain contains en.wikipedia and replace the ip address of the visitor with the country code. See for a list of all the valid country codes: http://www.maxmind.com/app/iso3166

./udp_filter -p en.wikipedia -g -c BA  this will log all pageviews (depending on the sampling rate) if the domain contains en.wikipedia and replace the ip address of the visitor with the country code. In addition, only hits from Brasil (BA) will be logged.

./udp_filter -p en.wikipedia -d GeoIP.dat  this specifies an alternative path for the MaxMind database and this will log all pageviews (depending on the sampling rate) if the domain contains en.wikipedia.

./udp_filter -p en.wikipedia -u SOPA -c US this will log all pageviews (depending on the sampling rate) where the domain contains en.wikipedia and the url contains SOPA and the visitor comes from the US.

./udp_filter -p en.wikipedia -v        this turns on verbose logging and can help in debugging and verifying that the appropriate hits are being logged. This setting is not recommended in production.

=Multiplexer=

It was written by Munagala Ramanath ( xyzram on IRC )

This is a diagram of how the Multiplexor works: