Analytics/udp-filter

From mediawiki.org
(Redirected from Analytics/UDP-filters)

Wikimedia's generic UDP filtering system[edit]

Wikimedia Foundation (c) 2012 // Diederik van Liere This code has been released under the GPL2.

Dependencies[edit]

libgeoip-dev is a dependency that needs to be installed if you are going to compile manually.

Installation[edit]

Debian packages are available at http://garage-coding.com/releases/udp-filters/

Compiling[edit]

You can run the following sequence to install udp-filter:

  • ./configure
  • make install

Debian package[edit]

Gerrit repo: https://gerrit.wikimedia.org/g/analytics/udp-filters

You can create a package yourself using the following steps: (replace strings where necessary)

  1. git clone git://gerrit.wikimedia.org:29416/analytics/udp-filters.git
  2. mv udp-filters udp-filters-0.1+git<yyyymmdd>
  3. tar -cvzf udp-filters-0.1+git<yyyymmdd>.tar.gz udp-filters-0.1+git<yyyymmdd>/
  4. dh_make -c gpl2 -e <your_email@wikimedia.org> -f ../udp-filters-0.1+git<yyyymmdd>.tar.gz
  5. make changes (if necessary) to control-sample and copy it to debian/control
  6. make changes (if necessary) to copyright-sample and copy it to debian/copyright
  7. dpkg-depcheck -d ./configure (#output of dpkg-depcheck (libraries) must be added to debian/control, the following three packages need to be added to the control file:
    1. libgeoip-dev
    2. mime-support
    3. mawk
    4. autotools-dev
  8. dpkg-buildpackage -rfakeroot -sgpg (if you want to sign the package then make sure that name/ email address of the maintainer in the control file 'exactly' matches the name and email address of your GPG key.

Example control file[edit]

Source: udp-filters
Section: utils
Priority: extra
Maintainer: Diederik van Liere (Wikimedia Foundation) <dvanliere@wikimedia.org>
Build-Depends: debhelper (>= 7.0.50~), autotools-dev, libgeoip-dev,mime-support, mawk
Standards-Version: 3.8.4
Homepage: <http://www.mediawiki.org/wiki/Analytics/UDP-filters>
Vcs-Git: git://gerrit.wikimedia.org:29416/analytics/udp-filters.git
Vcs-Browser: https://gerrit.wikimedia.org/r/gitweb?p=analytics/udp-filters.git

Package: udp-filters
Architecture: any
Depends: libc6 (>= 2.4), libgeoip1 (>= 1.4.6)
Description: <Wikimedia's udp-filter system.>
 <Wikimedia has a udp-logger that sends packets from the squid servers containing pageviews. UDP-filtes allows you to configure a filter and write particular pageviews, based on a combination of domain and url matching, to a logfile. It also offers geocoding and anonymization of ip addresses. >

Background[edit]

This new filter system replaces the old collection of filters written in C.

Command line arguments[edit]

The following is a list of valid command line parameters.

Either -url or -project are mandatory (you can also use them both), the other command line parameters are optional:

Option Description
-p paths, --path=paths

Path portions of the request URI to match. Comma separated.

-d domains, --domain=domain

Parts of domain names to match. Comma separated.

-r referers, --referers=domain

Parts of the referer domain to match. Comma separated.

-i addresses, --ip=addresses

IP address(es) to match. Comma seperated. Accepts IPv4 and IPv6 addresses and CIDR ranges.

-c countries, --country-list=countries

Filter for countries. This should be a comma separated list of country codes. Valid country codes are the ISO 3166 country codes (see http://www.maxmind.com/app/iso3166).

-s status, --http-status=status

Filter for HTTP response status code(s).

-r pattern, --regex=pattern

The parameters -p, -u and -s are interpreted as regular expressions. Regular expression searching is probably slower so substring matching is recommended.

-g, --geocode

Turns on geocoding of IP addresses. Must also specify --bird.

-b bird, --bird=bird

Mandatory when specifying --geocode. Valid choices are <country>, <region>, <city>, <latlon> and <everything>.

-a, --anonymize[=salt-key]

Turns on IP addresses anonymization. If salt-key is given, then libanon will be used to do prefix preserviing anonymization. salt-key may be 'random' a string at least 32 characters long. If 'random' is given, then a random salt-key will be chosen.

-n count, --min-field-count=count

Minimum number of fields that a log line contains. Default is 14. If a line has fewer than this number of fields,the line will be discarded.

-m path, --maxmind=path

Alternative path to MaxMind database. Default /usr/share/GeoIP

-F delimiter, --field-delimiter=delimter

Sets the delimiter used to separate fields. '\t' will be translated to a tab character. Default: ' ' (space).

-v, --verbose

Output detailed debug information to stderr, not recommended in production.

-h, --help

Show this help message.

-v, --version

Show version info.

Examples[edit]

./udp_filter -p en.wikipedia this will log all pageviews (depending on the sampling rate) if the domain contains en.wikipedia

./udp_filter -u SOPO this will log all pageviews (depending on the sampling rate) when the url (excluding the domain name) contains SOPA. So this will collect SOPA *across* projects.

./udp_filter -p en.wikipedia -u SOPA this will log all pageviews (depending on the sampling rate) where the domain contains en.wikipedia and the url contains SOPA.

./udp_filter -p en.wikipedia -u SOPA,PIPA this will log all pageviews ( depending on the sampling rate) where the domain contains en.wikipedia and the url contains either SOPA or PIPA.

./udp_filter -p en.wikipedia -a this will log all pageviews (depending on the sampling rate) if the domain contains en.wikipedia and replace the ip address of the visitor with 0.0.0.0

./udp_filter -p en.wikipedia -g this will log all pageviews (depending on the sampling rate) if the domain contains en.wikipedia and replace the ip address of the visitor with the country code. See for a list of all the valid country codes: http://www.maxmind.com/app/iso3166

./udp_filter -p en.wikipedia -g -c BA this will log all pageviews (depending on the sampling rate) if the domain contains en.wikipedia and replace the ip address of the visitor with the country code. In addition, only hits from Brasil (BA) will be logged.

./udp_filter -p en.wikipedia -d GeoIP.dat this specifies an alternative path for the MaxMind database and this will log all pageviews (depending on the sampling rate) if the domain contains en.wikipedia.

./udp_filter -p en.wikipedia -u SOPA -c US this will log all pageviews (depending on the sampling rate) where the domain contains en.wikipedia and the url contains SOPA and the visitor comes from the US.

./udp_filter -p en.wikipedia -v this turns on verbose logging and can help in debugging and verifying that the appropriate hits are being logged. This setting is not recommended in production.


Multiplexer[edit]

It was written by Munagala Ramanath ( xyzram on IRC )

This is a diagram of how the Multiplexor works: