Analytics/Archive/Infrastructure/Access

= How to access Kraken and crunch your wildest numbers = As of December 2012, Hadoop is up and running on 10 fresh and clean Analytics nodes. Come on over and start counting beans!

At the moment, all Hadoop users must be given unix shell accounts. To get one, ask [mailto:otto@wikimedia.org Andrew Otto] (ottomata on IRC) for help.

You can now ssh into analytics1001.wikimedia.org, and use the hadoop CLI. But, did you know? There is a web interface!

Hadoop Web UI
Head on over to http://analytics1001.wikimedia.org/. If you are not in the WMF office, you will be prompted for an HTTP auth password. Ask otto for the password. Follow the instructions at the bottom of that page, specifically the bit about adding the host aliases to your /etc/hosts file. We don't yet have real DNS for these domains configured, so you need to manually configure them yourself.

Hue
Hue is a general purpose web interface built for the Hadoop ecosystem. Use Hue if you want to easily run and schedule Pig and Hive jobs.

Navigate to to http://hue.analytics.wikimedia.org/. You'll need a Hue login account. Otto should have created one for you and given you a password if you also asked him for a shell account earlier. (This will soon be hooked into LDAP, and you will be able to use your usual WMF password).

= Cool, Hadoop! What can I do? =

Pig
Pig is a data processing language designed to run on top of a Hadoop cluster which makes it easy to run map reduce jobs. It comes with an interactive REPL, Grunt, and there are growing number of mature open source libraries for common tasks.

Resources
To get started writing pig scripts, there is a excellent (though outdated) O'Reilly Media publication Programming Pig freely available online and a nice IBM DeveloperWorks introduction. There is also a thorough official documentation wiki from the apache software foundation.

Examples
But perhaps the best place to start is looking at some example scripts to start playing with. The analytics team has a repository hosted on Github which contains a variety of Pig scripts and user defined functions (UDFs) which you can check out.

As a disclaimer, none of this code is very mature and a lot can depend on your set up, so it may take a while to get things actually working. For example, in order to use the piggybank UDFs or special Wikipedia specific UDFs you'll need to reference the shared library directory 'hdfs:///libs'. Our geocoding solution is also a big hacky at the moment, requiring you to place the geocoding database (usually called 'GeoIP.dat') in your hdfs home directory.

Here is a slightly modified and lightly annotated basic script which counts web requests count.pig:

REGISTER 'hdfs:///libs/piggybank.jar' --need to load jar files in order to use them REGISTER 'hdfs:///libs/kraken.jar' REGISTER 'hdfs:///libs/geoip-1.2.3.jar' --this needs to be located in your home directory -- it is also necessary AFAWK to place the geocoding database, GeoIP.dat in your home directory -- in order for geocoding to work

DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.RegexExtract; --define statements shorten UDF reference DEFINE PARSE org.wikimedia.analytics.kraken.pig.ParseWikiUrl; DEFINE GEO org.wikimedia.analytics.kraken.pig.GetCountryCode('GeoIP.dat'); --or allow constructor arguments to be passed DEFINE TO_MONTH org.wikimedia.analytics.kraken.pig.ConvertDateFormat('yyyy-MM-dd\'T\'HH:mm:ss', 'yyyy-MM'); DEFINE TO_MONTH_MS org.wikimedia.analytics.kraken.pig.ConvertDateFormat('yyyy-MM-dd\'T\'HH:mm:ss.SSS', 'yyyy-MM'); DEFINE HAS_MS org.wikimedia.analytics.kraken.pig.RegexMatch('.*\\.[0-9]{3}');

-- LOAD just takes a directory name and allows you -- to specify a schema with the AS command LOG_FIELDS    = LOAD '$input' USING PigStorage(' ') AS (    hostname,    udplog_sequence,    timestamp:chararray,    request_time,    remote_addr:chararray,    http_status,    bytes_sent,    request_method:chararray,    uri:chararray,    proxy_host,    content_type:chararray,    referer,    x_forwarded_for,    user_agent );

LOG_FIELDS    = FILTER LOG_FIELDS BY (request_method MATCHES '(GET|get)'); LOG_FIELDS    = FILTER LOG_FIELDS BY content_type == 'text/html' OR (content_type == '-');

PARSED   = FOREACH LOG_FIELDS GENERATE FLATTEN(PARSE(uri)) AS (language_code:chararray, isMobile:chararray, domain:chararray), GEO(remote_addr) AS country, (HAS_MS(timestamp) ? TO_MONTH_MS(timestamp) : TO_MONTH(timestamp)) AS month;

FILTERED   = FILTER PARSED BY (domain eq 'wikipedia.org'); GROUPED       = GROUP FILTERED BY (month, language_code, isMobile, country);

COUNT   = FOREACH GROUPED GENERATE FLATTEN(group) AS (month, language_code, isMobile, country), COUNT_STAR(FILTERED);

-- grunt / pig won't actually do anything until they see a STORE or DUMP command STORE COUNT into '$output';

and the call to invoke it from inside of grunt if the script is in your home directory on HDFS:

grunt> exec -param input=/traffic/zero -param output=zero_counts hdfs:///user//count.pig

which should create a collection of files on HDFS in a directory names 'counts' within your home directory.

Monitoring
To monitor one of your jobs once it has started, go to http://jobs.analytics.wikimedia.org/cluster/apps/RUNNING and you can see the number of mappers and reducers finished.

Known Issues / Annoyances / Feature Requests

 * Grunt periodically loses the ability to find files on the local file system. So when running a script which lives on an01, you might get the error "ERROR 1000: Error during parsing. File not found: src/kraken/src/main/pig/topk.pig".  If this happens, just execute the script directly from the an01 shell with "$ pig ...".
 * The MaxMind GeoIP library requires that the actual database file (not the jar) live in your home directory so that it can be replicated across all of the nodes.