Analytics/Archive/Infrastructure/Access

= How to access Kraken and crunch your wildest numbers = As of December 2012, Hadoop is up and running on 10 fresh and clean Analytics nodes. Come on over and start counting beans!

At the moment, all Hadoop users must be given unix shell accounts. To get one, ask [mailto:otto@wikimedia.org Andrew Otto] (ottomata on IRC) for help.

You can now ssh into analytics1001.wikimedia.org, and use the hadoop CLI. But, did you know? There is a web interface!

Hadoop Web UI
Head on over to http://analytics1001.wikimedia.org/. If you are not in the WMF office, you will be prompted for an HTTP auth password. Ask otto for the password. Follow the instructions at the bottom of that page, specifically the bit about adding the host aliases to your /etc/hosts file. We don't yet have real DNS for these domains configured, so you need to manually configure them yourself.

Hue
Hue is a general purpose web interface built for the Hadoop ecosystem. Use Hue if you want to easily run and schedule Pig and Hive jobs.

Navigate to to http://hue.analytics.wikimedia.org/. You'll need a Hue login account. Otto should have created one for you and given you a password if you also asked him for a shell account earlier. (This will soon be hooked into LDAP, and you will be able to use your usual WMF password).

= Cool, Hadoop! What can I do? =

Pig
Pig is a data processing language designed to fit on top of a Hadoop cluster and stream line the process of running map reduce jobs. It comes with an interactive REPL, Grunt, and there are growing number of mature open source libraries for common tasks.

Resources
To get started writing pig scripts, there is a excellent (though outdated) O'Reilly Media publication Programming Pig freely available online and a nice IBM DeveloperWorks introduction. There is also a thorough official documentation wiki from the apache software foundation.

Examples
But perhaps the best place to start is looking at some example scripts to start playing with. The analytics team has a repository hosted on Github which contains a variety of Pig scripts and user defined functions (UDFs) which you can check out.

There's also some system specific details that you'll very likely need to know. Pig UDFs live in jar files which you'll need to REGISTER at the beginning of a script or Grunt session. You can find the kraken.jar and the piggybank.jar at the HDFS paths: 'hdfs:///libs/piggybank.jar' and 'hdfs:///libs/kraken.jar'

Here is a lightly annotated basic script which counts web requests count.pig:

REGISTER 'hdfs:///libs/piggybank.jar' --need to load jar files in order to use them REGISTER 'hdfs:///libs/kraken.jar' REGISTER 'geoip-1.2.5.jar' --this needs to be located in your home directory

DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.RegexExtract; --define statements shorten UDF reference DEFINE PARSE org.wikimedia.analytics.kraken.pig.ParseWikiUrl; DEFINE GEO org.wikimedia.analytics.kraken.pig.GetCountryCode('GeoIP.dat'); --or allow constructor arguments to be passed DEFINE TO_MONTH org.wikimedia.analytics.kraken.pig.ConvertDateFormat('yyyy-MM-dd\'T\'HH:mm:ss', 'yyyy-MM'); DEFINE TO_MONTH_MS org.wikimedia.analytics.kraken.pig.ConvertDateFormat('yyyy-MM-dd\'T\'HH:mm:ss.SSS', 'yyyy-MM'); DEFINE HAS_MS org.wikimedia.analytics.kraken.pig.RegexMatch('.*\\.[0-9]{3}');

-- LOAD just takes a directory name and allows you -- to specify a schema with the AS command LOG_FIELDS    = LOAD '$input' USING PigStorage(' ') AS (    hostname,    udplog_sequence,    timestamp:chararray,    request_time,    remote_addr:chararray,    http_status,    bytes_sent,    request_method:chararray,    uri:chararray,    proxy_host,    content_type:chararray,    referer,    x_forwarded_for,    user_agent );

LOG_FIELDS    = FILTER LOG_FIELDS BY (request_method MATCHES '(GET|get)'); LOG_FIELDS    = FILTER LOG_FIELDS BY content_type == 'text/html' OR (content_type == '-');

PARSED   = FOREACH LOG_FIELDS GENERATE FLATTEN(PARSE(uri)) AS (language_code:chararray, isMobile:chararray, domain:chararray), GEO(remote_addr) AS country, (HAS_MS(timestamp) ? TO_MONTH_MS(timestamp) : TO_MONTH(timestamp)) AS month;

FILTERED   = FILTER PARSED BY (domain eq 'wikipedia.org'); GROUPED       = GROUP FILTERED BY (month, language_code, isMobile, country);

COUNT   = FOREACH GROUPED GENERATE FLATTEN(group) AS (month, language_code, isMobile, country), COUNT_STAR(FILTERED);

-- grunt / pig won't actually do anything until they see a STORE or DUMP command STORE COUNT into '$output';

and the call to invoke it from inside of grunt if the script is in your home directory on HDFS:

grunt> exec -param input=/traffic/logs/zero -param output=counts count.pig

which should create a collection of files in a directory names 'counts' within your home directory.