Analytics/Archive/Infrastructure/Access

= How to access Kraken and crunch your wildest numbers = As of December 2012, Hadoop is up and running on 10 fresh and clean Analytics nodes. Come on over and start counting beans!

If you have a shell account, you can ssh into analytics1001.wikimedia.org, and use the Hadoop CLI. But, did you know? There is a web interface!

Hadoop Web UI
All of the Kraken web interfaces are hosted from internally accessible hosts. analytics1001 is set up as a reverse proxy to allow access to these hosts. There are currently two different proxies you can use, depending on your preference. If you use the name based proxy, you will have to modify your /etc/hosts file so that you can address each of the services by name (we don't have any public DNS set up yet). Alternatively, you can configure your browser to use the proxy on port 8085. If you use this method, you won't have to modify /etc/hosts, but all of your browsers traffic will go through analytics1001 (unless you use something fancy like FoxyProxy).

I personally think the Name Based Proxy is easier to use than the Browser Configured Proxy. Hopefully, we will eventually have DNS set up for these services, and you won't need to edit /etc/hosts. However, the disadvantage to the Name Based Proxy is that some of the web services generate absolute URIs that explicitly link to the internal hosts and ports. These links won't work unless you use the Browser Configured Proxy method.

Both methods will prompt you for HTTP authentication credentials if you are not in the WMF office. Ask [mailto:otto@wikimedia.org otto] if you need access and don't have this information.

NOTE: The following access instructions are subject to change at any time.

Method 1: Browser Configured Proxy
Open up your browser preferences and configure these HTTP proxy settings: Host: analytics1001.wikimedia.org Port: 8085

If you are using FoxyProxy, this set of whitelist regexes will treat you nicely: ^https?://analytics.*\.eqiad\.wmnet.* ^https?://analytics10\d\d(:\d+)?(/.+)?$

Once that's done, navigate on over to http://analytics1001.wikimedia.org. The links there should guide you to the webservices you are looking for. Hue will probably be most useful for you at first.

Oh, and use the (internal) links on that page, not the main ones. Those link directly to the machine names.

Method 2: Name Based Proxy
Open up your /etc/hosts file and add this line:

208.80.154.154 analytics.wikimedia.org namenode.analytics.wikimedia.org jobs.analytics.wikimedia.org history.analytics.wikimedia.org oozie.analytics.wikimedia.org hue.analytics.wikimedia.org storm.analytics.wikimedia.org

This aliases analytics1001 to a bunch of hostnames so that the internal proxy rules can figure out which host and port you are actually trying to access.

Now head on over to http://analytics1001.wikimedia.org/. If you are not in the WMF office, you will be prompted for an HTTP auth password. Ask [mailto:otto@wikimedia.org otto] for the password. The links there should guide you to the webservices you are looking for. Hue will probably be most useful for you at first.

Hue
Hue is a general purpose web interface built for the Hadoop ecosystem. Use Hue if you want to easily run and schedule Pig and Hive jobs.

Hue is currently configured to use the Labs LDAP instance. You should be able to log in with your LabsConsole Credentials.

= Tutorial = There's a great Pig starter tutorial over at Analytics/Kraken/Tutorial. That's a good place to start if you want to try your hand at crunching data using Kraken. We'll add more tutorials there as we gain more experience.