Analytics/Archive/Infrastructure/Access

= How to access Kraken and crunch your wildest numbers = As of December 2012, Hadoop is up and running on 10 fresh and clean Analytics nodes. Come on over and start counting beans!

If you have a shell account, you can ssh into analytics1001.wikimedia.org, and use the Hadoop CLI. But, did you know? There is a web interface!

Hadoop Web UI
All of the Kraken web interfaces are hosted from internally accessible hosts. analytics1001 is set up as a reverse proxy to allow access to these hosts. You will have to modify your /etc/hosts file so that you can address each of the services by name (we don't have any public DNS set up yet).

You will be prompted for HTTP authentication credentials if you are not in the WMF office. Ask [mailto:otto@wikimedia.org otto] if you need access and don't have this information.

NOTE: The following access instructions are subject to change at any time.

Name Based Proxy
Open up your /etc/hosts file and add this line:

208.80.154.154 analytics.wikimedia.org namenode.analytics.wikimedia.org jobs.analytics.wikimedia.org history.analytics.wikimedia.org oozie.analytics.wikimedia.org hue.analytics.wikimedia.org storm.analytics.wikimedia.org

This aliases analytics1001 to a bunch of hostnames so that the internal proxy rules can figure out which host and port you are actually trying to access.

Now head on over to http://analytics1001.wikimedia.org/. If you are not in the WMF office, you will be prompted for an HTTP auth password. Ask [mailto:otto@wikimedia.org otto] for the password. The links there should guide you to the webservices you are looking for. Hue will probably be most useful for you at first.

Browser Configured Proxy
NOTE: This method is disabled due to security concerns.

Open up your browser preferences and configure these HTTP proxy settings: Host: analytics1001.wikimedia.org Port: 8085

If you are using FoxyProxy, this set of whitelist regexes will treat you nicely: ^https?://analytics.*\.eqiad\.wmnet.* ^https?://analytics10\d\d(:\d+)?(/.+)?$

Once that's done, navigate on over to http://analytics1001.wikimedia.org. The links there should guide you to the webservices you are looking for. Hue will probably be most useful for you at first.

Oh, and use the (internal) links on that page, not the main ones. Those link directly to the machine names.

Hue
Hue is a general purpose web interface built for the Hadoop ecosystem. Use Hue if you want to easily run and schedule Pig and Hive jobs.

Hue is currently configured to use the Labs LDAP instance. You should be able to log in with your LabsConsole Credentials.

= Tutorial = There's a great Pig starter tutorial over at Analytics/Kraken/Tutorial. That's a good place to start if you want to try your hand at crunching data using Kraken. We'll add more tutorials there as we gain more experience.