- Presentation Slides
- Tons of more information about our projects available at https://www.mediawiki.org/wiki/Analytics
Erik M, Patrick, Dario, David, Diederik, Howie, Robla, Tomasz, Jessie, Gayle, Ori, CT, Terry, Asher, Erik Z, Andrew Otto, Dan Andreescu
Special thanks to Erik Zachte who took most of these notes (I believe)!
- Project: https://www.mediawiki.org/wiki/Analytics/Kraken
- Kraken Dataflow Diagram: http://upload.wikimedia.org/wikipedia/mediawiki/3/38/Kraken_flow_diagram.png
- CDH4 — the world's leading Apache Hadoop Distribution. http://www.cloudera.com/content/cloudera/en/products/cdh.html
- The Hadoop Distributed File System ( HDFS ) is a distributed file system designed to run on commodity hardware. http://hadoop.apache.org/docs/hdfs/current/hdfs_design.html
- Hue is a general purpose web interface built for the Hadoop ecosystem. Use Hue if you want to easily run and schedule Pig and Hive jobs.
- Navigate to to http://hue.analytics.wikimedia.org/. You'll need a Hue login account. Otto should have created one for you and given you a password if you also asked him for a shell account earlier. (This will soon be hooked into LDAP, and you will be able to use your usual WMF password).
- On Storm:
- We brought in Nathan Marz, creator of Storm, as part of the WMF Analytics Day. He provided useful feedback during the research phase, which encouraged us to examine Storm as a solution for the ETL/Stream Processing phase.
- Patrick: What is the status of Kraken as a prototype? -- (dsc) Coming to it in the Kraken section (slide 12) ✔
- On Wikistats: traffic scripts (aka squid scripts) are improved now by contractor, dumps scripts are stable. All scripts are in git.