Analytics/Kraken/Query Service

= Research =

Projects that might make useful components in our query service.

Impala (Cloudera)
Open source clone of Google Dremel, aiming to be "mostly compatible" with HiveQL.


 * Source: https://github.com/cloudera/impala
 * Project: https://ccp.cloudera.com/display/IMPALA10BETADOC/Cloudera+Impala+1.0+Beta+Documentation

Tradeoffs

 * Pro: Performs both querying and caching.
 * Pro: Reuses Hadoop job scheduling infrastructure for realtime queries.
 * Pro: Resuses HiveQL skillset, itself largely compatible with SQL.
 * Pro: Supported by Cloudera, who generally is very good about bringing components to maturity.
 * Con: Early beta.
 * Con: Requires RedHat??

About

 * Blog post, with high-level overview.
 * Introducing Impala
 * FAQ
 * Screencast (I haven't watched this)

Docs

 * Usage Guide
 * Tutorial
 * Query Language, even in beta, largely compatible with HiveQL.
 * Security Features -- using kerberos would allow us to initially bypass the need for an HTTP gateway for internal applications, but ultimately we'll need one so (at least) Limn and friends can query it.

Hadoop HttpFS

 * Project: http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-hdfs-httpfs/index.html
 * Installation: https://ccp.cloudera.com/display/CDH4DOC/HttpFS+Installation

TODO

 * ElephantDB (Nathan Marz) &mdash; Distributed database specialized in exporting key/value data from Hadoop. (KV -- not ideal for analytics/slicing.)
 * Elephant Twin (Twitter) &mdash; Elephant Twin is a framework for creating indexes in Hadoop.
 * OpenTSDB (StumbleUpon) &mdash; OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. (Seems aimed at instrumentation-style data (a la RRD), not analytic purposes.)