Analytics/Archive/Infrastructure/Meetings/SecurityReview

This is an outline of information and topics that might be of interest to discuss. It's not intended as comprehensive, just a starting point. Feel free to add/edit.

= Architecture Overview =



There's been some discussion of swapping this meeting with the Arch Review meeting, so we can work out any questions with the components before discussing their security implications. Either way, we'll probably want a brief overview of the architecture before moving forward.

= Cluster Security =

Public Surfaces

 * ACL as outlined in the RT ticket.
 * Cross-datacenter connectivity (esams) requires a solution (public IP, or perhaps the existing bridge?)

Services and Access Needs
Many internal service dashboards and control panels:


 * Hue: HDFS web access; Hadoop job scheduling (via Oozie); Hive query dashboard (Beeswax)
 * Hue WebUI Login authentication uses LDAP
 * Can control privileges within the dashboard to granularly restrict access to particular services,
 * Limited control can be exerted on resource use
 * Dashboard access needed by analysts need access for job monitoring/control, and data access
 * Hadoop Admin pages
 * NameNode: Provides HDFS logs and system health overview. Cluster administrator access only.
 * JobTracker, DataNode: provides logs and debugging output for Hadoop jobs. Access needed by analysts to debug.
 * Storm's Nimbus: storm job monitoring and scheduling. Cluster administrator access only.
 * Graphite: Application and host monitoring for the cluster. Cluster administrator access only.

HDFS

 * http://blog.cloudera.com/blog/2012/03/authorization-and-authentication-in-hadoop/
 * is the superuser within the HDFS file system. Hadoop piggybacks on Unix users/groups, but uses its own protocols for communication, not the shell (ssh). This means being logged in as the  Unix account is sufficient to take actions using the Hadoop shell tools as , the superuser. Hadoop provides an optional Kerberos layer to provide authentication, but there are other solutions, such as firewalling off the NameNode using iptables to ensure only whitelisted nodes from connecting to the HDFS cluster.

= Data Retention =

What Potentially Private or Sensitive Data is there?

 * IP addresses
 * Browsing activity (including edits) correlated with Usernames & IPs
 * Data leakage via search queries / referrers
 * Low-entropy Data:
 * Precise geographic information about all activity
 * Login/Logout by IP across all web properties

Raw Logs

 * First persistantly stored in Kafka brokers; buffer window is 7 days (log.retention.hours=168), and automatically deleted afterward
 * ETL anonymization pipeline sees raw data, removes IPs (salted hash) after geo lookup

Policies

 * Never publish raw logs or unaggregated datasets
 * Raw logs only accessible by those who have signed the Data NDA.
 * Can control HDFS file permissions via LDAP groups, allowing segregation of NDA/non-NDA data access, private data import, etc

= Legacy =

This pertains only to the current, legacy system, but raw archival logs currently exist back to 2011 containing unsanitized IPs.


 * Something must be done about this! GeoIP then hash the IPs, and replace them in the logs! Offsite the files if we really need backups?
 * wikistats processes all of time at once, and frequently has errors that require recomputation