Original Suggested Agenda[edit]

This is an outline of information and topics that might be of interest to discuss. It's not intended as comprehensive, just a starting point. Feel free to add/edit.

Architecture Overview[edit]

There's been some discussion of swapping this meeting with the Arch Review meeting, so we can work out any questions with the components before discussing their security implications. Either way, we'll probably want a brief overview of the architecture before moving forward.

Cluster Security[edit]

Public Surfaces[edit]

ACL as outlined in the RT ticket.
Cross-datacenter connectivity (esams) requires a solution (public IP, or perhaps the existing bridge?)

Services and Access Needs[edit]

Many internal service dashboards and control panels:

Hue: HDFS web access; Hadoop job scheduling (via Oozie); Hive query dashboard (Beeswax)
- Hue WebUI Login authentication uses LDAP
- Can control privileges within the dashboard to granularly restrict access to particular services,
- Limited control can be exerted on resource use
- Dashboard access needed by analysts need access for job monitoring/control, and data access
Hadoop Admin pages
- NameNode: Provides HDFS logs and system health overview. Cluster administrator access only.
- JobTracker, DataNode: provides logs and debugging output for Hadoop jobs. Access needed by analysts to debug.
Storm's Nimbus: storm job monitoring and scheduling. Cluster administrator access only.
Graphite: Application and host monitoring for the cluster. Cluster administrator access only.

HDFS[edit]

http://blog.cloudera.com/blog/2012/03/authorization-and-authentication-in-hadoop/
hdfs is the superuser within the HDFS file system. Hadoop piggybacks on Unix users/groups, but uses its own protocols for communication, not the shell (ssh). This means being logged in as the hdfs Unix account is sufficient to take actions using the Hadoop shell tools as hdfs, the superuser. Hadoop provides an optional Kerberos layer to provide authentication, but there are other solutions, such as firewalling off the NameNode using iptables to ensure only whitelisted nodes from connecting to the HDFS cluster.

Data Retention[edit]

What Potentially Private or Sensitive Data is there?[edit]

IP addresses
Browsing activity (including edits) correlated with Usernames & IPs
Data leakage via search queries / referrers
Low-entropy Data:
- Precise geographic information about all activity
- Login/Logout by IP across all web properties

Raw Logs[edit]

First persistantly stored in Kafka brokers; buffer window is 7 days (log.retention.hours=168), and automatically deleted afterward
ETL anonymization pipeline sees raw data, removes IPs (salted hash) after geo lookup

Policies[edit]

Never publish raw logs or unaggregated datasets
Raw logs only accessible by those who have signed the Data NDA.
- Can control HDFS file permissions via LDAP groups, allowing segregation of NDA/non-NDA data access, private data import, etc

Legacy[edit]

This pertains only to the current, legacy system, but raw archival logs currently exist back to 2011 containing unsanitized IPs.

Something must be done about this! GeoIP then hash the IPs, and replace them in the logs! Offsite the files if we really need backups?
wikistats processes all of time at once, and frequently has errors that require recomputation

Meeting Notes[edit]

Summary[edit]

Quick summary of notes / take-aways from the Analytics (Kraken) security review meeting.

analytics1001 has been wiped and reimaged (restoring /home from backup)
All proxies and externally-facing services have been disabled.
Work is under way to bring everything that was puppetized under the analytics1001 puppetmaster into operations-puppet after proper review. Andrew is working closely with a number of people in ops to make this happen.
All future deployments to the cluster will be puppetized and go through normal code review. Other than performance testing, these puppet confs will be tested in labs.
The rest of the cluster will be wiped and reimaged out of puppet; data in HDFS will be preserved. This can be a rolling process allowing work to proceed while its under way.
Schedule an Architectural Review meeting sometime during the SF Ops hackathon, including a look at additional services and auth methods that provide access to internal dashboards like Hue &such.
Ensure all current "application" code (stuff written by WMF) gets reviewed:
- Cron doing HDFS import from Kafka
- Pig UDFs and other data tools used in processing
- Future: Storm ETL layer

We all agreed the overall goal is to get to an acceptable security state. During that process, the Analytics team still needs to continue to meet stakeholder needs and deliver on promises. We decided on keeping running a "minimum viable cluster" while reimaging boxes and civilizing cluster configuration:

Wall off some portion of the boxes to continue receiving data and running jobs; all other boxes can be wiped (preserving HDFS partitions). Boxes would be incrementally removed from the "unsanitary" cluster, reimaged, and then added to the "sanitary" cluster. Stupid bathroom-related jokes to be avoided.
Team Analytics to enumerate data processing jobs that will be running in the intermediate period; their configurations and tooling will be reviewed.
Analytics and Ops engineers continue to have shell access. Jobs can be submitted and managed using the CLI tools; internal dashboards can be accessed via SSH tunnelling. Analysts working on the cluster will be approved for shell access on a case-by-case basis (afaik, just Evan Rosen (full-time analyst for Grantmaking & Programs), and Stefan Petrea (contractor for Analytics)).
No public, external access of any box in either zone (including proxied, dashboards like Hue, or even static files) that hasn't gone through review.
Analytics and Ops will work together to find a simple, acceptable mechanism for data export.

Next Steps[edit]

Analytics puppet manifests fully reviewed and merged into master operations-puppet repository
- Andrew to come pow-wow before the SF Ops hackathon and buddy it up with ops to plow through some of this.
Schedule Architecture Review
Rolling reimaging of all analytics boxes (including hadoop data nodes but preserving data) implementing this "minimal viable cluster" plan.