Analytics/Archive/Infrastructure/Meetings/SecurityReview

This is an outline of information and topics that might be of interest to discuss. It's not intended as comprehensive, just a starting point. Feel free to add/edit.

= Cluster Security =

Public Surfaces

 * ACL as outlined in the RT ticket.
 * Cross-datacenter connectivity (esams) requires a solution (public IP, or perhaps the existing bridge?)

Services and Access Needs
Many internal service dashboards and control panels:


 * Hue: HDFS web access; Hadoop job scheduling (via Oozie); Hive query dashboard (Beeswax)
 * Hue WebUI Login authentication uses LDAP
 * Can control privileges within the dashboard to granularly restrict access to particular services,
 * Limited control can be exerted on resource use
 * Dashboard access needed by analysts need access for job monitoring/control, and data access
 * Hadoop Admin pages
 * NameNode: Provides HDFS logs and system health overview. Cluster administrator access only.
 * JobTracker, DataNode: provides logs and debugging output for Hadoop jobs. Access needed by analysts to debug.
 * Storm's Nimbus: storm job monitoring and scheduling. Cluster administrator access only.
 * Graphite: Application and host monitoring for the cluster. Cluster administrator access only.

= Data Retention =

What Potentially Private or Sensitive Data is there?

 * IP addresses
 * Browsing activity (including edits) correlated with Usernames & IPs
 * Data leakage via search queries / referrers
 * Low-entropy Data:
 * Precise geographic information about all activity
 * Login/Logout by IP across all web properties

Raw Logs

 * First persistantly stored in Kafka brokers; buffer window is 7 days (log.retention.hours=168), and automatically deleted afterward
 * ETL anonymization pipeline sees raw data, removes IPs (salted hash) after geo lookup
 * Raw archival logs currently exist back to 2011 containing unsanitized IPs
 * Something must be done about this! GeoIP then hash the IPs, and replace them in the logs! Offsite the files if we really need backups?
 * wikistats processes all of time at once, and frequently has errors that require recomputation

Policies

 * Never publish raw logs or unaggregated datasets
 * Raw logs only accessible by those who have signed the Data NDA.
 * Can control HDFS file permissions via LDAP groups, allowing segregation of NDA/non-NDA data access, private data import, etc