Requests for comment/Service split along public vs private line

Request for comment (RFC)
Service split along public vs private line
Component	General
Creation date	16 February 2015
Author(s)	RobLa-WMF
Document status	in draft

This RFC calls for a split of backend services between public and private data.

Background

There has been much talk in 2014-2015 about moving toward a service-oriented architecture. One aspect of that strategy should be defining where we want the fissures in our architecture to be, and then defining a strategy for creating a cleaner separation of code along that fissure.

Problem

Currently, if we have a vulnerability in our PHP code that allows for arbitrary database access, that allows for access to all of our data (including things like password hashes and CheckUser data). Furthermore, we rely on complicated and fragile techniques to filter our databases for public replication.

Proposal

Split our backend code that deals with data storage into two areas: public and private, and provision different hardware to deal with each half. For the public side, we would optimize for replication, doing everything we can to boost speed and volume of delivery (e.g. moving data from MySQL to Cassandra). For the private side, we can utilize more conservative technology choices, optimizing for security and simplicity.

Taken to (a possibly useful) extreme, this means that even things like revisions that have been deleted would be moved from the public cluster to the private cluster. This would complicate certain activities (like deleting revisions), but the benefit here would be that activities such as replication would be greatly simplified, and with the right architecture, that data could be kept much more securely.

This proposal was inspired by conversations surrounding SOA Authentication. In particular, during those conversations, an explicit design goal of a new authentication service is to narrow access to password hashes to a specialized single service, such that a compromise of our general application does not necessarily imply a compromise of our password hashes. If that is a worthwhile goal, then it's probably also the case that other access could benefit from similar separation.