Requests for comment/Encrypting DB fields

This is more just a random idea page at this point than an actual solid proposal or anything. It is definitely not ready for prime time yet, although I of course would love any thoughts and feedback.

Problem
Right now MediaWiki has very little defense in depth against data breaches. If a malicious person gets access to the DB servers or any mediawiki application server [as the MW user] they know basically everything that MediaWiki has collected.

Idea
We should consider encrypting sensitive fields in the database (i.e. Anything non-public whether that's emails or checkuser, etc). The encryption key should be stored in a central point (or a few central points for redundancy) that exposes a decryption service. The decryption service never exposes the key directly, but is a small trusted base responsible for decrypting strings, logging [And alerting], and maybe rate limiting. Direct access to this service would be extremely restricted (If we really want to go hard core, I guess we could even do TPMs). This service might also be responsible for calculating brute-force resistant hashes, where that makes sense.

I believe this approach is sometimes referred to as "crypto-anchoring"

Threats
So there are a number of threats to consider. There are a couple that this approach would prevent, but the bigger benefit is it would convert many attacks from being offline attacks to online attacks increasing the likelihood of detection, perhaps even mid-attack. It would also ensure more fool-proof logging to allow better auditability and reconstruction after the fact. There are of course many attacks that this doesn't prevent.

Prevents

 * SQL-injection to dump sensitive database fields (since now they are encrypted). This is a major one, as although we haven't suffered from all that many, it is a super common vulnerability in web apps, and pretty low skill to exploit.
 * Malicious actor gains access to a private DB backup
 * Malicious actor gains access to just the DB server (unlikely)

Mitigates
Mitigate here means that although it doesn't prevent the attack, we now would have an audit trail, maybe alerting. Additionally the attack would have to be online and much slower, possibly allowing detection mid-attack.
 * Malicious insider attempts to use shell access extract stuff from the DB [malicious insider does not have access to key server]
 * Although the first thing that comes to mind would be e.g. a massive dump of passwords, other possibilities might be trying to secretly run an unauthorized check-user without it showing up on the on-wiki log.
 * Malicious actor hacks a MediaWiki application server, and wants to extract sensitive data
 * There are a lot of potential vulnerabilities here, where this is probably the end result. Unserialization, RCE in MediaWiki, phising someone with access, somehow getting a malicious ssh key authorized, stealing someone's open laptop, etc

Does not prevent
Nothing fixes everything
 * Physical access to all the servers [perhaps TPM's would have some affect on this, but I don't think its realistic we can do much against this threat]
 * Live capturing data as it comes in
 * Lots of other stuff.

Implementation concerns
The main implementation idea is that there is a central server (+ backups) with a crypto service. This service can decrypt and encrypt strings, but should never expose the underlying key. It should record every operation it does, who is requesting it, and from which server. It should have some sanity cut offs and rate limits. e.g. If 2 million password hashes need to be decrypted in a 5 minute period, it should just stop. It should have alerting for unusual activity. All this should hopefully slow down attacks and make them noticeable, and also allow for better reconstruction of sophisticated attacks.

The actual data is still stored in the database (And not the crypto service, we want that as minimal as possible to reduce attack surface).

There are basically three types of data:
 * 1) Data that needs to be rerieved but not queried for. Example: a password hash. This can use standard encryption like AES-GCM or whatever
 * 2) Data which needs exact equality queries but not ranges. For example emails need this to do password resets. Encrypting this does expose some information of the data set. If the histogram of the data is flat (like email addresses mostly), that's probably ok. If its not, the security properties would require careful analysis, but its probably better than nothing, with the understanding for some data-sets with certain distrubutions it may very well be equivalent to nothing. Two possible constructions for this is either you prefix the ciphertext with a hash of the plaintext. There are also other specialized constructions like AES-CMC. The downside of this is that it is dangerously close to inventing our own crypto.
 * 3) Data with range queries (e.g. Checkuser ip range lookups). This is really hard, and mostly the realm of theoretical systems. There are things like order-preserving encryption, but they are generally very insecure, and the more secure variants involve re-encrypting old ciphertexts when new ones are added. But at the least we can encrypt the other fields, and maybe encrypt the high bits of the IP as we only do range queries up to a \16. More research is needed on this, but probably there are no perfect solutions here.