Talk:Requests for comment/Service split along public vs private line

Another advantage of this could be, I imagine, that replicating a database to non-US data centers would be less legal headache if it were known to only contain public data. --Tgr (WMF) (talk) 05:34, 16 February 2015 (UTC)Reply

On claims and use-cases[edit]

Latest comment: 9 years ago2 comments2 people in discussion

This proposal basically seems to be advocating for security through segregation. This approach, like any, has good and bad applications. It's not particularly novel and it's not my issue with the proposal, per se.

My issue here is similar to my issue with Service split along presentation vs data manipulation line (talk), which basically boils down to there being some bold claims that currently lack substantiation. Wikimedia has an unusual setup and sometimes has some peculiar requirements. Real-time database replication to Wikimedia Labs/Tool Labs is a pretty good example of this. In a worryingly short problem section, you call this system "complicated and fragile." This may be, but I don't see evidence of a problem in the proposal. If Marc or Sean or someone else directly involved in database replication is calling the system complicated and fragile, then I'm definitely interested. But none of them have commented here yet, none of their input is cited or referenced in the proposal, and they're not drafting or making this proposal, as far as I know. In recent memory, we had one very bad incident with leaked data involving database replication, to be sure, which this proposal somewhat strangely also currently omits. This seems like exactly the type of evidence you'd want to present when trying to demonstrate a problem.

This proposal is also strongly focused on data security, without Chris S.'s or any other security engineer's perspective. In terms of threat models and security priorities, I value Chris' opinion a lot more than yours, to be frank. This kind of ties in with my concern about Sean, Marc, et al. weighing in on database replication. In other words, you can have authority with facts or authority with reputation. You don't come with the latter in these areas, so I'm much more heavily relying on the former, which I'm not finding here right now.

In a zero, one, or infinity model, password hashes in MediaWiki seem to currently fall pretty neatly within one. I agree that a general application compromise shouldn't expose password hashes or e-mail addresses or other private information. Actually segregating that data elsewhere may have direct benefit and may be an option to consider, but extending that to other parts of the database seems almost like a solution in search of a problem right now.

With regard to revisions, my understanding is that moving table rows as we do currently between archive and revision is pretty expensive and not a system we want to replicate or expand. Your proposal seems to suggest an expansion of this model in the name of increased security. In both the case of the revision table and the page table, a bitfield approach is arguably better than what's being vaguely proposed here. For me, this guts the second and currently only other use-case presented in this request for comments.

In my mind, the problem statement here needs to be expanded to include referenced facts. If the problem statement more narrowly focuses on the user table, then the proposal section needs to be rewritten accordingly, of course, specifically addressing how this data could be securely segregated. If the problem statement does not narrow focus and instead remains higher-level, the proposal section needs to do a much better job of explaining what the greater architectural/design pattern is here with at least three or four specific examples of future use-cases for a more segregated model. --MZMcBride (talk) 08:51, 16 February 2015 (UTC)Reply

Agreed. Also, when rewriting the page to include use cases and clear problem statements, make sure to define "we" and "our". --Nemo 09:02, 16 February 2015 (UTC)Reply