Wikimedia Developer Summit/2017/Scaling the Wikimedia database schema

SESSION OVERVIEW
 * Title: Scaling the Wikimedia database schema
 * Day & Time: Tuesday, January 10, 2017
 * Room: Ventana
 * Phabricator Task Link: https://phabricator.wikimedia.org/T149534
 * Facilitator(s): Daniel Kinzler
 * Note-Taker(s): Brion Vibber
 * Remote Moderator:
 * Advocate:

Chronology: [Capture the gist of who said what, in what order. A transcript isn't necessary, but it's useful to capture the important points made by speakers as they happen]

todo: wikify this

Daniel: This session is about scaling the mediawiki database schema. Don't have any set solutions yet; want to discuss some issues and potential mitigations. And some possible goals and solutions... Jaime will be helping, as our DB administrator he knows where the problems are. :)

Problems: * Tall tables -- many rows * Wide fields -- lots of large text fields * Redundancy -- sometimes duplicative or wastes space * Complex queries * Inefficient updates

Pain points: * revision table * oldimage table * rev_user_text / rc_user_text repeated over and over in many tables, makes renaming users slow and wastes space * _to titles duplicates many titles that appear over in page

(Jaime)

The pain points are most severe on these three wikis:

* enwiki * commonswiki * wikidata

Common factors on these: * large nuber of revisions * high rate of revision creation * large # of images/image usages * large number of template usages (license templates on Commons, for instance) * large number of internal and external links counters * some accounts have large number of edits (bots)

Problem 1: basic model over the years (very long time ago these were distinct tables, had to move data) (pretty good, solved the basic problem of having lots of wikitext in same table... by moving it out) server-side hacks for separating some reads etc. nasty.
 * "pages"
 * cur + old
 * page + revision + text + external storage
 * revision special partitioning hack for 'contributions' group

Revision table is very tall.... that's not an inherent problem, mysql can handle that. Do a lot of range scans on the revision table; not always as efficient as want to... Also lots of complex queries on rev table, user filtering, etc. Very hard to optimize for these multiple separate use cases... 1/3 of all queries on enwiki use the revision table. Yikes!

Problem 1: it's getting worse
 * mcr, wikidata, structured content, smart cat editing, etc.... revisions will increase
 * need a different model to accomodate 10x the number of revisions!
 * forget about storage, think how we are going to query metadata efficiently

(Daniel to Jaime) don't run away, we may need you later :)

Daniel: a few more slides for solutions...

General strategies: * can split tables (eg in the multi-content revision model separate revision metadata from content metadata, making the revision table narrower) * alternative source of truth -- things like content model can be pulled from configuration instead of storing it, if it's always fixed to some other variable like namespace * "interning" strings -- titles of pages are frequently reused in link tables, etc. we could use an id to a string table in place of a string in many places that don't require index sorting. * hashing -- for titles, summaries etc can sometimes be useful to use a derived hash as id instead of repeating the string. (bigger than int, but smaller than many strings) * partitioning/sharding strategies -- can be much harder. even if done under th ehood within the database system, need to think about how to shard and on what. * special slave groups: reserve some servers for certain use to optimize cache/index usage and partition CPU/disk usage. This is actively done today for dumps, RC, contribs, etc.

Partitioning...

Internally splitting tables by some criterion. For instance: * by timestamp -- recent items in fast path, old items elsewhere. not being in the fast path can slow things down. * current vs archive --similar but keeping current-but-old data around in fast path. (shadows of 'cur' vs 'old' tables! :D) we still use this for image, or did until recently (image/oldimage) and it's a pain because must move data, which is slow and awkward. A lot of these decisions depend on knowing the usage patterns, whihc sometimes are different by site. Hard to bake in something that works for evertyhing. * by page id -- would be nice for page history to pull everything from one page in one place, but for contribs it would be awful! then we might need a summary table or some other efficient way for batch queries.... Already we push some of these queries to distinct servers, so is it so much harder to use another table? hmm :)   * by user? that'd be great for contribs, and not for others. :) could do side by side with the above. * partition by type of content -- harder because we have a common page area for many types of content. things like MCR would put different content types mixing in a little more in the metadata, potentially.

Remember we're gathering feedback from you all!

On string interning There was an RFC a while ago for changing storage of content model / content format from strings to interned strings. We think it works reasonably well. That mapping can also be aggressively cached if it's small like content model set -- can just cache a blob and have it always in memory. On cache miss, just pull the data set again and recache.

For larger data areas like titles and edit summaries, consider hashing instead of a numbered model.
 * Can store a 64-bit hash in a BIGINT field, still fairly efficient; collisions unlikely (but maybe not enough for a full UUID)
 * Hashes can be computed without a lookup

(Question): why not use page_id? Daniel: redlinks -- link targets may not exist (or may change, as when the target is renamed)

Jaime: (didn't hear, no mic) Daniel: yes we could use incrememntal IDs too; would hae to look at the table more often. Might or might not be as efficient. I like the idea of ids being the same across wikis (as with hashes), could be an advantage for some usages. Generating an ID without looking at the table is nice. :) But not completely married to the hash-as-pure-id idea.

Gabiel: question: which queries are most problematic, and what are constraints on what we can do? And Would it help to move edit comments out of the revision table?

Jaime: Yes! (moving edit comments) Daniel: yes, can definitely do that, though less certain about the redundancy aspect compared to titles in link tables. Jaime: summary of the slowest queries: https://wikitech.wikimedia.org/wiki/MariaDB/query_performance/coreproduction-20151020 I dont' have time to go over every query :) Most prominent ones are those that cover ranges -- "give me latest 500 revs for this page", "latest contributions for this user" etc. UI allows large limits, API allows even wackier queries. Either you have to index every possible combination, or limit what you can ask for.... eg "give me all from this date to that date, but order by .... random or something" much less pleasant than if you asked for them in index order. Makes query plans much uglier, and can require expensive index scans. Almost impossible to handle ranges with many conditions efficiently! Gabriel: Should we reign in the long tail by reducing functionality, or moving some to another backend? Jaime: perhaps; offloading some queries to services can help -- but sometimes those services (say elastic) want to move some data back to DB. :) Jaime: Problem with revision table is that you usually want to join *to* the revision table, and now it's easiest to have everything in DB, so... general content stores. :)

Roan: about partitioning... is that something that would be done manually in MW or internally? Daniel: "yes, it would be one or the other" ;) Jaime: can go both ways; just don't do them undocumented server-side! Should we go with transparent MySQL partitioning or do it manually in MW.... can't answer that, each has ups and downs. If on MediaWiki side, you have a lot of flexibility -- if you know you only need to query one table you can simplify etc. But it's more work for MW devs and users of the DB. If done transparently through MySQL.... it's not as transparent as you want, things can get complicated or destroy performance. Combination of both? Maybe... need to test and make some decisions...

Daniel: What's a place we could experiment with this? Can we poke into revision table directly or safer to go elsewhere? Jaime: there's that hack already [was that wikidata? check me] :D Only request as a DBA... all slaves should be equal. Here's why: if a server goes down, it needs to be replaced. Combinatorial explosion on redundancy etc if need to manage/repartition them all seperately! Make it easy to reassign machines for new usages. Roan: that actually shoots down what I was going to ask about possibility of partitioning some servers are partitioned one way and others partitioned another way (to optimize queries based on certain usage) Jaime: with a bit of MediaWiki magic.... smaller tables good. More smaller tables with different 'partitioning' -- summary tables? They could all be replicated together, with possibility of assigning usage-based sharding for cache optimization. Roan: that's interesting Aaron: do we run that slow query check more recently? Jaime: yes we have more realtime stuff, just don't have it handy on wiki (on grafana etc) (lost some details here sorry) (?) question about special slaves -- why does every slave need all those shard divisions? Jaime: it's a pain to maintain special things much simpler if everything is common. Repartitioning them would be expensive; if already has all shards present on disk, it's easier. (trading disk space vs live disk usage) (?) like many developers i don't really know what's going on _in_ the database -- having a publicized list of slow queries would be great. Saw some queries recently where a small change was a huge improvement, would love ot have a frequently updated list to do this in future. Jaime: yes we want to give you more data! Want to make more stuff public, as much as possible (security/legal/etc) (?): current best place is .... tendril? Jaime: that's for long-running queries only. new system will have 100% monitoring of all queries (which should include those executed a lot that are individually medium-usage)

(David?): some other potential scenarios... when a page is moved, log stores some php blobs with the source/dest/etc. This changes from time to time and is a huge headache. Going through history requires acrobatics... need to reconstruct data from old formats. Would be great to repopulate the old data with consistent format? Happy to hear about partitioning ideas.... some usages are more analytics-like; for instance denormalized queries that hit a lot of different things. Different space from application issues; maybe separate these more, treat batch analytics separately? Daniel: sending derived data to specialized databases would make sense for this; more like having separate databases than separate slave machines as such. Could be good model. (David?): thinking of say contribs query like those analytics maybe good, kind of time-based sharding where some data comes from hot path and others from archive.

Tim: I'm leaning towards more sharding in MediaWiki; we have it for External Storage, usually some concepts of sharding in some extensions... have read up a bit about partitioning since this came up. Various use cases where it helps, hurts, etc. Could still have everything in a single server instance, even if sharded... Lots of code reads from revision, but only a few places write to it. Should be able to do some denormalization changes without totally confusing things. We don't need to optimize for server count or size, but by usage. Daniel: yes denormalization could help for some of our sharding... Tim: we as MediaWiki developers, as we're refactoring storage systems we should try to incorporate sharding. Daniel: this could mean picking a different database table/server to write to based on page id/model, etc? Tim: yeah.... revision may not be the simplest use case, might want to start on something else. Just want developers to start thinking about splitting up data. Daniel: i have a question myself... normalizing strings, denormalizing to summary tables.... One option: single db Another option: separate dbs based on id, need to join across servers, complex! Third option: hashes. Simpler when consistent across databases? How important is it to have consistent queries across wikis? (middling answer :) Tim: I'm reluctant to put by hand up one way or the other cause it's about numbers -- hard to know without good modeling. Jaime: I think consistency in general is more important, and by that I mean how many times we have to read DB to write (?) Daniel: Also a performance aspect...

(?): i'm used to more distributed systems; lots of sharding and denormalization in this world. storage gets cheaper and cheaper over time though. is this right tool? doesn't solve all issues; but with enough storage and denormalization it's doable. Daniel: implementing sharding in app logic will be tricky, hard to figure out a generic way to do it. Brad: one comment wrt sharding -- if you have a query that doesn't fall into the criteria sharded, you have to query everything and combine -- super inefficient. :( Jaime: partitioning vs sharding? Actually I'm totally against sharding. :) Tim: I mentioned sharding. :) What in particular will be worse? Server count, admin? Jaime: sharding gives small tables, but needs separate servers... not large enough to be worth it? [I think we've got distinction between _servers_ with separate data _subsections_ and _tables_ with separate data _usages_] aaron: on hashing/uuids - one thing i like is imports wont conflict. just merge data sets! vs updating foreign keys. it's a numbers game though. daniel: we'll probobly keep them as internal detail but yeah see the benefit aaron: depending on disk issues, denormalizing a table should help read indexes? every replica has them all, can use existing query grouping to optimize query index cache usage. same schema on every server so less sharding pain. aaron: what do you see as future needs? jaime: main problem is monolithic things dont scale... mostly write traffic. range scans very expensive. daniel: maybe rfc to split rev table jaime: also splitting rc is good! gabriel: how much gain do we get from non sharding optimixations jaime: moving out edit summaries, user names will give us years! daniel: gives us time to implement sharding... gabriel: sharding will be hard with the long tail. improvements will give us a big jump forward brion: agree! combination of normalizing revision & denormalizing summary tables should give us a huge benefit with relatively small effort. Still will touch a lot fo things to, for instance, split revision metadata out, but many queries will require only minor (or no!) changes probably. Let's start. Daniel: anybody want to get some phab tasks or RFCs going? brion: i'll start some rfcs... (we probably won't get to it at tomorrow's session!) -end-

Action items:
 * Brion: prepare some normalization & denormalization RFCs for archcom