User:ASarabadani (WMF)/Database for developers toolkit/Concepts/Normalization

What is Database normalization?
Database normalization is a general term that is used to improve design of tables in databases but in here I mean something more specific that is part of the general term. By normalization I mean avoiding repeating strings in databases.

For example, this is a table that's not normalized:

This table is saying edit 123 was made by User:Foo_Long_User_Name. But what if User:Foo_Long_User_Name makes millions of edits? Then the table grows too large and we will repeat "Foo_Long_User_Name" millions of times. By normalizing it, we split this table into two:

In order to get the new data, you basically need to join these two tables on "user_id = rev_user_id".

Why it's being done?
Tables across MediaWiki and its extensions are being normalized, creation of "actor" table and "comment" table are being done to normalize "revision" table. "change_tag" table has been already normalized and as the most recent work, "wb_terms" is now normalized and replaced with six tables (all starting with wbt_). This won't stop here, more tables are going to change. Tables like "templatelinks", "categorylinks", etc. also need normalization.

This brings the question that why there is so much investment in changing tables and breaking workflows and tools, while the resource can be used on something else. The answer lies in two parts: DRY (Don't repeat yourself) is a principle in software engineering that's proven essential in database design. Why? First of all, look at the example above, What if the user_name changes (renaming users)? Then you need to change millions of lines to change "Foo_Long_User_Name" to "Bar" but if you do it with the normalized version, you're only going to change one row which is going to be literally millions of times faster. Normalizing actor table allowed us to be able to rename users in blink of an eye now. Strings are larger than numbers so they take more space in disk, memory, network I/O and so on. To understand the impact, the normalized version of wb_terms is ten times smaller than wb_terms. Normalized version of "change_tag" is half of its original version. How improving storage helps? Let's break it down:
 * DRY
 * Storage
 * Disk: Databases in Wikipedia are running out of space. Currently, the new host for master of Wikidata's database is at 70% (the old hosts were around 85%) and when it reaches 100%, we can't add anything to Wikidata anymore. The whole things breaks but with replacing wb_terms we will be at 40% disk usage. Also keep in mind backups that have the whole database in different points in time also will run out of disk space sooner.
 * Memory: For databases to be able to respond to queries in a timely manner, they load as much as possible of the database rows into memory. In our infrastructure, it's called "InnoDB buffer pool" (technical detail: Our storage engine is InnoDB) and it's 350GB. wb_terms table is 1TB while the new term store is around 100GB. InnoDB buffer pool efficiency (i.e. the ratio of how much data it reads from the memory instead of disk) is at 99.99% currently. One outage in Wikipedia happened because this number for Wikidata's database dropped to 99.9% (one nine drop) because reading from disk is ten times slower than reading them from memory. If we don't normalize the tables, eventually we exhaust this cache and everything will go down. Buying disk is cheap but buying bigger memory is not (if not impossible). Our database hosts have around 512GB for memory and we basically can't go higher than that. To put it simply, we have no choice.
 * Network I/O: In database, we have a "master" database that is the source of truth which replicates to lots of replicas (See https://dbtree.wikimedia.org), People can read from replicas but write has to happen only on master. The replication of changes from master to replicas happen using network. The flow of changes happening on Wikidata is so big that this replication takes a lot of network data flow. Making this tables smaller reduces this flow size allowing us to have more replicas (yay more people being able to read Wikidata and Wikipedia) and also reduces the replication lag (less outdated data).
 * Maintenance: If a table is smaller, schema changes take way less time and finish earlier so we can bring back replicas to rotation quicker (and in total a schema change across a full section finishes faster). On top of that, with higher replication capacity, when we stop replication to do maintenance, it recovers faster when we are done with it.

My queries are slow, What should I do?
Normalized tables are by design slower to read specially since using indexes on several tables is not possible but if you apply these hints here, it might even gets faster than non-normalized version: That seems counter-intuitive but it's proven to work if done right. Its technical term is "join decomposition" and it basically means do multiple queries instead of one with joins.
 * Avoid joins if possible

The rule of thumb to do join decomposition is that if the query with join(s) returns drastically lower number of rows than the several queries that are not joined, you shouldn't do it otherwise it's better to do multiple queries instead of join.

For example, if you want to query revision table to get the user name of the person who did the edit id 123 and you get the user id instead of user name and you want to fix this, instead of join two huge tables of revision and user, query revision table for rev_id = 123 and get the user_id value (let's say it's 1), then query user table with user_id = 1 and get the username. You can also sometimes cache the mapping (mapping of user_id to user_name is not going to change much) which brings us to the next point. If you're doing a query several times and you know it's not going to change much, just put the result of the query in a cache, if you can use key value stores like redis, great but if it's not possible, you can just put in a file on disk and read it from the file than making request to the database. This will be very useful if you combine it with join decomposition (the above hint) and hold the mapping. Remember, when the table is not used by people, it'll get kicked out of "InnoDB buffer pool" (the cache in database, see above) but once it's being used frequently, it'll reside in memory and it'll be faster. Actually, because normalized versions are smaller, you will have higher chance of hitting the cache for really large tables.
 * Cache mappings
 * It will get better once people start to use it

Let's normalize everything!
Hold on a second! Normalization means more joins and can mean more computations. It's easy to fall into trap of over-normalization and it can cause lots of issues as too many joins could cause the database to pick up a wrong join order and bring everything down. For example, avoid normalization if the benefit is small (e.g. language codes). Sometimes you need to have a denormalization table on top of your normalized data. As an example: recentchanges table in core is a denormalization table and has been doing its job relatively well because it stores the data for a fixed amount of time so it won't grow too large over the years. Common types of denormalization tables include cache tables, summary tables and counter tables. For more information look at related material in "High Performance MySQL" book.