User:ASarabadani (WMF)/Database for developers toolkit/Concepts/Normalization

From mediawiki.org

What is Database normalization?[edit]

Database normalization is a general term that is used to improve design of tables in databases but in here I mean something more specific that is part of the general term. By normalization I mean avoiding repeating strings in databases.

For example, this is a table that's not normalized:

revision
rev_id rev_user
123 Foo_Long_User_Name

This table is saying edit 123 was made by User:Foo_Long_User_Name. But what if User:Foo_Long_User_Name makes millions of edits? Then the table grows too large and we will repeat "Foo_Long_User_Name" millions of times. By normalizing it, we split this table into two:

revision_normalized
rev_id rev_user_id
123 1
user
user_id user_name
1 Foo_Long_User_Name

In order to get the new data, you basically need to join these two tables on "user_id = rev_user_id". Integers are much smaller in storage than strings.

Why is it being done?[edit]

Tables across MediaWiki and its extensions are being normalized, creation of "actor" table and "comment" table are being done to normalize "revision" table. The "change_tag" table has been already normalized and as the most recent work, "wb_terms" is now normalized and replaced with six tables (all starting with wbt_). This won't stop here, more tables are going to change. Tables like "templatelinks", "categorylinks", etc. also need normalization.

This brings the question of why there is so much investment in changing tables and breaking workflows and tools, while the resource can be used on something else? The answer lies in two parts:

DRY

DRY (Don't repeat yourself) is a principle in software engineering that's proven essential in database design. Why? First of all, look at the example above, What if the user_name changes (renaming users)? Then you need to change millions of lines to change "Foo_Long_User_Name" to "Bar" but if you do it with the normalized version, you're only going to change one row which is going to be literally millions of times faster. Normalizing the actor table allowed us to be able to rename users in the blink of an eye now.

Storage

Strings are larger than numbers so they take more space in disk, memory, network I/O, and so on. To understand the impact, the normalized version of wb_terms is ten times smaller than the older wb_terms. The normalized version of "change_tag" is half the size of its original version. How does improving storage help? Let's break it down:

  • Disk: Databases in Wikipedia are running out of space. Currently, the new host for the primary of Wikidata's database is at 70% (the old hosts were around 85%) and when it reaches 100%, we can't add anything to Wikidata anymore, and the whole things breaks. But after replacing wb_terms we will be at 40% disk usage. Also keep in mind that the backups, which have multiple copies of the whole database at different points in time, would also run out of disk space sooner.
  • Memory: For databases to be able to respond to queries in a timely manner, they load as much as possible of the database rows into memory. In our infrastructure, it's called "InnoDB buffer pool" (technical detail: Our storage engine is InnoDB) and it's 350GB. The old wb_terms table is 1TB while the new term store is around 100GB. InnoDB buffer pool efficiency (i.e. the ratio of how much data it reads from the memory instead of disk) is at 99.99% currently. One outage in Wikipedia happened because this number for Wikidata's database dropped to 99.9% (one nine drop) because reading from disk is ten times slower than reading them from memory. If we don't normalize the tables, eventually we exhaust this cache and everything will go down. Buying disk is cheap but buying bigger memory is not (if not impossible). Our database hosts have around 512GB for memory and we basically can't go higher than that anytime soon (we have 250 database hosts currently). To put it simply, we have no choice.
  • Network I/O: For each database, we have a primary database that is the source of truth which replicates to lots of replicas (See https://dbtree.wikimedia.org). People can "read" from the replicas but "write" has to happen only on primary. The replication of changes from primary to replicas happens using the network. The flow of changes happening on Wikidata is so big that this replication takes a lot of network data flow. Making these tables smaller reduces this flow size, thus allowing us to have more replicas (yay more people being able to read Wikidata and Wikipedia) and also reduces the replication lag (less outdated data).
  • Maintenance: If a table is smaller, any schema changes take way less time and finishes earlier so we can bring back replicas to rotation quicker (and in total, a schema change across a full section finishes faster). On top of that, with higher replication capacity, when we stop replication to do maintenance, and it recovers faster when we are done with it.

My queries are slow. What should I do?[edit]

Normalized tables are by design slower to read especially since using indexes on several tables is not possible; but if you apply these hints here, it might even get faster than the non-normalized version:

Avoid joins if possible

That seems counter-intuitive but it's proven to work if done right. Its technical term is "join decomposition" and it basically means doing multiple queries instead of one with joins.

The rule of thumb to do join decomposition, is that if the query with join(s) returns drastically lower number of rows than the several queries that are not joined, you shouldn't do it, otherwise it's better to do multiple queries instead of join.

For example, if you want to query revision table to get the user name of the person who did the edit id 123 and you get the user id instead of user name and you want to fix this. Instead of a join on the two huge tables of revision and user, you can query revision table for rev_id = 123 and get the user_id value (let's say it's 1), then query the user table with user_id = 1 and get the username. You can also sometimes cache the mapping (e.g. the mapping of user_id to user_name is not going to change often) which brings us to the next point.

Cache mappings

If you're doing a query several times and you know it's not going to change much, just put the result of the query in a cache. If you can use key value stores like redis, great, but if it's not possible, you can just put it in a file on disk and read it from the file instead of making requests to the database. This will be very useful if you combine it with join decomposition (the above hint) and hold the mapping.

It will get better once people start to use it

Remember, when the table is not used by people, it'll get kicked out of "InnoDB buffer pool" (the cache in database, see above) but once it's being used frequently, it'll reside in memory and it'll be faster. Actually, because normalized versions are smaller, you will have a higher chance of hitting the cache for really large tables.

Let's normalize everything![edit]

Hold on a second! Normalization means more joins and can mean more computations and more complicated write queries. It's easy to fall into the trap of over-normalization, and it can cause lots of issues as too many joins could cause the database to pick up a wrong join order and bring everything down. For example, avoid normalization if the benefit is small (e.g. language codes).

Sometimes you need to have a denormalization table on top of your normalized data. As an example: recentchanges table in core is a denormalization table and has been doing its job relatively well because it stores the data for a fixed amount of time so it won't grow too large over the years. Common types of denormalization tables include cache tables, summary tables, and counter tables. For more information look at related material in the "High Performance MySQL" book (see Library/en#Databases).