User:ASarabadani (WMF)/Database for developers toolkit/Concepts/Choosing storage technology

From mediawiki.org

For any new project, choosing the storage technology is a architectural decision that usually has several right answers with different trade-offs and many many wrong answers. It is usually hard to change later. So spending enough time to come up with the best technology is crucial to avoid forcing a square peg into a round hole later. Always get data persistence team involved in early stages of design.

In order to make it easier for third parties to install, always leave an easy option as default even though it won't be scalable. For example, default search backend in MediaWiki is relational databases (MySQL, etc.) which clearly is not feasible for large-scale systems but it makes it easier to install and use MediaWiki.

Current storage technologies used in production[edit]

  • MariaDB: Relational database storage
    • Core clusters (s1-s8)
    • Extension clusters (x1)
    • Object stash (x2)
    • ExternalStorage (es1-es6)
    • Misc clusters (m1-m5)
    • ParserCache (pc1-pc3)
  • Cassandra: NoSQL data storage used for sessions and more
  • ElasticSearch/OpenSearch: Distributed NoSQL storage with a powerful REST API. Used at WMF for wiki searches and log storage.
  • Swift: Object storage with an S3-compatible API.
  • HDFS: Backend for Hadoop and many other data engineering big data.
  • Memcached: Not really a storage but a large central cache.
  • Redis: Not used much
  • Custom designs (extremely rare): For example, dumps
  • Blazegraph: Open-source graph database. Deprecated; new projects should not use this.

Aspects to consider when choosing the technology[edit]

Does it store canonical data?

If you need to store canonical data that can't be regenerated and their loss would impact users trust in our infrastructure, the only viable options are MariaDB (for data) and Swift (for objects). They get backed up regularly and get replicated to many hosts, every storage is supported by RAID, and relational database have strong consistency and data integrity baked into them by design (Replication, Semi-sync, ARIES recovery, log ahead journaling, transaction support, MVCC, etc.).

Size and amount of read and write

MariaDB/MySQL are pretty good at small to medium size of data (let's say up to hundreds of Gigabytes per cluster) that don't get a lot of writes but get a lot reads and need a fast response because of being in a critical path (you can use cache for hot data to absorb significant read pressure as well).

If you have hundreds of GBs that don't get that much reads but get completely rewritten every day and you need to flush hundreds of millions of rows, MariaDB simply is not a good option for you. If you have to use MariaDB, to mitigate that, you can either find a way to shard the data into multiple clusters without hurting queries and make sure you reduce writes by for example diffing between old data and new. That being said, MariaDB works well in append-only databases and in some cases for heavy-write databases that don't hold canonical data, we can tune the cluster to make it work (reducing binlog time, making the cluster store binlog in memory only, etc.) but it's non-trivial amount of work.

MediaWiki integration

Direct access to storage is only available to MariaDB, Swift, Elastic and memcached. Integration to Cassandra currently only works through a service called Kask that adds latency and complexity to usage. There is no direct access to HDFS and only it's possible via AQS.

Also, different clusters in MariaDB give different access as well. For example core databases allow you to join directly to revision or page tables of Wikipedia but space for them is extremely limited (and no, we can't just buy more disks, we are bound by memory) while extension clusters (x1, etc.) are less bound by storage but you can't join with core tables.

Limitations in each specific technology

For example, for Swift holding a large amount of data is fine but number of files shouldn't grow too much. As mentioned, MariaDB is not good with large amount of writes (whether constant or spiked).

Cost

MariaDB databases are expensive, they use expansive and large SSDs with RAID and pretty large memories as well. Making DBs by far the most expensive servers (plus the fact that even smallest cluster possible needs at least six database servers). But there is no way around this, we need large storage and fast responses, anything in critical path of user request must be as fast as possible otherwise we end up with constant outages.

On the other end of the spectrum: If you need a massive tanker of data that doesn't get read much and fast response is not a priority, for example dumps, picking a MariaDB database for that would be a major waste of resources. Using cheap and large HDD simple storage is good enough and saves a lot of money.

Simplifying the infrastructure

PosgtresSQL might have some benefits compared to MariaDB in specific cases for your tool but having a PG cluster adds a lot of maintenance cost. It's simply better to use MariaDB instead and mitigate the downsides.

Extra points to consider[edit]

  • Try requesting and designing for new storage before start of Q4 of any fiscal year. We need to budget and purchase hardware if necessary. Generally, get the DBA team involved as early as possible. They might have some input on the design that would be needed and hard to change later.
  • Everything at the scale of Wikipedia is complex. Even mature technologies such as MariaDB struggle at our scale sometimes and we constantly report issues upstream. Lots of new technologies can't really work in our scale and many new tools have to do a lot of extra work to be able to withstand the flood of read and writes.
  • Cellular architecture: To avoid "noisy neighbor" problem and reducing coupling between services, it's recommended to have a dedicated storage cluster for each service (for example, ores has its own redis instances so an issue with ores or that redis couldn't affect other services using redis). This is called cellular architecture. We strive to have that but in many cases it's not possible or it would be extremely expensive, specially in case of MariaDB/MySQL.