Tugela Cache

From mediawiki.org

Intro[edit]

As large MediaWiki deployments may gain performance using Memcached, at some level cost of RAM to store all objects becomes too high. In order to balance resource usage and make more use of our Apache server disks, Tugela, the distributed cached on-disk hash database, has arrived.

Design[edit]

Tugela Cache is derived from Memcached. Much of the code remains the same, but notably, these changes:

  • Internal slab allocator replaced by BerkeleyDB B-Tree database.
  • Expiry policy management moved to external program tugela-expire
  • Much statistics code made obsolete.

Build[edit]

Make sure you've got libevent (a memcached dependency as well) installed, and try running make. In case of failure, take a look at DBLIB variable in Makefile and tune it to your system (RedHat-like systems may have just 'db').

Usage[edit]

Tugela[edit]

command line parameters are quite same as memcached's and can be found with -h switch. Though, two things, that differ are:

-m mbytes  - specifies not total store size, 
             but in-memory cache, used by BerkeleyDB
-f file    - specifies a database file
-s secs    - force database sync this often

Tugela-expire[edit]

The cache expiration program does not have any network interface yet, but its output can be sent with external commands like telnet or netcat to listening socket of cache daemon. Telnet might need some more tricks, like:

(tugela-expire;sleep 1) | telnet localhost 11211

Available parameters are:

-f file    - database file
-o days    - purge all entries older than specified days
-p prefix  - touch only keys, starting with prefix

Database management[edit]

As on-disk file is a regular BerkeleyDB database, standard suite programs may be used for data management, statistics and analysis:

  • db_stat
  • db_verify
  • db_dump
  • db_restore


Questions & Answers[edit]

Q: Currently isn't there a MySQL cluster backing the existing memcached system. It seems like in both the cases of hitting the mysql cluster or the Tugela-file-store you'd be pretty much doing a search on a b[+*]-tree index.

A: We still can allow losing data on Tugelas, which means that every node can have different portion. There's no replication so it's much more lightweight protocol. MySQL query cache can't be efficient at our rates of updates and use of transactions.

Q: Another option for having a file-backed Memcached is to simply set the Memcached server processes to a size *larger* than physical ram. The OS will swap out the infrequently used pages. How does this approach compare to BDB?

A: bad bad bad approach. swap overhead is too high compared to how it can be handled by proper library. the issue with memcached is that is not designed to be swapped out, it's memory access patterns are quite different from BDB access patterns. [This doesn't seem true. A few memcached articles suggest 2G caches on 1G machines; and hash-lookups are actually quite friendly data access for swapping (O(1) lookups that directly find the disk block to be swapped in). In contrast, walking a B-tree (like BDB) requires O(log(n)) pages to be swapped in; so the current memcached's slab+hash is probably more swap-friendly than BDB.]

Q: What about using an approach like Varnish where it "allocate[s] some virtual memory, it tells the operating system to back this memory with space from a disk file."?

A: mmap is actually used by BDB, but as files can grow beyond 32-bit limits, it becomes slightly problematic. It doesn't usually add too much performance, and having own LRU policies often works well too. At least it works for BDB. -- domas

Q: Above, you state that much of the statistics code that came in Memcached is made obsolete. Which statistics are obsolete and, more importantly, which statistics are still applicable?

A: there're no slabs in Tugela - so anything relating item counts in slabs doesn't work. General statistics might still work though, but we don't have anything about data itself. -- domas

Q: Does placing Tugelas DB file in a Linux tmpfs partition improve, damage or make no difference in cache performance? (added on 20071120)

A: This will not make any difference to performance (I am assuming that you want to put the DB on the /dev/shm partition). Linux tries to keep the tmpfs partition in memory when possible, but it will be spun out to swap if memory is needed for other uses. Linux will cache data in the file cache if you put the DB file on a normal partition. Either way, data will be in memory when there is memory available and be removed from memory when memory is needed for other things. My recommendation would be to put the DB on a normal partition and avoid using the machine for other things so that Linux can maximize the use of memory for file caching. Another thing to note is that the /dev/shm partition is size constrained and needs to be smaller that the swap space configured on the machine.

Q: Does the -m switch affect the maximum size of the DB file? (added on 20071120)

A: It affects just memory cache. File can grow forever. -- domas

Q: Are you OK? Hey, guys, check if he has a MedAlert bracelet.

A: We are. We are not using Tugela though, so it is in abandon-ware state, where other people pick it up and improve (like memcachedb folks). Or just people find it useful and use it as is. Generally, the most of magic happens in BDB, and it is in active development. -- domas

Q: Is this project still alive? MediaWiki shows 2 years old code. (added on 20071207)

A: You can still access source. Wikipedia itself due to extreme performance requirements is using memory-only stores. -- domas

Q: What version of memcached has been forked, is the code base synchronized? The current memcached release is 1.2.4 (20071205). (added on 20071207)

A: It is not. It was one-time fork. 1.2 was major rewriting of storage code (different slabs allocator), and it doesn't affect Tugela - as it has its own storage backend. Here again, I left it at current state (and it kind of works, probably). -- domas

Contact[edit]

All questions can be directed to Domas Mituzas or standard developer contact methods.