Fundraising tech/Message queues

This page gives an overview of the message queues used to decouple fundraising subsystems. For a description of the message format, see "Normalized donation messages". See also the article on WMF-specific configuration.

Message Queue
Queues are used to decouple the payments frontend from the CiviCRM server. This is important for several reasonsː it allows us to continue accepting donations even if the backend servers are down, it keeps our private database more secure, and it enforces write-only communication from the payments cluster.

The main data flow is over the donations queue. Completed payment transactions are encoded as JSON and sent over the wire, to be consumed by the queue2civicrm Drupal module and recorded in the CiviCRM database.

Another important queue is the limbo queue, which is used both as a key-value store and as a FIFO queue. Before we pass control to any hosted page or iframe, we record the donor's personal information we've collected to the limbo queue, indexed by the gateway and transaction ID. We store the information in this temporary fashion so that a) it does not leave the payments cluster, and b) so we aren't storing any data about people who aren't donors, which is mandated by our privacy policies. When (and if) control is returned to the payments server, the PHP session is used to build the key and search for a corresponding limbo message. We delete the message, and merge this information into the completed donation message sent to the regular queue.

However, if control is never returned, then limbo queue messages will sit around for some time. After about 20 minutes, they become eligible for "orphan slaying", which is currently only performed for GlobalCollect credit card transactions. We attempt to complete settlement on these orders, and if successful, the completed message is sent to the donations queue. If unsuccessful, the personal information should be purged.

At Wikimedia, we are currently using the ActiveMQ (http://activemq.apache.org/) message broker as the queue backend for everything but the limbo queue. Messages go over the wire using the aging STOMP protocol. The limbo queue on the other hand is stored in Redis on the payments-cluster.

Replace ActiveMQ
Motivation: ActiveMQ is a single point of failure, when it's unavailable we have to take campaigns down, disable the frontend and stop all jobs. The communication protocol is flawed with no remedy in sight, and queue disk storage is prone to bloat rot.

Implementation: We would like to have a layer to buffer the low-latency frontend and protect it from CRM downtime, and better secure sensitive backend pieces. The fact that ActiveMQ acted as a FIFO queue, an indexed store and a buffer was just an unhappy bonus. Going forward, a slimmer buffer abstraction like Kafka may be preferable.

The buffer layer has the following API:
 * push - Add to the store. This will look like the back of a FIFO.
 * pop - Get the oldest element in the store, and begin a transaction to mark it as consumed.
 * commitOffset - Advance the pop offset, so that previously popped messages disappear. If the client connection is closed before committing, the next time a process pops from this queue it will receive the same message as the last process to pop.

I'm planning to share a single partition for each topic in this first iteration, with no sharding. More partitions would only be helpful for distributing load, and for parallel consumption which we don't support yet.

Buffer contribution tracking
Motivation: Contribution tracking is another single point of failure. It has to come down for any type of database maintenance, but there aren't stability concerns. The risks from this are that we are bottlenecked on a single table, which has to be accessed by the donation frontend in real time. Almost all components must be disabled to during a contribution tracking outage.

Putting a queue in front of contribution tracking makes it safer to increase the demand on contribution tracking data, and keep a more sophisticated schema such as a proper event log.

Implementation: Rewrite all tracking events as production to a new queue. Its consumer keeps the  table up to date, plus an improved schema.

Consolidate pending message handling
Motivation: There are eight variations on this topic, spanning all four storage backends, and some implementations are buggy.

Implementation: Use one or a small number of topics, and a single FIFO queue abstraction.

The pending consumer imports the messages to a database, where we can do retry and expiry operations, and have lots of indexes.

Pending jobs each have a single responsibility, and pick through the pending database to find eligible records. Other components have access to this data, and can grab or delete records.

Rewrite banner impressions loader
Motivation: The legacy impressions loader is fragile and bloated. We require a one-of-a-kind kafkatee shim to simulate udp2log.

Implementation: Consume the Kafka impression stream directly and aggregate into summary tables.

Save to the existing schema as a first step, or provide a view with columns compatible with existing usages.

We can extend this later with a next generation stream processor and decide on its schema.