Fundraising tech/Message queues/Overhaul

This is the home page of the 2016 Fundraising Queue Overhaul.

Replace ActiveMQ
Motivation: ActiveMQ is a single point of failure, when it crashes we have to take campaigns down, disable the frontend and stop all jobs. The communication protocol is flawed with no remedy in sight, and application disk storage is prone to bloat rot.

What we are getting out of ActiveMQ is a very effective way to isolate the low-latency public frontend from a highly sensitive and rarely realtime backend, and this is the functionality we are interested in preserving during any rewrite. Generally, this layer can be seen as a set of one-directional stream buffers.

Implementation:

We're replacing any key-value uses of ActiveMQ with a new data store, the pending database. See the "consolidate pending" section below.

We've decided to use Redis as the transport for this project, a slimmer buffer abstraction like Kafka may be preferable.

The buffer layer has the following API:
 * push - Add to the store. This will look like the front of a FIFO.  You get an exception if anything went wrong.
 * popAtomic - Get the oldest element in the store, and run a client callback within a transaction, only marking the record as consumed if the message processor callback exits successfully. We require the atomicity so that consumers can guarantee that each message is processed "at least once".

No sharding is planned for this iteration. Ordinary peak load hovers around 10 messages per second, so Redis performance is not a consideration. Data size is low. However, more partitions will eventually allow us to support parallel consumption.

Buffer contribution tracking
Motivation: Contribution tracking is another single point of failure, almost all components must be disabled during a contribution tracking outage. Worse yet, it couples the frontend directly to a core internal system and in a non-transactional way.

Putting a buffer in front of contribution tracking will allow us to split it into low-latency, operational components, and an analytic backend. It will free us to migrate to a more sophisticated schema, and reuse a proper event log for the frontend.

This can be done after the ActiveMQ work, or independently.

Implementation: Rewrite all tracking events as producers to a new queue. Incidental tracking stuff could go to EventLogging. The tracking consumer would keep the internal  table up to date. Let's keep all the data points for each donation, as well as a summary with the current state.

Consolidate pending message handling
Motivation: There are already nine distinct variations on the "pending" channel (pending, limbo [payments@memcache], limbo [SmashPig@activemq], cc-limbo, globalcollect-cc-limbo, inflight@filesystem, pending_globalcollect, pending_paypal, pending_paypal_recurring), spanning all four storage backends. Usage is especially hard to understand because all nine pending channels do subtly different things with messages.

Implementation: Rewrite all of these modules to route through a single (or smallish number of) topic. Collapse queues with identical purpose. Coupling to queues should rely on a simple FIFO abstraction (push and popAtomic, as described above).

The pending consumer imports the messages to a database, where we can do retry and expiry operations, and have lots of indexes.

Pending jobs each have a single responsibility, and pick through the pending database to find eligible records. Other components have access to this data, and can grab or delete records.

One major detail among this work is to rewrite the orphan rectifier. It should be moved out of the payments cluster and generalized to all payment processors.

Documentation on current consumers of pending queues.

Rewrite banner impressions loader
Motivation: The legacy impressions loader is fragile and bloated. We require a one-of-a-kind kafkatee shim to simulate udp2log.

Implementation: Consume the Kafka impression stream directly and aggregate into summary tables.

Preserving backward compatibility with the existing schema should be carefully weighed, and the usages should be sprayed with something uppercase while we're evaluating the size of this problem. We might be able to write a SQL view that transforms enough fields to satisfy the usages...