PoolCounter

PoolCounter is a network daemon which provides mutex-like functionality, with a limited wait queue length. If too many servers try to do the same thing at the same time, the wait queue overflows and some configurable action might be taken by subsequent clients, such as displaying an error message or using a stale cache entry.

It was created to avoid massive wastage of CPU due to parallel parsing when the cache of a popular article is invalidated (the "Michael Jackson problem"), but has later been put to other uses as well, such as limiting thumbnail scaling requests.

MediaWiki uses PoolCounter via an abstract interface (see ) which allows alternative implementations.

Source
The implementation is located in multiple places: There is also a Redis-based default implementation in MediaWiki core, and an experimental Python client for the daemon in Thumbor.
 * The server source is in the  repository.
 * The client source is in the  directory of MediaWiki core.

As of Debian Buster (10) and Ubuntu Disco (19.04), the poolcounter server can be installed with.

Architecture
The server is a single-threaded C program based on libevent. It does not use autoconf, it just has a makefile which is suitable for a normal Linux environment. It currently has no daemonize code, and so is backgrounded by systemd.

In MediaWiki, the client must be a subclass of  and the class holding the application-specific logic must be a subclass of. See Manual:$wgPoolCounterConf for details.

Protocol
The network protocol is line-based, with parameters separated by spaces (spaces in parameters are percent-encoded). The client opens a connection, sends a lock acquire command, does the work, sends a lock release command, then closes the connection. The following commands are defined:


 * ACQ4ANY   : This is used to acquire a lock when the client is capable of using the cache entry generated by another process. If the active pool worker limit is exceeded, the server will give a delayed response to this command. When a client completes its work, all processes which are waiting with ACQ4ANY will immediately be woken so that they can read the new cache entry.
 * ACQ4ME   : This is used to acquire a lock when cache sharing is not possible or not applicable, for example when an article rendering request involves a non-default . When a lock of this kind is released, only one waiting process will be woken, so as to keep the worker population the same.
 * RELEASE: releases the lock that the client most recently acquired
 * STATS [FULL|UPTIME]: show statistics

The possible responses for ACQ4ANY/ACQ4ME:
 * LOCKED: successfully acquired a lock. Client is expected to do the work, then send RELEASE.
 * DONE: sent to wake up a waiting client
 * QUEUE_FULL: there are more workers than
 * TIMEOUT: there are more workers than ; no slot was freed up after waiting for seconds
 * LOCK_HELD: trying to get a lock when one is already held

For RELEASE:
 * NOT_LOCKED: client does not currently hold any locks
 * RELEASED: lock successfully released

For any command:
 * ERROR

Configuration
The server does not require configuration. The client in MediaWik can dynamically specify the needed pool sizes, wait timeouts, etc.

The client is enabled via and then further configured via   with the following keys:


 * servers
 * An array of server IP addresses. Adding multiple servers causes locks to be distributed on the client side using a consistent hashing algorithm.


 * timeout
 * The connect timeout in seconds.

Testing
$ echo 'STATS FULL' | nc -w1 localhost 7531 uptime: 633 days, 15209h 42m 26s total processing time: 85809 days 2059430h 0m 24.000000s average processing time: 0.957994s gained time: 1867 days 44820h 50m 24.000000s waiting time: 390 days 9365h 18m 24.000000s waiting time for me: 389 days 9343h 3m 28.000000s waiting time for anyone: 22h 14m 53.898438s waiting time for good: 520 days 12503h 48m 24.000000s wasted timeout time: 473 days 11375h 2m 44.000000s total_acquired: 7739031655 total_releases: 7736374042 hashtable_entries: 119 processing_workers: 119 waiting_workers: 216 connect_errors: 0 failed_sends: 1 full_queues: 10294544 lock_mismatch: 227 release_mismatch: 0 processed_count: 7739031536

Quickly inspect traffic in production
On a Mediawiki appserver you can do:

sudo tcpdump -A 'port 7531 and tcp[tcpflags] & tcp-push != 0'

Trivial Wireshark support for the protocol
The following Lua script is a trivial 'dissector' for Wireshark that simply stringifies the payloads of Poolcounter network packets, so you can then add that as a column displayed in Wireshark's UI:

On modern Linux systems you should be able to save this as  and then it will work automatically in either   or.

Tracing the execution of certain flavors of requests
Imagine that you cared about seeing the full conversational 'flow' between PoolCounter and its clients for a certain part of the keyspace -- for our example, we'll use.

Since the PoolCounter server's responses (e.g. ) don't include the key they're talking about, this isn't trivial to do.

Begin with a packet capture from the timespan you're interested in. You might generate this on a poolcounter host (or on an appserver host you're using for testing) with e.g.

Then, we'll ask Wireshark to extract the list of its internal TCP stream ID numbers for all requests that match that keyspace:

Once we have that list of IDs, we'll transform it into a Wireshark display filter: and then use that filter to select all PoolCounter protocol traffic from just those streams in the original packet capture: