User:Daniel Kinzler (WMDE)/Job Queue

This page gives a brief overview of current issues with the JobQueue, to the author's best but incomplete knowledge. The purpose of this page is to provide a starting point for a discussion about

Observations: Issues and considerations:
 * Latest instance of the JQ exploding: https://phabricator.wikimedia.org/T173710
 * With 600k refreshLink jobs in the backlog of commonswiki, only 7k refreshLink jobs got processed in a day. (Not including those run manually from terbium)
 * For wikis with just a few thousand pages, we sometimes see millions of UpdateHtmlCache jobs sitting in the queue.
 * Jobs that were triggered months ago were found to continue failing and re-trying
 * Selection of target wiki in the job runner is random. It does not depend on which wikis have most jobs pending. Wikis with no jobs pending get selected, and cause overhead (and even an explicit delay? Or is the job runner idling, waiting for jobs for a bit?)
 * There is one queue per target wiki.
 * We have lots ob jobs that do nothing (e.g. RefreshLinks for a page that doesn't have links - but we don't know that in advance)

Documentation:
 * Jobs re-trying indefinitely
 * Deduplication
 * mechanism is obscure/undocumented. Some rely on rootJob parameters, some use custom logic.
 * Batching prevents deduplication. When and how should jobs do batch operations? Can we automatically break up small batches?
 * Delaying jobs may improve deduplication, but support for delayed jobs is limited/obscure.
 * Custom coalescing could improve the chance for deduplication.
 * Scope and purpose of some jobs is unclear. E.g. UpdateHtmlCache invalidates the parser cache, and RefreshLinks re-parse the page - but does not trigger an UpdateHtmlCache, which it probably should.
 * The throttling mechanism does not take into account the nature and run-time of different job types.
 * Scaling is achieved by running more cron jobs.
 * Kafka-based JQ is being tested by Services. Generally saner. Should improve ability to track causality (which job got triggered by which other job). T157088
 * No support for recurrent jobs. Should we keep using cron?
 * Having a single queue for all wikis would mean wikis with a large backlock get more job runner time. But maybe too much, starving small wikis?
 * For wikibase, in change dispatching (pre job queue), we look at which wikis have the most changes pending, and we pick a set and then randomly pick one from that set. Makes it more likely for heavily-lagged targets to be processed, without starving others.
 * Manual:Job queue
 * JobQueue (Doxygen)
 * Aaron's slides (DropBox)
 * Job Queue Health (Grafana)