User:Daniel Kinzler (WMDE)/Job Queue

This page gives a brief overview of current issues with the JobQueue, to the author's best but incomplete knowledge. The purpose of this page is to provide a starting point for a discussion about

Observations: Issues and considerations:
 * Latest instance of the JQ exploding: https://phabricator.wikimedia.org/T173710
 * With 600k refreshLink jobs in the backlog of commonswiki, only 7k refreshLink jobs got processed in a day. (Not including those run manually from terbium)
 * For wikis with just a few thousand pages, we sometimes see millions of UpdateHtmlCache jobs sitting in the queue.
 * Jobs that were triggered months ago were found to continue failing and re-trying
 * Selection of target wiki in the job runner is random. It does not depend on which wikis have most jobs pending.
 * There is one queue per target wiki.

Documentation:
 * Jobs re-trying indefinitely
 * Deduplication
 * mechanism is obscure/undocumented. Some rely on rootJob parameters, some use custom logic.
 * Batching prevents deduplication. When and how should jobs do batch operations? Can we automatically break up small batches?
 * Delaying jobs may improve deduplication, but support for delayed jobs is limited/obscure.
 * Custom coalescing could improve the chance for deduplication.
 * Scope and purpose of some jobs is unclear. E.g. UpdateHtmlCache invalidates the parser cache, and RefreshLinks re-parse the page - but does not trigger an UpdateHtmlCache, which it probably should.
 * The throttling mechanism does not take into account the nature and run-time of different job types.
 * Scaling is achieved by running more cron jobs.
 * Kafka-based JQ is being tested by Services. Generally saner. Should improve ability to track causality (which job got triggered by which other job). T157088
 * No support for recurrent jobs. Should we keep using cron?
 * Manual:Job queue
 * JobQueue (Doxygen)
 * Aaron's slides (DropBox)
 * Job Queue Health (Grafana)