User:Daniel Kinzler (WMDE)/Job Queue

This page gives a brief overview of current issues with the JobQueue, to the author's best but incomplete knowledge, and was updated based on the IRC discussion on September 13. The purpose of this page is to provide a starting point for a discussion about the future of the JobQueue mechanism.

Observations:
 * Latest instance of the JQ exploding: https://phabricator.wikimedia.org/T173710


 * With 600k refreshLink jobs in the backlog of commonswiki, only 7k refreshLink jobs got processed in a day. (Not including those run manually from terbium)


 * For wikis with just a few thousand pages, we sometimes see millions of UpdateHtmlCache jobs sitting in the queue.


 * Jobs that were triggered months ago were found to continue failing and re-trying


 * Selection of target wiki in the job runner is random. It does not depend on which wikis have most jobs pending. Wikis with no jobs pending get selected, and cause overhead (and even an explicit delay? Or is the job runner idling, waiting for jobs for a bit?)


 * There is one queue per target wiki.

Issues and considerations:
 * We have lots ob jobs that do nothing (e.g. RefreshLinks for a page that doesn't have links - but we don't know that in advance)
 * Jobs re-trying indefinitely: https://phabricator.wikimedia.org/T73853


 * Deduplication
 * mechanism is obscure/undocumented. Some rely on rootJob parameters, some use custom logic.
 * Batching prevents deduplication. When and how should jobs do batch operations? Can we automatically break up small batches?
 * Delaying jobs may improve deduplication, but support for delayed jobs is limited/obscure.
 * Custom coalescing could improve the chance for deduplication.
 * Kafka / changeprop WIP lib for dedup, delay, rate limiting & retry processing: https://github.com/wikimedia/budgeteer


 * Scope and purpose of some jobs is unclear. E.g. UpdateHtmlCache invalidates the parser cache, and RefreshLinks re-parse the page - but does not trigger an UpdateHtmlCache, which it probably should.


 * The throttling mechanism does not take into account the nature and run-time of different job types.
 * Video transcode jobs for TimedMediaHandler can be very long, on order of many hours or just a couple minutes. Can divide up into smaller chunks, but how best to handle a large influx of small chunks from one (or many) upload(s)?
 * alternate: make throttling mechanism understand job length variability and use this to help balance longer vs shorter jobs
 * ex: https://github.com/wikimedia/budgeteer, part of changeprop / kafka effort.


 * Scheduling
 * Issue: Random per-wiki processing can run up large backlogs in individual wikis.
 * Scaling is achieved by running more cron jobs.
 * Having a single queue for all wikis would mean wikis with a large backlock get more job runner time. But maybe too much, starving small wikis?
 * this is roughly the plan for new Kafka system...
 * Can add rate limiting per job type & wiki if needed to prevent abuse. (what happens to rate-limited jobs? are they retried after an interval, or requeued at the end? -- yes, scheduled for delayed execution)
 * For wikibase, in change dispatching (pre job queue), we look at which wikis have the most changes pending, and we pick a set and then randomly pick one from that set. Makes it more likely for heavily-lagged targets to be processed, without starving others.


 * Kafka-based JQ is being tested by Services. Generally saner. Should improve ability to track causality (which job got triggered by which other job). T157088

Documentation:
 * No support for recurrent jobs. Should we keep using cron?
 * Limited visibility into job queue contents, difficult to inspect
 * Manual:Job queue
 * JobQueue (Doxygen)
 * Aaron's slides (DropBox)
 * Job Queue Health (Grafana)
 * IRC discussion notes 2017-09-14