Requests for comment/Parallel maintenance scripts

From mediawiki.org
Request for comment (RFC)
Parallel maintenance scripts
Component General
Creation date
Author(s) Brion Vibber (WMF)
Document status
See Phabricator.

See additional details in task T201970.

Proposal[edit]

  • Add MediaWiki\Parallel namespace with helper classes, living in includes/parallel
  • ParallelMaintenance class wraps helpers to add --threads option to farm sub-tasks to multiple workers
  • port various Maintenance classes to ParallelMaintenance which can make use of it

Implementation[edit]

Details[edit]

In MediaWiki\Parallel are some interfaces:

  • IController -- implemented by the StreamController classes; allows enqueueing items into the work stream and syncing the state
  • IWorker -- implemented by the StreamWorker and ForkStreamController classes; allows binding work callbacks to event names.

Concrete implementations:

  • InProcessController - just passes queued items directly to a consumer for immediate execution
  • StreamController - base class running JSON-serialized data to a number of child threads over i/o streams
    • ForkStreamController - pcntl_fork()-based implementation, needs callbacks bound to event names.
    • ExecStreamController - proc_open()-based implementation, takes a command line; caller's responsibility to route the command to something that runs a StreamWorker with suitable bindings.
  • StreamWorker - child process side for StreamController. Takes bindings from event names to callbacks.

\ParallelMaintenance extends \Maintenance and implements a dispatch() method that sets up a controller instance and lets you dispatch items to it. Several maintenance scripts have been ported to use this, and it seems to work ok.

This is the second iteration of the internal API, based on that porting work and feedback from the RfC IRC meeting.

Open questions[edit]

  • Increased use of pcntl_fork()-based maintenance scripts might be fragile; though they close connections it's likely that some new or extension feature might get forgotten; weird teardown issues are also visible if child processes aren't carefully killed in unusual contexts like unit testing. Would it be better to move ParallelMaintenance to use the proc_open()-based controller?
  • Considerations on error handling in worker contexts?
  • Should ForkController be fully merged with MediaWiki\Parallel?
  • Considerations on possible web-server-side use of parallelism where pcntl_fork() isn't available and proc_open() could be expensive?