Change propagation is distributing changes between services, using the EventBus infrastructure. Its rules subscribe to specific topics in eventbus, and execute an action (typically a templated HTTP request, or a CDN purge) in response to each event.
Monitoring change propagation
A grafana dashboard exists to monitor the EventBus and change propagation services. For EventBus it contains some generic information about current throughput of the system, response timing and load. For change-propagation, the dashboard shows rules execution rate and rule backlogs for each rule, for normal processing and retries separately.
Rule backlog is the time between the creation of event and begging of the processing. If the backlog grows over time - change propagation can't keep up with the event rate and either concurrency should be increased, or some other action taken. Backlogs can have occasional spikes, but steady backlog growth is an clear indication of a problem.
Advantages over the JobQueue
- Guaranteed delivery. EventBus infrastructure is based on Apache Kafka message broker which allows to achieve at least once delivery semantics: once the event got into kafka we can be sure the reaction would follow, which lets us build very long and complex sequences of dependencies without fear that something would be lost.
- Automatic retries with exponential delays, large job deduplication, persistent error topic in kafka
- Mostly config-based system allows to add most of the simple update rules with just a few lines of yaml without writing any code at all.
- Better monitoring. Fine-granted monitoring dashboard allows us to track rates and delays for individual topics, rates of event production and much more. The monitoring is so informative, that we can find bugs in other parts of the infrastructure just by looking at Change Propagation graphs.
Tutorials for the most common use-cases
Setting up a new rule
The most common task that might be needed to be done on the change propagation service is to set up a new rule. Currently all the rules are static and stored in the config file. On startup the rules will be read from the config, Kafka consumers will be created for each rule (as well as for the corresponding retry topic).
A rule contains of several pieces:
- Topic property configures which kafka topic should the rule listen to.
- General rule configuration properties help you configure features like retries, ignoring errors, delays etc.
- Match and match_not fields help you limit the rule execution to a specific subset of the events in the topic. For example, if you need to do a domain-spefic action you would need to add a regex match for a
- Exec property configures what should be done: you can do a set of HTTP request, call some change-prop module, do a Varnish purge or emit a new event as a reaction.
Also, a switch-rule is supported, that mimics the semantics of
switch operator in the programming languages, but without fall through. A switch rule basically groups together several rules that listen to the same topic, have same semantics, but have mutually exclusive
match parts. For a more detailed information about rule configuration use docs available in the repository.
When a new rule configuration is created, you should follow the process to include the rule:
- Create a github pull request for the config.example.wikimedia.yaml file in the change-propagation repository on Github. Tips:
- Use publicly available URIs if that's possible.
- Good to create a unit test.
- Configure retry policy and error ignoring
- Wait for the services team to review.
- When a pull request is merged, create the same change in gerrit for the puppet config file. The puppet config change might be different from your PR in the github repo since it needs to include some templating for the service hosts. Reviewers list should include a person from the services team and at least one person from the operations team.
- After puppet is merged and deployed ask the services team to restart change propagation service so that it pick up the new config.
Checking what templates are being processed now
The highest load generated by the ChangeProp service is derived from template expansion - all pages transcluding a certain template should be rerendered after the template was edited. To check which large templates are being processed right now got to
kafka1001.eqiad.wmnet and run a
/home/ppchelko/check_templates.sh script. The script will output all the temlates that are being processed right now or are in the backlog. Please note, that having duplicates in the output is OK - change-prop processor is concurrent and it commits the smallest processed offset, so the output of the script contains not only the templates that are in the backlog, but also the templates that have already been processed but not yet committed.
- Wikimedia Services Team: Owners of this service
- Requests for comment/Requirements for change propagation (T102476) - RFC that describes the different approaches being explored (publish / subscribe event bus, dependency tracking, and change propagation)