Platform Engineering Team/Event Platform Value Stream/Pyflink Enrichment Service Deployment

This page documents the deployment process of https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment

Prerequisite:

- ssh access to production systems

- Gitlab access

- Gerrit access

mediawiki-event-enrichment
mediawiki-event-enrichment is a repository of Flink streaming enrichment jobs intended for streaming enrichment of MediaWiki event streams.

The mediawiki-page-content-change-enrichment service consumes the mediawiki.page_change stream and emits mediawiki.page_content_change stream. mediawiki-page-content-change-enrichment is currently (2023-02) deployed as flink-app service on the dse-k8s-eqiad cluster as a POC in the stream-enrichment-poc namespace.

Deployment
A new deployment would typically involve


 * 1) Gitlab MR: modifying the application via Gitlab MR
 * 2) Tag gmodena, ottomata, tchin as reviewers
 * 3) make a new mediawiki-event-enrichment release by pushing a git tag:
 * 4) git tag -a vX.Y.Z -m 'Version X.Y.Z. .'
 * 5) git push --tags
 * 6) upon release CI will build and push as new mediawiki-event-enrichment  docker image to  https://docker-registry.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/tags/
 * 7) deployment-charts CR: the helmchart in deployment-charts should point to the new docker image tag https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment/values.yaml#3
 * 8) Tag gmodena, ottomata, tchin as reviewers
 * 9) After the deployment charts CR has been merged, the service is ready  to deploy.
 * 10) Service Deployment (generic instructions here)
 * 11) Examine diff, and if expected, confirm the deployment
 * 1) Examine diff, and if expected, confirm the deployment
 * 1) Examine diff, and if expected, confirm the deployment
 * 1) Examine diff, and if expected, confirm the deployment

Post deployment checks
Check metrics? Ssh into Flink Job Manager?


 * Flink Cluster Grafana Dashboard

Application restarts
''This section is a WIP (https://phabricator.wikimedia.org/T328563). Demo code'' https://gitlab.wikimedia.org/-/snippets/58

Application lifecyle is currently managed by the k8s operator High Availability strategy. `mediawiki-event-enrichment` value file declares cluster specific restart and upgrade strategies.

There are three scenario that will require an application restart:


 * 1) To recover from failure
 * 2) Following an application upgrade
 * 3) Following a k8s cluster upgrade

References


 * Flink checkpointing and restart strategies https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/
 * The difference between checkpoints and savepoints https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints_vs_savepoints/

Recovery from failure
The operator will try to recover job and task managers that enter a FAILED state. Reovery will resume from the last recorded checkpoint following and exactly one semantics. This restart is managed by k8s, and normally should not require user intervention. TODO: what to do if intervention is required?

Application upgrade
Occasionally we'll need to rollout new version of mediawiki-event-enrichment. By controlling the  field of the   users can define the desired state of the application and trigger a restart.


 * 1) Change the chart   to suspended. helm apply ...
 * 2) Deploy application changes
 * 3) Change the chart   to running: helm apply ...

Application upgrades will resume streaming from the latest recorded savepoint. A savepoint will be recorded when the application  changes.

K8S cluster upgrade
After k8s cluster upgrade the application will need to be manually restarted. Cluster upgrade purge ConfigMaps and will require following a manual recovery.

TBD: should we store HA state in zookeeper?

Rollback
TBD. Will require manual intervention.