Platform Engineering Team/Event Platform Value Stream/Pyflink Enrichment Service Deployment

This page documents the deployment process of https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment

Prerequisite:

- ssh access to production systems

- Gitlab access

- Gerrit access

mediawiki-event-enrichment
mediawiki-event-enrichment is a repository of Flink streaming enrichment jobs intended for streaming enrichment of MediaWiki event streams.

The mediawiki-page-content-change-enrichment service consumes the mediawiki.page_change stream and emits mediawiki.page_content_change stream. mediawiki-page-content-change-enrichment is currently (2023-02) deployed as flink-app service on the dse-k8s-eqiad cluster as a POC in the stream-enrichment-poc namespace.

Application Upgrades and Deployment
A new deployment would typically involve


 * 1) Gitlab MR: modifying the application via Gitlab MR
 * 2) Make sure to follow our contribution guidelines
 * 3) Tag gmodena, ottomata, tchin as reviewers
 * 4) make a new mediawiki-event-enrichment release by pushing a git tag:
 * 5) upon release CI will build and push as new mediawiki-event-enrichment  docker image to  https://docker-registry.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/tags/
 * 6) deployment-charts CR: the helmchart in deployment-charts should point to the new docker image tag https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment/values.yaml#3
 * 7) Tag gmodena, ottomata, tchin as reviewers
 * 8) After the deployment charts CR has been merged, the service is ready  to deploy.
 * 9) Service Deployment (generic instructions here)
 * 10) Examine diff, and if expected, confirm the deployment
 * 1) Service Deployment (generic instructions here)
 * 2) Examine diff, and if expected, confirm the deployment
 * 1) Examine diff, and if expected, confirm the deployment
 * 1) Examine diff, and if expected, confirm the deployment
 * 1) Examine diff, and if expected, confirm the deployment

Post deployment checks
Post deployment, review operation status in the Grafana and Jobmanager mertrics


 * Flink Cluster Grafana Dashboard

Application restarts
''This section is a WIP (https://phabricator.wikimedia.org/T328563). Demo code'' https://gitlab.wikimedia.org/-/snippets/58

Application lifecyle is currently managed by the k8s operator High Availability strategy. `mediawiki-event-enrichment` value file declares cluster specific restart and upgrade strategies.

There are three scenario that will require an application restart:


 * 1) To recover from failure
 * 2) Following an application upgrade
 * 3) Following a k8s cluster upgrade

References


 * Flink checkpointing and restart strategies https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/
 * The difference between checkpoints and savepoints https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints_vs_savepoints/

Recovery from failure
The operator will try to recover job and task managers that enter a FAILED state. Recovery will resume from the last recorded checkpoint following and exactly one semantics. This restart is managed by k8s, and normally should not require user intervention. TODO: what to do if intervention is required?

Application upgrade
Occasionally we'll need to upgrade to a new version of mediawiki-event-enrichment. By controlling the  field of the   helmfile deployment, users can define the desired state of the application and trigger a restart.


 * 1) Change the chart   to suspended and re-deploy the service
 * 2) Change the chart   to running and and re-deploy the service

Application upgrades will resume streaming from the latest recorded savepoint. A savepoint will be recorded when the application  changes.

K8S cluster upgrade
After k8s cluster upgrade the application will need to be manually restarted. Cluster upgrade purge ConfigMaps and will require following a manual recovery.

TBD: should we store HA state in zookeeper?

Rollback
To roll back a deployment:


 * Revert to the latest stable helmfile revision
 * Once the change is merged with master, re-deploy the service

General k8s rollback instructions can be found at https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_changes