Platform Engineering Team/Event Platform Value Stream/Use case: compute needs for streaming pipelines

This document is an RFC.

Published: 2022-09-19

Internal discussion draft: https://docs.google.com/document/d/1hhQAh-iWI2bWDYBZsvJwPE8e93jN0wI3XG-FLkSmcF4/edit#

Use case: compute needs for streaming pipelines
Author: Gabriele Modena ([mailto:gmodena@wikimedia.org gmodena@wikimedia.org]), 2022-07-12.

The data platform team has developed a PoC state-less streaming pipeline to enrich mediawiki page change events. We leveraged on the experience of the Search, and adopted Apache Flink as our reference streaming tech. The PoC implements an application that bootstraps a Flink cluster atop YARN. We chose YARN because it was the only available compute platform at the time. We are aware that production support, though not specified in terms of SLOs, will require a move to Kubernetes.

Access to the  DSE cluster would allow us to:


 * 1) Acquire experience with running Flink-on-k8s compute loads at scale.
 * 2) Acquire experience with integrating k8s loads with other parts of our infrastructure.
 * 3) Bridge the gap between development and production targets.

Key facts & metrics
The business logic is stateless, but pipeline (Flink) restarts require checkpointing Kafka offsets to a filesystem. On YARN, we use HDFS to store checkpoints. I/O boundaries are defined by Kafka topics. Consumers will access the enriched data from a snapshot loaded by HDFS (process managed by DE, orthogonal to us).

Dependencies

 * Kafka (consumption/production).
 * Mediawiki Action API (http request for revisions data, targeting all production wikis).
 * Access schema.discovery.wmnet.
 * Prometheus (metrics).

Resource allocation
We deploy  a single node Flink cluster on a YARN container with the following spec:


 * Cores: 2 (1 reserved for our application, one for the Flink Job Manager)


 * Memory: 3200Mb (1600Mb for our application, 1600Mb for the Flink Job Manager)

On k8s we would like to scale up (horizontally,  vertically) and experiment with resource allocation strategies and HA requirements.

Load
The expected load (lower bound) should follow the throughput trend of mediawiki-revision-create kafka topic. As an upper bound / extreme case we expect


 * 200 events/second consumed from kafka.
 * 200 request/second to the Action API (proxied by discovery ro).
 * 200 events/second produced in a kafa topic.
 * Backfilling will take into account the need to throttle and back-off consumption