Platform Engineering Team/Event Platform Value Stream/Stream Processing Framework Evaluation

=Shared Event Platform Project=

What is it?
Engineers from across Technology (Platform, Data Engineering, Search and Enterprise) will collaborate on a shared event streaming platform capability that is beneficial to each group and the overall foundation.

Existing event streams serve as a change of state but lack many details required to make sense of that change (see T291120), the event platform will enable us to build enriched data streams that will allow the foundation and community to build and share better knowledge experiences.

What we aim to achieve?

 * Evaluation of event streaming platforms
 * Implementation of chosen event streaming solution as a proof of concept (no SLO's)
 * Implementation of the following services/stream processors:
 * Simple Enrichment - transform a single stream by enriching with calls to MediaWiki API's
 * Research Use Case - transform a single stream to provide data for a Research Use Case
 * Data Integration - integrating streams and databases
 * Understanding the pathway and considerations to take the chosen solution to production
 * Creating tooling and pathways for other engineering groups to build streaming services/processors

How does this benefit the movement?

 * Knowledge as a service - Publishing enriched event streams to the world will allow anyone to build on that to create new knowledge experiences
 * Knowledge equity - By publishing enriched streams we break down technical barriers in navigating and accessing data that could be used to build new knowledge experiences

Phase 1: Evaluating Solutions - T306797
The analysis for Flink and Kafka streams has been supplemented from the evaluation conducted by the Search team. Knative Evening evaluation details can be found in Phabricator.

General Remarks
It would be nice if we chose the same technology for Data Connectors, Stream Processors, Batch Processors, and event driven feature service development, but there is no requirement to do so. We are focusing on platform level choices, so we are likely to favor technology that allows us to implement a more comprehensive data platform rather than ease of use for event driven feature services.

Multi DC
There is no built in support for 'multi DC' (AKA multi region) in any of these frameworks, and as such, the multi-DC-ness is an application level concern. Generally, the more state that the stream processor is responsible for, the more difficult it is to architect. For simple event transformation/enrichment processors, the only state we will need is in the Kafka consumer offsets, which should be fully managed by Kafka itself.

See also: Multi DC for Streaming Apps

Multi DC streaming would be much more easier to accomplish by default if Apache Kafka had support for multi region clusters. There is support for this in the Confluent Enterprise edition.

Flink
A general purpose stream and batch processing framework and scheduler, supporting any input datasource (not just Kafka streams).
 * Java API is somewhat limited, because of type erasure (doc). Because of this, Scala seems a better choice.
 * Testing API enables both stateless and stateful testing. Same with timely UDFs (user defined functions) (doc)
 * There is a script to launch Scala Flink REPL, seems useful
 * There are few different levels of API here, ranging from SQL analytics to low level stateful stream processing (1.10 Documentation: Dataflow Programming Model)

Kafka Streams
A library for developing stream applications using Kafka.
 * It focuses more heavily on SQL-like - called KSQL- approach, when it comes to data mangling
 * It looks cool for simple operations on Kafka topics, but the philosophy here is to augment existing applications (Kafka Streams API is a library) with a dash of data processing, rather than create standalone processing applications. They say so basically in the first, introductory video (1. Intro to Streams | Apache Kafka® Streams API)
 * It’s difficult to find code examples in their documentation - Apache Flink’s is much better in that regard.

Knative Eventing
'Event routing' k8s primitives to trigger service API calls based on events.
 * NOT a comprehensive stream processing framework.
 * More focused on abstraction of event streams, so application developers only have to develop HTTP request based services.
 * Has Kafka integration, but within the Eventing system fully abstracts this away. Eventing KafkaBroker uses a single topic, and filters all events to fowards requests to subscribing services.
 * Uses CloudEvents as the event data envelope.
 * Looks very nice if you are building a self contained event driven application. Not so great for any kind of CEP.
 * Requires newer versions of kubernetes that we currently have at WMF (as of 2022-05).

Use Case Considerations
These technologies are on a spectrum of more complex and featureful, to simple and less featureful, with Flink being the the most complex and Knative Eventing the simplest.

Given the use cases we are considering, at times we will need a complex stream processors (e.g. WDQS updater, diff calculation, database connectors), and at others, a simpler and language agnostic event driven application framework (event enrichment, change prop, job queue). We'd like to make a 'platform' that makes it easy for engineers to build stream based applications. Sometimes those applications will be about complex data state sourcing and transformation problems, and other times they will be for triggering actions based on events. Attempting to support those different types of uses with the same technology may not be the right decision.

We should keep this in mind, and try to place incoming use cases into one of 2 categories: simple event driven applications, and complex stream processing.

Decision Record
Kafka Streams is easier than Flink for developing event driven applications, but less flexible than Knative Eventing, and less powerful and featureful than Flink. We eliminate it from further consideration based on this.

For the initial capabilities development and experiment, we choose Flink. This will allow us to get our hands dirty and investigate how we can use it to build platform capabilities to support our initial use cases, while considering future ones.

In the future, we may want to also support something like Knative Eventing for event driven feature products.

Supplementary Research
During analysis of the various solutions, specifically how each of them work in a multi-dc environment, Kafka Stretch was discovered as a potential solution to allow a single Kafka cluster to span multiple dc's.

Details of this additional evaluation can be found here

Phase 2: Creating the first service - T307959
Now that we are moving forward with Flink as a solution, the first service will consolidate existing streams, enrich messages with page content (wikitext, json, etc) and output to a new topic.

More details can be found here

As part of the POC work we also worked on tooling to make consuming existing event platform streams easy, see here.

MILESTONE: Demo see video here

Phase 3: Building on Flink Learnings and the POC Service
To be groomed and defined: