Platform Engineering Team/Event Platform Value Stream/Build simple stateless service using Flink SQL

This page summarizes the learnings of https://phabricator.wikimedia.org/T318856

[SPIKE] Build simple stateless service using Flink SQL
Author: Gabriele Modena 

Bug: https://phabricator.wikimedia.org/T318856

To simplify the process of creating streaming stateless applications on Event Platform, this SPIKE investigated using Flink SQLto implement a near real-time enrichment data pipeline. I implemented a SQL service that:

sql-client.sh -f flink-http-action-connector.sql A demo can be viewed at flink-http-action-connector.
 * Listens to mediawiki.revision-create or another existing Kafka topic
 * Makes a call to MW Action API and extracst the wikitext associated with a revision id
 * Produces some output that combines the dataThe whole logic is contained in a single sql script https://gitlab.wikimedia.org/-/snippets/37 that can be executed via Flink's sql client with:

Evaluation
+ A platform engineer can implement simple enrichment applications as SQL transformations.

+ No knowledge of Flink is required.

+ Rich SQL semantics (e.g. windowing) https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/overview/

+ Flink SQL is interoperable with Python and JVM UDFs.We can extend the simple use case by embedding logic in python and java/scala.

- SQL can be hard to maintain (e.g. test)

- The custom connector cannot be used in SELECT statements, only in JOINS. That's how the lookup semantic works.

- While SQL is simple for a user to write, our team should still operate a Flink cluster.

- Without a Catalog, SQL applications have a lot of redundancy (we need to create schemas for all topics/endpoints we want to work with). In order for SQL applications to be viable, we must implement such a Catalog atop eventstreams and eventutils. Initial work suggest this to be a feasible piece of work, but more grooming and scoping is required.

Considerations and follow up work
I think there is merit in exploring SQL further, especially in combination with python UDFs. Follow up work should address:


 * 1) How hard will it be to operate a Flink cluster? Onboarding on DSE k8s should help us better scope this concern.
 * 2) Implement a Catalog to automatically expose known kafka topics to SQL applications.
 * 3) Can we have an openapi spec to automatically generate json schema for our endpoints? Can we decorate such endpoints to facilitate lookup join semantics?

mediawiki-http flink connector
Flink has an interface that implements loookup join semantics. The semantic of join requires a set of keys that are present in all relationships. Some response content (e.g. Action API) might violate join semantics (it might not contain a required key). Existing http connectors that assume that a table schema matches the content of a rest response, and where not suitable for our use case. For demo purposes I implemented a connector to asynchronously query http endpoints, and used it an enrichment pipeline that queries Action API to retrieve wikitext for a given revision.

A table using this connector must follow this semantics.

1. If present, a domain field is used to set the Host head field in a request.

2. a response content field should be present (we need it to store the API response as string).

All other schema fields will be used as parameters to the query string.

Example When used in the a JOIN ON revids statement, the table will: 1. send a query to: https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&formatversion=2&rvprop=content&rvslots=main&revids= 2. store the response content in the response field.

A simple stateless SQL application
An enrichment pipeline would look can be expressed as: here we use Flink's json functions capabilities (JSON_VALUE) to parse response content and extract wikitext.