Platform Engineering Team/Event Platform Value Stream/Build simple stateless service using PyFlink

This page summarizes the learnings of https://phabricator.wikimedia.org/T318859

User Story
As a platform engineer, I need to try and build a simple stateless service that takes an input stream, transforms, enriches and produces an output using PyFlink

The service should:


 * Listen to  or another existing Kafka topic
 * Make a call to MW Action API
 * Produce some output that combines the data

Is this good abstraction for event driven data producers to create similar services easily?

TL;DR

 * PyFlink by itself is in some areas more burdensome to use than regular Flink
 * Because PyFlink is just a thin wrapper for Flink, if we wanted to use our existing codebase it's basically writing Java without type hints and with the added overhead of needing to convert Java types into Python types
 * However, if we make wrappers for our existing codebase it becomes much more bearable... for the users. *If* we make wrappers
 * The one major advantage is its ability to easily implement UDFs that can be used in both PyFlink and Flink SQL

Pros

 * Python is more familiar to developers
 * Easier to get started; just install pyflink
 * Development is easier. No need to rebuild jars. No need to submit jobs to a cluster just to test it out
 * Can interop with our existing Java codebase
 * Supports Pandas

Cons

 * Need to know Flink
 * There is not a Python equivalent for every Java function, and it's unclear if the missing items are intentional or if it's still in development
 * Immediately gets more complicated the second you want to interop with Java
 * No type hints
 * Need to convert between Java/Python types

If we want the easiest developer experience, it might involve making a library of custom UDFs and Flink SQL connectors so people don't have to touch Flink at all.

Datastream API
Developers can define UDFs by extending one of PyFlink's  classes. Developers can also use third-party Python libraries within their UDFs, but must specify the dependencies when executing the jobs on a remote cluster.

Here's an example of a map function in PyFlink that takes a  and returns the images on that page:

Sample Output:

Full Example

Table API
The table api allows you to create UDFs which can mimic the datastream UDFs. However, the return type has to be one of  since it integrates with SQL.

Here's an example of a UDF that does the equivalent of the datastream example:

Sample Output:

Full Example

Flink SQL + Python UDF
You can pull the UDFs created for the table api and load it into Flink SQL. You can also continue to use external python libraries by importing the library within the UDF. However, unlike running a Pyflink job locally, running the UDF requires submitting it to a remote cluster and therefore you need to submit a virtual environment with all the required dependencies for the UDF to run in.

Packaging a Virtual Env
1. Create the virtual environment

We don't use conda because it's heavy and has a very long cold start time. Also stacked environments can cause some confusion since all dependencies must be packaged together. Even then, there *will* be some lag every time you execute a query with a UDF since Flink has to unzip and initialize the virtual environment. python3 -m venv pyflink-venv

2. Activate environment source pyflink-env/bin/activate

3. Install required dependencies pip3 install wheel pip3 install apache-flink==1.15.2 // Other dependencies

4. Zip up files cd pyflink-venv zip pyflink-venv.zip ./*

5. (Optional) Move files to a better location mv pyflink-venv.zip ../pyflink-venv.zip

6. Export Hadoop config export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` export HBASE_CONF_DIR=/etc/hbase/conf

7. (Assuming you already have Flink downloaded and extracted) Start Flink cluster cd flink-1.15.2 ./bin/start-cluster.sh

8. Start Flink SQL client and tell it to use the packaged virtual env for UDFs ./bin/sql-client.sh -pyarch file:///home/path/to/pyflink-venv.zip -pyexec pyflink-venv.zip/bin/python3 -pyclientexec pyflink-venv.zip/bin/python3 -pyfs ../stateless_table.py

9. Remember to stop the cluster when done ./bin/stop-cluster.sh

If you get an error with executing the python UDF, you might need to manually link the flink python jar wget https://repo1.maven.org/maven2/org/apache/flink/flink-python_2.12/1.15.2/flink-python_2.12-1.15.2.jar

And then add this to step 8: -j ../flink-python_2.12-1.15.2.jar

Example
Here's an example that uses the UDF from the above example:

Sample Output:

WIP Example