Platform Engineering Team/Event Platform Value Stream/Event Catalog

This page documents the initial prototype of the Event Catalog for Apache Flink.

The Event Catalog in Wikimedia Event Utilities provides an easy way to access Wikimedia's Kafka in an SQL-like way for stream and batch processing. It does schema validation and performs automatic normalization of  and   fields.

Getting Started
(Assuming you already have Apache Flink installed)
 * Package versions in the examples here may change

1. Build Event Utilities from this patch (If it's merged, then pull from main) to get

2. Download

3. Download

4. Start Flink's SQL client with these libraries. In this example they're all in a  folder.

4a. If you're inserting, also start the Flink cluster beforehand.

5. Create the catalog

6. Use the catalog

7. Check to see if you can query the kafka topics

Catalog Options
To create the catalog, you need to provide it with some default options.

Table Options
Tables within the catalog require some custom options in addition to the ones needed for the connector and format.

Limitations

 * When you create a table from scratch, you must use a  column (see examples)
 * When inserting, all columns (besides $schema and meta) must be present for it to succeed. (See T328211)
 * You cannot directly insert into a catalog-provided table.
 * You cannot alter the schema or its version after a table is created.
 * To use a table with a schema version other than latest, you must create the entire table from scratch.

= Internals =

Validation
There are three layers of validation that happens. First is the  that validates catalog options. Next is the * to validate table options, and then   to validate format options.

Because of this cascading validation, some invalid options are not caught when declaring tables and only caught at runtime when querying them. This behavior is taken advantage of within tests, so any DDL statements there should not be considered usable code.


 * refers to both  and

Meta-Definitions
Because of the catalog's behavior in managing options for the table and the format, more meta-definitions are needed to describe certain options. These are just definitions and currently not explicitly defined in code.

Pseudo-Table Options
Some options are provided when declaring a table, but used in the catalog. These options are not passed down when creating the table.

Some pseudo-table options are only applicable for  and not.

Pseudo-table options include:

Override Options
Options that are set by the catalog can be overridden by providing them as table options. However, it doesn't mean that the options are necessarily passed to the table.

Some override options are only applicable for  and not.

These options must be checked for defaults twice. First if it's in the table, then if it's in the catalog, and finally retrieve the default value defined in its. This means that these options do not perform their expected behavior within a. These defaults are currently handed on a case-by-case basis, however it might be worth creating a dedicated  to handle it.

Override options include:

Shared Options
The catalog and our custom format factory is strongly coupled, but we allow any connector and therefore any. This means that the catalog has to bypass the validation done by an unknown table factory so that the options can reach the format factory.

The way this is handled by Flink is by prefixing options with the identifier of the format factory, so the catalog does that automatically for options applicable to our  format. Therefore, the resulting table will only have the prefixed option when saved.


 * Input:


 * Processed Options:

Shared options include: