Project:Sandbox

About
On a regular cadence the Api Platform Value Stream team will post demos of our developments/works in progress here to provide transparency and gather feedback.

Demos Sessions
 Note:  You'll need to be signed in with your WMF account to view these videos.

2021-08-05
Recording Link

2021-06-09
Recording Link

Notes/Q&A

 * Next steps/followup items:
 * Setting up time with Product Analytics to start walking through this, giving them access to test it and generate feedback. This is not a "hard commit" but we do need something we can poke holes in, versus talking about intangibles.
 * Talking about sanitization by fields with data engineering & adding to our backlog.
 * Drafting documentation & communications.
 * Looking at potential other events we would want to add as samples. We have a short list that Mikhail has helped inform, and will also touch base with Maya on this.


 * With bespoke data, is it possible to add new content to JSON blobs or bespoke dimensions map without changing the schema?
 * Yes - that's the goal. Must have column, must have type, type can't change, but we want the values to change & make it structured. So in the string case, it's really a piece of structured data that's been serialized into a string. To work on it as a piece of structured data, you would un-serialize the string.
 * In cases with dimensions and measures, where you've chosen a type in advance, there's flexibility with adding new keys and changing properties. There's an essential amount of freedom to vary the content. With bespoke data we would have standard fields, and a "flexibility area" which allows us to do things engineering-wise that we can't do today. All of Metrics Platform's events, even though they're structured differently and we have different quality controls/processes in place, we fundamentally have the same shape/same schema. That allows us to do nice things with data integration, have different types of events in the same table, and allows for a sandbox for "non-standardized" data. We can also make changes as often as we want, and have backwards compatibility.
 * In other words: changes to what bespoke data that instruments collect will not require event schema changes.


 * Could it be possible, for example, to have a bespoke dimension which is a string/enumeration of possible values, can we add validation on top of that? Also, can we still use an allow list for sanitizing fields that can contain any field inside of them? We should be careful with that?
 * What we define as the granularity level, you could sanitize at the field level (e.g., if we have a bespoke dimension can I tell it which bespoke dimension to sanitize?), and you would probably want to. Unsure if that would present problems - we would need to look and see.
 * In terms of validation, enumerated values are probably the ones that will come up the most. Will it be really annoying to query these because of the way they're packed up? The answer seems to be "no," so the next question would be how do we build systems to accommodate validations/quality checks for bespoke data that is not necessarily structured?


 * Writing queries and assessing whether the data is easy to work with is still untested. One possible concern is having to write queries that are mostly JSON extract function calls, or mostly cast from string to various data types, whereas now we're able to get the top-level field stuff.
 * We can come in with many different levels of flexibility. Right now we know there are a lot of issues with creating new instrumentations & updating them, so one school of thought is that having more flexibility in how you capture data, the harder it is to query it and make sense of it, versus making it easy to use downstream. We could come to a point in the middle, where we define things better in terms of bespoke data, we can iterate and build some tools that enforce certain fields and flatten them out so you can query them like you can today. The tradeoff is there's less flexibility (e.g., if you rename a field, it's a backward-incompatible change which results in a new schema and you're starting fresh on the data side).
 * These are the kinds of "gives and takes" in terms of flexbility in consumption vs production - this tool will let us experiment and start getting some feedback from the team so we can figure out where we draw the line on flexibility. It's a big spectrum.


 * As an example, the Visual Editor uses another schema in combination with EditAttemptStep - is the vision to have a standardized schema with all the data in one place?
 * We aren't colocating different instrumentations in the same tables. The goal for migration is to keep having the data propagate into their own separate tables - we wouldn't merge them into the same table.
 * Part of the advantage of structuring it this way is that it makes data integration easier. Visual Editor feature use is also quite a busy schema. We can certainly write a query to see what it would be like to work with those.


 * Are we working on getting the PMs on Product Engineering on board?
 * Yes, PM Directors have been informed - there are some other migrations happening (e.g., Vue.js) and we're negotiating timelines. Documentation around roles & responsibilities will come.


 * This migration would come later in the fiscal year, correct?
 * Once we release the MVP of the metrics platform in Q1, we'll want to do a "steady trickle" in terms of migrating instrumentations - do one, see how it goes, and see what kinks there are to work out. Then do another, etc. This is hwere we need to coordinate time across Product Analytics to update queries, analytics, visualizations.


 * What about documentation? If we are expecting schema to change in terms of structure, it would be good to call that out so analysts can refer to new definitions & how to query them. Will that still be in the Modern Event Platform like we have now on GitHub, or will that change?
 * We are thinking about the right way to surface this - GitHub is probably not the best place for non-technical consumers of this information. All of the schemas and systems for writing them will probably still live in a repository, but we can also imagine some nice pretty APIs/visualizations on top of that. Maybe there's an integration between whatever data governance/data lineage software we use that is able to find the schemas using StreamConfig, or we move StreamConfig into the data lineage system.
 * We'll dive more into this user experience & discoverability piece once we have the foundation in place.