Extension:WikiLambda/Jupyter kernel proposal

''I've been doing some research on interpreters and debuggers which brought me right back around to Wikifunctions and how to implement them using multiple languages in a reasonable way that also lets function developers introspect their code as it runs. --Brion Vibber (WMF) (talk) 02:09, 26 February 2021 (UTC)''

There's a few pieces I'm looking at for the Wikifunctions backend and frontend story:
 * sandboxed function evaluator that supports shoving arbitrary sources into popular programming languages and returning output
 * preferably that also supports single-step debugging where the runtime supports it
 * a debugger UI that integrates well into MediaWiki frontend
 * being able to use the same debugger UI for client-side interpreters (for other projects like interactive diagrams)

Jupyter
I'm inclined to look further at Jupyter for the backend and the debug protocol; it's a project with years of experience sandboxing scripting languages for interactive execution from web pages.

A few things Jupyter does well:
 * already handles languages like Python (currently Python and C++ are the only languages with debugging support, which is new)
 * already deals with isolation/safety issues because it's often exposed to the web
 * has a pub-sub protocol between the kernels and the frontends, which we can probably adapt to allow code to invoke additional functions in a new kernel

How it generally works at the runtime/execution level:
 * "kernel" manages the programming language runtime, creating an interpreter context for each invocation
 * sends messages back and forth to management & frontends over the Jupyter protocol
 * to support debugging, kernel interfaces with the runtime's internal debugging API and exposes events, sources, and data
 * somewhere in all this there should be a container or chroot boundary and various safeties. need to confirm we understand how the pieces work.

The frontend in a Jupyter notebook would be replaced by the service layer we expose to MediaWiki, allowing functions to be invoked for use in a wiki page rendering or for an interactive debugging session. For the latter, we'd want a debugger frontend that integrates well with MediaWiki, showing source code and variables and letting you set breakpoints and such.

Performance
Runtime performance concerns include:
 * there will likely be some constant latency to spinning up a new execution environment for each function invocation.
 * reusing a running kernel for multiple invocations would leak global state across calls with different parameters, so isn't safe. ;_;
 * source parsing/compilation may be slow on larger modules. some language runtimes may be able to cache compiled bytecode, which might reduce this cost somewhat if we can invoke that path by storing source on the filesystem and then importing it from injected source
 * Sending large data sets between functions will be slow, incurring serialization/deserialization time. Best practices for function authors should be to send references to data sources around when possible instead of raw data blobs.
 * Specific support for passing large buffers without a copy would be neat, but might be complex and probably won't work with complex data types well.

Languages like JS that use JIT compilation and do on-stack replacement will still have a chance to optimize code that runs long loops -- a Mandelbrot fractal generator would run reasonably fast, for instance, but this compilation would happen every time the fractal function was invoked. So better to call once and return a buffer than call once per pixel!

Debugging
If we wanted, we could potentially rework Scribunto on this framework and allow interactive debugging of Lua modules. Something to consider.

The debugger UI would also be useful for client-side widgets using a sandboxed interpreter, which I'm researching for a future project. The interpreter would need to expose the low-level Jupyter debugging API, and the debugger would just connect to a virtual event bus instead of a WebSocket.

Async code
It occurs to me that this system would allow calls to other functions to proceed as non-blocking async RPC calls rather than blocking for a reply.

For JS code, this would allow using  functions and the   operator, or  s with callbacks; other languages have similar structures. This might be useful for getting some parallelism, but that's also dangerous -- if you run ten thousand invocations to a sync function in a loop you've only fired up one kernel which runs one function in sequence, but if you fire off ten thousand async calls in a row and only later wait on them, you're going to instantly fill the function execution engine queue.

For this reason I would recommend either using synchronous function invocation only, or very carefully defining how much processor time and how many simultaneously active subprocesses can be spawned to reduce the impact of fork-bombs.

Wikidata queries and query building
Let's say we have two functions, one which returns a list of Q references based on a CSV file in Commons, and another which takes that list and filters it manually based on presence of some property:

This at least avoids round-tripping for every item in the filter, but if we were going to use this list and either pop it back into Wikidata to fetch some properties for display, or add some more filters to the query to avoid transferring data we didn't need, it might be nice to do a single round-trip and avoid having to send the long list of IDs to and from the query server multiple times. Especially if one were to refactor the data-file provider to drive out of Wikidata or another database.

Might be worth thinking about good interfaces for reading portions of large files, making queries of large CSV or RDF data sets, and being able to compose the filtering to minimize roundtrips while remaining both ergonomic and performant in a multi-language, multi-process RPC scenario!

Idempotency and non-determinism
The general idea of Wikifunctions is to have idempotent, deterministic functions that are based on their input state. But in practice, there are likely to be many sources of non-determinism which need to either be shrugged at and left as-is, plugged/neutered, or taken into proper account with a suitable cache invalidation system.

Sources of non-determinism:
 * language features that return information about the outside world, like
 * the state of the language runtime itself, like checking for a feature that isn't available yet in the version of the kernel that's running when the function is deployed
 * reading a sandboxed filesystem, if the language allows it, might vary over time depending on the container details
 * reading data from a service, such as loading a file from Wikimedia Commons or making a query to Wikidata

The last (using services that we provide) are the easiest to plan for, because we'll control the API for accessing them and can treat the state of the world as an input to the function for caching purposes. A function invocation that caches its output after reading a Commons file could register a cache-invalidation hook for when that particular filename gets re-uploaded or deleted, which invalidates the cached function results and bubbles up to anything that cached data based on it.

World state might be avoidable by patching runtimes (eg to replace the JS Date constructor and Date.now etc with functions that return a constant stub value, or with functional versions that register a timeout-based cache invalidation hook) but this could be error-prone in that it's easy to miss something and end up with bad cached data.

Wikidata generally though is tricky to cache on, since many things could change your result set. I don't know how hard it would be to devise a generic system for creating cache invalidation hook requests from a query, but it sounds hard.

Pure function kernel and optimizations
Early planning thoughts on functions have centered on the pure application of functions as lists of calls with arguments to other functions, which essentially defines a very simple programming language based on call expressions, with no variables (but the promise of caching/memoization speeding multiple calls). Many functions may be implementable this way, especially those that mainly call other functions and filter their data.

Because there's no global state that can be modified, this avoids some of the complications of general programming language implementations for the special case of calling another function that's implemented in the same way -- the same running interpreter process can simply load up an additional function definition and execute it in-process, without having to transfer arguments and results through the Jupyter protocol.

Potentially the language runtime extensions that expose the Wikifunctions service could also implement the pure-function interpreter in-process within a JavaScript or Python or Lua function, avoiding RPC overhead. Something to consider.