Extension:WikiLambda/Jupyter kernel proposal


 * I've been doing some research on interpreters and debuggers which brought me right back around to Wikifunctions and how to implement them using multiple languages in a reasonable way that also lets function developers introspect their code as it runs. --Brion Vibber (WMF) (talk) 02:09, 26 February 2021 (UTC)

Wikifunctions is part of the Abstract Wikipedia family of projects, intended to create a programming library of functions, primarily aimed at being used to process and format data for display as raw values, tabular layout, or natural language description. There's a few pieces I'm looking at for the Wikifunctions backend and frontend story:
 * sandboxed function evaluator that supports shoving arbitrary sources into popular programming languages and returning output
 * that we can extend to launch calls to additional functions in other language runtimes, marshaling arguments
 * support stack traces with source, single-step debugging, and breakpoints where the runtime supports it
 * a debugger UI that integrates well into MediaWiki frontend
 * being able to use the same debugger UI for client-side interpreters (for other projects like interactive diagrams)

Jupyter
Jupyter is best known for the interactive "notebooks" system, where code fragments embedded in a web page are brought to life by executing them on the server and passing data to and from the web page, allowing interactive execution in a live programming environment using real programming languages and visualization of the resulting data.

I'm inclined to look at Jupyter specifically for the backend and the debug protocol; it's a project with years of experience sandboxing scripting languages for interactive execution from web pages.

A few things Jupyter does well:
 * already handles languages like Lua, Python, and JS (currently Python and C++ are the only languages with debugging support, which is new)
 * already deals with scaling and isolation/safety issues because it's often exposed to the web
 * has a pub-sub protocol between the kernels and the frontends, which we can probably adapt to allow code to invoke additional functions in a new kernel

How it generally works at the runtime/execution level:
 * "kernel" manages the programming language runtime, creating an interpreter context which can have source code fragments executed in it
 * sends messages back and forth to management & frontends over the Jupyter protocol to signal invocations, results, and errors
 * to support debugging, the kernel interfaces with the runtime's internal debugging API and exposes events, sources, and data
 * there's a lot of support for scaling via kubernetes clusters etc, so we should be able to create a reasonable scaling and isolation model

The frontend in a regular Jupyter notebook would be replaced by the service layer we expose to MediaWiki, allowing functions to be invoked for use in a wiki page rendering or for an interactive debugging session. For the latter, we'd want to make our own debugger frontend that integrates well with MediaWiki, showing source code and variables and letting you set breakpoints and such.

Alternately we could try to integrate Jupyter's debugger frontend into MediaWiki, which will likely require helping make some changes to that codebase to support our house style, localization interfaces, etc.

Two things Jupyter might not handle well:
 * Startup time of kernels might not be a current priority, since sessions are usually interactive and long-running
 * This can be mitigated by reusing kernels when the language allows doing this safely without mutable global state!
 * We already prefer avoiding mutable state to aid in both fault isolation and automatic caching when possible.
 * For JS with a restricted set of frozen globals, all functions for a given render session could run in one realm in one VM safely.
 * The message bus abstraction, and process and container isolation, means we may not be able to have multiple kernels call into each other in-process
 * It's easiest to reuse existing language kernel implementations if we let each be its own docker image etc
 * Cross-language calls will require an RPC with a context switch and data passed through a socket
 * This can be mitigated by at least avoiding network round-trips by running all kernels for a synchronous call session on one server
 * Co-location in one server allows using shared memory segments to perform data transfer, reducing copy overhead and providing a caching abstraction to reduce re-transfers of the same data. This would be a side-channel from the Jupyter messaging bus, with a connection to the function service over a socket that allows transferring file descriptors with the shared memory segments
 * We could potentially create a simplified language that only invokes other functions, which could be interpreted in-process by our runtime from any language, with no RPC; however this creates additional complications as it's essentially another language kernel glued onto JS, Lua etc ones.

Synchronous RPC calls
For efficiency, language kernels might use direct socket calls for cross-language RPC calls rather than the Jupyter message bus, while the message bus is used to set up the individual kernels, pass code into them to run, and point them at each other.

It would be necessary to connect kernels directly or through a function service over a direct socket in order to send file descriptors with shared memory segments, which minimizes copy overhead versus pushing data through the socket.

When making a call into another Wikifunction across language kernels:
 * Caller's runtime sends an event over socket to the other kernel's runtime, with the target function (by export name or by closure reference) and a reference to the arguments in its serialization buffer (see section below on )
 * Caller blocks waiting for a response...
 * In addition to  or   events, this may process incoming  s from the other language kernels in the same session -- for callbacks or calls into other functions implemented in the same language.
 * Callee runs through to completion
 * Caller receives the return event with serialized result ref from the callee kernel
 * Tally up whether cacheability state passed up the chain intact
 * If cacheable, function service stores any cached data with suitable invalidation hooks registered
 * Caller unblocks, translates the return value to a native programming language value, and returns.

Async code
It occurs to me that this system would allow calls to other functions to proceed as non-blocking async RPC calls rather than blocking for a reply.

For JS code, this would allow using  functions and the   operator, or  s with callbacks; other languages have similar structures. This might be useful for getting some parallelism, but that's also dangerous -- if you run ten thousand invocations to a database query function in a loop you're only running one at a time, but if you fire off ten thousand async calls in a row and only later wait on them, you're going to have some performance problems!

For cross-process invocations, there's also the possibility of each invoked language kernel running at full CPU thus using more CPU than any single-language function environment would -- and if it's possible to fire up additional kernels at runtime, this could balloon to a large number of processes.

For this reason I would recommend either using synchronous function invocation only, or very carefully defining how much processor time and how many simultaneously active subprocesses can be spawned to reduce the impact of fork-bombs.

At a minimum limiting each function to one event loop would be wise, but this still allows consuming one CPU core and a bunch of memory for each distinct function invoked; throttling how many new functions can be invoked at a time might be wise, though enforcing that in a consistent way could be hard.

Performance
Runtime performance concerns include:
 * there will likely be some constant latency to spinning up a new execution environment
 * running all functions in a given language in the same kernel is best, to reduce RPC latency and memory and processor usage
 * it may be best to disallow any language that couldn't be implemented safely, or else mark functions in those languages as sharing mutable state and thus uncacheable & impure
 * if we have a custom simple language for invoking other functions with minimal syntax & operators, we could invoke those from any language in-process by implementing it within the support runtime.
 * source parsing/compilation may be slow on larger modules.
 * some language runtimes may be able to cache compiled bytecode, which might reduce this cost somewhat if we can invoke that path by storing source on the filesystem and then importing it from injected source
 * where global state is immutable, one could potentially reuse kernels for multiple render sessions, so functions that were previously loaded don't have to be recompiled
 * Sending large data sets between language kernels will be slow, incurring serialization/deserialization time
 * Best practices for function authors should be to send references to data sources around when possible instead of large strings, arrays, or binary buffers
 * Specific support for passing large buffers as args/return values without a copy could be useful, but would not interoperate transparently with common data types like strings, arrays, and dictionaries in that you have to decide at the API layer whether it's possible to pass a reference or whether you want a native object that must be serialized and deserialized at function invocation boundaries.
 * If language kernels that call each other are colocated on one server or in one process, large serialized arguments and return values can be transferred through a shared memory segment instead of passing them through a socket over the Jupyter bus. This may reduce copy overhead.

Languages like JS that use JIT compilation and do on-stack replacement will have a chance to optimize code that runs long loops even if kernels are short-lived in a render session -- a Mandelbrot fractal generator would run reasonably fast.

However cross-language call latency could be large, so it's best to invoke one function that returns a frame buffer's worth of fractal imagery than to invoke an iteration counter function once for each pixel in case you have to make an RPC call.

Pure functions and determinism
This section is being revised to clarify terminology and classification of the key issues.

The general idea of Wikifunctions is to have pure, deterministic functions that are based on their input state. But in practice, there are likely to be many sources of non-determinism which need to either be shrugged at and left as-is, plugged/neutered, or taken into proper account with a suitable consistency model and implicit/explicit cache invalidation system.

(Note that "pure functions" matters primarily at the level of how it affects determinism, and whether functions can alter the state of other functions or another invocation of the same function. Mutation of variables and data structures within an invocation is probably fine, as long as those changes can't "escape" to a second invocation -- or to other functions. More on this in later sections.)

Sources of non-determinism:
 * language features that are deliberately non-deterministic, like
 * language features that return changing information about the world, like
 * the state of the language runtime itself, like checking for a feature that isn't available yet in the version of the kernel that's running when the function is deployed
 * reading a sandboxed filesystem, if the language allows it, might vary over time depending on the container details
 * reading data from a service, such as loading a file from Wikimedia Commons or making a query to Wikidata

The last (using services that we provide) are the easiest to plan for, because we'll control the API for accessing them and can treat the state of the world as an input to the function for caching purposes. A function invocation that caches its output after reading a Commons file could register a cache-invalidation hook for when that particular filename gets re-uploaded or deleted, which invalidates the cached function results and bubbles up to anything that cached data based on it.

World state might be avoidable by patching runtimes (eg to make Math.random use a consistent seed number, or replace the JS Date constructor and Date.now etc with functions that return a constant stub value, or with functional versions that register a timeout-based cache invalidation hook) but this could be error-prone in that it's easy to miss something and end up with bad cached data.

Wikidata generally though is tricky to cache on, since many things could change your result set. I don't know how hard it would be to devise a generic system for creating cache invalidation hook requests from a query, but it sounds hard.

Kernel implementations can optimize pure functions, if guaranteed statically or at runtime, by running them in the same interpreter context. Some languages may be easier to do this with than others. This allows reusing the same kernel process and execution context for every invocation of a particular function used in a large wiki page rendering.

This will improve speed on four fronts:
 * Not having to create an interpreter context and parse/instantiate new function source code on every invocation
 * For JIT languages, more opportunity to run an optimized version of the function and less relative time spent compiling vs executing
 * For functions known to be deterministic on input, automatic caching could be introduced between cross-module calls.
 * Strings could be passed directly by reference as they are immutable, while objects and arrays would have to be copied and function references wrapped by the marshaling layer that introduces caching and structured cloning. Some languages may have specific runtime support for copying objects across contexts efficiently like this, such as is done in JS engines when posting data between frames or workers

Non-deterministic functions that alter their global state might need their own context:
 * Some languages may be able to create an isolated global state which allows running in the same process, but without polluting each other's globals. This may be possible in JS using separate realms, as is done for frames and workers on a complex web page
 * Others may require creating a separate interpreter context, which might mean firing up a new Jupyter kernel for each such function, slowing startup

Alternately, if we say it's non-deterministic already, then we can let it change its global state -- which might allow it to manage an in-process cache with more precision based on its knowledge of the meaning of its inputs.

This allows data leakage across invocations within a page, which essentially makes execution order and past inputs during a session inputs to each invocation for pureness purposes. There are only a few good ways around this:
 * only use programs that can be validated statically (is it safe? takes lots of work)
 * -or- modify the the runtime to allow locking down global state so this doesn't come up
 * -or- run a separate context for each running invocation, and reinitialize global state on every invocation (very slow)
 * you could run multiple contexts of the same language in one process/thread potentially, and call synchronously between them, so startup cost can be reduced with a suitable multi-context-capable kernel BUT per-function-invocation cost will be high because of global initialization, and probaly recompilation of code if it's a JIT language (though if it saves bytecode, parsing source and compiling to bytecode may be reduced if cached)

Marshaling and serialization
To implement cross-language calls, arguments must be serialized to an intermediate representation, then deserialized in another process into objects native to another language. (If multiple language runtimes can be squooshed into one process successfully, then when reading the section below consider that the calls in/out would be direct calls instead of RPC calls, and the shared segments would simply be allocations in the same process; otherwise roughly the same considerations apply with data transfer and caching to avoid unnecessary re-copies.)

For cross-Wikifunction calls in the same language, we should for consistency enforce that arguments and return values are validated and coerced to standard immutable data types just like they would across languages, allowing a function to change implementations without unexpected surprises in its interface and reducing the leakage of state through exposing mutable objects. This would entail wrapping imports/exports across module contexts with marshaling functions that do a "structured clone" operation on anything that can't be passed on as-is, such as copying arrays and dictionary objects, that would be equivalent to serializing and deserializing without the intermediate step.

One way to implement the marshaling would be with shared memory segments; each language kernel in a render session would get its own segment and manage a ring buffer of serialized objects. The segments could be passed as file descriptors over a socket via the language server management layer; by ing the segments into memory, each language server would have a read-only view of the other languages' segments, so is free to read the record objects out of the buffer.

When making a call, a kernel would append new records to the buffer containing any objects not yet serialized, then send just the record sequence IDs with the call request over a socket. (Numbers might be sent directly in place of a reference, depending on what's most efficient to implement.) Along with sequence IDs to use as arguments, the message should indicate the most recent sequence ID and byte position as well as the oldest position that's still considered active for purposes of back-references.

The callee's runtime would check the caller's buffer for new records it hasn't seen yet, and update a map of record IDs to converted values. Any deserialized cache entries older than the oldest active position can be discarded unless needed for other references, and any unprocessed records between the last update and that same point can be skipped because they won't be used. This allows multiple language servers to coexist without having to read the objects sent between other pairs of language servers that they don't use, while still allowing them to make back-references to each other's transfers when passing data through.

After processing some number of records, the callee writes a record to its own buffer marking how far it's read in the caller's buffer and how far back it still has old objects cached, so it will know how far back it can send backreferences on the next call.

Note the serialization buffers must be considered potentially hostile for fault isolation purposes -- if deserialization is written in a language like C or C++ it must be very careful not to read beyond the end of the buffer, or beyond its start!

To handle object transfers larger than the allocated buffer, it would be possible to break up large strings, arrays, and object property dictionaries into smaller chunks; a single call-return RPC can be broken up with multiple callbacks confirming that the data has been read and asking for more to be sent. This requires holding the entire data set in memory in the target process, but doesn't require the serialized representation to fit in any particular size (as long as it's big enough to hold the necessary record structures and enough data to hold a chunk of one element).

Avoiding resending of previously transferred objects wich are no longer in the buffer but whose immutable deserializations remain in memory could be done in a couple ways. First, in addition to the items in the buffer a longer list of in-process cache items could be retained, with objects expiring out of the cache in least recently used order instead of original transfer order. When expiring an object, write a tombstone reference to your own buffer which lets other kernels know they'll have to send that object to you fresh.

Second, integration with the language runtime's garbage collector would allow keeping objects alive if they remain live in memory, even if they haven't been sent by reference recently. When the GC notifies that the value is destroyed, write the tombstone. Depending on details of the GC this may or may not be possible for strings and arrays, and objects might need to be a custom subtype under the hood. Will need to research this.

Types could roughly cover JSON types plus function closures:
 * null
 * probably no need to have a separate undefined for JS, and just collapse them both into null
 * boolean
 * number
 * at least have 64-bit floating point
 * potentially have explicit integer or bignum types as well
 * string
 * UTF-8 string? Warning: JS strings are *not* always valid UTF-16 and may include invalid surrogate pairs, meaning they wouldn't round-trip cleanly. One possibility is to use CESU-8 or another encoding variant that preserves those sequences, then expose the final result to MediaWiki as UTF-8 with invalid sequences stubbed out with a replacement character.
 * binary buffer
 * Note that in JavaScript there is no way to make an ArrayBuffer immutable. You can freeze the JS object and prevent addition of properties, but you can still pass it into a TypedArray and alter the contained data. Likewise, freezing a Typed Array doesn't freeze the array-index properties that access the backing buffer.
 * As such it's probably necessary to create a new ArrayBuffer out of the source buffer every time it's deserialized.
 * If doing a multi-part transfer due to limited ring buffer size, this would allow copying each segment only once if the final size is sent ahead of the list of segments to allocate a full-size ArrayBuffer target to copy into.
 * Back-references could still be done if the ArrayBuffer contents can be verified not to have changed, which would require a byte-by-byte comparison against the ring buffer contents, but this is cheaper than copying and avoids pushing other objects out of the ring buffer with a fresh copy.
 * Knowing which record to back-reference could be done with a  in JS, or similar structure, mapping the ArrayBuffer to a record ID by its identity, or by adding a user-visible marker property. The comparison could exit on the first difference.
 * Would have to immediately write tombstones when a binary blob falls out of the buffer, as no longer have source to compare it to. Copying the ArrayBuffer in process to a cache and copying it through the ring buffer again would likely be similar amounts of overhead.
 * array
 * a length followed by a sequence of back-references (potentially primitive elements could be transferred in place of a reference, depending on the representation details)
 * object/dictionary
 * ordered map of string keys to other values
 * a property count followed by one string back-reference for each key, & each value as a primitive or back-reference, in sequence
 * back-references
 * re-referencing a previously transferred object from another language kernel's buffer that's known to still be accessible in buffer or cache
 * function references
 * first-class function values, via an index reference
 * requires some GC integration in runtimes to avoid leaking memory during an execution session

The primitive types are immutable in most languages, and usually fit in a machine word and have no heap storage so are cheap to copy and 'create'. In some languages like JS and Java strings are immutable, which allows for the runtime layer reusing cached string transfers easily.

Arrays and objects in most languages are not always immutable, though in some cases they can be made so such as with  in JS. This allows you to skip creating a new serialized version if it's still in the buffer when making another transfer; for JS we can apply freezing on any deserialized arrays and objects, meaning we can freely issue back-references when passing them on.

One thing that worries me a little is object identity on these -- if you transfer an object through serialization and deserialization and back it could have the same object identity if it was still in the buffer, but not if enough other data was transferred in between that it's fallen out of the buffer and the in-process cache. One solution to this is to do this at the runtime level (rather than the scripting language level) so you can integrate with the garbage collector. However there may need to be more thought put into this to produce adequate determinism.

Fast cross-context iteration
A worrying performance scenario is when processing a loop over a large data set where every iteration involves a call across functions, especially across language kernels.

Serialization and deserialization of data is required to enforce typing and consistency at the Wikifunction level, and this is also applied on any function closures that get passed as an argument or within an array or dictionary object. Within a given language kernel, this can be optimized into a structured clone operation that, for instance, can avoid having to duplicate strings in JavaScript because they are immutable.

Additionally for cross-language calls there is RPC latency -- you must fully serialize arguments, send them over the socket, and then wait for a reply; on the other side you deserialize any result before you can continue. It may be possible to optimize "streaming" of a large number of calls by interleaving the serialization and call-message sending of one iteration with the preparation for the next, but this has implications for determinism if the input data is not guaranteed to be immutable in the source language.

In general, it's optimal to reduce the number of round-trips in a scenario where cross-language latency may happen in your important loops. The simplest way to do this is to accept input data as arrays as well as single items if you expect to be used, and do the iteration within your function. This might not be pretty but it will perform consistently fairly well, and allows manual tuning of the list size for memory management if the full data set is too large to keep in memory at once.

Pure function kernel and optimizations
Early planning thoughts on functions have centered on the pure application of functions as calls with lists of arguments, which may be further calls; this essentially defines a very simple programming language based on call expressions, with no variables (but the promise of caching/memoization speeding multiple calls). Many functions may be implementable this way, or with a small extension to bind and reuse values, especially those that mainly call other functions and filter their data.

Because there's no global state that can be modified, this avoids some of the complications of general programming language implementations for the special case of calling another function that's implemented in the same way -- the same running interpreter process can simply load up an additional function definition and execute it in-process, without having to transfer arguments and results through the Jupyter protocol.

The language runtime extensions that expose the Wikifunctions service could also implement the pure-function interpreter in-process within a JavaScript or Python or Lua function, avoiding RPC overhead. This could be a large performance boost for common simple filter functions, for instance.

However this might complicate debugging; at the least it probably requires the intermediary that stitches debug info together into a coherent session to speak separately to the simple function interpreters and to the JS, Lua, and Python debuggers.

Debugging
The Jupyter debugger protocol is based on the language server / debugger protocol used by VS Code and other projects, extended a bit to cover Jupyter's model where there's a web service intermediary between the execution kernels and the web front-end. We could use this protocol with the existing Python and C++ kernels, and could help add debugging support to Node JS and Lua kernels perhaps, to get our primary work languages going.

We would have the additional complication that a virtual call stack may include functions running in multiple kernels, so we need to be able to stitch that into a cohesive debugging view of multiple sources and languages.

A MediaWiki-friendly debugger frontend could use our standard widget sets, pull sources from the wiki, etc, and manage the function invocation and debugging protocol via a persistent WebSockets connection. I'm not sure if we have anything else using WebSockets right now in primary production; this'll have a long-running connection to the client during the debug session, and similarly the kernel for the running function may stay open for minutes at a time or more as execution is paused & single-stepped, variables inspected, breakpoints set and cleared, sources and stack traces perused. So this has some additional resource usage of memory for the execution kernels over a much longer period of time than a normal function invocation, even if it's not using CPU continuously.

If we wanted, we could potentially rework Scribunto on this framework and allow interactive debugging of Lua modules. Something to consider.

The debugger UI would also be useful for client-side widgets using a sandboxed interpreter, which I'm researching for a future project. The interpreter would need to expose the low-level Jupyter debugging API, and the debugger would just connect to a virtual event bus instead of a WebSocket.

Profiling
It may be possible to use profiling tools within the language runtimes to create runtime profiles listing the runtimes for every function call path, however this might require additional support from the kernels and additions to the debugging protocol, I don't know yet and will have to research this further.

Another possibility is to collect data only at the Wikifunction boundary layer, which could be done by explicitly tracking start/end times and updating a data structure designed to have the stacks stitched together. This would not cover calls to internal functions local to a Wikifunction definition, and it would have a small per-call cost which could add up in tight loops but we already have a lot of overhead at the Wikifunction call boundary with input/output validation and cloning or serialization/deserialization, and syscalls and context switches for cross-language calls.

Callbacks and first-class functions
Another possibility to consider is whether to support sending first-class function objects as arguments and return values, which enables callbacks and user-defined control structures like  loops over custom data structures.

This would be a useful addition to the model, and conforms well with functional programming ideals. It would certainly make callbacks much easier; a common pattern in today's MediaWiki templates is to split code over multiple templates, with the "callback"'s name and any necessary closure state manually passed in as parameters to the control template. It works but it's not pleasant to write and debug -- being able to just pop a closure inline in your main function would be super nice.

Different from the async model where the runtime returns to a low layer of the stack and then waits on additional data for its event loop, the callback model would keep the stack where it is when making a call, but would accept both a  event or a   event. A call would be handled by the runtime calling the registered callback function object, then passing its value back to the other side of the connection and waiting for the next message. Unlike async, there's no danger of using more than your share of allocated CPU time since calls block on one side of the connection or the other.

Kernels passing function references around may need to keep some sort of liveness count in order to garbage-collect function references across languages. If it's possible to integrate with the language runtime's GC and send a message that you've released the reference, this allows the caller runtime to free the function and any closure state -- otherwise closure references will just accumulate until the entire render session is done, potentially eating a lot of memory if something's racking up a lot of calls creating closures in a loop.

Code sharing
Note that first-class callbacks would allow building object-oriented patterns on top of standard dictionaries: a function could export a constructor factory function (or multiple such constructors in a dictionary), which would return instances as dictionaries containing a map of method closures wrapping the private state object.

There's no way to avoid this once you allow callbacks as first-class values, so if we accept the one we should make sure the other is ergonomically supported!

Alternately, or in addition, we could make it easier to share code libraries in a particular language, so that multiple functions using a common JS or Python library with a complex API using native objects could import common sources without having to bridge it over our Wikifunction data marshaling boundary. This also eliminates RPC overhead when calling to and from that library versus running it in a separate kernel.

Folks could already hack this up by querying a library's source code from a wiki page, or returning source code from a function as a string and then evaluating it, but these aren't pretty:
 * no static dependency tracking
 * not very ergonomic
 * won't take advantage of bytecode caching based on filesystem etc

I recommend having both language-agnostic Wikifunctions, and language-specific code libraries, as first-class but distinct types in the Wikifunction system. Function implementations (exports?) are isolated and serialize/clone all data structures on input/output, rather than exposing mutable native  and   types etc. Code libraries are specific to one programming language, run in-process, and can modify your function's global state when you import or call them.

In JS, the two could be distinguished with different import path prefixes; exposing the language-specific file extension on library import names also might be a nice ergonomic aid to knowing which libraries you're able to use, whereas we might hide it as an implementation detail for exported Wikifunctions.

Some code libraries could be exposed both ways, as native sources that can be imported to run in-process with full language API, and as a Wikifunction wrapper API built around it which can be called cross-language at higher overhead.

It's important to note that any same-language code library you import can modify your global language state (for instance sneakily replacing all the default constructor objects in JavaScript!), as can any of its dependencies -- but that doesn't extend to libraries loaded and run in the context of other Wikifunctions invoked on the same page render, since they live in separate kernel instances. So the possibility for a runaway global-state spamming attack of accidental screwup is limited.

In addition to the convenience of working with richer language-specific APIs, the performance benefit of avoiding RPC calls could be huge. This may encourage more development to be done in the language-specific areas, but it's straightforward enough to add a wrapper API for many things, especially with callbacks, so I think it's not too much additional friction.

Alternatives: in-process calls, rolling our own
I think it would duplicate a lot of existing effort to try building our own similar language kernel and messaging bus from scratch, but a couple possibilities that one might do differently from Jupyter:

One could run multiple language kernels in the same process and thread, and have them invoke calls to each other directly via their respective kernels. This would still require spinning up fresh interpreter states for each non-cached function call but no RPC overhead to talk to another kernel, except for the marshalling of arguments.

In debug mode, all kernels running in the process would be mediated by a common control layer, exposing a consistent view.

This might turn out to be a big overhead savings, or it might not be that much. We might want to model and test this before investing a lot of effort.

A direct-call meta-kernel might be able to piggyback on existing Jupyter kernel work, or it might not.

Note that this model requires all kernels to execute code in the same process and thread, which means all supported runtime libraries would be linked together. This makes calls faster rather than communicating over a socket. However all runtimes must be built into one filesystem image and executable so can't be implemented by separate Docker images. Also, security vulnerabilities in one runtime could result in access to memory belonging to other functions in the call stack.

A middle ground is to use the Jupyter protocol between the kernels and the function server, but with a topology that keeps all kernels within a session talking to each other (presumably over sockets) on the same server, but using Kubernetes or other scaling to farm separate sessions out to separate CPU and server nodes. This involves serialization overhead and waiting on socket transfer for all calls, but avoids adding network round-trips unless the call has to go fetch something from a database. This involves a two-level routing system, with Jupyter language kernels managed directly by the function server and a higher-level meta-kernel that proxies invocations and debug calls up to something the MediaWiki code calls down to through the Jupyter web service.

I'm a bit unclear whether one language kernel process would handle multiple instances through threads or launching new processes, but if we have only one process per language the the per-function cost is only a new thread and interpreter state, the runtime itself is already in memory and ready to roll.

Wikidata queries and query building
Let's say we have two functions, one which returns a list of Q references based on a CSV page in Commons, and another which takes that list and filters it manually based on presence of some property:

This at least avoids round-tripping for every item in the filter, but if we were going to use this list and either pop it back into Wikidata to fetch some properties for display, or add some more filters to the query to avoid transferring data we didn't need, it might be nice to do a single round-trip and avoid having to send the long list of IDs to and from the query server multiple times. Especially if one were to refactor the data-file provider to drive out of Wikidata or another database.

Might be worth thinking about good interfaces for reading portions of large files, making queries of large CSV or RDF data sets, and being able to compose the filtering to minimize roundtrips while remaining both ergonomic and performant in a multi-language, multi-process RPC scenario!

There are a few semi-specified, not fully standard ways of representing structured SPARQL queries as JSON-like structures that could be friendly to our situation and can cross language boundaries safely. Builder functions could handle generating the objects if literals are too verbose.

A SQLite provider that works from tabular data table sets would want a similar SQL builder solution.

Cache updates and getting behind
For MediaWiki templates, we went with a system that re-renders all templates on demand in case of page cache invalidation, but the way we invalidate pages' caches is to go through the list of all the pages registered as using the template and update their cache-invalidation time. For very large numbers of affected pages this can be done in many batches and can take quite a bit of time.

I don't know if we want to plan a more general improvement to this sort of invalidation technique, but we might want to specifically plan for what happens when a large invalidation happens of pages using a certain function. For a common function this could be millions of pages, and if we have to re-invoke every function on every page as they get visited and re-rendered there might be a glut of new invocation calls.


 * can we continue to use invalid data under certain circumstances?
 * can we partition a particular function or set of functions to use a "fair" amount of resources and avoid starving other work?
 * would throttling distinguish between top-level functions and the functions they call?
 * or invocations from cache invalidations versus invocations with no cached data?

Things to consider. I do kind of fear a giant out of control spike of usage from some rogue function that's spamming a bunch of loops -- or worse, a simple legit tweak to the source code of a fundamental function that's used on low-level templates everywhere, causing a re-render of every function and every page of Wikipedia. :D

Cache invalidation hook registration
One model to think about is the explicit management of data caches by complex functions, with their own opportunity to specify keys and register hooks for when a cached result is invalidated.

This might be good, or might be complex. Less certain about this so far. I don't like that it's complex to have to specify things -- but it may be a good under the hood model if we can derive invalidation hook points and suitable values automatically during execution.

Linking and explicit imports
The JS code examples above reference other Wikifunctions via  statements, rather than using an API call with a string to look up the function at runtime. This allows static analysis of the source code to determine which functions can be invoked. Linked function invocations can be stored as metadata.

This allows usage tracking: a Whatlinkshere for function definitions!

It also allows pre-fetching the definitions of any simple functions that can be implemented in-process without making an RPC call to another language kernel. All of the simple functions that could be reached directly via other simple functions could be preloaded to reduce round-trips (unless there's a combinatorial explosion in which case you might stop preloading at some point).

However the particular use of module may not work well for us unless we also statically verify that any global state is immutable. This seems feasible, but need to do some more work on this area.

Usage tracking and resource limits
As mentioned above, providing async APIs would immensely increase the danger of overcommiting resources, as at a minimum the kernels for each separate function used in a session could all be using a full CPU and lots of memory, multiplying the impact of a single function that invokes many other functions.

Even limiting calls to synchronous so all kernels but one are blocked, the CPU time spent needs to be tracked so when we start killing long running processes and memory hogs we can produce automated reports of which pages, templates, and functions were on the call stack or loaded up.

At a minimum, capturing which function died and what the rest of the call stack looked like would be of immense help in tracking down problems.

Ideally there should be easily accessible built-in monitoring and profiling that allows function authors to estimate the performance impact of their code.

Case study: JavaScript isolation
Some notes on how to potentially optimize JavaScript code for pure, deterministic or non-deterministic functions while being reasonably performant on the slower cases that require more isolation.

Note: the terminology is not precise here, and should be aligned with more common computer science usage.

Isolation levels
Important properties for a function to have, or not have, for us to handle it properly, on two axes.

Pure functions vs Dirty functions
 * Maximally speaking, a "pure functional" language doesn't support mutable state, but what we care about is avoiding state mutations escaping from a function invocation -- within a function we can think of mutations as a special case of replacing a value with a new value.
 * We must avoid changes to global or module-level state to avoid leaking information across invocations.
 * For safety, prevent mutation of global objects with  and perhaps a thorough filtering of the whole global object graph.
 * Pure functions may be run in the same scripting language context, as separate modules, and pass values to each other directly with no RPC or serialization overhead
 * To avoid mutable state escaping across modules, all Wikifunction imports will be wrapped with a marshaling function, which normalizes input, rejects invalid data, and copies things if necessary
 * Strings are immutable and may be freely sent between modules
 * Arrays and Objects, if frozen, could be sent between modules using the same Realm directly but we may not want to do so -- even passing a frozen Array would be surprising if you have added a custom  method to it. (Even if you can't change   you can just add a   property to an individual array object and it will override it unless the caller explicitly uses the prototype method.) Copying arrays and objects would require copying their storage, but any strings they contain are stored by reference, so it's just the pointers that are copied.
 * Binary data buffers would have to be copied or kept in an immutable custom data type that can be passed by reference and queried when bytes are needed;  and Typed Arrays cannot be frozen.
 * Traditional code -- can mutate global state like modifying  or
 * Has to at least run in a separate Realm so it can have its own globals, though that could stay in the same process
 * Realms would allow passing frozen data across modules too, but the created objects need to have the module-appropriate prototype attached

and:

Deterministic functions vs Non-deterministic functions
 * Always returns the same results for the same inputs, until its source code changes
 * Deterministic functions may be cached aggressively by the data marshalling layer in-process, and by the function server out of process
 * All pure functions are deterministic unless they call through to a non-deterministic function
 * Anything that fetches from an external data source like Wikimedia Commons or Wikidata is non-deterministic
 * Anything that calls another non-determnistic function is non-deterministic
 * Note callbacks provided as input may be non-deterministic, so whether a particular call can be cached may depend on whether the callback passed to it is deterministic or not!
 * Determining whether a random callback provided by JS is deterministic requires static analysis of its source, and could be fallible if there's a hole in the runtime and the validator. Plan for how to recover from caching failures due to data leakage between functions in a production environment.
 * Automatic caching must be disabled unless manually handled with invalidation hooks and/or a timeout.
 * Functions should have a clean way to declare their caching behavior through code or metadata or both, and this should be tracked with data flow analysis if possible.

Potentially this can all be managed by one big Jupyter kernel based around NodeJS or V8 directly, managing the multiple realms either from JS hosted code or from the C++ side. Hopefully this could be integrated with existing work, and not require us to maintain our own JS function server at the low level!

Enforcement: static analysis
Static analysis of JavaScript source code can determine a lot of things, but ....... not as much as you'd like thanks to it being a dynamically-typed language.

It should be possible to confirm that a program definitely follows a set of rules about preventing data mutation, though this requires statically verifying all call operations as well as preventing explicit mutations like the assignment and deletion operators.

For an example, you must prevent calls to,  , or   and reject any program that could potentially end up calling them.

This means even if they get accessed by name or passed over a Wikifunction boundary as a callback!

One thing that should help is to either freeze or copy all transferable arrays and objects across Wikifunction boundaries even when hyper-optimized, meaning that object identity is not preserved across calls. This means that even if, for instance, you found a way to call  or   you couldn't pass in a malleable object for the other end to send a reference back to call with:

Another possibility is to "poison-pill" known-bad runtime functions by listing them in a map of forbidden objects for each Realm, and refusing to pass them across module boundaries as callbacks.

More aggressive is to replace the offending globals and properties altogether so they can't be accessed (unless they creep in from another context). In this case, we can either remove known-bad properties or only keep a list of known-good ones, which is better.

Combining the two might be wise:
 * run all pure functions in one Realm
 * add a poison-pill property to all objects on the reject list (such as  or  )
 * filter globals and their properties (transitively) based on our explicit allow-list
 * freeze all globals and their properties to prevent modification
 * filter input/output between modules with intermediary adapter functions
 * prevent transfer of poison-pill objects
 * copy and/or freeze objects moving across modules
 * apply wrapper functions to any unwrapped callbacks
 * run impure functions all in one second Realm
 * filter globals, but with a wider allow list that allows general mutation
 * freeze all globals and their properties to prevent modification
 * filter input/output between modules with intermediary adapter functions
 * always copy objects moving across modules
 * apply wrapper functions to any unwrapped callbacks
 * run dirty functions in separate Realms
 * filter globals, but with a wider allow list
 * freeze all globals and their properties to prevent modification
 * filter input/output between modules with intermediary adapter functions
 * always copy objects moving across modules
 * apply wrapper functions to any unwrapped callbacks

Enforcement: validation
If each immutable function is loaded as a separate JS module in the same scripting engine context, they will be able to directly call each other -- this promises an immense speedup versus serialization, remote procedure call, and deserialization.

This optimization could be built on the back of the static validator, ensuring that functions cannot break the sandbox we've defined -- modifying an array or dictionary object you've received as a result could turn out to be a problem if someone else was relying on that object to be immutable!

As an example, to build a filtered array in immutable JS you'd do something like this:

instead of this:

Note that because the standard JS library doesn't include a lot of support functions for iterators, you'd need to write your own filter helper function if your data source is a JS iterable, or import one from another module. It might be worth examining a cross-module iterable data type that allows collections to be exported in a way that's friendly to step-by-step iteration protocols without transferring the entire data set at once in memory like an array.

This would require either enforcing all operations to be pure functional without changing any state, or taking into account that a validation failure could allow internal function state to be modified. A middle ground may be to ensure that mutated state doesn't leave the function via globals.

To enforce no mutation calls this in the validator, we'd know that  is an   so its   is the standard array push method (because there's no way Array or Array.prototype or Array.prototype.push have been changed, because in addition to the validator we can freeze them -- more later) and we know that's not an allowed runtime method.

We'd then want to enforce it at runtime as well, in case the validator failed and somehow let through code that managed to reach it...

Setup code run before loading any user-supplied modules should "neuter" the global state to a specific allow list, then freeze the objects so they can't be changed.

For, this means:
 * replace  with a stub that throws an
 * freeze
 * freeze, so the name   can't be reassigned

This however should be tested to ensure that it doesn't cause problems in the runtime that might rely on the mutable methods and their public names.

I can't yet guarantee that this would work!
 * TODO: list out the exact globals and their properties that would be allowed, and find any problematic things

One limitation is that you can't prevent calls to a particular function object, and it would likely be difficult to reliably hide constructors (there are several runtime ways to get them, and I'm not sure they can all be removed sensibly).

This means that preventing creation of runtime code with the  constructor in case of a validation failure has to be done from outside JavaScript, by the language server setting up the scripting context. Disabling  permissions prevents the   and   constructors from working.

Enforcement: freezing and copying
Constant bindings with  can be used instead of mutable   or   to prevent use of assignment operators on variables, or we could allow them for local mutable state that does not escape the function.

Note that function arguments are bound as mutable variables. They could be rewritten to use a  shadow var if this is necessary, or we could allow local mutable state.

Strings and numeric/boolean primitives are immutable themselves, which makes them easy and safe to transfer between modules.

Objects including arrays can be "frozen" with, preventing further modifications to their properties -- this includes array indexes.

Rewriting the source could wrap all object creations with a call to  to prevent any mutations, but this would be difficult to apply consistently since there are several ways in the runtime to create new objects.

Doing a recursive call to  on argument/return value passing of arrays/objects would be wise, preventing modification of the objects once they've crossed a Wikifunction boundary. This could be done within wrapper functions generated by the function server loader, and wouldn't require rewriting any source code.

Code shouldn't behave any different with objects potentially being frozen or not depending on what code path they've taken, because the static validation prevented us from putting any code that requires state mutation in this server. It should only see the difference if it's escaped the immutability validation sandbox, but at least if so the damage will be limited to objects that haven't passed through another function yet.

Note also that typed arrays cannot be frozen -- since there's no way to guarantee that another view doesn't exist that modifies the underlying ... which you can freeze itself but it doesn't stop you from changing the bytes in the buffer with another view!

This is somewhat unfortunate as typed arrays are quite fast and would be great for binary data processing functions.

This may require duplicating binary buffers, or backing them with an immutable data source and only copying them in-process on demand (perhaps partially).

Data exposure on functions
If functions are exposed across JS modules directly, they will leak their source code because the  method on functions returns source. This could be stubbed out if it's important for determinism reasons, but you're already related to the version so whatever.

Enforcement: JavaScript to WebAssembly compilation
In addition to all the in-JS-world work to enforce and isolate immutable modules, there are some possible optimizations that can be done with a custom runtime, which might be a lot of work but could give us additional safety guarantees by simply not implementing things we have disallowed.

A possibility is to implement a JavaScript to WebAssembly compiler, which produces functions that can be called synchronously from a JavaScript or other host environment, in the same process, without exposing private state to each other.

As with running JS directly through Node/V8, immutable functions can be precompiled and then linked together sharing a runtime and heap -- this allows calls to be very cheap, and data structures don't have to be serialized or copied unless they call out to another language.

Mutable-global-state functions could run in the same process and runtime, but would have to have their own separate contexts within the WebAssembly JS engine implementation. This would allow for sharing strings within the WebAssembly memory, but they'd each have their own instances of the JS global state context.

The runtime could either share a single reference namespace (like in V8 and other regular JS implementations) or could make them distinct to guarantee references can't escape. At the least, this gives us more chance to control the inter-module call marshaling while avoiding copies of strings until we get up to an outside-JS-world-boundary. It's likely though that runtime performance would take a hit for long-running loops compared to just running Node/V8 directly, which can produce heavily optimized code based on runtime type specialization.

It would be possible to optimize this by recompiling individual functions back to Wasm when type speculation says it can be done more optimally, but "on-stack replacement" of a running function is not possible at the JavaScript/Wasm API level.

If a function decided to recompile in the middle of a loop, it would have to call into the runtime support for a replacement, transfer its internal state into the replacement function, have the replacement _start partway through the loop with half its variables initialized_, and then pass through its return value.

This means function specialization compilations would probably produce two functions:
 * one for later use, which starts at the beginning
 * another for use at the optimization-check point, which accepts args to initialize local variable state and then starts at that point, finishing out the rest of the function body

The partial function could be saved for re-use on calls to the same function farther up the stack, which could themselves see there's a compiled version they could switch to at the next loop iteration etc.

This might be an additional complication, though, and can be planned as a later optimization strategy on top of the initial runtime.

Debugging would require some runtime support; debug-mode functions would have to be able to expose their data structures to the debugger and support breakpoints and single-stepping.

I've done some similar work in the other direction with a WebAssembly recompiler that produces asynchronous JavaScript code:

The generated functions include both a slow path that does an explicit breakpoint check before every instruction, and fast paths of runs of instructions that definitely cannot be interrupted by a breakpoint. A bitmap of all code locations with breakpoints marked is maintained to make the breakpoint check cheap, and a bitmap is also kept for each sequence of adjacent optimized instructions:

To single-step, both bitfields are set to all 1s. To use breakpoints, set 1 only for the individual breakpoints and the sequences that contain their opcodes.

WebAssembly doesn't allow asynchronous calls yet, so this won't work for our debugger client-side in the browser, but it will work perfectly in a server environment bridged to the client over a WebSocket: the debugger running in the host process (probably JS running in Node) will be able to query all the WebAssembly modules for their current stack state and present a view of the current state for debugging.