A Lambda Calculus With Coroutines and Heapless, Directly-Called Closures

In this cc, I'd like to present a small language based on the lambda calculus extended with stackful coroutines. I'll discuss a virtual machine designed for the language's execution, and an efficient compilation scheme.

We'll see that the compilation scheme

  • naturally supports tail-call optimization
  • eliminates all heap allocation of closures
  • eliminates the need for indirect calls

without restriction of the language's expressiveness - which may be interesting in its own right.

This post is intended for anyone broadly interested in implementing coroutines, compilers, or virtual machines. I am not an expert in any of those, but I hope that something in this post may be useful to you. The post will assume basic familiarity with unification-based type inference and code generation schemes.

First, let you let me show you a playground exemplifying the language and its compiler/VM. You can hover over values in the input buffer of the playground to see inferred types. The default program comes from Cyber's playground, computing the 20th fibonacci number, and how many times fib was called.


Recently, I've been studying the efficient implementation of effect handlers, which I currently believe are the most promising avenue for the future of managed effects in programming languages.

Effect handlers have a lot to do with coroutines - in fact, one way to implement effect handlers is as stackful coroutines, like OCaml does.

In this cc, I'll use the terms "coroutine", "green thread", and "fiber" all to refer to the specific idea of first-class, stackful, non-presumptive coroutines.


  • Coroutines are segments of a program whose execution can be paused and resumed at-will. They can be thought of threads of execution controlled in userspace, rather than in the kernel.
    • If you are familiar with goroutines, coroutines are like goroutines, except that
      • The point at which coroutines are preempted must be explicitly defined by the programmer; in the language we'll look at below, via the yield keyword.
      • Dually, coroutines do not imply a concurrent runtime; it is up the programmer whether to, and how to, execute coroutines in a concurrent fashion. In the language we'll look at below, this is done the spawn and resume keywords.
  • A first-class coroutine can be used like any other value in a language, unrestricted in its behavior; think of first-class functions.
  • Stackful coroutines can arbitrarily pause their execution, including within nested function calls. Stackful coroutines require a preserved call stack while they are preserved, in contrast to stackless coroutines. C++ (via co_await/co_yield), Rust (via async/await), and JavaScript (via async/await) all implement stackless coroutines - with the restriction that you cannot suspend in nested, non-co-routine call in those languages. The latter behavior induces the often-named function color problem for async/await.
  • Non-presumptive coroutines are those that cannot be resumed more than once. In the language we'll discuss below, the resume keyword resumes execution of a coroutines, and returns a new handle to the coroutine. It is a runtime error to call resume on the same handle to a coroutine more than once.
  • Coroutines are cooperative - unlike threads, which may be preempted arbitrarily by an operating systems, coroutines require programmer intervention to specify where execution should be paused and resumed (think async/await in JavaScript/Python/Rust). Moreoever, coroutines do not imply a concurrent runtime, and it is up the programmer to define how to execute coroutines.
    • Again, contrast this to goroutines, where the runtime of goroutines is well-defined and suspension points are largely implicit.

Why am I talking about coroutines here? I've seen a few good articles discussing implementing coroutines against a machine-language target (1, 2), and recent projects like Cyber and Co. I'd like to present a small model and implementation of coroutines against a virtual machine target, which may be a nice base for launching more experiments without designing a programming language intended for general use.

Co implements coroutines as continuations - which I believe is also the most promising technique for an efficient implementation of effect handlers, but I won't expand on that here. Continuations often go hand-in-hand with closures, which are typically heap-allocated and called indirectly. To show off something interesting, we'll also have our implementation compile closures unboxed, and eliminate all indirect calls via defunctionalization.

Finally, we'll see that tail-call optimization falls out naturally with other aspects of the implementation.

Okay, let's get started!

Source language

Our source language will be the simply-typed lambda calculus extended with tuples and fibers. Fibers will be our source-language-level representation of coroutines, and can be used like any other value. Fibers can be spawned, yielded, and resumed by the author to enable cooperative multitasking.

As an example, let's consider a program (taken from Cyber's examples) that calculates the 20th fibonacci number in a fiber that yields to its parent fiber on every iteration. This usage of yield and resume lets us compute how many times fib recursively calls itself. I hope you can see how you could run more complicated tasks concurrently in this style.

let rec fib = \n ->
if n < 2
then n
else (fib (n - 1)) + (fib (n - 2))
let rec exec = \state ->
stat state.0
| `Pending ->
let fib1 = resume state.0 in
exec {fib1, 0, state.2 + 1}
| `Done n -> {state.0, n, state.2}
let runFib = spawn (fib 20) in
let result = exec {runFib, 0, 0} in
{result.1, result.2}

This language's syntax is very similar to OCaml's. I'll point out a few key features of its semantics:

  • let rec defines a recursive binding; by default, let bindings are non-recursive.

  • {e1, e2, ..., en} defines a n-ary tuple. Tuple elements are accessed by 0-based indices; for example, t.0.

  • Every program starts on a unique main fiber. A program terminates when the main fiber terminates, even if other fibers have not terminatedcontrast this to non-joined kernel threads in e.g. C. It's up to the programmer to decide whether or not all child fibers should terminatewhether this is a good design design is a different question.

  • spawn (fib 20) executes the call fib 20 on a new fiber, which can yield to the parent fiber (in this case the main fiber) arbitrarily. spawn returns to the calling fiber a handle to the child fiber. These handles have type Fiber. Fibers have two states - they are either in the pending state, or have terminated with a value.

  • Fibers can be passed around arbitrarily, including to other fibers. Fibers are first-class values.

  • yield suspends a fiber's execution and returns execution to the current parent fiber. Note that since fibers can be passed around arbitrarily, the parent of a fiber may change during a program's execution. yield is a no-op on the main fiberI believe you could detect yields that are only reached by the main fiber with a full-program static analysis, but it's not clear to me whether a more localized analysis would be suitable..

  • resume returns execution to a child fiber, which executes until its next yield point or completion. Like spawn, resume returns a Fiber. It is a runtime to resume the same Fiber twice - in our example above, for example, it would not be legal to define

    - let fib1 = resume state.0 in
    - exec {fib1, 0, state.2 + 1}
    + let _ = resume state.0 in
    + exec {state.0, 0, state.2 + 1}

    A affine type system could enforce this requirement at compile-time, but we won't discuss that here.

    Resuming a completed Fiber is effectively a no-op; the Fiber is immediately returned as-is.

  • A stat expression queries the state of a Fiber, and matches on a `Pending branch if the fiber has not yet terminated, or a `Done branch if it has. The `Done binds the return value of the fiber to a variable name.

The source language has four types:

  • Integers
  • Booleans
  • Tuples composed of the language types, inductively
  • Fibers composed of the language types, inductively

Virtual machine strategy

There are a lot of ways you could design a virtual machine for the language described above. Here, I'll show one design for a stack-based machine.

First off, let's define the stack part of the stack machine. Suppose that we can represent all our types as integerstuples are just blocks of integers next to each other, and I'll explain the representation of fibers in a bit; then our stack is just a dynamically-growing array of integers. For the sake of simplicity we'll have the stack deal with 64-bit integers, rather than bytes. This is wasteful, but that's okay for this discussion.

We'll allow a program to arbitrarily push integers onto the stack, pop integers off the stack, and store integers into arbitrary indices in the stack.

Representing function frames

Since a function in our source language might have local variables referenced multiple times, it'd be nice to support some idea of local variables in the runtime representation of a function call frame.

The standard approach here is to keep track of a frame pointer in the runtime, which points to the place that was the top of the stack when a function call was enteredTypically, when targeting machine code, frame pointers are stored in CPU registers.. When a function is entered, it can ask the runtime to reserve some space on the stack that it will use for storing local variables. Then, those local values can be accessed relative to an offset from the frame pointer.

Example of compiling a function
let f = \x ->
n = 1
m = 2
n + m

can compile to something like

f: sp-add 2 # reserve two cells on the stack, for n and m. # n will live at fp[0], i.e. at the index of the frame pointer # m will live at fp[1] push 1 store-into fp[0] # push, store n=1 push 2 store-into fp[1] # push, store m=2 push fp[0] push fp[1] add ret

Representing function calls

Before calling a function, the caller needs to

  • allocate space for the callee's return value
  • push on the callee's argumentsthis includes any captured values, which will be discussed later
  • store the program counter of the caller (since this is global)
  • store the frame pointer of the caller (since this is global, and will be modified in the new function frame)

We can have the virtual machine runtime take care of storing the program counter and frame pointer, and restoring them when a function is returned, but our compiler will need to take care of the first two line items.

In all, the procedure for calling functions is:

  • Before call entry:

    • allocate space for the return value

    • push on the callee's arguments

    • push on the caller's program counter and frame pointer
      <return value> arg1 ... argn @caller_pc @caller_fp --- < stack top
  • During call:

    • Set the new program counter and frame pointer for the callee
      <return value> arg1 ... argn @caller_pc @caller_fp --- < stack top = fp
  • Upon return:

    • Reset the top of the stack to the callee's frame pointer. That means the caller's frame pointer is now the top value on the stack.
      <return value> arg1 ... argn @caller_pc @caller_fp --- < fp <-------\ <local args> | <other stuff> | --- < reset to --/
    • Pop and restore the caller's frame pointer and program counter.
      <return value> arg1 ... argn < ----------\ @caller_pc | @caller_fp | --- < reset to -/
    • The caller deallocates the provided arguments, leaving the return value on the top of the stack.
      <return value> <-------\ arg1 | ... | argn < reset to -/
Here's how a full function call would be represented in the VM.

Representing fibers

In this model, each fiber is identified by a unique stack. Spawning a fiber allocates a new call stack. The virtual machine needs to bookkeep

  • each unique fiber's call stack
  • what the current stack of fibers is, relative to the main fiber

Spawning or resuming a fiber pushes onto the stack of active fibers. Yielding pops the current fiber off the top of the stack, leaving the yielded fiber's representation on the parent fiber's stack, as the return value of the matching spawn or resume.

Because our fibers cannot be resumed multiple times, we also need to keep track of a dirty bit in both the runtime representation of a fiber's stack, and the Fiber value we return to the parent fiber. When a fiber is resumed, we check that its dirty bit lines up with the fiber's dirty bit state; if these values are out-of-line, we know the fiber has already been resumed and can't be resumed again. Note that the dirty bit would not needed if the type system were to enforce that a fiber can be resumed at most once. It also would not be needed if multi-shot resumptions were supported; one strategy to enable that is to create a new fiber on each resumption, and reference-count fiber stacks.

Put all together, we can represent a Fiber that terminates with type T as the runtime tuple

{ bit: int, value: T, fibidx: int, fibdirty: int }

Where bit=1 when the fiber is done, and 0 otherwise; value is the value the fiber terminated with, or else undefined; fibidx is the unique reference to the fiber's call stack; fibdirty is the fiber's dirty bit.

The values in the runtime tuple cannot be accessed directly from our source language. In fact, fibidx and fibdirty are purely implementation details. bit and value can be accessed conditionally, via the stat keyword.

Representing tuples

Speaking of stat, it's worth noting how tuples (including Fibers) are represented on the stack. The directionality is somewhat arbitrary, but here we'll have lower-indexed elements in a tuple stored at the top of the stack. For example, a Fiber<{int, int}> will be stored as

fiber.fibdirty fiber.fibidx fiber.value.1 fiber.value.0 fiber.bit < stack top

This has the advantage of making stat's job easier - once a fiber is loaded onto the top of the stack, the pending/done bit can be immediately popped off, and if needed, the terminating value is right on top of the stack.

Instruction set

All that described, we can construct an instruction set for our virtual machine.
type op =
| Eq
| Lt
| Sub
| Add
| Mul
| Yield
| Spawn of { proc : label; args_size : int; ret_size : int }
| Resume of int (** size of return value *)
| Push of locator
| Store of int (** store-into fp[offset] *)
| SpAdd of int
| SpSub of int
| SpRestoreFp (** restore to the frame pointer. Used for tail calls *)
| Jmp of label
| Jmpz of label
| Jmprel1 (** jump relative to 1 + the integer on the top of the stack. *)
| Call of label
| Ret
type basic_block = label * op list
type proc = {
name : label;
blocks : basic_block list;
debug_frame : debug_frame;

Instructions are organized into basic blocks, which are identified by a label and contain zero or more instructions that are executed in a linear fashion. A basic block only branches out via a jump or return at its exit. Procedures consist of a name and one or more basic blocks.

Most of the bytecode instructions are standard, though there are a few interesting ones:

  • Along with taking the procedure to call, Spawn requires the size of the procedure's arguments. That's because the arguments will need to be copied over from the parent stack into the stack allocated for the child fiber.
  • Both Spawn and Resume require the size of the fiber's termination value, because upon yield or termination, the runtime will need to know how large to make the Fiber value representation.
  • SpAdd allocates space on the stack (for local arguments and return values), SpSub deallocates space on the stack (so callers can reclaim space used for call argument), and SpRestoreFp restores the top of the stack to the current frame pointer. We'll see that the latter instruction is useful for implementing tail call elimination; it is otherwise unnecessary, since such restoration happens by the runtime during a return.
  • Call requires a named procedure, and there is no support for indirect calls. This is not a joke, and there is no restiction here - we'll see below how to compile co_lc, unrestricted in the kinds of closures you can create, with only direct calls.

Compilation strategy

Type inference

There are a few type systems co_lc could use; here we use the type system of the simply-typed lambda calculus, with complete type inference via unification. Here's one implementation of co_lc inference algorithm; I won't go into depth about it here, but the approach is pretty standard.

A tip for inference of tuples

One thing that may be interesting, which I don't see typically covered in discussions of unification-based inference, is how to deal with the inference of tuples. The question has to do with an expression like

let f = \t ->
t.1 == 10

Clearly, t is a tuple with an arity of at least two, where the second element is an int. But the precise arity of the tuple, and its other elements' types, cannot be known at this point.

One solution is to have both dense representations of tuple types (for when a tuple literal is seen), and a sparse representation for cases like the above. That is, our type definition might look like

type ty =
| TupleDense of ty list (* list of element types *)
| TupleSparse of (int * ty) list (* list of known indices' types *)

When a TupleSparse unifies with a TupleDense, its respective elements (and arities) must unify, and then the TupleSparse must become a TupleDense, since that is the point of the concrete type's instantiation.

Eliminating indirect calls and heap-allocated closures via defunctionalization

I did promise no indirect function calls, and I must deliver. We'll use type-directed defunctionalization, via lambda sets, to guarantee that all calls can be compiled directlyDefunctionalization (and devirtualization) are typically done as optimizations in later stages of a compiler; however, by embedding call information in the type system, we can guarantee the optimization is applied., with conditional dispatch via a jump table rather than a function pointer.

Lambda sets are due to William Brandon, et.al., unpublished manuscript. Lambda sets are also used in Roc to compile closures without indirect calls or heap allocation.

Why does eliminating indirect calls matter?

An indirect call occurs when a machine has to load the address to call at runtime, rather than it being statically known.

Function pointers

In our source language, it's reasonable to write something like

let caller = \f -> f 1 in
let id = \x -> x in
caller id

A direct compilation of this code might produce bytecode like

id: ... caller: # suppose the argument is at fp[-3] push 1 push fp[-3] call-indirect ret main: push &id call caller ret

where the address of the function passed to caller has to be loaded, popped, and dereferenced at runtime in order for call-indirect to determine what function should ultimately be called. It's clear that in a program like the one above, a smarter analysis could reduce the compiled code to

id: ... caller: push 1 call id ret main: call caller ret

The advantages of the latter include

  • The call site is statically known, enabling further compiler analyses and optimizations like inlining id in caller.
  • If this were compiled to target a CPU, there would be no reliance on the CPU's branch target predictor at runtime.

The biggest downside of this approach I'm aware of is that if caller is called with multiple times with unique functions,

  • a unique copy of caller must be stamped out for each unique argument (monomorphization, increasing code size)
  • or, direct calls can be used, but must be guarded by a jump table

Speaking of jump tables -

Conditionally-determined functions

A second case has to do with the representation of conditionally-determined functions. Let's extend our previous example to

let caller = \f -> f 1 in
let id = \x -> x in
let mul2 = \x -> x * 2
caller (if 0 == 0 then id else mul2)

Again, a direct compilation of this code might produce bytecode like

id: ... caller: # suppose the argument is at fp[-3] push 1 push fp[-3] call-indirect main: push 0 ifz jmp else then: push &id call caller ret else: push &mul call caller ret

As before, this compilation determines a function pointer to dispatch to, and passes that into caller. An alternative is to push down the conditional into the body of caller, enabling a direct call within each branch of the conditional.

id: ... caller: push 1 # suppose the argument is at fp[-3] push fp[-3] ifz jmp call_mul_ call_id_: call id ret call_mul_: call mul ret main: push 0 call caller ret

The advantage here is that we've eliminated one source of indirection - we still need a conditional, but we've gotten rid of function pointers. This may make inline analyses more fruitful, and places less pressure on a CPU's branch target predictor.

One thing I'm not sure about is the efficacy of the jump-table-based caller implementation relative to the function-pointer-based one, when caller's argument is static. In that case, the main performance consideration would be

  • for the function-pointer-based caller, how long a dereference takes, and how well branch target prediction works on a CPU
  • for the jump-table-based caller, how long a conditional and jump takes, and how well branch prediction works on a CPU

I am not aware of any empirical studies on the efficacy of CPU's branch predictors vs their branch target predictorsThough, there is this nice analysis by Cloudflare. - please let me know if you know of any!

Eliminating heap-allocated captures

Lambda sets also enable a natural way to eliminate heap allocation of closure captures. This is pretty nice, especially in functional paradigms where closures are common.

Let's consider the program

where f returns a function that either captures x, x and y, or x and y and z (notice that f itself must capture all three). At runtime, the captures must live alongside the representation of the function to call - which means that the runtime representation of the captures must also be the same size, no matter what function is returnedBecause the return value of f can be used indiscriminately, including being passed around to arbitrary locations, it must have a statically-known size..

The typical way this requirement of uniformityOf course, one alternative is to disallow the conditional selection of a closure. C++ and Rust enforce this restriction by giving each closure a unique type, which is incompatible with every other closure type. is achieved is by compiling closures to a representation like the following C struct

struct closure {
void* fn_addr;
void* captures;

where each capturing function takes an opaque captures pointer as an argument, which it can then unpack to the concrete type of captures it knows.

For the function f returns, this is needlessly generic - f's returned closure can have the stricter representation

struct f_closure {
int tag; // 1, 2, or 3
struct f_closure_captures captures;
union f_closure_captures {
struct captures1;
struct captures2;
struct captures3;
struct captures1 {
int x;
struct captures2 {
int x; int y;
struct captures3 {
int x; int y; int z;

That is, we can keep a tag of three states telling us which function to dispatch to (this is the elimination of function pointers discussed in previous sections). And, we can reduce the representation of captures to a three-state union - which can all be stored on the stack, at the expense of paying for the largest captures size!

As before, we are making tradeoffs here. While this helps avoid heap-allocation, it means that

  • The smallest captures in a representation like the above take as much space on the stack as the largest captures, which can be exceptionally wasteful if the size discrepancy is large.
  • This scheme makes incremental and separate compilation more challenging, as the addition of a new function can change the memory representation of values far away.

Inference and compilation semantics of lambda sets

Actually, the f_closure struct representation I gave above is exactly where lambda sets take us! If you already see how to get there, feel free to skip this section.

I must again say that the original idea of lambda sets in this type-directed form is due to William Brandon, et.al. in an unpublished manuscript. We've also used them in Roc, and learned a lot about their behavior (many potential blog posts about that!).

The idea is

  • Each syntactic function (a lambda \x -> ...) gets a unique name

  • When inferring the type of a function, include the name of that function and its set of captures in a set adjacent to the function type; this is the lambda set. For example,

    x = 1
    y = 1
    f = \n -> n + x + y

    may be given type int -[f {x: int, y: int}]-> int, where [f {x: int, y: int}] is the lambda set.

  • When two functions unify, their lambda sets union.

Take a look at the inferred types of our example from above. Feel free to hover over f, or look at the elaborated output.

Notice that the inferred type is saying that f returns a function of type int -> int whose lambda set is [lam {x} | lam1 {x, y} | lam2 {x, y, z}] - exactly the three functions that we might dispatch to, and what their captures are! This is all we need in order to compile the return value of f to the f_closure-like representation described previously.

Compiling functions, including lambda sets

Put all together, our compilation scheme for functions consists of
  • When compiling a function (e.g. lam2 of the example above)
    • Determine the stack size of its captures based on the largest captures of any lambda in the lambda set it's involved in.
      • Compile the function to unpack the passed captures appropriately.
      • For the runtime representation of the function-to-call, store the captures on the stack.
    • If the lambda set of the function is non-unary, the function to call must be determined conditionally; store a tag for this function on the top of the stack (e.g. 2, since lam2 appears in index 2 of the set).
  • When calling a function
    • Load the argument, then the function. That way, the stack consists of the argument, followed by any captures, followed by optionally the function tag on the top of the stack.
      <arg> <captures> <tag if non-unary lambda set> < stack top
    • If the lambda set is unary, there is only one function that can be called - compile the direct call.

    • If the lambda set is non-unary, build a jump table (one implementation) using the integer tag of the lambda on the top of the stack. Each branch of the jump table can perform the appropriate direct call to the matched function.

Tip: deciding where to store values

I don't know if this is well-known, but one thing I find helpful during compilation is to have the procedure that compiles an expression take a parameter indicating where the expression should be compiled, rather than (in this case) always compiling to the stack and storing the result elsewhere later. This eliminates a lot of trivially-reducable load and stores.

For example, my implementation has the recursive bytecode compiler take an optional target destination, which looks like

type opt_target =
[ `Any
| `FpOffset of int (** store as a local offset from the frame pointer *)
| `Stack (** store on the top of the stack *) ]

Sometimes, the target truly can be arbitrary. For example, when compiling a tuple access, we don't really care if the tuple is loaded onto the top of stack or is in an offset from the frame pointer - in the former case we can pop values off until we get to the field we're looking for, and in the latter case, the access can be made relative to where the tuple lives.

To facilitate the arbitrary-target case, the procedure also returns where exactly the expression ended up being stored, as a type that's opt_target without the Any variant.

Tail-call optimization

Passing an optional target makes tail-call optimization very natural - when we're compiling a function, we set up the target of the function body to a special return local, which is the location where the caller allocated space for the return value.

During the compilation of a function, if we see a call to the same function, and whose target is the return local, we know that we can perform a tail-call optimization! Thanks to lambda sets, the destination of a function call is in the function type, so only the type of the function being applied needs to be examined to understand whether the optimization can be made.

To actually perform the optimization, compile the call argument into the location of the argument in the current function frame, reset the stack pointer to the frame pointer, and jump back up to the top of the function. See, I promised a use of SpRestoreFp!

Unexplored extensions

Thanks for making it this far! You must be pretty interested in this kind of stuff. There are various natural extensions to this project, which may make for interesting projects:

  • Make these fibers do something interesting. Our language and compiler is enough to show off that the implementation of fibers works, but doesn't actually demonstrate its use in an interesting one. One idea would be to add support for making non-blocking network requests and show an example of cooperative multitasking where multiple fibers fetch many webpages.
  • Implement an affine type system to enforce the one-shot behavior of our fibers at compile-time.
  • Design and implement an analysis to detect what spawned expressions will never yield. Those expressions can be optimized to happen as direct evaluations, rather than needing to happen on a child fiber.
  • Design an efficient scheme for multi-resumable fibers.
  • Design a system for implicitly-yielding userland threads, and a preempting runtime scheduler of userland threads. Modify the VM and compiler to support this system.
  • Implement additional optimizations over the virtual machine, and optimizations of the bytecode in the compiler.
  • My virtual machine is implemented in OCaml (and compiled to JavaScript via JSOO). Using a tighter language, like C/Rust/Zig/whatever, for the VM (and compiling to WASM) may result in a much faster machine.

Playground & Implementation

I hope you enjoyed this post, and learned something. You can find my implementation of the language compiler and VM on GitHub. As you saw previously in this post, I've also made a playground for the language, embdded again below.

Ad: Are you interested in compiling structural subtypes?

If you know of ways, or are interested in, compiling languages based on structural subtyping to non-uniform representations, please email me! This has been an ongoing project of mine and I hope to write about it soon, and would love to chat through with others interested.

In particular, the idea here is to combine

  • the programming flexibility of structural subtyping
  • (non-principal, possibly incomplete) type inference, a-la MLsub
  • row polymorphism and polymorphic variants

to design subtyping-based source languages with compilation to target languages that

  • do not support runtime object polymorphism (i.e. all procedures are monomorphized)
  • have non-uniform, unboxed representations
  • do not require introducing implicit conversions at re-binding or usage sites
Analytics By visiting this site, you agree to its use of Cloudflare Analytics. No identifiable information is transmitted to Cloudflare. See Cloudflare Analytics user privacy.