A Simple Language With Flow Typing

Introduction

The full source code for the language described here is available at gh:ayazhafiz/lang_narrow.

You can try out the language in this playground.

In this cc, we will formulate lang_narrow, a language that supports flow typing with record and union typesUnion types are also known as sum types.. Flow typing empowers our language's type system to structurally narrow values' types at branches in a program's control flow. To gain an intuition for what this means, let's take a look at a simple example:If you are familiar with the TypeScript type system, our language's semantics will feel very similar.

fn defaultNat(): nat { 1729 } fn readNat(n: nat|string): nat { if n is nat then n // n is nat else defaultNat() // n is string } fn narrowBToNat(p: nat|{a: bool, b: nat}|{b: string, c: nat}): nat { if b in p then readNat(p.b) // p is {a: bool, b: nat}|{b: string, c: nat} else p // p is nat } // narrowBToNat(10) // => 10 // narrowBToNat({a: true, b: 20}) // => 20 // narrowBToNat({b: "not a nat" , c: 9}) // => 1729

The union type T|U denotes a type that is either T or U, and the is and in expressions introduce a condition that constrains a union type to a subtype in either branch. Hopefully it's easy to get an idea of what is going; later, we'll define the language and its constructs more formally.

Even superficially, it is clear that our use of "union types" here is somewhat different from the tagged union algebraic data types (ADTs) present in Haskell (via data), OCaml (via type), or Rust (via enum). Our union types are structural rather than nominal, and the variants of our union types are types in of themselves, rather than simply tag names (or constructors). However, they are still tagged unions in the sense that the the variants of the union are named by their types. As mentioned, we'll give a proper definition our union types later on.

Motivation

Before we delve into formalizations, it may be helpful give some motivation for why we might want to explore lang_narrow. After all, why not use ADTs in place of the kinds of unions we have here, and why would flow typing be useful when we can pattern match on data constructors? I will give two reasons, though there may be more.

The merits of flow typing

In languages with ADTs, but without flow typing, we cannot employ the type system to constrain the variants of an ADT based on invariants in a program control flow. For example, consider some code like

rust
fn send_recv(req: Request): Data {
const resp = send(req);
if let Some(Status::Ok) = resp.status {
return resp.data;
}
let req = expensive_refresh_request(req);
let req = expensive_validate_request(req);
match resp.status {
Status::RateLimited => send_recv_with_backoff(req),
Status::Timeout => send_recv_with_proxy(req),
Status::Ok => unreachable!(),
}
}

This is a toy, but I hope it inspires a realistic situation. You have some happy path you want to return from quickly, and a more expensive path to go down otherwise. However, even though Status::Ok is not a variant that could constitute resp.status in the last match expression, the type system doesn't know that; we have to inform it that this is so, somewhat unergonomically.

With flow typing, the type system could remove Status::Ok from the possible variants of resp.status after the first branch.

The merits of our union types compared to ADTs

ADTs are closed. This means that whenever we add a new variant to an ADT, we have to handle that variant everywhere the ADT is used.This is not entirely true in Haskell, as Haskell permits partial functions. But supposing that we would like our functions to be total, this holds.

rust
enum Animal = { Cat(Cat), Dog(Dog) }
fn get_sound(a: Animal): String { /* impl */ }
fn get_name(a: Animal): String { /* impl */ }

If I want to handle getting just the name for another animal, say Zebra, I need to add Zebra to the Animal ADT and handle the case of Zebra when unpacking Animal in both getSound and getName. Of course, there are ways around this: we could split up the Animal ADT into separate types to be used for getSound and getName, but then there is no easy way to convert between the separate types without writing even more code (at least if the ADTs are treated nominally, as they generally are).

On the other hand, with union types this is easy:

fn getSound(a: Cat|Dog|Zebra): string { /* impl */ } fn getSound(a: Cat|Dog): string { /* impl */ }

Since the variants of the union are types in themselves, we can include and exclude them freely. Furthermore, Cat|Dog can be subsumed by Cat|Dog|Zebra immediatelyAs long as union types are treated structurally, the subsumption would hold whether Cat and Dog are treated nominally or structurally.; we will formalize this subtyping relationship when developing the language in the sections below.

The disadvantages of our union types compared to ADTs

I certainly don't wish to claim life is greener on one side. There are several disadvantages of using union types in the manner we will develop in this language, including

  • A lack of uniform tag names: as we will see later, the is expression requires that we attach to every value used in such an expression a runtime type tag. This means that the compiler must either employ some non-trivial machinery to determine all reachable values to which it should attach type tags, or attach a type tag to all runtime values. The cost of this work only increases with the extends narrowing expression we introduce in lang_extends, which effectively requires a typechecker in the program runtime. With ADTs, the costs associated with the addition of type tags can be made local and minimal -- tags are constructed and read at syntax-directed sites, making it obvious to a compiler where they should be inserted.
  • Recursive types are more difficult to formalize: given that our type system is structural rather than nominal, it is more difficult to typecheck recursive type definitionsIt actually wouldn't be terrible with the use of equirecursive types, but that's beyond the scope of this cc.. To keep things simple, our language has neither a notion of type variables nor type inference, so there is no way to introduce recursive types.

All this is to say, if I were implementing flow typing in a production language, I would likely use ADTs and perform narrowing on data constructors. But open unions are easier to bookkeep than ADTs, so we prefer them for the purposes of our small language. It should be pretty straightforward to translate our approach to a language with closed, nominal ADTs.

Notation

signifies a typing context, with some number of variable-to-type bindings. For example, is a typing context wherein the variable $\text{t}$ has type $\text{T}$.

means that under the context $\Gamma$, the expression $\text{t}$ has type $\text{T}$. For example,

is sound.

A type derivation rule $\text{R}$ has the form and can be read as, "when the conditions of $\text{R}$ hold, the result holds." As an example, here is a derivation rule stipulating that the addition of two expressions whose types under are nats yields a nat:

$\text{S} <: \text{T} % >$ states that $\text{S}$ is a subtype of $\text{T}$. $\require{cancel}\text{S} \cancel{<:} \text{T} % >$ states that $\text{S}$ is not a subtype of $\text{T}$.

lang_narrow

Having discussed some motivation and notation, we will now begin to formalize lang_narrow. We will derive its type system, discuss what is needed to provide for flow typing, and go over a runtime for lang_narrow when compiled to C.

Although we will present most of the interesting typing rules necessary to glue the system together, I don't attempt to cover all that would be necessary to prove the correctness of this language's type systemIn fact, I am not sure the type system is sound and complete, though "intuitively" it may be seen to be..

Basic types and their rules

Our language contains the primitive types nat (a natural number), string, bool, unknown (the top typeThe top type is inhabited by all values; it is the supertype of all types.), and never (the bottom typeThe bottom type is inhabited by no values; it is the subtype of all types.).

lang_narrow supports functions and types them; for example,

fn add(a: nat, b: nat): nat { /* impl */ }

has the type (nat, nat): nat. However, this syntactic representation of function types is not admitted by the language front-end; that is, no value can be explicitly typed as a function. Since all parameters must be typed by the user, this restriction effectively means that there are no higher-order functions in our language. Adding language support for higher-order functions would be straightforward (support for type influencing less soHistorically, type inference in the presence of subtypes has presented serious implementation difficulties. In general, an efficient implementation of such systems requires careful design of a language with syntax-directed typing rules. For efficient implementations, see Pottier's 1998 thesis and Dolan's 2017 thesis. Dolan's thesis solves the even larger problem of type inference in the presence of subtyping and polymorphism by treating types foremost as an algebra. A simpler, yet just-as-powerful version of Dolan's work is described in Parreaux 2020. Thanks to Ben Siraphob for introducing me to the latter two publications.).

Record types

Our language supports records; {a: 1, b: true} is a record with fields a and b, and has the type {a: nat, b: bool}. Records can be projected by field name; {a: 1, b: true}.b yields true. Here are the typing rules for record construction and projectionAfter reading about union types, you may notice that neither $\ref{Record-Proj}$ nor any other rule permits the projection of a value of type {a: nat, b: nat}|{b: string} on the field b. As an exercise for you, suppose we would like such a projection to yield the type nat|string (in fact, the full lang_narrow type system does so). What modifications to $\ref{Record-Proj}$ must be made to support this? (You can also add additional rules, but doing so is not necessary).:

We also define a subtyping rule for records:

Notice that in the subtyping relation, the order of record fields does not matter; holds.

Union types

A union type (also known as a sum type) is a type that may be inhabited by values of several different types. As an example, in our language union types are written like {a: bool}|nat|string, of which {a: true}, 1, and "hi" are all valid inhabitants. In lang_narrow, variants of a union type U are subtypes of U and are tagged by their type name.

All values of a type T are also of the singleton union type consisting of T; we make no distinction between these two notions. Unions types containing no type variants are equivalent to the never type; we make no distinction between these two either.

Notice that it is impractical for us to deduce that a value of type T is of a union type that is a strict supertype of T without explicit annotations. For example, we would certainly prefer to type the literal 1 as a nat, rather than nat|string or any other number of union types - indeed, the only two reasonable choices here are nat or top. Nevertheless, we would like to express the fact that a nat is a nat|string; for example

fn id(a: nat|string): nat|string { a } id(1)

should be well-typed. Thus, we have the following subtyping rule for unions:

As with , the order of union variants does not matter; holds.

There are also a few convenience rules we may want to introduceThere are more than the two we present here; for example we could also have an "unfold" rule that makes string|(bool|nat) equivalent to string|bool|nat.. For one, notice that the type string|nat|string is equivalent to string|nat.

Furthermore, notice that string|never is equivalent to string, as never is inhabited by no values and therefore contributes nothing to the domain of the union type.

The Narrowed type

Next, let's formalize the semantics of flow typing in lang_narrow and discuss what data structures we need to represent these semantics.

As we saw in our initial example, the main idea of flow typing is to introduce a condition at branch sites that narrows a value of type T to a type U in one branch (let's call this the left branch), and to the type of T "without" U in the other branch (let's call this the right branch). In our language, we only provide if expressions for branching, for which this idea might look something like

// suppose Γ = {a: T} if <some check to see if a is a U> then ... // a is a subtype of U in this branch else ... // a is a subtype of (T "without" U) in this branch

We'll use a type constructor called Narrowed to encapsulate the relationship between an expression and its narrowed types in left and right branches. In the example above, the <some check> condition would yield the type Narrowed⟨a, U, T-U⟩, where T-U signifies a type inhabited by values that inhabit T but do not inhabit UThis is one interpretation of what we described as T "without" U above, but there are many others. We have to be careful though, because only a few interpretations can be proven sound!.

We have two things to figure out regarding Narrowed types:

  • figure out how to construct them
  • figure out how they influence the types of branching expressions

Let's do the latter first.

Interaction with branching expressions

Take a look at the derivation rule for an if expression with a boolean condition:

For those unfamiliar, is called the join of $\text{T}$ and $\text{U}$, and represents their least upper bound (or least common supertype). As a derivation rule,

The only modification we need to make for the derivation rule of if with a Narrowed condition is how the narrowed expression should be typed in the left and right branches.

First, we verify that the narrowed term t is in the typing context. If it is not (for example, if it is a literal) the expression would not be referenceable in either branch anyway; we assume this is a programming error. Next, we validate that neither narrowed type of t is never; if it were, the branch associated with that narrowed type would never be taken, and hence the branch is vacuous; we assume this also is a programming error.

Then, we remove the binding for t from the typing context, determine the type of each branch with the addition of the relevant narrowed type of t to its context, and yield the type of the if expression as the join of both branches' types. Notice that the context of the resulting type contains the unnarrowed type of t.

The restrictions we have imposed, namely that t must be a variable and that neither narrowed type can evaluate to never, are not actually necessary for the soundness of the derivation rule. However, their inclusion helps catch programming mistakes, and the restriction against never-narrowed branches effectively makes applicable only when T is a union type. This can simplify a typechecking implementation significantly.

And that's all for Narrowed types in branches. Now, onto how to construct Narrowed types!

The is narrowing expression

In our language, the is expression constructs a Narrowed type from an expression and a subtype of the expression type. For example, the expression

// suppose Γ = {a: string|nat|bool} a is string|bool

has the type Narrowed⟨a, string|bool, nat⟩. We define the derivation rule for is expressions the following way:

where the subtraction rule $\text{T} - \text{U}$ is given by

Notice that elides from T only those types that equivalent to a variant of U (up to permutation of record fields). For example, {a: nat, b: bool} - {a: nat} yields {a: nat, b: bool}, not never.

If the restrictions of seem a bit limiting or unnecessary, we will discuss a language that avoids them at the expense of greater runtime costs in lang_extends. However, I suspect these restrictions are reasonable for most general-purpose programming, as it is always possible to refine a union type annotation with a type variant that one may later want to narrow.

The in narrowing expression

Consider the function

fn readBOrC(obj: {a: nat, b: nat}|{a: nat, c: nat}|{b: nat, d: nat}|{c: nat, d: nat}): nat { if obj is {a: nat, b: nat}|{b: nat, d: nat} then obj.b else obj.c }

This works, but is a bit bulky; we don't care that obj is a union of records possibly containing a or d, we just care whether or not it's a subtype of {b: nat}. Because excludes types by equivalence rather than subtyping relation, if we were to try obj is {b: nat}, the left narrowing type would be never. So to elegantly handle narrowing of records, we can reformulate accordingly, as we do for lang_extends, or introduce a special case for records, as we will do here with the in expression.

in takes a field name f and an expression of type T, constructing a Narrowed instance with subtypes of T that are records containing f on the left and all other subtypes of T on the right. For example,

// suppose: obj :: {a: nat, b: nat}|{a: nat, c: nat}|{b: nat, d: nat}|{c: nat, d: nat} b in obj

has the type

Narrowed⟨obj, {a: nat, b: nat}|{b: nat, d: nat}, {a: nat, c: nat}|{c: nat, d: nat}⟩

We define the derivation rule for in expressions the following way:

Notice that the extraction of record types containing the field $\text{f}$ distributes over the union type $\text{T}$, rather than operating on the type as a whole. This is similar to the behavior of TypeScript's distributed conditional types.

A runtime implementation

A typechecking implementation of lang_narrow is a pretty straightforward encoding of the presented typing rules. What is more interesting is the implementation of expressions like is in the evaluation of a lang_narrow program, as these require auxiliary type information.

Actually, if the language is interpreted, evaluating type-dependent expressions is straightforward -- we reduce the expression-to-be-checked to a value, invoke the typechecker, and get our result.

But if the language is compiled, we have a bit more work to do in encoding type tags and measuring type compatibility. In the rest of this section, we will discuss this work for lang_narrow programs compiled to C.

A sufficiently advanced compiler could determine precisely which values are used in expressions that read type tags, and attach tags only to those. For the ease of implementation, we generate C code with all values having type tags.

In our code generator, we encode every value as an instance of a tagged_any struct. The struct includes a type tag that discerns a value's true type to runtime functions, and val field that contains the raw, type-safe value.

c
typedef struct tagged_any tagged_any;
typedef struct record {
int numFields;
const char** fields;
tagged_any* values;
} record;
union any {
int nat;
const char* string;
int bool;
record record;
} any;
struct tagged_any {
const char* tag;
union any val;
};

The most interesting runtime function using tagged_anys is the pseudo-polymorphic print:_print writes values without a newline, and is useful for recurring when printing a record. print is a user-facing entry point that terminates a printed value with a newline.

c
void _print(tagged_any any) {
if (is(any, NAT)) {
printf("%d", any.val.nat);
} else if (is(any, STRING)) {
printf("%s", any.val.string);
} else if (is(any, BOOL)) {
printf("%s", any.val.bool == 1 ? "true" : "false");
} else if (is(any, RECORD)) {
record r = any.val.record;
printf("{");
for (int i = 0; i < r.numFields; ++i) {
printf("%s: ", r.fields[i]);
_print(r.values[i]);
if (i != r.numFields - 1) {
printf(", ");
}
}
printf("}");
} else {
fprintf(stderr, "Runtime error: no matching tag %s\n", any.tag);
exit(1);
}
}
void print(tagged_any any) {
_print(any);
printf("\n");
}

Having encoded all values as tagged_anys, performing is and in checks becomes trivial. In the case of is, we take an array of type tags (constituting a union) and check if a value's type tag is in that array. If we need to check a record, we verify that all field names coincide and recurse on checking corresponding field values. In the case of in, if a value is not a record the check is immediately false; otherwise, we walk the fields of the record and check if one corresponds with the predicated field.

Performance concerns, and what to do about them

The introduction of type tags and runtime typecheckers introduces costs that may not be immediately obvious to users of the language, and are something that we would like to avoid. Type tags on their own aren't so bad to deal with; since they are constant, one can inline them and the downstream C compiler will likely replace identical tags with a common reference to a static; if not, an interner can be added without much difficulty.

On the other hand, the costs incurred by evaluating is expressions are harder to deal with, primarily because of the explosive cost of checking deeply nested records for equivalence.

To avoid this, perhaps the best strategy is to employ flow typing in a language with nominal types, rather than the structural ones we have presented in this language. This way, checks for type equivalence are simply the comparison of two type names (or even better, two memory addresses).

lang_extends

As mentioned earlier, it may appear that the restrictions imposed by and are superfluous. In some sense they are; we can generalize over both with a rule that constructs a Narrowed type by testing against an arbitrary subtype of a union. Let's direct such a rule with an expression called extends:

We'll define the exclusion the following way:

Now we can write the well-typed function

fn go(p: nat|bool|{a: nat, b: nat}|{b: nat, c: nat}): nat { if p extends {b: nat} then p.b else if p extends bool then if p then 1 else 0 else p }

using a single extends expression where we would have had to use a combination of in and is expressions in lang_narrow.

Performance concerns, and what to do about them

Of course, the flexibility of does not come without its costs. Like lang_narrow, a lang_extends runtime needs to check type tags and read record fields, but the conditions of also suggests that the runtime needs to carry around a subtyping checker!

Obviously, having both a compile-time and runtime subtyping checker is not ideal, and the runtime penalties can be severe especially in the presence of large union, record, or recursive types (if we had introduced them).

There are several techniques to mitigate the runtime penalties; one approach may be to populate, at compile time, a table of type relations amongst all known types in the program, and perform subtyping queries by searching the table. Costs of searches can be further reduced by storing the type relations in a tree and making queries union-finds.

My impression is that, in general, the power of a typing rule like is not worth the additional runtime complexity, cost, and code size its implementation must introduce.

Going further: an algorithm for narrowed contexts in imperative programs

The implementation of typing contexts in lang_narrow (and lang_extends) is straightforward; since each program is one large continuation, we never end up "jumping out" of the control flow (and the typing context) we are currently in. Contrast this to an imperative version of lang_narrow, wherein we may write a program like

fn is_number_like(p: bool|{numberish: nat|string}|nat|string): bool { if p is bool then { return false; } var n: nat|string = if numberish in p then { // (1) p.numberish } else { p }; return if p is number then { true // (2) } else { String.regex_matches(RE_NUMBER, p) // (3) }; }

At (1) we would like the typing context to associate p: {numberish: nat|string}|nat|string, at (2) associate p: number, and at (3) associate p: string.

It may not be obvious how to design an efficient algorithm to update the typing context in a control flow with the presence of jumps (like that of the first branch, which immediately returns). But actually, it's not so bad; I will give a quick sketch of a couple things you may want to consider, and leave you to fill out the algorithm if you are interested.

  • First, define what it means for the typing context when the program hits a local termination point (like a return statement, which jumps out of the local function); one idea is to then mark everything in the local context as uninhabited.
  • Keep track of the typing context after the completion of each statement in the program. For example, if a statement I is of the form if Cond then L else R, and the typing contexts of L and R as induced by Cond are $\Gamma_{\tt L}$ and $\Gamma_{\tt R}$, respectively, what are $\Gamma_{\text{post}({\tt L})}$ and $\Gamma_{\text{post}({\tt R})}$? What is the the typing context $\Gamma_{\text{post}({\tt I})}$ in relation to $\Gamma_{\text{post}({\tt L})}$ and $\Gamma_{\text{post}({\tt R})}$?

Summary

In this cc, we derived a simple language with support for structural type narrowing of record and union types via flow typing. Our language and its type system are small and intuitive enough to design and implement in an afternoon, but additional work not presented here is needed to formulate soundness and completeness of the type system.

To learn more about flow typing, you may be interested in the Wikipedia entry and Pearce's 2012 paper on a flow calculus with proofs of correctness.

As always, thanks for reading. I hope you had fun, and that this cc inspires you to design more interesting type systems.

Analytics By visiting this site, you agree to its use of Cloudflare Analytics. No identifiable information is transmitted to Cloudflare. See Cloudflare Analytics user privacy.