- Introduction
- Motivation
- Notation
- lang_narrow
- lang_extends
- Going further: an algorithm for narrowed contexts in imperative programs
- Summary

## Introduction

The full source code for the language described here is available at gh:ayazhafiz/lang_narrow.

You can try out the language in this playground.

In this cc, we will formulate `lang_narrow`

, a language that supports flow typing
with record and union typesUnion types are also known as sum types.. Flow
typing empowers our language's type system to structurally narrow values' types
at branches in a program's control flow. To gain an intuition for what this means, let's take a look at a
simple example:If you are familiar with the TypeScript type system, our language's semantics will feel very similar.

fn defaultNat(): nat { 1729 } fn readNat(n: nat|string): nat { if n is nat then n // n is nat else defaultNat() // n is string } fn narrowBToNat(p: nat|{a: bool, b: nat}|{b: string, c: nat}): nat { if b in p then readNat(p.b) // p is {a: bool, b: nat}|{b: string, c: nat} else p // p is nat } // narrowBToNat(10) // => 10 // narrowBToNat({a: true, b: 20}) // => 20 // narrowBToNat({b: "not a nat" , c: 9}) // => 1729

The union type `T|U`

denotes a type that is either `T`

or `U`

, and the
`is`

and `in`

expressions introduce a condition that constrains a union type to
a subtype in either branch. Hopefully it's easy to get an idea of what is
going; later, we'll define the language and its constructs more formally.

Even superficially, it is clear that our use of "union types" here is somewhat different from the
tagged union algebraic data types
(ADTs) present in Haskell (via `data`

), OCaml (via
`type`

), or Rust (via `enum`

). Our union types are structural rather than
nominal, and the variants of our union types are types in of themselves, rather
than simply tag names (or constructors). However, they are still tagged unions
in the sense that the the variants of the union are named by their types. As
mentioned, we'll give a proper definition our union types later on.

## Motivation

Before we delve into formalizations, it may be helpful give some motivation for why we might want to
explore `lang_narrow`

. After all, why not use ADTs in place of the kinds
of unions we have here, and why would flow typing be useful when
we can pattern match on data constructors? I will give two reasons, though
there may be more.

### The merits of flow typing

In languages with ADTs, but without flow typing, we cannot employ the type system to constrain the variants of an ADT based on invariants in a program control flow. For example, consider some code like

rust

fn send_recv(req: Request): Data {const resp = send(req);if let Some(Status::Ok) = resp.status {return resp.data;}let req = expensive_refresh_request(req);let req = expensive_validate_request(req);match resp.status {Status::RateLimited => send_recv_with_backoff(req),Status::Timeout => send_recv_with_proxy(req),Status::Ok => unreachable!(),}}

This is a toy, but I hope it inspires a realistic situation. You have
some happy path you want to return from quickly, and a more expensive path to
go down otherwise. However, even though `Status::Ok`

is not a variant that could
constitute `resp.status`

in the last `match`

expression, the type system doesn't
know that; we have to inform it that this is so, somewhat unergonomically.

With flow typing, the type system could remove `Status::Ok`

from the possible
variants of `resp.status`

after the first branch.

### The merits of our union types compared to ADTs

ADTs are closed. This means that whenever we add a new variant to an ADT, we have to handle that variant everywhere the ADT is used.This is not entirely true in Haskell, as Haskell permits partial functions. But supposing that we would like our functions to be total, this holds.

rust

enum Animal = { Cat(Cat), Dog(Dog) }fn get_sound(a: Animal): String { /* impl */ }fn get_name(a: Animal): String { /* impl */ }

If I want to handle getting just the name for another animal, say `Zebra`

, I
need to add `Zebra`

to the `Animal`

ADT and handle the case of `Zebra`

when
unpacking `Animal`

in both `getSound`

and `getName`

. Of course, there are
ways around this: we could split up the `Animal`

ADT into separate types to
be used for `getSound`

and `getName`

, but then there is no easy way to
convert between the separate types without writing even more code (at least
if the ADTs are treated nominally, as they generally are).

On the other hand, with union types this is easy:

fn getSound(a: Cat|Dog|Zebra): string { /* impl */ } fn getSound(a: Cat|Dog): string { /* impl */ }

Since the variants of the union are types in themselves, we can
include and exclude them freely. Furthermore, `Cat|Dog`

can be subsumed by
`Cat|Dog|Zebra`

immediatelyAs long as union types
are treated structurally, the subsumption would hold whether `Cat`

and `Dog`

are treated nominally or structurally.;
we will formalize this subtyping relationship when developing the language in
the sections below.

#### The disadvantages of our union types compared to ADTs

I certainly don't wish to claim life is greener on one side. There are several disadvantages of using union types in the manner we will develop in this language, including

**A lack of uniform tag names:**as we will see later, the`is`

expression requires that we attach to every value used in such an expression a runtime type tag. This means that the compiler must either employ some non-trivial machinery to determine all reachable values to which it should attach type tags, or attach a type tag to all runtime values. The cost of this work only increases with the`extends`

narrowing expression we introduce in`lang_extends`

, which effectively requires a typechecker in the program runtime. With ADTs, the costs associated with the addition of type tags can be made local and minimal -- tags are constructed and read at syntax-directed sites, making it obvious to a compiler where they should be inserted.**Recursive types are more difficult to formalize:**given that our type system is structural rather than nominal, it is more difficult to typecheck recursive type definitionsIt actually wouldn't be terrible with the use of equirecursive types, but that's beyond the scope of this cc.. To keep things simple, our language has neither a notion of type variables nor type inference, so there is no way to introduce recursive types.

All this is to say, if I were implementing flow typing in a production language, I would likely use ADTs and perform narrowing on data constructors. But open unions are easier to bookkeep than ADTs, so we prefer them for the purposes of our small language. It should be pretty straightforward to translate our approach to a language with closed, nominal ADTs.

## Notation

signifies a typing context, with some number of variable-to-type bindings. For example, is a typing context wherein the variable $\text{t}$ has type $\text{T}$.

means that under the context $\Gamma$, the expression $\text{t}$ has type $\text{T}$. For example,

is sound.A type derivation rule $\text{R}$ has the form
and can be read as, "when the conditions of $\text{R}$ hold, the result holds."
As an example, here is a derivation rule stipulating that the
addition of two expressions whose types under are `nat`

s
yields a `nat`

:

$\text{S} <: \text{T} % >$ states that $\text{S}$ is a subtype of $\text{T}$. $\require{cancel}\text{S} \cancel{<:} \text{T} % >$ states that $\text{S}$ is not a subtype of $\text{T}$.

`lang_narrow`

Having discussed some motivation and notation, we will now begin to formalize
`lang_narrow`

. We will derive its type system, discuss what is needed to
provide for flow typing, and go over a runtime for `lang_narrow`

when compiled to C.

Although we will present most of the interesting typing rules necessary to glue the system together, I don't attempt to cover all that would be necessary to prove the correctness of this language's type systemIn fact, I am not sure the type system is sound and complete, though "intuitively" it may be seen to be..

### Basic types and their rules

Our language contains the primitive types `nat`

(a natural number), `string`

,
`bool`

, `unknown`

(the top typeThe top type is
inhabited by all values; it is the supertype of all types.), and `never`

(the bottom typeThe bottom type is inhabited by no values; it is the
subtype of all types.).

`lang_narrow`

supports functions and types them; for example,

fn add(a: nat, b: nat): nat { /* impl */ }

has the type `(nat, nat): nat`

. However, this syntactic representation
of function types is not admitted by the language front-end;
that is, no value
can be explicitly typed as a function. Since all parameters must be typed by the user,
this restriction effectively means that there are no higher-order functions in
our language. Adding language support for higher-order functions would be
straightforward (support for type influencing less soHistorically, type inference in the presence of subtypes has presented serious
implementation difficulties. In general, an efficient implementation of such
systems requires careful design of a language with syntax-directed typing rules.
For efficient implementations, see Pottier's 1998 thesis
and Dolan's 2017 thesis. Dolan's
thesis solves the even larger problem of type inference in the presence of subtyping and
polymorphism by treating types foremost as an algebra. A simpler, yet just-as-powerful version of Dolan's work is
described in Parreaux 2020.
Thanks to Ben Siraphob for introducing me to the latter two publications.).

#### Record types

Our language supports records; `{a: 1, b: true}`

is a record with fields `a`

and `b`

, and has the type
`{a: nat, b: bool}`

. Records can be projected by field name; `{a: 1, b: true}.b`

yields `true`

. Here are the typing rules for record construction and
projectionAfter reading about union types, you may notice that neither $\ref{Record-Proj}$ nor any other rule
permits the projection of a value of type `{a: nat, b: nat}|{b: string}`

on the field `b`

. As an exercise for you,
suppose we would like such a projection to yield the type `nat|string`

(in fact, the full `lang_narrow`

type system does so). What
modifications to $\ref{Record-Proj}$ must be made to support this? (You can also
add additional rules, but doing so is not necessary).:

We also define a subtyping rule for records:

Notice that in the subtyping relation, the order of record fields does not matter; holds.

#### Union types

A union type (also known as a sum type) is a type that may be inhabited by
values of several different types. As an example, in our language union types
are written like `{a: bool}|nat|string`

, of which `{a: true}`

, `1`

, and `"hi"`

are all valid inhabitants. In `lang_narrow`

, variants of a union type `U`

are
subtypes of `U`

and are tagged by their type name.

All values of a type `T`

are also of the singleton union type consisting of `T`

;
we make no distinction between these two notions. Unions types containing no
type variants are equivalent to the `never`

type; we make no distinction between
these two either.

Notice that it is
impractical for us to deduce that a value of type `T`

is of a union type that is a
strict supertype of `T`

without explicit annotations. For example, we would
certainly prefer to type the literal `1`

as a `nat`

, rather than `nat|string`

or any other
number of union types - indeed, the only two reasonable choices here are `nat`

or `top`

.
Nevertheless, we would like to express the fact that a `nat`

is a `nat|string`

; for example

fn id(a: nat|string): nat|string { a } id(1)

should be well-typed. Thus, we have the following subtyping rule for unions:

As with , the order of union variants does not matter; holds.

There are also a few convenience rules we may want to introduceThere are more than the two we present here; for example we could also have an
"unfold" rule that makes `string|(bool|nat)`

equivalent to `string|bool|nat`

.. For one,
notice that the type `string|nat|string`

is equivalent to `string|nat`

.

Furthermore, notice that `string|never`

is equivalent to `string`

, as `never`

is
inhabited by no values and therefore contributes nothing to the domain of the
union type.

`Narrowed`

type

The Next, let's formalize the semantics of flow typing in `lang_narrow`

and discuss
what data structures we need to represent these semantics.

As we saw in our initial example, the main idea of flow typing is to introduce a
condition at branch sites that narrows a value of type `T`

to a type `U`

in
one branch (let's call this the left branch), and to the type of `T`

"without"
`U`

in the other branch (let's call this the right branch).
In our language, we only provide `if`

expressions for branching, for which this
idea might look something like

// suppose Γ = {a: T} if <some check to see if a is a U> then ... // a is a subtype of U in this branch else ... // a is a subtype of (T "without" U) in this branch

We'll use a type constructor called `Narrowed`

to encapsulate the relationship
between an expression and its narrowed types in left and right branches. In the
example above, the `<some check>`

condition would yield the type
`Narrowed⟨a, U, T-U⟩`

, where `T-U`

signifies a type inhabited by values that
inhabit `T`

but do not inhabit `U`

This is one
interpretation of what we described as `T`

"without" `U`

above, but there are many
others. We have to be careful though, because only a few interpretations can be
proven sound!.

We have two things to figure out regarding `Narrowed`

types:

- figure out how to construct them
- figure out how they influence the types of branching expressions

Let's do the latter first.

#### Interaction with branching expressions

Take a look at the derivation rule for an `if`

expression with a boolean
condition:

For those unfamiliar, is called the **join** of $\text{T}$ and $\text{U}$,
and represents their least upper bound (or least common supertype). As a derivation
rule,

The only modification we need to make for the derivation rule of `if`

with a
`Narrowed`

condition is how the narrowed expression should be typed in the left
and right branches.

First, we verify that the narrowed term `t`

is in the typing context. If it
is not (for example, if it is a literal) the expression would not be
referenceable in either branch anyway; we assume this is a programming error.
Next, we validate that neither narrowed type of `t`

is `never`

; if it were, the
branch associated with that narrowed type would never be taken, and hence the
branch is vacuous; we assume this also is a programming error.

Then, we remove the binding for `t`

from
the typing context, determine the type of each branch with the addition of
the relevant narrowed type of `t`

to its context, and yield the type of the `if`

expression as the
join of both branches' types. Notice that the context of the resulting type
contains the unnarrowed type of `t`

.

The restrictions we have imposed, namely that `t`

must be a variable
and that neither narrowed type can evaluate to `never`

, are not actually
necessary for the soundness of the derivation rule. However, their inclusion
helps catch programming mistakes, and the restriction against
`never`

-narrowed branches effectively makes
applicable only when `T`

is a union type. This can simplify a typechecking
implementation significantly.

And that's all for `Narrowed`

types in branches. Now, onto how to construct
`Narrowed`

types!

`is`

narrowing expression

The In our language, the `is`

expression constructs a `Narrowed`

type from an
expression and a subtype of the expression type. For example, the expression

// suppose Γ = {a: string|nat|bool} a is string|bool

has the type `Narrowed⟨a, string|bool, nat⟩`

. We define the derivation rule for
`is`

expressions the following way:

where the subtraction rule $\text{T} - \text{U}$ is given by

Notice that elides from `T`

only those
types that equivalent to a variant of `U`

(up to permutation of record
fields). For example, `{a: nat, b: bool} - {a: nat}`

yields `{a: nat, b: bool}`

,
not `never`

.

If the restrictions of seem a bit limiting or unnecessary, we will discuss a
language that avoids them at the expense of greater runtime costs in `lang_extends`

.
However, I suspect these restrictions are reasonable for most general-purpose programming,
as it is always possible to refine a union type annotation with a type variant
that one may later want to narrow.

`in`

narrowing expression

The Consider the function

fn readBOrC(obj: {a: nat, b: nat}|{a: nat, c: nat}|{b: nat, d: nat}|{c: nat, d: nat}): nat { if obj is {a: nat, b: nat}|{b: nat, d: nat} then obj.b else obj.c }

This works, but is a bit bulky; we don't care that `obj`

is a union of records
possibly containing `a`

or `d`

, we just care whether or not it's a subtype of
`{b: nat}`

. Because excludes types by
equivalence rather than subtyping relation, if we were to try `obj is {b: nat}`

,
the left narrowing type would be `never`

. So to elegantly handle narrowing of
records, we can reformulate accordingly, as we do for `lang_extends`

, or introduce
a special case for records, as we will do here with the `in`

expression.

`in`

takes a field name `f`

and an expression of type `T`

,
constructing a `Narrowed`

instance with subtypes of `T`

that are records containing `f`

on the left
and all other subtypes of `T`

on the right. For example,

// suppose: obj :: {a: nat, b: nat}|{a: nat, c: nat}|{b: nat, d: nat}|{c: nat, d: nat} b in obj

has the type

Narrowed⟨obj, {a: nat, b: nat}|{b: nat, d: nat}, {a: nat, c: nat}|{c: nat, d: nat}⟩

We define the derivation rule for `in`

expressions the following way:

Notice that the extraction of record types containing the field $\text{f}$ distributes over the union type $\text{T}$, rather than operating on the type as a whole. This is similar to the behavior of TypeScript's distributed conditional types.

### A runtime implementation

A typechecking implementation
of `lang_narrow`

is a pretty straightforward encoding of the presented
typing rules. What is more interesting is the implementation of expressions like
`is`

in the evaluation of a `lang_narrow`

program, as these require auxiliary
type information.

Actually, if the language is interpreted, evaluating type-dependent expressions is straightforward -- we reduce the expression-to-be-checked to a value, invoke the typechecker, and get our result.

But if the language is compiled,
we have a bit more work to do in encoding type tags and measuring type compatibility.
In the rest of this section, we will discuss this work for `lang_narrow`

programs compiled to C.

A sufficiently advanced compiler could determine precisely which values are used in expressions that read type tags, and attach tags only to those. For the ease of implementation, we generate C code with all values having type tags.

In our code generator,
we encode every value as an instance of a `tagged_any`

struct.
The struct includes a type tag that discerns a value's true type to
runtime functions, and `val`

field that contains the raw, type-safe value.

c

typedef struct tagged_any tagged_any;typedef struct record {int numFields;const char** fields;tagged_any* values;} record;union any {int nat;const char* string;int bool;record record;} any;struct tagged_any {const char* tag;union any val;};

The most interesting runtime function using `tagged_any`

s is the pseudo-polymorphic
`print`

:`_print`

writes values without a newline, and is useful for recurring when
printing a record. `print`

is a user-facing entry point that terminates a printed
value with a newline.

c

void _print(tagged_any any) {if (is(any, NAT)) {printf("%d", any.val.nat);} else if (is(any, STRING)) {printf("%s", any.val.string);} else if (is(any, BOOL)) {printf("%s", any.val.bool == 1 ? "true" : "false");} else if (is(any, RECORD)) {record r = any.val.record;printf("{");for (int i = 0; i < r.numFields; ++i) {printf("%s: ", r.fields[i]);_print(r.values[i]);if (i != r.numFields - 1) {printf(", ");}}printf("}");} else {fprintf(stderr, "Runtime error: no matching tag %s\n", any.tag);exit(1);}}void print(tagged_any any) {_print(any);printf("\n");}

Having encoded all values as `tagged_any`

s, performing `is`

and `in`

checks
becomes trivial. In the case of `is`

, we take an array of type tags
(constituting a union) and check if a value's type tag is in that array. If we
need to check a record, we verify that all field names coincide and
recurse on checking corresponding field values. In the case of `in`

, if a value is not a record the check is
immediately false; otherwise, we walk the fields of the record and check if one
corresponds with the predicated field.

### Performance concerns, and what to do about them

The introduction of type tags and runtime typecheckers introduces costs that may not be immediately obvious to users of the language, and are something that we would like to avoid. Type tags on their own aren't so bad to deal with; since they are constant, one can inline them and the downstream C compiler will likely replace identical tags with a common reference to a static; if not, an interner can be added without much difficulty.

On the other hand, the costs incurred by evaluating `is`

expressions are harder
to deal with, primarily because of the explosive cost of checking deeply nested
records for equivalence.

To avoid this, perhaps the best strategy is to employ flow typing in a language with nominal types, rather than the structural ones we have presented in this language. This way, checks for type equivalence are simply the comparison of two type names (or even better, two memory addresses).

`lang_extends`

As mentioned earlier, it may appear that the restrictions imposed by and
are superfluous. In some sense they are; we can generalize over
both with a rule that constructs a `Narrowed`

type by testing against an
arbitrary subtype of a union.
Let's direct such a rule with an expression called `extends`

:

We'll define the exclusion the following way:

Now we can write the well-typed function

fn go(p: nat|bool|{a: nat, b: nat}|{b: nat, c: nat}): nat { if p extends {b: nat} then p.b else if p extends bool then if p then 1 else 0 else p }

using a single `extends`

expression where we would have had to use a combination
of `in`

and `is`

expressions in `lang_narrow`

.

### Performance concerns, and what to do about them

Of course, the flexibility of does
not come without its costs. Like `lang_narrow`

, a `lang_extends`

runtime needs
to check type tags and read record fields, but the conditions of
also suggests that the runtime needs to carry around a subtyping checker!

Obviously, having both a compile-time and runtime subtyping checker is not ideal, and the runtime penalties can be severe especially in the presence of large union, record, or recursive types (if we had introduced them).

There are several techniques to mitigate the runtime penalties; one approach may be to populate, at compile time, a table of type relations amongst all known types in the program, and perform subtyping queries by searching the table. Costs of searches can be further reduced by storing the type relations in a tree and making queries union-finds.

My impression is that, in general, the power of a typing rule like is not worth the additional runtime complexity, cost, and code size its implementation must introduce.

## Going further: an algorithm for narrowed contexts in imperative programs

The implementation of typing contexts in `lang_narrow`

(and `lang_extends`

) is
straightforward; since each program is one large
continuation, we never end up "jumping out" of the control flow (and the typing
context) we are currently in. Contrast this to an imperative version of
`lang_narrow`

, wherein we may write a program like

fn is_number_like(p: bool|{numberish: nat|string}|nat|string): bool { if p is bool then { return false; } var n: nat|string = if numberish in p then { // (1) p.numberish } else { p }; return if p is number then { true // (2) } else { String.regex_matches(RE_NUMBER, p) // (3) }; }

At `(1)`

we would like the typing context to associate `p: {numberish: nat|string}|nat|string`

,
at `(2)`

associate `p: number`

, and at `(3)`

associate `p: string`

.

It may not be obvious how to design an efficient algorithm to update the typing context in a control flow with the presence of jumps (like that of the first branch, which immediately returns). But actually, it's not so bad; I will give a quick sketch of a couple things you may want to consider, and leave you to fill out the algorithm if you are interested.

- First, define what it means for the typing context when the program hits a
local termination point (like a
`return`

statement, which jumps out of the local function); one idea is to then mark everything in the local context as uninhabited. - Keep track of the typing context after the completion of each statement in the
program. For example, if a statement
`I`

is of the form`if Cond then L else R`

, and the typing contexts of`L`

and`R`

as induced by`Cond`

are $\Gamma_{\tt L}$ and $\Gamma_{\tt R}$, respectively, what are $\Gamma_{\text{post}({\tt L})}$ and $\Gamma_{\text{post}({\tt R})}$? What is the the typing context $\Gamma_{\text{post}({\tt I})}$ in relation to $\Gamma_{\text{post}({\tt L})}$ and $\Gamma_{\text{post}({\tt R})}$?

## Summary

In this cc, we derived a simple language with support for structural type narrowing of record and union types via flow typing. Our language and its type system are small and intuitive enough to design and implement in an afternoon, but additional work not presented here is needed to formulate soundness and completeness of the type system.

To learn more about flow typing, you may be interested in the Wikipedia entry and Pearce's 2012 paper on a flow calculus with proofs of correctness.

As always, thanks for reading. I hope you had fun, and that this cc inspires you to design more interesting type systems.