- Introduction
- Motivation
- Notation
- lang_narrow
- lang_extends
- Going further: an algorithm for narrowed contexts in imperative programs
- Summary
Introduction
The full source code for the language described here is available at gh:ayazhafiz/lang_narrow.
You can try out the language in this playground.
In this cc, we will formulate lang_narrow
, a language that supports flow typing
with record and union typesUnion types are also known as sum types.. Flow
typing empowers our language's type system to structurally narrow values' types
at branches in a program's control flow. To gain an intuition for what this means, let's take a look at a
simple example:If you are familiar with the TypeScript type system, our language's semantics will feel very similar.
fn defaultNat(): nat { 1729 } fn readNat(n: nat|string): nat { if n is nat then n // n is nat else defaultNat() // n is string } fn narrowBToNat(p: nat|{a: bool, b: nat}|{b: string, c: nat}): nat { if b in p then readNat(p.b) // p is {a: bool, b: nat}|{b: string, c: nat} else p // p is nat } // narrowBToNat(10) // => 10 // narrowBToNat({a: true, b: 20}) // => 20 // narrowBToNat({b: "not a nat" , c: 9}) // => 1729
The union type T|U
denotes a type that is either T
or U
, and the
is
and in
expressions introduce a condition that constrains a union type to
a subtype in either branch. Hopefully it's easy to get an idea of what is
going; later, we'll define the language and its constructs more formally.
Even superficially, it is clear that our use of "union types" here is somewhat different from the
tagged union algebraic data types
(ADTs) present in Haskell (via data
), OCaml (via
type
), or Rust (via enum
). Our union types are structural rather than
nominal, and the variants of our union types are types in of themselves, rather
than simply tag names (or constructors). However, they are still tagged unions
in the sense that the the variants of the union are named by their types. As
mentioned, we'll give a proper definition our union types later on.
Motivation
Before we delve into formalizations, it may be helpful give some motivation for why we might want to
explore lang_narrow
. After all, why not use ADTs in place of the kinds
of unions we have here, and why would flow typing be useful when
we can pattern match on data constructors? I will give two reasons, though
there may be more.
The merits of flow typing
In languages with ADTs, but without flow typing, we cannot employ the type system to constrain the variants of an ADT based on invariants in a program control flow. For example, consider some code like
rust
fn send_recv(req: Request): Data {const resp = send(req);if let Some(Status::Ok) = resp.status {return resp.data;}let req = expensive_refresh_request(req);let req = expensive_validate_request(req);match resp.status {Status::RateLimited => send_recv_with_backoff(req),Status::Timeout => send_recv_with_proxy(req),Status::Ok => unreachable!(),}}
This is a toy, but I hope it inspires a realistic situation. You have
some happy path you want to return from quickly, and a more expensive path to
go down otherwise. However, even though Status::Ok
is not a variant that could
constitute resp.status
in the last match
expression, the type system doesn't
know that; we have to inform it that this is so, somewhat unergonomically.
With flow typing, the type system could remove Status::Ok
from the possible
variants of resp.status
after the first branch.
The merits of our union types compared to ADTs
ADTs are closed. This means that whenever we add a new variant to an ADT, we have to handle that variant everywhere the ADT is used.This is not entirely true in Haskell, as Haskell permits partial functions. But supposing that we would like our functions to be total, this holds.
rust
enum Animal = { Cat(Cat), Dog(Dog) }fn get_sound(a: Animal): String { /* impl */ }fn get_name(a: Animal): String { /* impl */ }
If I want to handle getting just the name for another animal, say Zebra
, I
need to add Zebra
to the Animal
ADT and handle the case of Zebra
when
unpacking Animal
in both getSound
and getName
. Of course, there are
ways around this: we could split up the Animal
ADT into separate types to
be used for getSound
and getName
, but then there is no easy way to
convert between the separate types without writing even more code (at least
if the ADTs are treated nominally, as they generally are).
On the other hand, with union types this is easy:
fn getSound(a: Cat|Dog|Zebra): string { /* impl */ } fn getSound(a: Cat|Dog): string { /* impl */ }
Since the variants of the union are types in themselves, we can
include and exclude them freely. Furthermore, Cat|Dog
can be subsumed by
Cat|Dog|Zebra
immediatelyAs long as union types
are treated structurally, the subsumption would hold whether Cat
and Dog
are treated nominally or structurally.;
we will formalize this subtyping relationship when developing the language in
the sections below.
The disadvantages of our union types compared to ADTs
I certainly don't wish to claim life is greener on one side. There are several disadvantages of using union types in the manner we will develop in this language, including
- A lack of uniform tag names: as we will see later, the
is
expression requires that we attach to every value used in such an expression a runtime type tag. This means that the compiler must either employ some non-trivial machinery to determine all reachable values to which it should attach type tags, or attach a type tag to all runtime values. The cost of this work only increases with theextends
narrowing expression we introduce inlang_extends
, which effectively requires a typechecker in the program runtime. With ADTs, the costs associated with the addition of type tags can be made local and minimal -- tags are constructed and read at syntax-directed sites, making it obvious to a compiler where they should be inserted. - Recursive types are more difficult to formalize: given that our type system is structural rather than nominal, it is more difficult to typecheck recursive type definitionsIt actually wouldn't be terrible with the use of equirecursive types, but that's beyond the scope of this cc.. To keep things simple, our language has neither a notion of type variables nor type inference, so there is no way to introduce recursive types.
All this is to say, if I were implementing flow typing in a production language, I would likely use ADTs and perform narrowing on data constructors. But open unions are easier to bookkeep than ADTs, so we prefer them for the purposes of our small language. It should be pretty straightforward to translate our approach to a language with closed, nominal ADTs.
Notation
signifies a typing context, with some number of variable-to-type bindings. For example, is a typing context wherein the variable $\text{t}$ has type $\text{T}$.
means that under the context $\Gamma$, the expression $\text{t}$ has type $\text{T}$. For example,
is sound.A type derivation rule $\text{R}$ has the form
and can be read as, "when the conditions of $\text{R}$ hold, the result holds."
As an example, here is a derivation rule stipulating that the
addition of two expressions whose types under are nat
s
yields a nat
:
$\text{S} <: \text{T} % >$ states that $\text{S}$ is a subtype of $\text{T}$. $\require{cancel}\text{S} \cancel{<:} \text{T} % >$ states that $\text{S}$ is not a subtype of $\text{T}$.
lang_narrow
Having discussed some motivation and notation, we will now begin to formalize
lang_narrow
. We will derive its type system, discuss what is needed to
provide for flow typing, and go over a runtime for lang_narrow
when compiled to C.
Although we will present most of the interesting typing rules necessary to glue the system together, I don't attempt to cover all that would be necessary to prove the correctness of this language's type systemIn fact, I am not sure the type system is sound and complete, though "intuitively" it may be seen to be..
Basic types and their rules
Our language contains the primitive types nat
(a natural number), string
,
bool
, unknown
(the top typeThe top type is
inhabited by all values; it is the supertype of all types.), and never
(the bottom typeThe bottom type is inhabited by no values; it is the
subtype of all types.).
lang_narrow
supports functions and types them; for example,
fn add(a: nat, b: nat): nat { /* impl */ }
has the type (nat, nat): nat
. However, this syntactic representation
of function types is not admitted by the language front-end;
that is, no value
can be explicitly typed as a function. Since all parameters must be typed by the user,
this restriction effectively means that there are no higher-order functions in
our language. Adding language support for higher-order functions would be
straightforward (support for type influencing less soHistorically, type inference in the presence of subtypes has presented serious
implementation difficulties. In general, an efficient implementation of such
systems requires careful design of a language with syntax-directed typing rules.
For efficient implementations, see Pottier's 1998 thesis
and Dolan's 2017 thesis. Dolan's
thesis solves the even larger problem of type inference in the presence of subtyping and
polymorphism by treating types foremost as an algebra. A simpler, yet just-as-powerful version of Dolan's work is
described in Parreaux 2020.
Thanks to Ben Siraphob for introducing me to the latter two publications.).
Record types
Our language supports records; {a: 1, b: true}
is a record with fields a
and b
, and has the type
{a: nat, b: bool}
. Records can be projected by field name; {a: 1, b: true}.b
yields true
. Here are the typing rules for record construction and
projectionAfter reading about union types, you may notice that neither $\ref{Record-Proj}$ nor any other rule
permits the projection of a value of type {a: nat, b: nat}|{b: string}
on the field b
. As an exercise for you,
suppose we would like such a projection to yield the type nat|string
(in fact, the full lang_narrow
type system does so). What
modifications to $\ref{Record-Proj}$ must be made to support this? (You can also
add additional rules, but doing so is not necessary).:
We also define a subtyping rule for records:
Notice that in the subtyping relation, the order of record fields does not matter; holds.
Union types
A union type (also known as a sum type) is a type that may be inhabited by
values of several different types. As an example, in our language union types
are written like {a: bool}|nat|string
, of which {a: true}
, 1
, and "hi"
are all valid inhabitants. In lang_narrow
, variants of a union type U
are
subtypes of U
and are tagged by their type name.
All values of a type T
are also of the singleton union type consisting of T
;
we make no distinction between these two notions. Unions types containing no
type variants are equivalent to the never
type; we make no distinction between
these two either.
Notice that it is
impractical for us to deduce that a value of type T
is of a union type that is a
strict supertype of T
without explicit annotations. For example, we would
certainly prefer to type the literal 1
as a nat
, rather than nat|string
or any other
number of union types - indeed, the only two reasonable choices here are nat
or top
.
Nevertheless, we would like to express the fact that a nat
is a nat|string
; for example
fn id(a: nat|string): nat|string { a } id(1)
should be well-typed. Thus, we have the following subtyping rule for unions:
As with , the order of union variants does not matter; holds.
There are also a few convenience rules we may want to introduceThere are more than the two we present here; for example we could also have an
"unfold" rule that makes string|(bool|nat)
equivalent to string|bool|nat
.. For one,
notice that the type string|nat|string
is equivalent to string|nat
.
Furthermore, notice that string|never
is equivalent to string
, as never
is
inhabited by no values and therefore contributes nothing to the domain of the
union type.
Narrowed
type
The Next, let's formalize the semantics of flow typing in lang_narrow
and discuss
what data structures we need to represent these semantics.
As we saw in our initial example, the main idea of flow typing is to introduce a
condition at branch sites that narrows a value of type T
to a type U
in
one branch (let's call this the left branch), and to the type of T
"without"
U
in the other branch (let's call this the right branch).
In our language, we only provide if
expressions for branching, for which this
idea might look something like
// suppose Γ = {a: T} if <some check to see if a is a U> then ... // a is a subtype of U in this branch else ... // a is a subtype of (T "without" U) in this branch
We'll use a type constructor called Narrowed
to encapsulate the relationship
between an expression and its narrowed types in left and right branches. In the
example above, the <some check>
condition would yield the type
Narrowed⟨a, U, T-U⟩
, where T-U
signifies a type inhabited by values that
inhabit T
but do not inhabit U
This is one
interpretation of what we described as T
"without" U
above, but there are many
others. We have to be careful though, because only a few interpretations can be
proven sound!.
We have two things to figure out regarding Narrowed
types:
- figure out how to construct them
- figure out how they influence the types of branching expressions
Let's do the latter first.
Interaction with branching expressions
Take a look at the derivation rule for an if
expression with a boolean
condition:
For those unfamiliar, is called the join of $\text{T}$ and $\text{U}$, and represents their least upper bound (or least common supertype). As a derivation rule,
The only modification we need to make for the derivation rule of if
with a
Narrowed
condition is how the narrowed expression should be typed in the left
and right branches.
First, we verify that the narrowed term t
is in the typing context. If it
is not (for example, if it is a literal) the expression would not be
referenceable in either branch anyway; we assume this is a programming error.
Next, we validate that neither narrowed type of t
is never
; if it were, the
branch associated with that narrowed type would never be taken, and hence the
branch is vacuous; we assume this also is a programming error.
Then, we remove the binding for t
from
the typing context, determine the type of each branch with the addition of
the relevant narrowed type of t
to its context, and yield the type of the if
expression as the
join of both branches' types. Notice that the context of the resulting type
contains the unnarrowed type of t
.
The restrictions we have imposed, namely that t
must be a variable
and that neither narrowed type can evaluate to never
, are not actually
necessary for the soundness of the derivation rule. However, their inclusion
helps catch programming mistakes, and the restriction against
never
-narrowed branches effectively makes
applicable only when T
is a union type. This can simplify a typechecking
implementation significantly.
And that's all for Narrowed
types in branches. Now, onto how to construct
Narrowed
types!
is
narrowing expression
The In our language, the is
expression constructs a Narrowed
type from an
expression and a subtype of the expression type. For example, the expression
// suppose Γ = {a: string|nat|bool} a is string|bool
has the type Narrowed⟨a, string|bool, nat⟩
. We define the derivation rule for
is
expressions the following way:
where the subtraction rule $\text{T} - \text{U}$ is given by
Notice that elides from T
only those
types that equivalent to a variant of U
(up to permutation of record
fields). For example, {a: nat, b: bool} - {a: nat}
yields {a: nat, b: bool}
,
not never
.
If the restrictions of seem a bit limiting or unnecessary, we will discuss a
language that avoids them at the expense of greater runtime costs in lang_extends
.
However, I suspect these restrictions are reasonable for most general-purpose programming,
as it is always possible to refine a union type annotation with a type variant
that one may later want to narrow.
in
narrowing expression
The Consider the function
fn readBOrC(obj: {a: nat, b: nat}|{a: nat, c: nat}|{b: nat, d: nat}|{c: nat, d: nat}): nat { if obj is {a: nat, b: nat}|{b: nat, d: nat} then obj.b else obj.c }
This works, but is a bit bulky; we don't care that obj
is a union of records
possibly containing a
or d
, we just care whether or not it's a subtype of
{b: nat}
. Because excludes types by
equivalence rather than subtyping relation, if we were to try obj is {b: nat}
,
the left narrowing type would be never
. So to elegantly handle narrowing of
records, we can reformulate accordingly, as we do for lang_extends
, or introduce
a special case for records, as we will do here with the in
expression.
in
takes a field name f
and an expression of type T
,
constructing a Narrowed
instance with subtypes of T
that are records containing f
on the left
and all other subtypes of T
on the right. For example,
// suppose: obj :: {a: nat, b: nat}|{a: nat, c: nat}|{b: nat, d: nat}|{c: nat, d: nat} b in obj
has the type
Narrowed⟨obj, {a: nat, b: nat}|{b: nat, d: nat}, {a: nat, c: nat}|{c: nat, d: nat}⟩
We define the derivation rule for in
expressions the following way:
Notice that the extraction of record types containing the field $\text{f}$ distributes over the union type $\text{T}$, rather than operating on the type as a whole. This is similar to the behavior of TypeScript's distributed conditional types.
A runtime implementation
A typechecking implementation
of lang_narrow
is a pretty straightforward encoding of the presented
typing rules. What is more interesting is the implementation of expressions like
is
in the evaluation of a lang_narrow
program, as these require auxiliary
type information.
Actually, if the language is interpreted, evaluating type-dependent expressions is straightforward -- we reduce the expression-to-be-checked to a value, invoke the typechecker, and get our result.
But if the language is compiled,
we have a bit more work to do in encoding type tags and measuring type compatibility.
In the rest of this section, we will discuss this work for lang_narrow
programs compiled to C.
A sufficiently advanced compiler could determine precisely which values are used in expressions that read type tags, and attach tags only to those. For the ease of implementation, we generate C code with all values having type tags.
In our code generator,
we encode every value as an instance of a tagged_any
struct.
The struct includes a type tag that discerns a value's true type to
runtime functions, and val
field that contains the raw, type-safe value.
c
typedef struct tagged_any tagged_any;typedef struct record {int numFields;const char** fields;tagged_any* values;} record;union any {int nat;const char* string;int bool;record record;} any;struct tagged_any {const char* tag;union any val;};
The most interesting runtime function using tagged_any
s is the pseudo-polymorphic
print
:_print
writes values without a newline, and is useful for recurring when
printing a record. print
is a user-facing entry point that terminates a printed
value with a newline.
c
void _print(tagged_any any) {if (is(any, NAT)) {printf("%d", any.val.nat);} else if (is(any, STRING)) {printf("%s", any.val.string);} else if (is(any, BOOL)) {printf("%s", any.val.bool == 1 ? "true" : "false");} else if (is(any, RECORD)) {record r = any.val.record;printf("{");for (int i = 0; i < r.numFields; ++i) {printf("%s: ", r.fields[i]);_print(r.values[i]);if (i != r.numFields - 1) {printf(", ");}}printf("}");} else {fprintf(stderr, "Runtime error: no matching tag %s\n", any.tag);exit(1);}}void print(tagged_any any) {_print(any);printf("\n");}
Having encoded all values as tagged_any
s, performing is
and in
checks
becomes trivial. In the case of is
, we take an array of type tags
(constituting a union) and check if a value's type tag is in that array. If we
need to check a record, we verify that all field names coincide and
recurse on checking corresponding field values. In the case of in
, if a value is not a record the check is
immediately false; otherwise, we walk the fields of the record and check if one
corresponds with the predicated field.
Performance concerns, and what to do about them
The introduction of type tags and runtime typecheckers introduces costs that may not be immediately obvious to users of the language, and are something that we would like to avoid. Type tags on their own aren't so bad to deal with; since they are constant, one can inline them and the downstream C compiler will likely replace identical tags with a common reference to a static; if not, an interner can be added without much difficulty.
On the other hand, the costs incurred by evaluating is
expressions are harder
to deal with, primarily because of the explosive cost of checking deeply nested
records for equivalence.
To avoid this, perhaps the best strategy is to employ flow typing in a language with nominal types, rather than the structural ones we have presented in this language. This way, checks for type equivalence are simply the comparison of two type names (or even better, two memory addresses).
lang_extends
As mentioned earlier, it may appear that the restrictions imposed by and
are superfluous. In some sense they are; we can generalize over
both with a rule that constructs a Narrowed
type by testing against an
arbitrary subtype of a union.
Let's direct such a rule with an expression called extends
:
We'll define the exclusion the following way:
Now we can write the well-typed function
fn go(p: nat|bool|{a: nat, b: nat}|{b: nat, c: nat}): nat { if p extends {b: nat} then p.b else if p extends bool then if p then 1 else 0 else p }
using a single extends
expression where we would have had to use a combination
of in
and is
expressions in lang_narrow
.
Performance concerns, and what to do about them
Of course, the flexibility of does
not come without its costs. Like lang_narrow
, a lang_extends
runtime needs
to check type tags and read record fields, but the conditions of
also suggests that the runtime needs to carry around a subtyping checker!
Obviously, having both a compile-time and runtime subtyping checker is not ideal, and the runtime penalties can be severe especially in the presence of large union, record, or recursive types (if we had introduced them).
There are several techniques to mitigate the runtime penalties; one approach may be to populate, at compile time, a table of type relations amongst all known types in the program, and perform subtyping queries by searching the table. Costs of searches can be further reduced by storing the type relations in a tree and making queries union-finds.
My impression is that, in general, the power of a typing rule like is not worth the additional runtime complexity, cost, and code size its implementation must introduce.
Going further: an algorithm for narrowed contexts in imperative programs
The implementation of typing contexts in lang_narrow
(and lang_extends
) is
straightforward; since each program is one large
continuation, we never end up "jumping out" of the control flow (and the typing
context) we are currently in. Contrast this to an imperative version of
lang_narrow
, wherein we may write a program like
fn is_number_like(p: bool|{numberish: nat|string}|nat|string): bool { if p is bool then { return false; } var n: nat|string = if numberish in p then { // (1) p.numberish } else { p }; return if p is number then { true // (2) } else { String.regex_matches(RE_NUMBER, p) // (3) }; }
At (1)
we would like the typing context to associate p: {numberish: nat|string}|nat|string
,
at (2)
associate p: number
, and at (3)
associate p: string
.
It may not be obvious how to design an efficient algorithm to update the typing context in a control flow with the presence of jumps (like that of the first branch, which immediately returns). But actually, it's not so bad; I will give a quick sketch of a couple things you may want to consider, and leave you to fill out the algorithm if you are interested.
- First, define what it means for the typing context when the program hits a
local termination point (like a
return
statement, which jumps out of the local function); one idea is to then mark everything in the local context as uninhabited. - Keep track of the typing context after the completion of each statement in the
program. For example, if a statement
I
is of the formif Cond then L else R
, and the typing contexts ofL
andR
as induced byCond
are $\Gamma_{\tt L}$ and $\Gamma_{\tt R}$, respectively, what are $\Gamma_{\text{post}({\tt L})}$ and $\Gamma_{\text{post}({\tt R})}$? What is the the typing context $\Gamma_{\text{post}({\tt I})}$ in relation to $\Gamma_{\text{post}({\tt L})}$ and $\Gamma_{\text{post}({\tt R})}$?
Summary
In this cc, we derived a simple language with support for structural type narrowing of record and union types via flow typing. Our language and its type system are small and intuitive enough to design and implement in an afternoon, but additional work not presented here is needed to formulate soundness and completeness of the type system.
To learn more about flow typing, you may be interested in the Wikipedia entry and Pearce's 2012 paper on a flow calculus with proofs of correctness.
As always, thanks for reading. I hope you had fun, and that this cc inspires you to design more interesting type systems.