- Premise
- Definitions
- Our source language
- Inference rules for patterns and match expressions
- Compilation model
- Adding support for type refinement
- Playground & Implementation
- Ad: Are you interested in compiling structural subtypes?
Premise
Today I want to share a simple method to achieve type refinement, a-la flow typing, in a language with anonymous sum types and pattern matching.If you follow my posts, I know the existing set suggests that I've been on a flow typing kick for the past couple years. I promise I've been doing other stuff too - and there will be more about it soon! The method I'll share
- relies on straightforward, syntax-directed rules
- supports compilation to either uniform representations of sum types, or non-uniform representations
- compiling the type refinements to non-uniform representations of sum types requires a zero-costin the sense that a user-written transformation in the source language would be no more efficient, mechanical value transformation.
You may want to take a quick look at the playground to see what we'll end up with.
In this cc, I won't elaborate on why you might want control-flow based typing in a language, or what you can do with it. For more on that, consider reading my last post on this topic.
I'm going to keep formalizations light and introduction of definitions heavy, without really attempting to state or prove certain semantic properties you might expect me to prove here - I hope you'll agree with me that the construction is not all that complicated, and such statements would clutter the discussion.I mean no disappointment - as always, this is left to the reader :)
The method is derived from OCaml's conversion bindings of polymorphic variants, and differs only in that it supports compilation of anonymous sums to non-uniform memory layouts, which differs from OCaml's standard compilation of polymorphic variants.
Okay, hopefully I've whet your appetite. Let's get started!
Definitions
To clarify exactly what we're going to do,
-
By anonymous sum types, I mean sum types that do not have to be defined via nominal type declarations, and can be composed on-the-fly based on value usages. OCaml's polymorphic variants and TypeScript's union types fit this premise. In our language,
haskelllet operation : [Add, Sub, Mul, Div] = Muldefines a variable whose type consists of any of the four arithmetic operations
Add
,Sub
,Mul
, orDiv
, and its concrete value isMul
.The "composed on-the-fly" bit comes from the fact idea values without explicit annotations are inferred based on usages, for example
nimlet op = Add in # op : [Add]let x: [A, B] = A inlet y = match x with # y : [M, N]| A -> M| B -> NImportantly, we will be discussing a language with unification-based inference and without subtyping semantics, so it is possible for the same variant literal to have different types. For example, in the program
nimlet x1: [A, B] = A inlet x2: [A, B, C] = A inx2the first
A
is given type[A, B]
, while the second is given[A, B, C]
. -
type refinement refers to the narrowing of a value's type based on the control flow of a program. In our language,
nimlet x: [A, B, C] = A inmatch x with| C -> A| A | B as y -> ythe second branch in the match binds a variable
y
that refinesx
from the type[A, B, C]
to[A, B]
.As we'll see later, our system also allows type expansion, so you can bind a variable in a pattern that has a more general type than the matched value.
-
A uniform representation of anonymous sum types means that all values of differently-shaped sum types have the same memory layout. For example, in OCaml, polymorphic variants are compiled in part by hashing the variant name, which in our language would mean that a
B
of type[A, B]
would have the same runtime representation as aB
of type[B, C, D]
. Moreover, if a variants has arguments, then its representation of the arguments is also uniform (perhaps via boxing). -
A non-uniform representation of anonymous sum types means that differently-shaped sum types are compiled to different memory layouts. For example, the runtime representation of
A
in[A, B]
need only be expressed via 1 bit to differentiate the variant, whileA
in[A, B, C, D]
is differentiated via 2 bits. Compiling to non-uniform representations can generate more efficient and memory-compact code than uniform representations, since the memory layout generated for a type need only consider the data of that type. It does also mean that conversion of values between different types (for exampleA
in the example above) may incur non-zero runtime cost, when the memory layouts are materially different.
Our source language
The language we'll work with is first-order with only let bindings, ground anonymous variants, and match expressions. Let-bindings are immutable, and all constructs are defined in the standard way, as exemplified in the definitions above.
Variants
A variant expression is of the form $\tt{V}\ \tt{arg_1}\ \dots\ \tt{arg_n}$,
where zero or more arguments to the variant constitute the payload of the
argument. For example, A
is a variant with no arguments, typed as [A]
.
Mul One Two
is a variant with two arguments, each of which is a zero-argument
variant, typed as [Mul [One] [Two]]
. Application of arguments are
disambiguated with parentheses; for example, Mul (Add One Two) (Sub One Zero)
is typed as [Mul [Add [One] [Two]] [Sub [One] [Zero]]]
. As such, the
type inference rule for a variant expression is the natural one.
Match expressions
A match expression pattern-matches a value over its variants, dispatching computation to the appropriately-matched branch. Match expressions must be exhaustive. For example,
nim
let x : [A, B] = A inmatch x with| A -> C| B -> D
would evaluate to A
, since the runtime value of x
would match the first branch.
Top-level patterns in a branch are defined by the grammar
where the second rule introduces a binding to a variable x
that refines the
value being matchedTo keep things simple, we won't support nested refinement patterns here. It's doable, and may be enjoyable to do so!.
This is of course the key of our
discussion today, and we'll talk about it more in the next section.
The grammar for $\tt p$ is defined by
Here are some examples:
x
- matches any pattern and binds it to the variable x_
- matches any pattern without binding to a variableA
- matches a zero-argument variantA
A (B C) (D E)
- matches a two-argument variantA
with the arguments matchingB C
andD E
, respectivelyA B | B C
- an or-pattern that matches eitherA B
orB C
A B | B C as x
- matches eitherA B
orB C
, and binds the refined type constituted in part by[A [B], B [C]]
tox
.
Stuff left out
Function abstractions, type polymorphism, recursive types, and many other common language features have been left out, because they're not key to the idea of refinement discussed here.
Since our refinement is very localized to pattern matching, it is possible to both extend existing languages containing those features with the refinement described here, and extend our working language with those other features.
Inference rules for patterns and match expressions
The most interesting parts of match expressions are the inference rules for both branch patterns and the resulting value type of the branches, since this is the only place the "sum" aspect of our sum types play a role.
Type inference for both is standard - to get the resulting type of a match
expression, we take the union of the sum types each branch returns. In
short, if
branches $\tt{b_1},\ \dots,\ \tt{b_n}$ yield values of type $\tt{t_1},\ \dots,
\tt{t_n}$, then the resulting type of the match expression is the unification
$\tt{t_1} ~\sim~ \dots ~\sim~ \tt{t_n}$, and moreover the type of each branch is
determined to be this unification, rather than each component type $\tt{t_i}$.
For example,
nim
let x : [A, B] = A inmatch x with| A -> C # C is typed as [C, D]| B -> D # D is typed as [C, D]# => the result of this expression has type [C, D]
Don't worry, I would never leave you without the definition for unification:
Alright, that's a monster, I know. Something something a thousand words right? Anyway, the idea is that when we unify two types, we recursively unify the arguments of shared variants, where a variant is shared if it has the same name. For variants that differ in name between the two types, simply add them to the resulting unified type without modification.
At this point you might either totally get this or totally be sick of the notation, so let me show you a few examples.
[A] ~ [A] => [A] [A, B] ~ [A, C] => [A, B, C] [A [B]] ~ [A [C]] => [A [B, C]] [A [B], D [E]] ~ [A [C], G [H]] => [A [B, C], D [E], G [H]]
Our unification is not total - for example, we cannot unify
[A [B]]
with [A [B] [C]]
, since the number of arguments to A
differ. For
simplicity's sake we'll ignore this case in the rest of our discussion, and the
$\msf{assert}$ clause in the definition above pretends that type errors
cannot happen.
This unification is also what we'll use in $\textit{or-patterns}$ to determine
the type of a variable bound in $\textit{refinement-pattern}$, and can be used
to determine the type matched by all the branches in a match
expressionAs an aside, this
brings up an interesting idea of "bidirectional exhaustiveness checking".
Typically, exhaustiveness checking makes sure that there enough branches to
cover all the possible constructors of the value being match
ed on. This is
good enough when the constructors belong to a nominal type, but in the presence
of anonymous sum types, we might also want to make sure not to match too
much - this is a common pitfall with such sum types. To this end, we can infer
the type that the branches in a match
expression expect, and then check whether the
value being matched on is exhaustive relative to the branch types!.
Compilation model
Feel free to choose your own adventure for compiling match expressions. I'm
a fan of flowcharts so I always turn to compiling match
expressions via switch-based
decision trees a-la Maranget. There
are many good resources for compiling pattern matches to decision trees,
foremost Maranget's resources, so I'll just talk about the compilation model
for sum-typed values.
As promised, we'll compile our anonymous sum types to compact, non-uniform, unboxed memory layouts on a per-type basis. A sum type can be represented as a struct consisting of
- a minimally-sized integer to distinguish its variants
- a C-style union of structs representing each variant's arguments' memory layouts
For a sum type with multiple variants, each of which has at least one argument, this scheme minimizes the required runtime representation. As some examples,
[A, B]
can be compiled to a one-bit representation. One bit is used to distinguish the variants, and the variants' arguments are zero-sized.[A, B, C, D]
can be compiled to a two-bit representation. Two bits are used to distinguish the variants, and the variants' arguments are zero-sized.[A [B, C] [B, C], D [E, F, G, H] [I, J]]
can be compiled to a four-bit representation, of the shape
where the{ tag_id: 1 bit, argument_union: union { { 1 bit, 1 bit }, { 2 bit, 1 bit } } }argument_union
consists of the compiled representations of[B, C] [B, C]
and[E, F, G, H] [I, J]
, respectively. So, one bit is used to distinguish the variants, and the largest component of the union is three bits.
Of course, in practice, you'd likely want to compile all of these things sized as at least one byte, but the general idea still holds and scales for values that aren't as small-sized as the variants in this language.
There are many further optimizations you could imagine here! For example, one is
that the singleton sum type [A]
need not any material memory representation at
runtime, and can be compiled to a zero-sized struct. There are others; feel free
to come up with some!If you are interested
in this, check out Rust's and Roc's optimizations for sum types.
Adding support for type refinement
To add support for type refinement, we need to do two things:
- Add type inference rules to determine what type a refined variable should take on
- When compiling to non-uniform memory representations, the compiler needs to
generate code to transform the value being matched on into the refined value,
since they may have different memory layouts.
- If we're compiling to a uniform memory representation, this step isn't necessary, and we just need to emit code to bind the refined variable to the matched value. But that's not interesting so we'll talk about the non-uniform case.
Inference rules
When we see a $\textit{refinement-pattern}$ $|~\tt p \msf{\ as\ }\tt x$, we introduce the type binding $\tt{x} \colon \msf{typeof}\left( \tt p \right)$ to our type environment, where $\msf{typeof}$ infers the type of the (possibly-or-patterned) pattern $\tt p$ following the type inference rules we described previously.
Usages of $\tt x$ in the branch body add additional unification constraints as standard, which means that the refined type may end up larger than $\msf{typeof}\left(\tt p\right)$. This means that we can perform "type expansion", not only type refinement - more on that below.
That's it for typing. I know, maybe not so interesting right? But it composes with other language features quite nicely, allowing you to refine types without needing control-flow-based markers or variables like flow-typing schemes use. As mentioned in the beginning, Ocaml supports this feature to refine polymorphic variants, and the standard compiler uses the same inference scheme.
Type expansion
Type expansion means we can write a program
nim
let x : [A, B, C] = A inmatch x with| A | B as y ->let z : [D, E] = D inmatch z with| D -> y| E -> F| C -> C
where the inferred type of y
(and then z
) will be resolved to [A, B, C, F]
.
When is this useful? Well, here's a real example I've encountered in Roc.
Suppose I have the following program, in some fictional language that hopefully
resembles one you're familiar with:
nim
parseJsonItem : String -> { result: [Ok Json, Error [TooShort]], rest: String }parseStringToJsonCompletely : String -> [Ok Json, Error [TooShort, Leftover String]]parseStringToJsonCompletely(bytes, parser) ={result, rest} = parseJsonItem bytes parserif rest.isEmpty thenmatch result with # <<Ok json -> Ok json # <<Error TooShort -> Error TooShort # <<elseError (Leftover rest)
The lines annotated with # <<
might feel a bit redundant, right? In a language
with no subtyping semantics, implicit row polymorphism, or non-uniform memory
layout coercions, we'd need to do this unwrapping and re-wrapping because the sum type of
Error
's argument returned by parseJsonItem
, [TooShort]
, is not the same as
the sum type expected to be returned by as Error
's argument from
parseStringToJsonCompletely
.
Wouldn't it be nice if we could emulate subtyping-like semantics here, and have the compiler do the mechanical transformation instead? Type expansion provides one solution, allowing us to write instead
diff
parseStringToJsonCompletely : String -> [Ok Json, Error [TooShort, Leftover String]]parseStringToJsonCompletely(bytes, parser) ={result, rest} = parseJsonItem bytes parserif rest.isEmpty thenmatch result with- Ok json -> Ok json- Error TooShort -> Error TooShort+ _ as expandedResult -> expandedResultelseError (Leftover rest)
Of course, this has its own set of tradeoffs - for one, it might not be immediately obvious exactly what this does. There are many other solutions, including implicit conversions or using a subtyping semantics instead. Nevertheless, I think this idea of type expansion is a useful tool to consider for languages that have anonymous sum types and whose values are frequently translated into superset sums.
Generating transformations
As a running example, we'll consider the program
nim
let x : [A1 [B, C], A2 [B], A3 [B, C, D]] = A1 B inlet z = match x with# Keep only the values that have `B` as the argument| A1 B | A2 B | A3 B as y -> y| A1 C | A3 C | A3 D -> NoBin z
where y
is transformed to the partially-refined [A1 [B], A2 [B], A3 [B], NoB]
.
Let's say we refine or expand a matched value of type $\msf{T}$ to a variable of type $\msf{R}$. Assuming that the program has no type errors, I claim that the set of the values of type $\msf{T}$ that we need to transform into values of type $\msf{R}$ is determined entirely by the set of values that inhabit the type $\msf{T} \cap \msf{R}$, where the intersection of the two types forms a sum type recursively keeping only shared variants.
I hope that you can see, perhaps through considering the example above or drawing some others, that this holds for both the case of refinement and expansion. If you are interested in doing so, this would be a good point to formalize and prove this property.
Looking at the running example, there are a few ways one could perform the target transformation in the source language directly. I'll show one very explicit one that performs the transformation under the matched branch.
nim
let x : [A1 [B, C], A2 [B], A3 [B, C, D]] = A1 B inlet z = match x with| A1 B | A2 B | A3 B ->let y : [A1 [B], A2 [B], A3 [B], NoB] = match x with| A1 arg -> match arg with| B -> A1 B| C -> @unreachable| A2 arg -> A2 arg| A3 arg -> match arg with| B -> A3 B| C -> @unreachable| D -> @unreachabley| A1 C | A3 C | A3 D -> NoBin z
The key part of the transformation is that we need to nestedly unpack and repack all the sum types of the matched type $\msf{T}$ that are not equal to the corresponding sum type in $\msf{R}$. Said another way, the only nested sums that we do not need to unpack and repack are those whose shape in $\msf{T} \cap \msf R$ is equal to the shape in both $\msf T$ and $\msf R$. That's because those sum types are the only ones whose values will have the same memory layouts in both the matched and refined value.
In the example transformation above, the only value we do not need to unpack and
repack is the arg
matched in the innermost A2
branch, since it has type
[B]
in both the source and target type.
This transformation is straightforward to generate for our source language. And,
since the transformation can be entirely compiler-directed, we do not
need to expose a marker like @unreachable
in the source language.
Here's a sample implementation of the transformation. Oh yeah, let me let you play with the language!
Playground & Implementation
I've made a playground for a concrete implementation of this language, which you can also find embedded below! The playground links to the source code and includes
- a type checker
- a compiler to the intermediate representation (which should be natural to translate to machine code for your favorite ISA)
- an evaluator of the IR emulating a heap with the compact memory representation we described above
You can hover over types in the input buffer of the playground to see inferred types.
I hope you have fun with it, and that this cc has been enjoyable!
Ad: Are you interested in compiling structural subtypes?
If you know of ways, or are interested in, compiling languages based on structural subtyping to non-uniform representations, please email me! This has been an ongoing project of mine and I hope to write about it soon, and would love to chat through with others interested.
In particular, the idea here is to combine
- the programming flexibility of structural subtyping
- (possibly-non-principal, possibly-incomplete) type inference, a-la unification or supremum-based inference (maybe even MLsub)
- row polymorphism and polymorphic variants
to design subtyping-based source languages with compilation to target languages that
- have non-uniform, unboxed representations
- does not require introducing implicit conversions at re-binding or usage sites