Simple Refinement of Anonymous Sum Types in Pattern Matches

Premise
Definitions
Our source language
Inference rules for patterns and match expressions
Compilation model
Adding support for type refinement
- Inference rules
  - Type expansion
- Generating transformations
Playground & Implementation
Ad: Are you interested in compiling structural subtypes?

Premise

Today I want to share a simple method to achieve type refinement, a-la flow typing, in a language with anonymous sum types and pattern matching.⊕If you follow my posts, I know the existing set suggests that I've been on a flow typing kick for the past couple years. I promise I've been doing other stuff too - and there will be more about it soon! The method I'll share

relies on straightforward, syntax-directed rules
supports compilation to either uniform representations of sum types, or non-uniform representations
compiling the type refinements to non-uniform representations of sum types requires a zero-costin the sense that a user-written transformation in the source language would be no more efficient, mechanical value transformation.

You may want to take a quick look at the playground to see what we'll end up with.

In this cc, I won't elaborate on why you might want control-flow based typing in a language, or what you can do with it. For more on that, consider reading my last post on this topic.

I'm going to keep formalizations light and introduction of definitions heavy, without really attempting to state or prove certain semantic properties you might expect me to prove here - I hope you'll agree with me that the construction is not all that complicated, and such statements would clutter the discussion.I mean no disappointment - as always, this is left to the reader :)

The method is derived from OCaml's conversion bindings of polymorphic variants, and differs only in that it supports compilation of anonymous sums to non-uniform memory layouts, which differs from OCaml's standard compilation of polymorphic variants.

Okay, hopefully I've whet your appetite. Let's get started!

Definitions

To clarify exactly what we're going to do,

By anonymous sum types, I mean sum types that do not have to be defined via nominal type declarations, and can be composed on-the-fly based on value usages. OCaml's polymorphic variants and TypeScript's union types fit this premise. In our language,
```
haskell
let operation : [Add, Sub, Mul, Div] = Mul
```
defines a variable whose type consists of any of the four arithmetic operations Add, Sub, Mul, or Div, and its concrete value is Mul.

The "composed on-the-fly" bit comes from the fact idea values without explicit annotations are inferred based on usages, for example
```
nim
let op = Add in # op : [Add]
let x: [A, B] = A in
let y = match x with # y : [M, N]
  | A -> M
  | B -> N
```
Importantly, we will be discussing a language with unification-based inference and without subtyping semantics, so it is possible for the same variant literal to have different types. For example, in the program
```
nim
let x1: [A, B] = A in
let x2: [A, B, C] = A in
x2
```
the first A is given type [A, B], while the second is given [A, B, C].
type refinement refers to the narrowing of a value's type based on the control flow of a program. In our language,
```
nim
let x: [A, B, C] = A in
match x with
| C -> A
| A | B as y -> y
```
the second branch in the match binds a variable y that refines x from the type [A, B, C] to [A, B].

As we'll see later, our system also allows type expansion, so you can bind a variable in a pattern that has a more general type than the matched value.
A uniform representation of anonymous sum types means that all values of differently-shaped sum types have the same memory layout. For example, in OCaml, polymorphic variants are compiled in part by hashing the variant name, which in our language would mean that a B of type [A, B] would have the same runtime representation as a B of type [B, C, D]. Moreover, if a variants has arguments, then its representation of the arguments is also uniform (perhaps via boxing).
A non-uniform representation of anonymous sum types means that differently-shaped sum types are compiled to different memory layouts. For example, the runtime representation of A in [A, B] need only be expressed via 1 bit to differentiate the variant, while A in [A, B, C, D] is differentiated via 2 bits. Compiling to non-uniform representations can generate more efficient and memory-compact code than uniform representations, since the memory layout generated for a type need only consider the data of that type. It does also mean that conversion of values between different types (for example A in the example above) may incur non-zero runtime cost, when the memory layouts are materially different.

Our source language

The language we'll work with is first-order with only let bindings, ground anonymous variants, and match expressions. Let-bindings are immutable, and all constructs are defined in the standard way, as exemplified in the definitions above.

Variants

A variant expression is of the form $\tt{V}\ \tt{arg_1}\ \dots\ \tt{arg_n}$, where zero or more arguments to the variant constitute the payload of the argument. For example, A is a variant with no arguments, typed as [A]. Mul One Two is a variant with two arguments, each of which is a zero-argument variant, typed as [Mul [One] [Two]]. Application of arguments are disambiguated with parentheses; for example, Mul (Add One Two) (Sub One Zero) is typed as [Mul [Add [One] [Two]] [Sub [One] [Zero]]]. As such, the type inference rule for a variant expression is the natural one.

Match expressions

A match expression pattern-matches a value over its variants, dispatching computation to the appropriately-matched branch. Match expressions must be exhaustive. For example,

nim
let x : [A, B] = A in
match x with
| A -> C
| B -> D

would evaluate to A, since the runtime value of x would match the first branch. Top-level patterns in a branch are defined by the grammar

$\renewcommand{\sfdefault}{lmtt} \begin{align*} \texttt{top-pattern} ::&= &&\textbf{top-level pattern}\\ &|~\tt p &&\textit{pattern}\\ &|~\tt p \msf{\ as\ }\tt x &&\textit{refinement-pattern} \end{align*}$

where the second rule introduces a binding to a variable x that refines the value being matchedTo keep things simple, we won't support nested refinement patterns here. It's doable, and may be enjoyable to do so!. This is of course the key of our discussion today, and we'll talk about it more in the next section.

The grammar for $\tt p$ is defined by

$\renewcommand{\sfdefault}{lmtt} \begin{align*} \tt p ::&= &&\textbf{pattern}\\ &|~\tt \_ &&\textit{any-discard-pattern}\\ &|~\tt x &&\textit{any-bind-pattern}\\ &|~\tt{V}\ \tt{p_1}\ \dots\ \tt{p_n} &&\textit{variant-pattern}\\ &|~\texttt{p_1} \msf{~|~} \texttt{p_2} &&\textit{or-pattern}\\ &|~\msf{(}\texttt{p}\msf{)} &&\textit{paren-pattern} \end{align*}$

Here are some examples:

x - matches any pattern and binds it to the variable x
_ - matches any pattern without binding to a variable
A - matches a zero-argument variant A
A (B C) (D E) - matches a two-argument variant A with the arguments matching B C and D E, respectively
A B | B C - an or-pattern that matches either A B or B C
A B | B C as x - matches either A B or B C, and binds the refined type constituted in part by [A [B], B [C]] to x.

Stuff left out

Function abstractions, type polymorphism, recursive types, and many other common language features have been left out, because they're not key to the idea of refinement discussed here.

Since our refinement is very localized to pattern matching, it is possible to both extend existing languages containing those features with the refinement described here, and extend our working language with those other features.

Inference rules for patterns and match expressions

The most interesting parts of match expressions are the inference rules for both branch patterns and the resulting value type of the branches, since this is the only place the "sum" aspect of our sum types play a role.

Type inference for both is standard - to get the resulting type of a match expression, we take the union of the sum types each branch returns. In short, if branches $\tt{b_1},\ \dots,\ \tt{b_n}$ yield values of type $\tt{t_1},\ \dots,
\tt{t_n}$, then the resulting type of the match expression is the unification $\tt{t_1} ~\sim~ \dots ~\sim~ \tt{t_n}$, and moreover the type of each branch is determined to be this unification, rather than each component type $\tt{t_i}$.

For example,

nim
let x : [A, B] = A in
match x with
  | A -> C  # C is typed as [C, D]
  | B -> D  # D is typed as [C, D]
# => the result of this expression has type [C, D]

Don't worry, I would never leave you without the definition for unification:

$\begin{align*} &\left[\tt{V_1} \tt{t_{11}}\ \dots\ \tt{t_{1{l_1}}},\ \dots,\ \tt{V_n} \tt{t_{n1}}\ \dots\ \tt{t_{n{l_n}}}\right] \sim \left[\tt{Q_1} \tt{u_{11}}\ \dots\ \tt{u_{1{k_1}}},\ \dots,\ \tt{Q_m} \tt{u_{m1}}\ \dots\ \tt{u_{m{k_m}}}\right]\\ \Rightarrow&\\ &\left[ \tt{V_i} (\tt{t_{i1}} \sim \tt{u_{j1}})\ \dots\ (\tt{t_{i{l_i}}} \sim \tt{u_{j{l_j}}}) ~|~ \tt{V_i} = \tt{Q_j} \land \msf{assert(}\tt{l}_i \msf{=} \tt{k}_j\msf{)} ~|~ (i,j) \in (1\dots n, 1\dots m)\right]\\ ++&\left[ \tt{V_i} \tt{t_{i1}}\ \dots\ \tt{t_{i{l_i}}} ~|~ \forall j \in 1\dots m. V_i \neq Q_j ~|~ i \in 1\dots n\right]\\ ++&\left[ \tt{Q_j} \tt{t_{j1}}\ \dots\ \tt{t_{j{k_j}}} ~|~ \forall j \in 1\dots n. Q_j \neq V_i ~|~ j \in 1\dots m\right]\\ \end{align*}$

Alright, that's a monster, I know. Something something a thousand words right? Anyway, the idea is that when we unify two types, we recursively unify the arguments of shared variants, where a variant is shared if it has the same name. For variants that differ in name between the two types, simply add them to the resulting unified type without modification.

At this point you might either totally get this or totally be sick of the notation, so let me show you a few examples.

[A] ~ [A]                       => [A]
[A, B] ~ [A, C]                 => [A, B, C]
[A [B]] ~ [A [C]]               => [A [B, C]]
[A [B], D [E]] ~ [A [C], G [H]] => [A [B, C], D [E], G [H]]

Our unification is not total - for example, we cannot unify [A [B]] with [A [B] [C]], since the number of arguments to A differ. For simplicity's sake we'll ignore this case in the rest of our discussion, and the $\msf{assert}$ clause in the definition above pretends that type errors cannot happen.

This unification is also what we'll use in $\textit{or-patterns}$ to determine the type of a variable bound in $\textit{refinement-pattern}$, and can be used to determine the type matched by all the branches in a match expressionAs an aside, this brings up an interesting idea of "bidirectional exhaustiveness checking". Typically, exhaustiveness checking makes sure that there enough branches to cover all the possible constructors of the value being matched on. This is good enough when the constructors belong to a nominal type, but in the presence of anonymous sum types, we might also want to make sure not to match too much - this is a common pitfall with such sum types. To this end, we can infer the type that the branches in a match expression expect, and then check whether the value being matched on is exhaustive relative to the branch types!.

Compilation model

Feel free to choose your own adventure for compiling match expressions. I'm a fan of flowcharts so I always turn to compiling match expressions via switch-based decision trees a-la Maranget. There are many good resources for compiling pattern matches to decision trees, foremost Maranget's resources, so I'll just talk about the compilation model for sum-typed values.

As promised, we'll compile our anonymous sum types to compact, non-uniform, unboxed memory layouts on a per-type basis. A sum type can be represented as a struct consisting of

a minimally-sized integer to distinguish its variants
a C-style union of structs representing each variant's arguments' memory layouts

For a sum type with multiple variants, each of which has at least one argument, this scheme minimizes the required runtime representation. As some examples,

[A, B] can be compiled to a one-bit representation. One bit is used to distinguish the variants, and the variants' arguments are zero-sized.
[A, B, C, D] can be compiled to a two-bit representation. Two bits are used to distinguish the variants, and the variants' arguments are zero-sized.
[A [B, C] [B, C], D [E, F, G, H] [I, J]] can be compiled to a four-bit representation, of the shape
```
{ tag_id: 1 bit,
  argument_union:
    union { { 1 bit, 1 bit },
            { 2 bit, 1 bit } }
}
```
where the argument_union consists of the compiled representations of [B, C] [B, C] and [E, F, G, H] [I, J], respectively. So, one bit is used to distinguish the variants, and the largest component of the union is three bits.

Of course, in practice, you'd likely want to compile all of these things sized as at least one byte, but the general idea still holds and scales for values that aren't as small-sized as the variants in this language.

There are many further optimizations you could imagine here! For example, one is that the singleton sum type [A] need not any material memory representation at runtime, and can be compiled to a zero-sized struct. There are others; feel free to come up with some!⊕If you are interested in this, check out Rust's and Roc's optimizations for sum types.

To add support for type refinement, we need to do two things:

Add type inference rules to determine what type a refined variable should take on
When compiling to non-uniform memory representations, the compiler needs to generate code to transform the value being matched on into the refined value, since they may have different memory layouts.
- If we're compiling to a uniform memory representation, this step isn't necessary, and we just need to emit code to bind the refined variable to the matched value. But that's not interesting so we'll talk about the non-uniform case.

Inference rules

When we see a $\textit{refinement-pattern}$ $|~\tt p \msf{\ as\ }\tt x$, we introduce the type binding $\tt{x} \colon \msf{typeof}\left( \tt p \right)$ to our type environment, where $\msf{typeof}$ infers the type of the (possibly-or-patterned) pattern $\tt p$ following the type inference rules we described previously.

Usages of $\tt x$ in the branch body add additional unification constraints as standard, which means that the refined type may end up larger than $\msf{typeof}\left(\tt p\right)$. This means that we can perform "type expansion", not only type refinement - more on that below.

That's it for typing. I know, maybe not so interesting right? But it composes with other language features quite nicely, allowing you to refine types without needing control-flow-based markers or variables like flow-typing schemes use. As mentioned in the beginning, Ocaml supports this feature to refine polymorphic variants, and the standard compiler uses the same inference scheme.

Type expansion

Type expansion means we can write a program

nim
let x : [A, B, C] = A in
match x with
| A | B as y ->
  let z : [D, E] = D in
  match z with
  | D -> y
  | E -> F
| C -> C

where the inferred type of y (and then z) will be resolved to [A, B, C, F]. When is this useful? Well, here's a real example I've encountered in Roc. Suppose I have the following program, in some fictional language that hopefully resembles one you're familiar with:

nim
parseJsonItem : String -> { result: [Ok Json, Error [TooShort]], rest: String }
parseStringToJsonCompletely : String -> [Ok Json, Error [TooShort, Leftover String]]
parseStringToJsonCompletely(bytes, parser) =
  {result, rest} = parseJsonItem bytes parser
  if rest.isEmpty then
    match result with                  # <<
      Ok json -> Ok json               # <<
      Error TooShort -> Error TooShort # <<
  else
    Error (Leftover rest)

The lines annotated with # << might feel a bit redundant, right? In a language with no subtyping semantics, implicit row polymorphism, or non-uniform memory layout coercions, we'd need to do this unwrapping and re-wrapping because the sum type of Error's argument returned by parseJsonItem, [TooShort], is not the same as the sum type expected to be returned by as Error's argument from parseStringToJsonCompletely.

Wouldn't it be nice if we could emulate subtyping-like semantics here, and have the compiler do the mechanical transformation instead? Type expansion provides one solution, allowing us to write instead

diff
  parseStringToJsonCompletely : String -> [Ok Json, Error [TooShort, Leftover String]]
  parseStringToJsonCompletely(bytes, parser) =
    {result, rest} = parseJsonItem bytes parser
    if rest.isEmpty then
      match result with
-      Ok json -> Ok json
-      Error TooShort -> Error TooShort
+      _ as expandedResult -> expandedResult
    else
      Error (Leftover rest)

Of course, this has its own set of tradeoffs - for one, it might not be immediately obvious exactly what this does. There are many other solutions, including implicit conversions or using a subtyping semantics instead. Nevertheless, I think this idea of type expansion is a useful tool to consider for languages that have anonymous sum types and whose values are frequently translated into superset sums.

Generating transformations

As a running example, we'll consider the program

nim
let x : [A1 [B, C], A2 [B], A3 [B, C, D]] = A1 B in
let z = match x with
  # Keep only the values that have `B` as the argument
  | A1 B | A2 B | A3 B as y -> y
  | A1 C | A3 C | A3 D -> NoB
in z

where y is transformed to the partially-refined [A1 [B], A2 [B], A3 [B], NoB].

Let's say we refine or expand a matched value of type $\msf{T}$ to a variable of type $\msf{R}$. Assuming that the program has no type errors, I claim that the set of the values of type $\msf{T}$ that we need to transform into values of type $\msf{R}$ is determined entirely by the set of values that inhabit the type $\msf{T} \cap \msf{R}$, where the intersection of the two types forms a sum type recursively keeping only shared variants.

I hope that you can see, perhaps through considering the example above or drawing some others, that this holds for both the case of refinement and expansion. If you are interested in doing so, this would be a good point to formalize and prove this property.

Looking at the running example, there are a few ways one could perform the target transformation in the source language directly. I'll show one very explicit one that performs the transformation under the matched branch.

nim
let x : [A1 [B, C], A2 [B], A3 [B, C, D]] = A1 B in
let z = match x with
  | A1 B | A2 B | A3 B ->
    let y : [A1 [B], A2 [B], A3 [B], NoB] = match x with
      | A1 arg -> match arg with
        | B -> A1 B
        | C -> @unreachable
      | A2 arg -> A2 arg
      
      | A3 arg -> match arg with
        | B -> A3 B
        | C -> @unreachable
        | D -> @unreachable
    y
  | A1 C | A3 C | A3 D -> NoB
in z

The key part of the transformation is that we need to nestedly unpack and repack all the sum types of the matched type $\msf{T}$ that are not equal to the corresponding sum type in $\msf{R}$. Said another way, the only nested sums that we do not need to unpack and repack are those whose shape in $\msf{T} \cap \msf R$ is equal to the shape in both $\msf T$ and $\msf R$. That's because those sum types are the only ones whose values will have the same memory layouts in both the matched and refined value.

In the example transformation above, the only value we do not need to unpack and repack is the arg matched in the innermost A2 branch, since it has type [B] in both the source and target type.

This transformation is straightforward to generate for our source language. And, since the transformation can be entirely compiler-directed, we do not need to expose a marker like @unreachable in the source language.

Here's a sample implementation of the transformation. Oh yeah, let me let you play with the language!

Playground & Implementation

I've made a playground for a concrete implementation of this language, which you can also find embedded below! The playground links to the source code and includes

a type checker
a compiler to the intermediate representation (which should be natural to translate to machine code for your favorite ISA)
an evaluator of the IR emulating a heap with the compact memory representation we described above

You can hover over types in the input buffer of the playground to see inferred types.

I hope you have fun with it, and that this cc has been enjoyable!

Ad: Are you interested in compiling structural subtypes?

If you know of ways, or are interested in, compiling languages based on structural subtyping to non-uniform representations, please email me! This has been an ongoing project of mine and I hope to write about it soon, and would love to chat through with others interested.

In particular, the idea here is to combine

the programming flexibility of structural subtyping
(possibly-non-principal, possibly-incomplete) type inference, a-la unification or supremum-based inference (maybe even MLsub)
row polymorphism and polymorphic variants

to design subtyping-based source languages with compilation to target languages that

have non-uniform, unboxed representations
does not require introducing implicit conversions at re-binding or usage sites