Are tagged unions overrated?

Recently, Nelson Elhage published a newsletter post Tagged unions are overrated. In the post, he describes some of the problems with tagged unions and pattern matching in existing languages, and how you can do a good enough job at having language ASTs and IRs without using tagged unions.

(I recommend reading the linked post before proceeding.)

Now, Nelson is an experienced engineer who has worked on many different things; in particular, he has worked on Sorbet, a type-checker for Ruby written in C++ that is pretty fast.

While I do not have nearly as much experience as Nelson, I’ve been working on a compiler written in C++ for the past ~1.5 years, and my personal experience with C++ has been quite different from Nelson’s. For some rough calibration on my personal preferences: I’d even go so far as to say that language support for pattern matching with binding (that is, binding payloads to names while doing the match) and exhaustivity checking is THE thing that I miss the most when writing C++, and that feeling has not substantially changed over time.

So I figured it would be interesting to dig deeper into why there is a sharp disconnect between my perspective and Nelson’s perspective.

ASTs and IRs

At the beginning of the article, Nelson points out

At this point I’ve worked on a number of compilers, both toy and production, in a wide range of languages, including Go, C++, Java, Rust, and Haskell. With that experience, my considered opinion is this: Tagged unions and pattern-matching on them are vastly overrated for implementing language ASTs and IRs. They are nice to have, but they rarely make a substantial difference in the development of a compiler, typechecker, or similar.

(Nelson also reiterates this later “I want to pause again to emphasize that this opinion is scoped: I think tagged unions are overrated for implementing IRs.”)

Perhaps surprisingly, I agree with this point! (With some caveats. 😉) For the specific case of ASTs and IRs, here is what I’ve observed:

After an initial bring-up period, new cases are added very infrequently to an AST or on IR. So yeah, whoever is going to add the new case might hit more issues than they would’ve hit otherwise in the presence of tagged unions, but this is a cost that is not incurred very often during regular maintenance.
Doing instance checks can feel clunky, but once you get over that, you recognize that it’s not a significant barrier but a minor source of friction.
A class hierarchy emulating tagged unions provides some affordances that tagged unions are generally not very good at providing: gradual narrowing of types and the ability to traverse multiple levels quickly.

What do I mean by that?

With tagged unions, the path of least resistance typically has a flat hierarchy. This is because as the hierarchy gets deeper, both deconstruction (= matching) and construction become tedious quickly. The latter can be overcome using small helper functions, but most languages lack any meaningful way of combining patterns.

From a correctness perspective, a flatter hierarchy is problematic as it leads to types that are “too large”, so you sprinkle around assert and fatal_error in your code in places where you know some case can’t be reached, since writing a dedicated type (with back-and-forth conversions!) for a couple of functions might be too onerous.

In contrast, with classes, casting away multiple levels of the class hierarchy is syntactically lightweight, so having a more fine-grained hierarchy is not a constant burden that is shared at every matching site. I’d argue that this encourages good design; types being narrower means that you have fewer code paths in your code which are basically saying “trust me, this code path won’t be hit at runtime.”

So far, it might seem like I’m almost entirely in agreement with Nelson. What’s the catch?

Caveat 1: There is more to an AST than the node hierarchy

I think one common simplification that people make when talking about ASTs, is they think of a tree structure with nodes of different kinds, along with an is-a relationship between nodes, and they call that the AST.

Which is true, in some sense, but an AST is more than that! Arguably, most of the tricky code is inside the nodes themselves. Tricky code is a recipe for disaster in at least two ways:

It is easy to miss an edge case.
It is difficult for both newcomers and experienced people to understand and change the code.

To give you a concrete example, the Swift compiler had the following types for the longest time: (source)

enum class FunctionTypeRepresentation : uint8_t {
  Swift = 0,
  Block,
  Thin,
  CFunctionPointer,
  Last = CFunctionPointer,
};

/// A class which abstracts out some details necessary for
/// making a call.
class ExtInfo {
  // If bits are added or removed, then TypeBase::AnyFunctionTypeBits
  // and NumMaskBits must be updated, and they must match.
  //
  //   |representation|noEscape|throws|
  //   |    0 .. 3    |    4   |   5  |
  //
  enum : unsigned {
    RepresentationMask     = 0xF << 0,
    NoEscapeMask           = 1 << 4,
    ThrowsMask             = 1 << 5,
    NumMaskBits            = 6
  };

  unsigned Bits; // Naturally sized for speed.
};

(In some ways, it’s even more complicated now. 😅.)

In case you’re not familiar with C++ or the Swift compiler, let me give you a quick summary of what is going on here.

The ExtInfo type captures information about the calling convention of a function type, one aspect of which is the representation (that’s the FunctionTypeRepresentation).
The Bits field has 6 bits worth of information. It stores 3 things, one “representation” and 2 bits marking whether a function is @escaping and throws.
- The lower 4 bits store a FunctionTypeRepresentation most of the time. However, sometimes they end up storing a different type, which is why the mask is 4 bits instead of 2 (FunctionTypeRepresentation has 4 cases, so 2 bits should be enough otherwise).

Apart from the complexity around sometimes storing one thing and transiently storing another, there’s even more complexity here that’s not manifest in the code I’ve shown: some combinations are illegal.

This brings us to a couple of operational questions:

Are those illegal ExtInfo values actually created in practice? If so, what compiler passes create these illegal values?
Are those illegal ExtInfo values stored in the AST in practice (or are they created only transiently)?

If you’re trying to change this structure in a meaningful way, as I was at a certain point, this is (to put it politely) a bit of a headache.

Exacerbating the issue, the API on this type involved changing one “field” at a time, which meant that sometimes you would be forced to transiently create an illegal ExtInfo value in case you wanted to go from one legal value to another (or you could use a less ergonomic API). This meant that when I added a centralized assertion to check for illegal values, it would often be tripped in a spurious way only when compiling very particular funky code!

Now, I concede that one can write confusing code in any language. Language features do not magically stop anyone from writing confusing code. I also concede that this is a somewhat extreme example.

That said, I suspect that if C++ had native support for tagged unions with layout optimizations (such as Swift or Rust), this would’ve probably been written in a much more understandable manner from the outset. Or, even if you really need to hand-optimize the internals like in this example, the lower-level type itself doesn’t need to be part of APIs, one can use a higher-level sum type with a worse layout in APIs with conversions between the two in the most low-level functions.

The point I’m trying to make here is that, not unlike other domains, one often wants to be able to model small parts of an AST or an IR (say 1 field of 1 particular AST node) using tagged unions. Being unable to define sums and use pattern-matching imposes a small recurring tax on whoever is reading the code or trying to change it. Depending on the complexity of invariants, this tax can be surprisingly large for otherwise innocuous-looking data types.

There are ways around this to some extent – for example, you can have more asserts which clarify what invariants hold – but the benefits aren’t quite the same because the barrier of writing code that is understandable is now higher.

If types are like gravity for programs (paraphrasing Conor McBride), a language’s type system is like gravity for the types that you’re actually writing. Having tagged unions and pattern-matching reduces the friction for one of the most common cases of data representation “this or that”. Does that mean that they should always be the go-to choice when you want to represent different cases of something? No. But they’re very often a good starting point.

Caveat 2: It is okay to want to have nice things

Later in the post, Nelson points out that pattern-matching isn’t nearly as powerful as one would like it to be:

My next comment is on the flip side — the limits of pattern-matching in languages that do support it. In my experience, it often ends up being less powerful and convenient than you’d like.

…

To me, this is a compelling argument that, even if C++ had language-level pattern-matching, it would be necessary to invent our own pattern-matching system to support these complex and non-type-specific matchers, anyways. So what would we really be gaining from that language support?

I agree with Nelson’s point about ML-style pattern-matching (or its variations in popular languages) typically not being very powerful. However, I don’t agree with the argument in the second paragraph, for a few reasons:

I think that ML-style pattern-matching is valuable enough to cover a significant fraction of day-to-day use cases.
Library solutions often don’t end up providing sufficiently good ergonomics.
Having something as core as pattern matching be a language feature makes it easier for different libraries to interoperate.

At the same time, I agree that there is a need for more powerful solutions in case you do need the ability to do complex matches. Thankfully, there is already a bunch of cool work in this area which one can use for inspiration:

Haskell has pattern synonyms which can be used to compose patterns. These are not quite first-class patterns, but they’re still quite useful. Exhaustivity relations on user-defined patterns (as opposed to compiler-synthesized patterns based on data type declarations) can be specified using a particular compiler pragma.
Work on optics: There has been a bunch of work on optics in the context of functional programming languages, most prominently the lens library in Haskell. Prisms are the kind of optic which represent pattern-matching; they offer a first-class way to capture patterns, as well as compose well with other forms of selection. However, one loses out on exhaustivity checking.

The idea of optics has popularized and spread to other languages. For example, the Ramda library (20.3k GitHub stars) for Javascript has some limited support for lenses.
The Perls (including Raku) have powerful facilities on regex matching, and being able to define custom grammars.
Structural sum types with row polymorphism: OCaml has language support for structural sum types than can be extended and narrowed (these are called “polymorphic variants” in OCaml terminology). Purescript has a library purescript-variant implementing this feature.
Egison has support for complex, extensible patterns.
Mathematica has some really nice support for using patterns and replacement rules.

I’m sure there’s a bunch of related work that I’m missing here.

Based on these, I think there is quite some headroom for making pattern-matching in mainstream languages more powerful, so that everyone can benefit from it, not just people willing to hand-roll their own pattern-matching libraries (which are often accompanied by gnarly type errors).

Closing thoughts

I think that tagged unions and pattern-matching are powerful tools when working “in the small”. At the moment, they don’t work equally well “in the large”. The node hierarchy in an AST or IR is one example where using classes provides some tangible benefits over tagged unions, although one might think that tagged unions are a perfect fit at a first glance.

When thinking about the power of language features, I often like to think in terms of a “yield point”. In case you’re unfamiliar with the idea, here’s the description from Wikipedia:

In materials science and engineering, the yield point is the point on a stress-strain curve that indicates the limit of elastic behavior and the beginning of plastic behavior. Below the yield point, a material will deform elastically and will return to its original shape when the applied stress is removed. Once the yield point is passed, some fraction of the deformation will be permanent and non-reversible and is known as plastic deformation.

A yield point captures the idea of a point where gradual degradation starts. It’s not an immediate catastrophe, but you’re stretching the system in a way that you perhaps didn’t intend.

Today, given the pattern-matching facilities (or lack thereof) in mainstream languages, the yield point for tagged unions and pattern-matching is relatively low. But that’s not something to be lamented, rather it is a great opportunity. We can push the yield point further by synthesizing pattern-matching facilities in less popular languages and coming up with a design that extends ML-style pattern-matching in a way that provides excellent composability and ergonomics for users of mainstream languages.