A compiler’s users (part 1)
One idea that has become increasingly common in software development is having one or more people on the team doing user research. This could be done by a product manager, or it can be a product designer, or it could be someone else. The key bit is that this person’s role is to study the different (current and potential) users of the product under consideration, understanding their workflows and needs.
In contrast, I think that programming language development in industry hasn’t quite caught up with this. There are definitely some pockets of interesting development in the area. For example, Rust has a yearly “call for blogs” soliciting thoughts for language direction, as well as the recent async Rust user stories. However, it’s far from common practice.
Let’s fill that gap here to some extent. Since the topic I’m most familiar with is compilers, let’s consider the question, “Who uses a compiler?” There’s certainly an obvious answer: programmers use compilers, duh! But there’s a richer answer hiding underneath.
In the following sections, I’ll discuss who the different potential users of a compiler are, what they need from a compiler, and how those needs affect compiler design. The list of users here isn’t meant to be exhaustive, but I hope it is a useful starting point if you want to write an industrial-strength compiler.
People
Yes, people use compilers! I’m deliberately using “people” instead of “programmers” for two reasons:
- “Programmers” potentially (incorrectly) reads as “Software Engineers”, whereas many people use compilers even if their job title is not Software Engineer. For example, it could be a student at a university, or a system administrator at a government office or a data journalist at a news agency.
- “People” captures the fact that these folks have much richer lives and emotional states than “programmer” does. Maybe it’s a parent trying to attend to a child in-between coding, maybe it’s a person whose vision is slowly deteriorating, maybe it’s a scientist who wants to share an awesome discovery with the rest of the community.
All these different people will have different needs and wants, and they likely have different needs and wants at different stages in their proficiency journey, so it’s almost impossible to give a general prescription on “things worth focusing on.” That said, some potential things people may care about are: error messages, compiler speed, speed of generated code and ease of integration into existing workflows (for example, do your users prefer coding in notebooks?). The list could probably be 50+ items long, so I’m stopping there.
If you’re working on a new compiler, it is worth pausing and writing down some examples of people who are using it (and who you’d like to use it), and what their needs are. Then talk to those people! There’s no better way of having your assumptions challenged.
For error messages specifically, one way to have better messages is to have additional special cases in the compiler based on user reports on where someone got confused. You can have fast happy paths and slow error paths, where you do additional checks on the latter, such as for suggesting typo fixes.
For parse errors specifically, it is a good idea to have
a concrete syntax treeA concrete syntax tree (aka parse tree) is a lossless representation
of source code that sits between a token stream (no tree structure),
and an abstract syntax tree (representing language semantics).
A concrete syntax tree needs to preserve whitespace and comments;
even though these are not useful from a semantic perspective,
they are needed to be able to write tooling
that operates on source code.
representation
between source code and the abstract syntax tree,
to be able to report multiple parse errors at the same time.
Code generation tools
People sometimes create programs! That was bad enough. Worse, they sometimes create programs that create programs! Eek! Some common examples are:
- Compilers for serialization schemas: For example, if you have a binary serialization format, you might specify the schema in a declarative language and have the schema compiler generate bindings for different programming languages so they can all talk seamlessly without mixing up which byte is where.
- Parser generators: Again, you have a more declarative language, which can be used to generate a parser in the programming language.
For code generation tools, the primary thing needed is support for custom source locations – that is, a declaration can be marked with a custom location so that an IDE can jump to the code which generated the declaration instead of the generated code – so that diagnostics can be reported more accurately. If such a language feature is supported, the compiler implementation may want to have additional flags for printing source locations to make debugging and writing tests easier.
Editors
More often than not, people use specialized text editors or IDEs to read and modify their programs. Historically, editors have been mostly native applications, although an increasing number of them are web applications, sometimes even used from smaller screens such as mobile devices.
Different editors being wildly different has historically made it difficult to have language intelligence working smoothly across multiple editors; thankfully, the language server protocol (LSP) makes this easier than ever before.
Given the increasing support for LSP servers across editors, you probably want to implement a LSP client for your language.
IDE integration also encourages a couple of patterns:
Using immutable ASTs: Using an AST where the nodes are mutated upon gaining more information seems like a reasonable idea in the abstract. For example, before type-checking, one might leave empty slots for types (say for expressions) which are filled up during type-checking.
In an IDE, since the changes between successive re-compiles are typically small, computing things incrementally can avoid doing a lot of redundant work. Part of this is reusing ASTs instead of creating new ones from scratch.
However, combining incrementality/reuse with mutation can become difficult to reason about. Using an immutable AST changes the problem from a correctness hazard to a performance hazard, which is typically the right trade-off in the context of a compiler.
Using a virtual file system (VFS): The idea of a VFS is to pass the “file system” as a first-class value, instead of indiscriminately calling global functions to do file I/O. This is helpful for multiple reasons:
- If it is shared between the compiler and the IDE, it can help ensure that the files being included in the compilation are in sync between the two.1
- It provides a central place to measure statistics (such as overhead of file I/O) and prevent code duplication. It can also make debugging easier, as you can easily additional checks in the VFS code to understand file system usage.
Syntax highlighters
Traditionally syntax highlighters have often been written in a compiler-independent manner, starting with hacky regex matching to ending up with more complicated grammars. This can lead to skew when language changes take place, either making the highlighting out-dated (and requiring re-implementation in a separate codebase), or even erroneous.
On the other hand, if a syntax highlighter is compiler-derived, it can benefit from avoiding re-implementation as well as being up-to-date with the latest and greatest syntax changes.
There are a few ways to implement compiler-based syntax highlighting:
The compiler provides an API for obtaining the “kind” of tokens based on source code (or similar). This has the downside of requiring the syntax highlighter to be able to talk to the compiler’s implementation language. Often, when highlighting on the web (such as in a static site generator), you want the highlighter to be available in different languages, so this isn’t the best solution.
The compiler emits a breakdown of token kinds when given a particular flag; you probably want to have such a flag anyways for testing the tokenization! This is more flexible than option 1 (and can potentially be built on top of an API), as other languages can consume a standard format (say JSON) relatively easily.
The drawback of this approach is that it may not work for web-based editing environments. In a web-based editor, you may want to have client-side syntax-highlighting for fast feedback. This requires the compiler (or at least the tokenizer, and potentially the parser) to run in the browser! Not only does the compiler implementation language needs to be compilable to Javascript or WebAssembly, the code size needs to be reasonably small. (Although, with the average web page growing increasingly heavier, and networks speeds around the world becoming faster, this is potentially less of an issue.)
Provide a reference parser generator grammar: Since parser generators are often easily re-targetable, the generator can create code in a different language, and it becomes easier to consume the generated parser.
Whether the compiler should use a parser generated from this grammar for its implementation, or use a separate implementation, is a tricky question. See @matklad’s Challenging LR parsing for a more thorough discussion of the challenges involved.
Code formatters
Whether you like them or not, code formatters are all the rage and they’re not going away anytime soon.
At first glance, you might think of a code formatter as a parser and a pretty-printer stapled together. That’s not the full picture though. For a better developer experience, one typically wants a code formatter to be able to format syntactically invalid code, at least on a best-effort basis. Additionally, if the code formatter is able to format diffs of code, then it can be introduced incrementally on a large codebase that it hasn’t been used on before.
To avoid re-implementing a parser in the formatter, a compiler can expose an API for its parser that uses a concrete syntax tree. This API needs to be sufficiently flexible so that a formatter can use it to parse snippets of code, not just entire source files, or individual declarations. If parser error recovery involves adding fake syntax nodes, the formatter needs to be able to tell the fake nodes apart from the user-written ones to avoid accidentally printing synthesized syntax nodes. Decoupling the recording and display of parse errors is also a good idea; this make the parser a pure function, and it allows a formatter using the parser APIs to suppress errors as applicable.
One tricky aspect of code formatting is handling operator precedence in languages which allow for custom operator precedence. Generally, this requires a correct parse to at least go through name resolution followed by rebalancing expression trees containing operators. However, name resolution happens much later in the compiler pipeline compared to when the concrete syntax tree is exposed. Unfortunately, I am not aware of any good solutions to this problem.
On a somewhat tangential note, I recommend reading Bob Nystrom’s blog post The Hardest Program I’ve Ever Written if you’re interested in writing a code formatter.
Build systems
Sitting between the programmer and the either the IDE or the terminal, the build system is frequently one part of the puzzle that programmers tend to want to avoid touching if they can get away with it.
From the perspective of a typical build system, a compiler is an opaque box which takes some inputs and produces some outputs. If the inputs don’t change, the build system can skip invoking the compiler.
However, in many languages, the set of files that are read by a compiler are determined when the compiler actually runs, not beforehand. At first glance, it may seem like there is a chicken-and-egg problem here; the build system wants to check that the inputs are unchanged, so that it can avoid invoking the compiler, but understanding the full set of inputs requires invoking the compiler!
The typical way out of this is to break the set of inputs into two: base inputs and derived inputs. When doing a re-build:
If the base inputs haven’t changed: The set of derived inputs from the previous build is still valid.
- If the derived inputs haven’t changed, exit (this is a “null build”).
- If any of the derived inputs has changed, rebuild transitive dependents of the change input(s).
If the base inputs have changed: The set of derived inputs is potentially invalid, so it is recomputed.
- Rebuild transitive dependents of the changed input(s).
For this scheme to work, the build system needs to support dynamic dependencies (PDF).
Breaking the build into two stages allows the second stage build to be “fully explicit”, there are no files that are implicitly read. In the context of a distributed build, this kind of smart planning can cut down on compile times by reducing the size of uploads. However, it needs to be fast; as bandwidths of 1Gbps and higher get increasingly common, just uploading everything instead of trying to be clever may be a reasonable strategy and is likely to be less error-prone.
So what does a build system need from a compiler?
First, build system engineers need good documentation on compiler flags, especially the ones related to inputs and outputs. They also need documentation on environment variables read by the compiler.
Second, the compiler should provide some kind of API to break up one large compilation job into a DAG (directed acylic graph) of smaller compilation jobs. This allows for two things:
- Preventing process explosion: The build system (which can be a daemon) can be better at scheduling jobs across multiple concurrent builds. Otherwise, spawning hundreds of compiler processes all try to go full throttle may not be handled very well by the host OS.
- Better resource utilization: As the build system has more information about the build, it can pipeline builds (i.e. start doing some work for downstream compilation units before an upstream compilation unit has finished compiling) and better exploit parallelism.
Specifically, jobs should be broken up whenever there is some caching going on as a “side-channel”, which is not part of the compiler process’s main outputs. One of the main jobs of the build system is to handle caching; duplicating that inside the compiler increases the chances of multiple compiler processes trying to sneakily populate the cache, trampling over each other. Not coincidentally, this is a fun (read: nightmare) way to run into file system weirdness. Instead, let the build system de-duplicate cache-filling jobs.
Put differently, if your compiler is checking “is this file up-to-date” and skipping work based on that, you probably want to lift that check up into the build system. (There are some cases where you still want this in the compiler, such as someone editing a file that is being type-checked.)
And there’s more!
That’s not all! In a subsequent post, I will cover a bunch of other tools which would use a compiler. Stay tuned!
(No, there is no newsletter where you can sign up. If you’d like to read a newsletter about compilers and programming languages, feel free to email me at hello@cutcul.us with what you’d like to read more about.)
Maintaining this invariant can be tricky if:
- globs are allowed in paths and one is relying on different implementations of glob expansion to be identical in behavior
- symlinks are permitted
- the IDE has additional notions layered on top of “these files should be compiled together”