Our optimizers need a rethink

Intended audience: People working on, interested in, or adjacent to compiler and/or database development, and people who’ve been bitten by compiler/database optimizer failures in the past.

Based on my experience in compiler development, reading about new optimizations coming out, as well as more recently trying to debug slow queries in Postgres, it seems like optimizers usually tend to have the following characteristics in common:

Contingent criticality: The optimizer’s behavior is in some contexts load-bearing and has little margin for error (e.g. missing an optimization).

For example, in GCed languages, there are situations where you do not want any allocations to be triggered. For this, you want to make sure that the optimizer unboxes any boxing operations if the language boxes ~everything by default.

In the context of more low-level languages like C++, depending on the code in question, one might care about the compiler unrolling loops properly, using (or avoiding) branchless instructions, the right set of SIMD instructions, and so on.

For languages using ref-counting combined with copy-on-write container types, lifetime contraction can even lead to changes in algorithmic complexity, not just constant factors.
Limited control: The knobs to steer the optimizer are limited.

Usually, these take the form of compiler pragmas or specialized builtins/intrinsics, and occasionally language features. Some examples:
- Rust has a black_box intrinsic to conveniently put optimizer barriers for benchmarking.
- Many languages have ways of marking inlining information. For example, the Swift stdlib currently has 890 uses of @inline(__always) and 3.1K uses of @inlinable.
- MySQL has optimizer hints which are written as magic comments. Hints that cannot be applied are silently ignored without warning.
EDIT(2024 Oct 28): In some cases, languages have features which enforce performance-related properties at the semantic checking layer, hence, granting more control that integrates with semantic checks instead of relying on the optimizer:
- D has first-class support for marking functions as “no GC”.
- C# has support for ref parameters, and types, with a limited form of borrow-checking.
Limited testability: The facilities to check the optimizer’s output are meant for debugging only, and are not amenable to standard testing techniques.

Typically, this takes the form of viewing the compiler output, which takes the form of assembly for low-level languages and some intermediate representation (IR) for higher-level languages. However, there is no way to feed this output back into the tool to check that a new version of the code (or the same code being compiled against a newer tool version) has the ~same optimized output.

One interesting counter-example here is tail call elimination which has support for explicit checking in many compilers: Clang and GCC have musttail, Dotty (Scala 3) has tailrec, OCamlc has @tailcall.
Limited visibility: Sometimes, the optimizer seems to “have a mind of its own”.

For example, in the Postgres case recently, we have been baffled at the index choice being made by Postgres, preferring a very low selectivity index over the primary key index.Debugging the situation is made tricker by the fact that the index choice made in prod doesn’t match the index choice made in dev. Yes, we have checked that the index statistics are up-to-date in prod.

If you go through compiler bug trackers, it’s not hard to come across situations where optimizations do not trigger when users expect them to. Sometimes this is down to incorrect user expectations, sometimes due to optimizer bugs, and sometimes due to “insufficient smarts” (more on that shortly).

One last point worth mentioning which is more about optimizations and not about optimizers per se: optimizations tend to interfere with debugging, which encourages the use of different optimization levels based on context (local development vs production) for ahead-of-time (AOT) compilation. A counter-example to this is Go, which does not conventionally have separate dev and release builds.

For high-level languages, at least some level of optimization is required for any kind of reasonable performance, which, depending on the language, means that at least some debugging information must be erased. Tail-call elimination is a common example of this, and is sometimes mandatory for languages like Scheme. I believe the Functional-But-In-Place technique used in Roc, Lean and Koka is likely to have the same problem.

What’s wrong with the status quo

If you look at in the past section, there seems to be a contradiction between the first point on contingent criticality and the other points. If some system is critical to program behavior, surely we need good mechanisms to be able to inspect, understand, control and check it, right?

Software engineering is programming integrated over time

– Titus Winters

In a similar vibe, here’s a longer quote in the context of DB query planners from Nelson Elhage’s Some opinionated thoughts on SQL databases:

I dislike query planners

[..] For an online application with consistent data access patterns and high throughput, performance is part of the database’s interface; if a database continues serving queries but with substantially worse latency, that’s an incident!

The query planner is the antithesis of predictability; most of the time it will choose right, but it’s hard to know when it won’t or what the impact will be if it doesn’t. [..] It’s not a question of whether or not their planner is smart enough [..] At some point, transparency, explicitness, and predictability are really important values.

Connecting the two quotes is the theme of maintainability: “transparency, explicitness and predictability” are important for maintainability, and hence it’s critical that our tools support us in embedding these values into the software we write.

Applying the point to optimizers and optimizations, if an optimization is critical to overall system function, and if the system needs to be maintained over a long time, the programmer needs to be able to:

Have a good mental model of what the optimizer can and cannot do.
Easily debug situations when an optimization is missed.
Easily steer the optimizer in the desired direction if it cannot figure out something by itself. If the optimizer fails to follow the programmer’s guidance, it must issue an error. Optionally, the optimizer may allow the programmer to issue “weak guidance” which downgrades optimization misses to warnings instead of errors.
Easily write regression tests over the expected behavior of the optimizer. These tests should be able to have the other properties that are typically expected of good tests.A non-exhaustive list of properties the test should satisfy: (1) correctness - the test should fail if the necessary optimization doesn’t fire (2) robustness - the test shouldn’t fail due to irrelevant factors (3) debuggability - failures should be easy to reproduce and debug (4) automation - the test should be easily maintained in version control and runnable in typical CI environments (5) performance - the test should run reasonably quickly.

Where do we go from here

Optimizers, including their tooling and documentation, need to evolve to support the 4 properties outlined above.Since this is the internet, perhaps I should add the obligatory disclaimer: I’m not asking people to do this work for free.

Helping users build a solid, grounded mental model

Optimizers should have clear docs on the optimizations they perform, as well as the order the optimizations may be applied in.

Positive examples of optimizations kicking in should be present in the docs and machine-checked as doctests to avoid regressions.

Perhaps even more important are negative examples of optimizations not being automatically triggered. These should have explanations on why an optimization is not triggered in a specific situation, and ideally be based on misunderstandings from users as reflected in the issue tracker. Finally, negative examples should ideally have some recommendations on how the user can guide the optimizer to do what they want, if that makes sense in the given context (e.g. the optimization must not introduce unsoundness in case some assumptions are not upheld).

Evolving tooling for debugging optimization misses

LLVM supports an interesting feature called Optimization Remarks – these remarks track whether an optimization was performed or missed. Clang support recording remarks using -fsave-optimization-record and Rustc supports -Zremark-dir=<blah>. There are also some tools (opt-viewer.py, optview2) to help view and understand the output.

Other optimizers can potentially take inspiration from this approach to do something similar.

As a baseline, compilers using LLVM should expose a way to get the same data out. Compilers doing optimizations on their own IRs could expose their own optimization records. For example, Swift has SIL optimization remarks for its own intermediate language.

In the context of databases, I think a similar approach can be used, perhaps at the level of relational operators, with the added caveat that all other run time information that was utilized as part of the optimization decisions ought to be recorded as well.

It would be valuable to integrate this tooling with editor extensions such as LSPs for ease of use. Increased use will likely necessitate tooling improvements, further driving use, potentially creating a virtuous cycle if maintainer bandwidth can be effectively managed.

Lastly, users need to be easily able to share missed optimization reports with optimizer developers. In my experience, the biggest challenge for this is that it can be quite time-consuming for less savvy users to attempt to minimize and anonymize code, especially when multiple functions get involved. It would be valuable to integrate code anonymizationBy “code anonymization”, I mean things like bulk renaming identifiers, stripping comments and garbling string/byte/character literals, so that essentially ~only the control flow structure is preserved. Yes, I recognize there are language features which potentially make this more complex, such as reflection. However, anonymization for optimization purposes doesn’t need to support 100% of the language’s feature set.

into bug-reporting tooling to make this easier for users.

Guidance over hints

If you’re going through the effort of debugging a performance issue, and trying to figure out how to provide some assistance to the optimizer, chances are you do not really want the best thing you can do to just be a “hint” which the optimizer is free to silently ignore.

Users need to be able to provide guidance to the optimizer which the optimizer cannot ignore without loudly signalling that it cannot follow the guidance.EDIT(2024 Oct 28): For example, as in the case of D and C# as documented before, one can have extra checks during semantic analysis which enforce performance-related properties even before the code gets to the optimizer. This is a reasonable design choice for certain common properties like “no GC here”, but it is not sufficient as a full solution for the various performance-related aspects of a program that people care about.

This may be an error or a warning depending on what makes sense in a given context.EDIT(2024 Oct 28): For example, for code that is dynamically compiled on end-user devices, such as JavaScript and WebAssembly in a browser context, the browser engine might want to only allow a “warning” level.

One might counter: wouldn’t this make code more brittle as new versions of the optimizer are released? For example, let’s say your dependency X is using some optimization guidance. You upgrade your toolchain. Suddenly, X stops compiling because an optimization didn’t fire.

There are a few answers to this.

You should be auditing your dependencies for optimization guidance. For example, it could be required that you need to specify an "optimization-guidance" key-value pair in the package manifest. This could be surfaced in auto-generated docs so that downstream consumers better understand the risk of regressions on toolchain upgrades.
For open source dependencies, you can fork the dependency anyways to remove the guidance if you think the guidance is not important. Optionally, the package manifest could allow you to ignore guidance violations in your dependencies (or downgrade them to warnings).
Missed optimization bugs are also bugs; making it easy to uncover and report them in a timely manner is a worthwhile endeavor. Like other bugs, their severity needs to be analyzed based on the surrounding context instead of being automatically assumed to be less important than other kinds of bugs, just because the program’s dynamic semantics are not affected. As your compiler/database evolve, there may be regressions. Silently ignoring regressions is not a solid strategy for improving tooling.

In the context of databases specifically, it might make sense to add support for a query language more explicit than SQL instead of simply adding support for more magic comments.Loosely related but a good read: Jamie Brandon’s Against SQL.

EDIT(Oct 28 2024): This is not to imply that we should get rid of SQL or get rid of query planning entirely. Rather, more explicit planning would be an additional tool in database user’s toolbelt.

Optimizer regression testing infrastructure

You encounter a bug in production where the throughput of some performance sensitive code is much slower than it ought to be. You go read the docs and understand how the optimizer works. You debug the issue and narrow it down to a missed optimization bug. You report the issue and it is fixed by some engineer; they add a regression test with a minimal example. You verify the fix works for your not-so-minimal code, upgrade the tool version, post your victory in the team’s Slack channel and relax back in your chair feeling smug.

It would be a real shame if you had to go through the same loop after a few releases again, right? The smugness would change to annoyance very quickly.

In situations where the optimization is done purely based on compile time information, adding guidance is sufficient as a test; if the guidance is violated, the build will fail. However, this approach has the downside of being inflexible: every new guidance requires a toolchain change. This might make sense for more common things such as avoiding allocations and loop unrolling. However, for more specialized use cases, such as for checking specific instructions, it would be valuable to be able to make assertions on the optimized code (could be IR or assembly) from test code.Yes, I recognize this introduces new challenges around IR design, API design and maintaining compatibility across optimizer versions. I don’t have solutions to all of them right now; my point is that this seems like a problem worth tackling.

EDIT(2024/10/24): Brendan Zabarauskas reminded me that there is a Glasgow Haskell Compiler (GHC) plugin for inspection testing which allows making high-level assertions around allocations, checking usages of a particular function, etc. (usage example) Thanks Brendan!

There are also situations where optimization is not based purely on compile time information: interpreters and just-in-time (JIT) compilers.

For those cases, one needs to be able to write tests that assert properties of the generated code across different classes of inputs.

For example, let’s say I want to add optimization guidance to a database query stating that a specific index Y should be used for an index-only scan for a given table T. The underlying assumption is that using the index Y is always going to be better than using other indexes for T. However, this assumption may be violated based on index statistics, or when my colleague (or future self) adds another index Z which is actually a better fit.

There are two ways for going about capturing this assumption using tests depending on the strength of the guidance:

If I’m using strong guidance where I force the query planner to use Y, I can:
- Add a test across different index statisticsThese would likely have to be property-based tests for effectiveness, or perhaps even abstract interpretation.
  
  which show that query plans using Y are cheaper than alternate choices (e.g. by leaving the choice to the optimizer). This test should fail when Z is added.
If I’m using weak guidance where the query planner issues a warning if it doesn’t use Y, I can:
- Add a test for the relevant set of input statistics which assert that the query planner always picks Y when the stats fall in the expected range of operation.
- Wire up weak guidance violations to my alerting infra.
- Set up an alert for checking that the actual statistics at run time fall in the operating range being used in the test.

Overall, while I think this is a hard problem to solve, and there’s unlikely to be a one-size-fits-all solution in terms of testing APIs, it seems like a problem worth solving.

Closing thoughts

You might have heard of Proebstring’s Law:

compiler optimization advances double computing power every 18 years

Turns out, things are actually worse than that - depending on what exactly you look at, that number might as well be 36 years or worse. Here are just some of the articles and papers that I found:

2022: On Proebsting’s Law, Arseny Kapoulkine
2009: Benefits of Compiler Optimization (PDF), Nicholas FitzRoy-Dale
2001: On Proebsting’s Law (PDF), Kevin Scott

At the end, Proebsting concludes:

Perhaps this means Programming Language Research should be concentrating on something other than optimizations. Perhaps programmer productivity is a more fruitful arena.

I think there’s at least some amount of opportunity in exploring the intersection of programmer productivity and optimizations – by making optimizers more understandable, reliable and predictable and designing them to also work together with programmers instead of just silently behind the scenes.

Discussions: Hacker News, Lobste.rs