How should we learn from bugs?

I recently fixed a bug at work. When I realized what the fix was, I let out an audible sigh of disappointment, because it felt like a “stupid” bug.

I was a little bit angry at the different people whose code had contributed to the bug. My immediate thought was one of indignation (“ugh, this could’ve been prevented by construction if they’d used a newtype wrapper here!”) followed by some self-righteousness (“well, the code I wrote in this other codebase is not going to have this kind of issue because I used more newtypes there”) and some speculation in my own favor (“some of the code would not have passed code review had I been the code reviewer”).

As I paused to go drink some water, while I was standing at the sink, another thought crossed my mind, “Uh, what did I just do here? I somehow managed to convince myself that I’m better than the programmers whose code was involved. But I know that the people who were involved in writing the code are not stupid. It’s not like the bug was added intentionally… So have I actually understood how the bug came about?”

After that, I spent some time reflecting on the bug, in a calmer state of mind. I want to share those reflections with you.

Background context

A brief history of the codebase

I work at Sourcegraph, and the team I’m on is responsible for developing and maintaining language-specific indexers for supporting compiler-accurate code navigation actions – such as Go to definition and Find references – inside Sourcegraph.

The indexers typically use a compiler or a type checker as a library. If that’s not possible, we fork the compiler/type checker codebase, and add indexing functionality there. For reference, adding indexing functionality to an existing codebase typically takes ~3K SLOC, whereas an industrial strength type checker (not to mention a full compiler) is typically between 30K~100K SLOC.

Python is a popular language, so several customers were interested in precise code navigation support for it in Sourcegraph. Several years back, there were two major type checkers for Python - MyPy (implemented in Python) and Pyright (implemented in TypeScript). We noticed that:

However, Pyright was not usable as a library. So we decided to fork it and build an indexer scip-pythonThe ‘scip’ is pronounced as ‘skip’.

on top of it.

The forking decision was made about 3 years back.

Due to various factors, the scope of the code our team owns has grown, while the team size has shrunk. The person who did the initial development of scip-python is no longer at the company.

We have limited maintenance capacity for the indexers themselves. We attempt to fix bugs – particularly the ones prospects and customers run into – as capacity permits.

How scip-python is tested

Our indexers generally use snapshot testing (aka golden testing), as compilers and similar tools are very amenable to this kind of testing.

For example, part of a snapshot may have something like:

class InitClass:
#     ^^^^^^^^^ definition `src.foo.bar`/InitClass#

If there are mentions of InitClass elsewhere which refer to the same semantic entity, they’ll have a reference marker with the same symbol `src.foo.bar`/InitClass#.This output is slightly simplified.

This makes it easy to understand the behavior of code navigation without coupling the indexer to the backend which actually implements code navigation support.

The situation

Recently, we got a bug report of a crash a customer encountered when attempting to run scip-python on their (private) codebase.

The issue did not come with a minimal reproducer, so it was up to us to try to come up with one.

First, I just attempted to run the tests on my work Macbook. When attempting to run the tests, exactly one snapshot test was failing with mismatch in the output.

 from foo.bar import InitClass
 #    ^^^^^^^ reference `foo.bar`/__init__:
-#                   ^^^^^^^^^ reference `src.foo.bar`/InitClass#
+#                   ^^^^^^^^^ reference /InitClass#

I’m showing a subset of the output, the actual diff was similar but for about 35 lines total.

What is happening here is that at the reference range, the symbol is missing the module prefix `src.foo.bar`. This means that code navigation will not work from the reference site to the definition site we saw earlier, because the symbols need to match exactly.

At Sourcegraph, employees generally use Macbooks. The CI pipeline for scip-python was using Ubuntu and it was passing. There was no CI pipeline for macOS.

When I saw this failure, I vaguely remembered having hit this bug earlier, and not having been able to figure out a fix within a day or two of debugging.

I thought that before fixing the crash the customer hit, it’d be good to make sure that all tests were passing on macOS for ease of local development (instead of adding a hack to skip a test on macOS), and set up CI for macOS. I gave myself a time budget of 3-4 days.

In the past, I had not done much work on scip-python’s codebase. So I had little context. The little context that I had, I’d already lost by the time of starting debugging this issue, because I hadn’t worked on the codebase recently.

What I did

I used an AI coding assistant to onboard myself, to help with running commands, to add print debugging statements in lots of places, and to compare the logs on macOS vs Linux to identify the places of divergence.

While doing other work in parallel, I was about able to figure out a one line fix in a couple of days.

The fix involved normalizing the paths being added to a Program object (essentially a kind of god object/global state/context) using an existing helper function. This helper function would lowercase the paths on case-insensitive file systems (such as macOS’s APFS) and leave them as-is on case-sensitive file systems (such as ext4 on Linux).

This was a one line fix as the path normalization function took an input path of type string and returned a normalized path of type string.

Aside: Counterfactual reasoning

Per Wikipedia:

Counterfactual thinking is a concept in psychology that involves the human tendency to create possible alternatives to life events that have already occurred; something that is contrary to what actually happened. Counterfactual thinking is, as it states: “counter to the facts”. These thoughts consist of the “What if?” and the “If only…” that occur when thinking of how things could have turned out differently. Counterfactual thoughts include things that – in the present – could not have happened because they are dependent on events that did not occur in the past

Counterfactual reasoning is dangerous when attempting to understand why things happened the way they happened, and why people made the decisions they made, because it assumes an omniscient God-like view of not only the past, but also of potential futures that were not taken. Sidney Dekker has an evocative image about this:

Counterfactual reasoning involves taking a view from “outside” the tunnel, with a full map, without necessarily understanding what people in the tunnel were experiencing.

In the rest of this blog post, I will highlight examples of counterfactual statements and counterfactual reasoning with the ⚠️ emoji.

Three potential analyses of the bug

OK, so now you know a bunch of background about the bug, as well as the fix.

Let’s try looking at it in a few different ways.

  1. A simple “cause”
  2. An attempt at identifying the “root cause”
  3. An attempt at identifying some contributing factors to the bug

Analysis 1: What “caused” the bug?

This bug was “caused” by the code not (⚠️) correctly normalizing the paths before adding them to the Program instance.

Analysis 2: ‘Five Whys’ and the “root cause”

Five Whys is a technique purportedly used to identify the “root cause” of a problem by repeatedly asking ‘Why?’ questions. As the name suggests, the canonical number of ‘Why?’ questions is five, but depending on who you ask, this number can be tweaked. I believe it is somewhat popular, in that I’ve met people who know about this, but I’m not sure.

Let’s try it here.

Initial statement: The snapshot output was different on macOS vs Linux.

  1. Why was it different?

    A: On macOS, the code was setting a property moduleName to an empty value, leading to this bug. On Linux, the moduleName was set to a non-empty value.

  2. Why was the module name set to empty on macOS?

    A: On macOS, when the code ran, it used normalized paths from one code path, and non-normalized paths from another code path. These paths were used as hash map keys, leading to certain lookup operations returning undefined during import resolution. This was later propagated through as moduleName: ''.

  3. Why was the code passing normalized paths in some cases and not others?

    A: These code paths were written by different people. The code where the paths were passed as-is was written by the person who forked Pyright and was the original scip-python maintainer. The lead (solo) maintainer of Pyright wrote the other code where the paths were first normalized.

    The original scip-python maintainer perhaps did not have sufficient context (⚠️) of the unstated but important (⚠️) pre-conditions of different APIs. ️ It’s hard to know for sure.

  4. Why did the original scip-python maintainer not notice (⚠️) and fix the test failures on macOS?️

    A: They were using a Linux machine for development. I believe that they got this machine before the company-wide Macbooks only policy was announced, or maybe they got an exception. The CI was Ubuntu only, so there was no environment where they would notice this, unless someone else was also running the tests locally.

  5. Why was macOS not tested (⚠️) in CI if developers use Macbooks locally?️

    A: I’m not sure, but if I were guessing, it’s probably because the person didn’t anticipate (⚠️) that an indexer for Python written in TypeScript would have major differences between OSes (unlike, say, an indexer for C++, potentially written in a compile-to-native language), so they must’ve thought that testing on Ubuntu would be sufficient. Additionally, in production, the indexers generally run on Linux.

OK… So is the “root cause” of the bug the lack of anticipation when setting up CI?


Maybe you think that’s not a very good attempt at Five Whys.

Below, I’ve attempted to formulate some more questions – such as those people might ask in forum comments – and answer them faithfully.


OK… So what is the “root cause” in light of the above?

Analysis 3: Contributing factors

Let’s try listing down the various factors which contributed to this bug.

For this list, I’ve taken care to avoid counterfactuals (this was difficult!). The list is, by definition, incomplete.

You might be wondering: so what we do with this list now? Shouldn’t we at least discard some points for simplicity? For example, lots of applications run fine on macOS, and lots of applications written in TypeScript don’t necessarily have the same kinds of bugs.

To that, my response is, yes, we can discard points. But our choice to discard individual points doesn’t mean that these did not contribute.

If we consider the point about case-insensitivity of APFS, a quick web search will reveal that case-insensitivity on macOS (or Windows, for that matter) often bites programmers used to Linux. For example, Case-insensitive filesystems considered harmful (to me) talks about a build failure when using the Kotlin compiler along with a specific code pattern, that only triggers on case-insensitive file systems such as on macOS and Windows.

As to the question of what do we do with this list, I’ll quote Lorin Hochstein’s The Gamma Knife Model of Incidents:

Gamma knife is a system that surgeons use for treating brain tumors by focusing multiple beams of gamma radiation on a small volume inside of the brain.

Illustration of a gamma knife, with multiple beams of radiation converging.

Each individual beam is of low enough intensity that it doesn’t affect brain tissue. It is only when multiple beams intersect at one point that the combined intensity of the radiation has an impact.

Every day inside of your system, there are things that are happening (or not happening(!)) that could potentially enable an incident. You can think of each of these as a low-level beam of gamma radiation going off in a random direction. Somebody pushes a change to production, zap! Somebody makes a configuration change with a typo, zap! Somebody goes on vacation, zap! There’s an on-call shift change, zap! A particular service hasn’t been deployed in weeks, zap!

Most of these zaps are harmless, they have no observable impact on the health of the overall system. Sometimes, though, many of these zaps will happen to go off at the same time and all point to the same location. When that happens, boom, you have an incident on your hands.

Alas, there’s no way to get rid of all of those little beams of radiation that go off. You can eliminate some of them, but in the process, you’ll invariably create new ones.

To this, I can imagine someone reasonably bringing up a few different objections:

  1. The metaphor sounds interesting, but what I’ve said is not very actionable. Is my point that we should just give up on trying to improve things, such as using by alternate/“superior” coding patterns?
  2. There is a big difference between the bug I described and the kinds of incidents that Hochstein is talking about.

Let me cover the second point first. Consider this one hypothetical gamma ray:This is meant as a creative example, not a reflection of day-to-day operations.

The CEO is going to be talking to a big prospect (Python shop) in an hour, and they are interested in precise code navigation for Python. This meeting was set up on short notice, so there was no heads up. The CEO indexed a popular Python codebase using his Macbook, but code navigation is not working correctly. The CEO needs urgent support.

So perhaps, the difference between a bug and an incident is not as big a difference as it might seem at first glance?

Coming back the first question: surely, we need some actionable takeaways from a bug analysis/retro, right? Surely, I must agree that defining a dedicated type for normalized paths, by virtue of being a more systemic fix, is the “best” solution, right?

Path handling scar tissue

At my first job at Apple, the first slightly complex PR that I created against the Swift compiler involved modifications to a so-called module trace file emitted by the compiler for an Apple-internal build system. One of the key bits of information recorded in this module trace are paths to various files that are consulted by the compiler.

When creating this PR, I recall spending a fair bit of time trying to understand the various cases of when certain paths would be relative vs absolute, and if they were relative, then relative to what. I distinctly remember asking my mentor at the time as to why we weren’t using dedicated types for different kinds of paths. I don’t remember the exact answer, but I think it was something along the lines of nobody having suggested it earlier (at least, based on what my mentor knew).

In retrospect, one big contributing factor is at least somewhat clearer. The Swift compiler codebase is best viewed as a close cousin of Clang; multiple people who originally worked on Swiftc previously worked on Clang. There are a lot of similarities in coding patterns between the two. And the Clang codebase does not distinguish various kinds of paths; it generally uses the string type or a type representing a reference to a string.

In the end, I was still not super confident about the module trace PR when I submitted it (and we still landed it), and I ended up creating some follow-up PRs later to make the logic more robust, and to handle more cases.

That experience left a somewhat deep impression on me.

So when I joined Sourcegraph, and I was working on a new indexer for C++, I ended up defining several distinct types as follows:

Is that the “best” way to do things? Probably not. On looking at the code again, there are definitely parts which are complex, and it’d take me some time to make sense of them.

If building a theory of a program is like climbing, then the finer-grained types feel like additional ledges and footholds I can use, instead of having to face a scarier, sheer wall.


I chose my current handle on various sites (typesanitizer) in part because I held and continue to hold the belief that most existing programs which need to be maintained over several years or decades, particularly with a varying set of maintainers, would probably benefit from defining more distinct types to capture different sets of invariants.


Coming back to the question I posed earlier:

Surely, I must agree that defining a dedicated type for normalized paths, by being a more systemic fix, is the “best” solution, right?

In the situation where it makes business sense for our team to spend more time on indexer maintenance, maybe yes.

In the current situation, the “best” solution is probably to do nothing. Why? Because time is the biggest constraint.

If there is a bit of time, perhaps we can tweak the code paths which access the moduleName field in scip-python to handle the empty case, and just skip emitting a definition or reference. We could potentially log a warning message for the user to file a bug, and potentially include some debugging information in the warning, but I’m not sure how easy it would be to reproduce the issue without access to the code itself.

Some observations and beliefs

Coming back to the titular question:

How can we learn from bugs?

This question is well-intentioned, but unfortunately, it omits a key detail: specifically, what learning are we talking about here? Do we mean:

Or maybe we mean something else.

Notice how all of these questions bake certain assumptions into them. For example, “How can we learn (why the bugs came about) from (old/existing) bugs?” is potentially silently accompanied by the assumption that future bugs are likely to resemble past bugs.

Depending on context, these assumptions may or may not be true. For example, if your project heavily uses property-based testing (or its more supercharged version deterministic simulation testing), it’s possible that bugs at later times, more often than not, do not resemble bugs in the past.

In any case, given the large variety of programs that people write, the large variety of contexts programmers operate in, and the large variety in bugs that programs have, I do not feel confident enough to actually provide an answer to the titular question.

That said, here are my observations and beliefs:

Let’s walk through each of these.

The two sides of taking bugs seriously

When I say “taking bugs seriously”, I mean that as shorthand for operating with a core belief that it is possible to eliminate certain (kinds of) bugs to a significant extent from programs that one is yet to write in the future by doing something in the present, and that striving to reduce bugs is worth the effort.

On one hand, it seems highly likely that if you have this kind of mindset, then you will probably try to find new techniques, experiment with them, and reduce or eliminate certain (kinds of) bugs from the programs you write. Depending on the context, this may be helpful for you.

On the other hand, I think there is a mental cost when actually encountering certain bugs. For example, maybe you thought you eliminated the possibility of a certain kind of bug, but actually, someone (maybe yourself!) managed to write code a certain way to trigger it again.

Or maybe you’re working on a codebase written by other people, and you run into bugs which you know can be avoided by writing code a certain way.

And sometimes, giving up / not doing anything is the best option, and that can be frustrating. Earlier, when I wrote the words that doing nothing was probably the “best” solution in the current situation for the path handling in scip-python, there was an agitated voice in my brain going “do you really believe this??” as well as a more sarcastic voice going “so you’re saying we must imagine Sisphyus happy?”

As much as I’d like to be equanimous when encountering bugs or debugging, sometimes, feelings do get in the way.

Avoiding counterfactuals is worth the effort

Counterfactuals come up very naturally in thinking. While you were reading the earlier sections, particularly the “root cause” analysis, you might’ve been able to relate the example to some situations you’ve encountered at work.

You’ve probably seen people frame bugs as “If X were the case, this bug wouldn’t have happened.” – it probably happens all the time.

A few months back, I wrote an ~8K word incident report at work. As part of the process, I interviewed the people who were involved in the incident, some live, and some async. By far the hardest thing about this process was avoiding counterfactual questions, both in live interviews as well as during async interviews.

The problem with counterfactuals is that they generally assume an omniscient view of the history of the situation. But well-meaning people generally make the decisions they make because those decisions make sense to them, given their knowledge, skills and information about the situation they are in, at the time they made the decision.

So to understand why things happened the way they happened, it is generally more useful to frame questions such as:

rather than, “why didn’t they do X?”

The recent discussion on an OpenZFS bug is a good example of this. Some commentors were much more focused on how the bug could’ve been avoided if XYZ were true (where XYZ is known to not have been true), rather than how the bug came about to be, why the codebase uses the naming conventions that it does, etc.

Paying attention to contexts

If you’re a programmer, your day-to-day work involves taking things from one context and using them in another context.

A new username typed on a website becomes part of a new row in a database. A function name written in a text file becomes a debug symbol in an object file. A Set type defined in the standard library becomes a way to model a business constraint of no duplicates.

So it’s entirely unsurprising when we attempt to take ideas from one context and try to apply them in other contexts – that’s literally part of the job.

However, when we decontextualize bugs, or approaches for avoiding bugs, I believe we reduce the probability of the conversation having a constructive outcome.

Approaches for avoiding certain kinds of bugs that work in some contexts may not be workable in other contexts because of business constraints, limitations in the specific technologies, human factors such as inertia, and so on. So it’s worthwhile to pay closer attention to context when having discussions on particularly complex topics, such as bug prevention.

Depth is a property of analyses of bugs; not of bugs themselves

Generally, I’ve seen the time spent on discussing or analyzing a bug is correlated with the impact it had, either internally, or on a customer. For example, incidents receive more eyeballs than bugs which didn’t turn into incidents, and bugs hit by customers are discussed more compared to bugs which are not.

This might lead one to think (or maybe it is a reflection of the implicit assumption) that bugs which had more impact are “deeper”, i.e., they warrant a deeper analysis because there is likely more to learn from them, and spending effort deeply analyzing on high-impact bugs will have higher expected utility (as compared to deeply analyzing low-impact bugs**.

I suspect that this implicit assumption is false, but I do not have compelling evidence either way.

At some point, I’d like to run an experiment of deeply analyzing low-impact bugs, as well as deeply analyzing high-impact bugs, and comparing the impact of the analyses. Measuring impact is hard though; that’s just one complication of many potential complications, when trying to test this in a manner where the results are usable for other people.

Aside: Depth ≠ fitness-for-purpose

Perhaps there’s one additional bit that’s worth clarifying: I’m not implying that deeper analyses are necessarily more fit-for-purpose. Whether an analysis is fit-for-purpose depends on the purpose it is being used for, and the constraints you have when you’re doing the analysis.

For example, if you’re working in an environment where there’s pressure on you to provide concrete action items to prevent bugs, and yet business constraints mean that you only have limited time to actually follow up on the said action items, then a “root cause” analysis process would likely be more fit-for-purpose than the kind of contributing factors analysis I presented earlier.

Closing thoughts

I don’t know how we should learn from bugs.

I think there are at least some situations in which we cannot do anything about certain kinds of bugs until the situation/context changes, and it is important to come to terms with that for the sake of one’s mental health, instead of being indignant in the face of bugs. Depending on the person, accepting this can be hard.

At the same time, there may be situations in which we can improve things, such as subsystems where one has greater influence, or greenfield projects, and it makes sense to focus one’s energy on trying out new bug prevention techniques in those contexts.

When we’re discussing bugs with others, as well as when we’re debugging by ourselves, I think, on average, we could probably do with extending more grace, and starting with questions instead of solutions.

P.S. If you’re reading this via a forum, and you’ve spent a fair bit of time thinking about the meta question ‘How should we learn from bugs?’, or maybe even just trying different things (such as running different kinds of retro processes), I’d love to hear from you in a forum comment or via email (contact info on homepage).