TFW UBSan flags a dynamic type error inside Clang

Intended audience: Programmers familiar with C++ to some extent. If you’re not a C++ programmer, this blog post primarily revolves around C++ sanitizers, which are basically compiler-inserted dynamic checks (+ a small runtime) that detect various classes of errors that arise in C and C++. Address Sanitizer (ASan) catches memory related errors, like use-after-free, and Undefined Behavior (UBSan) catches undefined behavior related errors, like signed integer overflow etc.

Meta: This post deliberately doesn’t have a TLDR, because I wanted to convey the journey moreso than a elevator-pitch-friendly summary. Scroll to the end if you only want to read my take-aways.

Background

At work, lately I’ve been hacking on a new C++ project. Normally, I’m a big fan of using sanitizers if possible to catch bugs. In my experience, ASan has typically been much more helpful than UBSan, but just last week, it took me a few hours to track down a bug with ASan which would’ve been much quicker had I also turned on UBSan. So that was at the back of my mind.

Also, I should mention that my work project builds with Bazel, utilizing LLVM’s unofficial Bazel support (it’s also used by Tensorflow and other projects) rather than the official CMake support.

A problem comes up…

Earlier last week, I’d accidentally landed a feature without checking that the code was not triggering any assertions when running against certain large input (why not check this in CI? well, it’s complicated to answer, and not terribly interesting, but the overall gist is that project is a very early prototype, and I don’t want to slow down the already not-too-speedy CI too much).

On Thursday, while I was fixing some bugs that I ran into after landing the feature, I figured I’d make sure the build was ASan and UBSan clean as well (not just assertion clean). That way, I’d have greater confidence that I didn’t introduce more bugs while refactoring the code.

I just remembered what the flags were, so I added them to the .bazelrc file.

build:dev --copt="-fsanitize=address" --copt="-fsanitize=undefined"

That way, I could build the code with bazel build <blah> --config=dev and have it pass the flags without having to repeat them.

Seems simple enough, right?

I ran the code, and immediately hit a UBSan failure:

external/llvm-project/clang/include/clang/AST/Redeclarable.h:200:15: runtime error: downcast of address 0x000120b1b908 which does not point to an object of type 'clang::TranslationUnitDecl'
0x000120b1b908: note: object is of type 'clang::Decl'
 00 00 00 00  f0 77 c7 0d 01 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'clang::Decl'

For context, my work project accesses Clang internals. The failure you’re seeing is a dynamic type error where something has gone wrong inside Clang (specifically in Redeclarable.h).

Flailing and failing

OK, this was weird. I was like… hmm, that’s pointing inside Clang code. But at the time, my code wasn’t doing anything non-trivial at all with Clang APIs. All it was doing was asking Clang to type-check a translation unit. So I had a couple of different hypotheses:

  1. I’m making some stupid mistake in invoking Clang’s internal APIs.
  2. Clang itself has some undefined behavior (close to HEAD; I was working off a 1 month old commit) that nobody has caught yet.

Since there wasn’t a stack trace, it wasn’t super clear to me how the particular API in Redeclarable.h was being invoked. So I figured, let’s just try some brief hacks first. I tried commenting out some of the code calling Clang APIs. Didn’t help. OK. I spent a bit of time poring over my code to see if anything stood out. Nothing. I briefly looked at the code in Kythe, which was similar to what I had. Nothing stood out.

I figured… maybe it’s actually a bug in Clang. Would be strange, but I’ve seen enough compiler bugs that it didn’t seem like an impossibility.

A Clang bug, mayhaps?

I looked at the LLVM issue tracker, and found a GitHub issue with a similar crash signature:

/opt/venv/llvm-project/clang/include/clang/AST/Redeclarable.h:200:15: runtime error: downcast of address 0x621000017d08 which does not point to an object of type 'clang::TranslationUnitDecl'
0x621000017d08: note: object is of type 'clang::Decl'
00 00 00 00  10 90 78 08 b6 55 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'clang::Decl'

That GitHub issue states:

When LLVM/Clang are built with -fvirtual-function-elimination, and any optimization above -O0, most of the LLVM binaries all SEGV almost immediately (e.g. lli, llc, clang, opt); --help* or no command are the only commands that will not cause a crash.

OK! So I tried building my code with -fno-virtual-function-elimination. That should hopefully fix it! Except it didn’t. 💩

OK… Maybe some other optimization also triggers this? I did a rebuilt with -O0, as the bug report claimed that higher optimization levels triggered the error. Still getting UB…

At this point, I figured that maybe my issue had a different cause (but same symptoms) as that GitHub issue.

I had a checkout of LLVM lying around so that I could easily browse Clang’s code. I skimmed the docs to see if there was an easy way to turn on sanitizers. I saw a CMake variable LLVM_USE_SANITIZERS which seemed to do just what I wanted. Nice! I built Clang with ASan and UBSan, and invoked it against the translation unit that was triggering UB.

And it ran just fine. No UB. No ASan error. As if everything was fine and dandy.

The mystery deepens

Given that a UBSan build of Clang itself was not triggering UB, I went back to my first hypothesis. Maybe something was wrong with my code, which was causing the linked-in Clang to have UB. So I figured, let’s try to reduce this translation unit first, so that I have a minimal test case.

For that, I downloaded creduce. (Aside: creduce also depends LLVM, so this incurred yet another build of LLVM via Homebrew.)

After a bit of cajoling, creduce ran against the translation unit and helpfully reported that the empty translation unit (i.e. an empty file) was still triggering UB.

I verified that I hadn’t messed up my invocation of creduce. Yes, the Clang I was linking in was hitting UB when type-checking an empty file. This seemed even more strange… Maybe something is going wrong when initializing the built-in state of Clang, but only when I link it in?

A wild stack trace appears

At this point you’re probably wondering, why have I not yet mentioned attaching a debugger and poking around my code? The problem was that the UB was getting triggered in a child process, and a brief look at the LLDB docs didn’t give any insight on how/if LLDB worked properly in a multiprocess setting. I found some discussion on the LLVM discourse and some mention of a contract for multiprocess support, which didn’t seem terribly promising. There was also IPC going on with timeouts in my code. So I didn’t jump on using a debugger right away.

At this point, I figured, let me look at the UBSan docs to see if I can get more useful information. I found a way to turn on stack traces when UB is triggered, whoo! (Aside: This requires a llvm-symbolizer binary, which is not bundled alongside Xcode. But no worries; I’d already built Clang from source, so I already had llvm-symbolizer available without building LLVM yet again.)

This was the stack trace:

external/llvm-project/clang/include/clang/AST/Redeclarable.h:200:15: runtime error: downcast of address 0x000120e16908 which does not point to an object of type 'clang::TranslationUnitDecl'
0x000120e16908: note: object is of type 'clang::Decl'
 00 00 00 00  30 b7 ff 0d 01 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'clang::Decl'
    #0 0x10533ad48 in clang::Redeclarable<clang::TranslationUnitDecl>::Redeclarable(clang::ASTContext const&)+0xf8 (scip-clang:arm64+0x104f02d48)
    #1 0x10533a8d0 in clang::TranslationUnitDecl::TranslationUnitDecl(clang::ASTContext&)+0x13c (scip-clang:arm64+0x104f028d0)
    #2 0x10533ad88 in clang::TranslationUnitDecl::TranslationUnitDecl(clang::ASTContext&)+0x20 (scip-clang:arm64+0x104f02d88)
    #3 0x105388068 in clang::TranslationUnitDecl::Create(clang::ASTContext&)+0x3c (scip-clang:arm64+0x104f50068)
    #4 0x104d0ca24 in clang::ASTContext::addTranslationUnitDecl()+0x110 (scip-clang:arm64+0x1048d4a24)
    #5 0x104d0a4d0 in clang::ASTContext::ASTContext(clang::LangOptions&, clang::SourceManager&, clang::IdentifierTable&, clang::SelectorTable&, clang::Builtin::Context&, clang::TranslationUnitKind)+0x1760 (scip-clang:arm64+0x1048d24d0)
    #6 0x104d0e630 in clang::ASTContext::ASTContext(clang::LangOptions&, clang::SourceManager&, clang::IdentifierTable&, clang::SelectorTable&, clang::Builtin::Context&, clang::TranslationUnitKind)+0x28 (scip-clang:arm64+0x1048d6630)
    #7 0x100beef08 in clang::CompilerInstance::createASTContext()+0x15c (scip-clang:arm64+0x1007b6f08)
    #8 0x100e6f894 in clang::FrontendAction::BeginSourceFile(clang::CompilerInstance&, clang::FrontendInputFile const&)+0x53dc (scip-clang:arm64+0x100a37894)
    #9 0x100bf8dbc in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&)+0xadc (scip-clang:arm64+0x1007c0dbc)
    #10 0x100919b98 in clang::tooling::FrontendActionFactory::runInvocation(std::__1::shared_ptr<clang::CompilerInvocation>, clang::FileManager*, std::__1::shared_ptr<clang::PCHContainerOperations>, clang::DiagnosticConsumer*)+0x3d4 (scip-clang:arm64+0x1004e1b98)
    #11 0x10091948c in clang::tooling::ToolInvocation::runInvocation(char const*, clang::driver::Compilation*, std::__1::shared_ptr<clang::CompilerInvocation>, std::__1::shared_ptr<clang::PCHContainerOperations>)+0x338 (scip-clang:arm64+0x1004e148c)
    #12 0x10091482c in clang::tooling::ToolInvocation::run()+0x6a0 (scip-clang:arm64+0x1004dc82c)
    #13 0x1006e7ef4 in scip_clang::Worker::processTranslationUnit(scip_clang::SemanticAnalysisJobDetails

That’s a lot of frames between my code (at the bottom with scip_clang) and the actual UB being triggered. It seemed strange that something I was doing could be causing that UB.

I looked at the code involving the relevant functions, and it seemed like routine constructor calls/type initialization. Nothing stood out as particularly funky.

At this point, I started wondering… could it be a UBSan bug, where it was misdiagnosing UB when there wasn’t UB? It seemed unlikely, but again, not impossible. When I’d run the code without UBSan, it had run just fine (although, with undefined behavior, it’s very much possible that code works fine despite undefined behavior happening).

LLDB to the rescue?

Remember the bit about multi-processing/IPC that complicated attaching a debugger? I figured, I might as well refactor the code so that I can run the process standalone (instead of spawning it as a child) and without IPC. At least for debugging. That way, in the future, if I ran into a similar ASan/UBSan issue, being able to attach a debugger easily would be very helpful.

So I spent a bit of time refactoring the code. Then I ran it under LLDB.

When the UB was triggered, I went through the different stack frames and checked the type of the this variable, which was triggering the UB. LLDB said this had type Decl *, not TranslationUnitDecl *, just as UBSan was claiming (and unhappy with).

So it was not a UBSan bug. The type was actually wrong. (Or both UBSan and LLDB were simultaneously wrong…)

I spent a bit of time racking my head as to how that was possible. Surely, in a constructor, the dynamic type of this has to be the same as the type whose constructor is being invoked, right? Was I running into some weirdness around multiple inheritance? Was there some memory corruption going on…?

I spent a bit of time staring at the code, walking up and down the call stack in the debugger. I noticed one of the frames was initializing the type via operator new. Huh… Does operator new have funky rules around how the dynamic type was handled?

What are types, anyways?

यह सब मोह माया है (lit. these are all infatuations/illusions)

(old Hindi spiritual saying)

So TranslationUnitDecl inherits from Decl. It was also inheriting operator new from Decl. Could it be that since the operator new is called in the parent class, the memory has the dynamic type of Decl, and it’s somehow not being downcast to TranslationUnitDecl?

As I was looking at TranslationUnitDecl’s constructor, I had a realization.

TranslationUnitDecl::TranslationUnitDecl(ASTContext &ctx)
    : Decl(TranslationUnit, nullptr, SourceLocation()),
      DeclContext(TranslationUnit), redeclarable_base(ctx), Ctx(ctx) {}

It may not be obvious looking at the code here, but this is actually utilizing a custom run-time type information (RTTI) system. LLVM has its own RTTI system, which is essentially implemented as ML-style enums (or Scala style case classes) with a tag + payloads. The TranslationUnit in the parent class initialization Decl(...) is setting the enum tag (which is used by LLVM’s type-casting machinery), and the rest of the data is the payload.

OTOH, the failure was coming from a static_cast (not shown here) inside the redeclarable_base(ctx) call.

I had a new hypothesis! I figured, maybe the static_cast did not actually understand the type tags. It’s not unusual for memory allocators to have to add special code so that they work properly under ASan. So it didn’t seem impossible that special code was needed to handle type-casting for homegrown RTTI systems with UBSan. Maybe some special code was present somewhere and getting compiled out when I was linking Clang in (and hence UB was getting triggered), but it was being compiled in when Clang was being built by itself (and hence UB was not triggered).

I spent some time peeking inside the LLVM RTTI related code, trying to find the special sauce interacting with UBSan. Nothing.

Read the docs, Luke

I went back to the UBSan docs to see if something was helpful. I noticed this line:

-fsanitize=undefined: All of the checks listed above other than float-divide-by-zero, unsigned-integer-overflow, implicit-conversion, local-bounds and the nullability-* group of checks.

Ah. So -fsanitize=undefined is actually a group of separate checks, not a monolithic check… Could it be that the check which was catching this UB is just turned off when I used LLVM_USE_SANITIZERS earlier?

I quickly searched the LLVM CMake code for use of LLVM_USE_SANITIZERS and landed at the following: (source)

set(LLVM_UBSAN_FLAGS
    "-fsanitize=undefined -fno-sanitize=vptr,function -fno-sanitize-recover=all"
    CACHE STRING
    "Compile flags set to enable UBSan. Only used if LLVM_USE_SANITIZER contains 'Undefined'.")

Bingo!!! If you’ll recall the UBSan error, it also had vptr in it. Turns out, the check is just turned off! Well, no surprise that it didn’t fire, duh.

I rebuilt my code (and hence Clang/LLVM too) with -fno-sanitize=vptr,function, and it ran without any UBSan errors. Success! It was a little dissatisfying to just turn off a check, but you sometimes have to lose something to win something. Like how I lost some hours of my life to be able to get the juicy (?) material to write this blog post.

Anyways.

As one final bit of confirmation, I figured I’d check that forcibly turning on the vptr check with a standalone build of Clang triggered the same error I was seeing in my project earlier.

Sir, this is C++

I built Clang again, overriding the LLVM_UBSAN_FLAGS variable to be -fsanitize=undefined -fno-sanitize-recover=all, so that no checks were skipped. Then I tried to type-check a translation unit. And it succeeded. No UB. Erm…

I checked the build.ninja file to see that the code was indeed getting compiled without the -fno-sanitize=vptr,function arguments from earlier. Nope, that flag wasn’t present. And yet, the vptr check wasn’t firing when building Clang standalone.

At this point, I was honestly ready to throw in the towel. I’d spent a couple of days (including a few hours on a Sunday, because I couldn’t get the damned problem out of my head) chasing what seemed to have been a build misconfiguration on my part. I felt like I’d understood 90% of the problem, and I had a workaround where I was mostly able to use most UBSan checks in my project barring two. Surely, that was good enough?

At this point, I reasoned that similar to before, the problem was likely something due to a flag mismatch. If the same code is built with the same compiler and with the same flags, it has to behave the same, right? The compiler and code being compiled were surely the same. Maybe the flags were a little different because I was building LLVM via the Bazel setup in my project, and maybe the default CMake build was passing different flags.

I looked at the flags being passed by CMake, and none of them really stood out. Except one. It was -fno-rtti.

Remember I mentioned that LLVM uses a homegrown RTTI system? Since it has that, it doesn’t need really need the language-provided RTTI system for its own code. The other thing that RTTI enables is C++ exceptions. However, C++ exceptions are notorious for inhibiting compiler optimizations. So the default CMake configuration for LLVM disables exceptions too. (LLVM can get away with this in part because it reimplements a non-trivial amount of functionality that would normally be part of the standard library. The C++ standard library uses exceptions in several places, but LLVM does not use those bits.)

Thanks bud, whoever wrote the UBSan docs is crying

So I went back to the UBSan docs to see if it had any explicit mention of RTTI. Turns out, the UBSan docs had been silently screaming at me all this time:

-fsanitize=vptr: Use of an object whose vptr indicates that it is of the wrong dynamic type, or that its lifetime has not begun or has ended. Incompatible with -fno-rtti

Welp. That explains it. -fsanitize=undefined when combined with -fno-rtti was silently implying -fno-sanitize=vptr. That would explain why the check still not firing for Clang despite overriding LLVM_UBSAN_FLAGS.

I looked at the LLVM CMake docs again and found a variable LLVM_ENABLE_RTTI. Another build of LLVM later (hope you’re not counting, surely that cannot be healthy), I ran Clang again on the translation unit. Well, except I didn’t. During the build, another binary (llvm-tblgen) had been built with RTTI and UBSan and already crashed with the sweet, sweet vptr failure I’d been craving for.

So indeed, the problem was not in my code. Or in Clang. It had been in my build configuration. And in the part of my brain which irrationally assumed that other people’s C++ code would work fine with compiler flags of my choosing.

Problem reproduced. Case closed.

Fertilizer for my potato brain

Looking back on the whole investigation, it’s hard to not see several missteps along the way. It took me about 20 hours of investigation. It could’ve gone much faster. To be more specific:

However, it wasn’t all a waste of time. For one, I did tweak the code so that we can attach a debugger easily in the future. And the code is now ASan and UBSan clean, so I can sleep a little bit more peacefully. I’m also now more familiar with the LLVM CMake configuration, and am less wary of digging into it than I was earlier. And I ended up documenting the UBSan stack trace information more prominently in our Development docs.

So yeah, I’m going to try to look on the bright side. It was better than in my first year as a SWE where I had an stack variable use-after-free that took me nearly a week to debug, without ASan, as the ASan-ified build stack-overflowed before the problem was triggered.

Progress. Developing a spidey sense for footguns, one reified trauma at a time.