Tom Lee

Software development geekery.

3(+1) Things I Learned by Writing Chimp’s Garbage Collector

UPDATE: some code samples here were mangled pretty badly during my migration to Octopress/Jekyll – please excuse the mess.

I’ve been messing with toy compilers & even some aspects of a language runtime for years now, but I’ve never really tried my hand at writing a garbage collector for some reason. When I did eventually get around to it a few months back, I accidentally an entire programming language and runtime.

I want to be clear that both the language and the runtime are very young and the garbage collector isn’t anything special: because of certain language & runtime semantics/assumptions/constraints, GC & VM execution can run in isolation on each thread. There’s no GIL/locking/whatever related to garbage collection, nor is there a need for a tricky concurrent collector. It’s conceptually pretty simple.

What was interesting to me was how tricky things got — even in the simple, conservative, mark/sweep collector found in chimp. So for the benefit of folks who might be interested, here’s some interesting stuff I learned while writing the garbage collector for chimp. :)

#1. Writing a garbage collector is hard

In other words: it’s amazing how much we’ve come to rely on this thing that is really, really easy to fuck up in subtle and explosive ways. Garbage collection is almost a “must-have” for modern languages, but keeping it error-free is really pretty difficult.

To be clear, implementing mark/sweep is easy enough that you can hack out a naive implementation in an afternoon once you know what you’re doing. Even crazy stack-scanning voodoo isn’t all that “out there”.

What’s not so easy is subtle/insidious GC bugs that can be introduced by compiler optimizations. Joe Damato’s rant from a few years back about how MRI & REE are fundamentally broken touches on the nitty gritty details of problems like this.

(Be warned: his post is a bit emotional. The Ruby sky isn’t falling, at least as far as I’m aware. If you can see past that to the technical details of the bug, it’s pretty interesting stuff.)

Boehm’s collector also has an interesting hack for avoiding this type of bug: the GC_reachable_here macro:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* Explicitly tell the collector that an object is reachable    */
/* at a particular program point.  This prevents the argument   */
/* pointer from being optimized away, even it is otherwise no   */
/* longer needed.  It should have no visible effect in the  */
/* absence of finalizers or disappearing links.  But it may be  */
/* needed to prevent finalizers from running while the      */
/* associated external resource is still in use.        */
/* The function is sometimes called keep_alive in other     */
/* settings.                            */
# if defined(__GNUC__) && !defined(__INTEL_COMPILER)
#   define GC_reachable_here(ptr) \n    __asm__ volatile(" " : : "X"(ptr) : "memory");
# else
    GC_API void GC_noop1(GC_word x);
#   define GC_reachable_here(ptr) GC_noop1((GC_word)(ptr));
#endif

Fascinating stuff.

Another big potential source of bugs in the GC is any assumptions you make about the underlying architecture. In fact, that’s a great segue into #2 …

#2. It’s really important to know your compiler & architecture

Unless you’re doing a really good job of explicitly tracking GC references, you need to know every nook & cranny in which your compiler might try to stash pointers. For example, on x86 & x86-64 this means registers and the machine stack.

When I first wrote the GC, I knew I had to take special care to scan the C stack — you can see in libchimp/gc.c. However, I neglected to account for the possibility that local variables never make it to the stack. This has interesting implications, best understood by an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* GC object references */
ChimpRef *ref1;
ChimpRef *ref2;
 
/* allocate a `str` object using the GC */
ref1 = chimp_str_new ("foo");
 
/* run a mark/sweep */
chimp_gc_collect ();
 
/* allocate another `str` object using the GC */
ref2 = chimp_str_new ("bar");
 
/* output the contents of ref1 */
printf ("%s\n", CHIMP_STR_DATA(ref1));

What output would we expect from this simple little program? I would assume correct execution would cause it to display “foo”.

The real answer is: it depends on your C compiler and the optimizations it applies.

It’s entirely possible — even likely — that the values of ref1 & ref2 are always kept in registers & thus never stored on the machine stack. In that case, you will actually most likely see “bar”, because the second chimp_str_new allocation will recycle the GC slot you allocated for ref1.

You can see Ruby uses setjmp to write the registers to the stack so that they get picked up when we go walking the stack. Personally, I took the manual route because I didn’t know any better at the time.

So yeah — every nook and cranny. :)

(Note that nudging machine registers onto the stack in this manner won’t save you from the apocalyptic GC bug Joe Damato was ranting about. In that case the GC reference is stored in a register that is subsequently overwritten prior to the GC being invoked.)

#3. Assembly isn’t dead just yet

When you’re messing with low-level details of your GC — like, say, tracking machine registers — on occasion things do go wrong in weird in wonderful ways. When that happens, you need to be pretty comfortable with the assembly language & a debugger on the platform at issue.

I initially fucked up the assembly code for saving x86/x86-64 registers to the stack. Twice. Badly. Hell, it’s probably still broken in some subtle way because my x86/x86-64 foo is rusty. Do as I say, not as I do & “know your architecture” :P .

The bug manifested itself here, setting the symbols member of the symtable entry to NULL. I could see in gdb that the chimp_hash_new() call wasn’t failing, but somehow that field was still set to NULL.

Sigh.

Fortunately, once I got some sleep I was able to figure out what was going on:

  1. Immediately before the call to chimp_hash_new, the symtable entry ChimpRef pointer is copied from %rax to %rbx to keep it from getting smashed (the return value of chimp_hash_new will be stored in %rax). After the call, we save the hash:
    
       0x0000000000419a25 :   mov    %rax,%rbx
       0x0000000000419a28 :   callq  0x414ed1 
       0x0000000000419a2d :   mov    %rax,0x20(%rbx)
    

    At this point I suspected %rbx was somehow being overwritten.

  2. Inside chimp_hash_new, an object allocation triggers a call to chimp_gc_collect.
  3. One of the very first things you see in the disassembly of chimp_gc_collect is a push %rbx and a pop %rbx before we return.
  4. Setting a breakpoint on the push and the pop, I could see the value popped was different to the value pushed.

It turned out my crappy assembly code in chimp_gc_collect was smashing the stack — most likely because I was using offsets from %ebp/%esp & getting ‘em horribly wrong. Argh.

Admittedly my code was horribly wrong, but without knowing the architecture well enough to understand why things were failing, it would’ve been painful for me to track this bug down by the observable symptoms.

Ultimately, this was fixed using offsets relative to a known location rather than making big assumptions about the layout of the stack. (See “Know your architecture/compiler” ;) )

Maybe I should take a hint from the Ruby guys & stick to setjmp …

Bonus: Copying collector surprises

The first GC I wrote for what eventually became chimp was actually a copying collector. That introduced problems of its own. Namely, pointers can change under you in ways that aren’t intuitive.

I ultimately opted to move away from the copying collector after a few too many bugs — not because the collector itself was buggy, but because of seemingly-benign-but-broken code like this:

1
CHIMP_SYMTABLE_ENTRY(entry)->symbols = chimp_hash_new ();

What could possibly go wrong? Well if the chimp_hash_new call happens to trigger a GC, the underlying value of ‘entry’ that we extract using the CHIMP_SYMTABLE_ENTRY macro would move from one semi-space to another and — again, depending on the output of your C compiler — the value could be written to the wrong semi-space.

To work around this, you’d have to do things like this:

1
2
3
4
ChimpRef *temp;
temp = chimp_hash_new ();
/* safe, because no allocation (& ergo no GC) is possible here */
CHIMP_SYMTABLE_ENTRY(entry)->symbols = temp;

In retrospect, I could have probably worked around it with macros … but by the time I ripped it out & replaced it with a naive mark/sweep, the rest of the language was starting to take shape — and that was more fun to play with.

Help!

In retrospect, I guess all of these points boil down to the fact that you need a solid understanding of what’s happening under the hood of your C compiler & target architecture(s) to build something resembling a “correct” GC. Needless to say I’ve now got a tremendous, new-found respect for all the work that goes into things like the industrial-strength collectors in the JVM. This stuff is hard to get right & I can’t imagine what it must be like fighting with this stuff across a vast number of different platforms and compiler toolchains.

If GC stuff interests you and/or you’re some sort of garbage collection guru or even just some sort of angry person with strong opinions, I’d love to hear your thoughts on the current state of chimp’s garbage collector. It’s definitely still a work in progress (e.g. we don’t even try to shrink the heap at the moment), but the basics are probably ripe for criticism.

Better yet, go and hack on the chimp code base & point out where I’m being stupid. :)