Tom Lee

Software development geekery.

A More Detailed Tour of the Rust Compiler

Last week I gave a (poorly prepared) talk outlining how the Rust compiler works internally. Though the slides might be helpful & there was some interesting discussion after the talk, I thought a full post might do this topic more justice.

Note that as of writing this the current release of Rust is 0.10. Some of what I cover in this article may well change by the time you get around to reading this.

Also note that although I’ve made some minor contributions to Rust, I don’t pretend to know everything there is to know about the compiler or the language. There’s a lot going on here!

Compilers 101

Before we talk about Rust specifically, let’s cover some compiler fundamentals. If you’re already familiar with this stuff, feel free to skip ahead to the Rust-specific stuff.

Internally, most modern compilers tend to follow a fairly predictable pattern:

  1. The scanner reads raw program source code and emits a stream of tokens.
  2. A parser takes the stream of tokens from the parser and constructs an intermediate representation of the program by applying grammar rules specific to the language being parsed. The resulting intermediate representation is usually an abstract syntax tree (aka AST), an in-memory data structure representing the parsed program.
  3. At this point semantic analysis occurs, where the compiler traverses the AST to perform various language-specific safety & sanity checks on the parsed program. Statically typed languages typically do their type checking at this point too. Certain optimizations may also be applied to the AST before or after the semantic analysis phase.
  4. The AST itself is then traversed by a code generator to emit the target language. The target language might be native code, or it might be some other language (for example, you might be targeting JavaScript). In the case of languages like C and C++ that emit intermediate object files during the code generation phase may then be linked) together to produce a native binary.

You can find these components in most popular programming languages including Python, Ruby and PHP, though the latter omitted the syntax tree & instead blurs the line between parsing & code generation last I checked. I’ll leave it as an exercise to the reader to track these down – let’s move on to the Rust compiler. :)

An overview of the Rust Compiler

At a high level, there are no surprises in the structure of the Rust compiler. In Rust, you will find:

  • A scanner
  • A parser
  • A semantic analysis phase
  • A code generation phase
  • A link step

In fact, these steps are actually pretty easy to identify with a little digging if you track down compile_input in src/librustc/driver/driver.rs compile_input is called from main_args in src/librustc/lib.rs which is called directly from the rustc entry point in the same file. It’s the function that really drives the compilation process.

Anyway, within compile_input you’ll find calls to the following functions:

If you peruse these functions a little, you’ll quickly get a feel for how everything fits together. Let’s look at each phase in a little more detail. I’ll focus as much as I can that make the Rust compiler unique from compilers for other languages that I’m familiar with.

phase_1_parse_input

First, you’ll usually hit syntax::parse::parse_crate_from_file in src/libsyntax/parse/mod.rs here. Stepping through a few levels of indirection you’ll come to syntax::parse::file_to_filemap which slurps the source file from disk then calls syntax::parse::string_to_filemap to construct a FileMap which seems to be a glorified container for the file name and source code (plus some extra metadata & helper methods).

The FileMap is transformed into an ast::TokenTree via a quick pass through a lexer built from the source string which in turn is read by syntax::parse::parser::Parser.parse_all_token_trees(). All of this happens in in syntax::parse::filemap_to_tts. Note that although we’re making a pass through the parser we’re not actually building the AST at this point – i.e. the TokenTree is not the AST, though it does appear to be an intermediate representation between raw source code and the AST. (You’ll have to forgive the hand-waving here, I’m not especially familiar with this part of the code!)

Before we look at the AST pass, it’s interesting to note that the pass to construct the TokenTree uses an impl of syntax::lexer::Reader called syntax::lexer::StringReader which emits tokens from a raw string. This is a “traditional” scanner that converts raw source code into tokens. The TokenTree is then used to create a new lexer in syntax::parse::tts_to_parser that emits tokens built from the TokenTree which is used by the AST pass. You can see the TokenTree-based Reader impl here. I find this a little “weird” in that compilers for simpler languages usually go straight from raw source code to tokens to an AST, but from what I can tell this probably helps drive Rust’s syntax extension / macro system.

Anyway, TokenTree weirdness aside, by the end of all the indirection in parse_crate_from_file we’ve got an interface to a fairly traditional-looking scanner in the form of a syntax::lexer::Reader wrapped in a fairly boring syntax::parse::parser::Parser. We call syntax::parse::parser::Parser.parse_crate_mod to parse the token stream from the lexer::Reader and return our “real” AST in the form of a syntax::ast::Crate via what appears to be a fairly standard recursive descent parse.

phase_2_configure_and_expand

From a quick pass through this method:

  1. The crate’s feature attributes are checked. This is where we check the crate attributes for #[feature(…)] attributes specifying gated features required by the crate being compiled. If a requested feature has been removed, the compiler will complain. A similar error will appear if an unknown feature is requested, or if the feature name is malformed.

  2. Standard libraries are injected, unless #[no_std] is specified Typically to link to external crates, you need to refer to the crate using the extern keyword e.g. extern crate pcre;

    Unlike most crates, the std crate will be available to you by default without such an extern declaration. If you specify #[no_std] in your crate attributes, the std crate will be excluded. Similar voodoo is performed on #[no_start] as part of this pass.

  3. There is a pass to remove “unconfigured” items. This is basically applying the special rules implied by conditional compilation requested using #[cfg(foo)]. Note that we haven’t yet expanded macros / syntax extensions.

  4. Syntax extension expansion occurs. Exactly as it sounds, but if you’re unfamiliar with Rust’s facilities for syntax extensions check out the manual.

  5. We remove “unconfigured” items again. Think about it: we just expanded macros, so we may have emitted new items with cfg attributes!

  6. If we’re compiling tests, analyze & emit AST nodes describing a test harness. The details can be found in rustc::front::test::modify_for_testing.

  7. Prelude (std::prelude::*) imports are injected, unless #[no_implicit_prelude] is present at the crate level. The prelude includes stuff that is automatically & implicitly imported into your Rust code. We do that here, unless the #[no_implicit_prelude] attribute is present.

  8. We assign node IDs to AST nodes & build a mapping from those IDs to AST nodes. The AST map is useful for later phases of the compilation process.

Finally, the compiler supports optionally dumping out the full AST after all of these steps have occurred. This is useful for developers hacking on the compiler.

phase_3_run_analysis_passes

This is essentially the semantic analysis phase. At this point we make multiple passes over the AST to perform various different tasks. In most cases you can find the details of the semantic analysis steps in rustc::middle::*.

A lot happens here and trying to cover all the detail isn’t really possible, but fortunately most of the top-level functions speak for themselves:

  1. Resolve external crates AST pass tracking used crates & externs required for later resolve steps. Handled by rustc::metadata::creader::read_crates
  2. Name resolution What does a name represent in a given context? Function? Local variable? Type? Rust has two namespaces: one for types and one for values. This allows you to refer to e.g. both the str type and the str module at the same time.
  3. Resolve lifetimes Name resolution (could be misunderstanding, but this is presumably for lifetime variables e.g. ‘f ?).
  4. Find the program entry point Tries to locate the main function, or indicates no entry point if #[no_main] is set.
  5. Find the macro registrar Not entirely sure what this does, at a glance.
  6. Find freevars Annotates for loops and functions with the freevars within them.
  7. Region resolution Described in more detail here and here.
  8. Perform type checking Documented in a fair bit of detail here
  9. Static, constant and privacy checking
  10. Effect checking. This basically checks that the source program is not trying to do unsafe operations in unsafe contexts (i.e. outside of an ‘unsafe’ block/function). If it discovers a violation, compilation fails.
  11. Loop checking. Make sure the source program isn’t using break and continue outside of a loop, amongst other things.
  12. A compute moves step. This checks whether an operation will result in a move & enforces rules on non-copyable (aka “linear”) types.
  13. Match checking. Ensure that pattern matches are legal & exhaustive. Complain about unreachable patterns, etc.
  14. Liveness checking. Lots of details about this process documented in the source code.
  15. Borrow checking. Enforces Rust’s memory safety rules. Again, lots of details documented in the source code.
  16. Kind checking. Enforces special rules for built-in traits like Send and Drop.
  17. Reachability checking. Brief summary of the details here.
  18. Dead code checking. Warns if the compiler can determine that certain code will never be executed.
  19. Lint checking. Short but sweet docs here.

Phew. I think I missed some parts, but hopefully that’s enough to help you understand the sort of stuff that’s happening here.

phase_4_translate_to_llvm

I consider this our transition from the semantic analysis phase into the code generation phase.

With the amount of stuff that’s going on in phase_3, you might be relieved to know that phase_4 is actually fairly simple & cohesive: essentially this phase calls out to rustc::middle::trans::base::trans_crate to transform Rust’s internal ast::Crate into an llvm::Module. Rust leans pretty heavily on LLVM for its code generation backend.

phase_5_run_llvm_passes

Another relatively simple step, but its name doesn’t really explain what’s really going on here. At this point we’re well into code generation territory: the “LLVM passes” alluded to here are the calls to the LLVM API that emit native code or assembly or LLVM bitcode and write it out – typically to disk.

The code generation happens in rustc::back::link::write::run_passes, with the output formats selected & assembly/bitcode emitted here and object files optionally written to disk here.

The LLVM pass to write to disk is actually wired up via a C++ wrapper function with C linkage called LLVMRustWriteOutputFile which in turn calls llvm::TargetMachine::addPassesToEmitFile. You can see the Rust side of this call in rustc::back::link::WriteOutputFile.

phase_6_link_output

If we generated object files in the previous step & the compiler is configured to produce a native library or executable, the final pass is a link step. Ultimately this occurs in rustc::back::link::link_natively wherein Rust literally goes hunting for your platform’s cc program (the system C compiler) and uses it to invoke the linker to combine the object files together into an executable for the target platform. The arguments to the linker are built up in link_args.

By the time this function returns, compilation is complete & you can run your freshly compiled Rust binary!

What’s next?

To the source code, friend. Track down some issues to work on & get hacking.

If you made it this far – well done! Feel free to send me 140 characters or less of hate via Twitter @tglee or come along to a PDX-Rust meetup sometime.

Unable to Generate Spec: Read File Info /usr/lib/erlang/man/man1/python3.1 Failed

Would-be Erlang geeks on Debian may have been bitten by this bug recently:

tom@desktop:~/Source/xxxx $ ./rebar compile generate -v
==> ranch (compile)
==> cowboy (compile)
==> rel (compile)
==> xxxx (compile)
WARN:  'generate' command does not apply to directory /home/tom/Source/xxxx/deps/ranch
WARN:  'generate' command does not apply to directory /home/tom/Source/xxxx/deps/cowboy
==> rel (generate)
ERROR: Unable to generate spec: read file info /usr/lib/erlang/man/man1/python3.1 failed
ERROR: Unexpected error: rebar\_abort
ERROR: generate failed while processing /home/tom/Source/xxxx/rel: rebar\_abort

See how it’s bitching about /usr/lib/erlang/man/man1/python3.1? Let’s take a closer look:

tom@desktop:~/Source/xxxx $ ls -alh /usr/lib/erlang/man/man1/python3.1
lrwxrwxrwx 1 root root 11 May  8 14:14 /usr/lib/erlang/man/man1/python3.1 -> python3.2.1
tom@desktop:~/Source/xxxx $ ls -alh /usr/lib/erlang/man/man1/python3.2.1
ls: cannot access /usr/lib/erlang/man/man1/python3.2.1: No such file or directory

Ah. /usr/lib/erlang/man/man1/python3.2.1 is missing. Let’s delete the broken symlink at /usr/lib/erlang/man/man1/python3.1 and try rebuilding our Erlang app again:

tom@desktop:~/Source/xxxx $ sudo rm -f /usr/lib/erlang/man/man1/python3.1
tom@desktop:~/Source/xxxx $ ./rebar compile generate
==> ranch (compile)
==> cowboy (compile)
==> rel (compile)
==> xxxx (compile)
tom@desktop:~/Source/xxxx $

Fixed!

Check out python3 bug #707606 in the Debian bug tracker if you’d like to keep a closer eye on this.

Mozilla’s Tim Chevalier Talks Rust in PDX (June 17th, 2013)

So I’ve mentioned Rust is awesome, right?

Well, Mozilla’s Tim Chevalier is in town mid-June to give his talk Rust: A Friendly Introduction at OSBridge. Even better, he’s offered to give some lucky folks the chance to preview his talk a few days in advance!

Check out the PDX Rust Google Group for details. Since we’re ordering food & drink, I’d appreciate an RSVP if you plan on coming along. I’d sure hate for folks to starve ;)

Traits, Structs and Impls in Rust

I love Rust. There, I said it. We’re going to be seeing more of each other – though through no fault of her own, I doubt we’ll ever be an exclusive item.

The language is still evolving at a rapid (but slowing) pace, so it can be a little hard to figure out how to do things sometimes. Because of this, I think the best way to learn the Rust language is to literally go to the source.

On that topic, now is a brilliant time to get involved because everything mostly kinda-sorta works, but there’s still so much that you can do to help out.

So yeah, get hacking.

Anyway, as I figure things out, I like to write little tutorials like this one – largely for myself, because I have a terrible memory & an attention span to match. :) This time around, I’m going to explore traits, structs and impls in Rust. It took me a while to work out how these three features related to one another (mostly because I was too lazy to RTFM) & figured others might benefit from a summary.

If you’re wondering what all the ~ sigils & ampersands mean in the code samples below, you might want to take a detour & read my article on managed & owned boxes.

Composite types with structs

At their simplest, structs in Rust are much like structs in languages like C & C++ – composite/record types consisting of zero or more fields:

struct MyStruct {
    uint_field: uint
}

It’s possible to also declare unit-like structs by omitting the body of your struct:

struct MyUnitLikeStruct;

Structs may also be exported from modules using the pub keyword:

pub struct MyStruct {
    // ...
}

You can then initialize “instances” of a struct like so:

let x = MyStruct{field1: value1, field2: value2}

Unit-like structs are essentially singletons & are thus used without the curly braces:

let x = MyUnitLikeStruct

Adding methods to your structs with impl

So now we know how to use structs, but as it stands they’re kinda boring. They can’t really do anything. We can define functions to operate on our structs:

fn get_x(mystruct: &MyStruct) -> uint {
    mystruct.x
}

(Note that, like Ruby, Rust will return the last expression evaluated by a function – so long as it’s not followed by a semi-colon – so we don’t need a return here.)

Now we can call get_x, passing a borrowed pointer to an instance of our struct:

let mystruct = MyStruct{x: 20};
get_x(&mystruct) // returns 20

While this is perfectly acceptable, it’s more typical to define methods on a struct using an implementation:

pub struct MyStruct {
    x: uint
}

//
// implementation for MyStruct
//
impl MyStruct {
    pub fn get_x(&self) -> uint {
        self.x
    }
}

Note that because impls are associated with a type we don’t need to use the pub keyword here to export them from a module: if the underlying type is exported, so too will the impls associated with that type.

(UPDATE: I got this wrong. If you want to export your impls, pub is necessary unless you’re implementing a trait – see the “impl … for …” syntax below.)

Now we can call methods on instances of our Foo struct:

let mystruct = MyStruct{x: 50};
mystruct.get_x() // returns 50

Note that you can only define impls for types declared in the same crate.

Generalizing struct impls with traits

Traits can be used in a manner similar to the way interfaces are used in a language like Java.

For example, say we have two structs:

pub struct Person {
    name: ~str
}

pub struct Dog {
    name: ~str
}

And we decide we want to display the name of a Person & a Dog. We could write two separate functions:

pub fn show_person_name(person: &Person) {
    println(person.name)
}

pub fn show_dog_name(dog: &Dog) {
    println(dog.name)
}

But then we need to know what type of struct we’re dealing with – a Person or a Dog – & then figure out which function we should call to display its name. Adding a name method to your struct using an impl won’t help you either, because Persons and Dogs aren’t really related in our program.

So instead, we can define a trait:

trait HasName {
    pub fn name(&self) -> ~str;
}

(Note that like structs, traits can be exported using the pub keyword).

And instead of show_person_name & show_dog_name, we can define a function that accepts (a borrowed pointer to) a HasName:

pub fn show_name(has_name: &HasName) {
    println(has_name.name())
}

Now we can implement the HasName interface for the Person and Dog structs:

impl HasName for Person {
    pub fn name(&self) -> ~str {
        copy self.name
    }
}

impl HasName for Dog {
    pub fn name(&self) -> ~str {
        copy self.name
    }
}

(Again, note that we don’t need to explicitly export our impls.)

Bonus: static methods on traits

Lastly, it’s interesting to note that traits can be used to generalize static methods too. For example:

pub trait Square {
    pub fn square(x: Self) -> Self;
}

Here Self refers to the real, underlying type of Square. Using this trait, we can provide implementations for a number of different types:

impl Square for int {
    pub fn square(x: int) -> int { x * x }
}

impl Square for float {
    pub fn square(x: float) -> float { x * x }
}

See how we’ve replaced references to Self with concrete types int and float?

So now we can call Square::square(...) passing either an int or a float & Rust will know how to perform the appropriate operation for us:

Square::square(10) // -> 100
Square::square(10.0f) // -> 100.0f

(NOTE as of Rust 0.6, static methods on traits were not being properly exported from modules. Fixing this was actually my first “real” patch/bugfix for the project: you can see it working using the latest code on the incoming branch).

Until next time…

Congratulations on making it this far! Rust is totally worth the effort to learn. I’m very excited about the direction it’s headed.

Oh, and if you like this sort of nerdiness and/or would like to correct the error of my ways, you should totally ping me on Twitter.

Nim Is the New Chimp

I just renamed the language I was formerly calling ‘chimp’ to ‘nim’ (as in Nim Chimpsky, who in turn was named after Noam Chomsky — and yes, this does makes my geeky bits all kinds of tingly).

This was done largely to prevent any confusion with the CHIMP meta-programming tool concocted by John L. Kenyon for his Master’s thesis at the University of Nevada. Maybe a little premature, but I’ve been planning to do this for a while and there’s been a bit of a lull in activity in recent weeks. I figure now was as good a time as any.

So if you were previously tracking chimp, you might want to turn your attention to nim. All new development will be happening there.

You can get the source code for nim from GitHub.

3(+1) Things I Learned by Writing Chimp’s Garbage Collector

UPDATE: some code samples here were mangled pretty badly during my migration to Octopress/Jekyll – please excuse the mess.

I’ve been messing with toy compilers & even some aspects of a language runtime for years now, but I’ve never really tried my hand at writing a garbage collector for some reason. When I did eventually get around to it a few months back, I accidentally an entire programming language and runtime.

I want to be clear that both the language and the runtime are very young and the garbage collector isn’t anything special: because of certain language & runtime semantics/assumptions/constraints, GC & VM execution can run in isolation on each thread. There’s no GIL/locking/whatever related to garbage collection, nor is there a need for a tricky concurrent collector. It’s conceptually pretty simple.

What was interesting to me was how tricky things got — even in the simple, conservative, mark/sweep collector found in chimp. So for the benefit of folks who might be interested, here’s some interesting stuff I learned while writing the garbage collector for chimp. :)

#1. Writing a garbage collector is hard

In other words: it’s amazing how much we’ve come to rely on this thing that is really, really easy to fuck up in subtle and explosive ways. Garbage collection is almost a “must-have” for modern languages, but keeping it error-free is really pretty difficult.

To be clear, implementing mark/sweep is easy enough that you can hack out a naive implementation in an afternoon once you know what you’re doing. Even crazy stack-scanning voodoo isn’t all that “out there”.

What’s not so easy is subtle/insidious GC bugs that can be introduced by compiler optimizations. Joe Damato’s rant from a few years back about how MRI & REE are fundamentally broken touches on the nitty gritty details of problems like this.

(Be warned: his post is a bit emotional. The Ruby sky isn’t falling, at least as far as I’m aware. If you can see past that to the technical details of the bug, it’s pretty interesting stuff.)

Boehm’s collector also has an interesting hack for avoiding this type of bug: the GC_reachable_here macro:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* Explicitly tell the collector that an object is reachable    */
/* at a particular program point.  This prevents the argument   */
/* pointer from being optimized away, even it is otherwise no   */
/* longer needed.  It should have no visible effect in the  */
/* absence of finalizers or disappearing links.  But it may be  */
/* needed to prevent finalizers from running while the      */
/* associated external resource is still in use.        */
/* The function is sometimes called keep_alive in other     */
/* settings.                            */
# if defined(__GNUC__) && !defined(__INTEL_COMPILER)
#   define GC_reachable_here(ptr) \n    __asm__ volatile(" " : : "X"(ptr) : "memory");
# else
    GC_API void GC_noop1(GC_word x);
#   define GC_reachable_here(ptr) GC_noop1((GC_word)(ptr));
#endif

Fascinating stuff.

Another big potential source of bugs in the GC is any assumptions you make about the underlying architecture. In fact, that’s a great segue into #2 …

#2. It’s really important to know your compiler & architecture

Unless you’re doing a really good job of explicitly tracking GC references, you need to know every nook & cranny in which your compiler might try to stash pointers. For example, on x86 & x86-64 this means registers and the machine stack.

When I first wrote the GC, I knew I had to take special care to scan the C stack — you can see in libchimp/gc.c. However, I neglected to account for the possibility that local variables never make it to the stack. This has interesting implications, best understood by an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* GC object references */
ChimpRef *ref1;
ChimpRef *ref2;
 
/* allocate a `str` object using the GC */
ref1 = chimp_str_new ("foo");
 
/* run a mark/sweep */
chimp_gc_collect ();
 
/* allocate another `str` object using the GC */
ref2 = chimp_str_new ("bar");
 
/* output the contents of ref1 */
printf ("%s\n", CHIMP_STR_DATA(ref1));

What output would we expect from this simple little program? I would assume correct execution would cause it to display “foo”.

The real answer is: it depends on your C compiler and the optimizations it applies.

It’s entirely possible — even likely — that the values of ref1 & ref2 are always kept in registers & thus never stored on the machine stack. In that case, you will actually most likely see “bar”, because the second chimp_str_new allocation will recycle the GC slot you allocated for ref1.

You can see Ruby uses setjmp to write the registers to the stack so that they get picked up when we go walking the stack. Personally, I took the manual route because I didn’t know any better at the time.

So yeah — every nook and cranny. :)

(Note that nudging machine registers onto the stack in this manner won’t save you from the apocalyptic GC bug Joe Damato was ranting about. In that case the GC reference is stored in a register that is subsequently overwritten prior to the GC being invoked.)

#3. Assembly isn’t dead just yet

When you’re messing with low-level details of your GC — like, say, tracking machine registers — on occasion things do go wrong in weird in wonderful ways. When that happens, you need to be pretty comfortable with the assembly language & a debugger on the platform at issue.

I initially fucked up the assembly code for saving x86/x86-64 registers to the stack. Twice. Badly. Hell, it’s probably still broken in some subtle way because my x86/x86-64 foo is rusty. Do as I say, not as I do & “know your architecture” :P .

The bug manifested itself here, setting the symbols member of the symtable entry to NULL. I could see in gdb that the chimp_hash_new() call wasn’t failing, but somehow that field was still set to NULL.

Sigh.

Fortunately, once I got some sleep I was able to figure out what was going on:

  1. Immediately before the call to chimp_hash_new, the symtable entry ChimpRef pointer is copied from %rax to %rbx to keep it from getting smashed (the return value of chimp_hash_new will be stored in %rax). After the call, we save the hash:
    
       0x0000000000419a25 :   mov    %rax,%rbx
       0x0000000000419a28 :   callq  0x414ed1 
       0x0000000000419a2d :   mov    %rax,0x20(%rbx)
    

    At this point I suspected %rbx was somehow being overwritten.

  2. Inside chimp_hash_new, an object allocation triggers a call to chimp_gc_collect.
  3. One of the very first things you see in the disassembly of chimp_gc_collect is a push %rbx and a pop %rbx before we return.
  4. Setting a breakpoint on the push and the pop, I could see the value popped was different to the value pushed.

It turned out my crappy assembly code in chimp_gc_collect was smashing the stack — most likely because I was using offsets from %ebp/%esp & getting ‘em horribly wrong. Argh.

Admittedly my code was horribly wrong, but without knowing the architecture well enough to understand why things were failing, it would’ve been painful for me to track this bug down by the observable symptoms.

Ultimately, this was fixed using offsets relative to a known location rather than making big assumptions about the layout of the stack. (See “Know your architecture/compiler” ;) )

Maybe I should take a hint from the Ruby guys & stick to setjmp …

Bonus: Copying collector surprises

The first GC I wrote for what eventually became chimp was actually a copying collector. That introduced problems of its own. Namely, pointers can change under you in ways that aren’t intuitive.

I ultimately opted to move away from the copying collector after a few too many bugs — not because the collector itself was buggy, but because of seemingly-benign-but-broken code like this:

1
CHIMP_SYMTABLE_ENTRY(entry)->symbols = chimp_hash_new ();

What could possibly go wrong? Well if the chimp_hash_new call happens to trigger a GC, the underlying value of ‘entry’ that we extract using the CHIMP_SYMTABLE_ENTRY macro would move from one semi-space to another and — again, depending on the output of your C compiler — the value could be written to the wrong semi-space.

To work around this, you’d have to do things like this:

1
2
3
4
ChimpRef *temp;
temp = chimp_hash_new ();
/* safe, because no allocation (& ergo no GC) is possible here */
CHIMP_SYMTABLE_ENTRY(entry)->symbols = temp;

In retrospect, I could have probably worked around it with macros … but by the time I ripped it out & replaced it with a naive mark/sweep, the rest of the language was starting to take shape — and that was more fun to play with.

Help!

In retrospect, I guess all of these points boil down to the fact that you need a solid understanding of what’s happening under the hood of your C compiler & target architecture(s) to build something resembling a “correct” GC. Needless to say I’ve now got a tremendous, new-found respect for all the work that goes into things like the industrial-strength collectors in the JVM. This stuff is hard to get right & I can’t imagine what it must be like fighting with this stuff across a vast number of different platforms and compiler toolchains.

If GC stuff interests you and/or you’re some sort of garbage collection guru or even just some sort of angry person with strong opinions, I’d love to hear your thoughts on the current state of chimp’s garbage collector. It’s definitely still a work in progress (e.g. we don’t even try to shrink the heap at the moment), but the basics are probably ripe for criticism.

Better yet, go and hack on the chimp code base & point out where I’m being stupid. :)

The Chimp Programming Language

Disclaimer: chimp is still very young and incomplete. It’s by no means ready for prime time, merely getting to a point where I figure other language nerds might be interested in tinkering with it. You’ve been warned!

Here’s one for my “Doing It Wrong” bucket. :)

Over the last few of months I’ve been hacking away — with some help from Amjith (thanks Amjith!)– on chimp: a little dynamic, strongly typed programming language experiment written in C. Each task/thread in chimp gets its own heap, garbage collector and virtual machine for execution of bytecode. As a result, chimp tasks are “shared nothing” & communicate via message passing.

The language itself kind of grew out of a naive copying collector I wrote after prodding around in the MRI/Ruby code base (I eventually dropped the copying collector in favour of a naive mark/sweep). A Python-y object model was slapped on top of the copying collector & from there it kind of grew legs, arms & fur.

The language that has evolved is kinda interesting. If you can imagine Rust or Erlang carrying Ruby & Python’s bastard surrogate child … well, yeah. It’s kinda like that.

My imagination sucks. What does it actually look like?

Though the concurrency / message passing stuff is the part of chimp that perhaps best distinguishes it from Ruby/Python, here’s an example without any concurrency voodoo:

use io;

main argv {
  io.print("What's your name?");
  var name = io.readline();
  io.print(str("Hello, ", name));
}

(Yep. My blog totally has chimp syntax highlighting support. :) )

Inside the chimp programming language

The syntax of chimp is probably best described as an odd blend of Python, Ruby, JavaScript and maybe even Go, but it has some pretty significant differences when it comes to the language runtime.

To reiterate what I mentioned earlier, chimp has per-task heap, GC & VM. As a result, tasks run in isolation from one another. Currently this “isolation” is enforced by the runtime itself. The compiler doesn’t yet do a good job of enforcing the rules, so programs can “compile” but then bork hard when the VM hits something strange.

Tasks (currently mapped 1:1 with OS threads) communicate via message passing. Because each task gets its own GC, collections can occur on one task without blocking others. Since each VM runs in isolation, there’s no need for a GIL — or any other special locking at the VM level, for that matter. In theory tasks should be largely “shared nothing”, though there’s still much to do to enforce that. All this means the potential for Real Concurrency without forking off new heavyweight processes.

FP geeks will hate me for this one. :) Since heaps etc. aren’t shared between tasks, there’s only ever one task fiddling with any given bit of data. As a result, it’s “safe” to mutate data. I anticipate the language itself will encourage a functional style, but it would be best to expect something more like Ruby/Python than Haskell/Erlang.

I’m currently restricting (at runtime) the types that can be passed as messages between tasks — only strings, integers, arrays and tasks are permitted. Messages are serialized & deserialized between tasks in the current implementation. There’s no notion of anything like Rust’s exchange heap here.

Message Passing Example

Here’s a (relatively ugly) chimp port of an example from an Erlang tutorial I stumbled across the other day:

use io;

pong me {
  var pinger = me.recv();
  var msg = me.recv();
  while msg != "finished" {
    io.print("Pong received ping");
    pinger.send("pong");
    msg = me.recv();
  }
  io.print("Pong finished");
}

ping me {
  var ponger = me.recv();
  var i = 0;
  while i < 3 {
    ponger.send("ping");
    var m = me.recv();
    if m == "pong" {
      io.print("Ping received pong");
    }
    i = i   1;
  }
  ponger.send("finished");
  io.print("Ping finished");
}

main argv {
  var ponger = spawn { |me| pong(me); };
  var pinger = spawn { |me| ping(me); };
  ponger.send(pinger);
  pinger.send(ponger);
  pinger.join();
  ponger.join();
}

The spawn keyword, seen on lines 30 & 31 spins up a new task & returns a pipe/handle to the new task. This pipe can be used to communicate with the new task.

The explicit joins are kind of ugly & may be something to eventually do away with. Likewise explicitly passing the task handles via send/recv. But y’know, you get the idea.

Check out some of the other examples, if you’re still interested. sandbox.chimp is probably the best example of most of the syntax in the language to date.

What still needs to be done?

*sigh* So very much.

I’m keeping a TODO list of the list of immediate things I want to improve, but it’s not really comprehensive. Tasks deserve a rewrite. Tasks are too “heavy” thanks to aggressive GC memory grabs. Adding new data types isn’t as easy as I’d like. The garbage collector could be a lot better. There are no structured data types. Tests would be nice.

It’s not yet possible to import external chimp source files. Some basic arithmetic operators are unimplemented (modulo, for example). We need more modules. The bytecode instruction set kind of sucks. The VM is inefficient. There’s no optimization pass at all. Code enforcing certain scoping rules is non-existent … the list goes on!

All that said, if you’re a {,would-be} compiler/runtime geek eager to play with a small, dynamic language with some neat little features despite its infancy … well, I’ll take any code I can get. :)

Get involved!

It’s kind of exhausting hammering away on this all by my lonesome. :)

Get in touch or send me pull requests. Happy to answer any questions about where to begin, what needs doing, etc.

I’m also open to the (ideally friendly!) criticism & thoughts of folks who know better. Right now I’m kinda fumbling my way through in places & I’m sure it shows.

Last of all, you can take a page out of my buddy Glenn’s book and actively work to break it using contrived examples. ;)

Managed & Owned Boxes in the Rust Programming Language

*This was written at the time Rust 0.4 was the most recent release of the language. I’ll try to keep it up to date / revise it for later versions, but please be warned that Rust moves quickly & this may be out of date by the time you read it! :) *

I’ve been spending a little time with the Rust programming language sporadically over the last 6 months or so (and even landed some trivial patches.). I like what I’ve seen a whole lot, but one of the things that has really tripped me up as a newbie is the difference between owned boxed vs. managed boxed vs. unboxed values and borrowed pointers, and all the associated syntactic baggage.

It’s pretty difficult to read Rust code without at least somewhat understanding what it all means. So for my own benefit — and hopefully yours! — I’ve tried to document my current understanding of this stuff. Please be sure to leave abuse in the comments if I’m getting anything horribly wrong.

The Stack and Unboxed Values

To get it out of the way, unboxed values are relatively easy to grok. Straight up, if you’ve worked with the stack in C/C , you’ll be right at home with Rust’s stack. The stack in Rust is conceptually identical. Unboxed values are simply those values which are allocated on the stack. The end.

//
// Integer allocated on the stack.
//
let x = 10;

//
// String allocated on the stack.
//
let y = "Hello World";

Easy, right? What about those horrible managed/owned boxes and borrowed pointer things, though? Well, first we need to understand a little more about memory management in Rust — and for reasons that will become clear soon, the best place to start is the basic unit of execution: the task.


UPDATE [2012-12-06] minor correction from the geeks of proggit:

“Un-sigiled and &-sigiled string literals are actually allocated in the static memory region, not on the stack (I presume this is for the sake of efficiency). However this is mostly pedantry, as I don’t think it actually makes any semantic difference compared to something that’s truly stack-allocated.”

kibwen via /r/programming


What is a task?

Tasks are the basic unit of isolation and concurrency in Rust. Tasks do not share memory and communicate only through message passing. By default, tasks are scheduled in an M:N fashion onto operating system threads. The normal mode of operation will be to launch one thread per logical core. The programmer, however, will have the option of assigning certain tasks to a dedicated thread or running the scheduler in 1:1 mode.

Note: tasks and communication

(Note this source is nearly a year old and may contain inaccuracies. Further, the “design” may not reflect the current state of master.)

Thus, tasks …

  • … run concurrently (like threads!) in isolation from one another (not like threads!)
  • do not share memory & communicate via message passing
  • … are scheduled over a finite number of OS threads (default: one per core)
  • … can be optionally assigned to a dedicated OS thread by the programmer

So tasks are kinda similar to the light-weight processes found in Erlang.

Above I emphasized that tasks do not share memory. This is important for understanding managed vs. owned boxes in Rust. Keep it in mind!

Task-local Heaps and Managed Boxes

So tasks don’t share memory at all & run isolated from one another. What does this mean in practice?

Rust starts from the position that memory cannot be shared between tasks. Experience in other languages has proven that isolating each task’s heap from the others is a reliable strategy and one that is easy for programmers to reason about. Heap isolation has the additional benefit that garbage collection must only be done per-heap. Rust never “stops the world” to reclaim memory.

The Rust Language Tutorial

This implies that every task in Rust has its own private heap that can be garbage collected independently of heaps owned by other tasks.

This is the key to understanding managed boxes: a managed box is a value allocated on the task-local heap. Since heaps are private, garbage collection can occur on a per-task basis, so GC on one task need not block the execution of other tasks — i.e. tasks can continue executing (assuming OS threads are available to do so) while other tasks are mid-GC.

//
// An integer managed box, allocated on the local heap.
//
// Note that `x` itself is a *pointer* to the managed box on the heap.
//
let x = @10;

//
// A string managed box, allocated on the local heap.
//
let y = @"Hello World"

The Exchange Heap and Owned Boxes

So we’ve established that Rust tasks have their own private heap. We also know that tasks can communicate via message passing — but given that tasks don’t share any memory, how is this message passing implemented?

A naive implementation of a message passing system might simply copy values from the heap of one task to another, perhaps using some sort of thread-safe queue or by locking the tasks involved. However, Rust is being built with performance in mind and the designers determined that copying larger data structures has the potential to be expensive. This is where the exchange heap and owned boxes come in.

Complete isolation of heaps between tasks would, however, mean that any data transferred between tasks must be copied. While this is a fine and useful way to implement communication between tasks, it is also very inefficient for large data structures. […] Objects allocated in the exchange heap have ownership semantics, meaning that there is only a single variable that refers to them. For this reason, they are referred to as owned boxes. All tasks may allocate objects on the exchange heap, then transfer ownership of those objects to other tasks, avoiding expensive copies.

The Rust Language Tutorial

So the exchange heap is a global heap visible to all Rust tasks. “Liar!” I hear you scream. “I thought you said that tasks don’t share any memory!”.

Well, the gotcha with the exchange heap is that values on the heap can only be referenced by one task at a time. In fact, values on the exchange heap can only be referenced by a single variable at a time. It’s this property that ensures that tasks won’t stomp on each other when using the exchange heap: if only one variable can reference a value on the exchange heap, that means only one task is able to manipulate it at any given time. Neat eh?

//
// An integer owned box, allocated on the exchange heap.
//
// Note that `x` itself is a *pointer* to the owned box on the heap.
//
let x = ~10;

//
// A string owned box, allocated on the exchange heap.
//
let y = ~"Hello World"

Although the documentation might lead you to believe otherwise, you unfortunately can’t simply transfer ownership of a value directly from one task to another (I wrongly assumed that move and copy could be used for this purpose). Instead, you will want to use the core::pipes module which uses the exchange heap under the hood for efficient message passing between Rust tasks. Also see the std::comm module, which builds on core::pipes to provide a higher-level, two-way communication stream: DuplexStream.

If you need to transfer ownership of owned boxes, you can use the copy keyword to duplicate a value, or the move keyword to change ownership. For example:

//
// Initialize an owned box on the exchange heap.
//
let x = ~10;

//
// Copy the owned box. This allocates a *new* owned box
// and copies the value of the owned box referenced by `x`.
// `x` remains unchanged.
//
let y = copy x;

//
// This makes `z` the *owner* of the owned box referenced by
// `x`. `x` is no longer valid.
//
let z = move x;

//
// Fails at compile time -- "error: use of moved variable: `x`"
//
io::println(int::str(*x));

There’s actually a pretty accessible tutorial on communication between tasks on the Rust web site. I won’t bother reproducing those perfectly good examples here, but you should check it out if you’re interested in seeing how Rust tasks communicate. :)

Borrowed Pointers

With all this under your belt, it’s much easier to understand borrowed pointers:

Rust supports several types of pointers. The safe pointer types are @T for managed boxes allocated on the local heap, ~T, for uniquely-owned boxes allocated on the exchange heap, and &T, for borrowed pointers, which may point to any memory, and whose lifetimes are governed by the call stack.

The Rust Language Tutorial

Perhaps the most intuitive way to see borrowed pointers in action is with a function that takes a borrowed pointer as an argument:

fn foo(x : &int) {
    io::println(int::str(*x))
}

fn main() {
    let a = ~10;
    let b = @20;
    let c = 30;
    foo(a);
    foo(b);
    foo(&c);
}

Here, x is a borrowed pointer that is in scope for the duration of a call to foo. Note the variables a and b are both pointers here, and we’re taking the address of the unboxed value c on the stack before passing it to the foo function. This is what the language tutorial means by borrowed references being able to “point to any memory” — it doesn’t matter whether the memory is located on the local heap, the exchange heap or the stack.

There’s a lot more to say about borrowed pointers & regions, but I think I’ve written enough for one blog post. In any case, Niko Matsakis has already written a great tutorial on borrowed pointers and you should totally check it out if this stuff has you interested.

So … what should I use? Managed boxes? Owned boxes? Borrowed references?

This is unfortunately a more difficult question to answer. Although the intent of the different memory regions is pretty clear, it’s not unusual to find type signatures in the standard library that call for the use of an owned box when it seems like a borrowed reference would do just fine (for example, brson pointed out on IRC that much of the vec module comes from a time when Rust only had owned vectors!). As such, my experience with Rust’s evolving API has led me to believe that — where you can’t use the type that “makes sense” — you often wind up using transforming values to whatever type is required for by the library functions you are using (e.g. from a managed to an owned box).

More type trivia from brson (if you’re into that sort of thing): ~T and ~[T] are different types of owned pointer, so something like:

let x = ~10;
let y = @*x;

will compile, but:

let x = ~[1, 2, 3];
let y = @*x;

will not, because the compiler complains that x can’t be dereferenced. I don’t fully understand what prevents the latter pointer “type” from being dereferenced, but there you go. Something to be wary of. :)


UPDATE [2012-12-06]:

Great article. Regarding why ~[T] and ~T behave somewhat differently, the reason is that the size of [T] is not known to the compiler! It’s kind of like how in C, if you have T3, that’s a perfectly fine type, but T[] is sort of a degenerate type that gets converted to T* willy-nilly. The reason is that the compiler can’t really manipulate an “unknown number of T’s in a row”, which is what [T] (Rust) and T[] (C) represent. As an example, it can’t put such a value on the stack, since it doesn’t know how much stack space to allocate (well, it can’t put that on the stack without using alloca() or something similar).

@nikomatsakis via the Rust mailing list


These might seem like pretty major holes, but Rust is young and still evolving. And it really has come a long way — even in the last year or so. I look forward to seeing this language evolve & improve well into the future & hopefully prove to be a solid foundation for Mozilla Servo.

You’re Doing It Wrong (and That’s Okay)

Oh god damn it.

I’m absolutely appalled by this thread on Hacker News. Dozens of folks spewing bile because some guy on the Internet wants to build his own Operating System and has some strange ideas. Shit like this is everything that’s wrong with geek culture today.

The prevailing argument seems to be that he’s “doing it wrong”, that he’s ignoring years of research, that he shouldn’t have submitted it to HN (hot tip: he didn’t) and blah, blah, blah. Here’s the thing: succeed or fail, this guy is going to learn so freakin’ much in the process. He’s going to be a better developer for trying, even if you feel his ideas are bunk.

Even better, he has already started documenting his efforts. This may prove to be a handy resource for other would-be Operating Systems geeks. Nothing but good things can come from an undertaking like this. Unless you heap shit on the poor guy, ridiculing him and telling him his project is doomed and he shouldn’t bother.

I guess my beef with this sort of behaviour is thus: I’m a compiler nerd wannabe who has done it wrong in the past, is actively doing it wrong in the present and will continue doing it wrong well into the future if I feel like it’s a good way to learn stuff. I love messing around with compilers and I’m well aware I’m still only scratching the surface.

But despite my continued inexperience and naivety in the field, my continued tinkering has led to some crazy awesome opportunities — like speaking at a bunch of conferences, and working for awesome companies in a foreign country.

I don’t think any of that would have happened should I have been “Hacker News-ed” the way this guy was. Fortunately I had Python-Dev as something of a steward in the early days. Instead of pointing out how stupid I was & the dearth of things I didn’t (and still don’t) know, they patiently helped me limp along to accomplish what I was trying to do.

This is what a “community” should be all about.

To reiterate: good things generally come from the geekery of others, even when they’re “doing it wrong”. So stop proving the Greater Internet Fuckwad Theory, start showing some encouragement and do what you can to help people like Gusts on their way.

RubyConf 2012: Rapid Programming Language Prototypes With Ruby and Racc

Interested in how you might go about building a simple compiler?

The video of my RubyConf 2012 presentation “Rapid Programming Language Prototypes with Ruby and Racc” has just been uploaded to YouTube for your viewing pleasure by the awesome folks at Confreaks.

Big chunk of live coding around 12:13 and great questions from the audience at around 37:50.

The source code for the live-coded “compiler” has been pushed to this github repository. The slides are up on github too, with some minor changes — during the presentation I misattributed the author of Racc: Minero Aoki (@mineroaoki) is the original author, not Aaron Patterson (@tenderlove). (Huge apologies Minero!)