Tom Lee

Largely irrelephant.

Traits, Structs and Impls in Rust

I love Rust. There, I said it. We’re going to be seeing more of each other – though through no fault of her own, I doubt we’ll ever be an exclusive item.

The language is still evolving at a rapid (but slowing) pace, so it can be a little hard to figure out how to do things sometimes. Because of this, I think the best way to learn the Rust language is to literally go to the source.

On that topic, now is a brilliant time to get involved because everything mostly kinda-sorta works, but there’s still so much that you can do to help out.

So yeah, get hacking.

Anyway, as I figure things out, I like to write little tutorials like this one – largely for myself, because I have a terrible memory & an attention span to match. :) This time around, I’m going to explore traits, structs and impls in Rust. It took me a while to work out how these three features related to one another (mostly because I was too lazy to RTFM) & figured others might benefit from a summary.

If you’re wondering what all the ~ sigils & ampersands mean in the code samples below, you might want to take a detour & read my article on managed & owned boxes.

Composite types with structs

At their simplest, structs in Rust are much like structs in languages like C & C++ – composite/record types consisting of zero or more fields:

struct MyStruct {
    uint_field: uint
}

It’s possible to also declare unit-like structs by omitting the body of your struct:

struct MyUnitLikeStruct;

Structs may also be exported from modules using the pub keyword:

pub struct MyStruct {
    // ...
}

You can then initialize “instances” of a struct like so:

let x = MyStruct{field1: value1, field2: value2}

Unit-like structs are essentially singletons & are thus used without the curly braces:

let x = MyUnitLikeStruct

Adding methods to your structs with impl

So now we know how to use structs, but as it stands they’re kinda boring. They can’t really do anything. We can define functions to operate on our structs:

fn get_x(mystruct: &MyStruct) -> uint {
    mystruct.x
}

(Note that, like Ruby, Rust will return the last expression evaluated by a function – so long as it’s not followed by a semi-colon – so we don’t need a return here.)

Now we can call get_x, passing a borrowed pointer to an instance of our struct:

let mystruct = MyStruct{x: 20};
get_x(&mystruct) // returns 20

While this is perfectly acceptable, it’s more typical to define methods on a struct using an implementation:

pub struct MyStruct {
    x: uint
}

//
// implementation for MyStruct
//
impl MyStruct {
    pub fn get_x(&self) -> uint {
        self.x
    }
}

Note that because impls are associated with a type we don’t need to use the pub keyword here to export them from a module: if the underlying type is exported, so too will the impls associated with that type.

(UPDATE: I got this wrong. If you want to export your impls, pub is necessary unless you’re implementing a trait – see the “impl … for …” syntax below.)

Now we can call methods on instances of our Foo struct:

let mystruct = MyStruct{x: 50};
mystruct.get_x() // returns 50

Note that you can only define impls for types declared in the same crate.

Generalizing struct impls with traits

Traits can be used in a manner similar to the way interfaces are used in a language like Java.

For example, say we have two structs:

pub struct Person {
    name: ~str
}

pub struct Dog {
    name: ~str
}

And we decide we want to display the name of a Person & a Dog. We could write two separate functions:

pub fn show_person_name(person: &Person) {
    println(person.name)
}

pub fn show_dog_name(dog: &Dog) {
    println(dog.name)
}

But then we need to know what type of struct we’re dealing with – a Person or a Dog – & then figure out which function we should call to display its name. Adding a name method to your struct using an impl won’t help you either, because Persons and Dogs aren’t really related in our program.

So instead, we can define a trait:

trait HasName {
    pub fn name(&self) -> ~str;
}

(Note that like structs, traits can be exported using the pub keyword).

And instead of show_person_name & show_dog_name, we can define a function that accepts (a borrowed pointer to) a HasName:

pub fn show_name(has_name: &HasName) {
    println(has_name.name())
}

Now we can implement the HasName interface for the Person and Dog structs:

impl HasName for Person {
    pub fn name(&self) -> ~str {
        copy self.name
    }
}

impl HasName for Dog {
    pub fn name(&self) -> ~str {
        copy self.name
    }
}

(Again, note that we don’t need to explicitly export our impls.)

Bonus: static methods on traits

Lastly, it’s interesting to note that traits can be used to generalize static methods too. For example:

pub trait Square {
    pub fn square(x: Self) -> Self;
}

Here Self refers to the real, underlying type of Square. Using this trait, we can provide implementations for a number of different types:

impl Square for int {
    pub fn square(x: int) -> int { x * x }
}

impl Square for float {
    pub fn square(x: float) -> float { x * x }
}

See how we’ve replaced references to Self with concrete types int and float?

So now we can call Square::square(...) passing either an int or a float & Rust will know how to perform the appropriate operation for us:

Square::square(10) // -> 100
Square::square(10.0f) // -> 100.0f

(NOTE as of Rust 0.6, static methods on traits were not being properly exported from modules. Fixing this was actually my first “real” patch/bugfix for the project: you can see it working using the latest code on the incoming branch).

Until next time…

Congratulations on making it this far! Rust is totally worth the effort to learn. I’m very excited about the direction it’s headed.

Oh, and if you like this sort of nerdiness and/or would like to correct the error of my ways, you should totally ping me on Twitter.

Nim Is the New Chimp

I just renamed the language I was formerly calling ‘chimp’ to ‘nim’ (as in Nim Chimpsky, who in turn was named after Noam Chomsky — and yes, this does makes my geeky bits all kinds of tingly).

This was done largely to prevent any confusion with the CHIMP meta-programming tool concocted by John L. Kenyon for his Master’s thesis at the University of Nevada. Maybe a little premature, but I’ve been planning to do this for a while and there’s been a bit of a lull in activity in recent weeks. I figure now was as good a time as any.

So if you were previously tracking chimp, you might want to turn your attention to nim. All new development will be happening there.

You can get the source code for nim from GitHub.

3(+1) Things I Learned by Writing Chimp’s Garbage Collector

UPDATE: some code samples here were mangled pretty badly during my migration to Octopress/Jekyll – please excuse the mess.

I’ve been messing with toy compilers & even some aspects of a language runtime for years now, but I’ve never really tried my hand at writing a garbage collector for some reason. When I did eventually get around to it a few months back, I accidentally an entire programming language and runtime.

I want to be clear that both the language and the runtime are very young and the garbage collector isn’t anything special: because of certain language & runtime semantics/assumptions/constraints, GC & VM execution can run in isolation on each thread. There’s no GIL/locking/whatever related to garbage collection, nor is there a need for a tricky concurrent collector. It’s conceptually pretty simple.

What was interesting to me was how tricky things got — even in the simple, conservative, mark/sweep collector found in chimp. So for the benefit of folks who might be interested, here’s some interesting stuff I learned while writing the garbage collector for chimp. :)

#1. Writing a garbage collector is hard

In other words: it’s amazing how much we’ve come to rely on this thing that is really, really easy to fuck up in subtle and explosive ways. Garbage collection is almost a “must-have” for modern languages, but keeping it error-free is really pretty difficult.

To be clear, implementing mark/sweep is easy enough that you can hack out a naive implementation in an afternoon once you know what you’re doing. Even crazy stack-scanning voodoo isn’t all that “out there”.

What’s not so easy is subtle/insidious GC bugs that can be introduced by compiler optimizations. Joe Damato’s rant from a few years back about how MRI & REE are fundamentally broken touches on the nitty gritty details of problems like this.

(Be warned: his post is a bit emotional. The Ruby sky isn’t falling, at least as far as I’m aware. If you can see past that to the technical details of the bug, it’s pretty interesting stuff.)

Boehm’s collector also has an interesting hack for avoiding this type of bug: the GC_reachable_here macro:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* Explicitly tell the collector that an object is reachable    */
/* at a particular program point.  This prevents the argument   */
/* pointer from being optimized away, even it is otherwise no   */
/* longer needed.  It should have no visible effect in the  */
/* absence of finalizers or disappearing links.  But it may be  */
/* needed to prevent finalizers from running while the      */
/* associated external resource is still in use.        */
/* The function is sometimes called keep_alive in other     */
/* settings.                            */
# if defined(__GNUC__) && !defined(__INTEL_COMPILER)
#   define GC_reachable_here(ptr) \n    __asm__ volatile(" " : : "X"(ptr) : "memory");
# else
    GC_API void GC_noop1(GC_word x);
#   define GC_reachable_here(ptr) GC_noop1((GC_word)(ptr));
#endif

Fascinating stuff.

Another big potential source of bugs in the GC is any assumptions you make about the underlying architecture. In fact, that’s a great segue into #2 …

#2. It’s really important to know your compiler & architecture

Unless you’re doing a really good job of explicitly tracking GC references, you need to know every nook & cranny in which your compiler might try to stash pointers. For example, on x86 & x86-64 this means registers and the machine stack.

When I first wrote the GC, I knew I had to take special care to scan the C stack — you can see in libchimp/gc.c. However, I neglected to account for the possibility that local variables never make it to the stack. This has interesting implications, best understood by an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* GC object references */
ChimpRef *ref1;
ChimpRef *ref2;
 
/* allocate a `str` object using the GC */
ref1 = chimp_str_new ("foo");
 
/* run a mark/sweep */
chimp_gc_collect ();
 
/* allocate another `str` object using the GC */
ref2 = chimp_str_new ("bar");
 
/* output the contents of ref1 */
printf ("%s\n", CHIMP_STR_DATA(ref1));

What output would we expect from this simple little program? I would assume correct execution would cause it to display “foo”.

The real answer is: it depends on your C compiler and the optimizations it applies.

It’s entirely possible — even likely — that the values of ref1 & ref2 are always kept in registers & thus never stored on the machine stack. In that case, you will actually most likely see “bar”, because the second chimp_str_new allocation will recycle the GC slot you allocated for ref1.

You can see Ruby uses setjmp to write the registers to the stack so that they get picked up when we go walking the stack. Personally, I took the manual route because I didn’t know any better at the time.

So yeah — every nook and cranny. :)

(Note that nudging machine registers onto the stack in this manner won’t save you from the apocalyptic GC bug Joe Damato was ranting about. In that case the GC reference is stored in a register that is subsequently overwritten prior to the GC being invoked.)

#3. Assembly isn’t dead just yet

When you’re messing with low-level details of your GC — like, say, tracking machine registers — on occasion things do go wrong in weird in wonderful ways. When that happens, you need to be pretty comfortable with the assembly language & a debugger on the platform at issue.

I initially fucked up the assembly code for saving x86/x86-64 registers to the stack. Twice. Badly. Hell, it’s probably still broken in some subtle way because my x86/x86-64 foo is rusty. Do as I say, not as I do & “know your architecture” :P .

The bug manifested itself here, setting the symbols member of the symtable entry to NULL. I could see in gdb that the chimp_hash_new() call wasn’t failing, but somehow that field was still set to NULL.

Sigh.

Fortunately, once I got some sleep I was able to figure out what was going on:

  1. Immediately before the call to chimp_hash_new, the symtable entry ChimpRef pointer is copied from %rax to %rbx to keep it from getting smashed (the return value of chimp_hash_new will be stored in %rax). After the call, we save the hash:
    
       0x0000000000419a25 :   mov    %rax,%rbx
       0x0000000000419a28 :   callq  0x414ed1 
       0x0000000000419a2d :   mov    %rax,0x20(%rbx)
    

    At this point I suspected %rbx was somehow being overwritten.

  2. Inside chimp_hash_new, an object allocation triggers a call to chimp_gc_collect.
  3. One of the very first things you see in the disassembly of chimp_gc_collect is a push %rbx and a pop %rbx before we return.
  4. Setting a breakpoint on the push and the pop, I could see the value popped was different to the value pushed.

It turned out my crappy assembly code in chimp_gc_collect was smashing the stack — most likely because I was using offsets from %ebp/%esp & getting ‘em horribly wrong. Argh.

Admittedly my code was horribly wrong, but without knowing the architecture well enough to understand why things were failing, it would’ve been painful for me to track this bug down by the observable symptoms.

Ultimately, this was fixed using offsets relative to a known location rather than making big assumptions about the layout of the stack. (See “Know your architecture/compiler” ;) )

Maybe I should take a hint from the Ruby guys & stick to setjmp …

Bonus: Copying collector surprises

The first GC I wrote for what eventually became chimp was actually a copying collector. That introduced problems of its own. Namely, pointers can change under you in ways that aren’t intuitive.

I ultimately opted to move away from the copying collector after a few too many bugs — not because the collector itself was buggy, but because of seemingly-benign-but-broken code like this:

1
CHIMP_SYMTABLE_ENTRY(entry)->symbols = chimp_hash_new ();

What could possibly go wrong? Well if the chimp_hash_new call happens to trigger a GC, the underlying value of ‘entry’ that we extract using the CHIMP_SYMTABLE_ENTRY macro would move from one semi-space to another and — again, depending on the output of your C compiler — the value could be written to the wrong semi-space.

To work around this, you’d have to do things like this:

1
2
3
4
ChimpRef *temp;
temp = chimp_hash_new ();
/* safe, because no allocation (& ergo no GC) is possible here */
CHIMP_SYMTABLE_ENTRY(entry)->symbols = temp;

In retrospect, I could have probably worked around it with macros … but by the time I ripped it out & replaced it with a naive mark/sweep, the rest of the language was starting to take shape — and that was more fun to play with.

Help!

In retrospect, I guess all of these points boil down to the fact that you need a solid understanding of what’s happening under the hood of your C compiler & target architecture(s) to build something resembling a “correct” GC. Needless to say I’ve now got a tremendous, new-found respect for all the work that goes into things like the industrial-strength collectors in the JVM. This stuff is hard to get right & I can’t imagine what it must be like fighting with this stuff across a vast number of different platforms and compiler toolchains.

If GC stuff interests you and/or you’re some sort of garbage collection guru or even just some sort of angry person with strong opinions, I’d love to hear your thoughts on the current state of chimp’s garbage collector. It’s definitely still a work in progress (e.g. we don’t even try to shrink the heap at the moment), but the basics are probably ripe for criticism.

Better yet, go and hack on the chimp code base & point out where I’m being stupid. :)

The Chimp Programming Language

Disclaimer: chimp is still very young and incomplete. It’s by no means ready for prime time, merely getting to a point where I figure other language nerds might be interested in tinkering with it. You’ve been warned!

Here’s one for my “Doing It Wrong” bucket. :)

Over the last few of months I’ve been hacking away — with some help from Amjith (thanks Amjith!)– on chimp: a little dynamic, strongly typed programming language experiment written in C. Each task/thread in chimp gets its own heap, garbage collector and virtual machine for execution of bytecode. As a result, chimp tasks are “shared nothing” & communicate via message passing.

The language itself kind of grew out of a naive copying collector I wrote after prodding around in the MRI/Ruby code base (I eventually dropped the copying collector in favour of a naive mark/sweep). A Python-y object model was slapped on top of the copying collector & from there it kind of grew legs, arms & fur.

The language that has evolved is kinda interesting. If you can imagine Rust or Erlang carrying Ruby & Python’s bastard surrogate child … well, yeah. It’s kinda like that.

My imagination sucks. What does it actually look like?

Though the concurrency / message passing stuff is the part of chimp that perhaps best distinguishes it from Ruby/Python, here’s an example without any concurrency voodoo:

use io;

main argv {
  io.print("What's your name?");
  var name = io.readline();
  io.print(str("Hello, ", name));
}

(Yep. My blog totally has chimp syntax highlighting support. :) )

Inside the chimp programming language

The syntax of chimp is probably best described as an odd blend of Python, Ruby, JavaScript and maybe even Go, but it has some pretty significant differences when it comes to the language runtime.

To reiterate what I mentioned earlier, chimp has per-task heap, GC & VM. As a result, tasks run in isolation from one another. Currently this “isolation” is enforced by the runtime itself. The compiler doesn’t yet do a good job of enforcing the rules, so programs can “compile” but then bork hard when the VM hits something strange.

Tasks (currently mapped 1:1 with OS threads) communicate via message passing. Because each task gets its own GC, collections can occur on one task without blocking others. Since each VM runs in isolation, there’s no need for a GIL — or any other special locking at the VM level, for that matter. In theory tasks should be largely “shared nothing”, though there’s still much to do to enforce that. All this means the potential for Real Concurrency without forking off new heavyweight processes.

FP geeks will hate me for this one. :) Since heaps etc. aren’t shared between tasks, there’s only ever one task fiddling with any given bit of data. As a result, it’s “safe” to mutate data. I anticipate the language itself will encourage a functional style, but it would be best to expect something more like Ruby/Python than Haskell/Erlang.

I’m currently restricting (at runtime) the types that can be passed as messages between tasks — only strings, integers, arrays and tasks are permitted. Messages are serialized & deserialized between tasks in the current implementation. There’s no notion of anything like Rust’s exchange heap here.

Message Passing Example

Here’s a (relatively ugly) chimp port of an example from an Erlang tutorial I stumbled across the other day:

use io;

pong me {
  var pinger = me.recv();
  var msg = me.recv();
  while msg != "finished" {
    io.print("Pong received ping");
    pinger.send("pong");
    msg = me.recv();
  }
  io.print("Pong finished");
}

ping me {
  var ponger = me.recv();
  var i = 0;
  while i < 3 {
    ponger.send("ping");
    var m = me.recv();
    if m == "pong" {
      io.print("Ping received pong");
    }
    i = i   1;
  }
  ponger.send("finished");
  io.print("Ping finished");
}

main argv {
  var ponger = spawn { |me| pong(me); };
  var pinger = spawn { |me| ping(me); };
  ponger.send(pinger);
  pinger.send(ponger);
  pinger.join();
  ponger.join();
}

The spawn keyword, seen on lines 30 & 31 spins up a new task & returns a pipe/handle to the new task. This pipe can be used to communicate with the new task.

The explicit joins are kind of ugly & may be something to eventually do away with. Likewise explicitly passing the task handles via send/recv. But y’know, you get the idea.

Check out some of the other examples, if you’re still interested. sandbox.chimp is probably the best example of most of the syntax in the language to date.

What still needs to be done?

*sigh* So very much.

I’m keeping a TODO list of the list of immediate things I want to improve, but it’s not really comprehensive. Tasks deserve a rewrite. Tasks are too “heavy” thanks to aggressive GC memory grabs. Adding new data types isn’t as easy as I’d like. The garbage collector could be a lot better. There are no structured data types. Tests would be nice.

It’s not yet possible to import external chimp source files. Some basic arithmetic operators are unimplemented (modulo, for example). We need more modules. The bytecode instruction set kind of sucks. The VM is inefficient. There’s no optimization pass at all. Code enforcing certain scoping rules is non-existent … the list goes on!

All that said, if you’re a {,would-be} compiler/runtime geek eager to play with a small, dynamic language with some neat little features despite its infancy … well, I’ll take any code I can get. :)

Get involved!

It’s kind of exhausting hammering away on this all by my lonesome. :)

Get in touch or send me pull requests. Happy to answer any questions about where to begin, what needs doing, etc.

I’m also open to the (ideally friendly!) criticism & thoughts of folks who know better. Right now I’m kinda fumbling my way through in places & I’m sure it shows.

Last of all, you can take a page out of my buddy Glenn’s book and actively work to break it using contrived examples. ;)

Managed & Owned Boxes in the Rust Programming Language

*This was written at the time Rust 0.4 was the most recent release of the language. I’ll try to keep it up to date / revise it for later versions, but please be warned that Rust moves quickly & this may be out of date by the time you read it! :) *

I’ve been spending a little time with the Rust programming language sporadically over the last 6 months or so (and even landed some trivial patches.). I like what I’ve seen a whole lot, but one of the things that has really tripped me up as a newbie is the difference between owned boxed vs. managed boxed vs. unboxed values and borrowed pointers, and all the associated syntactic baggage.

It’s pretty difficult to read Rust code without at least somewhat understanding what it all means. So for my own benefit — and hopefully yours! — I’ve tried to document my current understanding of this stuff. Please be sure to leave abuse in the comments if I’m getting anything horribly wrong.

The Stack and Unboxed Values

To get it out of the way, unboxed values are relatively easy to grok. Straight up, if you’ve worked with the stack in C/C , you’ll be right at home with Rust’s stack. The stack in Rust is conceptually identical. Unboxed values are simply those values which are allocated on the stack. The end.

//
// Integer allocated on the stack.
//
let x = 10;

//
// String allocated on the stack.
//
let y = "Hello World";

Easy, right? What about those horrible managed/owned boxes and borrowed pointer things, though? Well, first we need to understand a little more about memory management in Rust — and for reasons that will become clear soon, the best place to start is the basic unit of execution: the task.


UPDATE [2012-12-06] minor correction from the geeks of proggit:

“Un-sigiled and &-sigiled string literals are actually allocated in the static memory region, not on the stack (I presume this is for the sake of efficiency). However this is mostly pedantry, as I don’t think it actually makes any semantic difference compared to something that’s truly stack-allocated.”

kibwen via /r/programming


What is a task?

Tasks are the basic unit of isolation and concurrency in Rust. Tasks do not share memory and communicate only through message passing. By default, tasks are scheduled in an M:N fashion onto operating system threads. The normal mode of operation will be to launch one thread per logical core. The programmer, however, will have the option of assigning certain tasks to a dedicated thread or running the scheduler in 1:1 mode.

Note: tasks and communication

(Note this source is nearly a year old and may contain inaccuracies. Further, the “design” may not reflect the current state of master.)

Thus, tasks …

  • … run concurrently (like threads!) in isolation from one another (not like threads!)
  • do not share memory & communicate via message passing
  • … are scheduled over a finite number of OS threads (default: one per core)
  • … can be optionally assigned to a dedicated OS thread by the programmer

So tasks are kinda similar to the light-weight processes found in Erlang.

Above I emphasized that tasks do not share memory. This is important for understanding managed vs. owned boxes in Rust. Keep it in mind!

Task-local Heaps and Managed Boxes

So tasks don’t share memory at all & run isolated from one another. What does this mean in practice?

Rust starts from the position that memory cannot be shared between tasks. Experience in other languages has proven that isolating each task’s heap from the others is a reliable strategy and one that is easy for programmers to reason about. Heap isolation has the additional benefit that garbage collection must only be done per-heap. Rust never “stops the world” to reclaim memory.

The Rust Language Tutorial

This implies that every task in Rust has its own private heap that can be garbage collected independently of heaps owned by other tasks.

This is the key to understanding managed boxes: a managed box is a value allocated on the task-local heap. Since heaps are private, garbage collection can occur on a per-task basis, so GC on one task need not block the execution of other tasks — i.e. tasks can continue executing (assuming OS threads are available to do so) while other tasks are mid-GC.

//
// An integer managed box, allocated on the local heap.
//
// Note that `x` itself is a *pointer* to the managed box on the heap.
//
let x = @10;

//
// A string managed box, allocated on the local heap.
//
let y = @"Hello World"

The Exchange Heap and Owned Boxes

So we’ve established that Rust tasks have their own private heap. We also know that tasks can communicate via message passing — but given that tasks don’t share any memory, how is this message passing implemented?

A naive implementation of a message passing system might simply copy values from the heap of one task to another, perhaps using some sort of thread-safe queue or by locking the tasks involved. However, Rust is being built with performance in mind and the designers determined that copying larger data structures has the potential to be expensive. This is where the exchange heap and owned boxes come in.

Complete isolation of heaps between tasks would, however, mean that any data transferred between tasks must be copied. While this is a fine and useful way to implement communication between tasks, it is also very inefficient for large data structures. […] Objects allocated in the exchange heap have ownership semantics, meaning that there is only a single variable that refers to them. For this reason, they are referred to as owned boxes. All tasks may allocate objects on the exchange heap, then transfer ownership of those objects to other tasks, avoiding expensive copies.

The Rust Language Tutorial

So the exchange heap is a global heap visible to all Rust tasks. “Liar!” I hear you scream. “I thought you said that tasks don’t share any memory!”.

Well, the gotcha with the exchange heap is that values on the heap can only be referenced by one task at a time. In fact, values on the exchange heap can only be referenced by a single variable at a time. It’s this property that ensures that tasks won’t stomp on each other when using the exchange heap: if only one variable can reference a value on the exchange heap, that means only one task is able to manipulate it at any given time. Neat eh?

//
// An integer owned box, allocated on the exchange heap.
//
// Note that `x` itself is a *pointer* to the owned box on the heap.
//
let x = ~10;

//
// A string owned box, allocated on the exchange heap.
//
let y = ~"Hello World"

Although the documentation might lead you to believe otherwise, you unfortunately can’t simply transfer ownership of a value directly from one task to another (I wrongly assumed that move and copy could be used for this purpose). Instead, you will want to use the core::pipes module which uses the exchange heap under the hood for efficient message passing between Rust tasks. Also see the std::comm module, which builds on core::pipes to provide a higher-level, two-way communication stream: DuplexStream.

If you need to transfer ownership of owned boxes, you can use the copy keyword to duplicate a value, or the move keyword to change ownership. For example:

//
// Initialize an owned box on the exchange heap.
//
let x = ~10;

//
// Copy the owned box. This allocates a *new* owned box
// and copies the value of the owned box referenced by `x`.
// `x` remains unchanged.
//
let y = copy x;

//
// This makes `z` the *owner* of the owned box referenced by
// `x`. `x` is no longer valid.
//
let z = move x;

//
// Fails at compile time -- "error: use of moved variable: `x`"
//
io::println(int::str(*x));

There’s actually a pretty accessible tutorial on communication between tasks on the Rust web site. I won’t bother reproducing those perfectly good examples here, but you should check it out if you’re interested in seeing how Rust tasks communicate. :)

Borrowed Pointers

With all this under your belt, it’s much easier to understand borrowed pointers:

Rust supports several types of pointers. The safe pointer types are @T for managed boxes allocated on the local heap, ~T, for uniquely-owned boxes allocated on the exchange heap, and &T, for borrowed pointers, which may point to any memory, and whose lifetimes are governed by the call stack.

The Rust Language Tutorial

Perhaps the most intuitive way to see borrowed pointers in action is with a function that takes a borrowed pointer as an argument:

fn foo(x : &int) {
    io::println(int::str(*x))
}

fn main() {
    let a = ~10;
    let b = @20;
    let c = 30;
    foo(a);
    foo(b);
    foo(&c);
}

Here, x is a borrowed pointer that is in scope for the duration of a call to foo. Note the variables a and b are both pointers here, and we’re taking the address of the unboxed value c on the stack before passing it to the foo function. This is what the language tutorial means by borrowed references being able to “point to any memory” — it doesn’t matter whether the memory is located on the local heap, the exchange heap or the stack.

There’s a lot more to say about borrowed pointers & regions, but I think I’ve written enough for one blog post. In any case, Niko Matsakis has already written a great tutorial on borrowed pointers and you should totally check it out if this stuff has you interested.

So … what should I use? Managed boxes? Owned boxes? Borrowed references?

This is unfortunately a more difficult question to answer. Although the intent of the different memory regions is pretty clear, it’s not unusual to find type signatures in the standard library that call for the use of an owned box when it seems like a borrowed reference would do just fine (for example, brson pointed out on IRC that much of the vec module comes from a time when Rust only had owned vectors!). As such, my experience with Rust’s evolving API has led me to believe that — where you can’t use the type that “makes sense” — you often wind up using transforming values to whatever type is required for by the library functions you are using (e.g. from a managed to an owned box).

More type trivia from brson (if you’re into that sort of thing): ~T and ~[T] are different types of owned pointer, so something like:

let x = ~10;
let y = @*x;

will compile, but:

let x = ~[1, 2, 3];
let y = @*x;

will not, because the compiler complains that x can’t be dereferenced. I don’t fully understand what prevents the latter pointer “type” from being dereferenced, but there you go. Something to be wary of. :)


UPDATE [2012-12-06]:

Great article. Regarding why ~[T] and ~T behave somewhat differently, the reason is that the size of [T] is not known to the compiler! It’s kind of like how in C, if you have T3, that’s a perfectly fine type, but T[] is sort of a degenerate type that gets converted to T* willy-nilly. The reason is that the compiler can’t really manipulate an “unknown number of T’s in a row”, which is what [T] (Rust) and T[] (C) represent. As an example, it can’t put such a value on the stack, since it doesn’t know how much stack space to allocate (well, it can’t put that on the stack without using alloca() or something similar).

@nikomatsakis via the Rust mailing list


These might seem like pretty major holes, but Rust is young and still evolving. And it really has come a long way — even in the last year or so. I look forward to seeing this language evolve & improve well into the future & hopefully prove to be a solid foundation for Mozilla Servo.

You’re Doing It Wrong (and That’s Okay)

Oh god damn it.

I’m absolutely appalled by this thread on Hacker News. Dozens of folks spewing bile because some guy on the Internet wants to build his own Operating System and has some strange ideas. Shit like this is everything that’s wrong with geek culture today.

The prevailing argument seems to be that he’s “doing it wrong”, that he’s ignoring years of research, that he shouldn’t have submitted it to HN (hot tip: he didn’t) and blah, blah, blah. Here’s the thing: succeed or fail, this guy is going to learn so freakin’ much in the process. He’s going to be a better developer for trying, even if you feel his ideas are bunk.

Even better, he has already started documenting his efforts. This may prove to be a handy resource for other would-be Operating Systems geeks. Nothing but good things can come from an undertaking like this. Unless you heap shit on the poor guy, ridiculing him and telling him his project is doomed and he shouldn’t bother.

I guess my beef with this sort of behaviour is thus: I’m a compiler nerd wannabe who has done it wrong in the past, is actively doing it wrong in the present and will continue doing it wrong well into the future if I feel like it’s a good way to learn stuff. I love messing around with compilers and I’m well aware I’m still only scratching the surface.

But despite my continued inexperience and naivety in the field, my continued tinkering has led to some crazy awesome opportunities — like speaking at a bunch of conferences, and working for awesome companies in a foreign country.

I don’t think any of that would have happened should I have been “Hacker News-ed” the way this guy was. Fortunately I had Python-Dev as something of a steward in the early days. Instead of pointing out how stupid I was & the dearth of things I didn’t (and still don’t) know, they patiently helped me limp along to accomplish what I was trying to do.

This is what a “community” should be all about.

To reiterate: good things generally come from the geekery of others, even when they’re “doing it wrong”. So stop proving the Greater Internet Fuckwad Theory, start showing some encouragement and do what you can to help people like Gusts on their way.

RubyConf 2012: Rapid Programming Language Prototypes With Ruby and Racc

Interested in how you might go about building a simple compiler?

The video of my RubyConf 2012 presentation “Rapid Programming Language Prototypes with Ruby and Racc” has just been uploaded to YouTube for your viewing pleasure by the awesome folks at Confreaks.

Big chunk of live coding around 12:13 and great questions from the audience at around 37:50.

The source code for the live-coded “compiler” has been pushed to this github repository. The slides are up on github too, with some minor changes — during the presentation I misattributed the author of Racc: Minero Aoki (@mineroaoki) is the original author, not Aaron Patterson (@tenderlove). (Huge apologies Minero!)

Coding Standards: Consistency Is King

I just read Richard Rodger’s article “Why I have given up on coding standards“. I loved this:

The truly evil thing about coding standards is what they do to your heart, your team’s heart. They are a little message that you are not good enough. You cannot quite be trusted. Without adult supervision, you’ll mess up.

(Emphasis mine.)

Awesome stuff, and I think Richard’s right in that coding standards are a demoralizing lost cause. However, I felt the conclusion he ultimately came to stopped just short of saying something really important:

I expect everyone to write good clean code. You decide what that means. You decide if you can sleep at night with random code layouts and inconsistent variable names. But you know what, maybe they just don’t matter for a 100 line node.js mini-server that only does one thing. You decide.

(Emphasis mine again.)

I understand the sentiment, and I agree with it in so much as that you need to trust smart people to do smart things. But for all their flaws, coding standards are often introduced to address a legitimate problem: it’s a pain in the arse to read code that is formatted inconsistently. Code bases that are inconsistent* are generally harder to understand.

So how do we maintain consistency given that we’ve established that formalized coding standards suck?

Write your code such that it’s consistent with the style of the code around it.

For example:

In the file you’re editing, are curly braces on the same line as a function declaration?
Put your curly braces on the same line, too.

Whitespace before brackets?
Put spaces before your brackets.

Tabs instead of spaces?
You use tabs too, even if you object to tabs on principle.

Seeing a pattern here?

You improve the consistency of your code base (a major goal of coding standards!) by trying to be consistent. It’s not rocket science.

The part where you need to trust developers to be “smart” comes along when there are obvious inconsistencies in the code:

Well, this source file uses a mix of tabs & spaces. That sucks. I’m going update the file to use spaces because that’s what we use in the rest of the project.

Now, even after reading Richard’s article (and this one) you might be thinking “but without rules, code will never be perfectly consistent!”. You’re absolutely right. But odds are folks were fucking up trying to follow whatever coding convention you put in place anyway.

Formalized coding standards aside, making an active effort to be consistent will generally lead to code that’s easier to follow.

Of course, you could always save yourself the trouble and start writing everything in Go. :)

  • All of this is true of architectural decisions too. If you’re going to make a better architectural “wheel”, you’d better be prepared to update the existing wheels. Otherwise, I’d wager 9 times out of 10 you’re actually making things worse… but maybe that’s a post for another day!

Patching Pidgin

Here’s a long-winded story about how I came to find myself patching Pidgin over the weekend. I’m not really sure how or why I turned this into a huge blog post, but it’s much too late for that now. :)

Setting the scene

I recently installed the awesome i3 window manager. One of my favourite parts of i3 is the i3bar, which is a dumping ground for various stats about your system: disk & memory usage, network info, current time, etc. You can also pretty easily write some easy scripts to add your own arbitrary stats & notifications to your i3bar.

After getting everything set up, I had the idea to integrate i3bar with Pidgin so that I get a notification whenever I receive a new IM. Little did I know this innocent little idea would lead me down a rabbit hole that ultimately resulted in me submitting a patch to the Pidgin project.

Investigating Pidgin’s D-Bus integration

Having decided to integrate Pidgin with i3bar, my first stop was Pidgin’s D-Bus How-To. This document basically describes how you can use D-Bus to communicate with an active Pidgin instance.

I quickly found that I could listen for a signal when a new IM message is received, but what wasn’t obvious was how I could determine that these messages had been read. Running Pidgin under a traditional window manager like GNOME, the IM window title bar flashes to let you know you have unread messages, then when you focus the IM window it will stop flashing. Could I determine this programmatically using the D-Bus API?

The conversation-updated signal looked interesting, particularly because of its type parameter — a PurpleConvUpdateType enum that includes the value PURPLE_CONV_UPDATE_UNSEEN. This would seem to be the event type fired when the number of “unseen” messages (i.e. messages received while you weren’t looking at the IM window) changes.

After a bit of experimentation it seemed that this was indeed the signal I was looking for, but it gave me no indication of how I could determine whether the number of unseen messages was increasing or decreasing. I figured I would need to again use the D-Bus API again to ask for more details about the IM conversation. Again, I couldn’t see a way to determine unread message counts.

Discovering unseen-count

The only thing that stood out to me was purple_conversation_get_data in the C API (which the D-Bus API seemed to be based upon), but didn’t seem to appear in the D-Bus API. Almost as an afterthought, I googled for “pidgin PurpleConversationGetData” (the name I would expect the D-Bus equivalent to have) to see if I could find a D-Bus interface to this function and to see if it might possibly have the unread message count I was seeking.

One of the first results of my Google search for PurpleConversationGetData was a ticket from the Pidgin bug tracker titled expose function via dbus to get ‘unseen-count’. Bingo! Some more Googling for details of “unseen-count” confirmed it: this looked exactly like the API call I was after.

The bad news?

  1. The ticket mentions that the D-Bus interface to PurpleConversationGetData was explicitly removed as a result of another change.
  2. The ticket itself was nearly two years old, and hadn’t been touched in over a year.

After a brief bit of soul-searching pondering whether or not I could really be bothered doing this just to know whether I had an IM pending my attention, I thought “fuck it, why not”. Note that the ticket included a discussion suggesting that the proposed PurpleConversationGetUnseen call be added as a D-Bus wrapper function. It wasn’t entirely clear to me at the time what this meant, so I decided it was time to look at some code.

A hg clone later, and I was digging through Pidgin’s source code.

Exploring the Pidgin code base

My first thought was to poke around for anything relating to D-Bus at the file system level (I often try this before a grep when I’m new to a code base — I find it more informative than the noise generated by grep output in a code base I know almost nothing about):

find . -name '*dbus*'

This was still a little overwhelming, so I narrowed it down to C source files:

find . -name '*dbus*.c'

One of the more interesting files that popped up was libpurple/dbus-server.c. It became pretty clear that this is how Pidgin exposed its D-Bus interface using some XML format I wasn’t familiar with at the time (but it was clearly describing the interface of Pidgin’s D-Bus service).

Okay, so this was probably where I needed to make my change. It looked like it should be relatively easy.

At some point I noticed that dbus-server.c was referencing bindings to generate the interface XML. Where did these come from? I could see a function called purple_dbus_register_bindings, which was declared in dbus-bindings.h and implemented in dbus-server.c. It was pretty likely this was being called somewhere to register our D-Bus bindings. But where?

$ grep -R "purple_dbus_register_bindings"
dbus-bindings.h:void purple_dbus_register_bindings(void *handle, PurpleDBusBinding *bindings);
dbus-server.c:purple_dbus_register_bindings(void *handle, PurpleDBusBinding *bindings)
dbus-analyze-functions.py:        print "#define PURPLE_DBUS_REGISTER_BINDINGS(handle) purple_dbus_register_bindings(handle, bindings_DBUS)"

Ah. There’s code generation involved: dbus-analyze-functions.py. I looked through it for long enough to confirm this looked like a code generator of some sort, scanning header files piped in via stdin. Next I opened up Makefile.am and searched for dbus-analyze-functions.py to see how it’s being used:

dbus-bindings.c: dbus-analyze-functions.py $(dbus_exported)
    $(AM_V_GEN)cat $(dbus_build_exported) | $(PYTHON) $(srcdir)/dbus-analyze-functions.py > $@

At this point it was pretty clear that it was used to generate dbus-bindings.c. Looking at the generated code confirmed that this was where D-Bus calls were marshaled into C API calls and the results were marshaled back again.

From what I understood of the code so far, it would be cleaner to add a C wrapper function called something like purple_conversation_get_unseen to libpurple (the communications API underlying Pidgin) & simply leave the magic to dbus-analyze-functions.py to expose my wrapper than to implement a D-Bus-specific method.

Back to the bug report

At this point I dropped a comment on the unseen-count ticket indicating that I intended to implement it, along with my thoughts about adding purple_conversation_get_unseen to libpurple. I left it for a few days to wait for feedback.

When I came back to the ticket a few days later, I got to thinking about the earlier discussion on the ticket about implementing a wrapper function rather than simply re-exposing purple_conversation_get_data and got to thinking about purple_conversation_get_unseen.

I grepped through the code for unseen-count, and found no references at all to it within libpurple. Ah. This is why they were talking about a D-Bus wrapper function: unseen-count is a Pidgin-specific implementation detail & they don’t want to pollute libpurple’s C API with something that really isn’t natively supported by libpurple.

I commented on the ticket again indicating that I now understood why they were looking for a D-Bus wrapper function rather than adding the call directly to libpurple. Suddenly it was clear to me how to proceed with my patch.

Hacking libpurple

Thinking for a while, I decided I didn’t actually want to modify the code generator at all & would prefer to simply hook in my code explicitly in dbus-server.c.

Looking at the documentation for the D-Bus introspection format, it wasn’t immediately clear what certain aspects of the output should look like. For example, it was obvious to me that the return value should probably be stored in an arg with a direction of out, but what do I need to call that argument?

I dug around in libpurple/dbus-analyze-functions.py, but nothing leaped out at me & I didn’t want to invest a huge amount of time trying to understand the code to piece together the expected output. What I really wanted to see is some example output from D-Bus itself. So I took this neat little intro to the dbus-send command line tool and, using information from Pidgin’s D-Bus documentation, came up with this:

dbus-send --session --type=method_call --print-reply 
  --dest=im.pidgin.purple.PurpleService 
  /im/pidgin/purple/PurpleObject 
  org.freedesktop.DBus.Introspectable.Introspect

(N.B. I later learned that there’s the much-easier-to-use purple-send command, too :) )

The resulting output looks something like this:

method return sender=:1.217 -> dest=:1.338 reply_serial=2
   string "

Further down in in this output, I found an example of a method that operates on a PurpleConversation and returns a single (string) value:

So looks like return values are stored in the RESULT argument. dbus-analyze-functions.py seems to confirm this:

class Binding:
    # ... snip ...
    def process(self):
        for param in self.params:
            self.processinput(param.type, param.name)

        self.processoutput(self.function.type, "RESULT")

Using this information and by butchering some of the marshaling code from the generated dbus-bindings.c, I was able to pretty easily write this patch implementing support for a D-Bus PurpleConversationGetUnseen method.

The Grand Finale

Once I felt I had the code right, I installed my modified build of Pidgin & triumphantly updated the Python script that I’m using to drive my i3bar:

#!/usr/bin/env python

import sys
import json
import socket
import dbus
from dbus.mainloop.glib import DBusGMainLoop
import gobject

messages = 0
def pidgin_conversation_updated(conv, type):
    global messages
    # yes, I should probably be checking the message type. I am a bad person.
    messages = purple.PurpleConversationGetUnseen(conv)

def println(s):
    sys.stdout.write("%sn" % s)
    sys.stdout.flush()

def readline():
    try:
        line = sys.stdin.readline().strip()
        if not line:
            sys.exit(3)
        return line
    except KeyboardInterrupt:
        sys.exit()

if __name__ == '__main__':
    DBusGMainLoop(set_as_default=True)
    bus = dbus.SessionBus()
    obj = bus.get_object("im.pidgin.purple.PurpleService", "/im/pidgin/purple/PurpleObject")
    purple = dbus.Interface(obj, "im.pidgin.purple.PurpleInterface")

    bus.add_signal_receiver(
        pidgin_conversation_updated,
        dbus_interface='im.pidgin.purple.PurpleInterface',
        signal_name='ConversationUpdated'
    )

    println(readline())
    println(readline())

    loop = gobject.MainLoop()
    ctx = loop.get_context()

    while True:
        ctx.iteration(False)

        line, prefix = readline(), ''
        if line.startswith(','):
            line, prefix = line[1:], ','

        j = json.loads(line)
        # j.insert(0, {'full_text': '%s.local' % socket.gethostname(), 'name': 'host', 'color': '#0088CC'})
        if messages > 0:
            pidgin_color = '#FF0000'
            pidgin_text = 'pidgin*'
            j.insert(0, {'full_text': pidgin_text, 'name': 'pidgin', 'color': pidgin_color})
        println("%s%s" % (prefix, json.dumps(j)))

I restarted i3wm and asked a buddy to drop me an IM so that I could test something, then jumped back to my bash shell before he could respond (you can see it in red down near the bottom :) ):

A screenshot of Pidgin IM notifications in i3bar

Bam. Unread IM notifications. Feels good, man.

Status of the Patch

The patch hasn’t been accepted yet (I only posted it yesterday), though indications are good that it will be. I really don’t mind either way since I can happily run my patched build locally. In any case, I always enjoy any chance I get to peak under the hood of the software I use every day.

And my big take away from this? I totally need to get out more. :)

Changes to apt.tomlee.co

Not sure if anybody out there is using my apt repository, but I’m no longer separate distributions for wheezy & squeeze (I’m now using a single “tomlee” distribution) and my GPG key has unfortunately changed.

New instructions, keys & sources.list files are available on apt.tomlee.co.