Managed & owned boxes in the Rust programming language

*This was written at the time Rust 0.4 was the most recent release of the language. I’ll try to keep it up to date / revise it for later versions, but please be warned that Rust moves quickly & this may be out of date by the time you read it! :) *

I’ve been spending a little time with the Rust programming language sporadically over the last 6 months or so (and even landed some trivial patches.). I like what I’ve seen a whole lot, but one of the things that has really tripped me up as a newbie is the difference between owned boxed vs. managed boxed vs. unboxed values and borrowed pointers, and all the associated syntactic baggage.

It’s pretty difficult to read Rust code without at least somewhat understanding what it all means. So for my own benefit — and hopefully yours! — I’ve tried to document my current understanding of this stuff. Please be sure to leave abuse in the comments if I’m getting anything horribly wrong.

The Stack and Unboxed Values

To get it out of the way, unboxed values are relatively easy to grok. Straight up, if you’ve worked with the stack in C/C , you’ll be right at home with Rust’s stack. The stack in Rust is conceptually identical. Unboxed values are simply those values which are allocated on the stack. The end.

//
// Integer allocated on the stack.
//
let x = 10;

//
// String allocated on the stack.
//
let y = "Hello World";

Easy, right? What about those horrible managed/owned boxes and borrowed pointer things, though? Well, first we need to understand a little more about memory management in Rust — and for reasons that will become clear soon, the best place to start is the basic unit of execution: the task.

UPDATE [2012-12-06] minor correction from the geeks of proggit:

“Un-sigiled and &-sigiled string literals are actually allocated in the static memory region, not on the stack (I presume this is for the sake of efficiency). However this is mostly pedantry, as I don’t think it actually makes any semantic difference compared to something that’s truly stack-allocated.”

kibwen via /r/programming

What is a task?

Tasks are the basic unit of isolation and concurrency in Rust. Tasks do not share memory and communicate only through message passing. By default, tasks are scheduled in an M:N fashion onto operating system threads. The normal mode of operation will be to launch one thread per logical core. The programmer, however, will have the option of assigning certain tasks to a dedicated thread or running the scheduler in 1:1 mode.

Note: tasks and communication

(Note this source is nearly a year old and may contain inaccuracies. Further, the “design” may not reflect the current state of master.)

Thus, tasks …

… run concurrently (like threads!) in isolation from one another (not like threads!)
… do not share memory & communicate via message passing
… are scheduled over a finite number of OS threads (default: one per core)
… can be optionally assigned to a dedicated OS thread by the programmer

So tasks are kinda similar to the light-weight processes found in Erlang.

Above I emphasized that tasks do not share memory. This is important for understanding managed vs. owned boxes in Rust. Keep it in mind!

Task-local Heaps and Managed Boxes

So tasks don’t share memory at all & run isolated from one another. What does this mean in practice?

Rust starts from the position that memory cannot be shared between tasks. Experience in other languages has proven that isolating each task’s heap from the others is a reliable strategy and one that is easy for programmers to reason about. Heap isolation has the additional benefit that garbage collection must only be done per-heap. Rust never “stops the world” to reclaim memory.

The Rust Language Tutorial

This implies that every task in Rust has its own private heap that can be garbage collected independently of heaps owned by other tasks.

This is the key to understanding managed boxes: a managed box is a value allocated on the task-local heap. Since heaps are private, garbage collection can occur on a per-task basis, so GC on one task need not block the execution of other tasks — i.e. tasks can continue executing (assuming OS threads are available to do so) while other tasks are mid-GC.

//
// An integer managed box, allocated on the local heap.
//
// Note that `x` itself is a *pointer* to the managed box on the heap.
//
let x = @10;

//
// A string managed box, allocated on the local heap.
//
let y = @"Hello World"

The Exchange Heap and Owned Boxes

So we’ve established that Rust tasks have their own private heap. We also know that tasks can communicate via message passing — but given that tasks don’t share any memory, how is this message passing implemented?

A naive implementation of a message passing system might simply copy values from the heap of one task to another, perhaps using some sort of thread-safe queue or by locking the tasks involved. However, Rust is being built with performance in mind and the designers determined that copying larger data structures has the potential to be expensive. This is where the exchange heap and owned boxes come in.

Complete isolation of heaps between tasks would, however, mean that any data transferred between tasks must be copied. While this is a fine and useful way to implement communication between tasks, it is also very inefficient for large data structures. […] Objects allocated in the exchange heap have ownership semantics, meaning that there is only a single variable that refers to them. For this reason, they are referred to as owned boxes. All tasks may allocate objects on the exchange heap, then transfer ownership of those objects to other tasks, avoiding expensive copies.

The Rust Language Tutorial

So the exchange heap is a global heap visible to all Rust tasks. “Liar!” I hear you scream. “I thought you said that tasks don’t share any memory!”.

Well, the gotcha with the exchange heap is that values on the heap can only be referenced by one task at a time. In fact, values on the exchange heap can only be referenced by a single variable at a time. It’s this property that ensures that tasks won’t stomp on each other when using the exchange heap: if only one variable can reference a value on the exchange heap, that means only one task is able to manipulate it at any given time. Neat eh?

//
// An integer owned box, allocated on the exchange heap.
//
// Note that `x` itself is a *pointer* to the owned box on the heap.
//
let x = ~10;

//
// A string owned box, allocated on the exchange heap.
//
let y = ~"Hello World"

Although the documentation might lead you to believe otherwise, you unfortunately can’t simply transfer ownership of a value directly from one task to another (I wrongly assumed that move and copy could be used for this purpose). Instead, you will want to use the core::pipes module which uses the exchange heap under the hood for efficient message passing between Rust tasks. Also see the std::comm module, which builds on core::pipes to provide a higher-level, two-way communication stream: DuplexStream.

If you need to transfer ownership of owned boxes, you can use the copy keyword to duplicate a value, or the move keyword to change ownership. For example:

//
// Initialize an owned box on the exchange heap.
//
let x = ~10;

//
// Copy the owned box. This allocates a *new* owned box
// and copies the value of the owned box referenced by `x`.
// `x` remains unchanged.
//
let y = copy x;

//
// This makes `z` the *owner* of the owned box referenced by
// `x`. `x` is no longer valid.
//
let z = move x;

//
// Fails at compile time -- "error: use of moved variable: `x`"
//
io::println(int::str(*x));

There’s actually a pretty accessible tutorial on communication between tasks on the Rust web site. I won’t bother reproducing those perfectly good examples here, but you should check it out if you’re interested in seeing how Rust tasks communicate. :)

Borrowed Pointers

With all this under your belt, it’s much easier to understand borrowed pointers:

Rust supports several types of pointers. The safe pointer types are @T for managed boxes allocated on the local heap, ~T, for uniquely-owned boxes allocated on the exchange heap, and &T, for borrowed pointers, which may point to any memory, and whose lifetimes are governed by the call stack.

The Rust Language Tutorial

Perhaps the most intuitive way to see borrowed pointers in action is with a function that takes a borrowed pointer as an argument:

fn foo(x : &int) {
    io::println(int::str(*x))
}

fn main() {
    let a = ~10;
    let b = @20;
    let c = 30;
    foo(a);
    foo(b);
    foo(&c);
}

Here, x is a borrowed pointer that is in scope for the duration of a call to foo. Note the variables a and b are both pointers here, and we’re taking the address of the unboxed value c on the stack before passing it to the foo function. This is what the language tutorial means by borrowed references being able to “point to any memory” — it doesn’t matter whether the memory is located on the local heap, the exchange heap or the stack.

There’s a lot more to say about borrowed pointers & regions, but I think I’ve written enough for one blog post. In any case, Niko Matsakis has already written a great tutorial on borrowed pointers and you should totally check it out if this stuff has you interested.

So … what should I use? Managed boxes? Owned boxes? Borrowed references?

This is unfortunately a more difficult question to answer. Although the intent of the different memory regions is pretty clear, it’s not unusual to find type signatures in the standard library that call for the use of an owned box when it seems like a borrowed reference would do just fine (for example, brson pointed out on IRC that much of the vec module comes from a time when Rust only had owned vectors!). As such, my experience with Rust’s evolving API has led me to believe that — where you can’t use the type that “makes sense” — you often wind up using transforming values to whatever type is required for by the library functions you are using (e.g. from a managed to an owned box).

More type trivia from brson (if you’re into that sort of thing): ~T and ~[T] are different types of owned pointer, so something like:

let x = ~10;
let y = @*x;

will compile, but:

let x = ~[1, 2, 3];
let y = @*x;

will not, because the compiler complains that x can’t be dereferenced. I don’t fully understand what prevents the latter pointer “type” from being dereferenced, but there you go. Something to be wary of. :)

UPDATE [2012-12-06]:

Great article. Regarding why ~[T] and ~T behave somewhat differently, the reason is that the size of [T] is not known to the compiler! It’s kind of like how in C, if you have T3, that’s a perfectly fine type, but T[] is sort of a degenerate type that gets converted to T* willy-nilly. The reason is that the compiler can’t really manipulate an “unknown number of T’s in a row”, which is what [T] (Rust) and T[] (C) represent. As an example, it can’t put such a value on the stack, since it doesn’t know how much stack space to allocate (well, it can’t put that on the stack without using alloca() or something similar).

@nikomatsakis via the Rust mailing list

These might seem like pretty major holes, but Rust is young and still evolving. And it really has come a long way — even in the last year or so. I look forward to seeing this language evolve & improve well into the future & hopefully prove to be a solid foundation for Mozilla Servo.