A primer on code generation in Cranelift
Cranelift is a code generator written in the Rust programming language that aims to be a fast code generator, which outputs machine code that runs at reasonable speeds.
The Cranelift compilation model consists in compiling functions one by one, holding extra information about external entities, like external functions, memory addresses, and so on. This model allows for concurrent and parallel compilation of individual functions, which supports the goal of fast compilation. It was designed this way to allow for just-in-time (JIT) compilation of WebAssembly binary code in Firefox, although its scope has broadened a bit. Nowadays it is used in a few different WebAssembly runtimes, including Wasmtime and Wasmer, but also as an alternative backend for Rust debug compilation, thanks to cg_clif.
A classic compiler design usually includes running a parser to translate the source to some form of intermediate representations, then run optimization passes onto them, then feeds this to the machine code generator.
This blog post focuses on the final step, namely the concepts that are involved in code generation, and what they map to in Cranelift. To make things more concrete, we’ll take a specific instruction, and see how it’s translated, from its creation down to code generation. At each step of the process, I’ll provide a short (ahem) high-level explanation of the concepts involved, and I’ll show what they map to in Cranelift, using the example instruction. While this is not a tutorial detailing how to add new instructions in Cranelift, this should be an interesting read for anyone who’s interested in compilers, and this could be an entry point if you’re interested in hacking on the Cranelift codegen
crate.
This is our plan for this blog post: each squared box represents data, each rounded box is a process. We’re going to go through each of them below.
Intermediate representations
Compilers use intermediate representations (IR) to represent source code. Here we’re interested in representations of the data flow, that is instructions themselves and only that. The IRs contain information about the instructions themselves, their operands, type specialization information, and any additional metadata that might be useful. IRs usually map to a certain level of abstraction, and as such, they are useful for solving different problems that require different levels of abstraction. Their shape (which data structures) and numbers often have a huge impact on the performance of the compiler itself (that is, how fast it is at compiling).
In general, most programming languages use IRs internally, and yet, these are invisible to the programmers. The reason is that source code is usually first parsed (tokenized, verified) and then translated into an IR. The abstract syntax tree, aka AST, is one such IR representing the source code itself, in a format that’s very close to the source code itself. Since the raison d’être of Cranelift is to be a code generator, having a text format is secondary, and only useful for testing and debugging purposes. That’s why embedders directly create and manipulate Cranelift’s IR.
At the time of writing, Cranelift has two IRs to represent the function’s code:
- one external, high-level intermediate representation, called CLIF (for Cranelift IR format),
- one internal, low-level intermediate representation called VCode (for virtual-registerized code).
CLIF IR
CLIF is the IR that Cranelift embedders create and manipulate. It consists of high-level typed operations that are convenient to use and/or can be simply translated to machine code. It is in static single assignment (SSA) form: each value referenced by an operation (SSA value) is defined only once, and may have as many uses as desired. CLIF is practical to use and manipulate for classic compilers optimization passes (e.g. LICM), as it is generic over the target architecture which we’re compiling to.
let x = builder.ins.iconst;
let y = builder.ins.iconst;
let sum = builder.ins.iadd;
An example of Rust code that would generate CLIF IR: using an IR builder, two constant 64-bits integer SSA values x and y are created, and then added together. The result is stored into the sum
SSA value, which can then be consumed by other instructions.
The code for the IR builder we’re manipulating above is automatically generated by the cranelift-codegen
build script. The build script uses a domain specific meta language (DSL)[1] that defines the instructions, their input and output operands, which input types are allowed, how the output type is inferred, etc. We won’t take a look at this today: this is a bit too far from code generation, but this could be material for another blog post.
As an example of a full-blown CLIF generator, there is a crate in the Cranelift project that allows translating from the WebAssembly binary format to CLIF. The Cranelift backend for Rustc uses its own CLIF generator that translates from one of the Rust compiler’s IRs.
Finally, it’s time to reveal what’s going to be our running example! The Chosen One is the iadd
CLIF operation, which allows to add two integers of any length together, with wrapping semantics. It is both simple to understand what it does, and exhibits interesting behaviors on the two architectures we’re interested in. So, let’s continue down the pipeline!
VCode IR
Later on, the CLIF intermediate representation is lowered, i.e. transformed from a high-level one into a lower-level one. Here lower level means a form more specialized for a machine architecture. This lower IR is called VCode in Cranelift. The values it references are called virtual registers (more on the virtual bit below). They’re not in SSA form anymore: each virtual register may be redefined as many times as we want. This IR is used to encode register allocation constraints and it guides machine code generation. As a matter of fact, since this information is tied to the machine code’s representation itself, this IR is also target-specific: there’s one flavor of VCode per each CPU architecture we’re compiling to.
Let’s get back to our example, that we’re going to compile on two instruction set architectures:
- ARM 64-bits (aka aarch64), which is used in most mobile devices but start to become mainstream on laptops (Apple’s Mac M1, some Chromebooks)
- Intel’s x86 64-bits (aka x86_64, also abbreviated x64), which is used in most desktop and laptop machines).
An integer addition machine instruction on aarch64 will take three operands: two input operands (one of which must be a register), and another third output register operand. While on the x86_64 architecture, the equivalent instruction involves a total of two registers: one that is a read-only source register, and another that is an in-out modified register, containing both the second source and the destination register. We’ll get back to this.
So considering iadd
, let’s look at (one of[2]) the VCode instruction that’s used to represent integer additions on aarch64 (as defined in cranelift/codegen/src/isa/aarch64/inst/mod.rs
):
/// An ALU operation with two register sources and a register destination.
AluRRR ,
Some details here:
alu_op
defines the sub-opcode used in the ALU (Arithmetic Logic Unit). It will beAluOp::Add64
for a 64-bits integer addition.rn
andrm
are the conventional aarch64 names for the two input registers.rd
is the destination register. See how it’s marked asWritable
, while the two others are not?Writable
is a plain Rust wrapper that makes sure that we can statically differentiate read-only registers from writable registers; a neat trick that allows us to catch more issues at compile-time.
All this information is directly tied to the machine code representation of an addition instruction on aarch64: each field is later used to select some bytes that will be generated during code generation.
As said before, the VCode is specific to each architecture, so x86_64 has a different VCode representation for the same instruction (as defined in cranelift/codegen/src/isa/x64/inst/mod.rs
):
/// Integer arithmetic/bit-twiddling: (add sub and or xor mul adc? sbb?) (32 64) (reg addr imm) reg
AluRmiR ,
Here, the sub-opcode is defined as part of the AluRmiROpcode
enum (the comment hints at which other x86 machine instructions are generated by this same VCode). See how there’s only one src
(source) register (or memory or immediate operand), while the instruction conceptually takes two inputs? That’s because it’s expected that the dst
(destination) register is modified, that is, both read (so it’s the second input operand) and written to (so it’s the result register). In equivalent C code, the x86’s add instruction doesn’t actually do a = b + c
. What it does is a += b
, that is, one of the sources is consumed by the instruction. This is an artifact inherited from the design of older x86 machines in the 1970’s, when instructions were designed around an accumulator model (and representing efficiently three operands in a CISC architecture would make the encoding larger and harder than it is).
Instruction selection (lowering)
As said before, converting from the high-level IR (CLIF) to the low-level IR (VCode) is called lowering. Since VCode is target-dependent, this process is also target-dependent. That’s where we consider which machine instructions get eventually used for a given CLIF opcode. There are many ways to achieve the same machine state results for given semantics, but some of these ways are faster than other, and/or require fewer code bytes to achieve. The problem can be summed up like this: given some CLIF, which VCode can we create to generate the fastest and/or smallest machine code that carries out the desired semantics? This is called instruction selection, because we’re selecting the VCode instructions among a set of different possible instructions.
How do these IR map to each other? A given CLIF node may be lowered into 1 to N VCode instructions. A given VCode instruction may lead to the code generation of 1 to M machine instructions. There are no rules governing the maximum of entities mapped. For instance, the integer addition CLIF opcode iadd
on 64-bits inputs maps to a single VCode instruction on aarch64. The VCode instruction then causes a single code instruction to be generated.
Other CLIF opcodes may generate more than a single machine instruction eventually. Consider the CLIF opcode for signed integer division idiv
. Its semantics define that it traps for zero inputs and in case of integer overflow[3]. On aarch64, this is lowered into:
- one VCode instruction that checks if the input is zero and trap otherwise
- two VCode instructions for comparing the input values against the minimal integer value and -1
- one VCode instruction to trap if the two input values match what we checked against
- and one VCode instruction that does the actual division operation.
Each of these VCode instruction then generates one or more machine code instructions, resulting in a bit of a longer sequence.
Let’s look at the lowering of iadd
on aarch64 (in cranelift/codegen/src/isa/aarch64/lower_inst.rs
), edited and simplified for clarity. I’ve added comments in the code, explaining what each line does:
=> Iadd
In fact, the alu_inst_imm12
wrapper can create one VCode instruction among a set of possible ones (since we’re trying to select the best one). For the sake of simplicity, we’ll assume that AluRRR
is going to be generated, i.e. the selected instruction is the one using only register encodings for the input values.
Register allocation
VCode, registers and stack slots
Hey, ever wondered what the V in VCode meant? Back to the drawing board. While a program may reference a theoretically unlimited number of instructions, each referencing a theoretically unlimited number of values as inputs and outputs, the physical machine only has a fixed set of containers for those values:
- either they must live in machine registers: very fast to access in the CPU, take some CPU real estate, thus are costly, so there are usually few of them.
- or they must live in the process’ stack memory: it’s slower to access, but we can have virtually any amount of stack slots.
mov %edi,-0x4(%rbp)
mov %rsi,-0x10(%rbp)
mov -0x4(%rbp),%eax
In this example of x86 machine code, %edi, %rsi, %rbp, %eax are all registers; stack slots are memory addresses computed as the frame pointer (%rbp) plus an offset value (which happens to be negative here). Note that stack slots may be referred to by the stack pointer (%rsp) in general.
Defining the register allocation problem
The problem of mapping the IR values (in VCode these are the Reg
) to machine “containers” is called register allocation (aka regalloc). Inputs to register allocation can be as numerous as we want them, and map to “virtual” values, hence we call them virtual registers. And… that’s where the V from VCode comes from: the instructions in VCode reference values that are virtual registers before register allocation, so we say the code is in virtualized register form. The output of register allocation is a set of new instructions, where the virtual registers have been replaced by real registers (the physical ones, limited in quantity) or stack slots references (and other additional metadata).
// Before register allocation, with unlimited virtual registers:
v2 = v0 + v1
v3 = v2 * 2
v4 = v2 + 1
v5 = v4 + v3
return v5
// One possible register allocation, on a machine that has 2 registers %r0, %r1:
%r0 = %r0 + %r1
%r1 = %r0 * 2
%r0 = %r0 + 1
%r1 = %r0 + %r1
return %r1
When all is well, the virtual registers don’t conceptually live at the same time, and they can be put into physical registers. Issues arise when there’s not enough physical registers to contain all the virtual registers that live at the same time, which is the case for… a very large majority of programs. Then, register allocation must decide which registers continue to live in registers at a given program point, and which should be spilled into a stack slot, effectively storing them onto the stack for later use. This later reuse will imply to reload them from the stack slot, using a load machine instruction. The complexity resides in choosing which registers should be spilled, at which program point they should be spilled, and at which program points we should reload them, if we need to do so. Making good choices there will have a large impact on the speed of the generated code, since memory accesses to the stack imply an additional runtime cost. For instance, a variable that’s frequently used in a hot loop should live in a register for the whole loop’s lifetime, and not be spilled/reloaded in the middle of the loop.
// Before register allocation, with unlimited virtual registers:
v2 = v0 + v1
v3 = v0 + v2
v4 = v3 + v1
return v4
// One possible register allocation, on a machine that has 2 registers %r0, %r1.
// We need to spill one value, because there's a point where 3 values are live at the same time!
spill %r1 --> stack_slot(0)
%r1 = %r0 + %r1
%r1 = %r0 + %r1
reload stack_slot(0) --> %r0
%r1 = %r1 + %r0
return %r1
And, since we like to have our cake and eat it too, the register allocator itself should be fast: it should not take an unbounded amount of time to make these allocation decisions. Register allocation has the good taste to be a NP-complete problem. Concretely, this means that implementations cannot find the best solutions given arbitrary inputs, but they’ll estimate good solutions based on heuristics, in worst-case quadratic time over the size of the input. All of this makes it so that register allocation has its own whole research field, and has been extensively studied for some time now. It is a fascinating problem.
Register allocation in Cranelift
Back to Cranelift. The register allocation contract is that if a value must live in a real register at a given program point, then it does live where it should (unless register allocation is impossible). At the start of code generation for a VCode instruction, we are guaranteed that the input values live in real registers, and that the output real register is available before the next VCode instruction.
You might have noticed that the VCode instructions only refer to registers, and not stack slots. But where are the stack slots, then? The trick is that the stack slots are invisible to VCode. Register allocation may create an arbitrary number of spills, reloads, and register moves[4] around VCode instructions, to ensure that their register allocation constraints are met. This is why the output of register allocation is a new list of instructions, that includes not only the initial instructions filled with the actual registers, but also additional spill, reload and move (VCode) instructions added by regalloc.
As said before, this problem is so sufficiently complex, involved and independent from the rest of the code (assuming the right set of interfaces!) that its code lives in a separate crate, regalloc.rs
, with its own fuzzing and testing infrastructure. I hope to shed some light on it at some point too.
What’s interesting to us today is the register allocation constraints. Consider the aarch64 integer add instruction add rd, rn, rm
: rd
is the output virtual register that’s written to, while rn
and rm
are the inputs, thus read from. We need to inform the register allocation algorithm about these constraints. In regalloc jargon, “read to” is known as used, while “written to” is known as defined. Here, the aarch64 VCode instruction AluRRR
does use rn
and rm
, and it defines rd
. This usage information is collected in the aarch64_get_regs
function (cranelift/codegen/src/isa/aarch64/inst/mod.rs
):
Then, after register allocation has assigned the physical registers, we need to instruct it how to replace virtual register mentions by physical register mentions. This is done in the aarch64_map_regs
function (same file as above):
Note this is reflecting quite precisely what the usage collector did: we’re replacing the virtual register mention for the defined register rd
with the information (which real register) provided by the RegUsageMapper
. These two functions must stay in sync, otherwise here be dragons! (and bugs very hard to debug!)
Register allocation on x86
On Intel’s x86, register allocation may be a bit trickier: in some cases, the lowering needs to be carefully written so it satisfies some register allocation constraints that are very specific to this architecture. In particular, x86 has fixed register constraints as well as tied operands.
For this specific part, we’ll look at the integer shift-left instruction, which is equivalent to C’s x << y
. Why this particular instruction? It exhibits both properties that we’re interested in studying here. The lowering of iadd
is similar, albeit slightly simpler, as it only involves tied operands.
Fixed register constraints
On the one hand, some instructions expect their inputs to be in fixed registers, that is, specific registers arbitrarily predefined by the architecture manual. For the example of the shift instruction, if the count is not statically known at compile time (it’s not a shift by a constant value), then the amount by which we’re shifting must be in the rcx
register[5].
Now, how do we make sure that the input value actually is in rcx
? We can mark rcx
as used in the get_regs
function so regalloc knows about this, but nothing ensures that the input resides in it at the beginning of the instruction. To resolve this, we’ll introduce a move instruction during lowering, that is going to copy the input value into rcx
. Then we’re sure it lives there, and register allocation knows it’s used: we’re good to go!
In a nutshell, this shows how lowering and register allocation play together:
- during lowering, we introduce a move from a dynamic shift input value to
rcx
before the actual shift - in the register usage function, we mark
rcx
as used - (nothing to do in the register mapping function:
rcx
is a real register already)
Tied operands
On the other hand, some instructions have operands that are both read and written at the same time: we call them modified in Cranelift and regalloc.rs, but they’re also known as tied operands in the compiler literature. It’s not just that there’s a register that must be read, and a register that must be written to: they must be the same register. How do we model this, then?
Consider a naive solution. We take the input virtual register, and decide it’s allocated to the same register as the output (modified) register. Unfortunately, if the chosen virtual register was going to be reused by another later VCode instruction, then its value would be overwritten (clobbered) by the current instruction. This would result in incorrect code being generated, so this is not acceptable. In general we can’t clobber the value that was in an input value during lowering, because that’s the role of regalloc to make this kind of decisions.
// Before register allocation, with virtual registers:
v2 = v0 + v1
v3 = v0 + 42
// After register allocation, on a machine with two registers %r0 and %r1:
// assign v0 to %r0, v1 to %r1, v2 to %r0
%r0 += v1
... = %r0 + 42 // ohnoes! the value in %r0 is v2, not v0 anymore!
The right solution is, again, to copy this input virtual register into the output virtual register, right before the instruction. This way, we can still reuse the untouched input register in other instructions without modifying it: only the copy is written to.
Pfew! We can now look at the entire lowering for the shift left instruction, edited and commented for clarity:
// Read the instruction operand size from the output's type.
let size = dst_ty.bytes as u8;
// Put the left hand side into a virtual register.
let lhs = put_input_in_reg;
// Put the right hand side (shift amount) into either an immediate (if it's
// statically known at compile time), or into a virtual register.
let =
if let Some = ctx.get_input_as_source_or_const.constant else ;
// Get the destination virtual register.
let dst = get_output_reg.only_reg.unwrap;
// Copy the left hand side into the (modified) output operand, to satisfy the
// mod constraint.
ctx.emit;
// If the shift count is statically known: nothing particular to do. Otherwise,
// we need to put it in the RCX register.
if count.is_none
// Generate the actual shift instruction.
ctx.emit;
And this is how we tell the register usage collector about our constraints:
=> ShiftR
Only the modified register needs to be mapped to its allocated physical register:
=> ShiftR
Virtual registers copies and performance
Do these virtual register copies sound costly to you? In theory, they could lead to the code generation of a move instructions, increasing the size of the code generated and causing a small runtime cost. In practice, register allocation, through its interface, knows how to identify move instructions, their source and their destination. By analyzing them, it can see when a source isn’t used after a given move instruction, and thus allocate the same register for the source and the destination of the move. Then, when Cranelift generates the code, it will avoid generating a move from a physical register to the same one[6]. As a matter of fact, creating a VCode copy doesn’t necessarily mean that it will generate a machine code move instruction later: it is present just in case regalloc needs it, but it can be avoided when it’s spurious.
Code generation
Oh my, we’re getting closer to actually being able to run the code! Once register allocation has run, we can generate the actual machine code for the VCode instructions. Cool kids call this step of the pipeline codegen, for code generation. This is the part where we decipher the architecture manuals provided by the CPU vendors, and generate the raw machine bytes for our machine instructions. In Cranelift, this means filling a code buffer (there’s a MachBuffer
sink interface for this!), returned along some internal relocations[7] and additional metadata. Let’s see what happens for our integer addition, when the times come to generate the code for its VCode equivalent AluRRR
on aarch64
(in cranelift/codegen/src/isa/aarch64/inst/emit.rs
):
// We match on the VCode's identity here:
& AluRRR =>
And what’s this enc_arith_rrr
doing, then?
Encoding the instruction parts (operands, register mentions) is a lot of bit twiddling and fun. We do so for each VCode instruction, until we’ve generated the whole function’s body. If you remember correctly, at this point register allocation may have added some spills/reloads/move instructions. From the codegen’s point of view, these are just regular instructions with precomputed operands (either real registers, or memory operands involving the stack pointer), so they’re not treated particularly and they’re just generated the same way other VCode instructions are.
More work is done by the codegen backend then, to optimize blocks placement, compute final branch offsets, etc. If you’re interested by this, I strongly encourage you to go read this blog post by Chris Fallin. After this, we’re finally done: we’ve produced a code buffer, as well as external relocations (to other functions, memory addresses, etc.) for a single function. The code generator’s task is complete: the final steps consist in linking and, optionally, producing an executable binary.
Mission accomplished!
So, we’re done for today! Thanks for reading this far, hope it has been a useful and pleasant read to you! Feel free to reach out to me on the twitterz if you have additional remarks/questions, and to go contribute on Wasmtime/Cranelift if this sort of things is interesting to you 😇. Until next time, take care of yourselves!
Thanks to Chris Fallin for reading and suggesting improvements to this blog post.
-
Really, Rust is the DSL. It was Python code before, that had the advantage to be faster to update. Yet it was doing a lot of magic behind the curtain, which wasn’t very friendly for new people trying to learn and use Cranelift. Despite a statically typed language helping for exploration through tooling, this meta-language is to partially disappear in the long run, see Chris’ blog post on this topic. ↩
-
Aarch64 connoisseurs may notice that there are other ways to encode an addition. Say, if one of the input operands was the result of a bit shift instruction by an immediate value, then it’s possible to embed the shift within the add, so we end up with fewer machine instructions (and lower the register pressure). This other possible encoding is sufficiently different in terms of register allocation and code generation that it justifies having its own VCode instruction.
AluRRR
is simpler in the sense that it’s only concerned with register inputs and outputs, thus a perfect example for this post. ↩ -
What’s an integer overflow for signed integer division? Consider an integer value represented on
N
bits. If you try to divide the smallest integer value-2**N
by-1
, it should return2**N
, but this is out of range, since the biggest signed integer value we can represent onN
bits is(2**N) - 1
! So this will overflow and be set to-2**N
, which is the initial value, but not the correct result. Good luck debugging this without a software trap! ↩ -
Register moves may be introduced because a successor block (in the control flow graph) expects a given virtual register to live in a particular real register, or because a particular instruction requires a virtual register to be allocated to a fixed real register that’s busy: regalloc can then temporarily divert the busy register into another unused register. ↩
-
The
c
inrcx
actually stands forcount
; this is a property inherited from former CPU designs. ↩ -
Unless this move carries sign- or zero-extending semantics, which is the case for e.g. x86’s 32-bits
mov
instructions on a 64-bits architecture. ↩ -
Relocations are placeholders for information we don’t have yet access to. For instance, when we’re generating jump instructions, the jump targets offsets are not determined yet. So we record where the jump instruction is in the code stream, as well as which control flow block it should jump into, so we can patch it later when the final offsets are known: that’s the content of our relocation. ↩