Boba Devlog #0: Compiling Pseudo-Rust to x86 Assembly

2023-Jun-30

Boba is a compiler I’m writing to learn how high-level language constructs can be expressed in assembly. It compiles Rust-like source code to 64-bit x86 instruction set. This post will describe the current status of the project.

1 Language Overview

The language’s syntax looks like Rust, but semantically its a lot like C with one big difference: here, the compiler does not care about declaration order in the global scope, so functions can be called before they are defined in the source code.

// Example program that can be compiled with Boba
fn main() -> i64 {
    println(factorial(5));
    return 0;
}

fn factorial(num: i64) -> i64 {
    if num <= 1 {
        return 1;
    } else {
        return num * factorial(num - 1);
    }
}

2 Compiler Architecture

The compiler follows a traditional multi-pass architecture:

  1. Scanner walks through every character in the source code and builds a list of tokens.
  2. Parser builds a high-level AST by parsing the tokens with a recursive descent LL(1) parser. This step is very similar to parsing Lox in Crafting Interpreters.
  3. Analyzer performs a bunch of static checks to verify that all the variables and functions in use are defined. Also, local variable names are resolved to their offset from the base register pointer %rbp. This step returns a low-level AST iff there are no semantic errors in the source code.
  4. Code generator iterates over every node in the low-level AST and translates it to assembly.

The high-level AST has token information which can be used to point an error message to a particular line and column number in the source code but the low-level AST gets rid of all token information and only knows about labels and offset from base pointer. The generated assembly is linked with the C standard library and converted to an ELF executable with the help of gcc. This allowed me to implement println() as a built-in function that calls printf(). Also, in theory, the generated code can call any C function because it follows the System V calling convention but I have not tested it yet.

3 Future Work

There aren’t any optimization passes, so the generated code is verbose and inefficient. For example, a simple local variable declaration let x = 10 gets translated into:

movq  $10, %rbx
movq  %rbx, -8(%rbp)

This is because the code generator generates code for every node in the low-level AST. It has no way to know that those two instructions are related and can be combined into one. Also, there is no type system as 64-bit signed number is the only data type supported by the compiler.

4 Reference

Chapters 10 - 12 in Introduction to Compilers and Language Design by Douglas Thain gave me a good idea about assembly language, code generation and optimization.