Boba Devlog #0: Compiling Pseudo-Rust to x86 Assembly
2023-Jun-30
Boba is a compiler I’m writing to learn how high-level language constructs can be expressed in assembly. It compiles Rust-like source code to 64-bit x86 instruction set. This post will describe the current status of the project.
1 Language Overview
The language’s syntax looks like Rust, but semantically its a lot like C with one big difference: here, the compiler does not care about declaration order in the global scope, so functions can be called before they are defined in the source code.
// Example program that can be compiled with Boba fn main() -> i64 { println(factorial(5)); return 0; } fn factorial(num: i64) -> i64 { if num <= 1 { return 1; } else { return num * factorial(num - 1); } }
2 Compiler Architecture
The compiler follows a traditional multi-pass architecture:
- Scanner walks through every character in the source code and builds a list of tokens.
- Parser builds a high-level AST by parsing the tokens with a recursive descent LL(1) parser. This step is very similar to parsing Lox in Crafting Interpreters.
- Analyzer performs a bunch of static checks to verify that all the variables and functions in use are defined. Also, local variable names are resolved to their offset from the base register pointer
%rbp
. This step returns a low-level AST iff there are no semantic errors in the source code. - Code generator iterates over every node in the low-level AST and translates it to assembly.
The high-level AST has token information which can be used to point an error message to a particular line and column number in the source code but the low-level AST gets rid of all token information and only knows about labels and offset from base pointer. The generated assembly is linked with the C standard library and converted to an ELF executable with the help of gcc
. This allowed me to implement println()
as a built-in function that calls printf()
. Also, in theory, the generated code can call any C function because it follows the System V calling convention but I have not tested it yet.
3 Future Work
There aren’t any optimization passes, so the generated code is verbose and inefficient. For example, a simple local variable declaration let x = 10
gets translated into:
movq $10, %rbx movq %rbx, -8(%rbp)
This is because the code generator generates code for every node in the low-level AST. It has no way to know that those two instructions are related and can be combined into one. Also, there is no type system as 64-bit signed number is the only data type supported by the compiler.
4 Reference
Chapters 10 - 12 in Introduction to Compilers and Language Design by Douglas Thain gave me a good idea about assembly language, code generation and optimization.