Part 1: Lexing

Published on Apr 13th, 2021, in build a programming language, rust. 3 minute read.

This post is part of a series about learning Rust and building a small programming language.

The first part of the language I’ve built is the lexer. It takes the program text as input and produces a vector of tokens. Tokens are the individual units that the parser will work with, rather than it having to work directly with characters. A token could be a bunch of different things. It could be a literal value (like a number or string), or it could be an identifier, or a specific symbol (like a plus sign).

I’ve decided to represent tokens themselves as an enum because there are a bunch of cases without any data (e.g., a plus sign is always just a plus sign) and some with (e.g., a number literal token has a value).

When I was reading the Rust book, I was excited to see Rust enums have associated values. Enums with associated data is one of my favorite features of Swift, and I’m glad to see Rust has them too.

#[derive(Debug)]
enum Token {
	Integer(u64),
	Plus,
}

For now, I’m only starting with integer literals and the plus sign. More tokens will come once I’ve got basic lexing and parsing in place.

Most of the work of turning a string into tokens is done in the drumroll please… tokenize function.

It creates an initially empty vector of tokens, and starts iterating through characters in the string.

Single character tokens like plus are the easiest to handle. If the current character matches that of a token, consume the character (by advancing the iterator) and add the appropriate token.

fn tokenize(s: &str) -> Vec<Token> {
	let mut tokens: Vec<Token> = Vec::new();

	let mut it = s.chars().peekable();
	while let Some(c) = it.peek() {
		if *c == '+' {
			it.next();
			tokens.push(Token::Plus);
		}
	}

	tokens
}

Already I’ve encountered a Rust thing. Inside the while loop, c is a reference to a char, so in order to check its value, you have to dereference it. I had expected that you’d be able to compare a value of some type to a reference of that same type (with the language deref’ing the reference automatically), but I can see how forcing the programmer to be explicit about it makes sense.

Next, to parse numbers literals, I check if the current character is in the digit range:

const DIGITS: RangeInclusive<char> = '0'..='9';

fn tokenize(s: &str) -> Vec<Token> {
	// ...
	while let Some(c) = it.peek() {
		// ...
		} else if DIGITS.contains(c) {
			let n = parse_number(&mut it).unwrap();
			tokens.push(Token::Integer(n));
		}
	}
}

You may note that even though the integer token takes a signed integer, I’m totally ignoring the possibility of negative number literals. That’s because they’ll be implemented in the parser along with the unary minus operator.

If the character is indeed a digit, a separate function is called to parse the entire number. This is the first thing that I’ve encountered for which mutable borrows are quite nice. The parse_number function operates on the same data as the tokenize function, it needs to start wherever tokenize left off, and it needs to tell tokenize how much it advanced by. Mutably borrowing the iterator has exactly these properties.

fn parse_number<T: Iterator<Item = char>>(it: &mut T) -> Option<i64> {
	let digits = it.take_while(|c| DIGITS.contains(c));
	let s: String = digits.collect();
	s.parse().ok()
}

Writing the declaration of parse_number was a bit rough, I’ll admit. I haven’t read the generics chapter of the book yet, so I stumbled through a number of compiler error messages (which are very detailed!) until I arrived at this. Having emerged victorious, what I ended up with makes sense though. The specific type of the iterator doesn’t matter, but it still needs to be known at the call site during compilation.

The actual implementation takes as many digits as there are from the iterator, turns them into a string, and parses it into an integer. It returns an optional because the parse method could fail if the string there were no digit characters at the beginning of the string (this will never happen in the one case I’m calling it, but in case it’s reused in the future and I forget…).

Lastly, the tokenize function also ignores whitespace just by calling it.next() whenever it encounters a whitespace char.

And with that, we can tokenize simple inputs:

fn main() {
	println!("tokens: {:#?}", tokenize("12 + 34"));
}

$ cargo run
tokens: [Integer(12), Plus, Integer(34)]