Compiler Architecture

Interpreters serve as dynamic execution engines that translate and execute source code on-the-fly. Unlike compilers, which translate entire source code files into machine code or bytecode before execution, interpreters process code line-by-line or statement-by-statement, executing it immediately. This real-time execution characteristic makes interpreters particularly suitable for dynamically typed languages and environments requiring rapid feedback loops, such as scripting languages and interactive environments.

# Traditional Compiler Architecture

The architecture of an compiler typically comprises several key components, each responsible for distinct stages of the interpretation process. Let's explore these components in detail:

##1 Lexer (Tokenizer):

The first step is scanning, also known as lexing.

At the initial stage of interpretation, the source code is broken down into smaller units called tokens. This process, performed by the lexer or tokenizer, involves scanning the source code character by character and grouping them into meaningful tokens such as identifiers, keywords, literals, and punctuation symbols. These tokens serve as the basic building blocks for subsequent stages of interpretation.

Input (Source Code):

#include <stdio.h>

int main() {
    int x = 10;
    int y = 20;
    int result = x + y;
    printf("Result: %d\n", result);
    return 0;
}

Output (Tokens):

Token: KEYWORD, Lexeme: 'int'
Token: IDENTIFIER, Lexeme: 'main'
Token: LPAREN, Lexeme: '('
Token: RPAREN, Lexeme: ')'
Token: LBRACE, Lexeme: '{'
Token: KEYWORD, Lexeme: 'int'
Token: IDENTIFIER, Lexeme: 'x'
Token: ASSIGN, Lexeme: '='
Token: INTEGER, Lexeme: '10'
Token: SEMICOLON, Lexeme: ';'
Token: KEYWORD, Lexeme: 'int'
Token: IDENTIFIER, Lexeme: 'y'
Token: ASSIGN, Lexeme: '='
Token: INTEGER, Lexeme: '20'
Token: SEMICOLON, Lexeme: ';'
Token: KEYWORD, Lexeme: 'int'
Token: IDENTIFIER, Lexeme: 'result'
Token: ASSIGN, Lexeme: '='
Token: IDENTIFIER, Lexeme: 'x'
Token: PLUS, Lexeme: '+'
Token: IDENTIFIER, Lexeme: 'y'
Token: SEMICOLON, Lexeme: ';'
Token: IDENTIFIER, Lexeme: 'printf'
Token: LPAREN, Lexeme: '('
Token: STRING_LITERAL, Lexeme: '"Result: %d\n"'
Token: COMMA, Lexeme: ','
Token: IDENTIFIER, Lexeme: 'result'
Token: RPAREN, Lexeme: ')'
Token: SEMICOLON, Lexeme: ';'
Token: KEYWORD, Lexeme: 'return'
Token: INTEGER, Lexeme: '0'
Token: SEMICOLON, Lexeme: ';'
Token: RBRACE, Lexeme: '}'

##2 Parser:

Once the source code has been tokenized, the parser analyzes the structure of the code according to the grammar rules of the programming language. It constructs a hierarchical representation of the code known as an abstract syntax tree (AST). The AST captures the syntactic structure of the program, organizing tokens into nodes that represent expressions, statements, and other language constructs.

A parser takes the flat sequence of tokens and builds a tree structure that mirrors the nested nature of the grammar. These trees have a couple of different names—parse tree or abstract syntax tree—depending on how close to the bare syntactic structure of the source language they are. In practice, language hackers usually call them syntax trees, ASTs, or often just trees.

Input:

"3 + 4 * (5 - 2)"

Output (Abstract Syntax Tree):

      +
     / \
    3   *
       / \
      4   -
         / \
        5   2

In this example, the input is a mathematical expression "3 + 4 * (5 - 2)", representing the addition of 3 and the result of multiplying 4 with the result of subtracting 5 from 2 within parentheses.

##3 Semantic Analyzer:

The first two stages are pretty similar across all implementations. Now, the individual characteristics of each language start coming into play. At this point, we know the syntactic structure of the code—things like which expressions are nested in which—but we don’t know much more than that.

In an expression like a + b, we know we are adding a and b, but we don’t know what those names refer to. Are they local variables? Global? Where are they defined?

After parsing, the interpreter performs semantic analysis to validate the code and enforce language-specific rules. This stage involves checking for correctness and consistency in the code, such as verifying variable declarations, type compatibility, scoping rules, and other semantic constraints. Any errors or inconsistencies detected during this process are reported to the user as syntax errors or semantic warnings.

The first bit of analysis that most languages do is called binding or resolution. For each identifier, we find out where that name is defined and wire the two together. This is where scope comes into play—the region of source code where a certain name can be used to refer to a certain declaration.

If the language is statically typed, this is when we type check. Once we know where a and b are declared, we can also figure out their types. Then if those types don’t support being added to each other, we report a type error.

Everything up to this point is considered the front end of the implementation. You might guess everything after this is the back end, but no. Back in the days of yore when “front end” and “back end” were coined, compilers were much simpler. Later researchers invented new phases to stuff between the two halves. Rather than discard the old terms, William Wulf and company lumped those new phases into the charming but spatially paradoxical name middle end.

##4 Intermediate Representation (IR):

You can think of the compiler as a pipeline where each stage’s job is to organize the data representing the user’s code in a way that makes the next stage simpler to implement. The front end of the pipeline is specific to the source language the program is written in. The back end is concerned with the final architecture where the program will run.

In the middle, the code may be stored in some intermediate representation (IR) that isn’t tightly tied to either the source or destination forms (hence “intermediate”). Instead, the IR acts as an interface between these two languages.

There are a few well-established styles of IRs out there. Hit your search engine of choice and look for “control flow graph”, “static single-assignment”, “continuation-passing style”, and “three-address code”.

This lets you support multiple source languages and target platforms with less effort. Say you want to implement Pascal, C, and Fortran compilers, and you want to target x86, ARM, and, I dunno, SPARC. Normally, that means you’re signing up to write nine full compilers: Pascal→x86, C→ARM, and every other combination.

A shared intermediate representation reduces that dramatically. You write one front end for each source language that produces the IR. Then one back end for each target architecture. Now you can mix and match those to get every combination.

In some interpreter architectures, an intermediate representation (IR) is generated from the AST. The IR is a simplified, platform-independent representation of the code that facilitates optimization and code generation. It typically consists of a data structure or set of instructions that capture the essence of the code's semantics without being tied to the specifics of any particular hardware platform.

##5 Optimization

Once we understand what the user’s program means, we are free to swap it out with a different program that has the same semantics but implements them more efficiently—we can optimize it.

A simple example is constant folding: if some expression always evaluates to the exact same value, we can do the evaluation at compile time and replace the code for the expression with its result. If the user typed in this:

pennyArea = 3.14159 * (0.75 / 2) * (0.75 / 2);

we could do all of that arithmetic in the compiler and change the code to:

pennyArea = 0.4417860938;

Optimization is a huge part of the programming language business. Many language hackers spend their entire careers here, squeezing every drop of performance they can out of their compilers to get their benchmarks a fraction of a percent faster. It can become a sort of obsession.

##6 Code Generation

We have applied all of the optimizations we can think of to the user’s program. The last step is converting it to a form the machine can actually run. In other words, generating code (or code gen), where “code” here usually refers to the kind of primitive assembly-like instructions a CPU runs and not the kind of “source code” a human might want to read.

Finally, we are in the back end, descending the other side of the mountain. From here on out, our representation of the code becomes more and more primitive, like evolution run in reverse, as we get closer to something our simple-minded machine can understand.

We have a decision to make. Do we generate instructions for a real CPU or a virtual one? If we generate real machine code, we get an executable that the OS can load directly onto the chip. Native code is lightning fast, but generating it is a lot of work. Today’s architectures have piles of instructions, complex pipelines, and enough historical baggage to fill a 747’s luggage bay.

Speaking the chip’s language also means your compiler is tied to a specific architecture. If your compiler targets x86 machine code, it’s not going to run on an ARM device. All the way back in the ’60s, during the Cambrian explosion of computer architectures, that lack of portability was a real obstacle.

To get around that, hackers like Martin Richards and Niklaus Wirth, of BCPL and Pascal fame, respectively, made their compilers produce virtual machine code. Instead of instructions for some real chip, they produced code for a hypothetical, idealized machine. Wirth called this p-code for portable, but today, we generally call it bytecode because each instruction is often a single byte long.

##7 Virtual Machine

If your compiler produces bytecode, your work isn’t over once that’s done. Since there is no chip that speaks that bytecode, it’s your job to translate. Again, you have two options. You can write a little mini-compiler for each target architecture that converts the bytecode to native code for that machine. You still have to do work for each chip you support, but this last stage is pretty simple and you get to reuse the rest of the compiler pipeline across all of the machines you support. You’re basically using your bytecode as an intermediate representation.

The basic principle here is that the farther down the pipeline you push the architecture-specific work, the more of the earlier phases you can share across architectures.

There is a tension, though. Many optimizations, like register allocation and instruction selection, work best when they know the strengths and capabilities of a specific chip. Figuring out which parts of your compiler can be shared and which should be target-specific is an art.

Or you can write a virtual machine (VM), a program that emulates a hypothetical chip supporting your virtual architecture at runtime. Running bytecode in a VM is slower than translating it to native code ahead of time because every instruction must be simulated at runtime each time it executes. In return, you get simplicity and portability. Implement your VM in, say, C, and you can run your language on any platform that has a C compiler. This is how the second interpreter we build in this book works.

##8 Runtime

We have finally hammered the user’s program into a form that we can execute. The last step is running it. If we compiled it to machine code, we simply tell the operating system to load the executable and off it goes. If we compiled it to bytecode, we need to start up the VM and load the program into that.

In both cases, for all but the basest of low-level languages, we usually need some services that our language provides while the program is running. For example, if the language automatically manages memory, we need a garbage collector going in order to reclaim unused bits. If our language supports “instance of” tests so you can see what kind of object you have, then we need some representation to keep track of the type of each object during execution.

All of this stuff is going at runtime, so it’s called, appropriately, the runtime. In a fully compiled language, the code implementing the runtime gets inserted directly into the resulting executable. In, say, Go, each compiled application has its own copy of Go’s runtime directly embedded in it. If the language is run inside an interpreter or VM, then the runtime lives there. This is how most implementations of languages like Java, Python, and JavaScript work.