Parsing Expressions

#1 Parsing Expressions

In the previous chapter, we discussed lexical analysis and tokenization, which breaks down the source code into tokens. Now, we'll dive into the next step: parsing expressions. Parsing is the process of analyzing the structure of a sequence of tokens to determine its grammatical structure according to a given grammar.

#2 Context-Free Grammars

To parse expressions effectively, we need to understand context-free grammars (CFGs). These grammars define the rules for constructing valid expressions in a language. Unlike regular languages, which can't handle nested structures effectively, CFGs can handle arbitrary nesting, making them more suitable for parsing expressions.

#2.1 Rules for Grammars

A formal grammar specifies which sequences of tokens are valid in the language. We use production rules to define these sequences. Each production in a context-free grammar has a head (its name) and a body (the symbols it generates). Terminals represent literal values, while non-terminals reference other rules in the grammar.

#2.2 Enhancing Our Notation

We can improve our grammar notation by allowing additional conveniences such as the use of pipes (|) to separate alternative productions, parentheses for grouping, and postfix operators (*) for repetition.

#2.3 A Grammar for FarmScript Expressions

We'll focus on a subset of the FarmScript language for parsing expressions. This subset includes literals (numbers, strings, Booleans, and nil), unary expressions (logical not and negation), binary expressions (arithmetic and logical operations), and parentheses for grouping.

Example:

Grammar for FarmScript expressions:

expression -> literal
           | unary
           | binary
           | grouping ;
literal    -> NUMBER | STRING | "true" | "false" | "nil" ;
grouping   -> "(" expression ")" ;
unary      -> ( "-" | "!" ) expression ;
binary     -> expression operator expression ;
operator   -> "==" | "!=" | "<" | "<=" | ">" | ">="
           | "+"  | "-"  | "*" | "/" ;

This grammar ensures that expressions like 1 + (2 * 3) are parsed correctly according to the rules defined.

#3 Code For the Parser

#include <iostream>
#include <vector>
#include <stdexcept>

// Token types
enum class TokenType {
    NUMBER,
    PLUS,
    MINUS,
    END_OF_FILE
};

// Token structure
struct Token {
    TokenType type;
    double value; // Only used for numbers
    size_t position; // Position of token in input
};

class Parser {
public:
    Parser(const std::vector<Token>& tokens) : tokens(tokens), currentPos(0), currentToken({ TokenType::END_OF_FILE, 0, 0 }) {}

    // Public method to start parsing
    double parse() {
        currentToken = getNextToken(); // Initialize current token
        return parseExpression(); // Parse expression
    }

private:
    // Private member variables
    const std::vector<Token>& tokens;
    size_t currentPos;
    Token currentToken;

    // Helper method to get the next token from input
    Token getNextToken() {
        if (currentPos >= tokens.size()) {
            return { TokenType::END_OF_FILE, 0, currentPos };
        }
        return tokens[currentPos++]; // Get current token
    }

    // Recursive descent parsing for expressions
    double parseExpression() {
        double result = parseTerm(); // Parse the first term

        while (currentToken.type == TokenType::PLUS || currentToken.type == TokenType::MINUS) {
            Token op = currentToken;
            if (op.type == TokenType::PLUS) {
                eat(TokenType::PLUS);
                result += parseTerm(); // Perform addition
            } else {
                eat(TokenType::MINUS);
                result -= parseTerm(); // Perform subtraction
            }
        }

        return result;
    }

    // Parse individual terms (numbers)
    double parseTerm() {
        Token token = currentToken;
        eat(TokenType::NUMBER); // Expect a number
        return token.value; // Return the value of the number
    }

    // Helper method to consume expected token type
    void eat(TokenType expectedType) {
        if (currentToken.type == expectedType) {
            currentToken = getNextToken(); // Consume current token and move to next
        } else {
            throw std::runtime_error("Unexpected token encountered at position " + std::to_string(currentToken.position));
        }
    }
};

int main() {
    std::vector<Token> tokens = {
        { TokenType::NUMBER, 3, 0 },
        { TokenType::PLUS, 0, 1 },
        { TokenType::NUMBER, 2, 2 },
        { TokenType::MINUS, 0, 3 },
        { TokenType::NUMBER, 1, 4 }
    };

    Parser parser(tokens);

    try {
        double result = parser.parse();
        std::cout << "Result: " << result << std::endl;
    } catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
    }

    return 0;
}

#3.1 Explanation

// Token types
enum class TokenType {
    NUMBER,
    PLUS,
    MINUS,
    END_OF_FILE
};
  • We define an enumeration TokenType to represent different types of tokens. In this example, we have NUMBER, PLUS, MINUS, and END_OF_FILE.
// Token structure
struct Token {
    TokenType type;
    double value; // Only used for numbers
    size_t position; // Position of token in input
};
  • Here, we define a structure Token to represent a token. It has three fields: type to store the type of token, value to store the numerical value (only used for NUMBER tokens), and position to store the position of the token in the input string.
class Parser {
public:
    Parser(const std::vector<Token>& tokens) : tokens(tokens), currentPos(0), currentToken({ TokenType::END_OF_FILE, 0, 0 }) {}

    // Public method to start parsing
    double parse() {
        currentToken = getNextToken(); // Initialize current token
        return parseExpression(); // Parse expression
    }
  • We define a class Parser to parse the input tokens. In the constructor, we initialize the tokens vector with the input tokens, currentPos to track the current position in the vector, and currentToken to store the current token being processed (initialized with END_OF_FILE token).
  • The parse() method initializes the currentToken and start parsing the expression.
private:
    // Private member variables
    const std::vector<Token>& tokens;
    size_t currentPos;
    Token currentToken;
  • We declare private member variables for the Parser class: tokens to store the input tokens, currentPos to track the current position in the token vector, and currentToken to store the current token being processed.
    // Helper method to get the next token from input
    Token getNextToken() {
        if (currentPos >= tokens.size()) {
            return { TokenType::END_OF_FILE, 0, currentPos };
        }
        return tokens[currentPos++]; // Get current token
    }
  • We define a helper method getNextToken() to retrieve the next token from the input vector. It increments the currentPos to move to the next token in the vector.
    // Recursive descent parsing for expressions
    double parseExpression() {
        double result = parseTerm(); // Parse the first term

        while (currentToken.type == TokenType::PLUS || currentToken.type == TokenType::MINUS) {
            Token op = currentToken;
            if (op.type == TokenType::PLUS) {
                eat(TokenType::PLUS);
                result += parseTerm(); // Perform addition
            } else {
                eat(TokenType::MINUS);
                result -= parseTerm(); // Perform subtraction
            }
        }

        return result;
    }
  • We define a method parseExpression() to parse arithmetic expressions. It parses the first term and then iteratively parses additional terms while encountering PLUS or MINUS tokens. It performs addition or subtraction based on the operator encountered.
    // Parse individual terms (numbers)
    double parseTerm() {
        Token token = currentToken;
        eat(TokenType::NUMBER); // Expect a number
        return token.value; // Return the value of the number
    }
  • We define a method parseTerm() to parse individual terms (numbers). It expects a NUMBER token, consumes it using the eat() method, and returns the value of the token.
    // Helper method to consume expected token type
    void eat(TokenType expectedType) {
        if (currentToken.type == expectedType) {
            currentToken = getNextToken(); // Consume current token and move to next
        } else {
            throw std::runtime_error("Unexpected token encountered at position " + std::to_string(currentToken.position));
        }
    }
  • We define a helper method eat() to consume the current token if it matches the expected token type. If the current token does not match the expected type, it throws a runtime error with the position information of the token.
};

int main() {
    std::vector<Token> tokens = {
        { TokenType::NUMBER, 3, 0 },
        { TokenType::PLUS, 0, 1 },
        { TokenType::NUMBER, 2, 2 },
        { TokenType::MINUS, 0, 3 },
        { TokenType::NUMBER, 1, 4 }
    };

    Parser parser(tokens);

    try {
        double result = parser.parse();
        std::cout << "Result: " << result << std::endl;
    } catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
    }

    return 0;
}
  • In the main() function, we create a vector of tokens representing the expression 3 + 2 - 1.
  • We instantiate a Parser object with the token vector and call the parse() method to parse the expression.
  • We catch any exceptions thrown during parsing and print an error message if parsing fails.

For the time being, our code just handles the simple arithmetic without the use of braces, and normal addition, subtraction.

# 3.2 Modify Code to support complex expression with parentheses, division and multiplication

#include <iostream>
#include <stdexcept>
#include <vector>

// Token types
enum class TokenType {
    NUMBER,
    PLUS,
    MINUS,
    MULTIPLY,
    DIVIDE,
    LPAREN,
    RPAREN,
    END_OF_FILE
};

// Token structure
struct Token {
    TokenType type;
    double value; // Only used for numbers
};

class Parser {
public:
    Parser(const std::vector<Token>& tokens) : tokens(tokens), currentPos(0), currentToken({ TokenType::END_OF_FILE, 0 }) {}

    // Public method to start parsing
    double parse() {
        currentToken = getNextToken(); // Initialize current token
        return parseExpression(); // Parse expression
    }

private:
    // Private member variables
    const std::vector<Token>& tokens;
    size_t currentPos;
    Token currentToken;

    // Helper method to get the next token from tokens
    Token getNextToken() {
        if (currentPos >= tokens.size()) {
            return { TokenType::END_OF_FILE, 0 };
        }

        return tokens[currentPos++];
    }

    // Recursive descent parsing for expressions
    double parseExpression() {
        double result = parseTerm(); // Parse the first term

        while (currentToken.type == TokenType::PLUS || currentToken.type == TokenType::MINUS) {
            Token op = currentToken;
            if (op.type == TokenType::PLUS) {
                eat(TokenType::PLUS);
                result += parseTerm(); // Perform addition
            } else {
                eat(TokenType::MINUS);
                result -= parseTerm(); // Perform subtraction
            }
        }

        return result;
    }

    // Parse individual terms (numbers or unary expressions)
    double parseTerm() {
        double result = parseFactor(); // Parse the first factor

        while (currentToken.type == TokenType::MULTIPLY || currentToken.type == TokenType::DIVIDE) {
            Token op = currentToken;
            if (op.type == TokenType::MULTIPLY) {
                eat(TokenType::MULTIPLY);
                result *= parseFactor(); // Perform multiplication
            } else {
                eat(TokenType::DIVIDE);
                double divisor = parseFactor();
                if (divisor == 0) {
                    throw std::runtime_error("Division by zero");
                }
                result /= divisor; // Perform division
            }
        }

        return result;
    }

    // Parse factors (numbers, unary expressions, or parentheses)
    double parseFactor() {
        Token token = currentToken;
        if (token.type == TokenType::NUMBER) {
            eat(TokenType::NUMBER); // Expect a number
            return token.value; // Return the value of the number
        } else if (token.type == TokenType::PLUS) {
            eat(TokenType::PLUS);
            return parseFactor(); // Unary plus (no effect)
        } else if (token.type == TokenType::MINUS) {
            eat(TokenType::MINUS);
            return -parseFactor(); // Unary minus
        } else if (token.type == TokenType::LPAREN) {
            eat(TokenType::LPAREN);
            double result = parseExpression(); // Parse expression within parentheses
            eat(TokenType::RPAREN);
            return result;
        } else {
            throw std::runtime_error("Invalid token encountered");
        }
    }

    // Helper method to consume expected token type
    void eat(TokenType expectedType) {
        if (currentToken.type == expectedType) {
            currentToken = getNextToken(); // Consume current token and move to next
        } else {
            throw std::runtime_error("Unexpected token encountered");
        }
    }
};

int main() {
    std::vector<Token> tokens = {
        { TokenType::LPAREN, 0 },
        { TokenType::NUMBER, 3 },
        { TokenType::PLUS, 0 },
        { TokenType::NUMBER, 2 },
        { TokenType::RPAREN, 0 },
        { TokenType::MULTIPLY, 0 },
        { TokenType::NUMBER, 4 },
        { TokenType::MINUS, 0 },
        { TokenType::LPAREN, 0 },
        { TokenType::MINUS, 0 },
        { TokenType::NUMBER, 5 },
        { TokenType::RPAREN, 0 }
    };
    Parser parser(tokens);

    try {
        double result = parser.parse();
        std::cout << "Result: " << result << std::endl;
    } catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
    }

    return 0;
}

#4 Recursive Descent Parsing

Recursive Descent Parsing is a top-down parsing technique where the parser starts from the root of the grammar (typically the start symbol) and recursively applies grammar rules to parse the input.

Recursive descent parsing is classified as a top-down parsing technique because it begins parsing from the top-level grammar rule, such as the expression rule, and then proceeds to recursively parse nested subexpressions before reaching the smallest components of the syntax tree. This stands in contrast to bottom-up parsers like LR, which start with the smallest building blocks of the language and gradually combine them into larger constructs.

The term "recursive descent" reflects the method's approach of traversing the grammar. Interestingly, there's a metaphorical use of directionality in discussing parsing techniques. While we refer to "high" and "low" precedence, indicating the importance of operations, the orientation of parsing is reversed. In a top-down parser like recursive descent, we encounter the lowest-precedence expressions first, as they may contain subexpressions of higher precedence within them.

#4.1 Theory

A recursive descent parser is essentially a direct translation of the grammar's rules into imperative code. Each rule in the grammar corresponds to a function in the code. The structure of the grammar is mirrored in the code representation:

  • Terminals in the grammar are represented by code that matches and consumes a token from the input stream.
  • Nonterminals in the grammar are represented by function calls to the corresponding rule's function.
  • The '|' symbol in the grammar translates to an 'if' or 'switch' statement in the code, allowing the parser to choose between alternative productions.
  • The '*' or '+' symbols in the grammar translate to 'while' or 'for' loops in the code, enabling the parser to handle repetitions of certain constructs.
  • The '?' symbol in the grammar translates to an 'if' statement in the code, allowing the parser to handle optional elements.

The term "recursive" in "recursive descent" comes from the fact that when a grammar rule refers to itself—either directly or indirectly—that results in a recursive function call. This recursion allows the parser to handle nested and recursive structures within the input language.

#4.2 Real-Life Example: Recipe Instructions

Imagine you have a set of instructions for making a sandwich:

  1. Tokenization:
    1. You start by breaking down the instructions into individual steps or tokens. For example: "Take two slices of bread", "Spread peanut butter on one slice", "Spread jelly on the other slice", etc.
  2. Parsing:
    1. You want to understand the structure of the instructions and how each step relates to the others. So, you start reading the instructions from the beginning.
    2. At each step, you decide what to do next based on what you've read so far.
      For example, if the instruction starts with "Take two slices of bread", you know that you'll need to perform actions related to handling bread next.
  3. Recursion:
    1. Some instructions might have sub-steps. For example, "Spread peanut butter on one slice" can be broken down into smaller steps: "Open the jar of peanut butter", "Use a knife to scoop peanut butter", "Spread peanut butter on the slice".
      Similarly, you recursively follow the instructions for each sub-step until you reach a base case.
  4. Base Cases:
    1. Eventually, you'll reach steps that are simple and don't require further breakdown. For example, "Spread peanut butter on the slice" is a base case because it's a simple action that can be performed directly.
  5. Error Handling:
    1. If you encounter an instruction that doesn't make sense or can't be followed, you might need to backtrack or report an error. For example, if the instruction says "Spread peanut butter on the slice" but you realize you don't have peanut butter, you need to handle this situation.
  6. Building Instructions Tree:
    1. As you go through the instructions, you mentally build a hierarchical structure of actions, similar to how you might draw a flowchart to represent the steps involved in making a sandwich.
  7. Execution:
    1. Once you've understood the structure of the instructions, you can start executing them step by step to make the sandwich.

#4.3 The parser class

Each grammar rule becomes a method inside this class:

class Parser
{
private:
	unsigned int current = 0;
    unsigned int loopDepth = 0;
    std::vector<Token> tokens;
public:
	Parser(std::vector<Token> tokens)
	{
    	this->tokens = tokens;
	}
	
};

Similar to the scanner, the parser consumes a flat input sequence only now we are reading tokens instead of characters. We store the list of tokens and use current to point to the next token eagerly waiting to be parsed.

We are going to run straight through the expression grammar now and translate each rule to C++ code. The first rule, expression, simply expands to the assignment rule, so that's straightforward:

Expr * Parser::expression()
{
    Expr *expr = equality();
    return expr;
}

The rule for equality is a little more complex:

equality       → comparison ( ( "!=" | "==" ) comparison )* ;
Expr Parser::equality() {
    Expr expr = comparison();

    while (match(BANG_EQUAL) || match(EQUAL_EQUAL)) {
        Token oper = previous();
        Expr right = comparison();
        expr = Expr::Binary(expr, oper, right); // Assuming Expr::Binary creates a new Binary expression
    }

    return expr;
}

In the above snippet, the equality() method begins by parsing the comparison() nonterminal:

Expr expr = comparison();

Then, the (…)*  loop in the rule maps to a while loop. We need to know when to exit that loop. We can see that inside the rule, we must first find either a != or == token. So, if we don't see one of those, we must be done with the sequence of equality operator. We express that check using a handy match() method:

bool Parser::match(TokenType type)
{
    if (check(type))
    {
        advance();
        return true;
    }
    return false;
}

This checks to see if the current token has any of the given types. If so, it consumes the token and returns true. Otherwise, it returns false and leaves the current token alone. The match() method is defined in terms of two more fundamental operations.

The check() method returns true if the current token is of given type. Unlike match(), it never consumes the token, it only looks at it.

bool Parser::check(TokenType type)
{
    if (isAtEnd())
    {
        return false;
    }
    return peek().type == type;
}

The advance() method consumes the current token and returns it, similar to how our scanner's corresponding method crawled through characters.

Token Parser::advance()
{
    if (!isAtEnd())
    {
        return tokens[current++];
    }
    return previous();
}

These methods handful of primitive operations:

Token Parser::peek()
{
    return tokens[current];
}
Token Parser::previous()
{
    return tokens[current - 1];
}

bool Parser::isAtEnd()
{
    return peek().type == EOF_TOKEN;
}

isAtEnd() checks if we have run out of tokens to parse.

peek() returns the current token we have yet to consume.

previous() returns the most recently consumed token.

That’s most of the parsing infrastructure we need. Where were we? Right, so if we are inside the while loop in equality(), then we know we have found a != or == operator and must be parsing an equality expression.

The parser falls out of the loop once it hits a token that’s not an equality operator. Finally, it returns the expression. Note that if the parser never encounters an equality operator, then it never enters the loop. In that case, the equality() method effectively calls and returns comparison(). In that way, this method matches an equality operator or anything of higher precedence.

# 4.3.2 Comparison

Moving on to the next rule…

comparison     → term ( ( ">" | ">=" | "<" | "<=" ) term )* ;
Expr * Parser::comparison()
{
    Expr *expr = term();

    while (match(LESS) || match(GREATER) || match(LESS_EQUAL) || match(GREATER_EQUAL))
    {        
        Token *oper = new Token(previous());
        Expr *right = term();
        expr = new Binary(expr, oper, right);
    }

    return expr;
}

The grammar rule is virtually identical to equality and so is the corresponding code. The only differences are the token types for the operators we match, and the method we call for the operands—now term() instead of comparison(). The remaining two binary operator rules follow the same pattern.

In order of precedence, first addition and subtraction:

Expr * Parser::term()
{
    Expr *expr = factor();

    while (match(MINUS) || match(PLUS))
    {
        Token *oper = new Token(previous());
        Expr *right = factor();
        expr = new Binary(expr, oper, right);
    }

    return expr;
}

And finally, multiplication and division:

Expr * Parser::factor()
{
    Expr *expr = unary();

    while (match(STAR) || match(SLASH))
    {
        Token *oper = new Token(previous());
        Expr *right = unary();
        expr = new Binary(expr, oper, right);
    }

    return expr;
}

That's all of the binary operators, parsed with the correct precedence and associativity. We are crawling up the precedence hierarchy and now we have reached the unary operators.

#4.3.3 Unary

unary          → ( "!" | "-" ) unary
               | primary ;

The code for this is a little different.

Expr Parser::unary() {
    if (match(BANG) || match(MINUS)) {
        Token oper = previous();
        Expr right = unary();
        return Expr::Unary(oper, right); // Assuming Expr::Unary creates a new Unary expression
    }

    return primary();
}

Again, we look at the current token to see how to parse. If it’s a ! or -, we must have a unary expression. In that case, we grab the token and then recursively call unary() again to parse the operand. Wrap that all up in a unary expression syntax tree and we’re done.

#4.3.4 Primary expression

we must have reached the highest level of precedence, primary expressions.

primary        → NUMBER | STRING | "true" | "false" | "nil"
               | "(" expression ")" ;

Most of the cases for the rule are single terminals, so parsing is straightforward.

Expr Parser::primary() {
    if (match(FALSE)) 
        return Expr::Literal(false);
    if (match(TRUE)) 
        return Expr::Literal(true);
    if (match(NIL)) 
        return Expr::Literal(nullptr);

    if (match(NUMBER) || match(STRING)) {
        return Expr::Literal(previous().literal);
    }

    if (match(LEFT_PAREN)) {
        Expr expr = expression();
        consume(RIGHT_PAREN, "Expect ')' after expression.");
        return Expr::Grouping(expr);
    }

    // Handle error or return appropriate value if no match
}

The interesting branch is the one for handling parentheses. After we match an opening ( and parse the expression inside it, we must find a ) token. If we don’t, that’s an error.

#4 Syntax errors

A parser really has two jobs:

  1. Given a valid sequence of tokens, produce a corresponding syntax tree.
  2. Given an invalid sequence of tokens, detect any errors and tell the user about their mistakes.

Don’t underestimate how important the second job is! In modern IDEs and editors, the parser is constantly reparsing code—often while the user is still editing it—in order to syntax highlight and support things like auto-complete. That means it will encounter code in incomplete, half-wrong states all the time.

When the user doesn’t realize the syntax is wrong, it is up to the parser to help guide them back onto the right path. The way it reports errors is a large part of your language’s user interface. Good syntax error handling is hard. By definition, the code isn’t in a well-defined state, so there’s no infallible way to know what the user meant to write. The parser can’t read your mind.

There are a couple of hard requirements for when the parser runs into a syntax error. A parser must:

  • Detect and report the error. If it doesn’t detect the error and passes the resulting malformed syntax tree on to the interpreter, all manner of horrors may be summoned.
  • Avoid crashing or hanging. Syntax errors are a fact of life, and language tools have to be robust in the face of them. Segfaulting or getting stuck in an infinite loop isn’t allowed. While the source may not be valid code, it’s still a valid input to the parser because users use the parser to learn what syntax is allowed.
  • Be fast. Computers are thousands of times faster than they were when parser technology was first invented. The days of needing to optimize your parser so that it could get through an entire source file during a coffee break are over. But programmer expectations have risen as quickly, if not faster. They expect their editors to reparse files in milliseconds after every keystroke.
  • Report as many distinct errors as there are. Aborting after the first error is easy to implement, but it’s annoying for users if every time they fix what they think is the one error in a file, a new one appears. They want to see them all.
  • Minimize cascaded errors. Once a single error is found, the parser no longer really knows what’s going on. It tries to get itself back on track and keep going, but if it gets confused, it may report a slew of ghost errors that don’t indicate other real problems in the code. When the first error is fixed, those phantoms disappear, because they reflect only the parser’s own confusion. Cascaded errors are annoying because they can scare the user into thinking their code is in a worse state than it is.

#4.1 Entering panic mode

bool Parser::consume(TokenType type, std::string message)
{
    if (check(type))
    {
        advance();
        return true;
    }
    error(peek(), message);
    return false;
}

It is similar to match() in that it checks to see if the next token is of the expected type. If so, it consumes the token and everything works smoothly. If some other token is there, then we have hit an error. We report it by calling this:

void Parser::error(Token token, std::string message)
{
    FarmScript::error(token, message);
}

First, that shows the error to the user by calling:

void FarmScript::error(Token token, std::string message)
{
    if (token.type == EOF_TOKEN)
    {
        report(token.line, " at end", message);
    }
    else
    {
        std::stringstream ss;
        ss << " at '" << token.lexeme << "'";
        report(token.line, ss.str(), message);
    }
}

This reports an error at a given token. It shows the token’s location and the token itself. This will come in handy later since we use tokens throughout the interpreter to track locations in code.