Parsing is the process of analyzing a sequence of tokens according to a grammar, producing a structured representation of the input. It is used in compilers, interpreters, query processors, configuration parsers, and IDEs. The parser consumes tokens generated by a lexer and determines whether they conform to a valid syntax, often constructing a parse tree or an abstract syntax tree (AST).
Relationship Between Lexing and Parsing
Lexing, also known as tokenization, is the process of scanning raw input and grouping characters into tokens. Parsing, in contrast, is concerned with structuring these tokens into a hierarchical syntax. A lexer can be implemented manually or generated using Flex, while a parser can be built using a parser generator like Bison or handwritten using recursive descent or other techniques.
Parse Trees, Derivations, and ASTs
A parse tree represents the syntactic structure of an input string according to a grammar. It explicitly captures every step in the derivation, including all intermediate non-terminals. The process of generating a parse tree follows a derivation, which describes how the start symbol expands into the full input sequence by applying grammar rules step by step.
There are two main types of derivations:
- Leftmost derivation: Expands the leftmost non-terminal first.
- Rightmost derivation: Expands the rightmost non-terminal first.
For the grammar:
expr → expr + term | term
term → NUMBER
and input 3 + 4, a leftmost derivation produces:
expr → expr + term
→ term + term
→ NUMBER + NUMBER
which corresponds to the parse tree:
expr
/ | \
expr + term
| |
term term
| |
3 4
A rightmost derivation expands the rightmost non-terminal first:
expr → expr + term
→ expr + NUMBER
→ NUMBER + NUMBER
This corresponds to the same parse tree, but the order in which rules are applied differs.
While parse trees capture every grammatical detail, many applications prefer an abstract syntax tree (AST), which simplifies structure by omitting unnecessary non-terminals. The AST for 3 + 4 is:
+
/ \
3 4
ASTs are used in compilers and interpreters because they are more compact and better suited for further processing like type checking and code generation.
Parser Strategies
Different parsing strategies determine how input is processed and converted into a structured representation. The two main approaches are top-down and bottom-up parsing. Top-down parsers start from the highest-level rule and expand productions recursively. Bottom-up parsers start with tokens and apply reductions until reaching the start symbol.
Top-Down Parsing
Top-down parsing begins at the start symbol of the grammar and applies production rules recursively to match the input tokens. Recursive descent parsers are a common approach, where each non-terminal is implemented as a function that calls itself recursively.
LL Parsing
LL parsers process input left to right and produce a leftmost derivation. They are predictive parsers, meaning they determine the correct production without backtracking.
LL(1) parsers use one token of lookahead to decide the next rule to apply. If multiple productions share the same prefix, left-factoring is required to rewrite the grammar into a form suitable for LL parsing.
Example of a recursive descent parser for arithmetic expressions:
void expr() {
term();
while (lookahead == '+' || lookahead == '-') {
match(lookahead);
term();
}
}Recursive descent works well for LL(1) grammars but fails on left-recursive rules, requiring grammar transformation.
Bottom-Up Parsing
Bottom-up parsers construct parse trees from leaves to root, using shift-reduce parsing. These parsers maintain a stack where tokens are shifted onto the stack until a valid rule is recognized, at which point they are reduced.
LR and LALR Parsing
LR parsers process input left to right, but they construct a rightmost derivation in reverse. The LR(1) algorithm requires a large parsing table, making it impractical for many applications. LALR(1) (Lookahead LR) reduces the table size while preserving most of LR(1)‘s power. This makes LALR(1) the choice for tools like Bison.
Example of an LALR(1) grammar used in Bison:
Bison generates a C-based parser that processes input using the shift-reduce method, resolving ambiguities using precedence rules.
LL vs. LALR: When to Use Each
LL parsers are commonly used for DSLs, configuration file parsers, and interpreters, while LALR parsers are favored for complex programming languages and SQL. ANTLR’s LL(*) parsing mitigates many limitations of traditional LL parsers by allowing arbitrary lookahead and better handling of ambiguity, making it highly practical for real-world use. LALR(1) parsing, as used in Bison, remains efficient for highly structured and performance-sensitive applications.
Handling Left Recursion
LL parsers cannot handle left-recursive rules, such as:
expr → expr + term | term
Rewriting the grammar to remove left recursion:
expr → term expr'
expr' → '+' term expr' | ε
LR parsers naturally handle left-recursive rules without requiring transformation, making them more powerful for complex languages.
Error Handling in Parsing
Parsing errors must be handled gracefully to avoid abrupt failures. The two common approaches are:
- Panic Mode: Skipping input until a synchronization point (e.g., semicolons in C).
- Error Productions: Defining special rules to recognize common syntax errors and recover from them.
LL parsers typically provide better error messages, while LR parsers focus on efficiency and correctness.
Practical Applications of Parsing
Parsing is fundamental in many domains:
- Compilers and Interpreters: Convert high-level code into executable instructions.
- SQL Parsing: PostgreSQL and SQLite use Bison for query parsing.
- Configuration Parsing: JSON, YAML, and TOML use parsing techniques to validate structured data.
- IDEs and Linters: Code analysis tools use parsing for syntax highlighting, refactoring, and static analysis.
Parsing techniques directly influence the performance, robustness, and maintainability of these systems.
Example parsers implementation
# LL(1) Parser (Recursive Descent)
# Example: Simple arithmetic expression parser (supports + and *)
# Uses an LL(1) predictive parsing approach
class LLParser:
def __init__(self, tokens):
self.tokens = tokens
self.current_token = self.tokens.pop(0) if self.tokens else None
def match(self, expected_token):
if self.current_token == expected_token:
self.current_token = self.tokens.pop(0) if self.tokens else None
else:
raise SyntaxError(f"Unexpected token: expected {expected_token}, got {self.current_token}")
def expr(self):
"""expr → term ('+' term)*"""
value = self.term()
while self.current_token == '+':
self.match('+')
value += self.term()
return value
def term(self):
"""term → factor ('*' factor)*"""
value = self.factor()
while self.current_token == '*':
self.match('*')
value *= self.factor()
return value
def factor(self):
"""factor → NUMBER"""
if self.current_token.isdigit():
value = int(self.current_token)
self.match(self.current_token)
return value
else:
raise SyntaxError(f"Unexpected token: {self.current_token}")
# Example usage
ll_parser = LLParser("3 + 4 * 2".split())
print(ll_parser.expr()) # Output: 11 (incorrect precedence, LL struggles without explicit rules)
# LALR(1) Parser using PLY (Python Lex-Yacc)
# Example: Arithmetic expression parser using shift-reduce parsing
from ply import yacc, lex
# Lexer for arithmetic expressions
tokens = ('NUMBER', 'PLUS', 'TIMES')
t_literals = {'+', '*'}
def t_NUMBER(t):
r'\d+'
t.value = int(t.value)
return t
t_ignore = ' \t'
def t_error(t):
raise SyntaxError(f"Illegal character '{t.value[0]}'")
lexer = lex.lex()
# LALR(1) grammar rules
def p_expr_plus(p):
"""expr : expr PLUS term"""
p[0] = p[1] + p[3]
def p_expr_term(p):
"""expr : term"""
p[0] = p[1]
def p_term_times(p):
"""term : term TIMES factor"""
p[0] = p[1] * p[3]
def p_term_factor(p):
"""term : factor"""
p[0] = p[1]
def p_factor_number(p):
"""factor : NUMBER"""
p[0] = p[1]
def p_error(p):
raise SyntaxError("Syntax error in input!")
parser = yacc.yacc()
# Example usage
print(parser.parse("3 + 4 * 2")) # Output: 11 (correct precedence, LALR handles it properly)
"""
Comparison of LL(1) vs. LALR(1):
1. **LL(1) (Recursive Descent)**:
- Simpler implementation, easier to understand.
- Struggles with left recursion.
- Requires manual precedence handling.
- Used in **ANTLR (LL(*))** but extended to avoid major LL(1) limitations.
2. **LALR(1) (Shift-Reduce, via YACC/Ply)**:
- Handles operator precedence naturally.
- More powerful and used in compilers, SQL parsers.
- Automatically resolves shift/reduce conflicts via precedence rules.
- Used in **Bison, PostgreSQL, GCC parsers**.
This demonstrates that LL(1) works well for simple, structured grammars but lacks the power to handle more complex constructs without additional transformations. LALR(1) is preferred for larger, more flexible parsing tasks.
"""