A grammar is a formal system that defines the structure of a language. In computing, grammars are used to specify:

  • programming languages
  • data formats
  • query languages
  • protocol structures.

A grammar is a set of rules that dictates how sequences of symbols (tokens) form valid constructs in a language. Without a grammar, a computer cannot determine if an input is meaningful or well-formed.

Components of a Grammar

A formal grammar consists of four essential components:

  1. Terminals: The smallest indivisible units of a language, often corresponding to tokens from a lexer. Examples include keywords (if, while), symbols (+, *), and literals (42, "hello").
  2. Non-terminals: Abstract categories representing higher-level language constructs. For instance, expr may represent an expression, while stmt might represent a statement.
  3. Production rules: Transformations that define how non-terminals expand into sequences of terminals and other non-terminals. These rules dictate the syntactic structure.
  4. Start symbol: A special non-terminal from which parsing begins, representing the highest-level structure.

Chomsky Hierarchy of Grammars

Formal grammars are classified into four categories, based on their expressive power:

Type 3: Regular Grammars

Regular grammars generate regular languages, which are recognized by finite automata. They are equivalent to regular expressions and cannot express nested or recursive structures.

A regular grammar for simple identifiers might look like:

id → letter id | letter
letter → a | b | c | ... | z

Regular expression example

Regular expressions cannot match arbitrarily nested structures. A simple example is:

L = { aⁿbⁿ | n ≥ 1 }

A regex like a*b* could match sequences of as followed by bs, but it does not enforce that the number of as must equal the number of bs. This limitation exists because regular expressions and finite automata lack memory beyond their finite states; they cannot count occurrences and compare them across different symbols. This makes it impossible for regular languages to express balanced constructs like aⁿbⁿ.

Type 2: Context-Free Grammars (CFGs)

Context-Free Grammars (CFGs) allow recursion, making them significantly more powerful. CFGs are used in most programming languages, enabling parsing of nested structures such as parentheses, blocks, and expressions.

An example of a CFG for balanced parentheses:

S → ( S ) S | ε

However, CFGs cannot express context-sensitive constraints. Consider a language where variable declarations must be followed by their correct usage:

decl → let id = expr
usage → id

A CFG cannot enforce that a declared id must match a later usage. This is why programming languages require additional semantic analysis beyond CFG parsing.

Backus-Naur Form (BNF)

BNF is a notation used to formally specify context-free grammars (CFGs). It consists of production rules, where each rule defines how a non-terminal expands into other non-terminals or terminals. The syntax follows the pattern:

non-terminal → expansion

A simple arithmetic expression grammar in BNF:

expr   → expr + term | expr - term | term
term   → term * factor | term / factor | factor
factor → NUMBER | ( expr )

Extended Backus-Naur Form (EBNF)

EBNF extends BNF with additional constructs that make grammars more expressive and compact:

  • { ... }0 or more repetitions (digit { digit } means a sequence of digits, like 1234).
  • [ ... ]Optional elements (if_stmt → "if" condition "then" statement [ "else" statement ]).
  • ( ... )Grouping (expr → ( term ("+" | "-") term )).
  • +1 or more repetitions (letter + means one or more letters, like abc).

Example of an arithmetic grammar in EBNF:

expr   → term { ("+" | "-") term }
term   → factor { ("*" | "/") factor }
factor → NUMBER | "(" expr ")"

This eliminates explicit recursion, making it more readable.

Syntax Diagrams

Syntax diagrams are graphical representations of grammars that illustrate valid constructs using arrows and loops. They are useful for visual learners and commonly appear in programming language specifications.

For example, a syntax diagram for an arithmetic expression might look like:

 [expr] →---> ( term ) ---> { ( + | - ) term } --->

This visually represents the same structure as EBNF but in an intuitive flowchart.

]

By Gregg-Irwin - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=99294579

Type 1: Context-Sensitive Grammars (CSGs)

CSGs impose rules where the left-hand side of a production may contain multiple symbols, allowing context-sensitive transformations. Example:

A B → B A  (Swap A and B only when adjacent)

CSGs can define languages that CFGs cannot, such as:

L = { aⁿbⁿcⁿ | n ≥ 1 }

This requires ensuring an equal number of as, bs, and cs, which a CFG cannot track.

Type 0: Unrestricted Grammars

The most expressive class, equivalent to Turing machines, capable of defining any computable language. However, they are undecidable in the general case, meaning no algorithm can always determine if a given input belongs to the language.

Ambiguity in Grammars

A grammar is ambiguous if at least one input has multiple valid parse trees.

Consider this ambiguous grammar for the dangling else problem:

stmt → if expr then stmt else stmt | if expr then stmt | other

The input:

if x then if y then S1 else S2

has two possible interpretations:

  1. else S2 belongs to the first if
  2. else S2 belongs to the second if

This is resolved by explicitly associating else with the closest unmatched if in languages like C and Java.

Abstract Syntax Trees (ASTs) vs. Concrete Syntax Trees (CSTs)

A Concrete Syntax Tree (CST) represents the full grammar structure, including all syntactic details. It adheres strictly to the grammar’s rules, while the Abstract Syntax Tree (AST) removes unnecessary nodes, simplifying the representation for semantic analysis and code generation.

Example: Parsing 3 + 4 * 5

CST:

       expr
      /    \
   expr     term
    |      /    \
  term   term   factor
   |       |      |
 factor    4      5
   |
   3

AST:

      +
     / \
    3   *
       / \
      4   5

The AST eliminates redundant non-terminals, reducing complexity.

Why Programming Languages Are Not Strictly CFGs

CFGs define syntax, but real programming languages impose additional constraints requiring context-sensitive analysis:

  1. Variable scoping: A variable must be declared before it is used.
  2. Type checking: Operations like int + string are syntactically valid but semantically incorrect.
  3. Indentation rules (Python): Indentation is not expressible in CFGs, requiring lexer hacks or context tracking.

CFGs handle basic structure, while semantic rules and type checking go beyond them.