03 Syntax Analysis
Also called as Parsing
Regular expressions cannot be used, due to nested structure.
Hence, we need a context-free grammar and PDA
Syntax/Parse Tree¶
Each interior node represents an operation and the children of the node represent the arguments of the operation. It shows the order of operations.
Yacc¶
Yet another compiler compiler
This is covered in practicals
CFG¶
Context-Free Grammar
- Set of terminals \(T\)
- Set of non-terminals \(N\)
- Start symbol \(S\) (non-terminal)
- Set of productions
where
- \(X \in N\)
- \(Y_i \in T \cup N \cup \{ \epsilon \}\)
LHS of any production/rule can only be a single non-terminal
If \(S\) is the start symbol and \(L(G)\) is the language of \(G\), then \(L(G)\) is the set of strings that can be derived from \(S\)
Derivation¶
A derivation defines a parse tree. One parse tree may have multiple derivations.
We have 2 different derivation types, which allows for different parser implementations.
Types¶
Type | Parser Implementation | Single-step Derivation |
---|---|---|
Leftmost | Top-Down | Leftmost non-terminal replaced with corr RHS of non-terminal |
Rightmost | Bottom-Up | Rightmost non-terminal replaced with corr RHS of non-terminal |
Example¶
Leftmost
E
ā E * E
ā (E) * E
ā (E+E) * E
ā (id+E) * E
ā (id+num) * E
ā (id+num) * num
Rightmost
E
ā E * E
ā E * num
ā (E) * num
ā (E+E) * num
ā (E+num) * num
ā (id+num) * num
Parse tree is same for both
Grammar¶
Types¶
Type | Production Form | Parser |
---|---|---|
Left-Recursive | \(X \to Xa\) | Bottom-Up |
Right-Recursive | \(X \to aX\) | Bottom-Up/ Top-Down |
where
- \(X\) is a non-terminal
- \(a\) is a string of terminals, it is called left recursive production
Top-down parser cannot work with Left recursive grammar, but both parsing works with right recursive grammar
Ambiguous Grammar¶
Grammar where the same string has multiple
- parse trees
- leftmost derivations
- rightmost derivations
It gives incorrect results.
Example¶
In this case, leftmost derivation is incorrect, as +
is left-associative, and should be treated as such.
// leftmost approach 1
E
ā E+E
āļæ¼ id + E
ā id + E + E
ā id + id + E
ā id + id + id
// leftmost approach 2
E
ā E + E
ā E + E + E
ā id + E + E
ā id + id + E
ā id + id + id
if
statement¶
This has two leftmost derivations for
Disambiguation¶
It is not possible to automatically convert ambiguous grammar into an unambiguous one. Hence, we need to use one of the following to eliminate ambiguity
Rewrite Grammar¶
if grammar previously
can be converted to
stmt ā m_stmt
| o_stmt
| other
m_stmt ā if expr then m_stmt else m_stmt
| other
o_stmt ā if expr then stmt
| if expr then m_stmt else o_stmt
where
- matched statement: with
else
- open statement: without
else
Tool-Provided Disambiguation¶
Yacc provides disambiguation declarations
This is covered in detail in practicals
Example Grammar for prog lang¶
Consider a language consisting of semicolon (;) separated list of statements (except the last statement), where
- A statement can be
- id := expr
- print(expression list)
- expr can be expr + expr/num/id/ ( statement list, expr)
- expression list is comma-separated list of expressions
Trees¶
Parse Tree | Syntax Tree | |
---|---|---|
Alternate Name | Concrete Syntax Tree | Abstract Syntax Tree |
Grammar symbols? | ā | ā (only terminals) |
Example |
Parsing¶
Parser decides
- which production rule is to be used, when required
- what is the next token
- Reserved word
if
, open paranthesis - what is the structure to be built
if
statement, expression
Types of Parsers¶
Bottom-Up | Top-Down | |
---|---|---|
Alternate Names | LR Shift-Reduction | LL Derivation |
Finds rightmost derivation in reverse order | ||
Parse tree construction | leaves \(\to\) root | root \(\to\) leaves |
Start | Input string | Start symbol |
End | Start symbol | Input string |
Steps | Shift & Reduction | Replace leftmost nonterminal w/ production rule |
Automatic Tools | Handwritten parsers Predictive parsers | |
Size of Grammar class | Larger | Smaller |
Handle¶
Substring matching right side of a production rule, which gets reduced in a manner that is reverse of rightmost derivation
Not every substring that matches the right side of a production rule is a handle; only the one that gets chosen for reduction
Sentential Form¶
Any string derivable from the start symbol
A derivation is a series of rewrite steps
Non-terminals in \(\gamma_i\) | \(\gamma_i\) is |
---|---|
\(0\) | Sentence in \(L(G)\) |
\(\ge 1\) | Sentential Form |
Right sentential form occurs in rightmost derivation. If the grammar is unambiguous, then every right-sentential form of the grammar has exactly one handle
Handle Pruning¶
A right-most derivation in reverse can be obtained by handle-pruning.
- Start from \(\gamma_n\), find a handle \(A_n \to \beta_n\) in \(\gamma_n\)
- Replace \(\beta_n\) by \(A_n\) to get \(\gamma_{n-1}\)
- Repeat the same until we reach \(S\)
Shift-Reduce Parsing¶
- Initial stack only contains end-marker
$
- End of input string is marked by end-marker
$
Shift input symbols into stack until reduction can be applied
Parser has to find the right handles
If a shift-reduce parser cannot be used for a grammar, that grammar is called as non-LR(k) grammar; ambiguous grammar can never be LR grammar.
Steps¶
Step | Action |
---|---|
Shift | New input symbol pushed to stack |
Reduction | Replace handle at top of stack by non-terminal |
Accept | Successful completion of parsing |
Error | Syntax error discovered Parser calls error recovery routine |
Types¶
-
Operator-Precedence Parser
-
LR Parser
There are 3 sub-types; only their parsing tables are different
- Simple LR
- LookAhead LR (intermediate)
- Canonical LR (most general)
Yacc creates LALR
Conflicts¶
Type | |
---|---|
Shift-Reduce | - Associativity & precedence not ensured - Default action: prefer shift |
Reduce-Reduce |
Resolving conflicts¶
Easy? | |
---|---|
Rewrite grammar | ā |
Yacc_Directives.md | ā |
Solving stack questions¶
Stack | Input | Action |
---|---|---|
$ | somethign |
LR(\(k\)) Parsing¶
Meaning
- Left-right scanning
- Rightmost derivation
- Lookahead of \(k\) i/p symbols at each step; if \(k\) not mentioned, \(k=1\)
Class of grammars parsable by LR is proper superset of class of grammars parsable with predictive parsers
Can detect syntactic error as soon it performs left-to-right scan of input
Why?¶
It is the most __ shift-reducing parser
- Efficient
- General
- Non-backtacking
LR Parsing Algorithm¶
Stack | Remaining Input |
---|---|
\(S_0 X_1 S_1 \dots X_m S_m\) | \(a_i a_{i+1} \dots a_n \$\) |
where
- \(X_i\) is a terminal/non-terminal
- \(S_i\) is a state
The parser action is determined by \(S_m\), \(a_i\), and parsing action table
Actions¶
action[\(S_m, a_i\)] | Meaning |
---|---|
Shift s | Shift new input symbol and next state \(s_i\) into stack |
Reduce \(A \to \beta\) (or) \(r_j\) | 1. Reduce by production no \(j\) 2. Pop 2 \(\vert \beta \vert\) items from stack (grammar symbol & state symbol) 3. Use goto table 4. Push \(A\) and \(s\) where \(s=\text{goto} [s_{m-r}, A]\) (current input symbol not affected) |
acc | Accept |
blank | Error |
Example¶
Parse id*id+id$
using