1
CS 1622 Lecture 6 1
CS1622
Lecture 6Introduction to Parsing
CS 1622 Lecture 6 2
2
The Front End
Parser
• Checks the stream of words and their parts of speech (producedby the scanner) for grammatical correctness
• Determines if the input is syntactically well formed
• Guides checking at deeper levels than syntax
• Builds an IR representation of the code
Think of this as the mathematics of diagramming sentences
S o u r c e
c o d eS c a n n e r
I RP a r s e r
E r r o r s
tokens
CS 1622 Lecture 6 3
The Functionality of theParser Input: sequence of tokens from lexer Output: parse tree of the program
parse tree is generated if the input is alegal program
Instead of parse tree, some parsersproduce directly:
abstract syntax tree (AST) + symbol table, or intermediate code, or object code
2
CS 1622 Lecture 6 4
Comparison with LexicalAnalysis
Parse treeString oftokens
Parser
String oftokens
String ofcharacters
Scanner
OutputInputPhase
CS 1622 Lecture 6 5
Example
E
E
E E
E+
id*
idid
• The program:x * y + zInput to parser:
ID TIMES ID PLUS IDwe’ll write tokens as follows:
id * id + id
• Output of parser:the parse tree
CS 1622 Lecture 6 6
What must parser do? Not all strings of tokens are valid programs
parser must distinguish between valid andinvalid strings of tokens
Parser must expose program structure(associativity and precedence)
• parser must return the parse treeWe need:
A language for describing valid strings of tokens A method for distinguishing valid from invalid
strings of tokens (and for building the parse tree)
3
CS 1622 Lecture 6 7
Parser
lexicalanalyzer
symboltable
parsersourcetoken
get nexttoken
parsetree
rest offrontend
IR
Parsing = determining whether a string of tokenscan be generated by a grammar
CS 1622 Lecture 6 8
Grammars Precise, easy-to understand description of
syntax Context-free grammars -> efficient parsers
(automatically!) Help in translation and error detection
Eg. Attribute grammars Easier language evolution
Can add new constructs systematically
CS 1622 Lecture 6 9
Syntax Errors Many errors are syntactic or exposed
by parsing eg. Unbalanced ()
Error handling goals: Report errors quickly & accurately Recover quickly (continue parsing after
error) Little overhead on parse time
4
CS 1622 Lecture 6 10
Error Recovery Panic mode
Discard tokens until synchronization token found (often ‘;’)
Phrase level Local correction: replace a token by another and continue
Error productions Encode commonly expected errors in grammar
Global correction Find closest input string that is in L(G)
Too costly in practice
CS 1622 Lecture 6 11
Context-free Grammars Precise and easy way to specify the
syntactical structure of a programminglanguage
Efficient recognition methods exist Natural specification of many
“recursive” constructs: expr -> expr + expr | term
CS 1622 Lecture 6 12
Context-free GrammarDefinition Terminals T
Symbols which form strings of L(G), G a CFG (= tokens in thescanner), e.g. if, else, id
Nonterminals N Syntactic variables denoting sets of strings of L(G) Impose hierarchical structure (e.g., precedence rules)
Start symbol S (∈ N) Denotes the set of strings of L(G)
Productions P Rules that determine how strings are formed N -> (N|T)*
5
CS 1622 Lecture 6 13
Why are regular expressionsnot enough?
What programs are generated by?digit+ ( ( “+” | “-” | “*” | “/” ) digit+ )*
no structure! Generates a list rather than a tree
What important properties this regularexpression fails to express?
CS 1622 Lecture 6 14
Why are regular expressionsnot enough?
Write an automaton that accepts strings “a”, “(a)”, “((a))”, and “(((a)))”
cannot do - regular expressions cannot count.
“a”, “(a)”, “((a))”, “(((a)))”, … “(ka)k”
CS 1622 Lecture 6 15
Example: ExpressionGrammarexpr -> expr op exprexpr -> (expr)expr -> - exprexpr -> idop -> +op -> -op -> *op -> /op -> ^
Terminals: {id, +, -, *, /, ^}
Nonterminals {expr, op,}
Start symbol Expr
6
CS 1622 Lecture 6 16
Notational Conventions Terminals
a,b,c.. +,-,.. ‘,’.’;’ etc 0..9 expr or <expr>
Nonterminals A, B, C .. S start symbol (if present)
or first nonterminal inproduction list
Terminal strings u,v,..
Grammar symbol strings α,β
Productions A -> α
CS 1622 Lecture 6 17
Shorthands & DerivationsE -> E + E | E * E |
(E) | - E | <id> E => - E “E derives -
E” => derives in 1 step =>* derive in n (0..)
steps
CS 1622 Lecture 6 18
More Definitions L(G) language generated by G = set of strings
derived from S S =>+ w : w sentence of G (w string of terminals) S =>+ α : α sentential form of G (string can contain nonterminals) G and G’ are equivalent :⇔ L(G) = L(G’) A language generated by a grammar (of the form
shown) is called a context-free language
7
CS 1622 Lecture 6 19
ExampleG = ({-,*,(,),<id>}, {E},
E, {E -> E + E, E->E * E , E -> (E) , E-> - E, E -> <id>})
Sentence: -(<id> + <id>)Derivation:E => -E => -(E) =>
-(E+E)=>-(<id>+E) => -(<id> + <id>)• Leftmost derivation i.e.
always replace leftmostnonterminal
• Rightmost derivationanalogously
• Left /right sentential form
CS 1622 Lecture 6 20
Parse Trees
E
- E
( E )
E + E
<id> <id>
E => -E =>-(E) =>-(E+E)=>-(<id>+E) => -(<id> +<id>)
Parse tree =graphical representation
of a derivationignoring replacement
order
CS 1622 Lecture 6 21
Expressive Power CFGs are more powerful than REs
Can express matching () with CFGs Can express most properties desired for programming
languages CFGs cannot express:
Identifiers declared before used L = {wcw|w is in (a|b)*} Parameter checking (#formals = #actuals)
L ={anbmcndm|n ≥ 1, m ≥ 1}
8
CS 1622 Lecture 6 22
Parsing = determining whether a string of
tokens can be generated by a grammar Two classes based on order in which
parse tree is constructed: Top-down parsing
Start construction at root of parse tree
Bottom-up parsing Start at leaves and proceed to root
CS 1622 Lecture 6 23
Derivations and Parse TreesA derivation is a sequence of productionsA derivation can be drawn as a tree
Start symbol is the tree’s rootFor a production N -> X1…Xn add
X1 … Xn as children to node N
S!!!LLL
1 nXYY!L
X
1 nYYL
CS 1622 Lecture 6 24
Derivation Example S -> E + E | E * E E -> id | (E) Derivation for: id * id + id Derivation for: id + id + id
See board
9
CS 1622 Lecture 6 25
Notes on Derivations A parse tree has
Terminals at the leaves Non-terminals at the interior nodes
An in-order traversal of the leaves is theoriginal input
The parse tree shows the association ofoperations, the input string does not
CS 1622 Lecture 6 26
Terminals Terminals are called because there are no
rules for replacing them
Once generated, terminals are permanent
Terminals are the tokens of the languagerepresented by the grammar
CS 1622 Lecture 6 27
Left-most and Right-mostDerivations The example is a left-
most derivation At each step, replace the
left-most non-terminal
There is an equivalentnotion of a right-mostderivation
EE*Eid *Eid *E+E id *id+Eid*id+ id
®
®
®
®
®
10
CS 1622 Lecture 6 28
Right-most Derivation inDetail (1)
EEE+EE+ idE * E + idE * id + idid * id + id
®
®*®*®*®*
CS 1622 Lecture 6 29
Derivations and Parse Trees Note that right-most and left-most
derivations have the same parse tree
The difference is the order in whichbranches are added
CS 1622 Lecture 6 30
Summary of Derivations We are not just interested in whether
s e L(G) We need a parse tree for s
A derivation defines a parse tree But one parse tree may have many derivations
Left-most and right-most derivations areimportant in parser implementation
11
CS 1622 Lecture 6 31
Ambiguity (Cont.)This string has two parse trees
E
E
E E
E*
id +
idid
E
E
E E
E+
id*
idid
CS 1622 Lecture 6 32
Parse treesQuestion 1:
for each of the two parse trees, find thecorresponding left-most derivation
Question 2: for each of the two parse trees, find the
corresponding right-most derivation
CS 1622 Lecture 6 33
Ambiguity (Cont.) A grammar is ambiguous if for some string
(the following three conditions are equivalent)
it has more than one parse tree if there is more than one right-most derivation if there is more than one left-most derivation
Ambiguity is BAD Leaves meaning of some programs ill-defined
12
CS 1622 Lecture 6 34
Dealing with Ambiguity There are several ways to handle ambiguity
Most direct method is to rewrite grammarunambiguousl
Enforces precedence of * over + Separate (external of grammar) conflict-resolution
rules E.g., precedence or associativity rules (e.g. via parser
tool declarations) Same idea we saw with scanning: instead of
complicated RE’s use symbol table to recognizekeywords
'''EEE | EEid E' | id | (E)!+!"
CS 1622 Lecture 6 35
Expression Grammars(precedence) Rewrite the grammar
use a different nonterminal for each precedence level start with the lowest precedence (MINUS)
E E - E | E / E | ( E ) | id
rewrite to
E E - E | TT T / T | FF id | ( E )
CS 1622 Lecture 6 36
Exampleparse tree for id – id / id
E E - E | TT T / T | FF id | ( E )
E
E
F F
T
-
id
/
idid
T
FT T
E
13
CS 1622 Lecture 6 37
More than one parse tree? Attempt to construct a parse tree for id-
id/id that shows the wrong precedence.
Question: Why do you fail to construct this parse
tree?
CS 1622 Lecture 6 38
Associativity The grammar captures operator precedence,
but it is still ambiguous! fails to express that both subtraction and division
are left associative;
e.g., 5-3-2 is equivalent to: ((5-3)-2) and not to:(5-(3-2)).
CS 1622 Lecture 6 39
Recursion A grammar is recursive in nonterminal X if:
X + … X … recall that + means “in one or more steps, X derives a
sequence of symbols that includes an X”
A grammar is left recursive in X if: X + X …
in one or more steps, X derives a sequence of symbolsthat starts with an X
A grammar is right recursive in X if: X + … X
in one or more steps, X derives a sequence of symbolsthat ends with an X
14
CS 1622 Lecture 6 40
How to fix associativity The grammar given above is both left and
right recursive in nonterminals exp and term try at home: write the derivation steps that show
this. To correctly expresses operator associativity:
For left associativity, use left recursion. For right associativity, use right recursion.
Here's the correct grammar:E E – T | TT T / F | FF id | ( E )