1 Syntax Analysis Introduction to parsers Context-free grammars Push-down automata Top-down parsing LL grammars and parsers Bottom-up parsing LR grammars

1

Syntax AnalysisSyntax Analysis

Introduction to parsers Context-free grammars Push-down automata Top-down parsing LL grammars and parsers Bottom-up parsing LR grammars and parsers Bison/Yacc - parser generators Error Handling: Detection & R

ecovery

2

Introduction to parsersIntroduction to parsers

LexicalAnalyzer

Parser

SymbolTable

token

next token

source SemanticAnalyzer

syntaxtreecode

CFG

3

Context Free GrammarContext Free Grammar

CFG & Terminology Rewrite vs. Reduce Derivation

Language and CFL Equivalence & CNF

Parsing vs. Derivation lm/rm derivation & parse tree Ambiguity & resolution

Expressive power

Derivation is the reverse of Parsing.If we know how sentences are derived, we may find a parsing method in the reversed direction.

4

CFG: An ExampleCFG: An Example

Terminals: id, ‘+’, ‘-’, ‘*’, ‘/’, ‘(’, ‘)’Nonterminals: expr, opProductions:

expr expr op expr expr ‘(’ expr ‘)’

expr ‘-’ expr expr id

op ‘+’ | ‘-’ | ‘*’ | ‘/’ The start symbol: expr

5

Notational Conventions in CFGNotational Conventions in CFG

• a, b, c, … [+-0-9], id: symbols in • A, B, C,…,S, expr,stmt: symbols in N• U, V, W,…,X,Y,Z: grammar symbols in(+N)• …denotes strings in (+N)*

• u, v, w,… denotes strings in *

• is an abbreviation of

• Alternatives: … at RHS

||| A

A

AA

7

Context-Free GrammarsContext-Free Grammars

A set of terminals: basic symbols from which sentences are formed

A set of nonterminals: syntactic variables denoting sets of strings

A set of productions: rules specifying how the terminals and nonterminals can be combined to form sentences

The start symbol: a distinguished nonterminal denoting the language

8

CFG: ComponentsCFG: ComponentsSpecification for Structures & ConstituencySpecification for Structures & Constituency

• CFG: formal specification of structure (parse trees)– G = {, N, P, S} : terminal symbols– N: non-terminal symbols– P: production rules– S: start symbol

9

CFG: ComponentsCFG: Components

: terminal symbols– the input symbols of the language

• programming language: tokens (reserved words, variables, operators, …)

• natural languages: words or parts of speech

– pre-terminal: parts of speech (when words are regarded as terminals)

• N: non-terminal symbols– groups of terminals and/or other non-terminals

• S: start symbol: the largest constituent of a parse tree

10

CFG: ComponentsCFG: Components

• P: production (re-writing) rules– form: A → β (A: non-terminal, β: string of

terminals and non-terminals)– meaning: A re-writes to (“consists of”, “derived

into”)β, or β reduced to A – start with “S-productions” (S → β)

11

DerivationsDerivations

A derivation step is an application of a production as a rewriting rule

E - EA sequence of derivation steps

E - E - ( E ) - ( id ) is called a derivation of “- ( id )” from E

The symbol * denotes “derives in zero or more steps”; the symbol + denotes “derives in one or more steps

12

CFG: Accepted LanguagesCFG: Accepted Languages

• Context-Free Language– Language accepted by a CFG

• L(G) = { | S + (strings of terminals that can be derived from start symbol)}

– Proof of acceptance: by induction• On the number of derivation steps

• On the length of input string

13

Context-Free LanguagesContext-Free Languages

A context-free language L(G) is the language defined by a context-free grammar G

A string of terminals is in L(G) if and only if S + , is called a sentence of G

If S * , where may contain nonterminals, then we call a sentential form of G

E - E - ( E ) - ( id ) G1 is equivalent to G2 if L(G1) = L(G2)

14

CFG: EquivalenceCFG: Equivalence• Chomsky Normal Form (CNF) (Chomsky, 1963):

– ε-free, and– Every production rule is in either of the following

form:• A → A1 A2 [two non-terminals: A1, A2], or• A → a [a terminal: a]

– i.e., two non-terminals or one terminal at the RHS

• Properties:– Generate binary parse tree– Good simplification for some algorithms

• e.g., grammar training with the inside-outside algorithm (Baker 1979)

– Good tool for theoretical proving• e.g., time complexity

15

CFG: EquivalenceCFG: Equivalence

• Every CFG can be converted into a weakly equivalent CNF– equivalence: L(G1) = L(G2)

• strong equivalent: assign the same phrase structure to each sentence (except for renaming non-terminals)

• weak equivalent: do not assign the same phrase structure to each sentence

– e.g., A → B C D == {A → B X, X → CD}

16

CFG: An ExampleCFG: An Example

Terminals: id, ‘+’, ‘-’, ‘*’, ‘/’, ‘(’, ‘)’Nonterminals: E, opProductions:

E E op E …[R1] E ‘(’ E ‘)’ …[R2] E ‘-’ E …[R3] E id …[R4] op ‘+’ | ‘-’ | ‘*’ | ‘/’

The start symbol: E

17

Left- & Right-most DerivationsLeft- & Right-most DerivationsEach derivation step needs to choose

– a nonterminal to rewrite– an alternative to apply

A leftmost derivation always chooses the leftmost nonterminal to rewrite

E lm - E lm - ( E ) lm - ( E + E ) lm - ( id + E ) lm - ( id + id )

A rightmost (canonical) derivation always chooses the rightmost nonterminal to rewrite

E rm - E rm - ( E ) rm - ( E + E ) rm - (E + id ) rm - ( id + id )

18

Left- & Right-most DerivationsLeft- & Right-most Derivations Representation of leftmost/rightmost derivations:

Use the sequence of productions (or production numbers) to represent a derivation sequence.

Example:E rm - E rm - ( E ) rm - ( E + E )

rm - (E + id ) rm - ( id + id ) => [3], [2], [1], [4], [4] (~ R3, R2, R1, R4, R

4)Advantage: A compact representation for

parse tree (data compression)Each parse tree has a unique leftmost/rightmo

st derivation

R3

R2 R1

19

Parse TreesParse Trees

A parse tree is a graphical representation for a derivation that filters out the order of choosing nonterminals for rewriting

PP

in

NP

NP

girl the park

NP

20

Context Free Grammar (CFG): Context Free Grammar (CFG): Specification for Structures & ConstituencySpecification for Structures & Constituency

• Parse Tree: graphical representation of structure– Root node (S): a sentencial level structure

– Internal nodes: constituents of the sentence

– Arcs: relationship between parent nodes and their children (constituents)

– Terminal nodes: surface forms of the input symbols (e.g., words)

• Bracketed notation: Alternative representation• e.g., [I saw [the [girl [in [the park]]]]]

21

Parse Tree:Parse Tree:“I saw the girl in the park”“I saw the girl in the park”

PP

in

NP

NP

girl the parkI saw the

NP

S

VP

vpron det n p det n

1st parse

22

Parse Tree:Parse Tree:“I saw the girl in the park”“I saw the girl in the park”

PP

in

NP

NP

girl the park

NP

I saw the

NP

S

VP

vpron det n p det n

2nd parse

23

LM & RM: An ExampleLM & RM: An Example

E

-

( )

+

id id

E

E E

E E lm - E lm - ( E ) lm - ( E + E )lm - ( id + E ) lm - ( id + id )

E rm - E rm - ( E ) rm - ( E + E )rm - ( E + id ) rm - ( id + id )

24

Parse Trees & DerivationsParse Trees & Derivations

Many derivations may correspond to the same parse tree, but every parse tree has associated with it a unique leftmost and a unique rightmost derivation

25

Ambiguous GrammarAmbiguous Grammar

A grammar is ambiguous if it produces more than one parse tree for some sentence more than one leftmost/rightmost derivation

E E + E id + E id + E * E id + id * E id + id * id

E E * E E + E * E id + E * E id + id * E id + id * id

26

Ambiguous GrammarAmbiguous Grammar

E

+E E

id

id

*E E

id

E

*E E

id

id

+E E

id

27

Resolving AmbiguityResolving Ambiguity

Use disambiguating rules to throw away

undesirable parse trees

Rewrite grammars by incorporating

disambiguating rules into unambiguous

grammars

28

An ExampleAn Example

The dangling-else grammar stmt if expr then stmt | if expr then stmt else stmt

| other

Two parse trees forif E1 then if E2 then S1 else S2

29


S

elseE S Sif then

if E then S

elseE

S

S Sif then

if E then S

Preferred parse: closest then

30

Disambiguating RulesDisambiguating Rules

Rule: match each else with the closest previous

unmatched then

Remove undesired state transitions in the

pushdown automaton (parser) shift/reduce conflict on “else”

1st parse: reduce

2nd parse: shift

31

Grammar RewritingGrammar Rewritingstmt m_stmt ; with only paired then-else | unm_stmt

m_stmt if expr then m_stmt else m_stmt | other

unm_stmt if expr then stmt | if expr then m_stmt else unm_stmt

So… cannot have unmatched then-else

want this then-else pair matched

32

RE RE vs.vs. CFG CFG

Every language described by a RE can also be described by a CFG

Example: (a|b)*abb A0 a A0 | b A0 | a A1 A1 b A2 A2 b A3 A3 (1) Right branching

(2) Starts with a terminal symbol

33

RE RE vs.vs. CFG CFGRegular Grammar:• Right branching• Starts with a

terminal symbol

A0

a(|b) A0

a(|b) A0A0

a A1

b A2A2

b A3

(a|b)* abb

34

RE vs. CFG

0 31 2a b b

a

b

start

RE: (a | b)*abb

A0 a A0 | b A0 | a A1

A1 b A2

A2 b A3

A3 A0

A1

A2

A3

35

RE vs. CFG

a DFA for (a | b)*abb

0 31 2ab b

a

b

start

a

b

a

A0

A1 A3

A2

A0 b A0 | a A1

A1 a A1 | b A2

A2 a A1 | b A3

A3 a A1 | b A0 |

36

CFG: Expressive Power (cont.)CFG: Expressive Power (cont.)

• Writing a CFG for a FSA (RE)– define a non-terminal Ni for a state with state numb

er i

– start symbol S = N0 (assuming that state 0 is the initial state)

– for each transition δ(i,a)=j (from state i to stet j on input alphabet a), add a new production Ni → a Nj to P (if a== εNi → Nj)

– for each final state i, add a new production Ni → εto P

38

CFG: Expressive PowerCFG: Expressive Power

• CFG vs. Regular Expression (R.E.)– Every R.E. can be recognized by a FSA– Every FSA can be represented by a CFG

with production rules of the form: A → a B | ε

– (known as a “Regular Grammar”)

• Therefore, L(RE) L(CFG)

39

CFG: Expressive Power (cont.)CFG: Expressive Power (cont.)

• Chomsky Hierarchy:– R.E. : Regular set (recognized by FSAs)– CFG: Context-free (Pushdown automata)– CSG: Context-sensitive (Linear bounded aut

omata)– Unrestricted: Recursively enumerable (Tuni

ng Machine)

40

Push-Down AutomataPush-Down Automata

Finite Automata

Input

OutputStack

41

RE RE vs.vs. CFG CFG

Why use REs for lexical syntax?– do not need a notation as powerful as CFGs– are more concise and easier to understand than

CFGs– More efficient lexical analyzers can be constru

cted from REs than from CFGs– Provide a way for modularizing the front end i

nto two manageable-sized components

42

CFG CFG vs.vs. Finite-State Machine Finite-State Machine

• Inappropriateness of FSA– Constituents: only terminals

– Recursion: do not allow A => … B … => … A …

• RTN (Recursive Transition Network)– FSA with augmentation of recursion

– arc: terminal or non-terminal

– if arc is non-terminal: call to a sub-transition network & return upon traversal

43

Nonregular ConstructsNonregular Constructs

REs can denote only a fixed number of repetitions or an unspecified number of repetitions of one given constructE.g. a*b*

A nonregular construct:– L = {anbn | n 1}

44

Non-Context-Free ConstructsNon-Context-Free Constructs

CFGs can denote only a fixed number of repetitions or an unspecified number of repetitions of one or two (paired) given constructs E.g. anbn

Some non-context-free constructs:– L1 = {wcw | w is in (a | b)*}

• declaration/use of identifiers

– L2 = {anbmcndm | n 1 and m 1}• #formal arguments/#actual arguments

– L3 = {anbncn | n 0}• e.g., b: Backspace, c: under score

45

Context-Free ConstructsContext-Free Constructs

FA (RE) cannot keep countsCFGs can keep count of two items but not

threeSimilar context-free constructs:

– L’1 = {wcwR | w is in (a | b)*, R: reverse order}– L’2 = {anbmcmdn | n 1 and m 1}– L’’2 = {anbncmdm | n 1 and m 1}– L’3 = {anbn | n 1}

46

CFG ParsersCFG Parsers

47

Types of CFG ParsersTypes of CFG Parsers

Universal: can parse any CFG grammar CYK, Earley

CYK: Exhaustively matching sub-ranges of input tokens against grammar rules, from smaller ranges to larger ranges

Earley: Exhaustively enumerating possible expectations from left-to-right, according to current input token and grammar

Non-universal: not all CFG’s can be parsed (e.g., recursive descent parser)

Universal (to all grammars) is NOT always efficient

48

Types of CFG ParsersTypes of CFG Parsers Practical Parsers: [“what is a good parser?”]

Simple: simple program structure Left-to-right (or right-to-left) scan

middle-out or island driven is often not preferred

Top-down or Bottom up matching

Efficient: efficient for good/bad inputs Parse normal syntax quickly Detect errors immediately on next token

Deterministic: No alternative choices during parsing given next token Small lookahead buffer (also contribute to efficiency)

49

Types of CFG ParsersTypes of CFG Parsers

Top Down:Matching from start symbol down to terminal

tokens

Bottom Up:Matching input tokens with reducible rules

from terminal up to start symbol

50

Efficient CFG ParsersEfficient CFG Parsers

Top Down: LL ParsersMatching from start symbol down to terminal

tokens, left-to-right, according to a leftmost derivation sequence

Bottom Up: LR ParsersMatching input tokens with reducible rules,

left-to-right, from terminal up to start symbol, in a reverse order of rightmost derivation sequence

51

Efficient CFG ParsersEfficient CFG Parsers

Efficient & Deterministic Parsing – only possible for some subclasses of grammars with special parsing algorithmsTop Down:

Parsing LL Grammars with LL Parsers

Bottom Up:Parsing LR Grammars with LR ParsersLR grammar is a larger class of grammars than LL

52

Parsing Table Construction for Parsing Table Construction for Efficient ParsersEfficient Parsers

Parsing Table:A pre-computed table (according to the gram

mar), indicating the appropriate action(s) to take in any predefined state when some input token(s) is/are under examination

Lookahead symbol(s): the input symbol(s) under examination for determining next action(s) id + * num

State-0 action-1 action-3

State-1 action-2 action-5

State-2 action-4

Good parsers do not change their codes when the grammar

is revised. Table driven.

53


Parsing Table Construction:Decide a pre-defined number of lookaheads to

use for predicting next stateDefine and enumerate all the unique states for

the parsing methodDecide the actions to take in all states with all

possible lookahead(s)

54


X-Parser: you can invent any parser and call it the X-ParserBut its parsing algorithm may not handle all

grammars deterministically, thus efficiently.X-Grammar:

Any grammar whose parsing table for the X-parsing method/X-Parser has no conflicting actions in all states

Non-X Grammar: has more than one action to take under any state

55


k: The number of lookahead symbols used by a parser to determine the next action A larger number of lookahead symbols tends to make

it less possible to have conflicting actions But may result in a much larger table that grows exponential

ly with the number of lookaheads Does not guarantee unambiguous for some grammars (inher

ently ambiguous) even with infinite lookaheads X(k) Parser:

X Parser that uses k lookahead symbols to determine the next action

X(k) Grammar: any grammar deterministically parsable with X(k) Par

ser

56

Types of Grammars Capable of Types of Grammars Capable of Efficient ParsingEfficient Parsing

LL(k) GrammarsGrammars that can be deterministically

parsed using an LL(k) parsing algorithme.g., LL(1) grammar

LR(k) GrammarsGrammars that can be deterministically

parsed using an LR(k) parsing algorithme.g., SLR(1) grammar, LR(1) grammar,

LALR(1) grammar

57

Top-Down CFG ParsersTop-Down CFG Parsers

Recursive Descent Parser

vs.

Non-Recursive LL(1) Parser

58

Top-Down ParsingTop-Down ParsingConstruct a parse tree from the root to the

leaves using leftmost derivation

S c A B input: cadA a b | aB d

S

c A B

S

c A B

a

S

c A B

a b

S

c A B

a d

59

Predictive ParsingPredictive Parsing

A top-down parsing without backtracking– there is only one alternative production to choo

se at each derivation step

stmt if expr then stmt else stmt | while expr do stmt | begin stmt_list end

60

LL(LL(kk) Parsing) Parsing

The first L stands for scanning the input from left to right

The second L stands for producing a leftmost derivation

The k stands for the number of input symbols for lookahead used to choose alternative productions at each derivation step

61

LL(1) ParsingLL(1) Parsing

Use one input symbol of lookaheadSame as Recursive-descent parsing

But, Non-recursive predictive parsing

62

Recursive Descent Parsing (more)Recursive Descent Parsing (more)

The parser consists of a set of (possibly recursive) procedures

Each procedure is associated with a nonterminal of the grammar

The calling sequence of procedures in processing the input implicitly defines a parse tree for the input

63


type simple | id | array [ simple ] of type

simple integer | char | num dotdot num

64


type

array [ simple ] of type

dotdotnum num simple

integer

array [ num dotdot num ] of integer

65

An ExampleAn Exampleprocedure type;begin if lookahead is in { integer, char, num } then simple else if lookahead = id then match(id) else if lookahead = array then begin match(array); match('['); simple; match(']'); match(of); type end else errorend;

66


procedure match(t : token);begin if lookahead = t then lookahead := nexttoken else errorend;

67


procedure simple;begin if lookahead = integer then match(integer) else if lookahead = char then match(char) else if lookahead = num then begin match(num); match(dotdot); match(num) end else errorend;

68

LL(k) Constraint: Left RecursionLL(k) Constraint: Left Recursion

A grammar is left recursive if it has a nonterminal A such that A + A

A A | A R R R |

A

A

A

A

A R

RRR

*

69

Direct/Immediate Left Direct/Immediate Left RecursionRecursion

A A 1 | A 2 | ... | A m | 1 | 2 | ... | n

A 1 A' | 2 A' | ... | n A'

A' 1 A' | 2 A' | ... | m A' |

is equivalent to …

(1 | 2 | ... | n ) (1 | 2 | ... | m )*

A A i | j (i=1,m ; j=1,n)

70


E E + T | TT T * F | FF ( E ) | id

E T E'E' + T E' | T F T'T' * F T' | F ( E ) | id

71

Indirect Left RecursionIndirect Left Recursion

G0: S A a | b A A c | S d |

Problem: Indirect Left-Recursion: S A a S d a

Solution-Step1: Indirect to Direct Left-Recursion: A A c | A a d | b d |

Solution-Step2: Direct Left-Recursion to Right-Recursion: S A a | b A b d A' | A' A' c A' | a d A' |

• Scan rules top-down• Do not start with symbols defined earlier (=> substitute them if any)• Resolve direct recursion

72

Indirect Left RecursionIndirect Left Recursion

Input. Grammar G with no cycles or -production.Output. An equivalent grammar with no left recursion.1. Arrange the nonterminals in some order A1, A2, ..., An

2. for i := 1 to n do begin // Step1: Substitute 1st-symbols of Aifor j := 1 to i - 1 do begin // which are previous Aj’s replace each production of the form Ai Aj ( j < i )

by the production Ai 1 | 2 | ... | k where Aj 1 | 2 | ... | k are all thecurrent Aj-productions;

endeliminate direct left recursion among Ai-productions // Step2

end

73

Left FactoringLeft Factoring

Two alternatives of a nonterminal A have a nontrivial common prefix if , and

A 1 | 2

A A'A' 1 | 2

74


S i E t S | i E t S e S | aE b

S i E t S S' | aS' e S | E b

76

Top-Down Parsing: as Stack Top-Down Parsing: as Stack MatchingMatching

Construct a parse tree from the root to the leaves using leftmost derivation

S c A B input: cadA a b | aB d

S

c A B

S

c A B

a

S

c A B

a b

S

c A B

a d

77

Nonrecursive Predictive ParsinNonrecursive Predictive Parsing – General Stateg – General State

Parsing program(parser/driver)

Parsing table

Input

Output

Stack

Predictive: pre-computed

parsing actions

M[X,a]= {X -> Y1 Y2 … Yk}

X

…Non-

Recursive: “Stack + Driver

Program” (instead of Recursive

procedures)

a b c … x y z

78

Nonrecursive Predictive Parsing Nonrecursive Predictive Parsing – Expand Non-terminal– Expand Non-terminal


Parsing table

Input

Output

Stack


parsing actions

M[X,a]= {X -> Y1 Y2 … Yk}

Y1

Y2

…

Yk

Non-Recursive: “Stack + Driver


procedures)

a b c … x y z

79

Nonrecursive Predictive ParsinNonrecursive Predictive Parsing – Match Terminalg – Match Terminal


Parsing table

Input

Output

Stack


parsing actions

M[X,a]= {X -> Y1 Y2 … Yk}

Y1

Y2

…

Yk



procedures)

a b c … x y z

=a

80

Nonrecursive Predictive ParsinNonrecursive Predictive Parsing - Error Recoveryg - Error Recovery


Parsing table

Input

Output

Stack


parsing actions

M[X,a]= {X -> Y1 Y2 … Yk}

Y1

Y2

…

Yk



procedures)

a b c … x y z

=a

=c

81

Nonrecursive Predictive ParsinNonrecursive Predictive Parsing - Error Recoveryg - Error Recovery


Parsing table

Input

Output

Stack


parsing actions

M[X,a]= {X -> Y1 Y2 … Yk}

Y1

Y2

…

Yk



procedures)

a b c … x y z

=a

=c

83

Stack OperationsStack Operations

Match– when the top stack symbol is a terminal and it

matches the input symbol, pop the top stack symbol and advance the input pointer

Expand– when the top stack symbol is a nonterminal, rep

lace this symbol by the right hand side of one of its productions

• Leftmost RHS symbol at Top-of-Stack

84


type simple | id | array [ simple ] of type

simple integer | char | num dotdot num

85

An ExampleAn ExampleAction Stack InputE type array [ num dotdot num ] of integerM type of ] simple [ array array [ num dotdot num ] of integerM type of ] simple [ [ num dotdot num ] of integerE type of ] simple num dotdot num ] of integerM type of ] num dotdot num num dotdot num ] of integerM type of ] num dotdot dotdot num ] of integerM type of ] num num ] of integerM type of ] ] of integerM type of of integerE type integerE simple integerM integer integer

86

Parsing programParsing program

push $S onto the stack, where S is the start symbolset ip to point to the first symbol of w$; // try to match S$ with w$repeat let X be the top stack symbol and a the symbol pointed to by ip; if X is a terminal or $ then if X = a then pop X from the stack and advance ip else error // or error_recovery() else // X is a nonterminal

if M[X, a] = X Y1 Y2 ... Yk then pop X from and push Yk ... Y2 Y1 onto the stack else error // or error_recovery()until X = $

87

Parser Driven by a Parsing Table:Parser Driven by a Parsing Table:Non-recursive DescentNon-recursive Descent

X() { // WITHOUT ε-production: X→ε

if (LA=‘a’) then

Y1(); Y2(); …Yk();

else if (LA=‘b’)

Z1(); Z2(); …; Zm();

else ERROR(); // no X→ε

// else RETURN; if X exists

} // Recursive decent procedure for matching X

a b c d

X X Y1 Y2 … Yk X Z1 Z2 … Zm

Y1 Y1 1 Y1 2

Z1 Z1 1 Z1 2

‘a’ in FirstSet( Y1 Y2 … Yk )

‘b’ in FirstSet( Z1 Z2 … Zm )

88

Parser Driven by a Parsing Table:Parser Driven by a Parsing Table:Non-recursive DescentNon-recursive Descent

X() { // WITH ε-production: X→ε

if (LA=‘a’) then

Y1(); Y2(); …Yk();

else if (LA=‘b’)

Z1(); Z2(); …; Zm();

// else ERROR(); // no X→ε

else if (LA=??) RETURN; // if X exists

} // Recursive decent procedure for matching X

a b c d

X X Y1 Y2 … Yk X Z1 Z2 … Zm X

Y1 Y1 1 Y1 2

Z1 Z1 1 Z1 2

‘a’ in FirstSet( Y1 Y2 … Yk )

‘b’ in FirstSet( Z1 Z2 … Zm )

‘d’ in FollowSet(X)(S =>* …X d …)

89

First Sets: Predictive ParsingFirst Sets: Predictive Parsing

The first set of a string is the set of terminals that begin the strings derived from. If * , then is also in the first set of

.Used simply to flag whether can be null for

computing First SetNot for matching any real input when parsing

FIRST() = {a | * a }+{ , if * }FIRST() includes { }: means that *

90

Compute First SetsCompute First Sets

If X is terminal, then FIRST(X) is {X} If X is nonterminal and X is a production,

then add to FIRST(X) If X is nonterminal and X Y1 Y2 ... Yk is a pr

oduction, then add a to FIRST(X) if for some i, a is in FIRST(Yi) and is in all of FIRST(Y1), ..., FIRST(Yi-1).

If is in FIRST(Yj) for all j, then add to FIRST(X)

91

Follow Sets: Matching EmptyFollow Sets: Matching Empty

What to do with matching null: A ? TD Recursive Descent Parsing: “assumes” success LL: more predictive => Follow Set of ‘A’

The follow set of a nonterminal A is the set of terminals that can appear immediately to the right of A in some sentential form, namely,

S * A a

a is in the follow set of A.

92

Compute Follow SetsCompute Follow Sets Initialization: Place $ in FOLLOW(S), where S is the

start symbol and $ is the input right end marker. If there is a production A B , then everything in

FIRST() except for is placed in FOLLOW(B) is not considered a visible input to follow any symbol

If there is a production A B or A B where FIRST() contains (i.e., * ), then everything in FOLLOW(A) is in FOLLOW(B) S * … A a … implies S * … B a YES:“every symbol that can follow A will also follow B” NO!: “every symbol that can follow B will also follow A”

93


E T E'E' + T E' | T F T'T' * F T' | F ( E ) | id

FIRST(E) = FIRST(T) = FIRST(F) = { (, id }FIRST(E') = { +, }FIRST(T') = { *, }FOLLOW(E) = FOLLOW(E') = { ), $ }FOLLOW(T) = FOLLOW(T') = { +, ), $ }FOLLOW(F) = { +, *, ), $ }

94

Constructing Parsing TableConstructing Parsing Table

Input. Grammar G.

Output. Parsing Table M.

Method.

1. For each production A of the grammar, do steps 2 and 3.

2. For each terminal a in FIRST( ), add A to M[A, a].

3. If is in FIRST( ) [A * ], add A to M[A, b] for each

terminal b [including ‘$’] in FOLLOW(A).

- If is in FIRST( ) and $ is in FOLLOW(A),

add A to M[A, $].

4. Make each undefined entry of M be error.

95

LL(1) Parsing Table ConstructionLL(1) Parsing Table Construction

A() { // WITH/WITHOUT ε-productions: A (* )

if (LA=‘a’ in First(Y1 Y2… Yk)) then

Y1(); Y2(); …Yk();

else if (LA=‘b’ in Follow(A) & εin First(Z1 Z2... ))

Z1(); Z2(); …; Zm(); // Nullable

else ERROR();

} // Recursive version of LL(1) parser

a in First() b in Follow(A) c not in First() or Follow(A)

A A A (* ) error

B

CWhen to apply A ?

including A

96


id + * ( ) $E E TE' E TE'E' E' +TE' E' E' T T FT' T FT' T' T' T' *FT' T' T' F F id F (E)

97

An ExampleAn Example Stack Input Output$E id + id * id$ $E'T id + id * id$ E TE' $E'T'F id + id * id$ T FT' $E'T'id id + id * id$ F id$E'T' + id * id$$E' + id * id$ T' $E'T+ + id * id$ E' + TE' $E'T id * id$$E'T'F id * id$ T FT' $E'T'id id * id$ F id$E'T' * id$

$E'T'F* * id$ T' * FT' $E'T'F id$$E'T'id id$ F id$E'T' $$E' $ T' $ $ E'

98

LL(1) GrammarsLL(1) Grammars

A grammar is an LL(1) grammar if its predictive parsing table has no multiply-defined entries

99

A Counter ExampleA Counter Example

S i E t S S' | aS' e S | E b

a b e i t $S S a S i E t S S'S' S' S' S' e SE E b

e FOLLOW(S’)

e FIRST(e S)Disambiguation: matching closest “then”

100

LL(1) Grammars or Not ??LL(1) Grammars or Not ??

A grammar G is LL(1) iff whenever A | are two distinct productions of G, the following conditions hold:– For no terminal a do both and derive strings beginning

with a.• or… M[A, first()&first()] entries will have conflicting actions

– At most one of and can derive the empty string• or… M[A, follow(A)] entries have conflicting actions

– If * , then does not derive any string beginning with a terminal in FOLLOW(A).

• or… M[A, first()&follow(A)] entries have conflicting actions

101

Non-LL(1) Grammar:Non-LL(1) Grammar:Ambiguous According to LL(1) Ambiguous According to LL(1)

Parsing Table ConstructionParsing Table Construction

a in First() & First() b in Follow(A) a in First() & Follow(A)

A A A

A (* )

A (* )

A (/* ) (but * a )

A (* )

B

C

When will A & A appear in the same table cell ??

S' e S | X X a | b

102

LL(1) Grammars or Not??LL(1) Grammars or Not??

If G is left-recursive or ambiguous, then M will have at least one multiply-defined entry=> non-LL(1)E.g., X X a | b

=> FIRST(X) = {b} (and, of course, FIRST(b) = {b})

=> M[X,b] includes both {X X a} and {X b}

i.e., Ambiguous G and G with left-recursive productions can not be LL(1).

No LL(1) grammar can be ambiguous

103

Error Recovery for LL ParsersError Recovery for LL Parsers

104

Syntactic ErrorsSyntactic Errors

• Empty entries in a parsing table:– Syntactic error is encountered when the lookah

ead symbol corresponding to this entry is in input buffer

– Error Recovery information can be encoded in such entries to take appropriate actions upon error

• Error Detection:– (1) Stacktop = x && x != input (a)– (2) Stacktop = A && M[A, a] = empty (error)

105

Error Recovery StrategiesError Recovery Strategies Panic mode: skip tokens until a token in a set of

synchronizing tokens appears INS (insertion) type of errors sync at delimiters, keywords, …, that have clear

functions Phrase Level Recovery

local INS (insertion), DEL (deletion), SUB (substitution) types of errors

Error Production define error patterns (“error productions”) in grammar

Global Correction [Grammar Correction] minimum distance correction

106

Error Recovery – Panic ModeError Recovery – Panic Mode

Panic mode: skip tokens until a token in a set of synchronizing tokens appears

Commonly used Synchronizing tokens:– SUB(A,ip): use FOLLOW(A) as sync set for A (pop A)

– use the FIRST set of a higher construct as sync set for a lower construct

– INS(ip): use FIRST(A) as sync set for A

– *ip= : use the production deriving as the default

– DEL(ip): If a terminal on stack cannot be matched, pop the terminal

107

… …

Error Recovery – Panic ModeError Recovery – Panic ModeAction Stack InputSUB(A,ip)

INS(ip)

DEL(ip)

… A *ip … Follow(A) …A

… A *ip … First(A) …

… x *ip … …

A

x

X

…

Follow(A)…

A

*ip

X

… A

First(A)…*ip

X

… …x

*ip

x

108

Error Recovery Actions Using Error Recovery Actions Using Follow & First Sets to SyncFollow & First Sets to Sync

Expanding non-terminal A: M[A,a] = error (blank):

Skip “a” in input = delete all such “a” (until sync with sync symbol, b) /* panic */

M[A,b] = sync (at FOLLOW(A)) Pop “A” from stack = “b” is a sync symbol following A

M[A,b] = A (== sync at FIRST(A) ) Expand A as (same as normal parsing action)

Matching terminal “x”: (*sp=“x”) != “a”

Pop(x) from stack = missing input token “x”

109


id + * ( ) $E E TE' E TE' sync syncE' E' +TE' E' E' T T FT' sync T FT' sync syncT' T' T' *FT' T' T' F F id sync sync F (E) sync sync

FOLLOW(F)={+,*,),$}

FOLLOW(E)=FOLLOW(E’)={),$}

FIRST(X) is used to Expand non-productions or Sync (on errors)

FOLLOW(X) is used to Expand -productions or Sync (on errors)

110

An ExampleAn Example Stack Input Output$E ) id * + id$ error, skip )$E id * + id$ id is in FIRST(E)$E'T id * + id$ E TE' $E'T'F id * + id$ T FT' $E'T'id id * + id$ F id$E'T' * + id$$E'T'F* * + id$ T' *FT' $E'T'F + id$ error, M[F,+]=synch / FOLLOW(F)$E'T' + id$ F popped$E' + id$ T' $E'T+ + id$ E' +TE' $E'T id$$E'T'F id$ T FT'$E'T'id id$ F id$E'T' $$E' $ T' $ $ E'

111

Parse Tree - Error RecoveredParse Tree - Error Recovered

E

) E’

ε

+ E’T

ε

F T’

id

T

F

id

T’

ε

F* T’

) id * + id => id * F + id

Documents

1 Syntax Analysis Introduction to parsers Context-free grammars Push-down automata Top-down parsing LL grammars and parsers Bottom-up parsing LR grammars