View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Topic 2 – Syntax Analysis
Dr. William A. ManiattyAssistant Prof.
Dept. of Computer ScienceUniversity At Albany
CSI 511Programming Languages and Systems Concepts
Fall 2002
Monday Wednesday 2:30-3:50LI 99
What is Langauge?
Language - A systematic means of communicating by the use of sounds or conventional symbols (Wordnet 1.7)
Natural Language - Used by people Programming Languages - A formal language
in which computer programs are written. (Foldoc). Languages are defined by their
Syntax - Rules for how symbols are combined Semantics - What the constructs mean
Why is Syntax Important?
A compiler must recognize a program's syntax and then do semantic analysis.
A good programming language syntax Is precise (avoids ambiguity) Can easily be recognized (i.e. In O(|Program|) Is easy to scan/parse (unlike Fortran) Is succinct (unlike COBOL) Is readable (unlike APL)
Some Notation
denotes the empty string
ab denotes a followed by b
a* denotes 0 or more repetitions of a Called the Kleene star (after Stephen Kleene)
a+ denotes 1 or more repetitions of a (aa*)
a|b denotes either a or b (but not both)
[a] denotes a or the empty string
Grammar and Syntax
A Formal Grammar of language L is defined as G(L)=(N,T,S,P), where:
N is the set of nonterminals (metasymbols) T is the set of terminals.
N and T are disjoint, NT= A=NT denotes the alphabet of the language
P is the set of productions S, SN is the start symbol of the language.
Productions
A production is a rule describing allowed substitutions for nonterminals.
More formally, each production, pP is an ordered pair () where N, (NT)*
= a jAy where ( )j yÎ NÈT * and AÎN = b jwy where ( )wÎ NÈT The production is denoted ®a b
Derivations (Some notation)
A derivation is a sequence of substitutions (applications of productions).
Each string of symbols at a stage in a derivation is a sentential form.
The yield is the final sentential form composed of only terminals.
We use to show a replacement Transitive closure of replacement is shown as *
Languages Automata and Grammar
Joint work done by: Noam Chomsky (Linguistics) Alonso Church (Math) Alan Turing (Math/Computer Science)
Side note, Turing was Church's Ph.D. Student
Church demonstrated equivalence classes of languages, automata and grammar
More powerful classes contain less powerful
Chomsky Hierarchy
Chomsky defined equivalence classes, Type i Includes Type j if i < j.Language Grammar Auomaton Productions Notes
Type 0 RecursivelyEnumerable
Turing Machine contracting
Type 1 Context Sensitive Linear BoundedAutomata
)*
Non- contracting
Type 2 Context Free (CFG) Push Down Automata
Stack Based
Type 3 Regular Expressions Finite State Automata(FSA)
Bounded Memory
Programming Language Grammars
Most programming languages have CFG. Type 0 and Type 1 languages hard to parse But some CFGs cannot be efficiently parsed.
Push Down Automata (PDA) are used. Nondeterministic PDA (NPDA) are more powerful
than deterministic PDA (DPDA).
Regular Expressions used (in scanning) Deterministic and Nondeterministic Finite
Automata have equivalent power.
Recognizing a Language
Recognizing a program's syntax involves: Scanning - Recognizes Terminals (DFA)
Reads characters, emits tokens. Prefer longer strings
Parsing - Recognizes Terminals and Nonterminals (DPDA)
Reads tokens, constructs a parse tree. Automated techniques well studied.
Parsing harder than scanning.
Scanner Automation
Lex -Mike Lesk's Lexical Analysis Tool Takes a regular grammar Creates a NFA Converts NFA to DFA Does code generation (Table Driven)
GNU flex (by Vern Paxson) -lex clone
Parsing
Parsing recognizes strings in a grammar by constructing a tree (derivation).
The root is the start symbol. The terminals appear at the leaves. Each interior node represents a production
The non terminal on the LHS is the root The symbols on the RHS are the children.
BNF
Extended Backus Naur Format (eBNF) is a popular notation for CFGs.
Instead of use "::=" Start symbol is LHS of the first production An example:
expression ::= identifier | number | - expression |"("expression")" | expression operator expression
operator ::= + | - | "*" | /identifier ::= (A- Z,a - z)+
BNF an Example
expression ::= identifier | number | - expression |"("expression")" | expression operator expression
operator ::= + | - | "*" | /identifier ::= (A- Z,a - z)+
Consider the following grammar (eBNF) and program.
Show the derivation (use as a separator) Draw the parse tree.
S l o p e * x + i n t e r c e p t
The Derivation
Consider the following grammar (eBNF) and program, draw the parse tree.
expression expression operator expression expression operator identifier(intercept) expression + identifier(intercept) expression operator expression + identifier(intercept) expression operator identifier(x) + identifier(intercept) expression * identifier(x) + identifier(intercept) identifier(Slope) * identifier(x) + identifier(intercept)
Is this a leftmost or rightmost derivation?
The Derivation And an Exercise
The prior example was rightmost. Try a leftmost derivation at home.
Is the given rightmost derivation unique?
Are there multiple parse trees?
Precendence and Associativity
Some techniques used to disambiguate Want to avoid adding useless symbols Must determine order in which productions are
applied in a derivation. Precedence - High precedence operators
productions applied later (closer to leaf). Add an intermediate production (change the grammar).
Associativity -For repeated applications of the same operator, pick a left or right derivation.
Left, Right, Unary -Done by Grammar Annotation
Precendence An Exercise
Consider our example language Except now we want * and / to have higher
precendence than + and -. So make high precedence nodes lower in tree.
Can you fix this grammar?
expression ::= identifier | number | - expression |"("expression")" | expression operator expression
operator ::= + | - | "*" | /identifier ::= (A- Z,a - z)+
Precendence Solution
Consider our example language Want * and / to always be beneath + and -.
expression ::= term rtermterm ::= factor rfactorrterm ::= add_op term rterm | factor ::= (expression) | identifierrfactor ::= mult_op factor rfactor | add_op ::= "+" | "- " mult_op ::= "*" | /identifier ::= (A- Z,a - z)+
Parser Taxonomy
How can Parse trees be constructed? Top Down -Easier to implement.
Called Predictive or Recursive Descent Parser Left-to-right Left-most derivation with k symbols
lookahead, LL(k)
Bottom Up -More general, hard to implement Left-to-right Right-most derivation with k symbols
lookahead, LR(k)
Usually k=1 (reduces complexity)
Flow Control
To manage the Flow of Control Write instructions to do it
Algorithmic Complexity is in the instructions Most used for LL(k) approaches
Use a table driven approaches Complexity pushed into the table Useful for automation, algorithms used to:
Interpret the table Populate (initialize) the table (parser generators)
LL Parser Construction
Uses Leftmost derivation. Scans left to right, maintaining
State of the current production (Initially Start State) Try to match expected symbols
Recurse down to terminals Use Languages recursion stack or Maintain your own Stack
How to handle end of input? Add special token $$ to indicate EOF
LL Grammars
Predictive parsers examine: Current parent production and the lookahead symbol(s) to pick a production.
One recursive subroutine per nonterminal
Left recursive grammars have productions of the form A * Aa which breaks predictive parsers.
Change A Aa | B to A BA and AaA|
Tables For Predictive Parsers
Convert LL(k) grammar to disjunction free form
Change Aa|b to Aa and Ab Construct |N| |Tk| table to match nonterminals
Each entry is a production identifier Or indicates an error (unexpected input).
Use a Stack of symbols in the alphabet (N T)
Parser matches the top of stack Initially start symbol is on top.
Table Driven Predictive Parsers
Read a terminal and check the stack
If the top of stack is a terminal, match it Otherwise use nonterminal on top of stack and
lookahead to select production. Push LHS on stack in reverse order.
Populating Predictive Parsing Tables
Want to develop a function for parsing that:
Given The nonterminal (top of stack) k symbol lookahead select the production
Predict:N Tk(NT)*, for LL(k) parser Use Helper functions to manage complexity
First:(NT)*(Tk), Book assumes k=1 Follow:(A)
First and Follow Functions
Predict designed using Helper Functions.
Book assume LL(1), we will follow suit First - The set of token(s) that can be the start
of a Production's RHS (a string) in (NT)*
First:(NT)*(T) Follow - The expected lookaheads, i.e. the set
of tokens allowed to come after a nonterminal. Follow:N(T)
Formal Definition of Predict, First and Follow
Want to Predict Using Lookahead Define a function Predict(A), where AN
Predict(A) = (First() \ {}) (if * then Follow(A) else {})
First() = {a:*a} (if * then {} else ) where (NT)*
Follow(A) = {a:SAa} (if S*A then {} else )
Error Recovery in LL(k) Parsers (1 of 3)
Humans make mistakes (GIGO).
Want to detect errors
But don't want to trigger a deluge of message Either stop or "repair" error and continue
Wirth proposed phrase level recovery
Error means unexpected token is scanned Delete all tokens until a token possibly
beginning the next statement is scanned.
Error Recovery in LL(k) Parsers (2 of 3)
To Improve on quality of error detection
Use context (helps with productions) Add error productions to detect common
mistakes. To Make It Easier to Program the Detection
Use Exception Handling features (needs langauage support).
Least Cost Error Recovery in LL(k) Parsers (1 of 2)
Error Recovery in Table Driven LL(k) parsers
Fischer, Milton and Quiring developed a least cost error recovery method (FMQ for short).
Goals: Should not need semantic information. Should not require reprocessing input Do not change spelling of tokens or
separators.
Least Cost Error Recovery in LL(k) Parsers (2 of 2)
FMQ algorithm has compiler writer determined parameters for manipulating the input stream:
C(t) - Cost of inserting token t into input D(t) - Cost of deleting token t from input
C($$) = D($$) = Find the optimal repair
Find optimal insert/delete combination Use finding of optimal insertion as a helper
Bottom Up Parsing (œ)
Bottom up parsers use LR(k) grammar
Typically table driven Uses Lookahead and Stack to pick action
For common LR(k) parsers use a FSA
Characteristic Finite State Machine (CFSM) Transitions use top of stack and lookahead.
Variants include LR(0), SLR(1), LALR(1)
Bottom Up Parsing (2/2)
Parser Actions are:
Shift a token onto the stack Reduce a sequence of recognized tokens from
the top of stack Accept the input Detect errors
State in LR Parsers
LR Parsers keep track of where they are
Initial State
(S0| a
1,a
2, , an,$)
After parsing the ith token, the stack may have m state/symbol pairs pushed (m i)
(S0,S
1, X
1, S
2, X
2, ,S
m,X
m|a
i+1,a
i+2, , a
n,$)
Shift Reduce Parsing 1
Basic Idea - Build parse tree from right to left.
If the first few symbols on the stack match the RHS of some production:
Pop them (the RHS) off the stack Push the LHS of the production and state This is called a reduction step (reduce)
Else push (shift) token and state onto stack.
Shift Reduce Parsing 2
Consider a language with the rule E E + E
The RHS E + E is called a handle. When a handle is recognized reduce Otherwise (not a handle) shift a handle id But what if we have a rule, say S E?
The fundamental shift reduce challenges are:
Knowing when to shift and/or reduce. If reducing, knowing which rule to apply.
Shift Reduce Parsing 3
Use a DFA to process Table Drivens
State transitions stored in tables indexed by top of stack and lookahead, entries describe
Which action to take (Action table) What is the next state (GoTo table)
Right recursion breaks shift reduce parsers
Just like left recursion breaks LL parsers Use of left recursion leads to left associative
trees (which is often a good thing).
Shift/Reduce ParsersRemember What Matched
Each production uses a set of ". productions" (aka items), to show part of RHS matched
So SE+E has dot productions SE+E (Matched nothing) SE.+E (Matched left E) SE+.E (Matched left E and +) SE+E. (Matched entire production)
Add production SS$ and its items.
Dot Productions of Example
1 S ::= .S$2 S ::= S.$3 S ::= S$.4 S ::= .BB5 S ::= B.B6 S ::= BB.7 B ::= .aB8 B ::= a.B9 B ::= aB.10 B := .c11 B := c.
Item Setsan Informal Introduction
We could directly derive a DFA Transitions guided by matched symbol One state per item (dot production)
But this induces too many states. Sometimes, would not be possible to put the parser
in a unique state given the scanned input. Especially if lookahead is required.
Partition into item sets (equivalence classes). Partitioning done using Closure
LR(0) Parsing and Closure
Recall LR(0) parsing, by defintion: Uses no lookahead in parser construction However, generated parser may use lookahead
Closure(I), I is a set of items in a grammar G constructs itemsets as follows: Initially every item in I is added to Closure(I) For each rule AB in Closure(I) and B is a
production, insert B into Closure(I) until Closure(I) no longer increases in size.
LR(0) Parsing Example Revisited
Recall LR(0) parsing, by defintion: Uses no lookahead in parser construction However, generated parser may use lookahead
In Class Exercise: Recall Example The given grammar is LR(0) What items can be applied before any tokens
are scanned? What items are feasible after scanning 'c'?
Closure Example Part 1
What items are feasible before any tokens are scanned?
1 S::= .S$4 S ::= .BB7 B ::= .aB10 B ::= .c
These members containing S::= .S$ and without a . on the left are Kernel Items
Nonkernel items have a . on the left end
Closure Defined
Closure takes an item, i, as input and the equivalence class containing i
Itemset Closure(item i) /* Algorithm for LR(0) grammars */begin
Itemset I := {i}, OldI;repeat
OldI := I;foreach item J ::= M in I do
foreach item M ::= . , such that M not in J doI := I {M};
until I = OldI;return I;
end Closure;
Constructing all Itemsets
We need to build ALL itemsets.
Procedure BuildAllItemsets(Grammar G) /* LR(0) grammars */begin
G.C := Closure(SS');repeat
OldC := G.C;for each ItemSet G.C and each X ( N T) such that
GoTo(I, X) and GoTo(I,X) G.C doG.C = G.C GoTo(I,X);
until G.C = OldC;end BuildAllItemsets;
LALR Parsing
Most grammars in use are LALR(1)
Because of Popular tools Mike Lesk's YACC Donnelly, Stallman et al. GNU, Bison
Why LALR(1) -Space/time efficiency.
LR(0) simple but limited SLR(k) more restrictive than LALR(k) LR(k), k > 0 more complex
LALR(1) Error Handling
YACC and Bison use a special error token, when an error occurs:
The yyerror() function is invoked. Pops tokens until it can shift in an error token. Inserts and shifts the error token. Deletes tokens from the input until it can find a
post error input token (or terminates). (Temporarily) disables error reporting. Resumes parsing