Topic 2 – Syntax Analysis Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems

Topic 2 – Syntax Analysis

Dr. William A. ManiattyAssistant Prof.

Dept. of Computer ScienceUniversity At Albany

CSI 511Programming Languages and Systems Concepts

Fall 2002

Monday Wednesday 2:30-3:50LI 99

What is Langauge?

Language - A systematic means of communicating by the use of sounds or conventional symbols (Wordnet 1.7)

Natural Language - Used by people Programming Languages - A formal language

in which computer programs are written. (Foldoc). Languages are defined by their

Syntax - Rules for how symbols are combined Semantics - What the constructs mean

Why is Syntax Important?

A compiler must recognize a program's syntax and then do semantic analysis.

A good programming language syntax Is precise (avoids ambiguity) Can easily be recognized (i.e. In O(|Program|) Is easy to scan/parse (unlike Fortran) Is succinct (unlike COBOL) Is readable (unlike APL)

Some Notation

denotes the empty string

ab denotes a followed by b

a* denotes 0 or more repetitions of a Called the Kleene star (after Stephen Kleene)

a+ denotes 1 or more repetitions of a (aa*)

a|b denotes either a or b (but not both)

[a] denotes a or the empty string

Grammar and Syntax

A Formal Grammar of language L is defined as G(L)=(N,T,S,P), where:

N is the set of nonterminals (metasymbols) T is the set of terminals.

N and T are disjoint, NT= A=NT denotes the alphabet of the language

P is the set of productions S, SN is the start symbol of the language.

Productions

A production is a rule describing allowed substitutions for nonterminals.

More formally, each production, pP is an ordered pair () where N, (NT)*

= a jAy where ( )j yÎ NÈT * and AÎN = b jwy where ( )wÎ NÈT The production is denoted ®a b

Derivations (Some notation)

A derivation is a sequence of substitutions (applications of productions).

Each string of symbols at a stage in a derivation is a sentential form.

The yield is the final sentential form composed of only terminals.

We use to show a replacement Transitive closure of replacement is shown as *

Languages Automata and Grammar

Joint work done by: Noam Chomsky (Linguistics) Alonso Church (Math) Alan Turing (Math/Computer Science)

Side note, Turing was Church's Ph.D. Student

Church demonstrated equivalence classes of languages, automata and grammar

More powerful classes contain less powerful

Chomsky Hierarchy

Chomsky defined equivalence classes, Type i Includes Type j if i < j.Language Grammar Auomaton Productions Notes

Type 0 RecursivelyEnumerable

Turing Machine contracting

Type 1 Context Sensitive Linear BoundedAutomata

)*

Non- contracting

Type 2 Context Free (CFG) Push Down Automata

Stack Based

Type 3 Regular Expressions Finite State Automata(FSA)

Bounded Memory

Programming Language Grammars

Most programming languages have CFG. Type 0 and Type 1 languages hard to parse But some CFGs cannot be efficiently parsed.

Push Down Automata (PDA) are used. Nondeterministic PDA (NPDA) are more powerful

than deterministic PDA (DPDA).

Regular Expressions used (in scanning) Deterministic and Nondeterministic Finite

Automata have equivalent power.

Recognizing a Language

Recognizing a program's syntax involves: Scanning - Recognizes Terminals (DFA)

Reads characters, emits tokens. Prefer longer strings

Parsing - Recognizes Terminals and Nonterminals (DPDA)

Reads tokens, constructs a parse tree. Automated techniques well studied.

Parsing harder than scanning.

Scanner Design

Begin with DFA to recognize terminals

Scanner Implementation

Use DFA to Generate Code

Scanner Automation

Lex -Mike Lesk's Lexical Analysis Tool Takes a regular grammar Creates a NFA Converts NFA to DFA Does code generation (Table Driven)

GNU flex (by Vern Paxson) -lex clone

Converting Regular Expression to NFA

Parsing

Parsing recognizes strings in a grammar by constructing a tree (derivation).

The root is the start symbol. The terminals appear at the leaves. Each interior node represents a production

The non terminal on the LHS is the root The symbols on the RHS are the children.

BNF

Extended Backus Naur Format (eBNF) is a popular notation for CFGs.

Instead of use "::=" Start symbol is LHS of the first production An example:

expression ::= identifier | number | - expression |"("expression")" | expression operator expression

operator ::= + | - | "*" | /identifier ::= (A- Z,a - z)+

BNF an Example



Consider the following grammar (eBNF) and program.

Show the derivation (use as a separator) Draw the parse tree.

S l o p e * x + i n t e r c e p t

The Derivation

Consider the following grammar (eBNF) and program, draw the parse tree.

expression expression operator expression expression operator identifier(intercept) expression + identifier(intercept) expression operator expression + identifier(intercept) expression operator identifier(x) + identifier(intercept) expression * identifier(x) + identifier(intercept) identifier(Slope) * identifier(x) + identifier(intercept)

Is this a leftmost or rightmost derivation?

The Derivation And an Exercise

The prior example was rightmost. Try a leftmost derivation at home.

Is the given rightmost derivation unique?

Are there multiple parse trees?

Ambiguous Grammars

An ambiguous grammar supports multiple parse trees (has semantic consequences).

Precendence and Associativity

Some techniques used to disambiguate Want to avoid adding useless symbols Must determine order in which productions are

applied in a derivation. Precedence - High precedence operators

productions applied later (closer to leaf). Add an intermediate production (change the grammar).

Associativity -For repeated applications of the same operator, pick a left or right derivation.

Left, Right, Unary -Done by Grammar Annotation

Precendence An Exercise

Consider our example language Except now we want * and / to have higher

precendence than + and -. So make high precedence nodes lower in tree.

Can you fix this grammar?



Precendence Solution

Consider our example language Want * and / to always be beneath + and -.

expression ::= term rtermterm ::= factor rfactorrterm ::= add_op term rterm | factor ::= (expression) | identifierrfactor ::= mult_op factor rfactor | add_op ::= "+" | "- " mult_op ::= "*" | /identifier ::= (A- Z,a - z)+

Parser Taxonomy

How can Parse trees be constructed? Top Down -Easier to implement.

Called Predictive or Recursive Descent Parser Left-to-right Left-most derivation with k symbols

lookahead, LL(k)

Bottom Up -More general, hard to implement Left-to-right Right-most derivation with k symbols

lookahead, LR(k)

Usually k=1 (reduces complexity)

LL vs LR ParsingAn Example

Flow Control

To manage the Flow of Control Write instructions to do it

Algorithmic Complexity is in the instructions Most used for LL(k) approaches

Use a table driven approaches Complexity pushed into the table Useful for automation, algorithms used to:

Interpret the table Populate (initialize) the table (parser generators)

LL Parser Construction

Uses Leftmost derivation. Scans left to right, maintaining

State of the current production (Initially Start State) Try to match expected symbols

Recurse down to terminals Use Languages recursion stack or Maintain your own Stack

How to handle end of input? Add special token $$ to indicate EOF

LL Grammars

Predictive parsers examine: Current parent production and the lookahead symbol(s) to pick a production.

One recursive subroutine per nonterminal

Left recursive grammars have productions of the form A * Aa which breaks predictive parsers.

Change A Aa | B to A BA and AaA|

Predictive Parsing an Exercise

Consider the following grammar:

Predictive Parsing Solution

Tables For Predictive Parsers

Convert LL(k) grammar to disjunction free form

Change Aa|b to Aa and Ab Construct |N| |Tk| table to match nonterminals

Each entry is a production identifier Or indicates an error (unexpected input).

Use a Stack of symbols in the alphabet (N T)

Parser matches the top of stack Initially start symbol is on top.

Table Driven Predictive Parsers

Read a terminal and check the stack

If the top of stack is a terminal, match it Otherwise use nonterminal on top of stack and

lookahead to select production. Push LHS on stack in reverse order.

Populating Predictive Parsing Tables

Want to develop a function for parsing that:

Given The nonterminal (top of stack) k symbol lookahead select the production

Predict:N Tk(NT)*, for LL(k) parser Use Helper functions to manage complexity

First:(NT)*(Tk), Book assumes k=1 Follow:(A)

First and Follow Functions

Predict designed using Helper Functions.

Book assume LL(1), we will follow suit First - The set of token(s) that can be the start

of a Production's RHS (a string) in (NT)*

First:(NT)*(T) Follow - The expected lookaheads, i.e. the set

of tokens allowed to come after a nonterminal. Follow:N(T)

Formal Definition of Predict, First and Follow

Want to Predict Using Lookahead Define a function Predict(A), where AN

Predict(A) = (First() \ {}) (if * then Follow(A) else {})

First() = {a:*a} (if * then {} else ) where (NT)*

Follow(A) = {a:SAa} (if S*A then {} else )

Error Recovery in LL(k) Parsers (1 of 3)

Humans make mistakes (GIGO).

Want to detect errors

But don't want to trigger a deluge of message Either stop or "repair" error and continue

Wirth proposed phrase level recovery

Error means unexpected token is scanned Delete all tokens until a token possibly

beginning the next statement is scanned.

Error Recovery in LL(k) Parsers (2 of 3)

To Improve on quality of error detection

Use context (helps with productions) Add error productions to detect common

mistakes. To Make It Easier to Program the Detection

Use Exception Handling features (needs langauage support).

Least Cost Error Recovery in LL(k) Parsers (1 of 2)

Error Recovery in Table Driven LL(k) parsers

Fischer, Milton and Quiring developed a least cost error recovery method (FMQ for short).

Goals: Should not need semantic information. Should not require reprocessing input Do not change spelling of tokens or

separators.

Least Cost Error Recovery in LL(k) Parsers (2 of 2)

FMQ algorithm has compiler writer determined parameters for manipulating the input stream:

C(t) - Cost of inserting token t into input D(t) - Cost of deleting token t from input

C($$) = D($$) = Find the optimal repair

Find optimal insert/delete combination Use finding of optimal insertion as a helper

Bottom Up Parsing (œ)

Bottom up parsers use LR(k) grammar

Typically table driven Uses Lookahead and Stack to pick action

For common LR(k) parsers use a FSA

Characteristic Finite State Machine (CFSM) Transitions use top of stack and lookahead.

Variants include LR(0), SLR(1), LALR(1)

Bottom Up Parsing (2/2)

Parser Actions are:

Shift a token onto the stack Reduce a sequence of recognized tokens from

the top of stack Accept the input Detect errors

LR Parser Structure

State in LR Parsers

LR Parsers keep track of where they are

Initial State

(S0| a

1,a

2, , an,$)

After parsing the ith token, the stack may have m state/symbol pairs pushed (m i)

(S0,S

1, X

1, S

2, X

2, ,S

m,X

m|a

i+1,a

i+2, , a

n,$)

Shift Reduce Parsing 1

Basic Idea - Build parse tree from right to left.

If the first few symbols on the stack match the RHS of some production:

Pop them (the RHS) off the stack Push the LHS of the production and state This is called a reduction step (reduce)

Else push (shift) token and state onto stack.


Consider a language with the rule E E + E

The RHS E + E is called a handle. When a handle is recognized reduce Otherwise (not a handle) shift a handle id But what if we have a rule, say S E?

The fundamental shift reduce challenges are:

Knowing when to shift and/or reduce. If reducing, knowing which rule to apply.


Use a DFA to process Table Drivens

State transitions stored in tables indexed by top of stack and lookahead, entries describe

Which action to take (Action table) What is the next state (GoTo table)

Right recursion breaks shift reduce parsers

Just like left recursion breaks LL parsers Use of left recursion leads to left associative

trees (which is often a good thing).

An Example Grammar

Consider the following Grammar (in EBNF)

S::=BB$B::=aBB::=c

Shift/Reduce ParsersRemember What Matched

Each production uses a set of ". productions" (aka items), to show part of RHS matched

So SE+E has dot productions SE+E (Matched nothing) SE.+E (Matched left E) SE+.E (Matched left E and +) SE+E. (Matched entire production)

Add production SS$ and its items.

Dot Productions of Example

1 S ::= .S$2 S ::= S.$3 S ::= S$.4 S ::= .BB5 S ::= B.B6 S ::= BB.7 B ::= .aB8 B ::= a.B9 B ::= aB.10 B := .c11 B := c.

Item Setsan Informal Introduction

We could directly derive a DFA Transitions guided by matched symbol One state per item (dot production)

But this induces too many states. Sometimes, would not be possible to put the parser

in a unique state given the scanned input. Especially if lookahead is required.

Partition into item sets (equivalence classes). Partitioning done using Closure

LR(0) Parsing and Closure

Recall LR(0) parsing, by defintion: Uses no lookahead in parser construction However, generated parser may use lookahead

Closure(I), I is a set of items in a grammar G constructs itemsets as follows: Initially every item in I is added to Closure(I) For each rule AB in Closure(I) and B is a

production, insert B into Closure(I) until Closure(I) no longer increases in size.

LR(0) Parsing Example Revisited

Recall LR(0) parsing, by defintion: Uses no lookahead in parser construction However, generated parser may use lookahead

In Class Exercise: Recall Example The given grammar is LR(0) What items can be applied before any tokens

are scanned? What items are feasible after scanning 'c'?

Closure Example Part 1

What items are feasible before any tokens are scanned?

1 S::= .S$4 S ::= .BB7 B ::= .aB10 B ::= .c

These members containing S::= .S$ and without a . on the left are Kernel Items

Nonkernel items have a . on the left end

Closure Defined

Closure takes an item, i, as input and the equivalence class containing i

Itemset Closure(item i) /* Algorithm for LR(0) grammars */begin

Itemset I := {i}, OldI;repeat

OldI := I;foreach item J ::= M in I do

foreach item M ::= . , such that M not in J doI := I {M};

until I = OldI;return I;

end Closure;

Constructing all Itemsets

We need to build ALL itemsets.

Procedure BuildAllItemsets(Grammar G) /* LR(0) grammars */begin

G.C := Closure(SS');repeat

OldC := G.C;for each ItemSet G.C and each X ( N T) such that

GoTo(I, X) and GoTo(I,X) G.C doG.C = G.C GoTo(I,X);

until G.C = OldC;end BuildAllItemsets;

Complete Closure Example

We can make a state diagram, showing the Itemsets and state transitions

LALR Parsing

Most grammars in use are LALR(1)

Because of Popular tools Mike Lesk's YACC Donnelly, Stallman et al. GNU, Bison

Why LALR(1) -Space/time efficiency.

LR(0) simple but limited SLR(k) more restrictive than LALR(k) LR(k), k > 0 more complex

LALR(1) Error Handling

YACC and Bison use a special error token, when an error occurs:

The yyerror() function is invoked. Pops tokens until it can shift in an error token. Inserts and shifts the error token. Deletes tokens from the input until it can find a

post error input token (or terminates). (Temporarily) disables error reporting. Resumes parsing

CFG Taxonomy

Recall not all CFGs are easily parsed.

Popular CFG Taxonomy

LALR(1) or LL(1) are popular/practical.

Summary

Scanning recognizes terminals.

Parsers recognize the nonterminals. Use parse trees

Leftmost or Rightmost derivation Top down or Bottom Up

Well understood Automation available for LL(k) and LALR(1)

Human input make error recovery important.

Documents

Topic 2 – Syntax Analysis Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems