8/6/2019 Color Lecture Notes
1/237
CS 352 Compilers: Principles and Practice
1. Introduction 2
2. Lexical analysis 31
3. LL parsing 58
4. LR parsing 110
5. JavaCC and JTB 127
6. Semantic analysis 150
7. Translation and simplification 165
8. Liveness analysis and register allocation 185
9. Activation Records 216
1
8/6/2019 Color Lecture Notes
2/237
Chapter 1: Introduction
2
8/6/2019 Color Lecture Notes
3/237
Things to do
make sure you have a working mentor account
start brushing up on Java
review Java development tools find http://www.cs.purdue.edu/homes/palsberg/cs352/F00/index.html
add yourself to the course mailing list by writing (on a CS computer)
mailer add me to cs352
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this workfor personal or classroom use is granted without fee provided that copies are not made or distributed forprofit or commercial advantage and that copies bear this notice and full citation on the first page. To copyotherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/orfee. Request permission to publish from [email protected].
3
8/6/2019 Color Lecture Notes
4/237
Compilers
What is a compiler?
a program that translates an executableprogram in one language into
an executable program in another language
we expect the program produced by the compiler to be better, in someway, than the original
What is an interpreter?
a program that reads an executableprogram and produces the results
of running that program
usually, this involves executing the source program in some fashion
This course deals mainly with compilers
Many of the same issues arise in interpreters4
8/6/2019 Color Lecture Notes
5/237
Motivation
Why study compiler construction?
Why build compilers?
Why attend class?
5
8/6/2019 Color Lecture Notes
6/237
Interest
Compiler construction is a microcosm of computer science
artificial intelligence greedy algorithms
learning algorithmsalgorithms graph algorithms
union-find
dynamic programmingtheory DFAs for scanning
parser generators
lattice theory for analysissystems allocation and naming
locality
synchronization
architecture pipeline managementhierarchy management
instruction set use
Inside a compiler, all these things come together6
8/6/2019 Color Lecture Notes
7/237
Isnt it a solved problem?
Machines are constantly changing
Changes in architecture changes in compilers
new features pose new problems
changing costs lead to different concerns
old solutions need re-engineering
Changes in compilers should prompt changes in architecture
New languages and features
7
8/6/2019 Color Lecture Notes
8/237
Intrinsic Merit
Compiler construction is challenging and fun
interesting problems
primary responsibility for performance (blame)
new architectures new challenges
real results
extremely complex interactions
Compilers have an impact on how computers are used
Compiler construction poses some of the most interesting problems in
computing8
8/6/2019 Color Lecture Notes
9/237
Experience
You have used several compilers
What qualities are important in a compiler?
1. Correct code
2. Output runs fast
3. Compiler runs fast
4. Compile time proportional to program size
5. Support for separate compilation6. Good diagnostics for syntax errors
7. Works well with the debugger
8. Good diagnostics for flow anomalies
9. Cross language calls
10. Consistent, predictable optimization
Each of these shapes your feelings about the correct contents of this course9
8/6/2019 Color Lecture Notes
10/237
Abstract view
errors
compilercode codesource machine
Implications:
recognize legal (and illegal) programs
generate correct code
manage storage of all variables and code
agreement on format for object (or assembly) code
Big step up from assembler higher level notations10
8/6/2019 Color Lecture Notes
11/237
Traditional two pass compiler
codesource codemachinefrontend backendIR
errors
Implications:
intermediate representation (IR)
front end maps legal code into IR
back end maps IR onto target machine
simplify retargeting
allows multiple front ends
multiple passes better code
11
8/6/2019 Color Lecture Notes
12/237
A fallacy
backend
frontend
FORTRANcode
frontend
frontend
frontend
backend
backend
code
code
code
C++
CLU
Smalltalk
target1
target2
target3
Can we build n m compilers with n m components?
must encode all the knowledge in each front end
must represent all the features in one IR
must handle all the features in each back end
Limited success with low-level IRs12
8/6/2019 Color Lecture Notes
13/237
8/6/2019 Color Lecture Notes
14/237
Front end
codesource tokens
errors
scanner parser IR
Scanner:
maps characters into tokens the basic unit of syntax
becomes
id, id, id,
character string value for a token is a lexeme
typical tokens: number, id, , , , , ,
eliminates white space (tabs, blanks, comments)
a key issue is speed
use specialized recognizer (as opposed to
)14
8/6/2019 Color Lecture Notes
15/237
Front end
codesourcetokens
errors
scanner parser IR
Parser:
recognize context-free syntax
guide context-sensitive analysis
construct IR(s)
produce meaningful error messages
attempt error correction
Parser generators mechanize much of the work15
8/6/2019 Color Lecture Notes
16/237
Front end
Context-free syntax is specified with a grammar
sheep noise
::=
sheep noise
This grammar defines the set of noises that a sheep makes under normalcircumstances
The format is called Backus-Naur form (BNF)
Formally, a grammar G S N T P
S is the start symbol
N is a set of non-terminal symbols
T is a set of terminal symbols
P is a set of productions or rewrite rules (P : N N T)
16
8/6/2019 Color Lecture Notes
17/237
Front end
Context free syntax can be put to better use
1
goal
::=
expr
2
expr
::=
expr
op
term
3
term
4
term
::=
5
6 op ::= 7
This grammar defines simple expressions with addition and subtractionover the tokens and
S = goal
T = , , ,
N = goal , expr , term , op
P = 1, 2, 3, 4, 5, 6, 7
17
F t d
8/6/2019 Color Lecture Notes
18/237
Front end
Given a grammar, valid sentences can be derived by repeated substitution.
Prodn. Result
goal
1
expr
2
expr
op
term
5
expr
op
7
expr
2
expr
op
term
4
expr
op
6 expr 3 term 5
To recognize a valid sentence in some CFG, we reverse this process and
build up a parse18
F t d
8/6/2019 Color Lecture Notes
19/237
Front end
A parse can be represented by a tree called a parse or syntax tree
2
>
8/6/2019 Color Lecture Notes
20/237
Front end
So, compilers often use an abstract syntax tree
2>
8/6/2019 Color Lecture Notes
21/237
Back end
errors
IR allocationregisterselectioninstruction machinecode
Responsibilities
translate IR into target machine code
choose instructions for each IR operation
decide what to keep in registers at each point
ensure conformance with system interfaces
Automation has been less successful here21
Back end
8/6/2019 Color Lecture Notes
22/237
Back end
errors
IR allocationregister machine
codeinstructionselection
Instruction selection:
produce compact, fast code
use available addressing modes
pattern matching problem
ad hoc techniques tree pattern matching
string pattern matching
dynamic programming
22
Back end
8/6/2019 Color Lecture Notes
23/237
Back end
errors
IR machinecodeinstructionselection registerallocation
Register Allocation:
have value in a register when used
limited resources
changes instruction choices
can move loads and stores
optimal allocation is difficult
Modern allocators often use an analogy to graph coloring23
Traditional three pass compiler
8/6/2019 Color Lecture Notes
24/237
Traditional three pass compiler
IR
errors
IRmiddlefront backend end endsourcecode codemachine
Code Improvement
analyzes and changes IR
goal is to reduce runtime
must preserve values
24
8/6/2019 Color Lecture Notes
25/237
8/6/2019 Color Lecture Notes
26/237
The Tiger compiler
Parse TranslateLexCanon-Semantic
Analysis calize
Instruction
Selection
Frame
Layout
Parsing
Actions
SourceProgram
Tokens
Pass 10
Reductions
AbstractSyntax
Translate
IRTrees
IRTrees
Frame
Tables
Environ-
ments
Assem
Control
FlowAnalysis
Data
FlowAnalysis
Register
Allocation
Code
Emission Assembler
MachineLanguage
Asse
m
FlowG
raph
Interferenc
eGraph
RegisterAs
signment
AssemblyL
anguage
RelocatableO
bjectCode
Pass 1 Pass 4
Pass 5 Pass 8 Pass 9
Linker
Pass 2
Pass 3
Pass 6 Pass 7
26
The Tiger compiler phases
8/6/2019 Color Lecture Notes
27/237
The Tiger compiler phases
Lex Break source file into individual words, or tokens
Parse Analyse the phrase structure of programParsingActions
Build a piece of abstract syntax tree for each phrase
SemanticAnalysis
Determine what each phrase means, relate uses of variables to theirdefinitions, check types of expressions, request translation of eachphrase
FrameLayout
Place variables, function parameters, etc., into activation records (stackframes) in a machine-dependent way
Translate Produce intermediate representation trees(IR trees), a notation that isnot tied to any particular source language or target machine
Canonicalize Hoist side effects out of expressions, and clean up conditional branches,for convenience of later phases
InstructionSelection
Group IR-tree nodes into clumps that correspond to actions of target-machine instructions
Control FlowAnalysis
Analyse sequence of instructions into control flow graph showing allpossible flows of control program might follow when it runs
Data Flow
Analysis
Gather information about flow of data through variables of program; e.g.,
liveness analysis calculates places where each variable holds a still-needed (live) value
RegisterAllocation
Choose registers for variables and temporary values; variables not si-multaneously live can share same register
CodeEmission
Replace temporary names in each machine instruction with registers
27
A straight-line programming language
8/6/2019 Color Lecture Notes
28/237
A straight line programming language
A straight-line programming language (no loops or conditionals):
Stm
Stm; Stm CompoundStmStm : Exp AssignStmStm ExpList PrintStmExp IdExpExp NumExp
Exp Exp Binop Exp OpExpExp Stm Exp EseqExpExpList Exp ExpList PairExpList
ExpList Exp LastExpList
Binop
PlusBinop
Minus
Binop Times
Binop Div
e.g., : 5 3; : 1 10 ;
prints:
28
Tree representation
8/6/2019 Color Lecture Notes
29/237
Tree representation
: 5 3; : 1 10 ;
AssignStm
CompoundStm
a OpExp
PlusNumExp
5
NumExp
3
AssignStm
b EseqExp
PrintStm
PairExpList
IdExp
a
LastExpList
OpExp
MinusIdExp NumExp
a 1
OpExp
NumExp Times IdExp
a10
PrintStm
LastExpList
IdExp
b
CompoundStm
This is a convenient internal representation for a compiler to use.
29
Java classes for trees
8/6/2019 Color Lecture Notes
30/237
30
8/6/2019 Color Lecture Notes
31/237
Chapter 2: Lexical Analysis
31
Scanner
8/6/2019 Color Lecture Notes
32/237
code
source tokens
errors
scanner parser IR
maps characters into tokens the basic unit of syntax
becomes id, id, id,
character string value for a token is a lexeme
typical tokens: number, id,
,
,
,
,
,
eliminates white space (tabs, blanks, comments)
a key issue is speed use specialized recognizer (as opposed to )
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this workfor personal or classroom use is granted without fee provided that copies are not made or distributed forprofit or commercial advantage and that copies bear this notice and full citation on the first page. To copyotherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or
fee. Request permission to publish from [email protected].
32
Specifying patterns
8/6/2019 Color Lecture Notes
33/237
A scanner must recognize various parts of the languages syntax
Some parts are easy:
white space ws ::= ws
ws
keywords and operators
specified as literal patterns:
,
comments
opening and closing delimiters:
33
Specifying patterns
8/6/2019 Color Lecture Notes
34/237
A scanner must recognize various parts of the languages syntax
Other parts are much harder:
identifiers
alphabetic followed by k alphanumerics ( , $, &, . . . )
numbers
integers: 0 or digit from 1-9 followed by digits from 0-9
decimals: integer
digits from 0-9
reals: (integer or decimal)
(+ or -) digits from 0-9
complex:
real
real
We need a powerful notation to specify these patterns34
Operations on languages
8/6/2019 Color Lecture Notes
35/237
Operation Definition
unionof L and M L M s s L or s Mwritten L M
concatenation of L and M LM st s L and t Mwritten LM
Kleene closureof L L
i 0 L
i
written L
positive closureof L L
i
1 Li
written L
35
Regular expressions
8/6/2019 Color Lecture Notes
36/237
Patterns are often specified as regular languages
Notations used to describe a regular language (or a regular set) includeboth regular expressionsand regular grammars
Regular expressions (over an alphabet):
1. is a RE denoting the set
2. if a , then a is a RE denoting a
3. if r and s are REs, denoting L
r
and L
s
, then:
r is a RE denoting L r
r
s is a RE denoting L r
L
s
r s is a RE denoting L r L s
r
is a RE denoting L r
If we adopt a precedencefor operators, the extra parentheses can go away.
We assume closure, then concatenation, then alternation as the order of
precedence.36
Examples
8/6/2019 Color Lecture Notes
37/237
identifier
letter a b c z A B C Z
digit 0 1 2 3 4 5 6 7 8 9
id letter
letter digit
numbers
integer
0 1 2 3 9 digit
decimal integer . digit
real integer decimal digit
complex real real
Numbers can get much more complicated
Most programming language tokens can be described with REs
We can use REs to build scanners automatically37
Algebraic properties of REs
8/6/2019 Color Lecture Notes
38/237
Axiom Description
r s s r is commutativer
s t
r s
t is associative rs t r st concatenation is associativer
s t
rs rt concatenation distributes over
s
t
r
sr
trr r is the identity for concatenationr
r
r
r
relation between and r
r is idempotent
38
Examples
8/6/2019 Color Lecture Notes
39/237
Let a b
1. a b denotes a b
2. a b a b denotes aa ab ba bb
i.e., a b a b aa ab ba bb
3. a denotes a aa aaa
4. a b denotes the set of all strings of as and bs (including )
i.e., a b a b
5. a a b denotes a b ab aab aaab aaaab
39
Recognizers
8/6/2019 Color Lecture Notes
40/237
From a regular expression we can construct a
deterministic finite automaton(DFA)
Recognizer for identifier:
0 21
3
digit
other
letter
digit
letter
other
error
accept
identifierletter a b c z A B C Z
digit 0 1 2 3 4 5 6 7 8 9
id letter letter digit
40
Code for the recognizer
8/6/2019 Color Lecture Notes
41/237
41
Tables for the recognizer
8/6/2019 Color Lecture Notes
42/237
Two tables control the recognizer
a
z A
Z 0
9 othervalue letter letter digit other
class 0 1 2 3
letter 1 1 digit 3 1
other 3 2
To change languages, we can just change tables42
Automatic construction
8/6/2019 Color Lecture Notes
43/237
Scanner generators automatically construct code from regular expression-
like descriptions
construct a dfa
use state minimization techniques
emit code for the scanner
(table driven or direct code )
A key issue in automation is an interface to the parser
is a scanner generator supplied with UNIX
emits C code for scanner
provides macro definitions for each token
(used in the parser)
43
Grammars for regular languages
8/6/2019 Color Lecture Notes
44/237
Can we place a restriction on the form of a grammar to ensure that it de-
scribes a regular language?
Provable fact:
For any RE r, there is a grammar g such that L
r
L
g
.
The grammars that generate regular sets are called regular grammars
Definition:
In a regular grammar, all productions have one of two forms:
1. A aA
2. A a
where A is any non-terminal and a is any terminal symbol
These are also called type 3 grammars (Chomsky)44
More regular languages
8/6/2019 Color Lecture Notes
45/237
Example: the set of strings containing an even number of zeros and an
even number of ones
s0 s1
s2 s3
1
1
0 0
1
1
0 0
The RE is 00 11 01 10 00 11 01 10 00 11
45
More regular expressions
8/6/2019 Color Lecture Notes
46/237
What about the RE
a b
abb ?
s0 s1 s2 s3
a
b
a b b
State s0
has multiple transitions on a!
nondeterministic finite automaton
a b
s0 s0 s1 s0
s1 s2 s2 s3
46
Finite automata
A non deterministic finite automaton (NFA) consists of:
8/6/2019 Color Lecture Notes
47/237
A non-deterministic finite automaton(NFA) consists of:
1. a set of states S s0 sn
2. a set of input symbols (the alphabet)
3. a transition function movemapping state-symbol pairs to sets of states
4. a distinguished start state s0
5. a set of distinguished accepting or final states F
A Deterministic Finite Automaton(DFA) is a special case of an NFA:
1. no state has a -transition, and
2. for each state s and input symbol a, there is at most one edge labelled
a leaving s.
A DFA accepts x iff. there exists a unique path through the transition graph
from the s0 to an accepting state such that the labels along the edges spell
x.47
DFAs and NFAs are equivalent
8/6/2019 Color Lecture Notes
48/237
1. DFAs are clearly a subset of NFAs
2. Any NFA can be converted into a DFA, by simulating sets of simulta-
neous states:
each DFA state corresponds to a set of NFA states
possible exponential blowup
48
NFA to DFA using the subset construction: example 1
8/6/2019 Color Lecture Notes
49/237
s0 s1 s2 s3
a b
a b b
a b s0 s0 s1 s0
s0 s1 s0 s1 s0 s2
s0 s2 s0 s1 s0 s3
s0 s3 s0 s1 s0
f
s0 g f s0 ; s1 g f s0 ; s2 g f s0 ; s3 g
b
a b b
b
a
a
a
49
Constructing a DFA from a regular expression
8/6/2019 Color Lecture Notes
50/237
DFA
DFA
NFA
RE
minimized
moves
RE NFA w/ movesbuild NFA for each termconnect them with moves
NFA w/ moves to DFAconstruct the simulationthe subset construction
DFA minimized DFA
merge compatible states
DFA REconstruct Rki j R
k 1
ik
Rk 1kk
Rk 1k j
Rk 1i j
50
RE to NFA
8/6/2019 Color Lecture Notes
51/237
N
N
a
a
N
A
B
AN(A)
N(B) B
N AB
A N(A) N(B) B
N A
AN(A)
51
RE to NFA: example
8/6/2019 Color Lecture Notes
52/237
a b abb
a b
1
2 3
6
4 5
a
b
a b
0 1
2 3
6
4 5
7
a
b
abb7 8 9 10
a b b
52
NFA to DFA: the subset construction
8/6/2019 Color Lecture Notes
53/237
Input: NFA NOutput: A DFA D with states Dstatesand transitions Dtrans
such that L D L NMethod: Let s be a state in N and T be a set of states,
and using the following operations:
Operation Definition
-closure
s
set of NFA states reachable from NFA states
on -transitions alone-closure T set of NFA states reachable from some NFA state s in T on -transitions alone
move T a set of NFA states to which there is a transition on input symbol afrom some NFA state s in T
add state T
-closure
s0
unmarked to Dstateswhile
unmarked state T in Dstatesmark Tfor each input symbol a
U -closure move T a
if U
Dstatesthen add U to Dstatesunmarked
Dtrans T a Uendfor
endwhile
-closure s0 is the start state of DA state of D is accepting if it contains at least one accepting state in N
53
NFA to DFA using subset construction: example 2
8/6/2019 Color Lecture Notes
54/237
0 1
2 3
6
4 5
7
a
b
8 9 10a b b
A 0 1 2 4 7 D 1 2 4 5 6 7 9
B 1 2 3 4 6 7 8 E 1 2 4 5 6 7 10
C 1 2 4 5 6 7
a b A B C
B B D
C B C
D B E E B C
54
Limits of regular languages
Not all languages are regular
8/6/2019 Color Lecture Notes
55/237
g g g
One cannot construct DFAs to recognize these languages:
L
pkqk
L
wcwr w
Note: neither of these is a regular expression!
(DFAs cannot count!)
But, this is a little subtle. One can construct DFAs for:
alternating 0s and 1s
1
01
0
sets of pairs of 0s and 1s
01 10
55
So what is hard?
Language features that can cause problems:
8/6/2019 Color Lecture Notes
56/237
reserved wordsPL/I had no reserved words
significant blanks
FORTRAN and Algol68 ignore blanks
string constants
special characters in strings
, , ,
finite closures
some languages limit identifier lengthsadds states to count length
FORTRAN 66 6 characters
These can be swept under the rug in the language design
56
How bad can it get?
8/6/2019 Color Lecture Notes
57/237
57
8/6/2019 Color Lecture Notes
58/237
Chapter 3: LL Parsing
58
The role of the parser
tokens
8/6/2019 Color Lecture Notes
59/237
code
source tokens
errors
scanner parser IR
Parser
performs context-free syntax analysis
guides context-sensitive analysis
constructs an intermediate representation produces meaningful error messages
attempts error correction
For the next few weeks, we will look at parser constructionCopyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this workfor personal or classroom use is granted without fee provided that copies are not made or distributed forprofit or commercial advantage and that copies bear this notice and full citation on the first page. To copyotherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/orfee. Request permission to publish from [email protected].
59
Syntax analysis
8/6/2019 Color Lecture Notes
60/237
Context-free syntax is specified with a context-free grammar.
Formally, a CFG G is a 4-tuple
Vt Vn S P , where:
Vt is the set of terminal symbols in the grammar.For our purposes, Vt is the set of tokens returned by the scanner.
Vn, the nonterminals, is a set of syntactic variables that denote sets of(sub)strings occurring in the language.These are used to impose a structure on the grammar.
S is a distinguished nonterminal S Vn denoting the entire set of stringsin L G .This is sometimes called a goal symbol.
P is a finite set of productions specifying how terminals and non-terminals
can be combined to form strings in the language.Each production must have a single non-terminal on its left hand side.
The set V Vt Vn is called the vocabulary of G
60
Notation and terminology
8/6/2019 Color Lecture Notes
61/237
a
b
c
Vt
A
B
C
Vn
U
V
W
V
V
u v w Vt
If A then A is a single-step derivationusing A
Similarly, and denote derivations of 0 and 1 steps
If S then is said to be a sentential formof G
L G w Vt S w , w L G is called a sentence of G
Note, L G V S V
t61
Syntax analysis
Grammars are often written in Backus-Naur form (BNF).
8/6/2019 Color Lecture Notes
62/237
G a a s a e o te tte ac us au o ( )
Example:
1 goal :: expr
2 expr :: expr op expr
3
4
5 op ::
6
7
8
This describes simple expressions over numbers and identifiers.
In a BNF for a grammar, we represent
1. non-terminals with angle brackets or capital letters2. terminals with
font or underline
3. productions as in the example
62
Scanning vs. parsing
Where do we draw the line?
8/6/2019 Color Lecture Notes
63/237
term :: 0 9
0
op ::
expr :: term op term
Regular expressions are used to classify:
identifiers, numbers, keywords
REs are more concise and simpler for tokens than a grammar
more efficient scanners can be built from REs (DFAs) than grammars
Context-free grammars are used to count:
brackets: , . . . , . . . . . .
imparting structure: expressions
Syntactic analysis is complicated enough: grammar for C has around 200
productions. Factoring out lexical analysis as a separate phase makes
compiler more manageable.
63
Derivations
We can view the productions of a CFG as rewriting rules
8/6/2019 Color Lecture Notes
64/237
We can view the productions of a CFG as rewriting rules.
Using our example CFG:
goal
expr
expr
op
expr
expr op expr op expr
id,
op
expr
op
expr
id, expr
op
expr
id, num,
op
expr
id,
num,
expr
id, num, id,
We have derived the sentence
.
We denote this
goal
.
Such a sequence of rewrites is a derivationor a parse.
The process of discovering a derivation is called parsing.
64
Derivations
At h t h t i l t l
8/6/2019 Color Lecture Notes
65/237
At each step, we chose a non-terminal to replace.
This choice can lead to different derivations.
Two are of particular interest:
leftmost derivationthe leftmost non-terminal is replaced at each step
rightmost derivationthe rightmost non-terminal is replaced at each step
The previous example was a leftmost derivation.
65
Rightmost derivation
F th t i
8/6/2019 Color Lecture Notes
66/237
For the string
:
goal expr
expr op expr
expr op id,
expr id,
expr op expr id,
expr
op
num,
id,
expr
num,
id,
id,
num,
id,
Again, goal .
66
Precedence
8/6/2019 Color Lecture Notes
67/237
goal
expr
expr op expr
expr op expr *
+
Treewalk evaluation computes( )
the wrong answer!
Should be
(
)
67
Precedence
These two derivations point out a problem with the grammar.
8/6/2019 Color Lecture Notes
68/237
It has no notion of precedence, or implied order of evaluation.
To add precedence takes additional machinery:
1 goal :: expr
2
expr
::
expr
term
3
expr
term
4 term
5 term :: term factor
6 term factor
7 factor
8 factor ::
9
This grammar enforces a precedence on the derivation:
terms must be derived from expressions
forces the correct tree
68
Precedence
Now for the string :
8/6/2019 Color Lecture Notes
69/237
Now, for the string
:
goal expr
expr term
expr term factor
expr term id,
expr factor id,
expr
num,
id,
term
num,
id,
factor
num,
id,
id,
num,
id,
Again, goal , but this time, we build the desired tree.
69
Precedence
l
8/6/2019 Color Lecture Notes
70/237
expr
expr
+
term
factor
goal
term
*term
factor
factor
Treewalk evaluation computes
(
)
70
Ambiguity
If a grammar has more than one derivation for a single sentential form,
h i i bi
8/6/2019 Color Lecture Notes
71/237
then it is ambiguous
Example: stmt ::=
expr
stmt
expr stmt stmt
Consider deriving the sentential form:
E1 E2 S1 S2
It has two derivations.
This ambiguity is purely grammatical.
It is a context-freeambiguity.
71
Ambiguity
May be able to eliminate ambiguities by rearranging the grammar:
8/6/2019 Color Lecture Notes
72/237
May be able to eliminate ambiguities by rearranging the grammar: stmt ::= matched
unmatched
matched ::= expr matched matched
unmatched ::= expr stmt
expr matched unmatched
This generates the same language as the ambiguous grammar, but applies
the common sense rule:
match each
with the closest unmatched
This is most likely the language designers intent.
72
Ambiguity
Ambiguity is often due to confusion in the context-free specification.
8/6/2019 Color Lecture Notes
73/237
Context-sensitive confusions can arise from overloading.
Example:
In many Algol-like languages,
could be a function or subscripted variable.
Disambiguating this statement requires context:
need valuesof declarations not context-free really an issue of type
Rather than complicate parsing, we will handle this separately.
73
Parsing: the big picture
8/6/2019 Color Lecture Notes
74/237
parser
generator
code
parser
tokens
IR
grammar
Our goal is a flexible parser generator system
74
Top-down versus bottom-up
Top-down parsers
8/6/2019 Color Lecture Notes
75/237
start at the root of derivation tree and fill in
picks a production and tries to match the input
may require backtracking
some grammars are backtrack-free (predictive)
Bottom-up parsers
start at the leaves and fill in
start in a state valid for legal first tokens
as input is consumed, change state to encode possibilities
(recognize valid prefixes)
use a stack to store both state and sentential forms
75
Top-down parsing
A top-down parser starts with the root of the parse tree, labelled with the
start or goal symbol of the grammar
8/6/2019 Color Lecture Notes
76/237
start or goal symbol of the grammar.
To build a parse, it repeats the following steps until the fringe of the parse
tree matches the input string
1. At a node labelled A, select a production A
and construct theappropriate child for each symbol of
2. When a terminal is added to the fringe that doesnt match the input
string, backtrack
3. Find the next node to be expanded (must have a label in Vn)
The key is selecting the right production in step 1
should be guided by input string
76
Simple expression grammar
Recall our grammar for simple expressions:
8/6/2019 Color Lecture Notes
77/237
1 goal :: expr
2 expr :: expr term
3 expr term
4 term
5 term :: term factor
6
term
factor
7
factor
8 factor ::
9
Consider the input string
77
8/6/2019 Color Lecture Notes
78/237
Example
Another possible parse for
8/6/2019 Color Lecture Notes
79/237
Prodn Sentential form Input
goal 1 expr
2
expr
term
2 expr term term 2 expr term 2 expr term 2
If the parser makes the wrong choices, expansion doesnt terminate.This isnt a good property for a parser to have.
(Parsers should terminate!)
79
Left-recursion
Top-down parsers cannot handle left-recursion in a grammar
8/6/2019 Color Lecture Notes
80/237
Formally, a grammar is left-recursive if
A Vn such that A
A for some string
Our simple expression grammar is left-recursive
80
Eliminating left-recursion
To remove left-recursion, we can transform the grammar
8/6/2019 Color Lecture Notes
81/237
Consider the grammar fragment:
foo :: foo
where and do not start with foo
We can rewrite this as:
foo :: bar bar :: bar
where bar is a new non-terminal
This fragment contains no left-recursion
81
Example
Our expression grammar contains two cases of left-recursion
expr :: expr term
8/6/2019 Color Lecture Notes
82/237
expr
term
term
term :: term factor
term factor
factor
Applying the transformation gives
expr :: term expr
expr :: term expr
term
expr
term ::
factor
term
term ::
factor
term
factor term
With this grammar, a top-down parser will
terminate backtrack on some inputs
82
Example
This cleaner grammar defines the same language
8/6/2019 Color Lecture Notes
83/237
1
goal
::
expr
2 expr :: term expr
3 term expr
4 term
5 term :: factor term
6 factor term
7 factor
8 factor ::
9
It is
right-recursive
free of productions
Unfortunately, it generates different associativity
Same syntax, different meaning
83
Example
Our long-suffering expression grammar:
8/6/2019 Color Lecture Notes
84/237
1 goal :: expr
2 expr :: term expr
3 expr :: term expr
4 term expr
5 6 term ::
factor
term
7 term ::
factor
term
8
factor
term
9
10 factor ::
11
Recall, we factored out left-recursion
84
How much lookahead is needed?
We saw that top-down parsers may need to backtrack when they select the
wrong production
8/6/2019 Color Lecture Notes
85/237
g p
Do we need arbitrary lookahead to parse CFGs?
in general, yes
use the Earley or Cocke-Younger, Kasami algorithmsAho, Hopcroft, and Ullman, Problem 2.34
Parsing, Translation and Compiling, Chapter 4
Fortunately
large subclasses of CFGs can be parsed with limited lookahead
most programming language constructs can be expressed in a gram-
mar that falls in these subclasses
Among the interesting subclasses are:
LL(1): left to right scan, left-most derivation, 1-token lookahead; and
LR(1): left to right scan, right-most derivation, 1-token lookahead
85
Predictive parsing
Basic idea:
8/6/2019 Color Lecture Notes
86/237
For any two productions A , we would like a distinct way ofchoosing the correct production to expand.
For some RHS G, define FIRST
as the set of tokens that appear
first in some string derived from That is, for some w Vt , w FIRST iff.
w.
Key property:
Whenever two productions A
and A
both appear in the grammar,we would like
FIRST
FIRST
This would allow the parser to make a correct choice with a lookahead ofonly one symbol!
The example grammar has this property!
86
Left factoring
What if a grammar does not have this property?
8/6/2019 Color Lecture Notes
87/237
Sometimes, we can transform a grammar to have this property.
For each non-terminal A find the longest prefix
common to two or more of its alternatives.
if then replace all of the A productions
A 1 2 n
withA A
A 1 2 n
where A is a new non-terminal.
Repeat until no two alternatives for a single
non-terminal have a common prefix.
87
Example
Consider a right-recursiveversion of the expression grammar:
1 l
8/6/2019 Color Lecture Notes
88/237
1 goal :: expr
2 expr :: term expr
3 term expr
4 term
5 term :: factor term
6 factor term
7 factor
8 factor ::
9
To choose between productions 2, 3, & 4, the parser must see past the
or and look at the , , , or .
FIRST
2
FIRST
3
FIRST
4
This grammar failsthe test.
Note: This grammar is right-associative.
88
Example
There are two nonterminals that must be left factored:
t
8/6/2019 Color Lecture Notes
89/237
expr :: term expr term expr
term
term :: factor term
factor
term
factor
Applying the transformation gives us:
expr :: term expr
expr :: expr
expr
term :: factor term
term :: term
term
89
Example
Substituting back into the grammar yields
1 l
8/6/2019 Color Lecture Notes
90/237
1 goal :: expr
2 expr :: term expr
3 expr :: expr
4 expr
5 6 term ::
factor
term
7 term ::
term
8
term
9
10 factor ::
11
Now, selection requires only a single token lookahead.
Note: This grammar is still right-associative.
90
Example
Sentential form Input
goal 1
expr 2 term expr
8/6/2019 Color Lecture Notes
91/237
2 term expr 6 factor term expr
11 term expr
term
expr
9 expr
4
expr
expr
2 term expr
6
factor
term
expr
10 term expr
term expr
7 term expr
term
expr
6
factor
term
expr
11 term expr
term
expr
9 expr 5
The next symbol determined each choice correctly.
91
Back to left-recursion elimination
Given a left-factored CFG, to eliminate left-recursion:
8/6/2019 Color Lecture Notes
92/237
if A A then replace all of the A productions
A A
with
A NA
N
A A
where N and A are new productions.
Repeat until there are no left-recursive productions.
92
Generality
Question:
By left factoring and eliminating left-recursion can we transform
8/6/2019 Color Lecture Notes
93/237
By left factoring and eliminating left-recursion, can we transforman arbitrary context-free grammar to a form where it can be pre-
dictively parsed with a single token lookahead?
Answer:
Given a context-free grammar that doesnt meet our conditions,
it is undecidable whether an equivalent grammar exists that does
meet our conditions.
Many context-free languagesdo not have such a grammar:
an0bn n 1
an1b2n n 1
Must look past an arbitrary number of as to discover the 0 or the 1 and so
determine the derivation.
93
Recursive descent parsing
Now, we can produce a simple recursive descent parser from the (right-
associative) grammar.
8/6/2019 Color Lecture Notes
94/237
94
Recursive descent parsing
8/6/2019 Color Lecture Notes
95/237
95
Building the tree
One of the key jobs of the parser is to build an intermediate representation
of the source code.
8/6/2019 Color Lecture Notes
96/237
To build an abstract syntax tree, we can simply insert code at the appropri-
ate points:
can stack nodes
,
can stack nodes ,
can pop 3, build and push subtree
can stack nodes
,
can pop 3, build and push subtree
can pop and return tree
96
Non-recursive predictive parsing
Observation:
Our recursive descent parser encodes state information in its run-
8/6/2019 Color Lecture Notes
97/237
Our recursive descent parser encodes state information in its run time stack, or call stack.
Using recursive procedure calls to implement a stack abstraction may not
be particularly efficient.This suggests other implementation methods:
explicit stack, hand-coded parser
stack-based, table-driven parser
97
Non-recursive predictive parsing
Now, a predictive parser looks like:
stack
8/6/2019 Color Lecture Notes
98/237
scannertable-driven
parser IR
parsing
tables
source
code
tokens
Rather than writing code, we build tables.
Building tables can be automated!
98
Table-driven parsers
A parser generator system often looks like:
8/6/2019 Color Lecture Notes
99/237
p g y
scannertable-driven
parserIR
parsing
tables
stack
source
code
tokens
parser
generatorgrammar
This is true for both top-down (LL) and bottom-up (LR) parsers
99
Non-recursive predictive parsing
Input: a string w and a parsing table M for G
8/6/2019 Color Lecture Notes
100/237
Start Symbol
X is a non-terminal
M
X Y1Y2 Yk
Yk Yk
1 Y1
100
8/6/2019 Color Lecture Notes
101/237
FIRST
For a string of grammar symbols , define FIRST
as:
the set of terminal symbols that begin strings derived from :
a Vt a
8/6/2019 Color Lecture Notes
102/237
If then FIRST
FIRST
contains the set of tokens valid in the initial position in
To build FIRST X :
1. If X Vt then FIRST X is X
2. If X then add to FIRST X .
3. If X Y1Y2 Yk:
(a) Put FIRST
Y1 in FIRST X
(b) i : 1 i k, if FIRST Y1 FIRST Yi
1
(i.e., Y1
Yi
1
)then put FIRST
Yi in FIRST X
(c) If FIRST
Y1 FIRST Yk then put in FIRST X
Repeat until no more additions can be made.
102
FOLLOW
For a non-terminal A, define FOLLOW A as
the set of terminals that can appear immediately to the right of Ain some sentential form
8/6/2019 Color Lecture Notes
103/237
in some sentential form
Thus, a non-terminals FOLLOW set specifies the tokens that can legally
appear after it.
A terminal symbol has no FOLLOW set.
To build FOLLOW A :
1. Put $ in FOLLOW
goal
2. If A B:
(a) Put FIRST in FOLLOW B
(b) If
(i.e.,A B) or FIRST
(i.e., ) then put FOLLOW
A
in FOLLOW B
Repeat until no more additions can be made
103
LL(1) grammars
Previous definition
A grammar G is LL(1) iff. for all non-terminals A, each distinct pair of pro-
ductions A and A satisfy the condition FIRST
FIRST .
8/6/2019 Color Lecture Notes
104/237
What if A ?
Revised definition
A grammar G is LL(1) iff. for each set of productions A
1
2
n:
1. FIRST 1 FIRST 2 FIRST n are all pairwise disjoint
2. If i then FIRST j
FOLLOW
A
1 j n i j.
If G is -free, condition 1 is sufficient.
104
LL(1) grammars
Provable facts about LL(1) grammars:
1. No left-recursive grammar is LL(1)
2 No ambiguous grammar is LL(1)
8/6/2019 Color Lecture Notes
105/237
2. No ambiguous grammar is LL(1)
3. Some languages have no LL(1) grammar
4. A free grammar where each alternative expansion for A begins with
a distinct terminal is a simpleLL(1) grammar.
Example
S aS a
is not LL(1) because FIRST aS FIRST a a
S aS
S aS accepts the same language and is LL(1)
105
LL(1) parse table construction
Input: Grammar G
Output: Parsing table M
Method:
8/6/2019 Color Lecture Notes
106/237
Method:
1. productions A :
(a)
a FIRST
, add A
to M
A
a
(b) If FIRST
:
i. b FOLLOW A , add A to M A b
ii. If $ FOLLOW A then add A to M A $
2. Set each undefined entry of M to
If M A a with multiple entries then grammar is not LL(1).
Note: recall a b Vt, so a b
106
Example
Our long-suffering expression grammar:
S E T FT
E T E
T
T T E E E F
8/6/2019 Color Lecture Notes
107/237
E
E
E F
FIRST FOLLOW
S
$
E
$
E $
T $
T $
F $
$
S S
E S
E
E E
T E E
T E
E
E
E E
E
E
T T FT T FT
T
T T T T T T T
F F
F
107
A grammar that is not LL(1)
stmt :: expr stmt
expr stmt stmt
Left factored:
8/6/2019 Color Lecture Notes
108/237
Left-factored:
stmt ::
expr
stmt
stmt
stmt ::
stmt
Now, FIRST stmt
Also, FOLLOW
stmt $
But, FIRST stmt
FOLLOW
stmt
On seeing , conflict between choosing
stmt :: stmt and stmt ::
grammar is not LL(1)!
The fix:
Put priority on stmt :: stmt to associate with clos-
est previous
.
108
Error recovery
Key notion:
For each non-terminal, construct a set of terminals on which the parser
can synchronize
8/6/2019 Color Lecture Notes
109/237
When an error occurs looking for A, scan until an element of SYNCH
A
is found
Building SYNCH:
1. a FOLLOW
A
a SYNCH
A
2. place keywords that start statements in SYNCH
A
3. add symbols in FIRST A to SYNCH A
If we cant match a terminal on top of stack:
1. pop the terminal
2. print a message saying the terminal was inserted
3. continue the parse
(i.e., SYNCH a Vt a )
109
8/6/2019 Color Lecture Notes
110/237
Chapter 4: LR Parsing
110
Some definitions
Recall
For a grammar G, with start symbol S, any string such that S
iscalled a sentential form
8/6/2019 Color Lecture Notes
111/237
called a sentential form
If Vt , then is called a sentence in L G
Otherwise it is just a sentential form (not a sentence in L
G
)
A left-sentential form is a sentential form that occurs in the leftmost deriva-
tion of some sentence.
A right-sentential form is a sentential form that occurs in the rightmost
derivation of some sentence.
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this workfor personal or classroom use is granted without fee provided that copies are not made or distributed forprofit or commercial advantage and that copies bear this notice and full citation on the first page. To copyotherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/orfee. Request permission to publish from [email protected].
111
Bottom-up parsing
Goal:
Given an input string w and a grammar G, construct a parse tree by
8/6/2019 Color Lecture Notes
112/237
starting at the leaves and working to the root.
The parser repeatedly matches a right-sentential form from the languageagainst the trees upper frontier.
At each match, it applies a reduction to build on the frontier:
each reduction matches an upper frontier of the partially built tree to
the RHS of some production
each reduction adds a node on top of the frontier
The final result is a rightmost derivation, in reverse.
112
Example
Consider the grammar
1 S AB
2 A A
8/6/2019 Color Lecture Notes
113/237
3
4 B
and the input string
Prodn. Sentential Form
3
2
A
4 A
1 aABe S
The trick appears to be scanning the input and finding valid sentential
forms.
113
Handles
What are we trying to find?
A substring of the trees upper frontier that
matches some production A where reducing to A is one step in
8/6/2019 Color Lecture Notes
114/237
matches some production A where reducing to A is one step in
the reverse of a rightmost derivation
We call such a string a handle.
Formally:
a handle of a right-sentential form is a production A and a po-
sition in where may be found and replaced by A to produce the
previous right-sentential form in a rightmost derivation of
i.e., if S rm Aw rm w then A in the position following is a
handle of w
Because is a right-sentential form, the substring to the right of a handle
contains only terminal symbols.
114
Handles
S
8/6/2019 Color Lecture Notes
115/237
A
wThe handle A in the parse tree for w
115
Handles
Theorem:
If G is unambiguous then every right-sentential form has a unique han-
8/6/2019 Color Lecture Notes
116/237
dle.
Proof: (by definition)
1. G is unambiguous rightmost derivation is unique
2. a unique production A applied to take i 1 to i
3.
a unique position k at which A
is applied
4. a unique handle A
116
8/6/2019 Color Lecture Notes
117/237
Handle-pruning
The process to construct a bottom-up parse is called handle-pruning.
To construct a rightmost derivation
8/6/2019 Color Lecture Notes
118/237
S
0 1 2 n
1 n w
we set i to n and apply the following simple algorithm
n
1.
Ai i i
2. i Ai i 1
This takes2n steps, wheren is the length of the derivation
118
Stack implementation
One scheme to implement a handle-pruning, bottom-up parser is called a
shift-reduceparser.
Shift-reduce parsers use a stack and an input buffer
8/6/2019 Color Lecture Notes
119/237
1. initialize stack with $
2. Repeat until the top of the stack is the goal symbol and the input token
is $
a) find the handle
if we dont have a handle on top of the stack, shiftan input symbolonto the stack
b) prune the handle
if we have a handle A
on the stack, reducei) pop symbols off the stack
ii) push A onto the stack
119
Example: back to
1
goal
::
expr
2
expr
::
expr
term
Stack Input Action
$ shift$
reduce 9
$ factor reduce 7
$
term
reduce 4
$ expr shift
8/6/2019 Color Lecture Notes
120/237
3 expr term4
term
5 term :: term factor
6 term factor7
factor
8 factor :: 9
$ expr shift$
expr shift$
expr
reduce 8$ expr factor reduce 7
$ expr term shift$
expr term shift$
expr
term
reduce 9$ expr term factor reduce 5
$ expr term reduce 3
$ expr reduce 1$
goal
accept
1. Shift until top of stack is the right end of a handle
2. Find the left end of the handle and reduce
5 shifts + 9 reduces + 1 accept
120
Shift-reduce parsing
Shift-reduce parsers are simple to understand
A shift-reduce parser has just four canonical actions:
8/6/2019 Color Lecture Notes
121/237
A shift reduce parser has just four canonical actions:
1. shift next input symbol is shifted onto the top of the stack
2. reduce right end of handle is on top of stack;
locate left end of handle within the stack;
pop handle off stack and push appropriate non-terminal LHS
3. accept terminate parsing and signal success
4. error call an error recovery routine
The key problem: to recognize handles (not covered in this course).
121
LR k grammars
Informally, we say that a grammar G is LR k if, given a rightmost derivation
S
0
1
2
n
w
we can, for each right-sentential form in the derivation,
8/6/2019 Color Lecture Notes
122/237
1. isolate the handle of each right-sentential form, and
2. determine the production by which to reduce
by scanning i from left to right, going at most k symbols beyond the right
end of the handle of i.
122
LR k grammars
Formally, a grammar G is LR k iff.:
1. S
rm Aw
rm w, and2. S rm Bx rm y, and
3 FIRST FIRST
8/6/2019 Color Lecture Notes
123/237
3. FIRSTk w FIRSTk y
Ay
Bx
i.e., Assume sentential forms w and y, with common prefix and
common k-symbol lookahead FIRSTk y FIRSTk w , such that w re-
duces to Aw and y reduces to Bx.
But, the common prefix means y also reduces to Ay, for the same re-
sult.
Thus Ay
Bx.
123
Why study LR grammars?
LR(1) grammars are often used to construct parsers.
We call these parsers LR(1) parsers.
everyones favorite parser
8/6/2019 Color Lecture Notes
124/237
virtually all context-free programming language constructs can be ex-
pressed in an LR(1) form
LR grammars are the most general grammars parsable by a determin-
istic, bottom-up parser
efficient parsers can be implemented for LR(1) grammars
LR parsers detect an error as soon as possible in a left-to-right scanof the input
LR grammars describe a proper superset of the languages recognized
by predictive (i.e., LL) parsers
LL k : recognize use of a production A seeing first k symbols of
LR
k
: recognize occurrence of (the handle) having seen all of what
is derived from plus k symbols of lookahead
124
Left versus right recursion
Right Recursion:
needed for termination in predictive parsers
requires more stack space
8/6/2019 Color Lecture Notes
125/237
q p
right associative operators
Left Recursion:
works fine in bottom-up parsers
limits required stack space left associative operators
Rule of thumb:
right recursion for top-down parsers
left recursion for bottom-up parsers
125
Parsing review
Recursive descent
A hand coded recursive descent parser directly encodes a grammar(typically an LL(1) grammar) into a series of mutually recursive proce-
dures It has most of the linguistic limitations of LL(1)
8/6/2019 Color Lecture Notes
126/237
dures. It has most of the linguistic limitations of LL(1).
LL k
An LL k parser must be able to recognize the use of a production after
seeing only the first k symbols of its right hand side.
LR
k
An LR k parser must be able to recognize the occurrence of the righthand side of a production after having seen all that is derived from that
right hand side with k symbols of lookahead.
126
8/6/2019 Color Lecture Notes
127/237
Chapter 5: JavaCC and JTB
127
The Java Compiler Compiler
Can be thought of as Lex and Yacc for Java.
It is based on LL(k) rather than LALR(1).
8/6/2019 Color Lecture Notes
128/237
Grammars are written in EBNF.
The Java Compiler Compiler transforms an EBNF grammar into anLL(k) parser.
The JavaCC grammar can have embedded action code written in Java,
just like a Yacc grammar can have embedded action code written in C.
The lookahead can be changed by writing .
The whole input is given in just one file (not two).
128
The JavaCC input format
8/6/2019 Color Lecture Notes
129/237
One file:
header
token specifications for lexical analysis
grammar
129
The JavaCC input format
Example of a token specification:
8/6/2019 Color Lecture Notes
130/237
Example of a production:
130
8/6/2019 Color Lecture Notes
131/237
Generating a parser with JavaCC
131
The Visitor Pattern
For object-oriented programming,
the Visitor pattern enables
8/6/2019 Color Lecture Notes
132/237
the definition of a new operation
on an object structure
without changing the classes
of the objects.
Gamma, Helm, Johnson, Vlissides:
Design Patterns, 1995.
132
Sneak Preview
8/6/2019 Color Lecture Notes
133/237
When using the Visitor pattern,
the set of classes must be fixed in advance, and
each class must have an accept method.
133
First Approach: Instanceof and Type Casts
The running Java example: summing an integer list.
8/6/2019 Color Lecture Notes
134/237
134
First Approach: Instanceof and Type Casts
8/6/2019 Color Lecture Notes
135/237
Advantage: The code is written without touching the classes
and
.
Drawback: The code constantly uses type casts and to
determine what class of object it is considering.
135
Second Approach: Dedicated Methods
The first approach is not object-oriented!
To access parts of an object the classical approach is to use dedicated
8/6/2019 Color Lecture Notes
136/237
To access parts of an object, the classical approach is to use dedicated
methods which both access and act on the subobjects.
We can now compute the sum of all components of a given
-object
by writing
.
136
Second Approach: Dedicated Methods
8/6/2019 Color Lecture Notes
137/237
Advantage: The type casts and operations have disappeared,
and the code can be written in a systematic way.
Disadvantage: For each new operation on -objects, new dedicated
methods have to be written, and all classes must be recompiled.
137
Third Approach: The Visitor Pattern
The Idea:
Divide the code into an object structure and a Visitor (akin to Func-tional Programming!)
Insert an method in each class Each accept method takes a
8/6/2019 Color Lecture Notes
138/237
Insert an
method in each class. Each accept method takes aVisitor as argument.
A Visitor contains a
method for each class (overloading!) Amethod for a class C takes an argument of type C.
138
Third Approach: The Visitor Pattern
The purpose of the
methods is to
invoke the
method in the Visitor which can handle the currentobject.
8/6/2019 Color Lecture Notes
139/237
139
Third Approach: The Visitor Pattern
The control flow goes back and forth between the
methods inthe Visitor and the
methods in the object structure.
8/6/2019 Color Lecture Notes
140/237
Notice: The methods describe both1) actions, and 2) access of subobjects.
140
Comparison
The Visitor pattern combines the advantages of the two other approaches.
Frequent Frequenttype casts? recompilation?
Instanceof and type casts Yes NoDedicated methods No Yes
8/6/2019 Color Lecture Notes
141/237
Dedicated methods No YesThe Visitor pattern No No
The advantage of Visitors: New methods without recompilation!
Requirement for using Visitors: All classes must have an accept method.
Tools that use the Visitor pattern:
JJTree (from Sun Microsystems) and the Java Tree Builder (from Pur-due University), both frontends for The Java Compiler Compiler from
Sun Microsystems.
141
Visitors: Summary
Visitor makes adding new operations easy. Simply write a new
visitor.
A visitor gathers related operations. It also separates unrelated
ones.
8/6/2019 Color Lecture Notes
142/237
ones.
Adding new classes to the object structure is hard. Key consid-eration: are you most likely to change the algorithm applied over an
object structure, or are you most like to change the classes of objects
that make up the structure.
Visitors can accumulate state.
Visitor can break encapsulation. Visitors approach assumes that
the interface of the data structure classes is powerful enough to let
visitors do their job. As a result, the pattern often forces you to providepublic operations that access internal state, which may compromise
its encapsulation.
142
The Java Tree Builder
The Java Tree Builder (JTB) has been developed here at Purdue in my
group.
JTB is a frontend for The Java Compiler Compiler.
8/6/2019 Color Lecture Notes
143/237
JTB supports the building of syntax trees which can be traversed usingvisitors.
JTB transforms a bare JavaCC grammar into three components:
a JavaCC grammar with embedded Java code for building a syntax
tree;
one class for every form of syntax tree node; and
a default visitor which can do a depth-first traversal of a syntax tree.
143
The Java Tree Builder
The produced JavaCC grammar can then be processed by the Java Com-
piler Compiler to give a parser which produces syntax trees.
The produced syntax trees can now be traversed by a Java program by
writing subclasses of the default visitor.
8/6/2019 Color Lecture Notes
144/237
writing subclasses of the default visitor.
JavaCCgrammar
JTB JavaCC grammarwith embeddedJava code
Java CompilerCompiler
Parser
Program
Syntax treewith accept methods
Syntax-tree-nodeclasses
Default visitor
144
Using JTB
8/6/2019 Color Lecture Notes
145/237
145
Example (simplified)
For example, consider the Java 1.1 production
JTB d
8/6/2019 Color Lecture Notes
146/237
JTB produces:
Notice that the production returns a syntax tree represented as an
object.
146
Example (simplified)JTB produces a syntax-tree-node class for :
8/6/2019 Color Lecture Notes
147/237
Notice the method; it invokes the method for in
the default visitor.
147
Example (simplified)The default visitor looks like this:
8/6/2019 Color Lecture Notes
148/237
Notice the body of the method which visits each of the three subtrees of
the
node.
148
Example (simplified)Here is an example of a program which operates on syntax trees for Java
1.1 programs. The program prints the right-hand side of every assignment.
The entire program is six lines:
8/6/2019 Color Lecture Notes
149/237
When this visitor is passed to the root of the syntax tree, the depth-firsttraversal will begin, and when
nodes are reached, the method
in
is executed.
Notice the use of
. It is a visitor which pretty prints Java1.1 programs.
JTB is bootstrapped.
149
8/6/2019 Color Lecture Notes
150/237
Chapter 6: Semantic Analysis
150
Semantic Analysis
The compilation process is driven by the syntactic structure of the program
as discovered by the parser
Semantic routines:
interpret meaning of the program based on its syntactic structure
8/6/2019 Color Lecture Notes
151/237
two purposes: finish analysis by deriving context-sensitive information
begin synthesis by generating the IR or target code
associated with individual productions of a context free grammar orsubtrees of a syntax tree
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this workfor personal or classroom use is granted without fee provided that copies are not made or distributed forprofit or commercial advantage and that copies bear this notice and full citation on the first page. To copyotherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/orfee. Request permission to publish from [email protected].
151
Context-sensitive analysis
What context-sensitive questions might the compiler ask?
1. Is scalar, an array, or a function?
2. Is
declared before it is used?
3. Are any names declared but not used?
4 Which declaration of does this reference?
8/6/2019 Color Lecture Notes
152/237
4. Which declaration of does this reference?
5. Is an expression type-consistent?
6. Does the dimension of a reference match the declaration?
7. Where can
be stored? (heap, stack,
)
8. Does
reference the result of a malloc()?9. Is defined before it is used?
10. Is an array reference in bounds?
11. Does function produce a constant value?
12. Can
be implemented as a memo-function?
These cannot be answered with a context-free grammar
152
Context-sensitive analysis
Why is context-sensitive analysis hard?
answers depend on values, not syntax
questions and answers involve non-local information
8/6/2019 Color Lecture Notes
153/237
answers may involve computation
Several alternatives:
abstract syntax tree specify non-local computations
(attribute grammars) automatic evaluators
symbol tables central store for factsexpress checking code
language design simplify languageavoid problems
153
Symbol tables
For compile-timeefficiency, compilers often use a symbol table:
associates lexical names (symbols) with their attributes
What items should be entered?
variable names
8/6/2019 Color Lecture Notes
154/237
variable names
defined constants
procedure and function names
literal constants and strings
source text labels
compiler-generated temporaries (well get there)
Separate table for structure layouts (types) (field offsets and lengths)
A symbol table is a compile-time structure
154
Symbol table information
What kind of information might the compiler need?
textual name
data type
dimension information (for aggregates)
8/6/2019 Color Lecture Notes
155/237
( gg g )
declaring procedure
lexical level of declaration
storage class (base address)
offset in storage
if record, pointer to structure table
if parameter, by-reference or by-value?
can it be aliased? to what other names?
number and type of arguments to functions
155
Nested scopes: block-structured symbol tables
What information is needed?
when we ask about a name, we want the most recent declaration
the declaration may be from the current scope or some enclosing
scope
innermost scope overrides declarations from outer scopes
8/6/2019 Color Lecture Notes
156/237
Key point: new declarations (usually) occur only in current scope
What operations do we need?
binds key to value
returns value bound to key
remembers current state of table
restores table to state at most recent scope thathas not been ended
May need to preserve list of locals for the debugger
156
Attribute information
Attributes are internal representation of declarations
Symbol table associates names with attributes
8/6/2019 Color Lecture Notes
157/237
Names may have different attributes depending on their meaning:
variables: type, procedure level, frame offset
types: type descriptor, data size/alignment
constants: type, value
procedures: formals (names/types), result type, block information (lo-
cal decls.), frame size
157
Type expressions
Type expressions are a textual representation for types:
1. basic types: boolean, char, integer, real, etc.
2. type names
8/6/2019 Color Lecture Notes
158/237
3. constructed types (constructors applied to type expressions):(a) array
I
T
denotes array of elements type T, index type I
e.g., array 1 10 integer
(b) T1 T2 denotes Cartesian product of type expressions T1 and T2
(c) records: fields have names
e.g., record
integer
real
(d) pointer T denotes the type pointer to object of type T
(e) D
R denotes type of function mapping domain D to range Re.g., integer integer integer
158
Type descriptors
Type descriptors are compile-time structures representing type expres-
sions
e.g., char char pointer integer
8/6/2019 Color Lecture Notes
159/237
g , p g
!
char char
pointer
integer
or
!
char
pointer
integer
159
Type compatibility
Type checking needs to determine type equivalence
Two approaches:
Name equivalence: each type name is a distinct type
8/6/2019 Color Lecture Notes
160/237
Structural equivalence: two types are equivalent iff. they have the same
structure (after substituting type expressions for type names)
s t iff. s and t are the same basic types
array
s1 s2 array t1 t2 iff. s1 t1 and s2 t2
s1 s2 t1 t2 iff. s1 t1 and s2 t2
pointer
s
pointer
t
iff. s t
s1 s2 t1 t2 iff. s1 t1 and s2 t2
160
Type compatibility: example
Consider:
8/6/2019 Color Lecture Notes
161/237
Under name equivalence:
and have the same type
,
and
have the same type
and have different type
Under structural equivalence all variables have the same type
Ada/Pascal/Modula-2/Tiger are somewhat confusing: they treat distinct
type definitions as distinct types, so
has different type from
and
161
Type compatibility: Pascal-style name equivalence
Build compile-time structure called a type graph:
each constructor or basic type creates a node
each name creates a leaf (associated with the types descriptor)
8/6/2019 Color Lecture Notes
162/237
n e x t l a s t
l i n k = pointer
c e l l
pointer
p
pointer
q r
Type expressions are equivalent if they are represented by the same nodein the graph
162
Type compatibility: recursive types
Consider:
We may want to eliminate the names from the type graph
8/6/2019 Color Lecture Notes
163/237
Eliminating name from type graph for record:
record=c e l l
i n f o integer
n e x t pointer
c e l l
163
Type compatibility: recursive types
Allowing cycles in the type graph eliminates :
8/6/2019 Color Lecture Notes
164/237
record=c e l l
i n f o integer
n e x t pointer
164
8/6/2019 Color Lecture Notes
165/237
Chapter 7: Translation and Simplification
165
Tiger IR trees: Expressions
i
CONSTInteger constant i
n
NAMESymbolic constant n [a code label]
t
TEMPTemporary t [one of any number of registers]
o e1 e2
BINOPApplication of binary operator o:
PLUS, MINUS, MUL, DIV
8/6/2019 Color Lecture Notes
166/237
AND, OR, XOR [bitwise logical]LSHIFT, RSHIFT [logical shifts]ARSHIFT [arithmetic right-shift]
to integer operands e1 (evaluated first) and e2 (evaluated second)
e
MEMContents of a word of memory starting at address e
f
e1 en
CALLProcedure call; expression f is evaluated before arguments e1 en
s e
ESEQExpression sequence; evaluate s for side-effects, then e for result
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this workfor personal or classroom use is granted without fee provided that copies are not made or distributed forprofit or commercial advantage and that copies bear this notice and full citation on the first page. To copyotherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/orfee. Request permission to publish from [email protected].
166
Tiger IR trees: Statements
t
TEMP e
MOVE
Evaluate e into temporary t
e1
MEM e2
MOVE
Evaluate e1 yielding address a, e2 into word at a
EXPEvaluate e and discard result
8/6/2019 Color Lecture Notes
167/237
e
e
l1 ln
JUMPTransfer control to address e; l1 ln are all possible values for e
o e1 e2 t f
CJUMPEvaluate e1 then e2, yielding a and b, respectively; compare a with busing relational operator o:
EQ, NE [signed and unsigned integers]LT, GT, LE, GE [signed]ULT, ULE, UGT, UGE [unsigned]
jump to t if true, f if false
s1 s2
SEQStatement s1 followed by s2
n
LABELDefine constant value of name n as current code address; NAME ncan be used as target of jumps, calls, etc.
167
Kinds of expressions
Expression kinds indicate how expression might be used
Ex(exp) expressions that compute a value
Nx(stm) statements: expressions that compute no value
Cx conditionals (jump to true and false destinations)
8/6/2019 Color Lecture Notes
168/237
RelCx(op, left, right)
IfThenElseExp expression/statement depending on use
Conversion operators allow use of one form in context of another:
unEx convert to tree expression that computes value of inner tree
unNx convert to tree statement that computes inner tree but returns no
value
unCx(t, f) convert to statement that evaluates inner tree and branches totrue destination if non-zero, false destination otherwise
168
Translating Tiger
Simple variables: fetch with a MEM:
PLUS TEMP fpCONST k
BINOP
MEM
Ex(MEM(
(TEMP fp, CONST k)))
where fp is home frame of variable, found by following static links; k is
8/6/2019 Color Lecture Notes
169/237
offset of variable in that levelTiger array variables: Tiger arrays are pointers to array base, so fetch
with a MEM like any other variable:
Ex(MEM(
(TEMP fp, CONST k)))
Thus, for e i :
Ex(MEM( (e.unEx, (i.unEx, CONST w))))
i is index expression and w is word size all values are word-sized
(scalar) in Tiger
Note: must first check array index i size e ; runtime will put size in
word preceding array base
169
Translating TigerTiger record variables: Again, records are pointers to record base, so fetch like other
variables. For e.
:
Ex(MEM(
(e.unEx, CONST o)))
where o is the byte offset of the field
in the recordNote: must check record pointer is non-nil (i.e., non-zero)
String literals: Statically allocated, so just use the strings label
Ex(NAME(label))
8/6/2019 Color Lecture Notes
170/237
where the literal will be emitted as:
Record creation: f1 e1 f2 e2 fn en in the (preferably GCd) heap, first allocatethe space then initialize it:
Ex( ESEQ(SEQ(MOVE(TEMP r, externalCall(allocRecord, [CONST n])),SEQ(MOVE(MEM(TEMP r), e1.unEx)),
SEQ(...,MOVE(MEM(+(TEMP r, CONST n 1 w)),
en.unEx))),TEMP r))
where w is the word size
Array creation:
e1 e2: Ex(externalCall(initArray, [e1.unEx, e2.unEx]))
170
Control structures
Basic blocks:
a sequence of straight-line code
if one instruction executes then they all execute
i l f i i i h b h
8/6/2019 Color Lecture Notes
171/237
a maximal sequence of instructions without branches a label starts a new basic block
Overview of control structure translation:
control flow links up the basic blocks ideas are simple
implementation requires bookkeeping
some care is needed for good code
171
while loops
while c do s:
1. evaluate c
2. if false jump to next statement after loop3. if true fall into loop body
4. branch to top of loop
e.g.,
t t
8/6/2019 Color Lecture Notes
172/237
test:if not(c) jump dones
jump test
done:Nx( SEQ(SEQ(SEQ(LABEL test, c.unCx(body, done)),
SEQ(SEQ(LABEL body, s.unNx), JUMP(NAME test))),LABEL done))
repeat e1 until e2 evaluate/compare/branch at bottom of loop
172
for loops
for := e1 to e2 do s
1. evaluate lower bound into index variable
2. evaluate upper bound into limit variable3. if index
limit jump to next statement after loop4. fall through to loop body5. increment index6. if index limit jump to top of loop body
t1 1
8/6/2019 Color Lecture Notes
173/237
t1 e1t2 e2if t1 t2 jump done
body: s
t1
t1
1if t1 t2 jump bodydone:
For break statements:
when translating a loop push the done label on some stack break simply jumps to label on top of stack when done translating loop and its body, pop the label
173
Function calls
f e1 en :
8/6/2019 Color Lecture Notes
174/237
Ex(CALL(NAME labelf, [sl,e1, . . . en]))
where sl is the static link for the callee f, found by following n static links
from the caller, n being the difference between the levels of the caller andthe callee
174
Comparisons
Translate a op b as:
RelCx(op, a.unEx, b.unEx)
When used as a conditional unCx t f yields:
CJUMP(op a unEx b unEx t f )
8/6/2019 Color Lecture Notes
175/237
CJUMP(op, a.unEx, b.unEx, t, f)
where t and f are labels.
When used as a value unEx yields:
ESEQ(SEQ(MOVE(TEMP r, CONST 1),
SEQ(unCx(t, f),
SEQ(LABEL f,
SEQ(MOVE(TEMP r, CONST 0), LABEL t)))),
TEMP r)
175
ConditionalsThe short-circuiting Boolean operators have already been transformed into
if-expressions in Tiger abstract syntax:
e.g., x 5 & a b turns into if x 5 then a b else 0
Translate if e1 then e2 else e3 into: IfThenElseExp(e1, e2, e3)
When used as a value unEx yields:
ESEQ(SEQ(SEQ(e1.unCx(t, f),
SEQ(SEQ(LABEL t
8/6/2019 Color Lecture Notes
176/237
SEQ(SEQ(LABEL t,SEQ(MOVE(TEMP r, e2.unEx),
JUMP join)),
SEQ(LABEL f,
SEQ(MOVE(TEMP r, e3.unEx),JUMP join)))),
LABEL join),
TEMP r)
As a conditional unCx t f yields:SEQ(e1.unCx(tt,ff), SEQ(SEQ(LABEL tt, e2.unCx(t, f)),
SEQ(LABEL ff, e3.unCx(t, f))))
176
Conditionals: Example
Applying unCx t f to if x 5 then a b else 0:
SEQ(CJUMP(LT, x.unEx, CONST 5, tt, ff),
SEQ(SEQ(LABEL CJUMP(GT E b E f ))
8/6/2019 Color Lecture Notes
177/237
SEQ(SEQ(LABEL tt, CJUMP(GT, a.unEx, b.unEx, t, f)),
SEQ(LABEL ff, JUMP f)))
or more optimally:
SEQ(CJUMP(LT, x.unEx, CONST 5, tt, f),
SEQ(LABEL tt, CJUMP(GT, a.unEx, b.uneX, t, f)))
177
One-dimensional fixed arrays
8/6/2019 Color Lecture Notes
178/237
translates to:
MEM(+(TEMP fp, +(CONST k
2w,
(CONST w, e.unEx))))
where k is offset of static array from fp, w is word size
In Pascal, multidimensional arrays are treated as arrays of arrays, so
is equivalent to A[i][j], so can translate as above.
178
Multidimensional arrays
Array allocation:
constant bounds
allocate in static area, stack, or heap
no run-time descriptor is needed
8/6/2019 Color Lecture Notes
179/237
no run time descriptor is neededdynamic arrays: bounds fixed at run-time
allocate in stack or heap
descriptor is needed
dynamic arrays: bounds can change at run-time
allocate in heap
descriptor is needed
179
Multidimensional arrays
Array layout:
Contiguous:1. Row major
Rightmost subscript varies most quickly:
8/6/2019 Color Lecture Notes
180/237
Used in PL/1, Algol, Pascal, C, Ada, Modula-3
2. Column major
Leftmost subscript varies most quickly:
Used in FORTRAN
By vectorsContiguous vector of pointers to (non-contiguous) subarrays
180
Multi-dimensional arrays: row-major layout
no. of elts in dimension j:
Dj Uj Lj 1
position of i1 in :
in Ln
in 1 Ln 1 Dn
8/6/2019 Color Lecture Notes
181/237
in 2 Ln 2 DnDn 1
i1 L1 Dn D2
which can be rewritten asvariable part
i1D2 Dn i2D3 Dn in
1Dn in
L1D2 Dn L2D3 Dn Ln
1Dn Ln
constant partaddress of
i1 in :
address( ) + ((variable part constant part) element size)
181
ca