Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Compiler Design – 15CS301
Instructor : Mr. R. Rajkumar, Assistant Professor | CSE
Venue : TP-606, Tech park,
SRM Institute of Science and Technology,
Kattankulathur, India. 1
UNIT 1 – Introduction to Compiler and Automata
R. Rajkumar AP | CSE
Q: Why Compiler Design?
Programming languages are the primary tools for all
computer programmers.
While many software engineers may claim to know a
number of languages enough that they can work with them,
it’s seen that they work within their comfort zones.
A number of very sophisticated features of programming
languages remain out of reach for a majority of
programmers.
R. Rajkumar AP | CSE2
Compilers provide you with the theoretical and practical
knowledge that is needed to implement a programming
language. Once you have learnt to do a compiler, you pretty
much know the innards of many programming languages.
Compilers have a plethora of sophisticated algorithms and
data-structures implemented within. So, if you are fascinated
with algorithms and data-structures, you will find several of
them at work in a compiler.
R. Rajkumar AP | CSE3
Q: Why Compiler Design? (2)
Compilers are complex software systems. If you can
truthfully claim that you have written a compiler with your
own hands, it is likely that there will be no questions asked
after that in any interview. A person who has made a
compiler can do anything.
R. Rajkumar AP | CSE4
Q: Why Compiler Design? (3)
The software architecture of a compiler is quite general. A
large variety of applications can be modelled after a
compiler (or some part thereof). Simulators, debuggers,
program analysis tools, editors, IDEs, RDBMSs, browsers,
OS shells, … have some significant elements of language
processing (read compiling) in them.
R. Rajkumar AP | CSE5
Q: Why Compiler Design? (4)
Let us start with a small history
R. Rajkumar AP | CSE6
Overview and History (1)
Cause Software for early computers was written in assembly language
The benefits of reusing software on different CPUs started to become significantly greater than the cost of writing a compiler
The first real compiler FORTRAN compilers of the late 1950s
18 person-years to build
7 R. Rajkumar AP | CSE
Overview and History (2)
8
Compiler technology is more broadly applicable and has been employed in
rather unexpected areas. Text-formatting languages,
like nroff and troff; preprocessor packages like eqn, tbl, pic
Silicon compiler for the creation of VLSI circuits
Command languages of OS
Query languages of Database systems
R. Rajkumar AP | CSE
What Do Compilers Do (1)
9
A compiler acts as a translator,
transforming human-oriented programming languages
into computer-oriented machine languages.
Ignore machine-dependent details for programmer
Programming
Language
(Source)Compiler
Machine
Language
(Target)
R. Rajkumar AP | CSE
What Do Compilers Do (2)
Compilers may generate three types of code:
Pure Machine Code
Machine instruction set without assuming the existence of any
operating system or library.
Mostly being OS or embedded applications.
Augmented Machine Code
Code with OS routines and runtime support routines.
More often
Virtual Machine Code
Virtual instructions, can be run on any architecture with a virtual
machine interpreter or a just-in-time compiler
Ex. Java
10 R. Rajkumar AP | CSE
What Do Compilers Do (3)
Another way that compilers
differ from one another is in the format of the target
machine code they generate:
Assembly or other source format
Relocatable binary
Relative address
A linkage step is required
Absolute binary
Absolute address
Can be executed directly
11 R. Rajkumar AP | CSE
12
Any compiler must perform two major tasks
Analysis of the source program
Synthesis of a machine-language program
The Structure of a Compiler (1)
Compiler
Analysis Synthesis
R. Rajkumar AP | CSE
The Structure of a Compiler (2)
13
Scanner ParserSemantic
Routines
Code
Generator
Optimizer
Source
Program Tokens Syntactic
Structure
Symbol and
Attribute
Tables
(Used by all Phases of The Compiler)
(Character Stream)
Intermediate
Representation
Target machine codeR. Rajkumar AP | CSE
The Structure of a Compiler (3)
14
Scanner ParserSemantic
Routines
Code
Generator
Optimizer
Source
Program Tokens Syntactic
Structure
Symbol and
Attribute
Tables
(Used by all
Phases of
The Compiler)
Scanner The scanner begins the analysis of the source program by
reading the input, character by character, and grouping
characters into individual words and symbols (tokens)
RE ( Regular expression )
NFA ( Non-deterministic Finite Automata )
DFA ( Deterministic Finite Automata )
LEX
(Character Stream)
Intermediate
Representation
Target machine codeR. Rajkumar AP | CSE
The Structure of a Compiler (4)
15
Scanner ParserSemantic
Routines
Code
Generator
Optimizer
Source
Program Tokens Syntactic
Structure
Symbol and
Attribute
Tables
(Used by all
Phases of
The Compiler)
Parser Given a formal syntax specification (typically as a context-
free grammar [CFG] ), the parse reads tokens and groups
them into units as specified by the productions of the CFG
being used.
As syntactic structure is recognized, the parser either calls
corresponding semantic routines directly or builds a syntax
tree. CFG ( Context-Free Grammar )
BNF ( Backus-Naur Form )
GAA ( Grammar Analysis Algorithms )
LL, LR, SLR, LALR Parsers
YACC
(Character Stream)
Intermediate
Representation
Target machine codeR. Rajkumar AP | CSE
The Structure of a Compiler (5)
16
Scanner ParserSemantic
Routines
Code
Generator
Optimizer
Source
Program
(Character Stream)
Tokens Syntactic
Structure
Intermediate
Representation
Symbol and
Attribute
Tables
(Used by all
Phases of
The Compiler)
Semantic Routines Perform two functions
Check the static semantics of each construct Do the actual translation
The heart of a compiler
Syntax Directed Translation
Semantic Processing Techniques
IR (Intermediate Representation)
Target machine codeR. Rajkumar AP | CSE
The Structure of a Compiler (6)
17
Scanner ParserSemantic
Routines
Code
Generator
Optimizer
Source
Program Tokens Syntactic
Structure
Symbol and
Attribute
Tables
(Used by all
Phases of
The Compiler)
Optimizer The IR code generated by the semantic routines is
analyzed and transformed into functionally equivalent but
improved IR code
This phase can be very complex and slow
Peephole optimization
loop optimization, register allocation, code scheduling
Register and Temporary Management
Peephole Optimization
(Character Stream)
Intermediate
Representation
Target machine codeR. Rajkumar AP | CSE
The Structure of a Compiler (7)
18
Source
Program
(Character Stream)Scanner
TokensParser
Syntactic
Structure
Semantic
Routines
Intermediate
Representation
Optimizer
Code
Generator
Code Generator Interpretive Code Generation
Generating Code from Tree/Dag
Grammar-Based Code Generator
Target machine codeR. Rajkumar AP | CSE
The Structure of a Compiler (8)
19
Scanner [Lexical Analyzer]
Parser [Syntax Analyzer]
Semantic Process [Semantic analyzer]
Code Generator[Intermediate Code Generator]
Code Optimizer
Tokens
Parse tree
Abstract Syntax Tree w/ Attributes
Non-optimized Intermediate Code
Optimized Intermediate Code
Code Optimizer
Target machine code
R. Rajkumar AP | CSE
The Structure of a Compiler (9)
Compiler writing tools
Compiler generators or compiler-
compilers
E.g. scanner and parser generators
Examples : Yacc, Lex
20 R. Rajkumar AP | CSE
The Syntax and Semantics of
Programming Language (1)
A programming language must include the specification of
syntax (structure) and semantics (meaning).
Syntax typically means the context-free syntax because of
the almost universal use of context-free-grammar (CFGs)
Ex.
a = b + c is syntactically legal
b + c = a is illegal
21 R. Rajkumar AP | CSE
The Syntax and Semantics of
Programming Language (2)
The semantics of a programming language are commonly
divided into two classes:
Static semantics
Semantics rules that can be checked at compiled time.
Ex. The type and number of a function’s arguments
Runtime semantics
Semantics rules that can be checked only at run time
22 R. Rajkumar AP | CSE
Compiler Design and Programming
Language Design
23
An interesting aspect is how programming language
design and compiler design influence one another.
Programming languages that are easy to compile
have many advantages
R. Rajkumar AP | CSE
Computer Architecture and Compiler
Design
Compilers should exploit the hardware-specific feature
and computing capability to optimize code.
The problems encountered in modern computing
platforms:
Instruction sets for some popular architectures are highly
nonuniform.
High-level programming language operations are not always
easy to support.
Ex. exceptions, threads, dynamic heap access …
Exploiting architectural features such as cache, distributed
processors and memory
Effective use of a large number of processors
24 R. Rajkumar AP | CSE
Compiler Design Considerations
Debugging Compilers
Designed to aid in the development and debugging of
programs.
Optimizing Compilers
Designed to produce efficient target code
Retargetable Compilers
A compiler whose target architecture can be changed without
its machine-independent components having to be rewritten.
25 R. Rajkumar AP | CSE
Compiler Construction Tools
R. Rajkumar AP | CSE26
Contents
Defining Compiler Construction Tools (aka CCTs)
Uses for CCTs
CCTs in the Compiler Structure
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
Intermediate Code Generator
Code Optimizer
Code Generator
R. Rajkumar AP | CSE27
Defining CCTs
programs or environments that assist in
the creation of an entire compiler or its
parts
R. Rajkumar AP | CSE28
Uses for CCTs
generate lexical analyzers,
syntax analyzers,
semantic analyzers,
intermediate code,
optimized target code
R. Rajkumar AP | CSE29
CCTs in
the
Compiler
Structure
R. Rajkumar AP | CSE30
Lexical Analyzer
scanner generators
input: source program
output: lexical analyzer
task of reading characters from source program and
recognizing tokens or basic syntactic components
maintains a list of reserved words
R. Rajkumar AP | CSE31
Lexical Analyzer
Flex (fast lexical analyzer generator)
Example which specifies a scanner which replaces the
string “username” with the user’s login name
%%
username printf(“%s”, getlogin());
R. Rajkumar AP | CSE32
Syntax Analyzer
parser generators
input: context-free grammar
output: syntax analyzer
the task of the syntax analyzer is to produce a representation of the source program in a form directly representing its syntax structure. This representation is usually in the form of a binary tree or similar data structure
R. Rajkumar AP | CSE33
Semantic Analyzer
syntax-directed translators
input: parse tree
output: routines to generate I-code
“The role of the semantic analyzer is to derive methods by which the structures constructed by the syntax analyzer may be evaluate or executed.“
type checker
two common tactics:
~ flatten the semantic analyzer’s parse tree
~ embed semantic analyzer w/in syntax analyzer
(syntax-driven translation)
R. Rajkumar AP | CSE34
Intermediate Code Generator
Automatic code generators
input: I-code rules
output: crude target machine program
“The task of the code generator is to traverse this tree, producing functionally equivalent object code.” [3]
three address code is one type
R. Rajkumar AP | CSE35
Intermediate Code Generator
Example 7 + (8 * y) / 2
a := 8
b := y
c := a * b
a := c
b := 2
c := a / b
a := 7
b := c
c := a + b
expr
7 + expr
expr / 2
expr( )
8 *y
R. Rajkumar AP | CSE36
Code Optimizer
Data flow engines
input: I-code
output: transformed code
“This improvement is achieved by program transformations that are traditionally called optimizations, although the term ‘optimization’ is a misnomer because there is rarely a guarantee that the resulting code is the best possible.”
R. Rajkumar AP | CSE37
Code Optimizer
Peephole Optimization
machine or assembly code is used along with knowledge of target machine’s instruction set to replace I-code instructions with shorter or more quickly executed instructions - this is repeated as much as is necessary
R. Rajkumar AP | CSE38
Code Optimizer
Common Optimizing Transformations
Optim. Name Required Analysis Transformation
constant folding simulated exec. elimination
dead code elim. simulated exec. elimination
loop unrolling loop struct., stat.s motion (replic.)
linearizing arrays loop structure elimination
load/store optim. DFA motion
branch chaining statistics selection (dec)
math identities none selection, elimination
common subexp. simulated exec. elimination
R. Rajkumar AP | CSE39
Code Optimizer
Example 7 + (8 * y) / 2
a := y
a := a * 8
a := a / 2
a := + 7expr
7 + expr
expr / 2
expr( )
8 *y
R. Rajkumar AP | CSE40
Code Generator (Assembly Level)
Automatic code generators
input: optimized (transformed) I-code
output: target machine program
Example 7 + (8 * y) / 2
Load a, y
Mult a, 8
Div a, 2
Add a, 7
R. Rajkumar AP | CSE41
Review: Compiler Phases:Source program
Lexical analyzer
Syntax analyzer
Semantic analyzer
Intermediate code generator
Code optimizer
Code generator
Symbol table
manager Error handler
Front End
Backend
R. Rajkumar AP | CSE42
The role of lexical analyzer
Lexical
AnalyzerParser
Source
program
token
getNextToken
Symbol
table
To semantic
analysis
R. Rajkumar AP | CSE43
Lexical Analysis
Lexical analyzer: reads input characters and produces a sequence of tokens as output (nexttoken()).
Trying to understand each element in a program. Token: a group of characters having a collective meaning.
const pi = 3.14159;
Token 1: (const, -)
Token 2: (identifier, ‘pi’)
Token 3: (=, -)
Token 4: (realnumber, 3.14159)
Token 5: (;, -)
R. Rajkumar AP | CSE44
Some terminology: Token: a group of characters having a collective meaning.
A lexeme is a particular instant of a token.
E.g. token: identifier, lexeme: pi, etc.
pattern: the rule describing how a token can be formed.
E.g: identifier: ([a-z]|[A-Z]) ([a-z]|[A-Z]|[0-9])*
Lexical analyzer does not have to be an individual
phase. But having a separate phase simplifies the
design and improves the efficiency and portability.
R. Rajkumar AP | CSE45
Two issues in lexical analysis.
How to specify tokens (patterns)?
How to recognize the tokens giving a token specification (how to
implement the nexttoken() routine)?
How to specify tokens:
all the basic elements in a language must be tokens so that
they can be recognized.
Token types: constant, identifier, reserved word, operator and
misc. symbol.
Tokens are specified by regular expressions.
main() {
int i, j;
for (I=0; I<50; I++) {
printf(“I = %d”, I);
}
}
R. Rajkumar AP | CSE46
Why to separate Lexical analysis and
parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
R. Rajkumar AP | CSE47
Tokens, Patterns and Lexemes
A token is a pair a token name and an optional token
value
A pattern is a description of the form that the lexemes of
a token may take
A lexeme is a sequence of characters in the source
program that matches the pattern for a token
R. Rajkumar AP | CSE48
Example
Token Informal description Sample lexemes
if
else
comparison
id
number
literal
Characters i, f
Characters e, l, s, e
< or > or <= or >= or == or !=
Letter followed by letter and digits
Any numeric constant
Anything but “ sorrounded by “
if
else
<=, !=
pi, score, D2
3.14159, 0, 6.02e23
“core dumped”
printf(“total = %d\n”, score);
R. Rajkumar AP | CSE49
Lexical errors
Some errors are out of power of lexical analyzer to
recognize:
fi (a == f(x)) …
However it may be able to recognize errors like:
d = 2r
Such errors are recognized when no pattern for tokens
matches a character sequence
R. Rajkumar AP | CSE50
Error recovery
Panic mode: successive characters are ignored until we
reach to a well formed token
Delete one character from the remaining input
Insert a missing character into the remaining input
Replace a character by another character
Transpose two adjacent characters
R. Rajkumar AP | CSE51
Input buffering
Sometimes lexical analyzer needs to look ahead some
symbols to decide about the token to return
In C language: we need to look after -, = or < to decide what
token to return
In Fortran: DO 5 I = 1.25
We need to introduce a two buffer scheme to handle
large look-aheads safely
E = M * C * * 2 eof
R. Rajkumar AP | CSE52
Sentinels
Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
}
E = M eof * C * * 2 eof eof
R. Rajkumar AP | CSE53
Specification of tokens
In theory of compilation regular expressions are used to
formalize the specification of tokens
Regular expressions are means for specifying regular
languages
Example:
Letter_(letter_ | digit)*
Each regular expression is a pattern specifying the form of
strings
R. Rajkumar AP | CSE54
Regular expressions
Ɛ is a regular expression, L(Ɛ) = {Ɛ}
If a is a symbol in ∑then a is a regular expression, L(a) =
{a}
(r) | (s) is a regular expression denoting the language L(r)
∪ L(s)
(r)(s) is a regular expression denoting the language
L(r)L(s)
(r)* is a regular expression denoting (L9r))*
(r) is a regular expression denting L(r)
R. Rajkumar AP | CSE55
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
R. Rajkumar AP | CSE56
Extensions
One or more instances: (r)+
Zero of one instances: r?
Character classes: [abc]
Example:
letter_ -> [A-Za-z_]
digit -> [0-9]
id -> letter_(letter|digit)*
R. Rajkumar AP | CSE57
Recognition of tokens
Starting point is the language grammar to understand the
tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
| Ɛ
expr -> term relop term
| term
term -> id
| number
R. Rajkumar AP | CSE58
Recognition of tokens (cont.)
The next step is to formalize the patterns:digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
We also need to handle whitespaces:
ws -> (blank | tab | newline)+
R. Rajkumar AP | CSE59
Architecture of a transition-diagram-
based lexical analyzerTOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1) { /* repeat character processing until a
return or failure occurs */
switch(state) {
case 0: c= nextchar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: …
…
case 8: retract();
retToken.attribute = GT;
return(retToken);
}
R. Rajkumar AP | CSE60
Lexical Analyzer Generator - Lex
Lexical
Compiler
Lex Source program
lex.llex.yy.c
Ccompiler
lex.yy.c a.out
a.outInput stream Sequence
of tokens
R. Rajkumar AP | CSE61
Structure of Lex programs
declarations
%%
translation rules
%%
auxiliary functions
Pattern {Action}
R. Rajkumar AP | CSE62
Example%{
/* definitions of manifest constants
LT, LE, EQ, NE, GT, GE,
IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular definitions
delim [ \t\n]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum(); return(NUMBER);}
…
Int installID() {/* funtion to install the lexeme, whose first character is pointed to by yytext, and whose length is yyleng, into the symbol table and return a pointer thereto */
}
Int installNum() { /* similar to installID, but puts numerical constants into a separate table */
}
R. Rajkumar AP | CSE63
64
Finite Automata
R. Rajkumar AP | CSE
65
Finite Automata
Regular expressions = specification
Finite automata = implementation
A finite automaton consists of
An input alphabet
A set of states S
A start state n
A set of accepting states F S
A set of transitions state input state
R. Rajkumar AP | CSE
66
Finite Automata
Transition
s1 a s2
Is read
In state s1 on input “a” go to state s2
If end of input
If in accepting state => accept, othewise => reject
If no transition possible => reject
R. Rajkumar AP | CSE
67
Finite Automata State Graphs
A state
• The start state
• An accepting state
• A transitiona
R. Rajkumar AP | CSE
68
A Simple Example
A finite automaton that accepts only “1”
A finite automaton accepts a string if we can follow
transitions labeled with the characters in the string from
the start to some accepting state
1
R. Rajkumar AP | CSE
69
Another Simple Example
A finite automaton accepting any number of 1’s followed
by a single 0
Alphabet: {0,1}
Check that “1110” is accepted but “110…” is not
0
1
R. Rajkumar AP | CSE
70
And Another Example
Alphabet {0,1}
What language does this recognize?
0
1
0
1
0
1
R. Rajkumar AP | CSE
71
And Another Example
Alphabet still { 0, 1 }
The operation of the automaton is not completely
defined by the input
On input “11” the automaton could be in either state
1
1
R. Rajkumar AP | CSE
72
Epsilon Moves
Another kind of transition: -moves
• Machine can move from state A to state B without reading input
A B
R. Rajkumar AP | CSE
73
Deterministic and Nondeterministic
Automata
Deterministic Finite Automata (DFA)
One transition per input per state
No -moves
Nondeterministic Finite Automata (NFA)
Can have multiple transitions for one input in a given state
Can have -moves
Finite automata have finite memory
Need only to encode the current state
R. Rajkumar AP | CSE
74
Execution of Finite Automata
A DFA can take only one path through the state graph
Completely determined by input
NFAs can choose
Whether to make -moves
Which of multiple transitions for a single input to take
R. Rajkumar AP | CSE
75
Acceptance of NFAs An NFA can get into multiple states
• Input:
0
1
1
0
1 0 1
• Rule: NFA accepts if it can get in a final state
R. Rajkumar AP | CSE
76
NFA vs. DFA (1)
NFAs and DFAs recognize the same set of languages
(regular languages)
DFAs are easier to implement
There are no choices to consider
R. Rajkumar AP | CSE
77
NFA vs. DFA (2) For a given language the NFA can be simpler than the DFA
01
0
0
01
0
1
0
1
NFA
DFA
• DFA can be exponentially larger than NFA
R. Rajkumar AP | CSE
78
Regular Expressions to Finite Automata
High-level sketch
Regularexpressions
NFA
DFA
LexicalSpecification
Table-driven Implementation of DFA
R. Rajkumar AP | CSE
79
Regular Expressions to NFA (1)
Thomson Construction For each kind of rexp, define an NFA
Notation: NFA for rexp A
A
• For
• For input aa
R. Rajkumar AP | CSE
80
Regular Expressions to NFA (2) For AB
A B
• For A | B
A
B
R. Rajkumar AP | CSE
81
Regular Expressions to NFA (3)
For A*
A
R. Rajkumar AP | CSE
R. Rajkumar AP | CSE82
Relationship between NFAs and DFAs
DFA is a special case of an NFA
DFA has no transitions
DFA’s transition function is single-valued
Same rules will work
DFA can be simulated with an NFA
Obviously
NFA can be simulated with a DFA (less obvious)
Simulate sets of possible states
Possible exponential blowup in the state space
Still, one state per character in the input streamRabin & Scott, 1959
R. Rajkumar AP | CSE83
Automating Scanner Construction
To convert a specification into code:
1 Write down the RE for the input language
2 Build a big NFA
3 Build the DFA that simulates the NFA
4 Systematically shrink the DFA
5 Turn it into code
Scanner generators
Lex and Flex work along these lines
Algorithms are well-known and well-understood
Key issue is interface to parser (define all parts of speech)
You could build one in a weekend!
R. Rajkumar AP | CSE84
Where are we? Why are we doing this?
RE NFA (Thompson’s construction)
Build an NFA for each term
Combine them with -moves
NFA DFA (Subset construction)
Build the simulation
DFA Minimal DFA
Hopcroft’s algorithm
DFA RE
All pairs, all paths problem
Union together paths from s0 to a final state
minimal
DFARE NFA DFA
The Cycle of Constructions
R. Rajkumar AP | CSE85
RE NFA using Thompson’s Construction
Key idea
NFA pattern for each symbol & each operator
Join them with moves in precedence orderS0 S1
a
NFA for a
S0 S1
aS3 S4
b
NFA for ab
NFA for a | b
S0
S1 S2
a
S3 S4
b
S5
S0 S1
S3 S4
NFA for a*
a
Ken Thompson, CACM, 1968
R. Rajkumar AP | CSE86
Example of Thompson’s Construction
Let’s try a ( b | c )*
1. a, b, & c
2. b | c
3. ( b | c )*
S0 S1
aS0 S1
bS0 S1
c
S2 S3
b
S4 S5
c
S1 S6 S0 S7
S1 S2
b
S3 S4
c
S0 S5
R. Rajkumar AP | CSE87
Example of Thompson’s Construction (con’t)
4. a ( b | c )*
Of course, a human would design something simpler ...S0 S1
a
b | c
But, we can automate production of the more complex NFA version ...
S0 S1
a S4 S5
b
S6 S7
c
S3 S8 S2 S9
R. Rajkumar AP | CSE88
Where are we? Why are we doing this?
RE NFA (Thompson’s construction)
Build an NFA for each term
Combine them with -moves
NFA DFA (subset construction)
Build the simulation
DFA Minimal DFA
Hopcroft’s algorithm
DFA RE
All pairs, all paths problem
Union together paths from s0 to a final state
minimal
DFARE NFA DFA
The Cycle of Constructions
89
Example of RegExp -> NFA conversion
Consider the regular expression
(1 | 0)*1
The NFA is
1C E
0D F
B
G
A H1
I J
R. Rajkumar AP | CSE
90
Next
Regularexpressions
NFA
DFA
LexicalSpecification
Table-driven Implementation of DFA
R. Rajkumar AP | CSE
91
Constructing Efficient Finite Automata
First we’ll see how to transform an NFA into a DFA.
Then we’ll see how to transform a
DFA into a minimum-state DFA.
Transforming an NFA into a DFA
The l-closure of a state s, denoted l(s), is the set consisting of s together with all states that
can be reached from s by traversing l-edges. The l-closure of a set S of states, denoted
l(S), is the union of the l-closures of the states in S.
Example. Given the following NFA as a graph and as a transition table.
S 0
2
1
b
b
La
La
Some sample l-closures for the NFA are as follows:
l(0) = {0, 1, 2}
l(1) = {1, 2}
l(2) = {2}
l() =
l({1, 2}) = {1, 2}
l({0, 1, 2}) = {0, 1, 2}.
S
F
TN a b L
0 {1, 2} {1}
1 {1, 2} {2}
2
R. Rajkumar AP | CSE
92
S
F
F
TN a b L
0 {0, 1} {3}
1 {2}
2 {2}
3 {3}
Algorithm: Transform an NFA into a DFA
Construct a DFA table TD from an NFA table TN as follows:
1. The start state of the DFA is l(s), where s is the start state of the NFA.
2. If {s1, …, sn} is a DFA state and a A, then
TD({s1, …, sn}, a) = l(TN(s1, a) … TN(sn, a)).
3. A DFA state is final if one of its elements is an NFA final state.
Example. Given the following NFA.
The algorithm constructs the following DFA transition table TD, where it is also written in
simplified form after a renumbering of the states.
S, F
F
F
F
F
TD a b
{0, 3} {3} {0, 1, 3}
{3} {3}
{0, 1, 3} {2, 3} {0, 1, 3}
{2, 3} {3} {2}
{2} {2}
S 0
3
1 bb
aL
a2
b
S, F
F
F
F
F
TD a b
0 1 2
1 1 5
2 3 2
3 1 4
4 5 4
5 5 5R. Rajkumar AP | CSE
93
S
F
TN a b L
0 {1} {3}
1 {2}
2
3 {2, 3}
S
F
F
F
TD a b
{0, 3} {1, 2, 3}
{1, 2, 3} {2, 3} {2}
{2, 3} {2, 3}
{2}
Quiz. Use the algorithm to transform the following NFA into a DFA.
S 0
1a
aL
b
2
3a
Solution: The algorithm constructs the following DFA transition table TD, where it is
also written in simplified form after a renumbering of the states.
S
F
F
F
TD a b
0 1 4
1 2 3
2 2 4
3 4 4
4 4 4
R. Rajkumar AP | CSE
94
Transforming an DFA into a minimum-state DFA
Let S be the set of states that can be reached from the start state of a DFA over A.
For states s, t S let s ~ t mean that for all strings w A* either T(s, w) and T(t, w) are
both final or both nonfinal. Observe that ~ is an equivalence relation on S. So it partitions S
into equivalence classes.
Observe also that the number of equivalence classes is the minimum number of states
needed by a DFA to recognize the language of the given DFA.
Algorithm: Transform a DFA to a minimum-state DFA
1. Construct the following sequence of sets of possible equivalent pairs of distinct states:
E0 E1 … Ek = Ek+1,
where
E0 = {{s, t} | s and t are either both final or both nonfinal}
and
Ei+1 = {{s, t} Ei | {T(s, a), T(t, a)} Ei or T(s, a) = T(t, a)} for every a A}.
Ek represents the distinct pairs of equivalent states from which ~ can be generated.
2. The equivalence classes form the states of the minimum state DFA with transition
table Tmin defined by
Tmin([s], a) = [T(s, a)].
3. The start state is the class containing the start state of the given DFA.
4. A final state is any class containing a final state of the given DFA.
R. Rajkumar AP | CSE
95
Example. Use the algorithm to transform the following DFA into a minimum-state DFA.
S 0
a
b
a2
a, b
4
1
3
b a, b
a, b
S
F
F
F
T a b
0 1 4
1 2 3
2 3 3
3 3 3
4 4 4
Solution: The set of states is S = {0, 1, 2, 3, 4}. To find the equivalent states calculate:
E0 = {{0, 4}, {1, 2}, {1, 3}, {2, 3}}
E1 = {{1, 2}, {1, 3}, {2, 3}}
E2 = {{1, 2}, {1, 3}, {2, 3}} = E1.
So 1 ~ 2, 1 ~ 3, 2 ~ 3. This tells us that S is partitioned by {0}, {1, 2, 3}, {4}, which we
name [0], [1], [4], respectively. So the minimum-state DFA has three states.
S
F
TMin a b
[0] [1] [4]
[1] [1] [1]
[4] [4] [4]
Min-state Table
S
F
TMin a b
0 1 2
1 1 1
2 2 2
Renamed Table
S 0
a
b
a, b
2
1 a, b
Min-state DFA graph
Quiz: What regular expression equality arises from the two DFAs?
Answer: a + aa + (aaa + aab + ab)(a + b)* = a(a + b)*.R. Rajkumar AP | CSE
DFA Minimization
R. Rajkumar AP | CSE96
DFA
Deterministic Finite Automata (DFSA)
(Q, Σ, δ, q0, F)
Q – (finite) set of states
Σ – alphabet – (finite) set of input symbols
δ – transition function
q0 – start state
F – set of final / accepting states
R. Rajkumar AP | CSE97
DFA
Often representing as a diagram:
R. Rajkumar AP | CSE98
DFA Minimization
Some states can be redundant:
The following DFA accepts (a|b)+
State s1 is not necessary
R. Rajkumar AP | CSE99
DFA Minimization
So these two DFAs are equivalent:
R. Rajkumar AP | CSE100
DFA Minimization
This is a state-minimized (or just minimized) DFA
Every remaining state is necessary
R. Rajkumar AP | CSE101
DFA Minimization
The task of DFA minimization, then, is to automatically
transform a given DFA into a state-minimized DFA
Several algorithms and variants are known
Note that this also in effect can minimize an NFA (since we
know algorithm to convert NFA to DFA)
R. Rajkumar AP | CSE102
DFA Minimization Algorithm
Recall that a DFA M=(Q, Σ, δ, q0, F)
Two states p and q are distinct if
p in F and q not in F or vice versa, or
for some α in Σ, δ(p, α) and δ(q, α) are distinct
Using this inductive definition, we can calculate which
states are distinct
R. Rajkumar AP | CSE103
DFA Minimization Algorithm
Create lower-triangular table DISTINCT, initially blank
For every pair of states (p,q): If p is final and q is not, or vice versa
DISTINCT(p,q) = ε
Loop until no change for an iteration: For every pair of states (p,q) and each symbol α
If DISTINCT(p,q) is blank and DISTINCT( δ(p,α), δ(q,α) ) is not blank
DISTINCT(p,q) = α
Combine all states that are not distinct
R. Rajkumar AP | CSE104
Very Simple Example
s0
s1
s2
s0 s1 s2
R. Rajkumar AP | CSE 105
Very Simple Example
s0
s1 ε
s2 ε
s0 s1 s2
Label pairs with ε where one is a final state and the other is not
R. Rajkumar AP | CSE 106
Very Simple Example
s0
s1 ε
s2 ε
s0 s1 s2
Main loop (no changes occur)
R. Rajkumar AP | CSE 107
Very Simple Example
s0
s1 ε
s2 ε
s0 s1 s2
DISTINCT(s1, s2) is empty, so s1 and s2 are equivalent states
R. Rajkumar AP | CSE 108
Very Simple Example
Merge s1 and s2
R. Rajkumar AP | CSE109
More Complex Example
R. Rajkumar AP | CSE110
More Complex Example
Check for pairs with one state final and one not:
R. Rajkumar AP | CSE111
More Complex Example
First iteration of main loop:
R. Rajkumar AP | CSE112
More Complex Example
Second iteration of main loop:
R. Rajkumar AP | CSE113
More Complex Example
Third iteration makes no changes
Blank cells are equivalent pairs of states
R. Rajkumar AP | CSE114
More Complex Example
Combine equivalent states for minimized DFA:
R. Rajkumar AP | CSE115
Conclusion
DFA Minimization is a fairly understandable process, and
is useful in several areas
Regular expression matching implementation
Very similar algorithm is used for compiler optimization to
eliminate duplicate computations
The algorithm described is O(kn2)
John Hopcraft describes another more complex algorithm that
is O(k (n log n) )
R. Rajkumar AP | CSE116
117
Parse Trees
Definitions
Relationship to Left- and Rightmost Derivations
Ambiguity in Grammars
R. Rajkumar AP | CSE
118
Parse Trees
Parse trees are trees labeled by symbols of a particular
CFG.
Leaves: labeled by a terminal or ε.
Interior nodes: labeled by a variable.
Children are labeled by the right side of a production for
the parent.
Root: must be labeled by the start symbol.
R. Rajkumar AP | CSE
119
Example: Parse Tree
S -> SS | (S) | ()
S
SS
S )(
( )
( )
R. Rajkumar AP | CSE
120
Yield of a Parse Tree
The concatenation of the labels of the leaves in left-to-
right order
That is, in the order of a preorder traversal.
is called the yield of the parse tree.
Example: yield of is (())()
S
SS
S )(
( )
( )
R. Rajkumar AP | CSE
121
Parse Trees, Left- and Rightmost
Derivations For every parse tree, there is a unique leftmost, and a
unique rightmost derivation.
We’ll prove:
1. If there is a parse tree with root labeled A and yield w, then
A =>*lm w.
2. If A =>*lm w, then there is a parse tree with root A and
yield w.
R. Rajkumar AP | CSE
122
Proof: Part 2
Given a leftmost derivation of a terminal string, we need
to prove the existence of a parse tree.
The proof is an induction on the length of the derivation.
R. Rajkumar AP | CSE
123
Part 2 – Basis
If A =>*lm a1…an by a one-step derivation, then there
must be a parse tree
A
a1 an. . .
R. Rajkumar AP | CSE
124
Part 2 – Induction
Assume (2) for derivations of fewer than k > 1 steps,
and let A =>*lm w be a k-step derivation.
First step is A =>lm X1…Xn.
Key point: w can be divided so the first portion is
derived from X1, the next is derived from X2, and so
on.
If Xi is a terminal, then wi = Xi.
R. Rajkumar AP | CSE
125
Induction – (2)
That is, Xi =>*lm wi for all i such that Xi is a variable.
And the derivation takes fewer than k steps.
By the IH, if Xi is a variable, then there is a parse tree
with root Xi and yield wi.
Thus, there is a parse tree
A
X1 Xn. . .
w1 wnR. Rajkumar AP | CSE
126
Parse Trees and Rightmost Derivations
The ideas are essentially the mirror image of the
proof for leftmost derivations.
Left to the imagination.
R. Rajkumar AP | CSE
127
Parse Trees and Any Derivation
The proof that you can obtain a parse tree from a
leftmost derivation doesn’t really depend on
“leftmost.”
First step still has to be A => X1…Xn.
And w still can be divided so the first portion is
derived from X1, the next is derived from X2, and so
on.
R. Rajkumar AP | CSE
128
Ambiguous Grammars
A CFG is ambiguous if there is a string in the language
that is the yield of two or more parse trees.
Example: S -> SS | (S) | ()
Two parse trees for ()()() on next slide.
R. Rajkumar AP | CSE
129
Example – Continued
S
SS
S S
( )
S
SS
SS
( )( )
( ) ( )
( )
R. Rajkumar AP | CSE
130
Ambiguity, Left- and Rightmost
Derivations
If there are two different parse trees, they must
produce two different leftmost derivations by the
construction given in the proof.
Conversely, two different leftmost derivations produce
different parse trees by the other part of the proof.
Likewise for rightmost derivations.
R. Rajkumar AP | CSE
131
Ambiguity, etc. – (2)
Thus, equivalent definitions of “ambiguous grammar’’
are:
1. There is a string in the language that has two different
leftmost derivations.
2. There is a string in the language that has two different
rightmost derivations.
R. Rajkumar AP | CSE
132
Ambiguity is a Property of Grammars,
not Languages
For the balanced-parentheses language, here is
another CFG, which is unambiguous.
B -> (RB | ε
R -> ) | (RR B, the start symbol,
derives balanced strings.
R generates strings that
have one more right paren
than left.R. Rajkumar AP | CSE
133
Example: Unambiguous Grammar
B -> (RB | ε R -> ) | (RR
Construct a unique leftmost derivation for a given
balanced string of parentheses by scanning the string from
left to right.
If we need to expand B, then use B -> (RB if the next symbol is “(” and ε if at the end.
If we need to expand R, use R -> ) if the next symbol is “)” and
(RR if it is “(”.
R. Rajkumar AP | CSE
134
The Parsing Process
Remaining Input:
(())()
Steps of leftmost derivation:
B
Next
symbol
B -> (RB | ε R -> ) | (RRR. Rajkumar AP | CSE
135
The Parsing Process
Remaining Input:
())()
Steps of leftmost derivation:
B
(RB
Next
symbol
B -> (RB | ε R -> ) | (RRR. Rajkumar AP | CSE
136
The Parsing Process
Remaining Input:
))()
Steps of leftmost derivation:
B
(RB
((RRB
Next
symbol
B -> (RB | ε R -> ) | (RRR. Rajkumar AP | CSE
137
The Parsing Process
Remaining Input:
)()
Steps of leftmost derivation:
B
(RB
((RRB
(()RB
Next
symbol
B -> (RB | ε R -> ) | (RRR. Rajkumar AP | CSE
138
The Parsing Process
Remaining Input:
()
Steps of leftmost derivation:
B
(RB
((RRB
(()RB
(())BNext
symbol
B -> (RB | ε R -> ) | (RRR. Rajkumar AP | CSE
139
The Parsing Process
Remaining Input:
)
Steps of leftmost derivation:
B (())(RB
(RB
((RRB
(()RB
(())BNext
symbol
B -> (RB | ε R -> ) | (RRR. Rajkumar AP | CSE
140
The Parsing Process
Remaining Input: Steps of leftmost derivation:
B (())(RB
(RB (())()B
((RRB
(()RB
(())BNext
symbol
B -> (RB | ε R -> ) | (RRR. Rajkumar AP | CSE
141
The Parsing Process
Remaining Input: Steps of leftmost derivation:
B (())(RB
(RB (())()B
((RRB (())()
(()RB
(())BNext
symbol
B -> (RB | ε R -> ) | (RRR. Rajkumar AP | CSE
142
LL(1) Grammars
As an aside, a grammar such B -> (RB | ε R -> ) |
(RR, where you can always figure out the production to
use in a leftmost derivation by scanning the given string
left-to-right and looking only at the next one symbol is
called LL(1).
“Leftmost derivation, left-to-right scan, one symbol of
lookahead.”
R. Rajkumar AP | CSE
143
LL(1) Grammars – (2)
Most programming languages have LL(1) grammars.
LL(1) grammars are never ambiguous.
R. Rajkumar AP | CSE
References
Aho, Alfred V., Sethi, Ravi and Ullman, Jeffrey D. Compilers: principles, techniques, and tools. (1986). Reading: Addison-Wesley.
Peters, James, Pittman, Thomas. The art of compiler design: theory and practice. (1992). Englewood Cliffs: Prentice Hall.
Watson, Des. High-level languages and their compilers. (1989). Wokingham: Addison-Wesley.
R. Rajkumar AP | CSE144
References
Ullman, A. V., Hopcroft, J. E. and Ullman, J. D. (1974) The Design and Analysis of
Computer Algorithms. Addison-Wesley.
Hopcroft, J. (1971) An N Log N Algorithm for Minimizing States in a Finite Automaton.
Stanford University.
Parthasarathy, M. and Fleck, M. (2007) DFA Minimization. University of Illinois at
Urbana-Champaign. http://www.cs.uiuc.edu/class/fa07/cs273/Handouts/minimization/minimization.pdf
R. Rajkumar AP | CSE145
References
Heng, Christopher. Free Compiler Construction Tools. http://www.thefreecountry.com/programming/compilercontructiontools
The Lex & Yacc Page. http://dinosaur.compilertools.net
Compiler Construction Kits. http://catalog.compilertools.net
The Cocktail Compiler Toolbox. http://www.first.gmd.de/cocktail/
R. Rajkumar AP | CSE146
Prepared by
www.gameofcompilers.weebly.com
R. Rajkumar AP | CSE147
Instructor : Mr. R. Rajkumar, Assistant Professor | CSE
Staff room: TP-612, Tech park,
SRM Institute of Science and Technology,
Kattankulathur, India.