Compiler Design 15CS301gameofcompilers.weebly.com/uploads/8/5/4/8/8548812/cd_unit_1.pdf · The Structure of a Compiler (3) 14 Scanner Parser Semantic Routines Code Generator Optimizer

Compiler Design – 15CS301

Instructor : Mr. R. Rajkumar, Assistant Professor | CSE

Venue : TP-606, Tech park,

SRM Institute of Science and Technology,

Kattankulathur, India. 1

UNIT 1 – Introduction to Compiler and Automata

R. Rajkumar AP | CSE

Q: Why Compiler Design?

Programming languages are the primary tools for all

computer programmers.

While many software engineers may claim to know a

number of languages enough that they can work with them,

it’s seen that they work within their comfort zones.

A number of very sophisticated features of programming

languages remain out of reach for a majority of

programmers.

R. Rajkumar AP | CSE2

Compilers provide you with the theoretical and practical

knowledge that is needed to implement a programming

language. Once you have learnt to do a compiler, you pretty

much know the innards of many programming languages.

Compilers have a plethora of sophisticated algorithms and

data-structures implemented within. So, if you are fascinated

with algorithms and data-structures, you will find several of

them at work in a compiler.


Q: Why Compiler Design? (2)

Compilers are complex software systems. If you can

truthfully claim that you have written a compiler with your

own hands, it is likely that there will be no questions asked

after that in any interview. A person who has made a

compiler can do anything.



The software architecture of a compiler is quite general. A

large variety of applications can be modelled after a

compiler (or some part thereof). Simulators, debuggers,

program analysis tools, editors, IDEs, RDBMSs, browsers,

OS shells, … have some significant elements of language

processing (read compiling) in them.



Let us start with a small history


Overview and History (1)

Cause Software for early computers was written in assembly language

The benefits of reusing software on different CPUs started to become significantly greater than the cost of writing a compiler

The first real compiler FORTRAN compilers of the late 1950s

18 person-years to build

7 R. Rajkumar AP | CSE

Overview and History (2)

8

Compiler technology is more broadly applicable and has been employed in

rather unexpected areas. Text-formatting languages,

like nroff and troff; preprocessor packages like eqn, tbl, pic

Silicon compiler for the creation of VLSI circuits

Command languages of OS

Query languages of Database systems


What Do Compilers Do (1)

9

A compiler acts as a translator,

transforming human-oriented programming languages

into computer-oriented machine languages.

Ignore machine-dependent details for programmer

Programming

Language

(Source)Compiler

Machine

Language

(Target)



Compilers may generate three types of code:

Pure Machine Code

Machine instruction set without assuming the existence of any

operating system or library.

Mostly being OS or embedded applications.

Augmented Machine Code

Code with OS routines and runtime support routines.

More often

Virtual Machine Code

Virtual instructions, can be run on any architecture with a virtual

machine interpreter or a just-in-time compiler

Ex. Java



Another way that compilers

differ from one another is in the format of the target

machine code they generate:

Assembly or other source format

Relocatable binary

Relative address

A linkage step is required

Absolute binary

Absolute address

Can be executed directly


12

Any compiler must perform two major tasks

Analysis of the source program

Synthesis of a machine-language program

The Structure of a Compiler (1)

Compiler

Analysis Synthesis



13

Scanner ParserSemantic

Routines

Code

Generator

Optimizer

Source

Program Tokens Syntactic

Structure

Symbol and

Attribute

Tables

(Used by all Phases of The Compiler)

(Character Stream)

Intermediate

Representation

Target machine codeR. Rajkumar AP | CSE


14


Routines

Code

Generator

Optimizer

Source


Structure

Symbol and

Attribute

Tables

(Used by all

Phases of

The Compiler)

Scanner The scanner begins the analysis of the source program by

reading the input, character by character, and grouping

characters into individual words and symbols (tokens)

RE ( Regular expression )

NFA ( Non-deterministic Finite Automata )

DFA ( Deterministic Finite Automata )

LEX

(Character Stream)

Intermediate

Representation



15


Routines

Code

Generator

Optimizer

Source


Structure

Symbol and

Attribute

Tables

(Used by all

Phases of

The Compiler)

Parser Given a formal syntax specification (typically as a context-

free grammar [CFG] ), the parse reads tokens and groups

them into units as specified by the productions of the CFG

being used.

As syntactic structure is recognized, the parser either calls

corresponding semantic routines directly or builds a syntax

tree. CFG ( Context-Free Grammar )

BNF ( Backus-Naur Form )

GAA ( Grammar Analysis Algorithms )

LL, LR, SLR, LALR Parsers

YACC

(Character Stream)

Intermediate

Representation



16


Routines

Code

Generator

Optimizer

Source

Program

(Character Stream)

Tokens Syntactic

Structure

Intermediate

Representation

Symbol and

Attribute

Tables

(Used by all

Phases of

The Compiler)

Semantic Routines Perform two functions

Check the static semantics of each construct Do the actual translation

The heart of a compiler

Syntax Directed Translation

Semantic Processing Techniques

IR (Intermediate Representation)



17


Routines

Code

Generator

Optimizer

Source


Structure

Symbol and

Attribute

Tables

(Used by all

Phases of

The Compiler)

Optimizer The IR code generated by the semantic routines is

analyzed and transformed into functionally equivalent but

improved IR code

This phase can be very complex and slow

Peephole optimization

loop optimization, register allocation, code scheduling

Register and Temporary Management

Peephole Optimization

(Character Stream)

Intermediate

Representation



18

Source

Program

(Character Stream)Scanner

TokensParser

Syntactic

Structure

Semantic

Routines

Intermediate

Representation

Optimizer

Code

Generator

Code Generator Interpretive Code Generation

Generating Code from Tree/Dag

Grammar-Based Code Generator



19

Scanner [Lexical Analyzer]

Parser [Syntax Analyzer]

Semantic Process [Semantic analyzer]

Code Generator[Intermediate Code Generator]

Code Optimizer

Tokens

Parse tree

Abstract Syntax Tree w/ Attributes

Non-optimized Intermediate Code

Optimized Intermediate Code

Code Optimizer

Target machine code



Compiler writing tools

Compiler generators or compiler-

compilers

E.g. scanner and parser generators

Examples : Yacc, Lex


The Syntax and Semantics of

Programming Language (1)

A programming language must include the specification of

syntax (structure) and semantics (meaning).

Syntax typically means the context-free syntax because of

the almost universal use of context-free-grammar (CFGs)

Ex.

a = b + c is syntactically legal

b + c = a is illegal


The Syntax and Semantics of

Programming Language (2)

The semantics of a programming language are commonly

divided into two classes:

Static semantics

Semantics rules that can be checked at compiled time.

Ex. The type and number of a function’s arguments

Runtime semantics

Semantics rules that can be checked only at run time


Compiler Design and Programming

Language Design

23

An interesting aspect is how programming language

design and compiler design influence one another.

Programming languages that are easy to compile

have many advantages


Computer Architecture and Compiler

Design

Compilers should exploit the hardware-specific feature

and computing capability to optimize code.

The problems encountered in modern computing

platforms:

Instruction sets for some popular architectures are highly

nonuniform.

High-level programming language operations are not always

easy to support.

Ex. exceptions, threads, dynamic heap access …

Exploiting architectural features such as cache, distributed

processors and memory

Effective use of a large number of processors


Compiler Design Considerations

Debugging Compilers

Designed to aid in the development and debugging of

programs.

Optimizing Compilers

Designed to produce efficient target code

Retargetable Compilers

A compiler whose target architecture can be changed without

its machine-independent components having to be rewritten.


Compiler Construction Tools


Contents

Defining Compiler Construction Tools (aka CCTs)

Uses for CCTs

CCTs in the Compiler Structure

Lexical Analyzer

Syntax Analyzer

Semantic Analyzer

Intermediate Code Generator

Code Optimizer

Code Generator


Defining CCTs

programs or environments that assist in

the creation of an entire compiler or its

parts


Uses for CCTs

generate lexical analyzers,

syntax analyzers,

semantic analyzers,

intermediate code,

optimized target code


CCTs in

the

Compiler

Structure


Lexical Analyzer

scanner generators

input: source program

output: lexical analyzer

task of reading characters from source program and

recognizing tokens or basic syntactic components

maintains a list of reserved words


Lexical Analyzer

Flex (fast lexical analyzer generator)

Example which specifies a scanner which replaces the

string “username” with the user’s login name

%%

username printf(“%s”, getlogin());


Syntax Analyzer

parser generators

input: context-free grammar

output: syntax analyzer

the task of the syntax analyzer is to produce a representation of the source program in a form directly representing its syntax structure. This representation is usually in the form of a binary tree or similar data structure


Semantic Analyzer

syntax-directed translators

input: parse tree

output: routines to generate I-code

“The role of the semantic analyzer is to derive methods by which the structures constructed by the syntax analyzer may be evaluate or executed.“

type checker

two common tactics:

~ flatten the semantic analyzer’s parse tree

~ embed semantic analyzer w/in syntax analyzer

(syntax-driven translation)



Automatic code generators

input: I-code rules

output: crude target machine program

“The task of the code generator is to traverse this tree, producing functionally equivalent object code.” [3]

three address code is one type



Example 7 + (8 * y) / 2

a := 8

b := y

c := a * b

a := c

b := 2

c := a / b

a := 7

b := c

c := a + b

expr

7 + expr

expr / 2

expr( )

8 *y


Code Optimizer

Data flow engines

input: I-code

output: transformed code

“This improvement is achieved by program transformations that are traditionally called optimizations, although the term ‘optimization’ is a misnomer because there is rarely a guarantee that the resulting code is the best possible.”


Code Optimizer

Peephole Optimization

machine or assembly code is used along with knowledge of target machine’s instruction set to replace I-code instructions with shorter or more quickly executed instructions - this is repeated as much as is necessary


Code Optimizer

Common Optimizing Transformations

Optim. Name Required Analysis Transformation

constant folding simulated exec. elimination

dead code elim. simulated exec. elimination

loop unrolling loop struct., stat.s motion (replic.)

linearizing arrays loop structure elimination

load/store optim. DFA motion

branch chaining statistics selection (dec)

math identities none selection, elimination

common subexp. simulated exec. elimination


Code Optimizer

Example 7 + (8 * y) / 2

a := y

a := a * 8

a := a / 2

a := + 7expr

7 + expr

expr / 2

expr( )

8 *y


Code Generator (Assembly Level)

Automatic code generators

input: optimized (transformed) I-code

output: target machine program

Example 7 + (8 * y) / 2

Load a, y

Mult a, 8

Div a, 2

Add a, 7


Review: Compiler Phases:Source program

Lexical analyzer

Syntax analyzer

Semantic analyzer

Intermediate code generator

Code optimizer

Code generator

Symbol table

manager Error handler

Front End

Backend


The role of lexical analyzer

Lexical

AnalyzerParser

Source

program

token

getNextToken

Symbol

table

To semantic

analysis


Lexical Analysis

Lexical analyzer: reads input characters and produces a sequence of tokens as output (nexttoken()).

Trying to understand each element in a program. Token: a group of characters having a collective meaning.

const pi = 3.14159;

Token 1: (const, -)

Token 2: (identifier, ‘pi’)

Token 3: (=, -)

Token 4: (realnumber, 3.14159)

Token 5: (;, -)


Some terminology: Token: a group of characters having a collective meaning.

A lexeme is a particular instant of a token.

E.g. token: identifier, lexeme: pi, etc.

pattern: the rule describing how a token can be formed.

E.g: identifier: ([a-z]|[A-Z]) ([a-z]|[A-Z]|[0-9])*

Lexical analyzer does not have to be an individual

phase. But having a separate phase simplifies the

design and improves the efficiency and portability.


Two issues in lexical analysis.

How to specify tokens (patterns)?

How to recognize the tokens giving a token specification (how to

implement the nexttoken() routine)?

How to specify tokens:

all the basic elements in a language must be tokens so that

they can be recognized.

Token types: constant, identifier, reserved word, operator and

misc. symbol.

Tokens are specified by regular expressions.

main() {

int i, j;

for (I=0; I<50; I++) {

printf(“I = %d”, I);

}

}


Why to separate Lexical analysis and

parsing

1. Simplicity of design

2. Improving compiler efficiency

3. Enhancing compiler portability


Tokens, Patterns and Lexemes

A token is a pair a token name and an optional token

value

A pattern is a description of the form that the lexemes of

a token may take

A lexeme is a sequence of characters in the source

program that matches the pattern for a token


Example

Token Informal description Sample lexemes

if

else

comparison

id

number

literal

Characters i, f

Characters e, l, s, e

< or > or <= or >= or == or !=

Letter followed by letter and digits

Any numeric constant

Anything but “ sorrounded by “

if

else

<=, !=

pi, score, D2

3.14159, 0, 6.02e23

“core dumped”

printf(“total = %d\n”, score);


Lexical errors

Some errors are out of power of lexical analyzer to

recognize:

fi (a == f(x)) …

However it may be able to recognize errors like:

d = 2r

Such errors are recognized when no pattern for tokens

matches a character sequence


Error recovery

Panic mode: successive characters are ignored until we

reach to a well formed token

Delete one character from the remaining input

Insert a missing character into the remaining input

Replace a character by another character

Transpose two adjacent characters


Input buffering

Sometimes lexical analyzer needs to look ahead some

symbols to decide about the token to return

In C language: we need to look after -, = or < to decide what

token to return

In Fortran: DO 5 I = 1.25

We need to introduce a two buffer scheme to handle

large look-aheads safely

E = M * C * * 2 eof


Sentinels

Switch (*forward++) {

case eof:

if (forward is at end of first buffer) {

reload second buffer;

forward = beginning of second buffer;

}

else if {forward is at end of second buffer) {

reload first buffer;\

forward = beginning of first buffer;

}

else /* eof within a buffer marks the end of input */

terminate lexical analysis;

break;

cases for the other characters;

}

E = M eof * C * * 2 eof eof


Specification of tokens

In theory of compilation regular expressions are used to

formalize the specification of tokens

Regular expressions are means for specifying regular

languages

Example:

Letter_(letter_ | digit)*

Each regular expression is a pattern specifying the form of

strings


Regular expressions

Ɛ is a regular expression, L(Ɛ) = {Ɛ}

If a is a symbol in ∑then a is a regular expression, L(a) =

{a}

(r) | (s) is a regular expression denoting the language L(r)

∪ L(s)

(r)(s) is a regular expression denoting the language

L(r)L(s)

(r)* is a regular expression denoting (L9r))*

(r) is a regular expression denting L(r)


Regular definitions

d1 -> r1

d2 -> r2

…

dn -> rn

Example:

letter_ -> A | B | … | Z | a | b | … | Z | _

digit -> 0 | 1 | … | 9

id -> letter_ (letter_ | digit)*


Extensions

One or more instances: (r)+

Zero of one instances: r?

Character classes: [abc]

Example:

letter_ -> [A-Za-z_]

digit -> [0-9]

id -> letter_(letter|digit)*


Recognition of tokens

Starting point is the language grammar to understand the

tokens:

stmt -> if expr then stmt

| if expr then stmt else stmt

| Ɛ

expr -> term relop term

| term

term -> id

| number


Recognition of tokens (cont.)

The next step is to formalize the patterns:digit -> [0-9]

Digits -> digit+

number -> digit(.digits)? (E[+-]? Digit)?

letter -> [A-Za-z_]

id -> letter (letter|digit)*

If -> if

Then -> then

Else -> else

Relop -> < | > | <= | >= | = | <>

We also need to handle whitespaces:

ws -> (blank | tab | newline)+


Architecture of a transition-diagram-

based lexical analyzerTOKEN getRelop()

{

TOKEN retToken = new (RELOP)

while (1) { /* repeat character processing until a

return or failure occurs */

switch(state) {

case 0: c= nextchar();

if (c == ‘<‘) state = 1;

else if (c == ‘=‘) state = 5;

else if (c == ‘>’) state = 6;

else fail(); /* lexeme is not a relop */

break;

case 1: …

…

case 8: retract();

retToken.attribute = GT;

return(retToken);

}


Lexical Analyzer Generator - Lex

Lexical

Compiler

Lex Source program

lex.llex.yy.c

Ccompiler

lex.yy.c a.out

a.outInput stream Sequence

of tokens


Structure of Lex programs

declarations

%%

translation rules

%%

auxiliary functions

Pattern {Action}


Example%{

/* definitions of manifest constants

LT, LE, EQ, NE, GT, GE,

IF, THEN, ELSE, ID, NUMBER, RELOP */

%}

/* regular definitions

delim [ \t\n]

ws {delim}+

letter [A-Za-z]

digit [0-9]

id {letter}({letter}|{digit})*

number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?

%%

{ws} {/* no action and no return */}

if {return(IF);}

then {return(THEN);}

else {return(ELSE);}

{id} {yylval = (int) installID(); return(ID); }

{number} {yylval = (int) installNum(); return(NUMBER);}

…

Int installID() {/* funtion to install the lexeme, whose first character is pointed to by yytext, and whose length is yyleng, into the symbol table and return a pointer thereto */

}

Int installNum() { /* similar to installID, but puts numerical constants into a separate table */

}


64

Finite Automata


65

Finite Automata

Regular expressions = specification

Finite automata = implementation

A finite automaton consists of

An input alphabet

A set of states S

A start state n

A set of accepting states F S

A set of transitions state input state


66

Finite Automata

Transition

s1 a s2

Is read

In state s1 on input “a” go to state s2

If end of input

If in accepting state => accept, othewise => reject

If no transition possible => reject


67

Finite Automata State Graphs

A state

• The start state

• An accepting state

• A transitiona


68

A Simple Example

A finite automaton that accepts only “1”

A finite automaton accepts a string if we can follow

transitions labeled with the characters in the string from

the start to some accepting state

1


69

Another Simple Example

A finite automaton accepting any number of 1’s followed

by a single 0

Alphabet: {0,1}

Check that “1110” is accepted but “110…” is not

0

1


70

And Another Example

Alphabet {0,1}

What language does this recognize?

0

1

0

1

0

1


71

And Another Example

Alphabet still { 0, 1 }

The operation of the automaton is not completely

defined by the input

On input “11” the automaton could be in either state

1

1


72

Epsilon Moves

Another kind of transition: -moves

• Machine can move from state A to state B without reading input

A B


73

Deterministic and Nondeterministic

Automata

Deterministic Finite Automata (DFA)

One transition per input per state

No -moves

Nondeterministic Finite Automata (NFA)

Can have multiple transitions for one input in a given state

Can have -moves

Finite automata have finite memory

Need only to encode the current state


74

Execution of Finite Automata

A DFA can take only one path through the state graph

Completely determined by input

NFAs can choose

Whether to make -moves

Which of multiple transitions for a single input to take


75

Acceptance of NFAs An NFA can get into multiple states

• Input:

0

1

1

0

1 0 1

• Rule: NFA accepts if it can get in a final state


76

NFA vs. DFA (1)

NFAs and DFAs recognize the same set of languages

(regular languages)

DFAs are easier to implement

There are no choices to consider


77

NFA vs. DFA (2) For a given language the NFA can be simpler than the DFA

01

0

0

01

0

1

0

1

NFA

DFA

• DFA can be exponentially larger than NFA


78

Regular Expressions to Finite Automata

High-level sketch

Regularexpressions

NFA

DFA

LexicalSpecification

Table-driven Implementation of DFA


79

Regular Expressions to NFA (1)

Thomson Construction For each kind of rexp, define an NFA

Notation: NFA for rexp A

A

• For

• For input aa


80

Regular Expressions to NFA (2) For AB

A B

• For A | B

A

B


81

Regular Expressions to NFA (3)

For A*

A



Relationship between NFAs and DFAs

DFA is a special case of an NFA

DFA has no transitions

DFA’s transition function is single-valued

Same rules will work

DFA can be simulated with an NFA

Obviously

NFA can be simulated with a DFA (less obvious)

Simulate sets of possible states

Possible exponential blowup in the state space

Still, one state per character in the input streamRabin & Scott, 1959


Automating Scanner Construction

To convert a specification into code:

1 Write down the RE for the input language

2 Build a big NFA

3 Build the DFA that simulates the NFA

4 Systematically shrink the DFA

5 Turn it into code

Scanner generators

Lex and Flex work along these lines

Algorithms are well-known and well-understood

Key issue is interface to parser (define all parts of speech)

You could build one in a weekend!


Where are we? Why are we doing this?

RE NFA (Thompson’s construction)

Build an NFA for each term

Combine them with -moves

NFA DFA (Subset construction)

Build the simulation

DFA Minimal DFA

Hopcroft’s algorithm

DFA RE

All pairs, all paths problem

Union together paths from s0 to a final state

minimal

DFARE NFA DFA

The Cycle of Constructions


RE NFA using Thompson’s Construction

Key idea

NFA pattern for each symbol & each operator

Join them with moves in precedence orderS0 S1

a

NFA for a

S0 S1

aS3 S4

b

NFA for ab

NFA for a | b

S0

S1 S2

a

S3 S4

b

S5

S0 S1

S3 S4

NFA for a*

a

Ken Thompson, CACM, 1968


Example of Thompson’s Construction

Let’s try a ( b | c )*

1. a, b, & c

2. b | c

3. ( b | c )*

S0 S1

aS0 S1

bS0 S1

c

S2 S3

b

S4 S5

c

S1 S6 S0 S7

S1 S2

b

S3 S4

c

S0 S5


Example of Thompson’s Construction (con’t)

4. a ( b | c )*

Of course, a human would design something simpler ...S0 S1

a

b | c

But, we can automate production of the more complex NFA version ...

S0 S1

a S4 S5

b

S6 S7

c

S3 S8 S2 S9


Where are we? Why are we doing this?

RE NFA (Thompson’s construction)

Build an NFA for each term

Combine them with -moves

NFA DFA (subset construction)

Build the simulation

DFA Minimal DFA

Hopcroft’s algorithm

DFA RE

All pairs, all paths problem

Union together paths from s0 to a final state

minimal

DFARE NFA DFA

The Cycle of Constructions

89

Example of RegExp -> NFA conversion

Consider the regular expression

(1 | 0)*1

The NFA is

1C E

0D F

B

G

A H1

I J


90

Next

Regularexpressions

NFA

DFA

LexicalSpecification

Table-driven Implementation of DFA


91

Constructing Efficient Finite Automata

First we’ll see how to transform an NFA into a DFA.

Then we’ll see how to transform a

DFA into a minimum-state DFA.

Transforming an NFA into a DFA

The l-closure of a state s, denoted l(s), is the set consisting of s together with all states that

can be reached from s by traversing l-edges. The l-closure of a set S of states, denoted

l(S), is the union of the l-closures of the states in S.

Example. Given the following NFA as a graph and as a transition table.

S 0

2

1

b

b

La

La

Some sample l-closures for the NFA are as follows:

l(0) = {0, 1, 2}

l(1) = {1, 2}

l(2) = {2}

l() =

l({1, 2}) = {1, 2}

l({0, 1, 2}) = {0, 1, 2}.

S

F

TN a b L

0 {1, 2} {1}

1 {1, 2} {2}

2


92

S

F

F

TN a b L

0 {0, 1} {3}

1 {2}

2 {2}

3 {3}

Algorithm: Transform an NFA into a DFA

Construct a DFA table TD from an NFA table TN as follows:

1. The start state of the DFA is l(s), where s is the start state of the NFA.

2. If {s1, …, sn} is a DFA state and a A, then

TD({s1, …, sn}, a) = l(TN(s1, a) … TN(sn, a)).

3. A DFA state is final if one of its elements is an NFA final state.

Example. Given the following NFA.

The algorithm constructs the following DFA transition table TD, where it is also written in

simplified form after a renumbering of the states.

S, F

F

F

F

F

TD a b

{0, 3} {3} {0, 1, 3}

{3} {3}

{0, 1, 3} {2, 3} {0, 1, 3}

{2, 3} {3} {2}

{2} {2}

S 0

3

1 bb

aL

a2

b

S, F

F

F

F

F

TD a b

0 1 2

1 1 5

2 3 2

3 1 4

4 5 4

5 5 5R. Rajkumar AP | CSE

93

S

F

TN a b L

0 {1} {3}

1 {2}

2

3 {2, 3}

S

F

F

F

TD a b

{0, 3} {1, 2, 3}

{1, 2, 3} {2, 3} {2}

{2, 3} {2, 3}

{2}

Quiz. Use the algorithm to transform the following NFA into a DFA.

S 0

1a

aL

b

2

3a

Solution: The algorithm constructs the following DFA transition table TD, where it is

also written in simplified form after a renumbering of the states.

S

F

F

F

TD a b

0 1 4

1 2 3

2 2 4

3 4 4

4 4 4


94

Transforming an DFA into a minimum-state DFA

Let S be the set of states that can be reached from the start state of a DFA over A.

For states s, t S let s ~ t mean that for all strings w A* either T(s, w) and T(t, w) are

both final or both nonfinal. Observe that ~ is an equivalence relation on S. So it partitions S

into equivalence classes.

Observe also that the number of equivalence classes is the minimum number of states

needed by a DFA to recognize the language of the given DFA.

Algorithm: Transform a DFA to a minimum-state DFA

1. Construct the following sequence of sets of possible equivalent pairs of distinct states:

E0 E1 … Ek = Ek+1,

where

E0 = {{s, t} | s and t are either both final or both nonfinal}

and

Ei+1 = {{s, t} Ei | {T(s, a), T(t, a)} Ei or T(s, a) = T(t, a)} for every a A}.

Ek represents the distinct pairs of equivalent states from which ~ can be generated.

2. The equivalence classes form the states of the minimum state DFA with transition

table Tmin defined by

Tmin([s], a) = [T(s, a)].

3. The start state is the class containing the start state of the given DFA.

4. A final state is any class containing a final state of the given DFA.


95

Example. Use the algorithm to transform the following DFA into a minimum-state DFA.

S 0

a

b

a2

a, b

4

1

3

b a, b

a, b

S

F

F

F

T a b

0 1 4

1 2 3

2 3 3

3 3 3

4 4 4

Solution: The set of states is S = {0, 1, 2, 3, 4}. To find the equivalent states calculate:

E0 = {{0, 4}, {1, 2}, {1, 3}, {2, 3}}

E1 = {{1, 2}, {1, 3}, {2, 3}}

E2 = {{1, 2}, {1, 3}, {2, 3}} = E1.

So 1 ~ 2, 1 ~ 3, 2 ~ 3. This tells us that S is partitioned by {0}, {1, 2, 3}, {4}, which we

name [0], [1], [4], respectively. So the minimum-state DFA has three states.

S

F

TMin a b

[0] [1] [4]

[1] [1] [1]

[4] [4] [4]

Min-state Table

S

F

TMin a b

0 1 2

1 1 1

2 2 2

Renamed Table

S 0

a

b

a, b

2

1 a, b

Min-state DFA graph

Quiz: What regular expression equality arises from the two DFAs?

Answer: a + aa + (aaa + aab + ab)(a + b)* = a(a + b)*.R. Rajkumar AP | CSE

DFA Minimization


DFA

Deterministic Finite Automata (DFSA)

(Q, Σ, δ, q0, F)

Q – (finite) set of states

Σ – alphabet – (finite) set of input symbols

δ – transition function

q0 – start state

F – set of final / accepting states


DFA

Often representing as a diagram:


DFA Minimization

Some states can be redundant:

The following DFA accepts (a|b)+

State s1 is not necessary


DFA Minimization

So these two DFAs are equivalent:


DFA Minimization

This is a state-minimized (or just minimized) DFA

Every remaining state is necessary


DFA Minimization

The task of DFA minimization, then, is to automatically

transform a given DFA into a state-minimized DFA

Several algorithms and variants are known

Note that this also in effect can minimize an NFA (since we

know algorithm to convert NFA to DFA)


DFA Minimization Algorithm

Recall that a DFA M=(Q, Σ, δ, q0, F)

Two states p and q are distinct if

p in F and q not in F or vice versa, or

for some α in Σ, δ(p, α) and δ(q, α) are distinct

Using this inductive definition, we can calculate which

states are distinct


DFA Minimization Algorithm

Create lower-triangular table DISTINCT, initially blank

For every pair of states (p,q): If p is final and q is not, or vice versa

DISTINCT(p,q) = ε

Loop until no change for an iteration: For every pair of states (p,q) and each symbol α

If DISTINCT(p,q) is blank and DISTINCT( δ(p,α), δ(q,α) ) is not blank

DISTINCT(p,q) = α

Combine all states that are not distinct


Very Simple Example

s0

s1

s2

s0 s1 s2

R. Rajkumar AP | CSE 105

Very Simple Example

s0

s1 ε

s2 ε

s0 s1 s2

Label pairs with ε where one is a final state and the other is not


Very Simple Example

s0

s1 ε

s2 ε

s0 s1 s2

Main loop (no changes occur)


Very Simple Example

s0

s1 ε

s2 ε

s0 s1 s2

DISTINCT(s1, s2) is empty, so s1 and s2 are equivalent states


Very Simple Example

Merge s1 and s2


More Complex Example



Check for pairs with one state final and one not:



First iteration of main loop:



Second iteration of main loop:



Third iteration makes no changes

Blank cells are equivalent pairs of states



Combine equivalent states for minimized DFA:


Conclusion

DFA Minimization is a fairly understandable process, and

is useful in several areas

Regular expression matching implementation

Very similar algorithm is used for compiler optimization to

eliminate duplicate computations

The algorithm described is O(kn2)

John Hopcraft describes another more complex algorithm that

is O(k (n log n) )


117

Parse Trees

Definitions

Relationship to Left- and Rightmost Derivations

Ambiguity in Grammars


118

Parse Trees

Parse trees are trees labeled by symbols of a particular

CFG.

Leaves: labeled by a terminal or ε.

Interior nodes: labeled by a variable.

Children are labeled by the right side of a production for

the parent.

Root: must be labeled by the start symbol.


119

Example: Parse Tree

S -> SS | (S) | ()

S

SS

S )(

( )

( )


120

Yield of a Parse Tree

The concatenation of the labels of the leaves in left-to-

right order

That is, in the order of a preorder traversal.

is called the yield of the parse tree.

Example: yield of is (())()

S

SS

S )(

( )

( )


121

Parse Trees, Left- and Rightmost

Derivations For every parse tree, there is a unique leftmost, and a

unique rightmost derivation.

We’ll prove:

1. If there is a parse tree with root labeled A and yield w, then

A =>*lm w.

2. If A =>*lm w, then there is a parse tree with root A and

yield w.


122

Proof: Part 2

Given a leftmost derivation of a terminal string, we need

to prove the existence of a parse tree.

The proof is an induction on the length of the derivation.


123

Part 2 – Basis

If A =>*lm a1…an by a one-step derivation, then there

must be a parse tree

A

a1 an. . .


124

Part 2 – Induction

Assume (2) for derivations of fewer than k > 1 steps,

and let A =>*lm w be a k-step derivation.

First step is A =>lm X1…Xn.

Key point: w can be divided so the first portion is

derived from X1, the next is derived from X2, and so

on.

If Xi is a terminal, then wi = Xi.


125

Induction – (2)

That is, Xi =>*lm wi for all i such that Xi is a variable.

And the derivation takes fewer than k steps.

By the IH, if Xi is a variable, then there is a parse tree

with root Xi and yield wi.

Thus, there is a parse tree

A

X1 Xn. . .

w1 wnR. Rajkumar AP | CSE

126

Parse Trees and Rightmost Derivations

The ideas are essentially the mirror image of the

proof for leftmost derivations.

Left to the imagination.


127

Parse Trees and Any Derivation

The proof that you can obtain a parse tree from a

leftmost derivation doesn’t really depend on

“leftmost.”

First step still has to be A => X1…Xn.

And w still can be divided so the first portion is

derived from X1, the next is derived from X2, and so

on.


128

Ambiguous Grammars

A CFG is ambiguous if there is a string in the language

that is the yield of two or more parse trees.

Example: S -> SS | (S) | ()

Two parse trees for ()()() on next slide.


129

Example – Continued

S

SS

S S

( )

S

SS

SS

( )( )

( ) ( )

( )


130

Ambiguity, Left- and Rightmost

Derivations

If there are two different parse trees, they must

produce two different leftmost derivations by the

construction given in the proof.

Conversely, two different leftmost derivations produce

different parse trees by the other part of the proof.

Likewise for rightmost derivations.


131

Ambiguity, etc. – (2)

Thus, equivalent definitions of “ambiguous grammar’’

are:

1. There is a string in the language that has two different

leftmost derivations.

2. There is a string in the language that has two different

rightmost derivations.


132

Ambiguity is a Property of Grammars,

not Languages

For the balanced-parentheses language, here is

another CFG, which is unambiguous.

B -> (RB | ε

R -> ) | (RR B, the start symbol,

derives balanced strings.

R generates strings that

have one more right paren

than left.R. Rajkumar AP | CSE

133

Example: Unambiguous Grammar

B -> (RB | ε R -> ) | (RR

Construct a unique leftmost derivation for a given

balanced string of parentheses by scanning the string from

left to right.

If we need to expand B, then use B -> (RB if the next symbol is “(” and ε if at the end.

If we need to expand R, use R -> ) if the next symbol is “)” and

(RR if it is “(”.


134

The Parsing Process

Remaining Input:

(())()

Steps of leftmost derivation:

B

Next

symbol

B -> (RB | ε R -> ) | (RRR. Rajkumar AP | CSE

135

The Parsing Process

Remaining Input:

())()


B

(RB

Next

symbol


136

The Parsing Process

Remaining Input:

))()


B

(RB

((RRB

Next

symbol


137

The Parsing Process

Remaining Input:

)()


B

(RB

((RRB

(()RB

Next

symbol


138

The Parsing Process

Remaining Input:

()


B

(RB

((RRB

(()RB

(())BNext

symbol


139

The Parsing Process

Remaining Input:

)


B (())(RB

(RB

((RRB

(()RB

(())BNext

symbol


140

The Parsing Process

Remaining Input: Steps of leftmost derivation:

B (())(RB

(RB (())()B

((RRB

(()RB

(())BNext

symbol


141

The Parsing Process

Remaining Input: Steps of leftmost derivation:

B (())(RB

(RB (())()B

((RRB (())()

(()RB

(())BNext

symbol


142

LL(1) Grammars

As an aside, a grammar such B -> (RB | ε R -> ) |

(RR, where you can always figure out the production to

use in a leftmost derivation by scanning the given string

left-to-right and looking only at the next one symbol is

called LL(1).

“Leftmost derivation, left-to-right scan, one symbol of

lookahead.”


143

LL(1) Grammars – (2)

Most programming languages have LL(1) grammars.

LL(1) grammars are never ambiguous.


References

Aho, Alfred V., Sethi, Ravi and Ullman, Jeffrey D. Compilers: principles, techniques, and tools. (1986). Reading: Addison-Wesley.

Peters, James, Pittman, Thomas. The art of compiler design: theory and practice. (1992). Englewood Cliffs: Prentice Hall.

Watson, Des. High-level languages and their compilers. (1989). Wokingham: Addison-Wesley.


References

Ullman, A. V., Hopcroft, J. E. and Ullman, J. D. (1974) The Design and Analysis of

Computer Algorithms. Addison-Wesley.

Hopcroft, J. (1971) An N Log N Algorithm for Minimizing States in a Finite Automaton.

Stanford University.

Parthasarathy, M. and Fleck, M. (2007) DFA Minimization. University of Illinois at

Urbana-Champaign. http://www.cs.uiuc.edu/class/fa07/cs273/Handouts/minimization/minimization.pdf


References

Heng, Christopher. Free Compiler Construction Tools. http://www.thefreecountry.com/programming/compilercontructiontools

The Lex & Yacc Page. http://dinosaur.compilertools.net

Compiler Construction Kits. http://catalog.compilertools.net

The Cocktail Compiler Toolbox. http://www.first.gmd.de/cocktail/


http://www.thefreecountry.com/programming/compilercontructiontools

http://dinosaur.compilertools.net/

http://catalog.compilertools.net/

http://www.first.gmd.de/cocktail/

Prepared by

www.gameofcompilers.weebly.com


Instructor : Mr. R. Rajkumar, Assistant Professor | CSE

Staff room: TP-612, Tech park,

SRM Institute of Science and Technology,

Kattankulathur, India.

Documents

Compiler Design 15CS301gameofcompilers.weebly.com/uploads/8/5/4/8/8548812/cd_unit_1.pdf · The Structure of a Compiler (3) 14 Scanner Parser Semantic Routines Code Generator Optimizer