150
Lexical Analysis Dec 6, 2021

Lexical Analysis - maxsnew.com

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lexical Analysis - maxsnew.com

Lexical AnalysisDec 6, 2021

Page 2: Lexical Analysis - maxsnew.com

Previously on EECS 483...

The Structure of a Modern Compiler

Lexical Analysis

Syntax Analysis

Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization

SourceCode

Machine

Code

Structure of a modern compiler

Page 3: Lexical Analysis - maxsnew.com

Lexical Analysis

Syntax Analysis

Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization

while (y < z) {

int x = a + b;

y += x;

}

Page 4: Lexical Analysis - maxsnew.com

Lexical Analysis

Syntax Analysis

Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization

while (y < z) {

int x = a + b;

y += x;

}

T_WhileT_LeftParenT_Identifier yT_LessT_Identifier zT_RightParenT_OpenBraceT_IntT_Identifier xT_AssignT_Identifier aT_PlusT_Identifier bT_SemicolonT_Identifier yT_PlusAssignT_Identifier xT_SemicolonT_CloseBrace

Lexical analysis (Scanning): Group sequence of characters into lexemes – smallest meaningful entity in a language (keywords, identifiers, constants)

Page 5: Lexical Analysis - maxsnew.com

Lexical Analysis

Syntax Analysis

Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization

while (y < z) {

int x = a + b;

y += x;

}While

<

Sequence

=

x +

a b

=

y +

y x

y z

Syntax analysis (Parsing): Convert a linear structure – sequence of tokens – to a hierarchical tree-like structure - abstract syntax tree (AST)

Page 6: Lexical Analysis - maxsnew.com

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

Input: code (character stream)

Goal of Lexical AnalysisBreaking the program down into words or “tokens”

Page 7: Lexical Analysis - maxsnew.com

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

T_While ( T_Ident < T_Ident ) ++ T_Ident

ip z ip

Goal of Lexical AnalysisOutput: Token Stream

Page 8: Lexical Analysis - maxsnew.com

What’s a token?

• What’s a lexical unit of code?

Page 9: Lexical Analysis - maxsnew.com

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Page 10: Lexical Analysis - maxsnew.com

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Token Type

• Keyword: for int if else while

• Punctuation: ( ) { } ;

• Operand: + - ++

• Relation: < > =

• Identifier: (variable name, function name) foo foo_2

• Integer, float point, string: 2345 2.0 “hello world”

• Whitespace, comment /* this code is awesome */

Page 11: Lexical Analysis - maxsnew.com

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 12: Lexical Analysis - maxsnew.com

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 13: Lexical Analysis - maxsnew.com

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 14: Lexical Analysis - maxsnew.com

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 15: Lexical Analysis - maxsnew.com

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 16: Lexical Analysis - maxsnew.com

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 17: Lexical Analysis - maxsnew.com

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 18: Lexical Analysis - maxsnew.com

Scanning a Source File

w h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

Token

Lexeme: the piece of the original program from which we made the token

Page 19: Lexical Analysis - maxsnew.com

Scanning a Source File

w h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

( T_IntConst

137

Page 20: Lexical Analysis - maxsnew.com

Scanning a Source File

w h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

( T_IntConst

137

Some tokens can have

attributes that store

extra information about

the token. Here we

store which integer is

represented.

Some tokens can have

attributes that store

extra information about

the token. Here we

store which integer is

represented.

Page 21: Lexical Analysis - maxsnew.com

Lexical Analyzer

• Recognize substrings that correspond to tokens: lexemes

• Lexeme: actual text of the token

• For each lexeme, identify token type

• < Token type, attribute>

• attribute: optional, extra information, often numeric value

Page 22: Lexical Analysis - maxsnew.com

Challenges for Lexical Analyzer• How do we determine which lexemes are

associated with each token?

• When there are multiple ways we could scan the input, how do we know which one to pick?

• if

• ifc

• How do we address these concerns efficiently?

Page 23: Lexical Analysis - maxsnew.com

Associate Lexemes to Tokens

• Tokens: categorize lexemes by what information they provide.

• Associate lexemes to token: Pattern matching

• How to describe patterns??

Page 24: Lexical Analysis - maxsnew.com

Token: Lexemes

• Keyword: for int if else while

• Punctuation: ( ) { } ;

• Operand: + - ++

• Relation: < > =

• Identifier: (variable name,function name) foo foo_2

• Integer, float point, string: 2345 2.0 “hello world”

• Whitespace, comment /* this code is awesome */

Finite possible lexemes

Infinite possible lexemes

Page 25: Lexical Analysis - maxsnew.com

• How do we describe which (potentially infinite) set of lexemes is associated with each token type?

Page 26: Lexical Analysis - maxsnew.com

Formal Languages

● A formal language is a set of strings.

● Many infinite languages have finite descriptions:

● Define the language using an automaton.

● Define the language using a grammar.

● Define the language using a regular expression.

● We can use these compact descriptions of the language to define sets of strings.

● Over the course of this class, we will use all of these approaches.

Page 27: Lexical Analysis - maxsnew.com

• What type of formal language should we use to describe tokens?

Page 28: Lexical Analysis - maxsnew.com

Regular Expressions

● Regular expressions are a family of descriptions that can be used to capture certain languages (the regular languages).

● Often provide a compact and human-readable description of the language.

● Used as the basis for numerous software systems, including the flex tool we will use in this course.

Page 29: Lexical Analysis - maxsnew.com

Atomic Regular Expressions

● The regular expressions we will use in this course begin with two simple building blocks.

● The symbol ε is a regular expression matches the empty string.

● For any symbol a, the symbol a is a regular expression that just matches a.

Page 30: Lexical Analysis - maxsnew.com

Compound Regular Expressions

● If R1 and R

2 are regular expressions, R

1R

2 is a regular

expression represents the concatenation of the languages of R

1 and R

2.

● If R1 and R

2 are regular expressions, R

1 | R

2 is a regular

expression representing the union of R1 and R

2.

● If R is a regular expression, R* is a regular expression for the Kleene closure of R.

● If R is a regular expression, (R) is a regular expression with the same meaning as R.

Page 31: Lexical Analysis - maxsnew.com

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings containing 00 as a substring:

(0 | 1)*00(0 | 1)*

Page 32: Lexical Analysis - maxsnew.com

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings containing 00 as a substring:

(0 | 1)*00(0 | 1)*

Page 33: Lexical Analysis - maxsnew.com

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings containing 00 as a substring:

(0 | 1)*00(0 | 1)*

110111001010000

11111011110011111

Page 34: Lexical Analysis - maxsnew.com

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings containing 00 as a substring:

(0 | 1)*00(0 | 1)*

110111001010000

11111011110011111

Page 35: Lexical Analysis - maxsnew.com

Applied Regular Expressions

● Suppose that our alphabet is all ASCII characters.

● A regular expression for even numbers is

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)?

Page 36: Lexical Analysis - maxsnew.com

Applied Regular Expressions

● Suppose that our alphabet is all ASCII characters.

● A regular expression for even numbers is

42+1370-3248

-9999912

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

Page 37: Lexical Analysis - maxsnew.com

• More examples

• Whitespace: [ \t\n]+

• Integers: [+\-]?[0-9]+

• Hex numbers: 0x[0-9a-f]+

• identifier

• [A-Za-z]([A-Za-z]|[0-9])*

Page 38: Lexical Analysis - maxsnew.com

• Use regular expressions to describe token types

• How do we match regular expressions?

Page 39: Lexical Analysis - maxsnew.com

Recognizing Regular Language

• Finite Automata

• DFA (Deterministic Finite Automata)

• NFA (Non-deterministic Finite Automata)

What is the machine that recognize regular language??

Page 40: Lexical Analysis - maxsnew.com

" "start

A,B,C,...,Z

A Simple Automaton

Page 41: Lexical Analysis - maxsnew.com

" "start

A,B,C,...,Z

Each circle is a state of the

automaton. The automaton's

configuration is determined

by what state(s) it is in.

Each circle is a state of the

automaton. The automaton's

configuration is determined

by what state(s) it is in.

A Simple Automaton

Page 42: Lexical Analysis - maxsnew.com

" "start

A,B,C,...,Z

These arrows are called

transitions. The automaton

changes which state(s) it is in

by following transitions.

These arrows are called

transitions. The automaton

changes which state(s) it is in

by following transitions.

A Simple Automaton

Page 43: Lexical Analysis - maxsnew.com

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Finite Automata: Takes an input string and determines whether it’s a valid sentence of a language

accept or reject

Page 44: Lexical Analysis - maxsnew.com

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 45: Lexical Analysis - maxsnew.com

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 46: Lexical Analysis - maxsnew.com

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 47: Lexical Analysis - maxsnew.com

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 48: Lexical Analysis - maxsnew.com

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 49: Lexical Analysis - maxsnew.com

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 50: Lexical Analysis - maxsnew.com

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 51: Lexical Analysis - maxsnew.com

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 52: Lexical Analysis - maxsnew.com

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

The double circle indicates that this

state is an accepting state. The

automaton accepts the string if it

ends in an accepting state.

The double circle indicates that this

state is an accepting state. The

automaton accepts the string if it

ends in an accepting state.

Page 53: Lexical Analysis - maxsnew.com

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

Page 54: Lexical Analysis - maxsnew.com

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

These are called -transitionsε . These

transitions are followed automatically and

without consuming any input.

These are called -transitionsε . These

transitions are followed automatically and

without consuming any input.

Page 55: Lexical Analysis - maxsnew.com

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 56: Lexical Analysis - maxsnew.com

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 57: Lexical Analysis - maxsnew.com

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 58: Lexical Analysis - maxsnew.com

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 59: Lexical Analysis - maxsnew.com

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 60: Lexical Analysis - maxsnew.com

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 61: Lexical Analysis - maxsnew.com

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 62: Lexical Analysis - maxsnew.com

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 63: Lexical Analysis - maxsnew.com

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 64: Lexical Analysis - maxsnew.com

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 65: Lexical Analysis - maxsnew.com

Lexer Generator

• Given regular expressions to describe the language (token types),

• Step 1: Generates NFA that can recognize the regular language defined

• existing algorithms

• Step 2: Transforms NFA to DFA

• existing algorithms

• Tools: lex, flex

Page 66: Lexical Analysis - maxsnew.com

Challenges for Lexical Analyzer

• How do we determine which lexemes are associated with each token?

• Regular expression to describe token type

• When there are multiple ways we could scan the input, how do we know which one to pick?

• How do we address these concerns efficiently?

Page 67: Lexical Analysis - maxsnew.com

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

Page 68: Lexical Analysis - maxsnew.com

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

Page 69: Lexical Analysis - maxsnew.com

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

f o tr

f o tr

f o tr

f o tr

f o tr

f o tr

f o tr

f o tr

f o tr

Page 70: Lexical Analysis - maxsnew.com

Conflict Resolution

● Assume all tokens are specified as regular expressions.

● Algorithm: Left-to-right scan.

● Tiebreaking rule one: Maximal munch.

● Always match the longest possible prefix of the remaining text.

Page 71: Lexical Analysis - maxsnew.com

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

f o tr

Page 72: Lexical Analysis - maxsnew.com

Implementing Maximal Munch

● Given a set of regular expressions, how can we use them to implement maximum munch?

● Idea:

● Convert expressions to NFAs.

● Run all NFAs in parallel, keeping track of the last match.

● When all automata get stuck, report the last match and restart the search at that point.

Page 73: Lexical Analysis - maxsnew.com

• Example

Page 74: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

Page 75: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

Page 76: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 77: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 78: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 79: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 80: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 81: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 82: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 83: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 84: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 85: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 86: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 87: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 88: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 89: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 90: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 91: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 92: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 93: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 94: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 95: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 96: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 97: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 98: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 99: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 100: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 101: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 102: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 103: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 104: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 105: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 106: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 107: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 108: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 109: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 110: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 111: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 112: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 113: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 114: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 115: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 116: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 117: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 118: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 119: Lexical Analysis - maxsnew.com

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 120: Lexical Analysis - maxsnew.com

A Minor Simplification

d o

d o u b l e

Σ

ε

ε

εstart

Page 121: Lexical Analysis - maxsnew.com

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

Page 122: Lexical Analysis - maxsnew.com

More Tiebreaking

● When two regular expressions apply, choose the one with the greater “priority.”

● Simple priority system: pick the rule that was defined first.

Page 123: Lexical Analysis - maxsnew.com

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

d o bu el

d o bu el

Page 124: Lexical Analysis - maxsnew.com

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

d o bu el

Page 125: Lexical Analysis - maxsnew.com

Implement a lexical analyzer• Step 1: Use regular expressions to describe token types (keyword,

identifier, integer constant..)

Number = digit + …Keyword = ‘if ’ + ‘else’ + …Identifier = letter (letter + digit)*OpenPar = ‘(‘

… Then construct Regular language R, matching all lexemes for all tokens

R = Keyword + Identifier + Number + …

= R1 + R2 + …

• Step 2: Use DFA/NFA to recognize the regular language

• But...good news. you don’t need to implement the algorithms to transform your regular expressions to DFA/NFA to recognize it

• flex: given regular expressions -> output c code that does lexical analysis (it internally generates DFA)

Page 126: Lexical Analysis - maxsnew.com

Lexical analyzer

REs + priorities + longest matching token rule

= definition of a lexical analyzer

Page 127: Lexical Analysis - maxsnew.com

DFA vs. NFA•  NFAs and DFAs recognize the same set of

languages (regular languages) –  For a given NFA, there exists a DFA, and vice versa

•  DFAs are faster to execute –  There are no choices to consider –  Tradeoff: simplicity

•  For a given language DFA can be exponentially larger than NFA.

Page 128: Lexical Analysis - maxsnew.com

Automating Lexical Analyzer (scanner) Construction

To convert a specification into code:

1  Write down the RE for the input language

2  Build a big NFA

3  Build the DFA that simulates the NFA

4  Systematically shrink the DFA

5  Turn it into code

Scanner generators

•  Lex and Flex work along these lines

•  Algorithms are well-known and well-understood

Page 129: Lexical Analysis - maxsnew.com

Automating Lexical Analyzer (scanner) Construction

RE→ NFA (Thompson’s construction)

•  Build an NFA for each term

•  Combine them with ε-moves

NFA → DFA (subset construction)

•  Build the simulation

DFA → Minimal DFA

•  Hopcroft’s algorithm

DFA →RE (Not part of the scanner construction)

•  All pairs, all paths problem

•  Take the union of all paths from s0 to an accepting state

minimal DFA

RE NFA DFA

The Cycle of Constructions

Lexical Spec Scanner

Page 130: Lexical Analysis - maxsnew.com

minimal DFA

RE NFA DFA

The Cycle of Constructions

Lexical Spec Scanner

Page 131: Lexical Analysis - maxsnew.com

Key idea •  NFA pattern for each symbol & each operator •  Join them with ε moves in precedence order

RE →NFA using Thompson’s Construction

S0 S1 a

NFA for a

S0 S1 a

S3 S4 b

NFA for ab

ε

NFA for a | b

S0

S1 S2 a

S3 S4 b

S5 ε

ε ε

ε

S0 S1 ε S3 S4

ε

NFA for a*

a

ε

ε

Ken Thompson, CACM, 1968

Page 132: Lexical Analysis - maxsnew.com

Example of Thompson’s Construction

Let’s try a ( b | c )*

1. a, b, & c

2. b | c

3. ( b | c )*

S0 S1 a

S0 S1 b

S0 S1 c

S2 S3 b

S4 S5 c

S1 S6 S0 S7

ε

ε

ε ε

ε ε

ε ε

S1 S2 b

S3 S4 c

S0 S5 ε

ε

ε

ε

Page 133: Lexical Analysis - maxsnew.com

Example of Thompson’s Construction (con’t)

4.  a ( b | c )*

Of course, a human would design something simpler ...

S0 S1 a

b | c But, we can automate production of the more complex one ...

S0 S1 a ε

S4 S5 b

S6 S7 c

S3 S8 S2 S9

ε

ε

ε ε

ε ε

ε ε

Page 134: Lexical Analysis - maxsnew.com

minimal DFA

RE NFA DFA

The Cycle of Constructions

Lexical Spec Scanner

Page 135: Lexical Analysis - maxsnew.com

NFA to DFA : Trick

•  Simulate the NFA

•  Each state of DFA

= a non-empty subset of states of the NFA

•  Start state

= the set of NFA states reachable through e-moves from NFA start state

•  Add a transition S !a S’ to DFA iff –  S’ is the set of NFA states reachable from any state in S after

seeing the input a, considering ε-moves as well

Page 136: Lexical Analysis - maxsnew.com

NFA to DFA : cont..

•  An NFA may be in many states at any time

•  How many different states ?

•  If there are N states, the NFA must be in some subset of those N states

•  How many subsets are there?

2^N - 1 = finitely many

Page 137: Lexical Analysis - maxsnew.com

NFA to DFA

•  Remove the non-determinism –  States with multiple outgoing edges due to same input –  ε transitions

2

4

a

c

start 1

3

b ε

ε ε

ε (a*| b*) c*

Page 138: Lexical Analysis - maxsnew.com

NFA to DFA (2)

•  Multiple transitions –  Solve by subset construction –  Build new DFA based upon the set of states each

representing a unique subset of states in NFA

1 2 a

a b R= a+ b*

ε-closure(1) = {1} include state “1” (1,a) ! {1,2} include state “1/2” (1,b) ! ERROR (1/2,a) !1/2 (1/2,b) ! 2 include state “2”

start 1 2

a

a 1/2 start b

b

(2,a) ! ERROR (2,b) ! 2 Any state with “2” in name is a final state

Page 139: Lexical Analysis - maxsnew.com

NFA to DFA (3)

•  ε transitions –  Any state reachable by an ε transition is “part of the state” –  ε-closure - Any state reachable from S by ε transitions is in

the ε-closure; treat ε-closure as 1 big state, always include ε-closure as part of the state

2 3

a b

start 1

ε ε

1.  ε-closure(1) = {1,2,3}; include1/2/3 2.  Move(1/2/3, a) = {2, 3} + ε-closure(2,3) = {2,3} ; include 2/3 3.  Move(1/2/3, b) = {3} + ε-closure(3) = {3} ; include state 3 4.  Move(2/3, a) = {2} + ε-closure(2) = {2,3} 5.  Move(2/3, b) = {3} + ε-closure(3) = {3} 6.  Move(3, b) = {3} + ε-closure(3) = {3}

a*b*

Page 140: Lexical Analysis - maxsnew.com

NFA to DFA (3)

•  ε transitions –  Any state reachable by an ε transition is “part of the state” –  ε-closure - Any state reachable from S by ε transitions is in

the ε-closure; treat ε-closure as 1 big state, always include ε-closure as part of the state

2 3

a b

start 1

ε ε

1.  ε-closure(1) = {1,2,3}; include1/2/3 2.  Move(1/2/3, a) = {2, 3} + ε-closure(2,3) = {2,3} ; include 2/3 3.  Move(1/2/3, b) = {3} + ε-closure(3) = {3} ; include state 3 4.  Move(2/3, a) = {2} + ε-closure(2) = {2,3} 5.  Move(2/3, b) = {3} + ε-closure(3) = {3} 6.  Move(3, b) = {3} + ε-closure(3) = {3}

2/3 3

a b

start 1/2/3

a b

a*b*

b

Page 141: Lexical Analysis - maxsnew.com

NFA to DFA - Example

1

2

3 start

a

a b

a

4

6 5

ε

ε

ε

Page 142: Lexical Analysis - maxsnew.com

B

NFA to DFA - Example

1

2

3 start

a

a b

a

4

6 5

ε

ε

ε

a

b

b

A

4

6 a a start

ε-closure(1) = {1, 2, 3, 5}

Create a new state A = {1, 2, 3, 5}

move(A, a) = {3, 6} + ε-closure(3,6) = {3,6}

Create B = {3,6}

move(A, b) = {4} + ε-closure(4) = {4}

move(B, a) = {6} + ε-closure(6) = {6}

move(B, b) = {4} + ε-closure(4) = {4}

move(6, a) = {6} + ε-closure(6) = {6}

move(6, b) ! ERROR

move(4, a|b) ! ERROR

Page 143: Lexical Analysis - maxsnew.com

Class Problem

0 1

4

2

6

3

5

9 7 ε ε

ε

ε

εε

ε

ε a

a

b

8 b

Convert this NFA to a DFA

Page 144: Lexical Analysis - maxsnew.com

minimal DFA

RE NFA DFA

The Cycle of Constructions

Lexical Spec Scanner

Page 145: Lexical Analysis - maxsnew.com

State Minimization

•  Resulting DFA can be quite large –  Contains redundant or equivalent states

2

5

b

start 1

3

b

a b

a a

4

b

a

1 2 3 start

a a

b b

Both DFAs accept b*ab*a

Page 146: Lexical Analysis - maxsnew.com

State Minimization (2)

•  Idea – find groups of equivalent states and merge them –  All transitions from states in group G1 go to states in

another group G2 –  Construct minimized DFA such that there is 1 state for

each group of states

2

5

b

start 1

3

b

a b

a a

4

b

a

Basic strategy: identify distinguishing transitions

Page 147: Lexical Analysis - maxsnew.com

minimal DFA

RE NFA DFA

The Cycle of Constructions

Lexical Spec Scanner

Page 148: Lexical Analysis - maxsnew.com

DFA Implementation

•  A DFA can be implemented by a 2D table T –  One dimension is “states” –  Other dimension is “input symbol” –  For every transition Si !a Sk define T[i,a] = k

•  DFA “execution” –  If in state Si and input a, read T[i,a] = k and skip to

state Sk –  Very efficient

Page 149: Lexical Analysis - maxsnew.com

DFA Table Implementation : Example

Page 150: Lexical Analysis - maxsnew.com

Implementation Cont ..

•  NFA -> DFA conversion is at the heart of tools such as flex

•  But, DFAs can be huge

•  In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations