Lexical Analysis (2 Lectures). CSE244 Compilers 2 Overview Basic Concepts Regular Expressions...

Preview:

Citation preview

Lexical Analysis(2 Lectures)

2CSE244 Compilers

Overview• Basic Concepts

• Regular Expressions– Language

• Lexical analysis by hand

• Regular Languages Tools– NFA

– DFA

• Scanning tools– Lex / Flex / JFlex / ANTLR

3CSE244 Compilers

Scanning Perspective• Purpose

– Transform a stream of symbols

– Into a stream of tokens

4CSE244 Compilers

Lexical Analyzer Responsibilities

• Lexical analyzer [Scanner]– Scan input

– Remove white spaces

– Remove comments

– Manufacture tokens

– Generate lexical errors

– Pass token to parser

5CSE244 Compilers

Modular design• Rationale

– Separate the two analysis• High cohesion / Low coupling

– Improve efficiency

– Improve portability / maintainability

– Enable integration of third-party lexers • [lexer = lexical analysis tool]

6CSE244 Compilers

Terminology• Token

– A classification for a common set of strings

– Examples: Identifier, Integer, Float, Assign, LeftParen, RightParen,....

• Pattern– The rules that characterize the set of strings for a token

– Examples: [0-9]+

• Lexeme– Actual sequence of characters that matches a pattern

and has a given Token class.

– Examples: • Identifier: Name,Data,x

• Integer: 345,2,0,629,....

7CSE244 Compilers

Examples

“ ” “ ”

8CSE244 Compilers

Lexical Errors• Error Handling is very localized, w.r.t. Input

Source

• Example: fi(a==f(x)) …generates no lexical error in C

• In what situations do errors occur?• Prefix of remaining input doesn’t match any defined

token

• Possible error recovery actions:• Deleting or Inserting Input Characters

• Replacing or Transposing Characters

• Or, skip over to next separator to ignore problem

9CSE244 Compilers

Basic Scanning technique• Use 1 character of look-ahead

– Obtain char with getc()

• Do a case analysis– Based on lookahead char

– Based on current lexeme

• Outcome– If char can extend lexeme, all is well, go on.

– If char cannot extend lexeme:• Figure out what the complete lexeme is and return its

token

• Put the lookahead back into the symbol stream

10

CSE244 Compilers

Language Concepts

Alphabet Language

{0,1} {0,10,100,1000,10000,…}

{0,1,100,000,111,…}

{a,b,c} {abc,aabbcc,aaabbbccc,…}

{A…Z} {TEE,FORE,BALL…}

{FOR,WHILE,GOTO…}

{A…Z,a…z,0…9, {All legal PASCAL progs}

+,-,…,<,>,…} {All grammatically correct English Sentences}

Special Languages: Φ – EMPTY LANGUAGE

ε – contains empty string ε only

•A language, L, is simply any set of strings over a fixed alphabet.

11

CSE244 Compilers

Formal Language Operations

12

CSE244 Compilers

Examples

L = {A, B, C, D } D = {1, 2, 3}

L D = {A, B, C, D, 1, 2, 3 }

LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }

L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, É DD}

L4 = L2 L2 = ??

L* = { All possible strings of L plus }

L+ = L* -

L (L D ) = ??

L (L D )* = ??

13

CSE244 Compilers

Regular Languages• All examples above are

– Quite expressive

– Simple languages

• But also...– Belong to a special class: regular languages

• A Regular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings) From an Alphabet.

• Let Σ Be an Alphabet, r a Regular Expression Then L(r) is the Language That is Characterized by the Rules of r

14

CSE244 Compilers

Rules• fix alphabet Σ

• ε is a regular expression denoting {ε}

• If a is in Σ , a is a regular expression that denotes {a}

• Let r and s be R.E. for L(r) and L(s). Then

• (a) (r) | (s) is a regular expression L(r) ∪ L(s)

• (b) (r)(s) is a regular expression L(r) L(s)

• (c) (r)* is a regular expression (L(r))*

• (d) (r) is a regular expression L(r)

• All are Left-Associative.

• Parentheses are dropped as allowed by precedences.

Pre

ced

ee

nce

Pre

ced

ee

nce

15

CSE244 Compilers

Example revisitedL = {A, B, C, D } D = {1, 2, 3}

A | B | C | D = L

(A | B | C | D ) (A | B | C | D ) = L2

(A | B | C | D )* = L*

(A | B | C | D ) ((A | B | C | D ) | ( 1 | 2 | 3 )) = L (L D)

16

CSE244 Compilers

Algebraic Properties

AXIOM DESCRIPTION

r | s = s | r

r | (s | t) = (r | s) | t

(r s) t = r (s t)

r = rr = r

r* = ( r | )*

r ( s | t ) = r s | r t( s | t ) r = s r | t r

r** = r*

| is commutative

| is associative

concatenation is associative

concatenation distributes over |

relation between * and

Is the identity element for concatenation

* is idempotent

17

CSE244 Compilers

More Examples

• All Strings that start with “tab” or end with “bat”:

tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat

• All Strings in Which {1,2,3} exist in ascending

order:

{A,…,Z}*1 {A,…,Z}*2 {A,…,Z}*3 {A,…,Z}*

18

CSE244 Compilers

Tokens as R.E.

… …

“+”

“?”

19

CSE244 Compilers

Tokens as Patterns• Patterns are ???

• Tokens are ???Assume Following Tokens:

if, then, else, relop, id, num

What language construct are they used for ?

Given Tokens, What are Patterns ?

if if

then then

else else

relop < | <= | > | >= | = | <>

id letter ( letter | digit )*

num digit + (. digit + ) ? ( E(+ | -) ? digit + ) ?

What does this represent ? What is ?

20

CSE244 Compilers

Throw Away Tokens• Fact

– Some languages define tokens as useless

– Example: C• whitespace, tabulations, carriage return, and

comments can be discarded without affecting the program’s meaning.

blank b

tab ^T

newline ^M

delim blank | tab | newline

ws delim +

21

CSE244 Compilers

Automaton• A tool to specify a token

start

other

=>0 6 7

8 * RTN(G)

RTN(GE)> =

WeÕve accepted Ņ>Ó and have read other char thatmust be unread.

22

CSE244 Compilers

A More Complex Automatonstart <

0

other

=6 7

8

return(relop, LE)

5

4

>

=1 2

3

other

>

=

*

*

return(relop, NE)

return(relop, LT)

return(relop, EQ)

return(relop, GE)

return(relop, GT)

23

CSE244 Compilers

Two More...id :

delim :

start delim28

other3029

delim

*

start letter9

other1110

letter or digit

*

24

CSE244 Compilers

What about keywords ?• Easy!

– Use the “Identifier” token

– After a match, lookup the keyword table• If found, return a token for the matched keyword

• If not, return a token for the true identifier

25

CSE244 Compilers

Yes... But how to scan?• Remember the algorithm?

– Acquire 1 character of lookahead

– Case analysis based• On lookahead

• On state of automaton

26

CSE244 Compilers

Scanner codeclass Scanner {

InputStream _in; char _la; // The lookahead characterchar[] _window; // lexeme windowToken nextToken() {

startLexeme(); // reset window at startwhile(true) {

switch(_state) {case 0: {

_la = getChar(); if (_la == ‘<’) _state = 1;

else if (_la == ‘=’) _state = 5;else if (_la == ‘>’) _state = 6;else failure(state);

}break;case 6: {

_la = getChar();if (_la == ‘=’) _state = 7;else _state = 8;

}break;}

}}

}

case 7: {return new Token(GEQUAL);

}break;

case 8: {pushBack(_la);return new Token(GREATER);

}

start <0

other

=6 7

8

5

4

>

=1 2

3

other

>

=

*

*

27

CSE244 Compilers

Handling Failures• Meaning

– The automaton for this token failed

• solution– If another automaton is available

• “rewind” the input to the beginning of last lexeme

• Jump to start state of next automaton

• Start recognizing again

– If no other automaton• This is a true lexical error.

• Discard lexeme (or at least first char of lexeme)

• Start from state 0 again

28

CSE244 Compilers

Overview• Basic Concepts

• Regular Expressions– Language

• Lexical analysis by hand

• Regular Languages Tools– NFA / DFA

• Scanning with DFAs

• Scanning tools– Lex / Flex / JFlex

29

CSE244 Compilers

Automata & Language Theory

• Terminology– FSA

• A recognizer that takes an input string and determines whether it’s a valid string of the language.

– Non-Deterministic FSA (NFA)• Has several alternative actions for the same input

symbol

– Deterministic FSA (DFA)• Has at most 1 action for any given input symbol

• Bottom Line– expressive power(NFA) == expressive power(DFA)

– Conversion can be automated

30

CSE244 Compilers

NFA

An NFA is a mathematical model that consists of :

• S, a set of states

• , the symbols of the input alphabet

• move, a transition function.

• move(state, symbol) → set of states

• move : S ∪{∈} → Pow(S)

• A state, s0 ∈ S, the start state

• F ⊆ S, a set of final or accepting states.

31

CSE244 Compilers

Representing NFA

Transition Diagrams :

Transition Tables:

Number states (circles), arcs, final states, …

More suitable to representation within a computer

We’ll see examples of both !

32

CSE244 Compilers

Example NFA

S = { 0, 1, 2, 3 }

s0 = 0

F = { 3 }

Σ = { a, b }

start0 3b21 ba

a

b

What Language is defined ?

What is the Transition Table ?

state

i n p u t

0

1

2

a b

{ 0, 1 }-- { 2 }

-- { 3 }

{ 0 }

∈(null) moves possible

ji ∈

Switch state but do not use any input symbol

33

CSE244 Compilers

Epsilon-Transitions• Given the regular expression : (a (b*c)) | (a (b | c+)?)

• Find a transition diagram NFA that recognizes it.

• Solution ?

34

CSE244 Compilers

NFA Construction• Automatic construction example

• a(b*c)

• a(b|c+)?

Build a DisjunctionBuild a Disjunction

35

CSE244 Compilers

Resulting NFA

36

CSE244 Compilers

Working NFA

start0 3b21 ba

a

b • Given an input string, we trace moves

• If no more input & in final state, ACCEPT EXAMPLE:

Input: ababb

move(0, a) = 1

move(1, b) = 2

move(2, a) = ? (undefined)

REJECT !

move(0, a) = 0

move(0, b) = 0

move(0, a) = 1

move(1, b) = 2

move(2, b) = 3ACCEPT !

-OR-

37

CSE244 Compilers

Handling Undefined Transitions

• We can handle undefined transitions by defining one more state, a “death” state, and transitioning all previously undefined transition to this death state.

start0 3b21 ba

a

b

4

a, b

aa

38

CSE244 Compilers

Worse still...• Not all path result in acceptance!

start0 3b21 ba

a

b

aabb is accepted along path :

0 → 0 → 1 → 2 → 3

BUT… it is not accepted along the valid path:

0 → 0 → 0 → 0 → 0

39

CSE244 Compilers

The NFA “Problem”• Two problems

– Valid input may not be accepted

– Non-deterministic behavior from run to run...

• Solution?

40

CSE244 Compilers

The DFA Save The Day• A DFA is an NFA with a few restrictions

– No epsilon transitions

– For every state s, there is only one transition (s,x) from s for any symbol x in Σ

• Corollaries

– Easy to implement a DFA with an algorithm!

– Deterministic behavior

41

CSE244 Compilers

NFA vs. DFA• NFA

– smaller number of states Qnfa

– In order to simulate it requires a |Qnfa| computation for each input symbol.

• DFA– larger number of states Qdfa

– In order to simulate it requires a constant computation for each input symbol.

• caveat - generic NFA=>DFA construction: Qdfa

~ 2^{Qnfa}

• but: DFA’s are perfectly optimizable! (i.e., you can find smallest possible Qdfa )

42

CSE244 Compilers

One catch...• NFA-DFA comparison

43

CSE244 Compilers

NFA to DFA Conversion• Idea

– Look at the state reachable without consuming any input

– Aggregate them in macro states

44

CSE244 Compilers

Final Result• A state is final

– IFF one of the NFA state was final

45

CSE244 Compilers

Preliminary Definitions• NFA N = ( S, Σ, s0, F, MOVE )• ε-Closure(s) : s ε S

– set of states in S that are reachable from s via ε-moves of N that originate from s.

• ε-Closure(T) : T ⊆S• NFA states reachable from all t ε T on ε-moves

only.

• move(T,a) : T ⊆S, a ε Σ• Set of states to which there is a transition on

input a from some t ε T

46

CSE244 Compilers

Algorithm

forall(t in T) push(t);

initialize ε-closure(T) to T;

while stack is not empty do begin

t = pop();

for each u ε S with edge t→u labeled ε

if u is not in ε-closure(T)

add u to ε-closure(T) ;

push u onto stack

computing the ε-closurecomputing the ε-closure

47

CSE244 Compilers

DFA constructioncomputing the

The set of states

The transitions

computing the

The set of states

The transitionslet Q = ε-closure(s0) ;

D = { Q };

enQueue(Q)

while queue not empty do

X = deQueue();

for each a ε Σ do

Y := ε-closure(move(X,a));

T[X,a] := Y

if Y is not in D

D = D U { Y }

enQueue(Y);

end

end

48

CSE244 Compilers

Summary• We can

– Specify tokens with R.E.

– Use DFA to scan an input and recognize token

– Transform an NFA into a DFA automatically

• What we are missing– A way to transform an R.E. into an NFA

• Then, we will have a complete solution– Build a big R.E.

– Turn the R.E. into an NFA

– Turn the NFA into a DFA

– Scan with the obtained DFA

49

CSE244 Compilers

R.E. To NFA• Process

– Inductive definition• Use the structure of the R.E.

• Use atomic automata for atomic R.E.

• Use composition rules for each R.E. expression

• Recall– RE ::= ε

::= s in Σ::= rs

::= r | s

::= r*

50

CSE244 Compilers

Epsilon Construction• RE ::= ε

51

CSE244 Compilers

Symbol Construction• RE ::= x in Σ

52

CSE244 Compilers

Chaining Construction• RE ::= rs

53

CSE244 Compilers

Branching Construction• RE ::= r | s

54

CSE244 Compilers

Kleene-Closure Construction• RE ::= r*

55

CSE244 Compilers

NFA Construction Example• R.E.

– (ab*c) | (a(b|c*))

• Parse Tree: r13

r12r5

r3 r11r4

r9

r10

r8r7

r6

r0

r1 r2

b

*c

a a

|

( )

b

|

*

c

56

CSE244 Compilers

NFA Construction Example 2

r3: a

r0: b

r2: c

b ∈

∈r1:

r4 : r1 r2b ∈

∈ c

r5 : r3 r4

b ∈

∈a c

57

CSE244 Compilers

NFA Construction Example 3

r11: a

r7: b

r6: c

c ∈

∈r9 : r7 | r8

∈∈

b

c ∈

∈r8:

c ∈

∈r12 : r11 r10

∈∈

b

a

r10 : r9

58

CSE244 Compilers

NFA Construction Example 4r13 : r5 | r12

b ∈

∈a c

c ∈

∈∈

b

a

∈∈

1

6543

8

2

10

9 12 13 14

11

15

7

16

17

59

CSE244 Compilers

Overall Summary• How does this all fit together ?

– Reg. Expr. → NFA construction

– NFA → DFA conversion

– DFA simulation for lexical analyzer

• Recall Lex Structure– Pattern Action

– Pattern Action

– … …• Each pattern recognizes lexemes

• Each pattern described by regular expression

etc.

(abc)*ab

(a | b)*abb

Recognizer!

60

CSE244 Compilers

Morale?• All of this can be automated with a tool!

– LEX The first lexical analyzer tool for C

– FLEXA newer/faster implementation C / C++ friendly

– JFLEX A lexer for Java. Based on same principles.

– JavaCC

– ANTLR

61

CSE244 Compilers

Ahead...• Grammars

• Parsing– Bottom Up

– Top Down

Recommended