View
232
Download
2
Category
Preview:
Citation preview
Lexical Analysis(2 Lectures)
2CSE244 Compilers
Overview• Basic Concepts
• Regular Expressions– Language
• Lexical analysis by hand
• Regular Languages Tools– NFA
– DFA
• Scanning tools– Lex / Flex / JFlex / ANTLR
3CSE244 Compilers
Scanning Perspective• Purpose
– Transform a stream of symbols
– Into a stream of tokens
4CSE244 Compilers
Lexical Analyzer Responsibilities
• Lexical analyzer [Scanner]– Scan input
– Remove white spaces
– Remove comments
– Manufacture tokens
– Generate lexical errors
– Pass token to parser
5CSE244 Compilers
Modular design• Rationale
– Separate the two analysis• High cohesion / Low coupling
– Improve efficiency
– Improve portability / maintainability
– Enable integration of third-party lexers • [lexer = lexical analysis tool]
6CSE244 Compilers
Terminology• Token
– A classification for a common set of strings
– Examples: Identifier, Integer, Float, Assign, LeftParen, RightParen,....
• Pattern– The rules that characterize the set of strings for a token
– Examples: [0-9]+
• Lexeme– Actual sequence of characters that matches a pattern
and has a given Token class.
– Examples: • Identifier: Name,Data,x
• Integer: 345,2,0,629,....
7CSE244 Compilers
Examples
“ ” “ ”
8CSE244 Compilers
Lexical Errors• Error Handling is very localized, w.r.t. Input
Source
• Example: fi(a==f(x)) …generates no lexical error in C
• In what situations do errors occur?• Prefix of remaining input doesn’t match any defined
token
• Possible error recovery actions:• Deleting or Inserting Input Characters
• Replacing or Transposing Characters
• Or, skip over to next separator to ignore problem
9CSE244 Compilers
Basic Scanning technique• Use 1 character of look-ahead
– Obtain char with getc()
• Do a case analysis– Based on lookahead char
– Based on current lexeme
• Outcome– If char can extend lexeme, all is well, go on.
– If char cannot extend lexeme:• Figure out what the complete lexeme is and return its
token
• Put the lookahead back into the symbol stream
10
CSE244 Compilers
Language Concepts
Alphabet Language
{0,1} {0,10,100,1000,10000,…}
{0,1,100,000,111,…}
{a,b,c} {abc,aabbcc,aaabbbccc,…}
{A…Z} {TEE,FORE,BALL…}
{FOR,WHILE,GOTO…}
{A…Z,a…z,0…9, {All legal PASCAL progs}
+,-,…,<,>,…} {All grammatically correct English Sentences}
Special Languages: Φ – EMPTY LANGUAGE
ε – contains empty string ε only
•A language, L, is simply any set of strings over a fixed alphabet.
11
CSE244 Compilers
Formal Language Operations
12
CSE244 Compilers
Examples
L = {A, B, C, D } D = {1, 2, 3}
L D = {A, B, C, D, 1, 2, 3 }
LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }
L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, É DD}
L4 = L2 L2 = ??
L* = { All possible strings of L plus }
L+ = L* -
L (L D ) = ??
L (L D )* = ??
13
CSE244 Compilers
Regular Languages• All examples above are
– Quite expressive
– Simple languages
• But also...– Belong to a special class: regular languages
• A Regular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings) From an Alphabet.
• Let Σ Be an Alphabet, r a Regular Expression Then L(r) is the Language That is Characterized by the Rules of r
14
CSE244 Compilers
Rules• fix alphabet Σ
• ε is a regular expression denoting {ε}
• If a is in Σ , a is a regular expression that denotes {a}
• Let r and s be R.E. for L(r) and L(s). Then
• (a) (r) | (s) is a regular expression L(r) ∪ L(s)
• (b) (r)(s) is a regular expression L(r) L(s)
• (c) (r)* is a regular expression (L(r))*
• (d) (r) is a regular expression L(r)
• All are Left-Associative.
• Parentheses are dropped as allowed by precedences.
Pre
ced
ee
nce
Pre
ced
ee
nce
15
CSE244 Compilers
Example revisitedL = {A, B, C, D } D = {1, 2, 3}
A | B | C | D = L
(A | B | C | D ) (A | B | C | D ) = L2
(A | B | C | D )* = L*
(A | B | C | D ) ((A | B | C | D ) | ( 1 | 2 | 3 )) = L (L D)
16
CSE244 Compilers
Algebraic Properties
AXIOM DESCRIPTION
r | s = s | r
r | (s | t) = (r | s) | t
(r s) t = r (s t)
r = rr = r
r* = ( r | )*
r ( s | t ) = r s | r t( s | t ) r = s r | t r
r** = r*
| is commutative
| is associative
concatenation is associative
concatenation distributes over |
relation between * and
Is the identity element for concatenation
* is idempotent
17
CSE244 Compilers
More Examples
• All Strings that start with “tab” or end with “bat”:
tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat
• All Strings in Which {1,2,3} exist in ascending
order:
{A,…,Z}*1 {A,…,Z}*2 {A,…,Z}*3 {A,…,Z}*
18
CSE244 Compilers
Tokens as R.E.
… …
…
“+”
“?”
…
19
CSE244 Compilers
Tokens as Patterns• Patterns are ???
• Tokens are ???Assume Following Tokens:
if, then, else, relop, id, num
What language construct are they used for ?
Given Tokens, What are Patterns ?
if if
then then
else else
relop < | <= | > | >= | = | <>
id letter ( letter | digit )*
num digit + (. digit + ) ? ( E(+ | -) ? digit + ) ?
What does this represent ? What is ?
20
CSE244 Compilers
Throw Away Tokens• Fact
– Some languages define tokens as useless
– Example: C• whitespace, tabulations, carriage return, and
comments can be discarded without affecting the program’s meaning.
blank b
tab ^T
newline ^M
delim blank | tab | newline
ws delim +
21
CSE244 Compilers
Automaton• A tool to specify a token
start
other
=>0 6 7
8 * RTN(G)
RTN(GE)> =
WeÕve accepted Ņ>Ó and have read other char thatmust be unread.
22
CSE244 Compilers
A More Complex Automatonstart <
0
other
=6 7
8
return(relop, LE)
5
4
>
=1 2
3
other
>
=
*
*
return(relop, NE)
return(relop, LT)
return(relop, EQ)
return(relop, GE)
return(relop, GT)
23
CSE244 Compilers
Two More...id :
delim :
start delim28
other3029
delim
*
start letter9
other1110
letter or digit
*
24
CSE244 Compilers
What about keywords ?• Easy!
– Use the “Identifier” token
– After a match, lookup the keyword table• If found, return a token for the matched keyword
• If not, return a token for the true identifier
25
CSE244 Compilers
Yes... But how to scan?• Remember the algorithm?
– Acquire 1 character of lookahead
– Case analysis based• On lookahead
• On state of automaton
26
CSE244 Compilers
Scanner codeclass Scanner {
InputStream _in; char _la; // The lookahead characterchar[] _window; // lexeme windowToken nextToken() {
startLexeme(); // reset window at startwhile(true) {
switch(_state) {case 0: {
_la = getChar(); if (_la == ‘<’) _state = 1;
else if (_la == ‘=’) _state = 5;else if (_la == ‘>’) _state = 6;else failure(state);
}break;case 6: {
_la = getChar();if (_la == ‘=’) _state = 7;else _state = 8;
}break;}
}}
}
case 7: {return new Token(GEQUAL);
}break;
case 8: {pushBack(_la);return new Token(GREATER);
}
start <0
other
=6 7
8
5
4
>
=1 2
3
other
>
=
*
*
27
CSE244 Compilers
Handling Failures• Meaning
– The automaton for this token failed
• solution– If another automaton is available
• “rewind” the input to the beginning of last lexeme
• Jump to start state of next automaton
• Start recognizing again
– If no other automaton• This is a true lexical error.
• Discard lexeme (or at least first char of lexeme)
• Start from state 0 again
28
CSE244 Compilers
Overview• Basic Concepts
• Regular Expressions– Language
• Lexical analysis by hand
• Regular Languages Tools– NFA / DFA
• Scanning with DFAs
• Scanning tools– Lex / Flex / JFlex
29
CSE244 Compilers
Automata & Language Theory
• Terminology– FSA
• A recognizer that takes an input string and determines whether it’s a valid string of the language.
– Non-Deterministic FSA (NFA)• Has several alternative actions for the same input
symbol
– Deterministic FSA (DFA)• Has at most 1 action for any given input symbol
• Bottom Line– expressive power(NFA) == expressive power(DFA)
– Conversion can be automated
30
CSE244 Compilers
NFA
An NFA is a mathematical model that consists of :
• S, a set of states
• , the symbols of the input alphabet
• move, a transition function.
• move(state, symbol) → set of states
• move : S ∪{∈} → Pow(S)
• A state, s0 ∈ S, the start state
• F ⊆ S, a set of final or accepting states.
31
CSE244 Compilers
Representing NFA
Transition Diagrams :
Transition Tables:
Number states (circles), arcs, final states, …
More suitable to representation within a computer
We’ll see examples of both !
32
CSE244 Compilers
Example NFA
S = { 0, 1, 2, 3 }
s0 = 0
F = { 3 }
Σ = { a, b }
start0 3b21 ba
a
b
What Language is defined ?
What is the Transition Table ?
state
i n p u t
0
1
2
a b
{ 0, 1 }-- { 2 }
-- { 3 }
{ 0 }
∈(null) moves possible
ji ∈
Switch state but do not use any input symbol
33
CSE244 Compilers
Epsilon-Transitions• Given the regular expression : (a (b*c)) | (a (b | c+)?)
• Find a transition diagram NFA that recognizes it.
• Solution ?
34
CSE244 Compilers
NFA Construction• Automatic construction example
• a(b*c)
• a(b|c+)?
Build a DisjunctionBuild a Disjunction
35
CSE244 Compilers
Resulting NFA
36
CSE244 Compilers
Working NFA
start0 3b21 ba
a
b • Given an input string, we trace moves
• If no more input & in final state, ACCEPT EXAMPLE:
Input: ababb
move(0, a) = 1
move(1, b) = 2
move(2, a) = ? (undefined)
REJECT !
move(0, a) = 0
move(0, b) = 0
move(0, a) = 1
move(1, b) = 2
move(2, b) = 3ACCEPT !
-OR-
37
CSE244 Compilers
Handling Undefined Transitions
• We can handle undefined transitions by defining one more state, a “death” state, and transitioning all previously undefined transition to this death state.
start0 3b21 ba
a
b
4
a, b
aa
38
CSE244 Compilers
Worse still...• Not all path result in acceptance!
start0 3b21 ba
a
b
aabb is accepted along path :
0 → 0 → 1 → 2 → 3
BUT… it is not accepted along the valid path:
0 → 0 → 0 → 0 → 0
39
CSE244 Compilers
The NFA “Problem”• Two problems
– Valid input may not be accepted
– Non-deterministic behavior from run to run...
• Solution?
40
CSE244 Compilers
The DFA Save The Day• A DFA is an NFA with a few restrictions
– No epsilon transitions
– For every state s, there is only one transition (s,x) from s for any symbol x in Σ
• Corollaries
– Easy to implement a DFA with an algorithm!
– Deterministic behavior
41
CSE244 Compilers
NFA vs. DFA• NFA
– smaller number of states Qnfa
– In order to simulate it requires a |Qnfa| computation for each input symbol.
• DFA– larger number of states Qdfa
– In order to simulate it requires a constant computation for each input symbol.
• caveat - generic NFA=>DFA construction: Qdfa
~ 2^{Qnfa}
• but: DFA’s are perfectly optimizable! (i.e., you can find smallest possible Qdfa )
42
CSE244 Compilers
One catch...• NFA-DFA comparison
43
CSE244 Compilers
NFA to DFA Conversion• Idea
– Look at the state reachable without consuming any input
– Aggregate them in macro states
44
CSE244 Compilers
Final Result• A state is final
– IFF one of the NFA state was final
45
CSE244 Compilers
Preliminary Definitions• NFA N = ( S, Σ, s0, F, MOVE )• ε-Closure(s) : s ε S
– set of states in S that are reachable from s via ε-moves of N that originate from s.
• ε-Closure(T) : T ⊆S• NFA states reachable from all t ε T on ε-moves
only.
• move(T,a) : T ⊆S, a ε Σ• Set of states to which there is a transition on
input a from some t ε T
46
CSE244 Compilers
Algorithm
forall(t in T) push(t);
initialize ε-closure(T) to T;
while stack is not empty do begin
t = pop();
for each u ε S with edge t→u labeled ε
if u is not in ε-closure(T)
add u to ε-closure(T) ;
push u onto stack
computing the ε-closurecomputing the ε-closure
47
CSE244 Compilers
DFA constructioncomputing the
The set of states
The transitions
computing the
The set of states
The transitionslet Q = ε-closure(s0) ;
D = { Q };
enQueue(Q)
while queue not empty do
X = deQueue();
for each a ε Σ do
Y := ε-closure(move(X,a));
T[X,a] := Y
if Y is not in D
D = D U { Y }
enQueue(Y);
end
end
48
CSE244 Compilers
Summary• We can
– Specify tokens with R.E.
– Use DFA to scan an input and recognize token
– Transform an NFA into a DFA automatically
• What we are missing– A way to transform an R.E. into an NFA
• Then, we will have a complete solution– Build a big R.E.
– Turn the R.E. into an NFA
– Turn the NFA into a DFA
– Scan with the obtained DFA
49
CSE244 Compilers
R.E. To NFA• Process
– Inductive definition• Use the structure of the R.E.
• Use atomic automata for atomic R.E.
• Use composition rules for each R.E. expression
• Recall– RE ::= ε
::= s in Σ::= rs
::= r | s
::= r*
50
CSE244 Compilers
Epsilon Construction• RE ::= ε
51
CSE244 Compilers
Symbol Construction• RE ::= x in Σ
52
CSE244 Compilers
Chaining Construction• RE ::= rs
53
CSE244 Compilers
Branching Construction• RE ::= r | s
54
CSE244 Compilers
Kleene-Closure Construction• RE ::= r*
55
CSE244 Compilers
NFA Construction Example• R.E.
– (ab*c) | (a(b|c*))
• Parse Tree: r13
r12r5
r3 r11r4
r9
r10
r8r7
r6
r0
r1 r2
b
*c
a a
|
( )
b
|
*
c
56
CSE244 Compilers
NFA Construction Example 2
r3: a
r0: b
r2: c
b ∈
∈
∈
∈r1:
r4 : r1 r2b ∈
∈
∈
∈ c
r5 : r3 r4
b ∈
∈
∈
∈a c
57
CSE244 Compilers
NFA Construction Example 3
r11: a
r7: b
r6: c
c ∈
∈
∈
∈r9 : r7 | r8
∈∈
b
c ∈
∈
∈
∈r8:
c ∈
∈
∈
∈r12 : r11 r10
∈∈
b
a
r10 : r9
58
CSE244 Compilers
NFA Construction Example 4r13 : r5 | r12
b ∈
∈
∈
∈a c
c ∈
∈
∈
∈
∈∈
b
a
∈
∈∈
∈
1
6543
8
2
10
9 12 13 14
11
15
7
16
17
59
CSE244 Compilers
Overall Summary• How does this all fit together ?
– Reg. Expr. → NFA construction
– NFA → DFA conversion
– DFA simulation for lexical analyzer
• Recall Lex Structure– Pattern Action
– Pattern Action
– … …• Each pattern recognizes lexemes
• Each pattern described by regular expression
∈
∈
etc.
(abc)*ab
(a | b)*abb
Recognizer!
60
CSE244 Compilers
Morale?• All of this can be automated with a tool!
– LEX The first lexical analyzer tool for C
– FLEXA newer/faster implementation C / C++ friendly
– JFLEX A lexer for Java. Based on same principles.
– JavaCC
– ANTLR
61
CSE244 Compilers
Ahead...• Grammars
• Parsing– Bottom Up
– Top Down
Recommended