Upload
justaperson3157
View
238
Download
0
Embed Size (px)
Citation preview
8/6/2019 Lexical Analysis2 Modefied
1/33
Lexical AnalysisLexical Analysis
Course 338- Compilers
Techniques and Tools
Chapter 3
8/6/2019 Lexical Analysis2 Modefied
2/33
2
Lexical AnalysisLexical Analysis
Lexical analysis recognizes the vocabulary of
the programming language and transforms a
string ofcharacters into a string ofwords or
tokens
Lexical analysis discards white spaces andcomments between the tokens
Lexical analyzer(orscanner) is the program
that performs lexical analysis
8/6/2019 Lexical Analysis2 Modefied
3/33
3
ContentsContents
Scanners
Tokens
Regular expressions
Finite automata
8/6/2019 Lexical Analysis2 Modefied
4/33
4
ScannersScanners
Scanner Parser
Symbol
Table
token
next token
characters
8/6/2019 Lexical Analysis2 Modefied
5/33
5
TokensTokens
A token is a sequence of characters that can
be treated as a unit in the grammarof aprogramming language
A programming language classifies tokens
into a finite set oftoken types
Type ExamplesID foo i n
NUM 73 13
IF if
COMMA ,
8/6/2019 Lexical Analysis2 Modefied
6/33
6
Semantic Values of TokensSemantic Values of Tokens
Semantic values are used to distinguish
different tokens in a token type < ID, foo>, < ID, i >, < ID, n >
< NUM, 73>, < NUM, 13 >
< IF, >
< COMMA, >
Token types affect syntax analysis and
semantic values affect semantic analysis
8/6/2019 Lexical Analysis2 Modefied
7/33
7
Scanner GeneratorsScanner Generators
Scanner
Generator
Scannerdefinition in
Mata LanguageScanner
ScannerProgram in
programming
language
Token types &
semantic values
8/6/2019 Lexical Analysis2 Modefied
8/33
8
LanguagesLanguages
A language is a set ofstrings
A string is a finite sequence ofsymbols taken
from a finite alphabet
The C language is the (infinite) set of all strings that
constitute legal C programs
The language of C reserved words is the (finite) set
of all alphabetic strings that cannot be used as
identifiers in the C programs
Each token type is a language
8/6/2019 Lexical Analysis2 Modefied
9/33
9
Regular Expressions (RE)Regular Expressions (RE)
Language allows us to use finite descriptions
to specify (possibly infinite) sets
RE is the metalanguage used to define the
token types of a programming language
8/6/2019 Lexical Analysis2 Modefied
10/33
10
Regular ExpressionsRegular Expressions
I is a RE denoting L = {I}
Ifa alphabet, then a is a RE denoting L = {a}
Suppose r and s are RE denoting L(r) and L(s)
alternation: (r) | (s) is a RE denoting L(r) L(s)
concatenation: (r) (s) is a RE denoting L(r)L(s)
repetition: (r)* is a RE denoting (L(r))*
(r) is a RE denoting L(r)
8/6/2019 Lexical Analysis2 Modefied
11/33
11
ExamplesExamples
a | b {a, b}
(a | b)(a | b) {aa, ab, ba, bb}
a* {I, a, aa, aaa, ...}
(a | b)* the set of all strings ofas and bs
a | a*b the set containing the string a andall strings consisting of zero or more
as followed by a b
8/6/2019 Lexical Analysis2 Modefied
12/33
12
Regular DefinitionsRegular Definitions
Names for regular expressions
d1 p r1d2 p r2
..
dn p rn
where ri over alphabet {d1, d2, ..., di-1} Examples:
letterp A | B | ... | Z | a | b | ... | z
digit p 0 | 1 | ... | 9
identifier p letter ( letter | digit )*
8/6/2019 Lexical Analysis2 Modefied
13/33
13
Notational AbbreviationsNotational Abbreviations
One or more instances
(r)+ denoting (L(r))+
r* = r+ | I r+ = r r*
Zero or one instancer? = r | I
Character classes[abc] = a | b | c [a-z] = a | b | ... | z
[^abc] = any character except a | b | c
Any character except newline
.
8/6/2019 Lexical Analysis2 Modefied
14/33
14
ExamplesExamples
if {return IF;}
[a-z][a-z0-9]* {return ID;}
[0-9]+ {return NUM;}
([0-9]+.
[0-9]*)|([0-9]*
.
[0-9]+) {return
REAL;}
(--[a-z]*\n)|( | \n | \t)+
{/*do nothing for white spaces and*
8/6/2019 Lexical Analysis2 Modefied
15/33
15
Finite AutomataFinite Automata
A finite automaton is a finite-state transition
diagram that can be used to model therecognition of a token type specified by a
regular expression
A finite automaton can be a nondeterministicfinite automaton or a deterministic finite
automaton
8/6/2019 Lexical Analysis2 Modefied
16/33
16
Nondeterministic Finite AutomataNondeterministic Finite Automata(NFA)(NFA)
An NFA consists of
A finite set ofstates
A finite set ofinput symbols
A transition function that maps (state, symbol)
pairs to sets of states
A state distinguished as start state
A set of states distinguished as final states
8/6/2019 Lexical Analysis2 Modefied
17/33
17
An ExampleAn Example
RE: (a | b)*abb
States: {1, 2, 3, 4}
Input symbols: {a, b}
Transition function:
(1,a) = {1,2}, (1,b) = {1}(2,b) = {3}, (3,b) = {4}
Start state: 1
Final state: {4}
1
2
3
4
a
b
b
a,b
start
8/6/2019 Lexical Analysis2 Modefied
18/33
18
Acceptance of NFAAcceptance of NFA
An NFA accepts an input string s iff there issome path in the finite-state transition diagram
from the start state to some final state such
that the edge labels along this path spell out s
The language recognized by an NFA
automaton is the set of strings it accepts
8/6/2019 Lexical Analysis2 Modefied
19/33
19
An ExampleAn Example
0 31 2
a b b
a
b
start
(a | b)
*
abb aabb
8/6/2019 Lexical Analysis2 Modefied
20/33
20
aaba
An ExampleAn Example
0 31 2
a b b
a
b
start
(a | b)
*
abb
a
8/6/2019 Lexical Analysis2 Modefied
21/33
21
Another ExampleAnother Example
RE: aa* | bb*
States: {1, 2, 3, 4, 5}
Input symbols: {a, b}
Transition function:
(1, I) = {2, 4}, (2, a) = {3}, (3, a) = {3},(4, b) = {5}, (5, b) = {5}
Start state: 1
Final states: {3, 5}
8/6/2019 Lexical Analysis2 Modefied
22/33
22
FiniteFinite--State Transition DiagramState Transition Diagram
start
aa* | bb*
1
4
2 3a
b
a
b
5
I
I
8/6/2019 Lexical Analysis2 Modefied
23/33
23
Deterministic Finite Automata (DFA)Deterministic Finite Automata (DFA)
A DFA is a special case of an NFA in which
no state has an I-transition
for each state s and input symbol a, there is at
most one edge labeled a leaving s
8/6/2019 Lexical Analysis2 Modefied
24/33
24
An ExampleAn Example
RE: (a | b)*abb
States: {1, 2, 3, 4} Input symbols: {a, b}
Transition function:
(1,a) = {2}, (2,a) = {2}, (3,a) = {2}, (4,a) = {2}
(1,b) = {1}, (2,b) = {3}, (3,b) = {4}, (4,b) = {1}
Start state: 1
Final state: {4}
8/6/2019 Lexical Analysis2 Modefied
25/33
25
Finite-State Transition Diagram
(a | b)*abb
1 42 3a
b b
a
b
start
a
b
a
8/6/2019 Lexical Analysis2 Modefied
26/33
26
Acceptance of DFAAcceptance of DFA
A DFA accepts an input string s iff there is one
path in the finite-state transition diagram fromthe start state to some final state such that the
edge labels along this path spell out s
The language recognized by a DFA
automaton is the set of strings it accepts
8/6/2019 Lexical Analysis2 Modefied
27/33
27
An ExampleAn Example
(a | b)*abb
1 42 3a
b b
a
b
start
a
b
a
aabb
8/6/2019 Lexical Analysis2 Modefied
28/33
28
An ExampleAn Example
(a | b)*abb
1 42 3a
b b
a
b
start
a
b
a
aaba
8/6/2019 Lexical Analysis2 Modefied
29/33
29
Combined Finite AutomataCombined Finite Automata
1
32
4.
0-9
start0-9
1 2
i fstart
3
1a-zstart
2 a-z,0-9
50-9
.0-9
0-9
[a-z][a-z0-9]*
([0-9]+.[0-9]*)
|
([0-9]*.[0-9]+)
if IFID
REAL
REAL
8/6/2019 Lexical Analysis2 Modefied
30/33
30
Combined Finite AutomataCombined Finite Automata
7
98
10.
0-9I0-9
2 3i f
I
4
5a-z
I6 a-z,0-9
110-9
.0-9
0-9
1start
IFID
REAL
REAL
NFA
8/6/2019 Lexical Analysis2 Modefied
31/33
31
Combined Finite AutomataCombined Finite Automata
65
7
.
0-90-9
2
i
f3
j-z4 a-z,0-9
80-9
.0-9
0-9
1start
IF
ID
REAL
REALDFA
a-z,0-9
a-h
a-eg-z
8/6/2019 Lexical Analysis2 Modefied
32/33
32
Recognizing the Longest MatchRecognizing the Longest Match
The automaton must keep track ofthe longest
match seen so farand the position of thatmatch until a dead state is reached
Use two variables Last-Final (the state
number of the most recent final state
encountered) and Input-Position-at-Last-Finalto remember the last time the automaton was
in a final state
8/6/2019 Lexical Analysis2 Modefied
33/33
33
An ExampleAn Example
65
7
.
0-9
0-9
2
i
f3
j-z4 a-z,0-9
80-9
.0-9
0-9
1
start
IF
ID
REAL
REAL
DFA
a-z,0-9
a-h
a-e g-ziffail+ S C L P1 0 0
i 2 0 0
f 3 3 2
f 4 4 3
a 4 4 4
i 4 4 5
l 4 4 6
+ ?