59
Lexical Analysis

compiler-lexical analysis

Embed Size (px)

DESCRIPTION

lexical analysisi

Citation preview

Page 1: compiler-lexical analysis

Lexical Analysis

Page 2: compiler-lexical analysis

OutlineRole of lexical analyzerSpecification of tokensRecognition of tokensLexical analyzer generatorFinite automataDesign of lexical analyzer generator

Page 3: compiler-lexical analysis

Compiler Construction

Lexicalanalyzer

ParserSource program

read char

put back char

pass tokenand attribute value

get next

Symbol TableRead entireprogram into

memory

id

Page 4: compiler-lexical analysis

The role of lexical analyzer

Lexical Analyzer ParserSource

program

token

getNextToken

Symboltable

To semanticanalysis

Page 5: compiler-lexical analysis

Other tasks…Stripping out from the source program

comment and white space in the form of blank, tab and newline characters .

Correlate error messages with source program (e.g., line number of error).

Page 6: compiler-lexical analysis

Lexical analyzer scanning (simple

operations)

lexical analysis(complex)

Page 7: compiler-lexical analysis

Why to separate Lexical analysis and parsing1. Simplicity of design

1. Improving compiler efficiency

1. Enhancing compiler portability

Eliminate the white space and comment lines before parsing

A large amount of time is being spent on reading the source program and generating the tokens

The representation of special or non standard symbols can be isolated in the lexical analyzer

Page 8: compiler-lexical analysis

Tokens, Patterns and LexemesPattern: A rule that describes a set of stringsToken: A set of strings in the same patternLexeme: The sequence of characters of a token

Page 9: compiler-lexical analysis

ExampleToken Informal description Sample lexemes

ifelse

comparison

idnumberliteral

Characters i, fCharacters e, l, s, e

< or > or <= or >= or == or !=

Letter followed by letter and digits

Any numeric constant

Anything but “ sorrounded by “

if

else<=, !=

pi, score, D2

3.14159, 0, 6.02e23

“core dumped”

Page 10: compiler-lexical analysis

Compiler Construction

E = C1 * 10

Token Attribute

ID Index to symbol table entry E

=

ID Index to symbol table entry C1

*

NUM 10

Page 11: compiler-lexical analysis

Lexical Error and RecoveryA character sequence that can’t be scanned into any valid

token is a lexical error.Lexical errors are uncommon, but they still must be

handled by a scanner. We won’t stop compilation because of minor error.

fi(a==f(x))... //mistyped if lex will give valid token Delete one character from the remaining input. Insert a missing character into the remaining input. Replace a character by another character. Transpose two adjacent characters from the

remaining input

Page 12: compiler-lexical analysis

Input bufferingSometimes lexical analyzer needs to look

ahead some symbols to decide about the token to returnIn C language: we need to look after -, = or <

to decide what token to returnWe need to introduce a two buffer scheme to

handle large look-aheads safely

E = M * C * * 2 eof

Page 13: compiler-lexical analysis

Compiler Construction

Specification of TokensRegular expressions are an important notation for

specifying lexeme patterns. While they cannot express all possible patterns, they are very effective in specifying

those types of patterns that we actually need for tokens.

Page 14: compiler-lexical analysis

Specification of tokensIn theory of compilation regular expressions

are used to formalize the specification of tokens

Regular expressions are means for specifying regular languages

Example: Letter(letter| digit)*

Each regular expression is a pattern specifying the form of strings

Page 15: compiler-lexical analysis

Regular expressionsƐ is a regular expression, L(Ɛ) = {Ɛ}If a is a symbol in ∑then a is a regular

expression, L(a) = {a}(r) | (s) is a regular expression denoting the

language L(r) ∪ L(s) (r)(s) is a regular expression denoting the

language L(r)L(s)(r)* is a regular expression denoting (L(r))*

R* = R concatenated with itself 0 or more times

= {ε} ∪ R ∪ RR ∪ RRR ∪(r) is a regular expression denoting L(r)

Page 16: compiler-lexical analysis

ExtensionsOne or more instances: (r)+Zero of one instances: r?Character classes: [ abc ]

Example:letter_ -> [A-Z , a-z_]digit -> [0-9]id -> letter(letter | digit)*

Page 17: compiler-lexical analysis

Regular definitionsd1 -> r1

d2 -> r2

…d n -> r n

Example:letter -> A | B | … | Z | a | b | … | z | _digit -> 0 | 1 | … | 9id -> letter(letter| digit)*

Page 18: compiler-lexical analysis

Operations on Languages

Example: Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and let D

be the set of digits {0,1,.. .9). L and D are, respectively, the alphabets of uppercase and lowercase letters and of digits. other languages can be constructed from L and D, using the operators illustrated above

Page 19: compiler-lexical analysis

Compiler Construction

Terms for Parts of Strings

Page 20: compiler-lexical analysis

Specification of TokensAlgebraic laws of regular expressions 1) |= |

2) |(|)=(|)| () =( ) 3) (| )= | (|)= | 4) = = 5)(*)*=*

6) *= + | + = *7) (|)*= (* | *)*= (* *)*

Page 21: compiler-lexical analysis

Recognition of TokensTask of recognition of token in a lexical

analyzer Isolate the lexeme for the next token in the

input buffer Produce as output a pair consisting of the

appropriate token and attribute-value, such as <id,pointer to table entry> , using the translation table given in the Fig in next page

Page 22: compiler-lexical analysis

Recognition of TokensTask of recognition of token in a lexical

analyzer

Regular expression

Token Attribute-value

if if -id id Pointer to

table entry< relop LT

Page 23: compiler-lexical analysis

Recognition of TokensMethods to recognition of token

Use Transition Diagram

Page 24: compiler-lexical analysis

Recognition of tokensStarting point is the language grammar to

understand the tokens:Grammar for branching statement

stmt -> if expr then stmt | if expr then stmt else stmt | Ɛexpr -> term relop term | termterm -> id | number

Page 25: compiler-lexical analysis

Recognition of tokens (cont.)The next step is to formalize the patterns:

digit -> [0-9]digits -> digit+number -> digits(.digits)? (E[+|-]? digits)?letter -> [A-Z a-z_]id -> letter (letter | digit)*If -> ifThen -> thenElse -> elseRelop -> < | > | <= | >= | = | <>

We also need to handle whitespaces:ws -> (blank | tab | newline)+

Page 26: compiler-lexical analysis

Compiler Construction

Ex :RELOP = < | <= | = | <> | > | >=

0

1

5

6

2

3

4

7

8

start<

=

=

=

>

>

other

other

return(relop,LE)

return(relop,NE)

return(relop,LT)

return(relop,GE)

return(relop,GT)

return(relop,EQ)

#

# # indicates input retraction

Page 27: compiler-lexical analysis

Compiler Construction

Ex2: ID = letter(letter | digit) *

9 10 11start letter

return(id)

# indicates input retraction

other #

letter or digitTransition Diagram:

Page 28: compiler-lexical analysis

Transition diagrams (cont.)Transition diagram for whitespace

Page 29: compiler-lexical analysis

9 10 11start letter

return(id)other

letter or digit

switch (state) {case 9:

if (isletter( c) ) state = 10; else state = failure();break;

case 10: c = nextchar(); if (isletter( c) || isdigit( c) ) state = 10; else state 11case 11: retract(1); insert(id); return;

Page 30: compiler-lexical analysis

Architecture of a transition-diagram-based lexical analyzerTOKEN getRelop(){

TOKEN retToken = new (RELOP)while (1) { /* repeat character processing until a

return or failure occurs */switch(state) {

case 0: c= nextchar(); if (c == ‘<‘) state = 1; else if (c == ‘=‘) state = 5; else if (c == ‘>’) state = 6; else fail(); /* lexeme is not a relop */ break;

case 1: ……case 8: retract();

retToken.attribute = GT; return(retToken);

}

Page 31: compiler-lexical analysis

Lexical Analyzer Generator - Lex

Lexical Compiler

Lex Source programlex.l

lex.yy.c

Ccompiler

lex.yy.c a.out

a.outInput stream Sequence of tokens

Page 32: compiler-lexical analysis

Structure of Lex programs

declarations%%translation rules%%auxiliary functions

Pattern {Action}

Page 33: compiler-lexical analysis

Example%{

/* definitions of manifest constantsLT, LE, EQ, NE, GT, GE,IF, THEN, ELSE, ID, NUMBER, RELOP */

%}

/* regular definitionsdelim [ \t\n]ws {delim}+letter [A-Za-z]digit [0-9]id {letter}({letter}|{digit})*number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?

%%{ws} {/* no action and no return */}if {return(IF);}then {return(THEN);}else {return(ELSE);}{id} {yylval = (int) installID(); return(ID); }{number} {yylval = (int) installNum();

return(NUMBER);}…

Int installID() {/* funtion to install the lexeme, whose first character is pointed to by yytext, and whose length is yyleng, into the symbol table and return a pointer thereto */

}

Int installNum() { /* similar to installID, but puts numerical constants into a separate table */

}

Page 34: compiler-lexical analysis

35

Finite AutomataRegular expressions = specificationFinite automata = implementation

A finite automaton consists ofAn input alphabet A set of states SA start state nA set of accepting states F SA set of transitions state input state

Page 35: compiler-lexical analysis

36

Finite AutomataTransition

s1 a s2

Is readIn state s1 on input “a” go to state s2

If end of inputIf in accepting state => accept, othewise =>

rejectIf no transition possible => reject

Page 36: compiler-lexical analysis

37

Finite Automata State GraphsA state

• The start state

• An accepting state

• A transitiona

Page 37: compiler-lexical analysis

38

A Simple ExampleA finite automaton that accepts only “1”

A finite automaton accepts a string if we can follow transitions labeled with the characters in the string from the start to some accepting state

1

Page 38: compiler-lexical analysis

39

Another Simple ExampleA finite automaton accepting any number of

1’s followed by a single 0Alphabet: {0,1}

Check that “1110” is accepted but “110…” is not

0

1

Page 39: compiler-lexical analysis

40

And Another ExampleAlphabet {0,1}What language does this recognize?

0

1

0

1

0

1

Page 40: compiler-lexical analysis

41

And Another ExampleAlphabet still { 0, 1 }

The operation of the automaton is not completely defined by the inputOn input “11” the automaton could be in either

state

1

1

Page 41: compiler-lexical analysis

42

Epsilon MovesAnother kind of transition: -moves

• Machine can move from state A to state B without reading input

A B

Page 42: compiler-lexical analysis

43

Deterministic and Nondeterministic AutomataDeterministic Finite Automata (DFA)

One transition per input per state No -moves

Nondeterministic Finite Automata (NFA)Can have multiple transitions for one input in a

given stateCan have -moves

Finite automata have finite memoryNeed only to encode the current state

Page 43: compiler-lexical analysis

44

Execution of Finite AutomataA DFA can take only one path through the

state graphCompletely determined by input

NFAs can chooseWhether to make -movesWhich of multiple transitions for a single input

to take

Page 44: compiler-lexical analysis

45

Acceptance of NFAsAn NFA can get into multiple states

• Input:

0

1

1

0

1 0 1

• Rule: NFA accepts if it can get in a final state

Page 45: compiler-lexical analysis

46

NFA vs. DFA (1)NFAs and DFAs recognize the same set of

languages (regular languages)

DFAs are easier to implementThere are no choices to consider

Page 46: compiler-lexical analysis

47

NFA vs. DFA (2)For a given language the NFA can be simpler

than the DFA0

10

0

01

0

1

0

1

NFA

DFA

• DFA can be exponentially larger than NFA

Page 47: compiler-lexical analysis

48

Regular Expressions to Finite AutomataHigh-level sketch

Regularexpressions

NFA

DFA

LexicalSpecification

Table-driven Implementation of DFA

Page 48: compiler-lexical analysis

49

Regular Expressions to NFA (1)For each kind of rexp, define an NFA

Notation: NFA for rexp A

A

• For

• For input aa

Page 49: compiler-lexical analysis

50

Regular Expressions to NFA (2)For AB

A B

• For A | B

A

B

Page 50: compiler-lexical analysis

51

Regular Expressions to NFA (3)For A*

A

Page 51: compiler-lexical analysis

52

Example of RegExp -> NFA conversionConsider the regular expression

(1 | 0)*1The NFA is

1C E0D F

B

G

A H 1I J

Page 52: compiler-lexical analysis

53

Next

Regularexpressions

NFA

DFA

LexicalSpecification

Table-driven Implementation of DFA

Page 53: compiler-lexical analysis

54

NFA to DFA. The TrickSimulate the NFAEach state of resulting DFA

= a non-empty subset of states of the NFAStart state

= the set of NFA states reachable through -moves from NFA start state

Add a transition S a S’ to DFA iffS’ is the set of NFA states reachable from the

states in S after seeing the input a considering -moves as well

Page 54: compiler-lexical analysis

55

NFA -> DFA Example10 1

A BC

D

E

FG H I J

ABCDHI

FGABCDHI

EJGABCDHI

0

1

0

10 1

Page 55: compiler-lexical analysis

56

NFA to DFA. RemarkAn NFA may be in many states at any time

How many different states ?

If there are N states, the NFA must be in some subset of those N states

How many non-empty subsets are there?2N - 1 = finitely many, but exponentially many

Page 56: compiler-lexical analysis

57

ImplementationA DFA can be implemented by a 2D table T

One dimension is “states”Other dimension is “input symbols”For every transition Si a Sk define T[i,a] = k

DFA “execution”If in state Si and input a, read T[i,a] = k and

skip to state Sk

Very efficient

Page 57: compiler-lexical analysis

58

Table Implementation of a DFA

S

T

U

0

1

0

10 1

0 1S T UT T UU T U

Page 58: compiler-lexical analysis

59

Implementation (Cont.)NFA -> DFA conversion is at the heart of

tools such as flex or jflex

But, DFAs can be huge

In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations

Page 59: compiler-lexical analysis

ReadingsChapter 3 of the book