35
Lecture 2 Lexical Analysis Topics Topics Sample Simple Compiler Operations on strings Regular expressions Finite Automata Readings: Readings: January 11, 2006 CSCE 531 Compiler Construction

Lecture 2 Lexical Analysis

  • Upload
    star

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Lecture 2 Lexical Analysis. CSCE 531 Compiler Construction. Topics Sample Simple Compiler Operations on strings Regular expressions Finite Automata Readings:. January 11, 2006. Overview. Last Time A little History Compilers vs Interpreter Data-Flow View of Compilers - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 2   Lexical Analysis

Lecture 2 Lexical Analysis

Topics Topics Sample Simple Compiler Operations on strings Regular expressions Finite Automata

Readings:Readings:

January 11, 2006

CSCE 531 Compiler Construction

Page 2: Lecture 2   Lexical Analysis

– 2 – CSCE 531 Spring 2006

OverviewLast TimeLast Time

A little History Compilers vs Interpreter Data-Flow View of Compilers Regular Languages Course Pragmatics

Today’s Lecture Today’s Lecture Why Study Compilers? xx

ReferencesReferences Chapter 2, Chapter 3

Assignment Due Wednesday Jan 18Assignment Due Wednesday Jan 18 3.3a; 3.5a,b; 3.6a,b,c; 3.7a; 3.8b

Page 3: Lecture 2   Lexical Analysis

– 3 – CSCE 531 Spring 2006

A Simple Compiler for ExpressionsChapter Two OverviewChapter Two Overview Structure of the simple compiler, really just translator Structure of the simple compiler, really just translator

for infix expressions for infix expressions postfix postfix Grammars Grammars Parse TreesParse Trees Syntax directed TranslationSyntax directed Translation Predictive ParsingPredictive Parsing Translator for Simple ExpressionsTranslator for Simple Expressions

Grammar Rewritten grammar (equivalent one better for pred. parsing) Parsing modules fig 2.24 Specification of Translator fig 2.35 Structure of translator fig 2.36

Page 4: Lecture 2   Lexical Analysis

– 4 – CSCE 531 Spring 2006

GrammarsGrammar (or a context free grammar more correctly) hasGrammar (or a context free grammar more correctly) has A set of tokens also known as terminalsA set of tokens also known as terminals A set of nonterminalsA set of nonterminals A set of productions of the formA set of productions of the form

nonterminal nonterminal sequence of tokens and/or nonterminals sequence of tokens and/or nonterminals A special nonterminal the start symbol.A special nonterminal the start symbol.

ExampleExample

E E E + E E + E

E E E * E E * E

E E digit digit

Page 5: Lecture 2   Lexical Analysis

– 5 – CSCE 531 Spring 2006

DerivationsA derivation is a sequence of rewriting of a string of A derivation is a sequence of rewriting of a string of

grammar symbols using the productions in a grammar symbols using the productions in a grammar. grammar.

We use the symbol We use the symbol to denote that one string of to denote that one string of grammar symbols is obtained by rewritting another grammar symbols is obtained by rewritting another using a productionusing a production

XX Y if there is a production N Y if there is a production N ββ where where The nonterminal N occurs in the sequence X of Grammar

symbols And Y is the same as X except β replaces the N

ExampleExample

E E E+E E+E d+E d+E d+ E*E d+ E*E d+ E+E*E d+ E+E*E d+d+E*E d+d+E*E d+d+d*E d+d+d*E d+d+d*d d+d+d*d

Page 6: Lecture 2   Lexical Analysis

– 6 – CSCE 531 Spring 2006

Parse TreesA graphical presentation of a derivation, satisfyingA graphical presentation of a derivation, satisfying Root is the start symbolRoot is the start symbol Each leaf is a token or Each leaf is a token or εε (note different font from (note different font from

text)text) Each interior node is a nonterminalEach interior node is a nonterminal If A is a parent with children XIf A is a parent with children X1 1 , X, X22 … X … Xnn then then

A A X X11XX22 … X … Xnn is a production is a production

Page 7: Lecture 2   Lexical Analysis

– 7 – CSCE 531 Spring 2006

Syntax directed TranslationFrequently the rewritting by a production will be called a reduction Frequently the rewritting by a production will be called a reduction

or reducing by the particular production.or reducing by the particular production.

Syntax directed translation attaches action (code) that are done Syntax directed translation attaches action (code) that are done when the reductions are performedwhen the reductions are performed

ExampleExample

EE E + TE + T {print(‘+’);}{print(‘+’);}

EE E - TE - T {print(‘-’);}{print(‘-’);}

EE TT

T T 00 {print(‘0’);} {print(‘0’);}

T T 11 {print(‘1’);} {print(‘1’);}

……

T T 99 {print(‘9’);} {print(‘9’);}

Page 8: Lecture 2   Lexical Analysis

– 8 – CSCE 531 Spring 2006

Equivalent Grammars

Page 9: Lecture 2   Lexical Analysis

– 9 – CSCE 531 Spring 2006

Specification of the translatorS S L eof L eof figure 2.38figure 2.38LL E ; L E ; LL L ЄЄEE T E’T E’E’E’ + T { print(‘+’); } E’+ T { print(‘+’); } E’E’E’ - T { print(‘-’); } E’ - T { print(‘-’); } E’EE ЄЄ TT F T’F T’T’T’ * F { print(‘*’); } T’* F { print(‘*’); } T’T’T’ / F { print(‘/’); } T’ T / F { print(‘/’); } T’ T ЄЄFF ( E ) ( E )FF id id { print(id.lexeme);}{ print(id.lexeme);}FF num num { print(num.value);}{ print(num.value);}

Page 10: Lecture 2   Lexical Analysis

– 10 – CSCE 531 Spring 2006

Translating to codeE E T E’T E’E’ E’ + T { print(‘+’); } E’+ T { print(‘+’); } E’E’ E’ - T { print(‘-’); } E’ - T { print(‘-’); } E’E E ЄЄ

Expr()Expr(){{

int t;int t;term();term();while(1)while(1) switch(lookahead){switch(lookahead){ case ‘+’: case ‘-’:case ‘+’: case ‘-’:

t = lookahead;t = lookahead;match(lookahead);match(lookahead);term();term();emit(t, NONE);emit(t, NONE);continue;continue;

……

Page 11: Lecture 2   Lexical Analysis

– 11 – CSCE 531 Spring 2006

Overview of the Code Figure 2.36/class/csce531-001/class/csce531-001

Page 12: Lecture 2   Lexical Analysis

– 12 – CSCE 531 Spring 2006

Operations on StringsA language over an alphabet is a set of strings of A language over an alphabet is a set of strings of

characters from the alphabet.characters from the alphabet.

Operations on strings: Operations on strings: let x=x1x2…xn and t=t1t2…tm then

Concatenation: xt =xConcatenation: xt =x11xx22…x…xnntt11tt22…t…tmm

Alternation: x|t = either xAlternation: x|t = either x11xx22…x…xnn or t or t11tt22…t…tmm

Page 13: Lecture 2   Lexical Analysis

– 13 – CSCE 531 Spring 2006

Operations on Sets of StringsOperations on sets of strings: Operations on sets of strings:

For these let S = {sFor these let S = {s11, s, s22, … s, … smm} and R = {r} and R = {r11, r, r22, … r, … rnn}}

Alternation: S | T = S U T = {sAlternation: S | T = S U T = {s11, s, s22, … s, … smm, r, r11, r, r22, … r, … rn n } } Concatenation: Concatenation:

ST ={st | where s ST ={st | where s ЄЄ S and t S and t ЄЄ T} T}

= { s= { s11rr11, s, s11rr22, … s, … s11rrnn, s, s22rr11, … s, … s22rrnn, … s, … smmrr11, … s, … smmrrnn}} Power: SPower: S22 = S S, S = S S, S33= S= S22 S, S S, Snn =S =Sn-1n-1 S S

What is SWhat is S00?? Kleene Closure: S* = UKleene Closure: S* = U∞∞

i=0i=0 S Sii , note S , note S00 = is in S* = is in S*

Page 14: Lecture 2   Lexical Analysis

– 14 – CSCE 531 Spring 2006

Operations cont. Kleene Closure

Powers: Powers: S2 = S S S3= S2 S … Sn =Sn-1 S

What is SWhat is S00?? Kleene Closure: S* = UKleene Closure: S* = U∞∞

i=0i=0 S Sii , note S , note S00 = is in S* = is in S*

Page 15: Lecture 2   Lexical Analysis

– 15 – CSCE 531 Spring 2006

Examples of Operations on Sets of Strings

Operations on sets of strings: Operations on sets of strings:

For these let S = {a,b,c} and R = {t,u}For these let S = {a,b,c} and R = {t,u} Alternation: S | T = S U T = {a,b,c,t,uAlternation: S | T = S U T = {a,b,c,t,u } } Concatenation: Concatenation:

ST ={st | where s ST ={st | where s ЄЄ S and t S and t ЄЄ T} T}

= { at, au, bt, bu, ct, cu}= { at, au, bt, bu, ct, cu} Power: SPower: S22 = { aa, ab, ac, ba, bb, bc, ca, cb, cc} = { aa, ab, ac, ba, bb, bc, ca, cb, cc}

SS33= { aaa, aab, aac, … ccc} 27 elements= { aaa, aab, aac, … ccc} 27 elements Kleene closure: S* = {any string of any length of a’s, Kleene closure: S* = {any string of any length of a’s,

b’s and c’s}b’s and c’s}

Page 16: Lecture 2   Lexical Analysis

– 16 – CSCE 531 Spring 2006

Examples of Operations on Sets of Strings

Page 17: Lecture 2   Lexical Analysis

– 17 – CSCE 531 Spring 2006

Regular ExpressionsFor a given alphabet For a given alphabet ΣΣ the following are regular the following are regular

expressions:expressions: If a If a ЄЄ ΣΣ then a is a regular expression and L(a) = { a } then a is a regular expression and L(a) = { a } ЄЄ is a regular expression and L( is a regular expression and L(ЄЄ) = { ) = { ЄЄ } } ΦΦ is a regular expression and L( is a regular expression and L(ΦΦ) = ) = ΦΦ And if s and t are regular expressions denoting And if s and t are regular expressions denoting

languages L(s) and L(t) respectively thenlanguages L(s) and L(t) respectively then st is a regular expression and L(st) = L(s) L(t) s | t is a regular expression and L(s | t) = L(s) U L(t) s* is a regular expression and L(s*) = L(s)*

Page 18: Lecture 2   Lexical Analysis

– 18 – CSCE 531 Spring 2006

Why Regular Expressions?We use regular expressions to describe the tokensWe use regular expressions to describe the tokens

Examples:Examples: Reg expr for C identifiersReg expr for C identifiers

C identifiers? Any string of letters, underscores and digits that start with a letter or underscore

ID reg expr = (letter | underscore) (letter | underscore | digit)*

Or more explicitlyID reg expr = ( a|b|…|z|_)(a|b|…z|_|0|1…|9)*

Page 19: Lecture 2   Lexical Analysis

– 19 – CSCE 531 Spring 2006

Pop QuizGiven r and s are regular expressions thenGiven r and s are regular expressions then What is rWhat is rЄЄ ? ? r | r | ЄЄ ? ?

Describe the Language denoted by 0*110*Describe the Language denoted by 0*110*

Describe the Language denoted by (0|1)*110*Describe the Language denoted by (0|1)*110*

Give a regular expression for the language of 0’s Give a regular expression for the language of 0’s and 1’s such that end in a 1and 1’s such that end in a 1

Give a regular expression for the language of 0’s Give a regular expression for the language of 0’s and 1’s such that every 0 is followed by a 1and 1’s such that every 0 is followed by a 1

Page 20: Lecture 2   Lexical Analysis

– 20 – CSCE 531 Spring 2006

Recognizers of Regular LanguagesTo develop efficient lexical analyzers (scanners) we will To develop efficient lexical analyzers (scanners) we will

rely on a mathematical model called finite automata, rely on a mathematical model called finite automata, similar to the state machines that you have probably similar to the state machines that you have probably seen. In particular we will use deterministic finite seen. In particular we will use deterministic finite automata, DFAs.automata, DFAs.

The construction of a lexical analyzer will then proceed as:The construction of a lexical analyzer will then proceed as:1.1. Identify all tokensIdentify all tokens2.2. Develop regular expressions for eachDevelop regular expressions for each3.3. Convert the regular expressions to finite automataConvert the regular expressions to finite automata4.4. Use the transition table for the finite automata as the Use the transition table for the finite automata as the

basis for the scannerbasis for the scannerWe will actually use the tools lex and/or flex for steps 3 We will actually use the tools lex and/or flex for steps 3

and 4.and 4.

Page 21: Lecture 2   Lexical Analysis

– 21 – CSCE 531 Spring 2006

Transition Diagram for a DFA

Start in state sStart in state s00 then if the input is “f” make transition to then if the input is “f” make transition to state sstate s11..

The from state sThe from state s1 1 if the input is “o” make transition to state if the input is “o” make transition to state ss22..

And from state sAnd from state s2 2 if the input is “r” make transition to state if the input is “r” make transition to state ss33..

The double circle denotes an “accepting state” which The double circle denotes an “accepting state” which means we recognized the token.means we recognized the token.

Actually there is a missing state and transitionActually there is a missing state and transition

f o rs0 s1 s2 s3

Page 22: Lecture 2   Lexical Analysis

– 22 – CSCE 531 Spring 2006

Now what about “fort”The string “fort” is an identifier, not the keyword “for” The string “fort” is an identifier, not the keyword “for”

followed by “t.”followed by “t.”

Thus we can’t really recognize the token until we see a Thus we can’t really recognize the token until we see a terminator – whitespace or a special symbol ( one terminator – whitespace or a special symbol ( one of ,;(){}[] of ,;(){}[]

Page 23: Lecture 2   Lexical Analysis

– 23 – CSCE 531 Spring 2006

Deterministic Finite AutomataA Deterministic finite automaton (DFA) is a

mathematical model that consists of

1. a set of states S

2. a set of input symbols ∑ , the input alphabet

3. a transition function δ: S x ∑ S that for each state and each input maps to the next state

4. a state s0 that is distinguished as the start state

5. a set of states F distinguished as accepting (or final) states

Page 24: Lecture 2   Lexical Analysis

– 24 – CSCE 531 Spring 2006

DFA to recognize keyword “for”ΣΣ= {a,b,c …z, A,B,…Z,0,…9,’,’, ‘;’, …}= {a,b,c …z, A,B,…Z,0,…9,’,’, ‘;’, …}

S = {sS = {s00, s, s11, s, s22, s, s3, 3, ssdeaddead}}

ss00, is the start state, is the start state

SSF F = {s= {s33}}

δ given by the table below

ff oo rr OthersOthers

ss00 ss11 ssdeaddead

ss11 ssdeaddead

ss22 ssdeaddead

ss33 ssdeaddead

ssdeaddead ssdeaddead ssdeaddead ssdeaddead ssdeaddead

Page 25: Lecture 2   Lexical Analysis

– 25 – CSCE 531 Spring 2006

Language Accepted by a DFAA string xA string x00xx11…x…xnn is accepted by a DFA M = ( is accepted by a DFA M = (ΣΣ, S, s, S, s00, , δδ, S, SFF) )

if s if si+1i+1= = δδ(s(sii, x, xii) for i=0,1, …n and s) for i=0,1, …n and sn+1n+1 ЄЄ S SFF

i.e. if xi.e. if x00xx11…x…xn n determines a path through the state diagram determines a path through the state diagram for the DFA that ends in an Accepting State.for the DFA that ends in an Accepting State.

Then the language accepted by the DFA Then the language accepted by the DFA M = ( M = (ΣΣ, S, s, S, s00, , δδ, S, SFF), denoted L(M) is the set of all ), denoted L(M) is the set of all

strings accepted by M.strings accepted by M.

Page 26: Lecture 2   Lexical Analysis

– 26 – CSCE 531 Spring 2006

What is the Language Accepted by…

Page 27: Lecture 2   Lexical Analysis

– 27 – CSCE 531 Spring 2006

DFA1.c/*/* * Deteministic Finite Automata Simulation* Deteministic Finite Automata Simulation * * * One line of input is read and then processed character by character.* One line of input is read and then processed character by character. * Thus '\n' (EOL) is treated as the end of input.* Thus '\n' (EOL) is treated as the end of input. * The major functions are:* The major functions are: ** delta(s,c) - that implements the tranistion function, anddelta(s,c) - that implements the tranistion function, and ** accept(s) - that tells whether state s is an accepting state or not.accept(s) - that tells whether state s is an accepting state or not. * The particular DFA recognizes strings of digits that end in 000.* The particular DFA recognizes strings of digits that end in 000. * The DFA has:* The DFA has: * * S = {0, 1, 2, 3, DEAD_STATE}S = {0, 1, 2, 3, DEAD_STATE} * Transitions on 0: S0=>S1, S1=>S2, S2=>S3, S3=>S3* Transitions on 0: S0=>S1, S1=>S2, S2=>S3, S3=>S3 * Transitions on non-zero digits: S0=>S0, S1=>S0, S2=>S0, S3=>S0* Transitions on non-zero digits: S0=>S0, S1=>S0, S2=>S0, S3=>S0 * Transitions on non-digits: Si=> DEAD_STATE* Transitions on non-digits: Si=> DEAD_STATE ** */*/

Page 28: Lecture 2   Lexical Analysis

– 28 – CSCE 531 Spring 2006

#include <stdio.h>#include <stdio.h>#define DEAD_STATE -1#define DEAD_STATE -1#define ACCEPT 1#define ACCEPT 1#define DO_NOT 0#define DO_NOT 0#define EOL '\n'#define EOL '\n'

main(){main(){ int c;int c; int state;int state; state = 0;state = 0; while((c = getchar()) != EOL && state != DEAD_STATE){while((c = getchar()) != EOL && state != DEAD_STATE){

state = delta(state, c);state = delta(state, c); }}

if(accept(state)){if(accept(state)){printf("Accept!\n");printf("Accept!\n");

}else{}else{printf("Do not accept!\n");printf("Do not accept!\n");

}}}}

Page 29: Lecture 2   Lexical Analysis

– 29 – CSCE 531 Spring 2006

/* DFA Transition function delta *//* DFA Transition function delta *//* delta(s,c) = transition from state s on input c *//* delta(s,c) = transition from state s on input c */int delta(int s, int c){int delta(int s, int c){ switch (s){switch (s){ case 0: if (c == '0') return 1;case 0: if (c == '0') return 1;

else if((c > '0') && (c <= '9')) return 0;else if((c > '0') && (c <= '9')) return 0; else return(DEAD_STATE);else return(DEAD_STATE);break;break;

case 1: if (c == '0') return 2;case 1: if (c == '0') return 2; else if((c > '0') && (c <= '9')) return 0;else if((c > '0') && (c <= '9')) return 0; else return(DEAD_STATE);else return(DEAD_STATE);break;break;

case 2: if (c == '0') return 3;case 2: if (c == '0') return 3; else if((c > '0') && (c <= '9')) return 0;else if((c > '0') && (c <= '9')) return 0; else return(DEAD_STATE);else return(DEAD_STATE);break;break;

case 3: if (c == '0') return 3;case 3: if (c == '0') return 3; else if((c > '0') && (c <= '9')) return 0;else if((c > '0') && (c <= '9')) return 0; else return(DEAD_STATE);else return(DEAD_STATE);break;break;

case DEAD_STATE: return DEAD_STATE;case DEAD_STATE: return DEAD_STATE;break;break;

default:default:printf("Bad State\n");printf("Bad State\n");

return(DEAD_STATE);return(DEAD_STATE); }}}}

Page 30: Lecture 2   Lexical Analysis

– 30 – CSCE 531 Spring 2006

int accept(state){int accept(state){

if (state == 3) return ACCEPT;if (state == 3) return ACCEPT;

else return DO_NOT;else return DO_NOT;

}}

Page 31: Lecture 2   Lexical Analysis

– 31 – CSCE 531 Spring 2006

Non-Deterministic Finite AutomataWhat does deterministic mean?What does deterministic mean?

In a Non-Deterministic Finite Automata (NFA) we relax the In a Non-Deterministic Finite Automata (NFA) we relax the restriction that the transition function restriction that the transition function δ maps every state and maps every state and every element of the alphabet to a unique state, i.e. every element of the alphabet to a unique state, i.e. δ: S x ∑ S

An NFA can: Have multiple transitions from a state for the same input Have Є transitions, where a transition from one state to another can

be accomplished without consuming an input character Not have transitions defined for every state and every input

Note for NFAs Note for NFAs δ: S x ∑ 2S where is the power set of Swhere is the power set of S

Page 32: Lecture 2   Lexical Analysis

– 32 – CSCE 531 Spring 2006

Language Accepted by an NFAA string xA string x00xx11…x…xnn is accepted by an NFA is accepted by an NFA

M = (M = (ΣΣ, S, s, S, s00, , δδ, S, SFF) if s) if si+1i+1= = δδ(s(sii, x, xii) for i=0,1, …n and ) for i=0,1, …n and ssn+1n+1 ЄЄ S SFF

i.e. if xi.e. if x00xx11…x…xn n can determines a path through the state can determines a path through the state diagram for the NFA that ends in an Accepting State, diagram for the NFA that ends in an Accepting State, taking taking ЄЄ where ever necessary. where ever necessary.

Then the language accepted by the DFA Then the language accepted by the DFA M = ( M = (ΣΣ, S, s, S, s00, , δδ, S, SFF), denoted L(M) is the set of ), denoted L(M) is the set of

all strings accepted by M.all strings accepted by M.

Page 33: Lecture 2   Lexical Analysis

– 33 – CSCE 531 Spring 2006

Language Accepted by an NFA

Page 34: Lecture 2   Lexical Analysis

– 34 – CSCE 531 Spring 2006

Thompson ConstructionFor any regular expression R construct an NFA, M, that For any regular expression R construct an NFA, M, that

accepts the language denoted by R, i.e., L(M) = L(R).accepts the language denoted by R, i.e., L(M) = L(R).

Page 35: Lecture 2   Lexical Analysis

– 35 – CSCE 531 Spring 2006