Lecture 2 Lexical Analysis

Preview:

DESCRIPTION

Lecture 2 Lexical Analysis. CSCE 531 Compiler Construction. Topics Sample Simple Compiler Operations on strings Regular expressions Finite Automata Readings:. January 11, 2006. Overview. Last Time A little History Compilers vs Interpreter Data-Flow View of Compilers - PowerPoint PPT Presentation

Citation preview

Lecture 2 Lexical Analysis

Topics Topics Sample Simple Compiler Operations on strings Regular expressions Finite Automata

Readings:Readings:

January 11, 2006

CSCE 531 Compiler Construction

– 2 – CSCE 531 Spring 2006

OverviewLast TimeLast Time

A little History Compilers vs Interpreter Data-Flow View of Compilers Regular Languages Course Pragmatics

Today’s Lecture Today’s Lecture Why Study Compilers? xx

ReferencesReferences Chapter 2, Chapter 3

Assignment Due Wednesday Jan 18Assignment Due Wednesday Jan 18 3.3a; 3.5a,b; 3.6a,b,c; 3.7a; 3.8b

– 3 – CSCE 531 Spring 2006

A Simple Compiler for ExpressionsChapter Two OverviewChapter Two Overview Structure of the simple compiler, really just translator Structure of the simple compiler, really just translator

for infix expressions for infix expressions postfix postfix Grammars Grammars Parse TreesParse Trees Syntax directed TranslationSyntax directed Translation Predictive ParsingPredictive Parsing Translator for Simple ExpressionsTranslator for Simple Expressions

Grammar Rewritten grammar (equivalent one better for pred. parsing) Parsing modules fig 2.24 Specification of Translator fig 2.35 Structure of translator fig 2.36

– 4 – CSCE 531 Spring 2006

GrammarsGrammar (or a context free grammar more correctly) hasGrammar (or a context free grammar more correctly) has A set of tokens also known as terminalsA set of tokens also known as terminals A set of nonterminalsA set of nonterminals A set of productions of the formA set of productions of the form

nonterminal nonterminal sequence of tokens and/or nonterminals sequence of tokens and/or nonterminals A special nonterminal the start symbol.A special nonterminal the start symbol.

ExampleExample

E E E + E E + E

E E E * E E * E

E E digit digit

– 5 – CSCE 531 Spring 2006

DerivationsA derivation is a sequence of rewriting of a string of A derivation is a sequence of rewriting of a string of

grammar symbols using the productions in a grammar symbols using the productions in a grammar. grammar.

We use the symbol We use the symbol to denote that one string of to denote that one string of grammar symbols is obtained by rewritting another grammar symbols is obtained by rewritting another using a productionusing a production

XX Y if there is a production N Y if there is a production N ββ where where The nonterminal N occurs in the sequence X of Grammar

symbols And Y is the same as X except β replaces the N

ExampleExample

E E E+E E+E d+E d+E d+ E*E d+ E*E d+ E+E*E d+ E+E*E d+d+E*E d+d+E*E d+d+d*E d+d+d*E d+d+d*d d+d+d*d

– 6 – CSCE 531 Spring 2006

Parse TreesA graphical presentation of a derivation, satisfyingA graphical presentation of a derivation, satisfying Root is the start symbolRoot is the start symbol Each leaf is a token or Each leaf is a token or εε (note different font from (note different font from

text)text) Each interior node is a nonterminalEach interior node is a nonterminal If A is a parent with children XIf A is a parent with children X1 1 , X, X22 … X … Xnn then then

A A X X11XX22 … X … Xnn is a production is a production

– 7 – CSCE 531 Spring 2006

Syntax directed TranslationFrequently the rewritting by a production will be called a reduction Frequently the rewritting by a production will be called a reduction

or reducing by the particular production.or reducing by the particular production.

Syntax directed translation attaches action (code) that are done Syntax directed translation attaches action (code) that are done when the reductions are performedwhen the reductions are performed

ExampleExample

EE E + TE + T {print(‘+’);}{print(‘+’);}

EE E - TE - T {print(‘-’);}{print(‘-’);}

EE TT

T T 00 {print(‘0’);} {print(‘0’);}

T T 11 {print(‘1’);} {print(‘1’);}

……

T T 99 {print(‘9’);} {print(‘9’);}

– 8 – CSCE 531 Spring 2006

Equivalent Grammars

– 9 – CSCE 531 Spring 2006

Specification of the translatorS S L eof L eof figure 2.38figure 2.38LL E ; L E ; LL L ЄЄEE T E’T E’E’E’ + T { print(‘+’); } E’+ T { print(‘+’); } E’E’E’ - T { print(‘-’); } E’ - T { print(‘-’); } E’EE ЄЄ TT F T’F T’T’T’ * F { print(‘*’); } T’* F { print(‘*’); } T’T’T’ / F { print(‘/’); } T’ T / F { print(‘/’); } T’ T ЄЄFF ( E ) ( E )FF id id { print(id.lexeme);}{ print(id.lexeme);}FF num num { print(num.value);}{ print(num.value);}

– 10 – CSCE 531 Spring 2006

Translating to codeE E T E’T E’E’ E’ + T { print(‘+’); } E’+ T { print(‘+’); } E’E’ E’ - T { print(‘-’); } E’ - T { print(‘-’); } E’E E ЄЄ

Expr()Expr(){{

int t;int t;term();term();while(1)while(1) switch(lookahead){switch(lookahead){ case ‘+’: case ‘-’:case ‘+’: case ‘-’:

t = lookahead;t = lookahead;match(lookahead);match(lookahead);term();term();emit(t, NONE);emit(t, NONE);continue;continue;

……

– 11 – CSCE 531 Spring 2006

Overview of the Code Figure 2.36/class/csce531-001/class/csce531-001

– 12 – CSCE 531 Spring 2006

Operations on StringsA language over an alphabet is a set of strings of A language over an alphabet is a set of strings of

characters from the alphabet.characters from the alphabet.

Operations on strings: Operations on strings: let x=x1x2…xn and t=t1t2…tm then

Concatenation: xt =xConcatenation: xt =x11xx22…x…xnntt11tt22…t…tmm

Alternation: x|t = either xAlternation: x|t = either x11xx22…x…xnn or t or t11tt22…t…tmm

– 13 – CSCE 531 Spring 2006

Operations on Sets of StringsOperations on sets of strings: Operations on sets of strings:

For these let S = {sFor these let S = {s11, s, s22, … s, … smm} and R = {r} and R = {r11, r, r22, … r, … rnn}}

Alternation: S | T = S U T = {sAlternation: S | T = S U T = {s11, s, s22, … s, … smm, r, r11, r, r22, … r, … rn n } } Concatenation: Concatenation:

ST ={st | where s ST ={st | where s ЄЄ S and t S and t ЄЄ T} T}

= { s= { s11rr11, s, s11rr22, … s, … s11rrnn, s, s22rr11, … s, … s22rrnn, … s, … smmrr11, … s, … smmrrnn}} Power: SPower: S22 = S S, S = S S, S33= S= S22 S, S S, Snn =S =Sn-1n-1 S S

What is SWhat is S00?? Kleene Closure: S* = UKleene Closure: S* = U∞∞

i=0i=0 S Sii , note S , note S00 = is in S* = is in S*

– 14 – CSCE 531 Spring 2006

Operations cont. Kleene Closure

Powers: Powers: S2 = S S S3= S2 S … Sn =Sn-1 S

What is SWhat is S00?? Kleene Closure: S* = UKleene Closure: S* = U∞∞

i=0i=0 S Sii , note S , note S00 = is in S* = is in S*

– 15 – CSCE 531 Spring 2006

Examples of Operations on Sets of Strings

Operations on sets of strings: Operations on sets of strings:

For these let S = {a,b,c} and R = {t,u}For these let S = {a,b,c} and R = {t,u} Alternation: S | T = S U T = {a,b,c,t,uAlternation: S | T = S U T = {a,b,c,t,u } } Concatenation: Concatenation:

ST ={st | where s ST ={st | where s ЄЄ S and t S and t ЄЄ T} T}

= { at, au, bt, bu, ct, cu}= { at, au, bt, bu, ct, cu} Power: SPower: S22 = { aa, ab, ac, ba, bb, bc, ca, cb, cc} = { aa, ab, ac, ba, bb, bc, ca, cb, cc}

SS33= { aaa, aab, aac, … ccc} 27 elements= { aaa, aab, aac, … ccc} 27 elements Kleene closure: S* = {any string of any length of a’s, Kleene closure: S* = {any string of any length of a’s,

b’s and c’s}b’s and c’s}

– 16 – CSCE 531 Spring 2006

Examples of Operations on Sets of Strings

– 17 – CSCE 531 Spring 2006

Regular ExpressionsFor a given alphabet For a given alphabet ΣΣ the following are regular the following are regular

expressions:expressions: If a If a ЄЄ ΣΣ then a is a regular expression and L(a) = { a } then a is a regular expression and L(a) = { a } ЄЄ is a regular expression and L( is a regular expression and L(ЄЄ) = { ) = { ЄЄ } } ΦΦ is a regular expression and L( is a regular expression and L(ΦΦ) = ) = ΦΦ And if s and t are regular expressions denoting And if s and t are regular expressions denoting

languages L(s) and L(t) respectively thenlanguages L(s) and L(t) respectively then st is a regular expression and L(st) = L(s) L(t) s | t is a regular expression and L(s | t) = L(s) U L(t) s* is a regular expression and L(s*) = L(s)*

– 18 – CSCE 531 Spring 2006

Why Regular Expressions?We use regular expressions to describe the tokensWe use regular expressions to describe the tokens

Examples:Examples: Reg expr for C identifiersReg expr for C identifiers

C identifiers? Any string of letters, underscores and digits that start with a letter or underscore

ID reg expr = (letter | underscore) (letter | underscore | digit)*

Or more explicitlyID reg expr = ( a|b|…|z|_)(a|b|…z|_|0|1…|9)*

– 19 – CSCE 531 Spring 2006

Pop QuizGiven r and s are regular expressions thenGiven r and s are regular expressions then What is rWhat is rЄЄ ? ? r | r | ЄЄ ? ?

Describe the Language denoted by 0*110*Describe the Language denoted by 0*110*

Describe the Language denoted by (0|1)*110*Describe the Language denoted by (0|1)*110*

Give a regular expression for the language of 0’s Give a regular expression for the language of 0’s and 1’s such that end in a 1and 1’s such that end in a 1

Give a regular expression for the language of 0’s Give a regular expression for the language of 0’s and 1’s such that every 0 is followed by a 1and 1’s such that every 0 is followed by a 1

– 20 – CSCE 531 Spring 2006

Recognizers of Regular LanguagesTo develop efficient lexical analyzers (scanners) we will To develop efficient lexical analyzers (scanners) we will

rely on a mathematical model called finite automata, rely on a mathematical model called finite automata, similar to the state machines that you have probably similar to the state machines that you have probably seen. In particular we will use deterministic finite seen. In particular we will use deterministic finite automata, DFAs.automata, DFAs.

The construction of a lexical analyzer will then proceed as:The construction of a lexical analyzer will then proceed as:1.1. Identify all tokensIdentify all tokens2.2. Develop regular expressions for eachDevelop regular expressions for each3.3. Convert the regular expressions to finite automataConvert the regular expressions to finite automata4.4. Use the transition table for the finite automata as the Use the transition table for the finite automata as the

basis for the scannerbasis for the scannerWe will actually use the tools lex and/or flex for steps 3 We will actually use the tools lex and/or flex for steps 3

and 4.and 4.

– 21 – CSCE 531 Spring 2006

Transition Diagram for a DFA

Start in state sStart in state s00 then if the input is “f” make transition to then if the input is “f” make transition to state sstate s11..

The from state sThe from state s1 1 if the input is “o” make transition to state if the input is “o” make transition to state ss22..

And from state sAnd from state s2 2 if the input is “r” make transition to state if the input is “r” make transition to state ss33..

The double circle denotes an “accepting state” which The double circle denotes an “accepting state” which means we recognized the token.means we recognized the token.

Actually there is a missing state and transitionActually there is a missing state and transition

f o rs0 s1 s2 s3

– 22 – CSCE 531 Spring 2006

Now what about “fort”The string “fort” is an identifier, not the keyword “for” The string “fort” is an identifier, not the keyword “for”

followed by “t.”followed by “t.”

Thus we can’t really recognize the token until we see a Thus we can’t really recognize the token until we see a terminator – whitespace or a special symbol ( one terminator – whitespace or a special symbol ( one of ,;(){}[] of ,;(){}[]

– 23 – CSCE 531 Spring 2006

Deterministic Finite AutomataA Deterministic finite automaton (DFA) is a

mathematical model that consists of

1. a set of states S

2. a set of input symbols ∑ , the input alphabet

3. a transition function δ: S x ∑ S that for each state and each input maps to the next state

4. a state s0 that is distinguished as the start state

5. a set of states F distinguished as accepting (or final) states

– 24 – CSCE 531 Spring 2006

DFA to recognize keyword “for”ΣΣ= {a,b,c …z, A,B,…Z,0,…9,’,’, ‘;’, …}= {a,b,c …z, A,B,…Z,0,…9,’,’, ‘;’, …}

S = {sS = {s00, s, s11, s, s22, s, s3, 3, ssdeaddead}}

ss00, is the start state, is the start state

SSF F = {s= {s33}}

δ given by the table below

ff oo rr OthersOthers

ss00 ss11 ssdeaddead

ss11 ssdeaddead

ss22 ssdeaddead

ss33 ssdeaddead

ssdeaddead ssdeaddead ssdeaddead ssdeaddead ssdeaddead

– 25 – CSCE 531 Spring 2006

Language Accepted by a DFAA string xA string x00xx11…x…xnn is accepted by a DFA M = ( is accepted by a DFA M = (ΣΣ, S, s, S, s00, , δδ, S, SFF) )

if s if si+1i+1= = δδ(s(sii, x, xii) for i=0,1, …n and s) for i=0,1, …n and sn+1n+1 ЄЄ S SFF

i.e. if xi.e. if x00xx11…x…xn n determines a path through the state diagram determines a path through the state diagram for the DFA that ends in an Accepting State.for the DFA that ends in an Accepting State.

Then the language accepted by the DFA Then the language accepted by the DFA M = ( M = (ΣΣ, S, s, S, s00, , δδ, S, SFF), denoted L(M) is the set of all ), denoted L(M) is the set of all

strings accepted by M.strings accepted by M.

– 26 – CSCE 531 Spring 2006

What is the Language Accepted by…

– 27 – CSCE 531 Spring 2006

DFA1.c/*/* * Deteministic Finite Automata Simulation* Deteministic Finite Automata Simulation * * * One line of input is read and then processed character by character.* One line of input is read and then processed character by character. * Thus '\n' (EOL) is treated as the end of input.* Thus '\n' (EOL) is treated as the end of input. * The major functions are:* The major functions are: ** delta(s,c) - that implements the tranistion function, anddelta(s,c) - that implements the tranistion function, and ** accept(s) - that tells whether state s is an accepting state or not.accept(s) - that tells whether state s is an accepting state or not. * The particular DFA recognizes strings of digits that end in 000.* The particular DFA recognizes strings of digits that end in 000. * The DFA has:* The DFA has: * * S = {0, 1, 2, 3, DEAD_STATE}S = {0, 1, 2, 3, DEAD_STATE} * Transitions on 0: S0=>S1, S1=>S2, S2=>S3, S3=>S3* Transitions on 0: S0=>S1, S1=>S2, S2=>S3, S3=>S3 * Transitions on non-zero digits: S0=>S0, S1=>S0, S2=>S0, S3=>S0* Transitions on non-zero digits: S0=>S0, S1=>S0, S2=>S0, S3=>S0 * Transitions on non-digits: Si=> DEAD_STATE* Transitions on non-digits: Si=> DEAD_STATE ** */*/

– 28 – CSCE 531 Spring 2006

#include <stdio.h>#include <stdio.h>#define DEAD_STATE -1#define DEAD_STATE -1#define ACCEPT 1#define ACCEPT 1#define DO_NOT 0#define DO_NOT 0#define EOL '\n'#define EOL '\n'

main(){main(){ int c;int c; int state;int state; state = 0;state = 0; while((c = getchar()) != EOL && state != DEAD_STATE){while((c = getchar()) != EOL && state != DEAD_STATE){

state = delta(state, c);state = delta(state, c); }}

if(accept(state)){if(accept(state)){printf("Accept!\n");printf("Accept!\n");

}else{}else{printf("Do not accept!\n");printf("Do not accept!\n");

}}}}

– 29 – CSCE 531 Spring 2006

/* DFA Transition function delta *//* DFA Transition function delta *//* delta(s,c) = transition from state s on input c *//* delta(s,c) = transition from state s on input c */int delta(int s, int c){int delta(int s, int c){ switch (s){switch (s){ case 0: if (c == '0') return 1;case 0: if (c == '0') return 1;

else if((c > '0') && (c <= '9')) return 0;else if((c > '0') && (c <= '9')) return 0; else return(DEAD_STATE);else return(DEAD_STATE);break;break;

case 1: if (c == '0') return 2;case 1: if (c == '0') return 2; else if((c > '0') && (c <= '9')) return 0;else if((c > '0') && (c <= '9')) return 0; else return(DEAD_STATE);else return(DEAD_STATE);break;break;

case 2: if (c == '0') return 3;case 2: if (c == '0') return 3; else if((c > '0') && (c <= '9')) return 0;else if((c > '0') && (c <= '9')) return 0; else return(DEAD_STATE);else return(DEAD_STATE);break;break;

case 3: if (c == '0') return 3;case 3: if (c == '0') return 3; else if((c > '0') && (c <= '9')) return 0;else if((c > '0') && (c <= '9')) return 0; else return(DEAD_STATE);else return(DEAD_STATE);break;break;

case DEAD_STATE: return DEAD_STATE;case DEAD_STATE: return DEAD_STATE;break;break;

default:default:printf("Bad State\n");printf("Bad State\n");

return(DEAD_STATE);return(DEAD_STATE); }}}}

– 30 – CSCE 531 Spring 2006

int accept(state){int accept(state){

if (state == 3) return ACCEPT;if (state == 3) return ACCEPT;

else return DO_NOT;else return DO_NOT;

}}

– 31 – CSCE 531 Spring 2006

Non-Deterministic Finite AutomataWhat does deterministic mean?What does deterministic mean?

In a Non-Deterministic Finite Automata (NFA) we relax the In a Non-Deterministic Finite Automata (NFA) we relax the restriction that the transition function restriction that the transition function δ maps every state and maps every state and every element of the alphabet to a unique state, i.e. every element of the alphabet to a unique state, i.e. δ: S x ∑ S

An NFA can: Have multiple transitions from a state for the same input Have Є transitions, where a transition from one state to another can

be accomplished without consuming an input character Not have transitions defined for every state and every input

Note for NFAs Note for NFAs δ: S x ∑ 2S where is the power set of Swhere is the power set of S

– 32 – CSCE 531 Spring 2006

Language Accepted by an NFAA string xA string x00xx11…x…xnn is accepted by an NFA is accepted by an NFA

M = (M = (ΣΣ, S, s, S, s00, , δδ, S, SFF) if s) if si+1i+1= = δδ(s(sii, x, xii) for i=0,1, …n and ) for i=0,1, …n and ssn+1n+1 ЄЄ S SFF

i.e. if xi.e. if x00xx11…x…xn n can determines a path through the state can determines a path through the state diagram for the NFA that ends in an Accepting State, diagram for the NFA that ends in an Accepting State, taking taking ЄЄ where ever necessary. where ever necessary.

Then the language accepted by the DFA Then the language accepted by the DFA M = ( M = (ΣΣ, S, s, S, s00, , δδ, S, SFF), denoted L(M) is the set of ), denoted L(M) is the set of

all strings accepted by M.all strings accepted by M.

– 33 – CSCE 531 Spring 2006

Language Accepted by an NFA

– 34 – CSCE 531 Spring 2006

Thompson ConstructionFor any regular expression R construct an NFA, M, that For any regular expression R construct an NFA, M, that

accepts the language denoted by R, i.e., L(M) = L(R).accepts the language denoted by R, i.e., L(M) = L(R).

– 35 – CSCE 531 Spring 2006

Recommended