51
Yu-Chen Kuo 1 Chapter 3 Lexical Analysis

Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Embed Size (px)

Citation preview

Page 1: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 1

Chapter 3

Lexical Analysis

Page 2: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 2

3.1 The Role of The Lexical Analyzer

• Its main task is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis

• It also performs certain secondary tasks such as stripping out comments and white space and correlating error messages with the source program

Page 3: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 3

3.1 The Role of The Lexical Analyzer

Page 4: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 4

Token, Patterns, Lexemes

• In general, a set of strings in the input for which the same token is produced as output.

• This set of strings is described by a rule called a pattern associated with the token.

• A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.– const pi=3.14156; pi is a lexeme for token id

Page 5: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 5

Examples of Tokens

• In most programming language, the following constructs are treated as tokens: keyword, identifiers, constants, literal strings, operators, and punctuation symbols.

regularexpression

Page 6: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 6

Attributes for Tokens

• When more than one lexeme matches a pattern, the lexical analyzer must provide additional information about the particular lexeme that matched to the subsequent phases of the compiler.

• The lexical analyzer collects information about tokens into their associated attributes.

Page 7: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 7

Attributes for Tokens (Cont.)

• The token influence parsing decision; the attributed influence the translation of tokens.

• A token has usually only a single attribute- a pointer (index) to the symbol-table entry in which the information about the token is kept.

Page 8: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 8

Lexical Errors

• Few errors are detected at lexical level alone, because a lexical analyzer has a very localized view of a source program.

• For example, if the string fi is encountered in a C program for the first time in the context– fi ( a == f(x)) …..– whether fi is a misspelling of the keyword if or an undecla

red function identifier – Since fi is a valid identifier, the lexical analyzer must retu

rn the token for an identifier and let latter phase handle any error.

Page 9: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 9

Lexical Errors (Cont.)

• A lexical analyzer finds an error when it is unable to proceed because none of the patterns matches a prefix of the remaining input.

• The simplest recovery strategy is “panic mode”, to delete successive characters from the remaining input until the lexical analyzer can find a well-formed token.

Page 10: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 10

Lexical Errors (Cont.)

• Other possible error-recovery actions are:– Deleting an extraneous character– Inserting a missing character– Replacing an incorrect character by a correct

character– Transposing two adjacent characters

Page 11: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 11

Lexical Errors (Cont.)

• Error transformation attempts to repair the input.

• The simplest strategy is to see if a prefix of the remaining input can be transformed into a valid lexeme by a single error transformation.

• This strategy assumes most lexical errors are the result of a single transformation.

Page 12: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 12

Input Buffering

• There are times when a lexical analyzer needs to look ahead several characters beyond the lexeme for a token before a match can be announced.

• Buffering techniques can be used to reduce the overhead required to process input characters.

• The buffer is divided into two N-character halves.

Page 13: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 13

Input Buffering(Cont.)

• N input characters are read into each half of the buffer with one read command.

• If fewer than N characters remain in the input then a special character eof is read into the buffer.

• Two pointers are maintained. Initially, both pointers point to the first character of the next lexeme. The forward pointer scans ahead until a match for a pattern is found. After the lexeme is processed, both pointers are set to the character immediately past the lexeme.

Page 14: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 14

Input Buffering(Cont.)

• If the forward pointer is about to move past the halfway mark, the right half is filled with N new characters.

• If the forward pointer is about to move past the right end of the buffer, the left half is filled with N new characters.

• Lookahead is limited by the length of the buffer.

Page 15: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 15

Input Buffering(Cont.)

Page 16: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 16

Sentinels to Improving Input Buffering

• Except at the ends of buffer halves, we need two tests for each advance of the forward pointer. We can reduce it to one test if we extend each buffer half to hold the special characters eof at the end of each half.

Page 17: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 17

Sentinels to Improving Input Buffering (Cont.)

Page 18: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 18

Sentinels to Improving Input Buffering (Cont.)

• Most of the time only one test is needed to see except the forward pointer points to an eof.

• The average number of tests per input character is very close to 1.

Page 19: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 19

Specification of Tokens

• Regular expressions are an important notation for specifying patterns.

Page 20: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 20

Strings and Languages

• An alphabet denotes any finite set of symbols,– {0,1}: binary alphabet– ASCII code: computer alphabet

• A string over some alphabet is a finite sequence of symbols drawn from the alphabet.

• A language denotes a set of strings over some fixed alphabet.

• The string exponentiation operation is defined as s0 = (empty string);

si = si-1s, for i>0 (string concatenation)

Page 21: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 21

Operation on Languages

• The language exponentiation operation is defined as L0 = {} and Li = Li-1L

Page 22: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 22

Operation on Languages (Cont.)

• Let L={A,…,Z, a,…,z} and D = {0,…,9}

1. LD is the set of letters and digits.

2. LD is the set of strings consisting of a letter followed by a digit.

3. L4 is the set of four-letter strings.

4. L* is the set of all strings of letters, including .

5. L(LD)* is the set of all strings of letters and digits beginning with a letter.

6. D+ is the set of all strings of one or more digits.

Page 23: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 23

Regular Expressions

• A regular expression r is a formalism for defining a language L(r).

• A language that can be defined by a regular expression is called a regular set.

• A language that can be defined by a context-free grammar is called a context-free language.

• the set of regular sets the set of context-free language

Page 24: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 24

Rule for Regular Expressions

• The rules that define the regular expression over alphabet are as follows.

1. is a regular expression, denoted {}

2. If a is a symbol in , then a is a regular expression denoting {a}

Page 25: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 25

Rule for Regular Expressions (Cont.)

3. Suppose r and s are regular expressions for the languages L(r) and L(s), then,

a) (r) | (s) is a regular expression denoting L(r)L(s)b) (r) (s) is a regular expression denoting L(r)L(s)c) (r )* is a regular expression denoting (L(r ))*

• Unnecessary parentheses can be avoided in regular expression if we adopt the following conventions

1. The unary operator * has the highest precedence and is left associative.

2. Concatenation has the second highest precedence and is left associative.

3. | has the lowest precedence and is left associative

Page 26: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 26

Rule for Regular Expressions (Example)

• Let ={a, b}1. a | b denotes {a, b}2. (a | b)(a | b) denotes {aa, ab, ba, bb}, the set of all st

rings of a’s and b’s of length two.3. a* denotes {, a, aa, aaa, …}, the set of all strings o

f zero or more a’s.4. (a | b)* denotes the set of all strings containing zero

or more instances of a or b.5. a | a*b denotes the set containing string a or the stri

ngs consisting zero or more a’s followed by b.

Page 27: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 27

Algebraic Properties of Regular Expressions

Page 28: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 28

Regular Definition

• Let be an alphabet, then a regular definition is a sequence of definition of the form

d1 r1

d2 r2

dn rn

where each di is a distinct name, and each ri is a regular expression over the symbols in {d1, d2,…,di-1}

Page 29: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 29

Regular Definition (Example)

• The set of Pascal identifiers is the set of strings of letters and digits beginning with a letter.

• A regular definition for this set is as follows.letter A | B | … | Z | a | b | … | z

digit 0 | 1 | … | 9

id letter ( letter | digit) *

Page 30: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 30

Regular Definition (Example)

• Unsigned numbers in Pascal are strings such as 5280, 39.37, 6.33E4, or 1.894E-4.

• A regular definition for this set is as follows.

digit 0 | 1 | … | 9

digits digit digit*

optional_faction .digits | optional_exponent (E(+|-| ) digits) | num digits optional_fraction optional_exponent

Page 31: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 31

Notational Shorthands

1. One or more instances +– a+ : the set of all strings of one ore more a’s– r + = r r*, r* = r + |

2. Zero or one instance ?– r? = r | digit 0 | 1 | … | 9

digits digit +

optional_faction (.digits)?optional_exponent (E(+|-) ? digits)?num digits optional_fraction optional_exponent

Page 32: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 32

Notational Shorthands (Cont.)

3. Character class:− [abc] = a | b | c

− [a-z] = a | b | … | z

− id [A-Za-z][A-Za-z0-9]*

Page 33: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 33

Nonregular Sets

• Some languages cannot be described by any regular expression.

• Regular expressions cannot describe balanced or nested constructs.

• Regular expressions cannot describe the set of all strings of balanced parentheses but that can be specified by a context-free grammar.

• Repeating string cannot be described by regular expressions or context-free grammar.

– {wcw| w is a string of a’s and b’s}

Page 34: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 34

Nonregular Sets (Cont.)

• Regular expressions can be used to denote only a fix number of repetition or an unspecified number of repetitions. Two arbitrary numbers cannot be compared to see whether they are the same.

– nHa1a2…an

Page 35: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 35

3.4 Recognition of Tokens

• Consider the following grammar fragment:

stmt if expr then stmt

| if expr then stmt else stmt

| expr term relop term

| term

term id

| num

Page 36: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 36

Recognition of Tokens (Cont.)

• The regular definitions for tokens are as follows:if ifthen then else elserelop < | <= | = | <>| > | >=id letter (letter|digit)*num digit+ (.digit+)? (E(+|-)?digit+ )?delim blank | tab | newlinews delim+

Page 37: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 37

Regular-expression Patterns for Tokens

Page 38: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 38

Transition Diagrams

• Lexical analysis use transition diagram to keep track of information about characters that are seen as the forward pointer scans the input.

• Positions in a transition diagram are drawn as circles and are called states. The states are connected by arrows, called edges. A double circle indicated an accepting state, a state in which a token is found. a* indicates that input retraction must take place.

Page 39: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 39

Transition Diagrams for >=

• start state : stare 0 in the above example • If input character is >, go to state 6.• other refers to any character that is not indicated by

any of the other edges leaving s.

Page 40: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 40

Transition Diagrams for Relational Operators

token attribute-value

Page 41: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 41

Transition Diagrams for Identifiers and Keywords

• gettoken( ): return token (id, if, then,…) if it looks the symbol table

• install_id( ): return 0 if keyword or a pointer to the symbol table entry if id

Page 42: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 42

Transition Diagrams for Unsigned Numbers

order:Ex. 12.3E4 ?

install_num( )

install_num( )

install_num( )

Page 43: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 43

Transition Diagrams for White Space

Page 44: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 44

Following Transition Diagrams

• Transition diagrams are followed one by one trying to determine the next tokens to be returned.

• If failure occurs while we are following one transition diagram, we retract the forward pointer to where it was in the start state of this diagram, and activate the next transition diagram.

Page 45: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 45

Following Transition Diagrams (Cont.)

• If failure occurs in all transition diagrams, then a lexical error has been detected and we invoke an error-recovery routine.

• It is better to look for frequently occurring tokens before less frequently occurring ones, because a transition diagram is reached only after we fail on all earlier transition diagrams.

• Since white space is expected to occur frequently, we should put the transition diagram for white space near the beginning.

Page 46: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 46

Implement a Transition Diagrams

• A sequence of transition diagrams can be converted into a program to look for tokens.

• Each state gets a segment of code.

Page 47: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 47

Implement a Transition Diagrams (Cont.)

• state and start record the current state and the start state of current transition diagram.

• lexical_value is assigned the pointer returned by install_id( ) and install_num( ) when an identifier or number is found.

• When a diagram fails, the function fail( ) is used to retract the forward pointer to the position of the lexeme beginning pointer and to return the start state of the next diagram. If all diagrams fail the function fail( ) calls an error-recovery routine.

Page 48: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 48

Implement a Transition Diagrams (Cont.)

Page 49: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 49

Implement a Transition Diagrams (Cont.)

return a character pointed by forward pointer

and forward pointer ++

Page 50: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 50

Implement a Transition Diagrams (Cont.)

id

Page 51: Yu-Chen Kuo1 Chapter 3 Lexical Analysis. Yu-Chen Kuo2 3.1 The Role of The Lexical Analyzer Its main task is to read the input characters and produce as

Yu-Chen Kuo 51

Implement a Transition Diagrams (Cont.)