28
Lexical Analysis S. M. Farhad

Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Embed Size (px)

Citation preview

Page 1: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Lexical Analysis

S. M. Farhad

Page 2: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Input Buffering

Speedup the reading the source program Look one or more characters beyond the next

lexeme There are many situations where we need to

look at least one additional character ahead.

Page 3: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Input Buffering

For instance, we cannot be sure we’ve seen the end of an identifier until we see a character that is not a letter or digit, and therefore is not part of the lexeme for id.

In C, single-character operators like -, =, or < could also be the beginning of a two-character operator like ->, ==, or <=.

A a two-buffer scheme that handles large lookaheads safely.

We then consider an improvement involving “sentinels” that saves time checking for the ends of buffers.

Page 4: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Buffer Pairs

The amount of time taken is high to process characters of a large source program.

Specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character.

An important scheme involves two buffers that are alternately reloaded, as suggested in figure.

Page 5: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Buffer Pairs

Each buffer is of the same size N. N is usually the size of a disk block, e.g.,

4096 bytes. Using one system read command we can

read N characters into a buffer, rather than using one system call per character.

Buffer Pairs

Page 6: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

If fewer than N characters remain in the input file, then a special character, represented by eof, marks the end of the source file.

This eof is different from any possible character of the source program.

Buffer Pairs

Page 7: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Buffer Pairs

Two pointers to the input are maintained: Pointer lexemeBegin, marks the beginning of the

current lexeme, whose extent we are attempting to determine.

Pointer forward scans ahead until a pattern match is found.

Page 8: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Buffer Pairs

Once the next lexeme is determined, forward is set to the character at its right end.

Then, after the lexeme is recorded as an attribute value of a token returned to the parser, lexemeBegin is set to the character immediately after the lexeme just found.

In figure, we see forward has passed the end of the next lexeme, ** (the Fortran exponentiation operator), and must be retracted one position to its left.

Page 9: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Buffer Pairs

Advancing forward requires that we first test whether we have reached the end of one of the buffers.

If so, we must reload the other buffer from the input. And move forward to the beginning of the newly loaded buffer.

Page 10: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Sentinels

If we use the previous scheme as described, we must check, each time we advance forward, that we have not moved off one of the buffers.

If we do, then we must also reload the other buffer. Thus, for each character read, we make two tests:

one for the end of the buffer. And one to determine what character is read (the latter may

be a multiway branch).

Page 11: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Sentinels

Two tests can be simplified using additional sentinels

We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a sentinel character at the end.

The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof.

Page 12: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Sentinels

Figure shows the same arrangement as previous, but with the sentinels added.

Note that eof retains its use as a marker for the end of the entire input.

Any eof that appears other than at the end of a buffer means that the input is at an end.

Page 13: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Implementing Multiway Branches The algorithm

for advancing forward.

Test is simplified

Page 14: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

String and Languages

An alphabet is any finite set of symbols The set {0,1) is the binary alphabet.

A string over an alphabet is a finite sequence of symbols drawn from that alphabet.

The empty string, denoted Ɛ, is the string of length zero.

A language is any countable set of strings over some fixed alphabet.

Page 15: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Operations on Languages

Page 16: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Operations on Languages

L U D is the set of letters and digits - strictly speaking the language with 62 strings of length one, each of which strings is either one letter or one digit.

LD is the set of 520 strings of length two, each consisting of one letter followed by one digit.

L4 is the set of all 4-letter strings. L* is the set of all strings of letters, including Ɛ, the

empty string. L(L U D)* is the set of all strings of letters and digits

beginning with a letter. D+ is the set of all strings of one or more digits.

Page 17: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Regular Expression

Specification of Tokens  A regular expression is a specific pattern that

provides concise and flexible means to "match" (specify and recognize) strings of text

Page 18: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

The C Identifiers

What will be the C Identifiers?

Page 19: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Unsigned Numbers

What will be C Unsigned Numbers?2380, 0.0123, 6.34E34, 12.3E-12

Page 20: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Extensions of Regular Expressions Kleene closure and Positive closure: one or

more instances r* = r+| Ɛ and r+ = rr* = r*r

Zero or one instance: r? is equivalent to r l Ɛ Character classes. [a-z] is shorthand for

a|b|. . . |z

Page 21: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Using the Extension

Page 22: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Recognition of Tokens

Our discussion will make use of the following running example.

Page 23: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Recognition of Tokens

For relop, we use the comparison operators of languages like Pascal or SQL, where = is “equals” and <> is “not equals,” because it presents an interesting structure of lexemes.

The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of tokens as far as the lexical analyzer is concerned

Page 24: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Example

The patterns for these tokens are described using regular definitions.

Page 25: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Example

To simplify matters, we make the common assumption that keywords are also reserved words.

They are not identifiers, even though their lexemes match the pattern for identifiers.

Page 26: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Example

Lexical analyzer stripes out white- space, by recognizing the “token” ws defined by:ws → (blank | tab | newline)+

Here, blank, tab, and newline are abstract symbols that we use to express the ASCII characters of the same names.

ws is not returned to the parser. We rather restart the lexical analysis from the

character that follows the whitespace.

Page 27: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Our goal for the lexical analyzer is summarized in figure.

Page 28: Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many

Question?