COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1

COMP 3438 – Part II - Lecture 2: Lexical Analysis (I)

Dr. Zili Shao

Department of Computing

The Hong Kong Polytechnic Univ.

1

Overview of the Subject (COMP 3438)

Overview of Unix Sys. Prog.

Process File System

Overview of Device Driver Development

Character Device Driver Development

Introduction to Block Device Driver

Overview of Complier Design

Lexical Analysis(HW #3)

Syntax Analysis(HW #4)

Part I: Unix System Programming (Device Driver Development)

Part II: Compiler Design

Course Organization (This lecture is in red)

2

The Outline

Part I: Introduction to Lexical Analysis 1. Input (A Source Program) and Output (Tokens) 2. How to specify tokens? Regular Expression 2. How to recognize tokens?

Regular Expression Lex (Software Tool)Regular Expression Finite Automaton (Write our own)

Part II: Regular Expression

Part III: Finite Automata(Write your own-

Homework #3)

3

Part I: Introduction to Lexical Analysis

Why we need Lexical Analysis? Its input & output. How to specify Tokens: Regular Expression How to Recognize Tokens – Two Methods

Regular Expression Software tool: LexRegular Expression Finite Automata (write your own program)

4

Given a program, how to group the characters into meaningful “words” ?

Example: Program Segment by C A string of characters stored in a file

if (i==j) z = 0;

else z = 1;

Why we need Lexical Analysis ?

if (i==j) \n\t\tz=0;\telse\n\t\tz=1;\n

How to identify: “if””else” are keyword; “i”,”j”,“z” are variables; etc. from the string of char.?

(Similar to in English, in order to understand “I love you”, you have to identify the words, “I”, “love”, “you” first.)

5

Lexical Analysis (Input & Output)

In lexical analysis, a source program is read from

left-to-right and grouped into tokens, the sequence of characters with a collective meaning.

Lexical

AnalyzerINPUT: Source Program

OUTPUT: Tokens

if (i==j) \n\t\tz=0;\telse\n\t\tz=1;\n Token Lexeme (value)

keyword if

ID i

Operator ==

ID j

… …

6

What is a Token ?

A syntactic category In English: noun, verb, adjective, … In a programming language: Identifier, Integer,

Keyword, Whitespace

Tokens correspond to sets of strings with a collective meaning Identifier: strings of letters and digits, starting with a

letter Integer: a non-empty string of digits Keyword: “else”, “if”, “while”, … 7

Example – Expression (Input & Output)

((48*10) div 12)**3

LEXICAL ANALYSIS

TokenName: LP value: ( TokenName: LP value: ( TokenName: NUM value: 48 TokenName: MPY value: * TokenName: NUM value: 10 TokenName: RP value: ) TokenName: DIV value: / TokenName: NUM value:12 TokenName: RP value:) TokenName: EXP value:^ TokenName: NUM value:3 TokenName: END value:$

LEXICAL ANALYSIS FINISH

8

Example – Mini Java Program (Input & Output)

program xyz;

class hellow{

method void main( ){ System.println('hellow\n'); }}

9

What are Tokens For ?

Classify program substrings according to their syntactic role.

As the output of lexical analysis, Tokens are the input to the Parser (Syntax Analysis) Parser relies on token distinctions

e.g. A keyword is treated differently than an ID

10

How to Recognize Tokens (Lexical Analyzer)?

First, specify tokens – Regular Expression (Patterns)

Second, based on regular Expression, two ways to implement a lexical analyzer:Method 1: Use Lex, a software tool Method 2: Use Finite Automata (write your own

program). (Homework #3)

11

Part II. Regular Expression

Alphabet, Strings, Languages Regular Expression Regular Set (Regular Language)

12

A token is specified by a pattern. A pattern is a set of rules that describing the formation of the

token.

The lexical analyzer uses the pattern to identify the lexeme - a sequence of characters in the input that matches the pattern. Once matched, the corresponding token is recognized.

Example:

The rule (pattern) for ID (Identifier) : letter followed by letters and digits

abc1 and A1By match the rule (pattern), they are ID type token; 1A does not match the rule (pattern), it is not ID type token.

Specifying tokens

13

The rules for specifying token patterns are called regular expression. A regular set (regular language) is the set of

strings generated by a regular expression over an alphabet.

What are Alphabet, Language, Regular Expression, Regular Set?

Example: Rules, Tokens, Regular Expressions

14

Alphabet and Strings

Alphabet () is a finite set of symbol. e.g. {0,1} is the binary alphabet;

{a,b,…,z, A,B,…,Z} is the English alphabet;

A string over an alphabet is a finite sequence of symbols drawn from that alphabet.

e.g. 01001 is the string over ={0,1};

wxyzabc is the string over ={a,b,c,…,z};

denotes the empty string (without any symbol)

The length of a string w is denoted as |w| e.g. || = 0; |101|=3; |abcdef|=6;

15

* and Languages

* denotes the set of all strings, including (the empty string), over an alphabet .

e.g. * ={, 0, 1, 00, 01, 10, 11, 000, …} over ={0,1};

Languages:

Any set of strings over an alphabet - that is, any subset of *- will be called a language

e.g. Ø, {}, *, and are all languages;

{abc, def, d, z} is a language over ={a,b,..,z};

16

Remember a language is a set; so all operations on sets can be applied for languages.

We are interested in: Union, Concatenation, and Closures.

Given: two languages, L and M

Operations on Languages

17

Precedence of Operators

Precedence:

1 + 2 3

Kleen closure > Concatenation > UnionExponentiation > Multiplication >

Addition

.2

1 | 23*

18

Examples for Operations on Languages

Given: L={a, b}, M={a , bb}

L U M = {a, b, bb}

LM ={aa, abb, ba, bbb}

...},,,,,,

,{},,,{},{}{...210*

bbbbababbaabbbabaaaba

aaabbbaabaabaLLLL

...},,,,,,

,{},,,{},{...21

bbbbababbaabbbabaaaba

aaabbbaabaabaLLL

19

Example 3.2 (Page 93 in textbook)Let L be the set {A, B, ..., Z, a, b, ..., z} and D be the set {0, 1, ..., 9}. They are both languages. Here are some examples of new languages created from L and D by applying the operators defined in Fig. 3.8.

1. L D is the set of letters and digits;

2. LD is the set of strings consisting of a letter followed by a digit;

3. L4 is the set of all four-letter strings;

4. L* is the set of all strings of letters, including , the empty string;

5. L(L D)* is the set of all strings of letters and digits beginning with a letter;

6. D+ is the set of all strings of one or more digits.

Another Example for Operations on languages

20

The rules defining regular expressions over alphabet : 1. The empty string is a regular expression that denotes {}.

2. A single symbol a in is a regular expression that denotes {a}.

3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then,

a) (r) | (s) is a regular expression denoting L(r) L(s). b) (r) (s) is a regular expression denoting L(r) L(s).c) (r)* is a regular expression denoting (L(r))* d) (r) is a regular expression denoting (L(r))

Defining regular expressions

21

Example 3.3 (Page 95 in textbook)

Let = {a, b}.1. The regular expression a | b denotes the set {a, b}

2. The regular expression (a | b) (a | b) denotes {aa, ab, ba, bb}, the set of all strings of a's and b's of length two.

Another regular expression for this same set is aa | ab | ba | bb.

3. The regular expression a* denotes the set of all strings of zero or more a's, i.e., {, a, aa, aaa, ...}.

Examples of regular expressions

22

Example 3.3 (Page 95 in textbook)Let = {a, b}.4. The regular expression (a | b)* denotes the set of

all strings containing zero or more instances of an a or b, that is, the set of all strings of a's and b's. Another regular expression for this set is (a*b*)*.

5. The regular expression a | a*b denotes the set containing the string a and all strings consisting of zero or more a's followed by a b.

Examples of regular expressions

23

Regular Set, Regular Expression and Regular Definition

Regular Set (Regular Language)

Each regular expression r denotes a language L(r) called as Regular Set.

e.g. Let = {a, b}, a|b denotes the set {a,b}. Regular definition: Give distinct names (d1, d2, ..) to define

regular expressions (r1, r2, …) like

d1 r1 d2 r2 d3 r3 ….

24

Example 3.4 (pp. 96)

Pascal identifier: a string of letters and digits beginning with a letter.

Regular Expression:

LETTER A | B | …| Z | a | b | … | z

DIGIT 0 | 1 | …| 9

ID LETTER ( LETTER | DIGIT )*

Example – Identifier in Pascal

25

Example – Unsigned Numbers in Pascal

Example 3.5 (pp. 96)

Unsigned numbers in Pascal are strings such as 5230, 39.37, 6.336E4, or 1.89E-4.

Regular Expression:

DIGIT 0 | 1 | …| 9

DIGITS DIGIT DIGIT*

OPTIONAL_FRAC . DIGITS | OPTIONAL_EXP (E ( + | - | ) DIGITS ) | NUM DIGITS OPTIONAL_FRAC OPTIONAL_EXP

26

Notation Shorthands

1. One or more instances:

2. Zero or one instance: r ? = r | 3. Character classes. [a-z]= a|b|c|…|z

e.g.

*rrr

Original Regular Expression for Unsigned NumbersDIGIT 0 | 1 | …| 9DIGITS DIGIT DIGIT*OPTIONAL_FRAC . DIGITS | OPTIONAL_EXP (E ( + | - | ) DIGITS ) | NUM DIGITS OPTIONAL_FRAC OPTIONAL_EXP

Regular Expression for Unsigned Numbers with Notation Shorthands DIGITS [0-9]OPTIONAL_FRAC (. DIGITS )?OPTIONAL_EXP (E ( + | - )? DIGITS )? NUM DIGITS OPTIONAL_FRAC OPTIONAL_EXP

+

27

Recognize tokens

Given a string s and a regular expression r, is

s L(r) ?

e.g. Let = {a, b}, a | b is given regular expression.

Then the string aa L(a|b)

the string a L(a|b)

28

Implementation of Lexical Analysis

After regular expressions are obtained, we have two methods to implement a lexical analyzer:

Use tools: lex (for C), flex (for C/C++), jlex (for Java)

Specify tokens using regular expressions Tool generates source code for the lexical analysis

Use regular expressions and finite automata Write code to express the recognition of tokens Table driven

29

LEX: a lexical analyzer generator

Lex is a UNIX software tool (developed by M.E. Lesk and E. Schmidt from Bell Lab in 1972) that automatically constructs a lexical analyzer

Input: a specification containing regular expressions written in the Lex language (pp.107-113 in textbook and LEX documentation on Blackboard)

Assume that each token matches a regular expression Also need an action for each expression

Output: Produces a C program

Especially useful when coupled with a parser generator (e.g., yacc)

30

Given the input file lex.l which contains regular expressions specifying the tokens, LEX produces a C file lex.yy.c,

Lex.yy.c contains a tabular representation of a state transition graph for a FA constructed from the regular expressions and a routine yylex() that uses the table to recognize token.

yylex() can be called as a subroutine, e.g., it can be called by a syntax analyzer generated by Yacc

Or compile lex.yy.c and run it independently

Lex Specification(lex.l)

LEX Lex.yy.c(contains the lexical analyzer called yylex)

How does LEX work

31

How does LEX work

lex cc foolexfoo.l foolex.c foolex

tokens

input

> flex –o foolex.c foo.l> cc –o foolex foolex.c -lfl

>more inputbegin if size>10 then size * -3.1415end

> foolex < inputKeyword: beginKeyword: ifIdentifier: sizeOperator: >Integer: 10 (10)Keyword: thenIdentifier: sizeOperator: *Operator: -Float: 3.1415 (3.1415)Keyword: end

32

About LEX

Some materials related to LEX can be found from Learn@PolyU.

33

Documents

COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1