26
Lexical Analysis Mooly Sagiv [email protected] Schrierber 317 03-640-7606 Wed 10:00-12:00 html://www.math.tau.ac.il/~msagiv/ courses/wcc.html Textbook:Modern Compiler Implementation in C

Lexical Analysis Mooly Sagiv [email protected] Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Embed Size (px)

Citation preview

Page 1: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Lexical Analysis

Mooly [email protected]

Schrierber 31703-640-7606

Wed 10:00-12:00html://www.math.tau.ac.il/~msagiv/courses/wcc.html

Textbook:Modern Compiler Implementation in CChapter 2

Page 2: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

A motivating example• Create a program that counts the number of lines in

a given input file

Page 3: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

A motivating examplesolution

int num_lines = 0;%%\n ++num_lines;. ;%% main() { yylex(); printf( "# of lines = %d\n", num_lines); }

Page 4: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Subjects• Roles of lexical analysis

• The straightforward solution a manual scanner for C

• Regular Expressions

• Finite automata

• From regular languages into finite automata

• Flex

Page 5: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Basic Compiler PhasesSource program (string)

Fin. Assembly

lexical analysis

syntax analysis

semantic analysis

Translate

Instruction selection

Register Allocation

Tokens

Abstract syntax tree

Intermediate representation

Assembly

Finite automata

Pushdown automata

Memory organization

graph algorithms

Dynamic programming

Page 6: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Example

a\b := 5 + 3 ;\nb := (print(a, a-1), 10 * a) ;\nprint(b)

• Input string

• Tokens

id (“a”) assign num (5) + num(3) ;id(“b”) assign

print(id(“a”) , id(“a”) - num(1)), num(10) * id(“a”)) ;print(id(“b”))

Page 7: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

• Functionality

– input

• program text (file)

– output

• sequence of tokens

– Read input file

– Identify language keywords and standard identifiers

– Handle include files and macros

– Count line numbers

– Remove whitespaces

– Report illegal symbols

– Produce symbol table

Lexical Analysis (Scanning)

Page 8: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

A simplified scanner for CToken nextToken(){char c ;loop: c = getchar();switch (c){

case ` `:goto loop ;case `;`: return SemiColumn;case `+`: c = getchar() ;

switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: putchar(c);

return Plus; } case `<`:case `w`:

}

Page 9: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Automatic Generation of Lexical Analysis

• The matching of input strings can be performed by a finite automaton

• Examples:– An automaton for while– An automaton for C identifier– An automaton for C comment

• The program for the automaton is automatically generated from regular expressions

Page 10: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Flex• Input

– regular expressions and actions (C code)

• Output– A scanner program that reads the input and

applies actions when input regular expression is matched

flex

regular expressions

input program tokensscanner

Page 11: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Regular Expression Notations

a An ordinary character stands for itselfM|N M or NMN M followed by NM* Zero or more times of MM+ One or more times of MM? Zero or one occurrence of M[a-zA-Z] Character set alternation (single character). Any (single) character but newline“a.+” Quotation\ Convert an operator into text

Page 12: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Ambiguity Resolving

• Find the longest matching token

• Between two tokens with the same length use the one declared first

Page 13: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

A Flex specification of C Scanner

Letter [a-zA-Z_]Digit [0-9]%%[ \t] {;} [\n] {line_count++;}“;” { return SemiColumn;}“++” { return PlusPlus ;}“+=“ { return PlusEqual ;}“+” { return Plus}“while” { return While ; }{Letter}({Letter}|{Digit})* { return Id ;}“<=” { return LessOrEqual;}“<” { return LessThen ;}

Page 14: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Running Exampleif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }[0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; }(\-\-[a-z]*\n)|(“ “|\n|\t) { ; }. { error(); }

Page 15: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

int edges[][256] ={ /* …, 0, 1, 2, 3, ..., -, e, f, g, h, i, j, ... *//* state 0 */ {0, ..., 0, 0, …, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0}/* state 1 */ {13, ..., 7, 7, 7, 7, …, 9, 4, 4, 4, 4, 2, 4, ..., 13, 13}/* state 2 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 3, 4, 4, 4, 4, ..., 0, 0}/* state 3 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, , 0, 0}/* state 4 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 4, 4, 4, 4, 4, ..., 0, 0} /* state 5 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}/* state 6 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0}/* state 7 */

.../* state 13 */ {0, …, 0, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}

Page 16: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Pseudo Code for ScannerToken nextToken(){lastFinal = 0; currentState = 1 ;inputPositionAtLastFinal = input; currentPosition = input; while (not(isDead(currentState))) {

nextState = edges[currentState][currentPosition]; if (isFinal(nextState)) { lastFinal = nextState ; inputPositionAtLastFinal = currentPosition; } currentState = nextState; advance currentPosition;

}input = inputPositionAtLastFinal ;return action[lastFinal]; }

Page 17: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Example

Input: “if --not-a-com”

Page 18: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Efficient Scanners

• Efficient state representation

• Input buffering

• Using switch and goto instead of tables

Page 19: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Constructing Automaton from Specification

• Create a non-deterministic automaton (NDFA) from every regular expression

• Merge all the automata using epsilon moves(like the | construction)

• Construct a deterministic finite automaton (DFA)

• Minimize the automaton starting with separate accepting states

Page 20: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

NDFA Constructionif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }[0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; }(\-\-[a-z]*\n)|(“ “|\n|\t) { ; }. { error(); }

Page 21: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

DFA Construction

Page 22: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Minimization

Page 23: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

%{/* C declarations */#include “tokens.h'' /* Mapping of tokens into integers */#include “errormsg.h'' /* Shared by all the phases */union {int ival; string sval; double fval;} yylval;int charPos=1 ; #define ADJ (EM_tokPos=charPos, charPos+=yyleng)%}/* Lex Definitions */digits [0-9]+%%if { ADJ; return IF;}[a-z][a-z0-9] { ADJ; yylval.sval=String(yytext); return ID; }{digits} {ADJ; yylval.ival=atoi(yytext); return NUM; }({digits}\.{digits}?)|({digits}?\.{digits}) {

ADJ; yylval.fval=atof(yytext); return REAL; }(\-\-[a-z]*\n)|([\n\t]|" ")* { ADJ; }. { ADJ; EM_error(“illegal character''); }

Page 24: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Start States• Regular expressions may be more complicated

than automata– C comments

• Solutions– Conversion of automata into regular expressions– Start States

% start s1 s2%%< INITIAL>r1 { action0 ; BEGIN s_1; }<s1>r1 { action1 ; BEGIN s2; }<s2>r2 { action2 ; BEGIN INITIAL};

Page 25: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Realistic Example% start Comment%%<INITIAL>”/*'' { BEGIN Comment; }<INITIAL>r1 { Usual actions; }<INITIAL>r2 { Usual actions; }

...<INITIAL>rk { Usual actions; }<Comment>”*/”’ { BEGIN Initial; }<Comment>.|\n ;

Page 26: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern

Summary

• For most programming languages lexical analyzers can be easily constructed

• Exceptions:– Fortran– PL/1

• Flex is a useful tool beyond compilers