19
Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Embed Size (px)

Citation preview

Page 1: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Lexical Analysiswith

lex(1) and flex(1)

© 2011 Clinton Jeffery

Page 2: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Reading

• Read Sections 3-5 of Lexical Analysis with Flex• Check out the class lecture notes• Ask questions from either source– Preferred venues: in-class, or in CS Forums

Page 3: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Traits of Scanners

• Function: convert from chars to tokens• Identify and categorize kinds of tokens• Detect boundaries between tokens• Discard comments and whitespace• Remember line/col #’s for error reporting• Report lexical errors• Run as fast as possible

Page 4: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Regular Expressions

• ε is a r.e.• Any char in the alphabet is a r.e.• If r and s are r.e.’s then r | s is a r.e.• If r and s are r.e.’s then r s is a r.e.• If r is a r.e. then r* is a r.e.• If r is a r.e. then (r) is a r.e.

Page 5: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Common extensionsto regular expression notation

• r+ is equivalent to rr*• r? is equivalent to r|ε• [abc] is equivalent to a|b|c• [a-z] is equivalent to a | b| … |z• [^abc] is equivalent to anything but a,b, or c

Page 6: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Lex’s extended regular expressions

• \c escapes for most operators• “s” match C string as-is (superescape)• r{m,n} match r between m and n times• r/s match r when s follows• ^r match r when at beginning of line• r$ match r when at end of line

Page 7: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Lexical Attributes

• A lexical attribute is a piece of information about a token

• Compiler writer can define as needed• Typically:– Category integer code, used in parsing– Lexeme actual string as appears in source– Line, column location in source code– Value for literals, the binary they represent

Page 8: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Meanings of the word “token”

• A single word from the source code• An integer code that categorizes a word• A set of lexical attributes that are computed

from a single word of input• An instance of a class (given by category)

Page 9: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Lex public interface

• FILE *yyin; /* set before calling yylex() */• int yylex(); /* call once per token */• char yytext[]; /* chars matched by yylex()

*/• int yywrap(); /* end-of-file handler */

Page 10: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

.l file format

header

%%body

%%helper functions

Page 11: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Lex header

• C code inside %{ … %}– prototypes for helper functions– #include’s that #define integer token categories

• Macro definitions, e.g.letter [a-zA-Z]digit [0-9]ident {letter}({letter}|{digit})*

• Warning: macros are fraught with peril

Page 12: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Lex body

• Regular expressions with semantic actions“ “ { /* discard */ }{ident} { return IDENT; }“*” { return ASTERISK; }“.” { return PERIOD; }• Match the longest r.e. possible• Break ties with whichever appears first• If it fails to match: copy unmatched to stdout

Page 13: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Lex helper functions

• Follows rules of ordinary C code• Compute lexical attributes• Do stuff the regular expressions can’t do• Write a yywrap() to switch files on EOF

Page 14: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

struct token – typical compiler

struct token { int category; char *text; int linenumber; int column; char *filename; union literal value;}

Page 15: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

“string removal tool”

%%“zap me”

Page 16: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

whitespace trimmer

%%[ \t]+ putchar(‘ ‘);[ \t]+ /* drop entirely */

Page 17: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

string replacement

%%username printf(“%s”, getlogin() );

Page 18: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Line/word counter

int lines=0, chars=0;%%\n++lines; ++chars;. ++chars;%%main() { yylex(); printf(“lines: %d chars: %d\n”, lines, chars);}

Page 19: Lexical Analysis with lex(1) and flex(1) © 2011 Clinton Jeffery

Example: C reals

• Is it: [0-9]*.[0-9]*• Is it: ([0-9]+.[0-9]* | [0-9]*.[0-9]+)