10
126 CS 536 Spring 2005 © JLex Examples A JLex scanner that looks for five letter words that begin with “P” and end with “T”. This example is in ~cs536-1/public/jlex 127 CS 536 Spring 2005 © The JLex specification file is: class Token { String text; Token(String t){text = t;} } %% Digit=[0-9] AnyLet=[A-Za-z] Others=[0-9’&.] WhiteSp=[\040\n] // Tell JLex to have yylex() return a Token %type Token // Tell JLex what to return when eof of file is hit %eofval{ return new Token(null); %eofval} %% [Pp]{AnyLet}{AnyLet}{AnyLet}[Tt]{WhiteSp}+ {return new Token(yytext());} ({AnyLet}|{Others})+{WhiteSp}+ {/*skip*/} 128 CS 536 Spring 2005 © The Java program that uses the scanner is: import java.io.*; class Main { public static void main(String args[]) throws java.io.IOException { Yylex lex = new Yylex(System.in); Token token = lex.yylex(); while ( token.text != null ) { System.out.print("\t"+token.text); token = lex.yylex(); //get next token } }} 129 CS 536 Spring 2005 © In case you care, the words that are matched include: Pabst paint petit pilot pivot plant pleat point posit Pratt print

JLex Examples The JLex specification file is: A JLex …pages.cs.wisc.edu/~fischer/cs536.s05/lectures/Lecture10.4up.pdf · AnyLet=[A-Za-z] Others=[0-9’&.] WhiteSp=[\040\n]

Embed Size (px)

Citation preview

Page 1: JLex Examples The JLex specification file is: A JLex …pages.cs.wisc.edu/~fischer/cs536.s05/lectures/Lecture10.4up.pdf · AnyLet=[A-Za-z] Others=[0-9’&.] WhiteSp=[\040\n]

126CS 536 Spring 2005©

JLex ExamplesA JLex scanner that looks for fiveletter words that begin with “P” andend with “T”.This example is in

~cs536-1/public/jlex

127CS 536 Spring 2005©

The JLex specification file is:class Token {

String text;Token(String t){text = t;}

}%%Digit=[0-9]AnyLet=[A-Za-z]Others=[0-9’&.]WhiteSp=[\040\n]// Tell JLex to have yylex() return aToken%type Token// Tell JLex what to return when eof offile is hit%eofval{return new Token(null);%eofval}%%[Pp]{AnyLet}{AnyLet}{AnyLet}[Tt]{WhiteSp}+

{return new Token(yytext());}

({AnyLet}|{Others})+{WhiteSp}+{/*skip*/}

128CS 536 Spring 2005©

The Java program that uses thescanner is:import java.io.*;

class Main {

public static void main(String args[])throws java.io.IOException {

Yylex lex = new Yylex(System.in);Token token = lex.yylex();

while ( token.text != null ) {System.out.print("\t"+token.text);token = lex.yylex(); //get next token

}}}

129CS 536 Spring 2005©

In case you care, the words that arematched include:

Pabst

paint

petit

pilot

pivot

plant

pleat

point

posit

Pratt

print

Page 2: JLex Examples The JLex specification file is: A JLex …pages.cs.wisc.edu/~fischer/cs536.s05/lectures/Lecture10.4up.pdf · AnyLet=[A-Za-z] Others=[0-9’&.] WhiteSp=[\040\n]

130CS 536 Spring 2005©

An example of CSX tokenspecifications. This example is in

~cs536-1/public/proj2/startup

131CS 536 Spring 2005©

The JLex specification file is:import java_cup.runtime.*;

/* Expand this into your solution forproject 2 */

class CSXToken {int linenum;int colnum;CSXToken(int line,int col){linenum=line;colnum=col;};

}

class CSXIntLitToken extends CSXToken {int intValue;CSXIntLitToken(int val,int line,

int col){super(line,col);intValue=val;};

}

class CSXIdentifierToken extendsCSXToken {String identifierText;CSXIdentifierToken(String text,int line,

int col){super(line,col);identifierText=text;};

}

132CS 536 Spring 2005©

class CSXCharLitToken extends CSXToken {char charValue;

CSXCharLitToken(char val,int line,int col){

super(line,col);charValue=val;};}

class CSXStringLitToken extends CSXToken{

String stringText;CSXStringLitToken(String text,

int line,int col){super(line,col);stringText=text; };

}

// This class is used to track line andcolumn numbers// Feel free to change to extend itclass Pos {static int linenum = 1;/* maintain this as line number current

token was scanned on */static int colnum = 1;

/* maintain this as column numbercurrent token began at */

static int line = 1;/* maintain this as line number after

scanning current token */

133CS 536 Spring 2005©

static int col = 1;/* maintain this as column number

after scanning current token */static void setpos() {

//set starting pos for current tokenlinenum = line;colnum = col;}

}

%%Digit=[0-9]

// Tell JLex to have yylex() return aSymbol, as JavaCUP will require

%type Symbol

// Tell JLex what to return when eof offile is hit%eofval{return new Symbol(sym.EOF,

new CSXToken(0,0));%eofval}

%%"+" {Pos.setpos(); Pos.col +=1;

return new Symbol(sym.PLUS,new CSXToken(Pos.linenum,

Pos.colnum));}

Page 3: JLex Examples The JLex specification file is: A JLex …pages.cs.wisc.edu/~fischer/cs536.s05/lectures/Lecture10.4up.pdf · AnyLet=[A-Za-z] Others=[0-9’&.] WhiteSp=[\040\n]

134CS 536 Spring 2005©

"!=" {Pos.setpos(); Pos.col +=2;return new Symbol(sym.NOTEQ,

new CSXToken(Pos.linenum,Pos.colnum));}

";" {Pos.setpos(); Pos.col +=1;return new Symbol(sym.SEMI,

new CSXToken(Pos.linenum,Pos.colnum));}

{Digit}+ {// This def doesn’t check// for overflow

Pos.setpos();Pos.col += yytext().length();return new Symbol(sym.INTLIT,

new CSXIntLitToken(new Integer(yytext()).intValue(),Pos.linenum,Pos.colnum));}

\n {Pos.line +=1; Pos.col = 1;}" " {Pos.col +=1;}

135CS 536 Spring 2005©

The Java program that uses thisscanner (P2) is:class P2 {

public static void main(String args[])throws java.io.IOException {

if (args.length != 1) {System.out.println("Error: Input file must be named on

command line." );System.exit(-1);

}java.io.FileInputStream yyin = null;try {

yyin =new java.io.FileInputStream(args[0]);

} catch (FileNotFoundExceptionnotFound){

System.out.println("Error: unable to open input file.”);

System.exit(-1);}

// lex is a JLex-generated scanner that// will read from yyin

Yylex lex = new Yylex(yyin);

136CS 536 Spring 2005©

System.out.println("Begin test of CSX scanner.");

/**********************************You should enter code here thatthoroughly test your scanner.

Be sure to test extreme cases,like very long symbols or lines,illegal tokens, unrepresentableintegers, illegals strings, etc.The following is only a starting point.

***********************************/Symbol token = lex.yylex();

while ( token.sym != sym.EOF ) {System.out.print(

((CSXToken) token.value).linenum+ ":"+ ((CSXToken) token.value).colnum+ " ");

switch (token.sym) {case sym.INTLIT:

System.out.println("\tinteger literal(" +((CSXIntLitToken)token.value).intValue + ")");

break;

137CS 536 Spring 2005©

case sym.PLUS:System.out.println("\t+");break;

case sym.NOTEQ:System.out.println("\t!=");break;

default:throw new RuntimeException();

}

token = lex.yylex(); // get next token}

System.out.println("End test of CSX scanner.");

}}}

Page 4: JLex Examples The JLex specification file is: A JLex …pages.cs.wisc.edu/~fischer/cs536.s05/lectures/Lecture10.4up.pdf · AnyLet=[A-Za-z] Others=[0-9’&.] WhiteSp=[\040\n]

138CS 536 Spring 2005©

Other Scanner IssuesWe will consider other practical issuesin building real scanners for realprogramming languages.Our finite automaton modelsometimes needs to be augmented.Moreover, error handling must beincorporated into any practicalscanner.

139CS 536 Spring 2005©

Identifiers vs. Reserved WordsMost programming languages containreserved words like if , while ,switch , etc. These tokens look likeordinary identifiers, but aren’t.It is up to the scanner to decide ifwhat looks like an identifier is really areserved word. This distinction is vitalas reserved words have differenttoken codes than identifiers and areparsed differently.How can a scanner decide whichtokens are identifiers and which arereserved words?• We can scan identifiers and reserved

words using the same pattern, andthen look up the token in a special“reserved word” table.

140CS 536 Spring 2005©

• It is known that any regularexpression may be complemented toobtain all strings not in the originalregular expression. Thus A, thecomplement of A, is regular if A is.Using complementation we can writea regular expression for nonreserved

identifiers:Since scanner generators don’tusually support complementation ofregular expressions, this approach ismore of theoretical than practicalinterest.

• We can give distinct regularexpression definitions for eachreserved word, and for identifiers.Since the definitions overlap (if willmatch a reserved word and thegeneral identifier pattern), we give

ident if while …( )

141CS 536 Spring 2005©

priority to reserved words. Thus atoken is scanned as an identifier if itmatches the identifier pattern anddoes not match any reserved wordpattern. This approach is commonlyused in scanner generators like Lexand JLex.

Page 5: JLex Examples The JLex specification file is: A JLex …pages.cs.wisc.edu/~fischer/cs536.s05/lectures/Lecture10.4up.pdf · AnyLet=[A-Za-z] Others=[0-9’&.] WhiteSp=[\040\n]

142CS 536 Spring 2005©

Converting Token ValuesFor some tokens, we may need toconvert from string form intonumeric or binary form.For example, for integers, we need totransform a string a digits into theinternal (binary) form of integers.We know the format of the token isvalid (the scanner checked this), but:• The string may represent an integer

too large to represent in 32 or 64 bitform.

• Languages like CSX and ML use anon-standard representation fornegative values (~123 instead of-123 )

143CS 536 Spring 2005©

We can safely convert from string tointeger form by first converting thestring to double form, checkingagainst max and min int, and thenconverting to int form if the value isrepresentable.Thus d = new Double(str) willcreate an object d containing thevalue of str in double form. If str istoo large or too small to berepresented as a double, plus or minusinfinity is automatically substituted.d.doubleValue() will give d’s valueas a Java double, which can becompared againstInteger.MAX_VALUE orInteger.MIN_VALUE .

144CS 536 Spring 2005©

If d.doubleValue() represents avalid integer,(int) d.doubleValue()will create the appropriate integervalue.If a string representation of aninteger begins with a “~” we can stripthe “~”, convert to a double and thennegate the resulting value.

145CS 536 Spring 2005©

Scanner TerminationA scanner reads input characters andpartitions them into tokens.What happens when the end of theinput file is reached? It may be usefulto create an Eof pseudo-characterwhen this occurs. In Java, forexample, InputStream.read() ,which reads a single byte, returns -1when end of file is reached. Aconstant, EOF, defined as -1 can betreated as an “extended” ASCIIcharacter. This character then allowsthe definition of an Eof token thatcan be passed back to the parser.An Eof token is useful because itallows the parser to verify that thelogical end of a program corresponds

Page 6: JLex Examples The JLex specification file is: A JLex …pages.cs.wisc.edu/~fischer/cs536.s05/lectures/Lecture10.4up.pdf · AnyLet=[A-Za-z] Others=[0-9’&.] WhiteSp=[\040\n]

146CS 536 Spring 2005©

to its physical end. Most parsersrequire an end of file token.Lex and Jlex automatically create anEof token when the scanner theybuild tries to scan an EOF character(or tries to scan when eof() is true).

147CS 536 Spring 2005©

Multi Character LookaheadWe may allow finite automata to lookbeyond the next input character.This feature is necessary to implementa scanner for FORTRAN.In FORTRAN, the statement

DO 10 J = 1,100specifies a loop, with index J rangingfrom 1 to 100 .The statement

DO 10 J = 1.100is an assignment to the variableDO10J. (Blanks are not significantexcept in strings.)A FORTRAN scanner decides whetherthe O is the last character of a DOtoken only after reading as far as thecomma (or period).

148CS 536 Spring 2005©

A milder form of extended lookaheadproblem occurs in Pascal and Ada.The token 10.50 is a real literal,whereas 10..50 is three differenttokens.We need two-character lookaheadafter the 10 prefix to decide whetherwe are to return 10 (an integerliteral) or 10.50 (a real literal).

149CS 536 Spring 2005©

Suppose we use the following FA.

Given 10..100 we scan threecharacters and stop in a non-accepting state.Whenever we stop reading in a non-accepting state, we back up alongaccepted characters until anaccepting state is found.Characters we back up over arerescanned to form later tokens. If noaccepting state is reached duringbackup, we have a lexical error.

.D

D D

D

..

Page 7: JLex Examples The JLex specification file is: A JLex …pages.cs.wisc.edu/~fischer/cs536.s05/lectures/Lecture10.4up.pdf · AnyLet=[A-Za-z] Others=[0-9’&.] WhiteSp=[\040\n]

150CS 536 Spring 2005©

Performance ConsiderationsBecause scanners do so muchcharacter-level processing, they canbe a real performance bottleneck inproduction compilers.Speed is not a concern in our project,but let’s see why scanning speed canbe a concern in production compilers.Let’s assume we want to compile at arate of 1000 lines/sec. (so that mostprograms compile in just a fewseconds).Assuming 30 characters/line (onaverage), we need to scan 30,000char/sec.

151CS 536 Spring 2005©

On a 30 SPECmark machine (30million instructions/sec.), we have1000 instructions per character tospend on all compiling steps.If we allow 25% of compiling to bescanning (a compiler has a lot moreto do than just scan!), that’s just 250instructions per character.A key to efficient scanning is togroup character-level operationswhenever possible. It is better to doone operation on n characters ratherthan n operations on singlecharacters.In our examples we’ve read input onecharacter as a time. A subroutine callcan cost hundreds or thousands ofinstructions to execute—far too muchto spend on a single character.

152CS 536 Spring 2005©

We prefer routines that do blockreads, putting an entire block ofcharacters directly into a buffer.Specialized scanner generators canproduce particularly fast scanners.The GLA scanner generator claimsthat the scanners it produces run asfast as:while(c != Eof) {

c = getchar();

}

153CS 536 Spring 2005©

Lexical Error RecoveryA character sequence that can’t bescanned into any valid token is alexical error.Lexical errors are uncommon, butthey still must be handled by ascanner. We won’t stop compilationbecause of so minor an error.Approaches to lexical error handlinginclude:• Delete the characters read so far and

restart scanning at the next unreadcharacter.

• Delete the first character read by thescanner and resume scanning at thecharacter following it.

Page 8: JLex Examples The JLex specification file is: A JLex …pages.cs.wisc.edu/~fischer/cs536.s05/lectures/Lecture10.4up.pdf · AnyLet=[A-Za-z] Others=[0-9’&.] WhiteSp=[\040\n]

154CS 536 Spring 2005©

Both of these approaches arereasonable.The first is easy to do. We just resetthe scanner and begin scanning anew.The second is a bit harder but also is abit safer (less is immediately deleted).It can be implemented using scannerbackup.Usually, a lexical error is caused bythe appearance of some illegalcharacter, mostly at the beginning ofa token.(Why at the beginning?)In these case, the two approaches areequivalent.

155CS 536 Spring 2005©

The effects of lexical error recoverymight well create a later syntax error,handled by the parser.Consider

...for$tnight.. .The $ terminates scanning of for .Since no valid token begins with $, itis deleted. Then tnight is scanned asan identifier. In effect we get

...for tnight.. .which will cause a syntax error. Such“false errors” are unavoidable, thougha syntactic error-repair may help.

156CS 536 Spring 2005©

Error TokensCertain lexical errors require specialcare. In particular, runaway stringsand runaway comments ought toreceive special error messages.In Java strings may not cross lineboundaries, so a runaway string isdetected when an end of a line is readwithin the string body. Ordinaryrecovery rules are inappropriate forthis error. In particular, deleting thefirst character (the double quotecharacter) and restarting scanning isa bad decision.It will almost certainly lead to acascade of “false” errors as the stringtext is inappropriately scanned asordinary input.

157CS 536 Spring 2005©

One way to handle runaway strings isto define an error token.An error token is not a valid token; itis never returned to the parser.Rather, it is a pattern for an errorcondition that needs special handling.We can define an error token thatrepresents a string terminated by anend of line rather than a doublequote character.For a valid string, in which internaldouble quotes and back slashes areescaped (and no other escapedcharacters are allowed), we can use" ( Not( " | Eol | \ ) | \" | \\ )* "For a runaway string we use" ( Not( " | Eol | \ ) | \" | \\ )* Eol(Eol is the end of line character.)

Page 9: JLex Examples The JLex specification file is: A JLex …pages.cs.wisc.edu/~fischer/cs536.s05/lectures/Lecture10.4up.pdf · AnyLet=[A-Za-z] Others=[0-9’&.] WhiteSp=[\040\n]

158CS 536 Spring 2005©

When a runaway string token isrecognized, a special error messageshould be issued.Further, the string may be “repaired”into a correct string by returning anordinary string token with the closingEol replaced by a double quote.This repair may or may not be“correct.” If the closing double quoteis truly missing, the repair will begood; if it is present on a succeedingline, a cascade of inappropriatelexical and syntactic errors willfollow.Still, we have told the programmerexactly what is wrong, and that is ourprimary goal.

159CS 536 Spring 2005©

In languages like C, C++, Java andCSX, which allow multiline comments,improperly terminated (runaway)comments present a similar problem.A runaway comment is not detecteduntil the scanner finds a closecomment symbol (possibly belongingto some other comment) or until theend of file is reached. Clearly aspecial, detailed error message isrequired.Let’s look at Pascal-style commentsthat begin with a { and end with a }.Comments that begin and end with apair of characters, like /* and */ inJava, C and C++, are a bit trickier.

160CS 536 Spring 2005©

Correct Pascal comments are definedquite simply:

{ Not( } )* }To handle comments terminated byEof , this error token can be used:

{ Not( } )* EofWe want to handle commentsunexpectedly closed by a closecomment belonging to anothercomment:{... missing close comment... { normal comment }...

We will issue a warning (this form ofcomment is lexically legal).Any comment containing an opencomment symbol in its body is mostprobably a missing } error.

161CS 536 Spring 2005©

We split our legal comment definitioninto two token definitions.The definition that accepts an opencomment in its body causes a warningmessage ("Possible unclosedcomment") to be printed.We now use:{ Not( } )* } and{ (Not( { | } )* { Not( { | } )* )+ }The first definition matches correctcomments that do not contain anopen comment in their body.The second definition matchescorrect, but suspect, comments thatcontain at least one open comment intheir body.

Page 10: JLex Examples The JLex specification file is: A JLex …pages.cs.wisc.edu/~fischer/cs536.s05/lectures/Lecture10.4up.pdf · AnyLet=[A-Za-z] Others=[0-9’&.] WhiteSp=[\040\n]

162CS 536 Spring 2005©

Single line comments, found in Java,CSX and C++, are terminated by Eol.They can fall prey to a more subtleerror—what if the last line has no Eolat its end?The solution?Another error token for single linecomments:

// Not(Eol) *

This rule will only be used forcomments that don’t end with an Eol,since scanners always match thelongest rule possible.