1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java

1

Lecture 3

Introduction to JLex: a lexical analyzer generator for Java

2

JLex

javacLscanner.class

…

Lscanner.lex

Ltokens…

input.L

Lscanner.java

charstream

token stream

The role of JLex

3

JLex Specificationsuser code%% // must at the beginning of a lineJLex directives%% // must at the beginning of a linelexical rules

Each spec file consists of 3 sections, seperated by %%» user code copied to output file» directives include macro and state definitions, among

others.» 3rd section contains the rules of lexical analysis, each

of which consists of three parts: an optional state list, a regular expression, and an action.

4

The layout of the generated file

%userCode // from 1st section: package, import decls+utility classes%public class %class [implements %implements] { %nternalCode // from %{ … %} directive// 2 constructors %public %class( InputStream is) [throws %initthrow]{ %initCode] // from %init{ … %init} directive …} // and %public %class( Reader is) throws …{…} // main methods for requesting next token% public %type %function() [throws %yylexthrow] { … // if eof => return ( %eofValue ) …}// method to be called after eof encounteredprivate void yy_do_eof ()

[throws %eofthrow]{ ... %eofCode ... } …

}

5

JLex Directives1 Internal Code to Lexical

Analyzer Class

2 Initialization Code for Lexical Analyzer Class

3 End-of-File Code for Lexical Analyzer Class

4 Macro Definitions

5 State Declarations

6 Character Counting

7 Line Counting

8 Java CUP Compatibility

9 Lexical Analyzer Component Titles

10 Default Token Type: int

11 Default Token Type II: Wrapped Integer

12 YYEOF on End-of-File 13 Newlines and Operating

System Compatibility 14 Character Sets 15 Character Format To and From

File 16 Exceptions Generated by

Lexical Actions 17 Specifying the Return Value on

End-of-File 18 Specifying an interface to

implement 19 Making the Generated Class

Public

6

Directives for determining the names of various components of

the lexer. The name of the generated class (as well as the

file name)%class className // default is Yylex The interface the lexer class would implement%implements interfaceName The name and return type of the method to get

the next token%function methodName // default is yylex%type typeName // default is Yytoken make the lexer class public%public

7

Directives for position information

Enabling the counting of character position

%char // private int yychar declared Enabling the counting of line information

%line // private int yyline declared

Notes:

1. yychar and yyline are zero-based.

2. yychar is used to record the position of the beginning of the current token in the input stream.

3. yylength (always enabled) is used to record the length of the text the current token consumes.

8

Java codes to be put on various parts of the

generated file user code to be put outside the lexer class

[all text from 1st section] // before first %% user code to be put inside the lexer class user code to be put inside the constructors of the

lexer class user code to be put inside the body of the

yy_do_eof() method. value to be return when eof is encountered.

9

User code to be put inside the lexer class

format:

%{ // at the beginning of line

<internal code>

%} // at the beginning of line

Permit the declaration of variables and methods inside the generated lexer class

Correspond to the %internalCode region.

10

User Code to be put inside all constructors of the lexer class

format:

%init{ // at the beginning of line

<initCode>

%init} // at the beginning of line Correspond to the %initCode region. Exceptions thrown should be declared by the

directive:

%initthrow{

Exception0 , …, ExceptionN

%initthrow} // corresponds to %initthrow region

11

Directives for Specifying the input alphabet

%full

%unicode default alphabet is ASCII ( 0~127) %full => 0~255; %unicode => 0 ~65535.

%ignorecase upper case and lower case letters regarded as

the same.

12

Directives related to eof processing

Specifying the Return Value on End-of-File

%eofval{

eofValue

%eofval} YYEOF on End-of-File

%yyeof notes:

» Enable the decl: public final int YYEOF=-1; in lexer

» implied by the dir: %integer

13

User Code to be executed when end_of_file is encountered

format:

%eof{ // at the beginning of line

<eofCode>

%eof} // at the beginning of line Correspond to the %eofCode region. Exceptions thrown should be declared by the

directive:

%eofthrow{

Exception0 , …, ExceptionN

%eofthrow} // corresponds to %eofthrow region

14

Specifying the type of the returned token

%type typeName

%integer // equ to %type int

%intwrap // equ to %type java.lang.Integer

Notes:

1. Default type is Yytoken (need to be declared elsewhere, say, in user code)

2. null will be returned for eof token if the returned type is not primitive.

3. YYEOF (-1) will be returned for %integer.

15

Java CUP Compatibility %cup this directive makes the generated scanner

conform to the java_cup.runtime.Scanner interface.

has the same effect as the following three directives:

%implements java_cup.runtime.Scanner

%function next_token

%type java_cup.runtime.Symbol

16

Newlines and Operating System Compatibility

new line represented differently in UNIX and DOS

based OSs.

unix => \n

dos => \r\n The directive %notunix cause the lexer to

recognize either \r or \n as a new line.

17

Exceptions Generated by Lexical Actions

Format:

%yylexthrow{

Exception0,…,ExceptionN

%yylexthrow} Notes:

1. mapped to the %yylexthrow region.

2. are Exceptions that may be thrown from within the action codes of lexical rules.

18

State Declarations Format:

%state state0,…, stateN Notes:

1. state0,..stateN must be at the same line.

2. can have more than one %state declarations

3. State names should be valid identifiers

4. Each stateK will be declared as an int constants in the lexer class.

5. A special state YYINITIAL is implicitly declared and the lexer begins its analysis in this state.

19

Macro Definitions used to name and define sets of strings for later

use of lexical rules. format:MacroName = MacroDefinition Notes:

1. Each macro definition is contained on a single line 2. MacroName should be a valid id (letter|_)(letter|digit|

_)*3. MacroDefinition should be a valid regular expression

to be defined later.4. MacroDefintion may contain other macro expansion

in the form {otherMacroName}, but recursion is not permitted.

20

Lexical Rules Format:[<state1,…statesN>] expression { actionCode } Notes:1. All stateKs must have been declared by %state.2. the rule will be activated only when the lexer is

in one of the state listed in the state list.» if state list omitted, it is always activated.

3. the intuitive meaning of the rule is as follows:» if the lexer is in one of the state in the list and

the substring from the current position matches the expression, then execute the actionCode.

21

Conflict resolution What happens If more than one rule matches

strings from its input?

1. Choose the rule that matches the longest string.

2. If more than one rule matches strings of the same length, then choose the rule that is given first in the JLex specification.

Therefore, rules appearing earlier in the specification are given a higher priority by the generated lexer.

22

Regular Expressions The alphabet for JLex is the Ascii character set,

meaning character codes between 0 and 127 inclusive

non_newline white spaces in expressions is not allowed unless withnin double quotes “ … “ or immediately after \.

metacharacters: are chars with special meanings in JLex regular expressions.

? * + | ( ) ^ $ . [ ] { } “ \ Other chars represent themselves.

23

Escape sequences for characters

\ddd The character with number (ddd)8

\xdd The character with number (dd)16

\udddd The Unicode character with number (dddd)16. \b Backspace \n newline \t Tab \f Formfeed \r Carriage return \^C Control character(0~31: \^@, \^A,…Z,[,\,],^,_) \c A backslash followed by any other character c

matches itself: Ex: \\, \a, \B, \”, \’, etc. $ denotes the end of a line. . matches any character except the newline, equ to

[^\n].

24

More on regular expression “…aString…" denotes aString.

» Metacharacters in aString loose their meaning and represent themselves.

» The sequence \" which represents " is the only exception.

» Ex: “ab d\\\”” stands for ab d\\” {name} denote a macro expansion E1E2 : concatenation E1|E2: choice E+ or (E)+ : one or more repetitions of E, E* or (E)* : zero or more repetitions of E. E? or (E)? : zero or one repetitions of E. (E) : (..) is used for grouping.

25

More on regular expressions [...]

» Square backets denote a class of characters and match any one character enclosed in the backets.

substring inside with special meaning:» {name} : macro expansion» a-b : range of characters from a to b.» “String” means String with metachars loosing

special meaning.» \ means where is any character.» [^Rest] means – [Rest]

26

More on regular expressinos Ex:

» [a-z] match a,b,…,z.» [^0-9] matches any char but 0,1,…,9.» [\”\\] matches “ or \.» [“a-z”] matches a,- and z.» [-0-9] matches -,0,..,9.» how about [\b\f”\r\t”] ?

27

Lexical Actions format:{ action } notes: All curly braces contained in action not part of

strings or comments should be balanced. Actions and Recursion: If no return value is returned in an action, the lexical

analyzer will search for the next match from the input stream and returning the value associated with that match.

The lexical analyzer can be made to recur explicitly with a call to yylex(), as in the following code fragment.{ ... return yylex(); ... }

28

More on lexical actions State transitions are made by the function call.

yybegin(state); Avilable Lexical methods / vars:

String yytext()

Matched portion of the character input stream

int yylength()

length of yytext()

int yychar;

int yyline;

29

Performance Size of JLex generated Lexer Hand-Written

Lexer

Source File Execution Time Execution Times

177 lines 0.42 seconds 0.53 seconds

897 lines 0.98 seconds 1.28 seconds

The JLex lexical analyzer soundly outperformed the hand-written lexer!!

30

Exampleimport java.lang.System; class Sample {

public static void main(String argv[]) throws java.io.IOException {

Yylex yy = new Yylex(System.in); Yytoken t; while ((t = yy.yylex()) != null) System.out.println(t); } }

31

class Utility { public static void assert ( boolean expr ) {

if (false == expr) { throw (new Error("Error: Assertion failed.")); }

} private static final String errorMsg[] = { "Error: Unmatched end-of-comment punctuation.", "Error: Unmatched start-of-comment punctuation.", "Error: Unclosed string.", "Error: Illegal character." }; public static final int E_ENDCOMMENT = 0; public static final int E_STARTCOMMENT = 1; public static final int E_UNCLOSEDSTR = 2; public static final int E_UNMATCHED = 3; public static void error ( int code )

{ System.out.println(errorMsg[code]); } }

32

class Yytoken { Yytoken ( int index, String text, int line, int

charBegin, int charEnd ) { m_index = index;

m_text = new String(text); m_line = line; m_charBegin = charBegin; m_charEnd = charEnd; } public int m_index; public String m_text; public int m_line; public int m_charBegin; public int m_charEnd; public String toString() { return "Token

#"+m_index+": "+m_text+" (line "+m_line+")"; } }

33

%% %{ private int comment_count = 0; %} %line %char %state COMMENT ALPHA=[A-Za-z] DIGIT=[0-9] NONNEWLINE_WHITE_SPACE_CHAR=[\ \t\b\012]WHITE_SPACE_CHAR=[\n\ \t\b\012]STRING_TEXT= (\\\"|[^\n\"]|\\{WHITE_SPACE_CHAR}

+\\)*COMMENT_TEXT=([^/*\n]|[^*\n]"/"[^*\n]|[^/\

n]"*"[^/\n]|"*"[^/\n]|"/"[^*\n])* %%

34

<YYINITIAL> "," { return (newYytoken(0,yytext(),yyline,yychar,yychar+1)); }

<YYINITIAL> ":" { return (new Yytoken(1,yytext(),yyline,yychar,yychar+1)); }<YYINITIAL> ";" { return (new

Yytoken(2,yytext(),yyline,yychar,yychar+1)); }<YYINITIAL> "(" { return (new Yytoken(3,yytext(),yyline,yychar,yychar+1)); }…<YYINITIAL> "<>" { return (new Yytoken(15,yytext(),yyline,yychar,yychar+2)); }…<YYINITIAL> "<" { return (new Yytoken(16,yytext(),yyline,yychar,yychar+1)); }<YYINITIAL> "<=" { return (new

Yytoken(17,yytext(),yyline,yychar,yychar+2)); }…<YYINITIAL> "|" { return (new Yytoken(21,yytext(),yyline,yychar,yychar+1)); }<YYINITIAL> ":=" { return (new Yytoken(22,yytext(),yyline,yychar,yychar+2)); }

35

<YYINITIAL> {NONNEWLINE_WHITE_SPACE_CHAR}+ { }

<YYINITIAL,COMMENT> \n { }

<YYINITIAL> "/*" { yybegin(COMMENT);

comment_count = comment_count + 1; }

<COMMENT> "/*" { comment_count = comment_count + 1; }

<COMMENT> "*/" {

comment_count = comment_count - 1;

Utility.assert(comment_count >= 0);

if (comment_count == 0) {yybegin(YYINITIAL);}}

<COMMENT> {COMMENT_TEXT} { }

<YYINITIAL> \"{STRING_TEXT}\" {

String str = yytext().substring(1,yytext().length() - 1);

Utility.assert(str.length() == yytext().length() - 2);

return (new Yytoken(40,str,yyline,yychar,yychar + str.length())); }

36

<YYINITIAL> \"{STRING_TEXT} {

String str = yytext().substring(1,yytext().length());

Utility.error(Utility.E_UNCLOSEDSTR);

Utility.assert(str.length() == yytext().length() - 1);

return (new Yytoken(41,str,yyline,yychar,yychar + str.length()));}

<YYINITIAL> {DIGIT}+ {

return (new Yytoken(42,yytext(),yyline,yychar,yychar + yytext().length()));}

<YYINITIAL> {ALPHA}({ALPHA}|{DIGIT}|_)* {

return (new Yytoken(43,yytext(),yyline,yychar,yychar + yytext().length())); }

<YYINITIAL,COMMENT> . {

System.out.println("Illegal character: <" + yytext() + ">");

Utility.error(Utility.E_UNMATCHED);}

Documents

1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java