View
241
Download
2
Tags:
Embed Size (px)
Citation preview
1
Lecture 3
Introduction to JLex: a lexical analyzer generator for Java
2
JLex
javacLscanner.class
…
Lscanner.lex
Ltokens…
input.L
Lscanner.java
charstream
token stream
The role of JLex
3
JLex Specificationsuser code%% // must at the beginning of a lineJLex directives%% // must at the beginning of a linelexical rules
Each spec file consists of 3 sections, seperated by %%» user code copied to output file» directives include macro and state definitions, among
others.» 3rd section contains the rules of lexical analysis, each
of which consists of three parts: an optional state list, a regular expression, and an action.
4
The layout of the generated file
%userCode // from 1st section: package, import decls+utility classes%public class %class [implements %implements] { %nternalCode // from %{ … %} directive// 2 constructors %public %class( InputStream is) [throws %initthrow]{ %initCode] // from %init{ … %init} directive …} // and %public %class( Reader is) throws …{…} // main methods for requesting next token% public %type %function() [throws %yylexthrow] { … // if eof => return ( %eofValue ) …}// method to be called after eof encounteredprivate void yy_do_eof ()
[throws %eofthrow]{ ... %eofCode ... } …
}
5
JLex Directives1 Internal Code to Lexical
Analyzer Class
2 Initialization Code for Lexical Analyzer Class
3 End-of-File Code for Lexical Analyzer Class
4 Macro Definitions
5 State Declarations
6 Character Counting
7 Line Counting
8 Java CUP Compatibility
9 Lexical Analyzer Component Titles
10 Default Token Type: int
11 Default Token Type II: Wrapped Integer
12 YYEOF on End-of-File 13 Newlines and Operating
System Compatibility 14 Character Sets 15 Character Format To and From
File 16 Exceptions Generated by
Lexical Actions 17 Specifying the Return Value on
End-of-File 18 Specifying an interface to
implement 19 Making the Generated Class
Public
6
Directives for determining the names of various components of
the lexer. The name of the generated class (as well as the
file name)%class className // default is Yylex The interface the lexer class would implement%implements interfaceName The name and return type of the method to get
the next token%function methodName // default is yylex%type typeName // default is Yytoken make the lexer class public%public
7
Directives for position information
Enabling the counting of character position
%char // private int yychar declared Enabling the counting of line information
%line // private int yyline declared
Notes:
1. yychar and yyline are zero-based.
2. yychar is used to record the position of the beginning of the current token in the input stream.
3. yylength (always enabled) is used to record the length of the text the current token consumes.
8
Java codes to be put on various parts of the
generated file user code to be put outside the lexer class
[all text from 1st section] // before first %% user code to be put inside the lexer class user code to be put inside the constructors of the
lexer class user code to be put inside the body of the
yy_do_eof() method. value to be return when eof is encountered.
9
User code to be put inside the lexer class
format:
%{ // at the beginning of line
<internal code>
%} // at the beginning of line
Permit the declaration of variables and methods inside the generated lexer class
Correspond to the %internalCode region.
10
User Code to be put inside all constructors of the lexer class
format:
%init{ // at the beginning of line
<initCode>
%init} // at the beginning of line Correspond to the %initCode region. Exceptions thrown should be declared by the
directive:
%initthrow{
Exception0 , …, ExceptionN
%initthrow} // corresponds to %initthrow region
11
Directives for Specifying the input alphabet
%full
%unicode default alphabet is ASCII ( 0~127) %full => 0~255; %unicode => 0 ~65535.
%ignorecase upper case and lower case letters regarded as
the same.
12
Directives related to eof processing
Specifying the Return Value on End-of-File
%eofval{
eofValue
%eofval} YYEOF on End-of-File
%yyeof notes:
» Enable the decl: public final int YYEOF=-1; in lexer
» implied by the dir: %integer
13
User Code to be executed when end_of_file is encountered
format:
%eof{ // at the beginning of line
<eofCode>
%eof} // at the beginning of line Correspond to the %eofCode region. Exceptions thrown should be declared by the
directive:
%eofthrow{
Exception0 , …, ExceptionN
%eofthrow} // corresponds to %eofthrow region
14
Specifying the type of the returned token
%type typeName
%integer // equ to %type int
%intwrap // equ to %type java.lang.Integer
Notes:
1. Default type is Yytoken (need to be declared elsewhere, say, in user code)
2. null will be returned for eof token if the returned type is not primitive.
3. YYEOF (-1) will be returned for %integer.
15
Java CUP Compatibility %cup this directive makes the generated scanner
conform to the java_cup.runtime.Scanner interface.
has the same effect as the following three directives:
%implements java_cup.runtime.Scanner
%function next_token
%type java_cup.runtime.Symbol
16
Newlines and Operating System Compatibility
new line represented differently in UNIX and DOS
based OSs.
unix => \n
dos => \r\n The directive %notunix cause the lexer to
recognize either \r or \n as a new line.
17
Exceptions Generated by Lexical Actions
Format:
%yylexthrow{
Exception0,…,ExceptionN
%yylexthrow} Notes:
1. mapped to the %yylexthrow region.
2. are Exceptions that may be thrown from within the action codes of lexical rules.
18
State Declarations Format:
%state state0,…, stateN Notes:
1. state0,..stateN must be at the same line.
2. can have more than one %state declarations
3. State names should be valid identifiers
4. Each stateK will be declared as an int constants in the lexer class.
5. A special state YYINITIAL is implicitly declared and the lexer begins its analysis in this state.
19
Macro Definitions used to name and define sets of strings for later
use of lexical rules. format:MacroName = MacroDefinition Notes:
1. Each macro definition is contained on a single line 2. MacroName should be a valid id (letter|_)(letter|digit|
_)*3. MacroDefinition should be a valid regular expression
to be defined later.4. MacroDefintion may contain other macro expansion
in the form {otherMacroName}, but recursion is not permitted.
20
Lexical Rules Format:[<state1,…statesN>] expression { actionCode } Notes:1. All stateKs must have been declared by %state.2. the rule will be activated only when the lexer is
in one of the state listed in the state list.» if state list omitted, it is always activated.
3. the intuitive meaning of the rule is as follows:» if the lexer is in one of the state in the list and
the substring from the current position matches the expression, then execute the actionCode.
21
Conflict resolution What happens If more than one rule matches
strings from its input?
1. Choose the rule that matches the longest string.
2. If more than one rule matches strings of the same length, then choose the rule that is given first in the JLex specification.
Therefore, rules appearing earlier in the specification are given a higher priority by the generated lexer.
22
Regular Expressions The alphabet for JLex is the Ascii character set,
meaning character codes between 0 and 127 inclusive
non_newline white spaces in expressions is not allowed unless withnin double quotes “ … “ or immediately after \.
metacharacters: are chars with special meanings in JLex regular expressions.
? * + | ( ) ^ $ . [ ] { } “ \ Other chars represent themselves.
23
Escape sequences for characters
\ddd The character with number (ddd)8
\xdd The character with number (dd)16
\udddd The Unicode character with number (dddd)16. \b Backspace \n newline \t Tab \f Formfeed \r Carriage return \^C Control character(0~31: \^@, \^A,…Z,[,\,],^,_) \c A backslash followed by any other character c
matches itself: Ex: \\, \a, \B, \”, \’, etc. $ denotes the end of a line. . matches any character except the newline, equ to
[^\n].
24
More on regular expression “…aString…" denotes aString.
» Metacharacters in aString loose their meaning and represent themselves.
» The sequence \" which represents " is the only exception.
» Ex: “ab d\\\”” stands for ab d\\” {name} denote a macro expansion E1E2 : concatenation E1|E2: choice E+ or (E)+ : one or more repetitions of E, E* or (E)* : zero or more repetitions of E. E? or (E)? : zero or one repetitions of E. (E) : (..) is used for grouping.
25
More on regular expressions [...]
» Square backets denote a class of characters and match any one character enclosed in the backets.
substring inside with special meaning:» {name} : macro expansion» a-b : range of characters from a to b.» “String” means String with metachars loosing
special meaning.» \ means where is any character.» [^Rest] means – [Rest]
26
More on regular expressinos Ex:
» [a-z] match a,b,…,z.» [^0-9] matches any char but 0,1,…,9.» [\”\\] matches “ or \.» [“a-z”] matches a,- and z.» [-0-9] matches -,0,..,9.» how about [\b\f”\r\t”] ?
27
Lexical Actions format:{ action } notes: All curly braces contained in action not part of
strings or comments should be balanced. Actions and Recursion: If no return value is returned in an action, the lexical
analyzer will search for the next match from the input stream and returning the value associated with that match.
The lexical analyzer can be made to recur explicitly with a call to yylex(), as in the following code fragment.{ ... return yylex(); ... }
28
More on lexical actions State transitions are made by the function call.
yybegin(state); Avilable Lexical methods / vars:
String yytext()
Matched portion of the character input stream
int yylength()
length of yytext()
int yychar;
int yyline;
29
Performance Size of JLex generated Lexer Hand-Written
Lexer
Source File Execution Time Execution Times
177 lines 0.42 seconds 0.53 seconds
897 lines 0.98 seconds 1.28 seconds
The JLex lexical analyzer soundly outperformed the hand-written lexer!!
30
Exampleimport java.lang.System; class Sample {
public static void main(String argv[]) throws java.io.IOException {
Yylex yy = new Yylex(System.in); Yytoken t; while ((t = yy.yylex()) != null) System.out.println(t); } }
31
class Utility { public static void assert ( boolean expr ) {
if (false == expr) { throw (new Error("Error: Assertion failed.")); }
} private static final String errorMsg[] = { "Error: Unmatched end-of-comment punctuation.", "Error: Unmatched start-of-comment punctuation.", "Error: Unclosed string.", "Error: Illegal character." }; public static final int E_ENDCOMMENT = 0; public static final int E_STARTCOMMENT = 1; public static final int E_UNCLOSEDSTR = 2; public static final int E_UNMATCHED = 3; public static void error ( int code )
{ System.out.println(errorMsg[code]); } }
32
class Yytoken { Yytoken ( int index, String text, int line, int
charBegin, int charEnd ) { m_index = index;
m_text = new String(text); m_line = line; m_charBegin = charBegin; m_charEnd = charEnd; } public int m_index; public String m_text; public int m_line; public int m_charBegin; public int m_charEnd; public String toString() { return "Token
#"+m_index+": "+m_text+" (line "+m_line+")"; } }
33
%% %{ private int comment_count = 0; %} %line %char %state COMMENT ALPHA=[A-Za-z] DIGIT=[0-9] NONNEWLINE_WHITE_SPACE_CHAR=[\ \t\b\012]WHITE_SPACE_CHAR=[\n\ \t\b\012]STRING_TEXT= (\\\"|[^\n\"]|\\{WHITE_SPACE_CHAR}
+\\)*COMMENT_TEXT=([^/*\n]|[^*\n]"/"[^*\n]|[^/\
n]"*"[^/\n]|"*"[^/\n]|"/"[^*\n])* %%
34
<YYINITIAL> "," { return (newYytoken(0,yytext(),yyline,yychar,yychar+1)); }
<YYINITIAL> ":" { return (new Yytoken(1,yytext(),yyline,yychar,yychar+1)); }<YYINITIAL> ";" { return (new
Yytoken(2,yytext(),yyline,yychar,yychar+1)); }<YYINITIAL> "(" { return (new Yytoken(3,yytext(),yyline,yychar,yychar+1)); }…<YYINITIAL> "<>" { return (new Yytoken(15,yytext(),yyline,yychar,yychar+2)); }…<YYINITIAL> "<" { return (new Yytoken(16,yytext(),yyline,yychar,yychar+1)); }<YYINITIAL> "<=" { return (new
Yytoken(17,yytext(),yyline,yychar,yychar+2)); }…<YYINITIAL> "|" { return (new Yytoken(21,yytext(),yyline,yychar,yychar+1)); }<YYINITIAL> ":=" { return (new Yytoken(22,yytext(),yyline,yychar,yychar+2)); }
35
<YYINITIAL> {NONNEWLINE_WHITE_SPACE_CHAR}+ { }
<YYINITIAL,COMMENT> \n { }
<YYINITIAL> "/*" { yybegin(COMMENT);
comment_count = comment_count + 1; }
<COMMENT> "/*" { comment_count = comment_count + 1; }
<COMMENT> "*/" {
comment_count = comment_count - 1;
Utility.assert(comment_count >= 0);
if (comment_count == 0) {yybegin(YYINITIAL);}}
<COMMENT> {COMMENT_TEXT} { }
<YYINITIAL> \"{STRING_TEXT}\" {
String str = yytext().substring(1,yytext().length() - 1);
Utility.assert(str.length() == yytext().length() - 2);
return (new Yytoken(40,str,yyline,yychar,yychar + str.length())); }
36
<YYINITIAL> \"{STRING_TEXT} {
String str = yytext().substring(1,yytext().length());
Utility.error(Utility.E_UNCLOSEDSTR);
Utility.assert(str.length() == yytext().length() - 1);
return (new Yytoken(41,str,yyline,yychar,yychar + str.length()));}
<YYINITIAL> {DIGIT}+ {
return (new Yytoken(42,yytext(),yyline,yychar,yychar + yytext().length()));}
<YYINITIAL> {ALPHA}({ALPHA}|{DIGIT}|_)* {
return (new Yytoken(43,yytext(),yyline,yychar,yychar + yytext().length())); }
<YYINITIAL,COMMENT> . {
System.out.println("Illegal character: <" + yytext() + ">");
Utility.error(Utility.E_UNMATCHED);}