Upload
hoangdiep
View
226
Download
0
Embed Size (px)
Citation preview
Writing a Lexer
CS F331 Programming LanguagesCSCE A331 Programming Language ConceptsLecture SlidesWednesday, February 8, 2017
Glenn G. ChappellDepartment of Computer ScienceUniversity of Alaska [email protected]
© 2017 Glenn G. Chappell
continued
ReviewOverview of Lexing & Parsing
Two phases:§ Lexical analysis (lexing)§ Syntax analysis (parsing)
The output of a parser is often an abstractsyntax tree (AST). Specifications of these can vary.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017
ParserLexemeStream
ASTor Error
cout << ff(12.6);
id op id litop
punctop
expr
binOp: <<
expr
id: cout funcCall
expr
id: ff
numLit: 12.6
expr
LexerCharacter
Streamcout << ff(12.6);
Parsing
2
ReviewIntroduction to Lexical Analysis — Lexeme Cat’s, Reserved Words
Five common lexeme categories:§ Identifier. A name a program gives to some entity: variable,
function, type, namespace, etc.§ Keyword. An identifier-like lexeme that has special meaning within
a PL. Lua examples: function if end do elseif
§ Operator. A word that gives an alternate method for making what is essentially a function call. Arguments are operands. Luaexamples: + * <=
§ Literal. A bare value. Lua examples: 34.2 (number) "abc" (string) [=[abc]=] (string) true (boolean) { 1, 2 } (table—not a single lexeme)
§ Punctuation. Other stuff. Lua examples: { } ,
A reserved word is a word that fits the general specification of an identifier, but is not allowed as an identifier. Reserved word is not a lexeme category.
In many PLs, the keywords and the reserved words are the same.8 Feb 2017 CS F331 / CSCE A331 Spring 2017 3
ReviewIntroduction to Lexical Analysis — Lexer Operation
There are essentially three ways to write a lexer.§ Automatically generated, based on regular grammars or regular
expressions for each lexeme category.§ Hand-coded state machine using a table.§ Entirely hand-coded state machine.
We will write a lexer using the last method.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 4
ReviewWriting a Lexer — Introduction, Design Decisions
We are writing a lexer in Lua, based on the In-Class Lexeme Specification. This describes lexemes for a roughly C-like PL.
Our lexer is implemented as Lua module lexer.Function lexer.lex provides an iterator. A for-in loop gives each
lexeme as a pair: text (string) and category (number). The latter is an index for lexer.catnames. Code to use our lexer:
lexer = require "lexer"
for lexstr, cat in lexer.lex(program) docatstr = lexer.catnames(cat)io.write(string.format("%-10s %s\n",
lexstr, catstr))end
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 5
ReviewWriting a Lexer — Coding a State Machine I [1/5]
Internally, our lexer runs as a state machine.§ A state machine has a current state, stored in variable state.§ It proceeds in a series of steps. At each step, it looks at the current
character in the input and the current state. It then decides what state to go to next.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 6
ReviewWriting a Lexer — Coding a State Machine I [2/5]
We need to be careful about invariants: statements that are always true at a particular point in a program.
What should we expect to be true about variables (pos, in particular) when our iterator function is called? Whatever we decide, we need to ensure that it is true when this function returns.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 7
ReviewWriting a Lexer — Coding a State Machine I [3/5]
We need to be clear about what happens when we read past the end of the input.
We use the string function sub to get single characters out of the input string. This function returns the empty string when it is asked to read past the end. And an empty string will always result in false when passed to a character-testing function, or when equality-compared with any single character. So anything we attempt to check about a past-the-end character will be false.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 8
ReviewWriting a Lexer — Coding a State Machine I [5/5]
We expanded lexer.lua to handle some of the lexeme categories.
I have also posted a simple main program for the lexer.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017
See lexer.lua.
See uselexer.lua.
9
Writing a LexerCoding a State Machine II — CODE
TO DO§ Finish lexer.lua.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017
continued
10
Done. See lexer.lua.
Writing a LexerNotes — Lookahead [1/2]
With our lexeme specification, it is tricky to handle “+.” and “-.”.§ For example, “+.3” is a single lexeme (NumericLiteral), while “+.x”
is three lexemes: (Operator, Operator, Identifier).There are several ways this could be handled.
§ Backtracking. Add a state for “+.”—perhaps called PLUSDOT. In this state, if the next character is a digit, then add it to the lexeme and go to DIGDOT; otherwise, remove the dot from the current lexeme, back up the pos pointer, and spit out the + operator.
§ Backtracking with saved state. Do as above, but before spitting out the + operator, begin construction of the next lexeme, which would begin with a dot. Then save this partial lexeme and state for the next lexeme request.
§ Lookahead. The strategy used. If we see “+.”, then peek ahead at the next character. If it is a digit, then add the dot to the lexeme, and go to DIGDOT—even though the lexeme contains no digit yet. Otherwise, do not add the dot to the current lexeme, and spit out the + operator.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 11
Writing a LexerNotes — Lookahead [2/2]
Our lexer used the lookahead strategy. This is both fast and simple to implement. Our state machine has thus morphed into a somewhat more general construction that considers two input characters, not just one.
Lookahead is a common technique at all phases of parsing.While lookahead is useful, but in a simple state machine it does not
increase our computational power. Each lexeme category must still form a regular language.
On the other hand, when we do syntax analysis involving more general context-free languages, lookahead may actually increase our capabilities. For some standard parsing algorithms, there are CFLs that cannot be recognized unless lookahead is used.§ CFGs are commonly classified according to the number of lexemes
of lookahead required. Thus we talk about LL(1) grammars, LL(2) grammars, etc. More about this on another day.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 12
Writing a LexerNotes — Error Handling [1/6]
The lexeme specification tells us to handle illegal characters by forming a single character Malformed lexeme.
But suppose there were no Malformed category. How else could we handle this error?
There are three places where a possible error condition in a function might be handled.1. Before the function. The caller can prevent the error, so that it
never happens.2. In the function. If the function encounters an error, then it can
fix it, so that the outside world never knows; it may display a message to the user.
3. After the function. The function can signal the caller that an error has occurred, leaving it to the caller to deal with.
We look at these three in turn.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 13
Writing a LexerNotes — Error Handling [2/6]
A possible error condition in a function can be handled before the function: the caller can prevent the error, so that it never happens.
A lexer generally reads text straight from a source file. To prevent the occurrence of illegal characters would require a preprocessing step.
This goes against the intent behind our design. It would also make our lexer more difficult to use. L
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 14
Writing a LexerNotes — Error Handling [3/6]
A possible error condition in a function can be handled in the function. If the function encounters an error, then it can fix it, so that the outside world never knows; it may display a message to the user.
The only way a lexer might “fix” illegal characters would be to skip them, but that contradicts the lexeme specification.
Secondly, a lexer generally exists to provide input to a parser. So a lexer is not user-facing code. Displaying a message to the user is poor practice for a lexer. L
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 15
ReviewWriting a Lexer — Coding a State Machine I [4/5]
I follow the convention that each state is named after a short string that will put me in that state. If we have read “a”, then we are in state LETTER, since we have read a single letter.
As we write a state machine, an important question is when do we add a new state? A good guiding principle:
Two situations can be handled by the same state if they would react identically to all future input.
Continuing from above, we have read “a” and are in state LETTER. Suppose the next character is “3”. Are we still in state LETTER?
Applying the above principle: yes. Because whatever follows “a3”, we handle it the same as we would if it followed “a”. For example, “a3_xq6” is an identifier; and so is “a_xq6”.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 16
Writing a LexerNotes — Error Handling [4/6]
A possible error condition in a function can be handled after the function. The function can signal the caller that an error has occurred, leaving it to the caller to deal with.
The caller here would generally be the parser. How could the lexersignal the parser that an illegal character has been encountered?
One option would be to raise an exception; and Lua does have exceptions. This would require extra exception-handling code in the parser.
The option chosen was to extend the return values of the lexerwith an extra category: Malformed. We signal the parser that an illegal character has occurred by returning a Malformed lexeme. This method has an important advantage …
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 17
Writing a LexerNotes — Error Handling [5/6]
A parser must always check whether each lexeme is what it wants. There must be code to deal with the possibility of an unwanted lexeme. A Malformed lexeme will always be unwanted.
if [lexeme is what we want] then[Yay!]
else[Uh oh, unwanted lexeme.]
end
Result: our error signaling method requires no additional code in the parser. J
8 Feb 2017 CS F331 / CSCE A331 Spring 2017
A Malformed lexeme always results in this branch being executed.
18
Writing a LexerNotes — Error Handling [6/6]
Our lexer, with its Malformed category, is a robust package. This means that it deals gracefully with anything that gets thrown at it—in particular, it returns a sequence of lexemes for all possible inputs.
Conclusion. This method of dealing with illegal characters allows the calling code to handle them however and whenever it wants, with minimal extra effort, and it reduces the likelihood of painful situations for the user.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 19
Writing a LexerNotes — Correct Lexeme Specification? [1/4]
By its nature, lexical analysis is the servant of syntax analysis. The output of a lexer is almost never needed for its own sake; lexingis typically just the first step in the construction of an AST—or something similar—perhaps followed by the generation of executable code.
Thus, we cannot really look at a lexeme specification in isolation and call it correct or incorrect.
However, it is true that our lexeme specification does not quite match the envisioned PL. (This was intentional, but it harkens back to an actual mistake I made when writing a lexeme specification some years ago.)
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 20
Writing a LexerNotes — Correct Lexeme Specification? [2/4]
Say our lexer is to be part of a parser for arithmetic expressions with syntax like C++, Java, and Lua. Consider the following.
Input: k – 4 Input: k–4Output: Output:
k Identifier k Identifier- Operator -4 NumericLiteral4 NumericLiteral
Note that the above is correct, according to our lexeme specification. However, the specification does not really match the envisioned PL.
What can we do about this?
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 21
Writing a LexerNotes — Correct Lexeme Specification? [3/4]
Problem: k - 4 vs. k-4.
Possible Solutions§ Leave the lexeme specification alone. Require programmers to
insert space sometimes.§ Change the lexeme specification so that a NumericLiteral always
begins with a digit. Then “-4” is an Operator and a NumericLiteral.§ Change the longest-lexeme rule. Allow the caller to set a flag during
lexing, indicating that the next lexeme, if it begins with + or –, should always be interpreted as an Operator.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 22
Writing a LexerNotes — Correct Lexeme Specification? [4/4]
The last option is my favorite.§ Change the longest-lexeme rule. Allow the caller to set a flag during
lexing, indicating that the next lexeme, if it begins with + or –, should always be interpreted as an Operator.
Note. This is not implemented in the posted lexer code.
We might implement it as follows.§ In the lexer module, make a variable preferOpFlag, not exported.§ Make function preferOp, exported. This sets preferOpFlag to true.§ Set preferOpFlag to false: in its declaration, and just before each
of the two return statements in function getLexeme.§ In the lexing code, when a lexeme begins with either + or -, read
preferOpFlag to decide what to do.
An important point: sometimes the parser may need to guide the lexer. This can affect the design of a lexer.
8 Feb 2017 CS F331 / CSCE A331 Spring 2017 23
Be careful with the third point! It’s easy to get wrong.