Writing a Lexer - CS Home · CS F331 Programming Languages CSCE A331 Programming Language Concepts ... Notes —Lookahead[2/2] Our lexerused the lookaheadstrategy. This …

Writing a Lexer

CS F331 Programming LanguagesCSCE A331 Programming Language ConceptsLecture SlidesWednesday, February 8, 2017

Glenn G. ChappellDepartment of Computer ScienceUniversity of Alaska [email protected]

© 2017 Glenn G. Chappell

continued

ReviewOverview of Lexing & Parsing

Two phases:§ Lexical analysis (lexing)§ Syntax analysis (parsing)

The output of a parser is often an abstractsyntax tree (AST). Specifications of these can vary.

8 Feb 2017 CS F331 / CSCE A331 Spring 2017

ParserLexemeStream

ASTor Error

cout << ff(12.6);

id op id litop

punctop

expr

binOp: <<

expr

id: cout funcCall

expr

id: ff

numLit: 12.6

expr

LexerCharacter

Streamcout << ff(12.6);

Parsing

2

ReviewIntroduction to Lexical Analysis — Lexeme Cat’s, Reserved Words

Five common lexeme categories:§ Identifier. A name a program gives to some entity: variable,

function, type, namespace, etc.§ Keyword. An identifier-like lexeme that has special meaning within

a PL. Lua examples: function if end do elseif

§ Operator. A word that gives an alternate method for making what is essentially a function call. Arguments are operands. Luaexamples: + * <=

§ Literal. A bare value. Lua examples: 34.2 (number) "abc" (string) [=[abc]=] (string) true (boolean) { 1, 2 } (table—not a single lexeme)

§ Punctuation. Other stuff. Lua examples: { } ,

A reserved word is a word that fits the general specification of an identifier, but is not allowed as an identifier. Reserved word is not a lexeme category.

In many PLs, the keywords and the reserved words are the same.8 Feb 2017 CS F331 / CSCE A331 Spring 2017 3

ReviewIntroduction to Lexical Analysis — Lexer Operation

There are essentially three ways to write a lexer.§ Automatically generated, based on regular grammars or regular

expressions for each lexeme category.§ Hand-coded state machine using a table.§ Entirely hand-coded state machine.

We will write a lexer using the last method.

8 Feb 2017 CS F331 / CSCE A331 Spring 2017 4

ReviewWriting a Lexer — Introduction, Design Decisions

We are writing a lexer in Lua, based on the In-Class Lexeme Specification. This describes lexemes for a roughly C-like PL.

Our lexer is implemented as Lua module lexer.Function lexer.lex provides an iterator. A for-in loop gives each

lexeme as a pair: text (string) and category (number). The latter is an index for lexer.catnames. Code to use our lexer:

lexer = require "lexer"

for lexstr, cat in lexer.lex(program) docatstr = lexer.catnames(cat)io.write(string.format("%-10s %s\n",

lexstr, catstr))end


ReviewWriting a Lexer — Coding a State Machine I [1/5]

Internally, our lexer runs as a state machine.§ A state machine has a current state, stored in variable state.§ It proceeds in a series of steps. At each step, it looks at the current

character in the input and the current state. It then decides what state to go to next.



We need to be careful about invariants: statements that are always true at a particular point in a program.

What should we expect to be true about variables (pos, in particular) when our iterator function is called? Whatever we decide, we need to ensure that it is true when this function returns.



We need to be clear about what happens when we read past the end of the input.

We use the string function sub to get single characters out of the input string. This function returns the empty string when it is asked to read past the end. And an empty string will always result in false when passed to a character-testing function, or when equality-compared with any single character. So anything we attempt to check about a past-the-end character will be false.



We expanded lexer.lua to handle some of the lexeme categories.

I have also posted a simple main program for the lexer.


See lexer.lua.

See uselexer.lua.

9

Writing a LexerCoding a State Machine II — CODE

TO DO§ Finish lexer.lua.


continued

10

Done. See lexer.lua.

Writing a LexerNotes — Lookahead [1/2]

With our lexeme specification, it is tricky to handle “+.” and “-.”.§ For example, “+.3” is a single lexeme (NumericLiteral), while “+.x”

is three lexemes: (Operator, Operator, Identifier).There are several ways this could be handled.

§ Backtracking. Add a state for “+.”—perhaps called PLUSDOT. In this state, if the next character is a digit, then add it to the lexeme and go to DIGDOT; otherwise, remove the dot from the current lexeme, back up the pos pointer, and spit out the + operator.

§ Backtracking with saved state. Do as above, but before spitting out the + operator, begin construction of the next lexeme, which would begin with a dot. Then save this partial lexeme and state for the next lexeme request.

§ Lookahead. The strategy used. If we see “+.”, then peek ahead at the next character. If it is a digit, then add the dot to the lexeme, and go to DIGDOT—even though the lexeme contains no digit yet. Otherwise, do not add the dot to the current lexeme, and spit out the + operator.


Writing a LexerNotes — Lookahead [2/2]

Our lexer used the lookahead strategy. This is both fast and simple to implement. Our state machine has thus morphed into a somewhat more general construction that considers two input characters, not just one.

Lookahead is a common technique at all phases of parsing.While lookahead is useful, but in a simple state machine it does not

increase our computational power. Each lexeme category must still form a regular language.

On the other hand, when we do syntax analysis involving more general context-free languages, lookahead may actually increase our capabilities. For some standard parsing algorithms, there are CFLs that cannot be recognized unless lookahead is used.§ CFGs are commonly classified according to the number of lexemes

of lookahead required. Thus we talk about LL(1) grammars, LL(2) grammars, etc. More about this on another day.


Writing a LexerNotes — Error Handling [1/6]

The lexeme specification tells us to handle illegal characters by forming a single character Malformed lexeme.

But suppose there were no Malformed category. How else could we handle this error?

There are three places where a possible error condition in a function might be handled.1. Before the function. The caller can prevent the error, so that it

never happens.2. In the function. If the function encounters an error, then it can

fix it, so that the outside world never knows; it may display a message to the user.

3. After the function. The function can signal the caller that an error has occurred, leaving it to the caller to deal with.

We look at these three in turn.



A possible error condition in a function can be handled before the function: the caller can prevent the error, so that it never happens.

A lexer generally reads text straight from a source file. To prevent the occurrence of illegal characters would require a preprocessing step.

This goes against the intent behind our design. It would also make our lexer more difficult to use. L



A possible error condition in a function can be handled in the function. If the function encounters an error, then it can fix it, so that the outside world never knows; it may display a message to the user.

The only way a lexer might “fix” illegal characters would be to skip them, but that contradicts the lexeme specification.

Secondly, a lexer generally exists to provide input to a parser. So a lexer is not user-facing code. Displaying a message to the user is poor practice for a lexer. L



I follow the convention that each state is named after a short string that will put me in that state. If we have read “a”, then we are in state LETTER, since we have read a single letter.

As we write a state machine, an important question is when do we add a new state? A good guiding principle:

Two situations can be handled by the same state if they would react identically to all future input.

Continuing from above, we have read “a” and are in state LETTER. Suppose the next character is “3”. Are we still in state LETTER?

Applying the above principle: yes. Because whatever follows “a3”, we handle it the same as we would if it followed “a”. For example, “a3_xq6” is an identifier; and so is “a_xq6”.



A possible error condition in a function can be handled after the function. The function can signal the caller that an error has occurred, leaving it to the caller to deal with.

The caller here would generally be the parser. How could the lexersignal the parser that an illegal character has been encountered?

One option would be to raise an exception; and Lua does have exceptions. This would require extra exception-handling code in the parser.

The option chosen was to extend the return values of the lexerwith an extra category: Malformed. We signal the parser that an illegal character has occurred by returning a Malformed lexeme. This method has an important advantage …



A parser must always check whether each lexeme is what it wants. There must be code to deal with the possibility of an unwanted lexeme. A Malformed lexeme will always be unwanted.

if [lexeme is what we want] then[Yay!]

else[Uh oh, unwanted lexeme.]

end

Result: our error signaling method requires no additional code in the parser. J


A Malformed lexeme always results in this branch being executed.

18


Our lexer, with its Malformed category, is a robust package. This means that it deals gracefully with anything that gets thrown at it—in particular, it returns a sequence of lexemes for all possible inputs.

Conclusion. This method of dealing with illegal characters allows the calling code to handle them however and whenever it wants, with minimal extra effort, and it reduces the likelihood of painful situations for the user.


Writing a LexerNotes — Correct Lexeme Specification? [1/4]

By its nature, lexical analysis is the servant of syntax analysis. The output of a lexer is almost never needed for its own sake; lexingis typically just the first step in the construction of an AST—or something similar—perhaps followed by the generation of executable code.

Thus, we cannot really look at a lexeme specification in isolation and call it correct or incorrect.

However, it is true that our lexeme specification does not quite match the envisioned PL. (This was intentional, but it harkens back to an actual mistake I made when writing a lexeme specification some years ago.)



Say our lexer is to be part of a parser for arithmetic expressions with syntax like C++, Java, and Lua. Consider the following.

Input: k – 4 Input: k–4Output: Output:

k Identifier k Identifier- Operator -4 NumericLiteral4 NumericLiteral

Note that the above is correct, according to our lexeme specification. However, the specification does not really match the envisioned PL.

What can we do about this?



Problem: k - 4 vs. k-4.

Possible Solutions§ Leave the lexeme specification alone. Require programmers to

insert space sometimes.§ Change the lexeme specification so that a NumericLiteral always

begins with a digit. Then “-4” is an Operator and a NumericLiteral.§ Change the longest-lexeme rule. Allow the caller to set a flag during

lexing, indicating that the next lexeme, if it begins with + or –, should always be interpreted as an Operator.



The last option is my favorite.§ Change the longest-lexeme rule. Allow the caller to set a flag during

lexing, indicating that the next lexeme, if it begins with + or –, should always be interpreted as an Operator.

Note. This is not implemented in the posted lexer code.

We might implement it as follows.§ In the lexer module, make a variable preferOpFlag, not exported.§ Make function preferOp, exported. This sets preferOpFlag to true.§ Set preferOpFlag to false: in its declaration, and just before each

of the two return statements in function getLexeme.§ In the lexing code, when a lexeme begins with either + or -, read

preferOpFlag to decide what to do.

An important point: sometimes the parser may need to guide the lexer. This can affect the design of a lexer.


Be careful with the third point! It’s easy to get wrong.

Documents

Writing a Lexer - CS Home · CS F331 Programming Languages CSCE A331 Programming Language Concepts ... Notes —Lookahead[2/2] Our lexerused the lookaheadstrategy. This …