Lex & yacc CIS2750 Winter 2013. CIS2750 (W13)D. McCaughan Scanners A “scanner” turns an input stream in the source language into token codes –in principle:

lex & yacc

CIS*2750 Winter 2013

D. McCaughanCIS*2750 (W13)

Scanners

A “scanner” turns an input stream in the source language into token codes– in principle: takes some action

when it recognizes a token in the input

– discard non-semantic content (i.e. whitespace, comments)

– may do other small jobs, like converting numeric constants

this is the wrong scanners

if (a == 0){ /* increase b */ b++;}

IFLPARENIDEQCONSTANTLBRACEIDINCRSEMIRBRACE


Scanners: Lexical Analysis Analyze the structural components of input

– scanner: groups input characters into tokens

What is a token?– a sequence of characters that can be treated as

an atomic grammatical unit– a language specifies a finite set of token types (the

lexical units of the language), e.g.• ID (“foo”, “bar”, “abc123”, …}, IF (“if”), INTEGER, REAL,

COMMA, NEQ, LPAREN, RPAREN, …

– tokens with additional semantic values• e.g. identifiers, string literals, numbers


Scanners: example program:

/* find a zero */float mach0(char *s){ if (!strncmp(*s, “0.0”,3)) return(0.0);}

scanner tokenizes:

FLOAT ID(match0) LPAREN ID(s) RPAREN LBRACE IF LPAREN BANG ID(strcmp) LPAREN ID(s) COMMA STRING(0.0) COMMA INTEGER(3) RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF


Specifying tokens Structure of tokens can be complex

– problem defining complex tokens ad hoc– e.g. string literals– e.g. floating point format

Need a formal language to specify token types without ambiguity– permit review of design and validation of input

Regular expressions– succinct, precise– capable of representing infinite sets of strings

• CAUTION: cannot describe all sets of strings with regular expressions

• consider writing a regex for “strings containing an equal number of a and b characters”


Finite automata Need a formalism that can be implemented in code

Finite Automaton: a simple idealized “computer” that recognizes strings belonging to regular sets:– a finite set of states S– a finite alphabet – a set of transitions between states based on the input read

in a given state T:(S x ) S– a specific start state s S– a set of final (accepting) states F S– the set of all strings accepted by a given FA is the language

it defines

Compare the above with what you understand about regular expressions… they are equivalent


Finite automata (cont.) Can represent a FA using transition graphs

– directed graph– each state is a vertex

• accepting states are marked as such– each transition is a directed edge between states

• edges are labeled with a symbol from the alphabet• a symbol can appear on only 1 outgoing edge from a

given state• an unlabelled edge is directed into the start state

a-zA-Z_

a-zA-Z_

0-9


Finite automata (cont.) Deterministic finite automata

– no two edges leaving the same state are labelled with the same input symbol

Processing– begin in start state– for each character in input do

• follow edge labelled with this character to next state– after n transitions (for input of length n), if current state is

final state: ACCEPT string, else: REJECT string

Easily implemented– CASE-based processing given global current state– matrix-based transition table (table lookup)

• newstate = matrix[current_state][input]


Scanner generators Writing scanners is a common requirement

– parsing is a ubiquitous activity

Process is repetitive, resulting code is similar in structure

Process is not difficult to automate

Scanner generators receive a specification file– definitions of the tokens to be scanned– non-procedural programming

• not “how”, but “what”

e.g. lex


lex What is lex?

– a lexical analyzer (scanner) generator

– INPUT:• a description file that uses regular expressions to specify

patterns to be tokenized– OUTPUT:

• source code that implements the scanner

– be default, input is taken from stdin and sent to stdout (this can be changed)

Specific variants– flex for creating C/C++ lexers– JFlex for creating Java lexers– etc.


Structure of a lex file

Definitions section– small building blocks of regular expressions to simplify the

scanner specification• declared outside of %{ %}• special directives to change lex’s behaviour also appear here

– anything inside of %{ %} is copied verbatim into the final program (so should be C code)

• comments, #include, #define, variables (e.g. line counter), etc.

DEFINITIONS SECTION%% RULES SECTION%% USER CODE SECTION


Structure of a lex file (cont.) Rules section

– a pattern (regex) and an action (program code) to execute when that pattern is found

• the action starts on the same line as the pattern• patterns only match a given input string once• the longest possible match is always used

– “island” would match [a-zA-Z]+ before is

User code section– any legal program code, not enclosed in %{ %}– copied verbatim into final program– main(), other subroutines used (or expected) by actions

from the rules section– NOTE: comments outside of %{ %} must be indented!


Example The simplest lex script:

– simply copies standard input to standard output– ECHO is a special lex directive, not a C command

%%

.|\n { ECHO; }

%%


Running lex Executing lex:

lex <lexfile>

e.g.

% lex example.l (produces lex.yy.c)

– outputs C source code for lexer - by default this file is called lex.yy.c, which can be compiled normally

– some systems may require you to link in the lex library (i.e. -ll note for flex: -lfl)

e.g.

% gcc -Wall -ansi lex.yy.c -o scanner -fl


Running lex (cont.) Key points:

– automatically generates a function yylex() which when called begins scanning the input (stdin) for patterns and executing actions

• if actions have no return statements, yylex() won’t return until EOF

– internal variables are always available in actions• yytext - text that matched the pattern• yyleng - length of string in yytext• etc. (some implementations will have built-in support for lineno)

– if a main() routine is not explicitly provided, lex will include one automatically that simply calls yylex()


Example

Things to note here:– local variables in the definitions section; #define and #include would also belong there

– special internal variables (yyleng) and functions (yylex)

%{/* a word counting program */unsigned char_count = 0, word_count = 0, line_count = 0;

%}

word [^ \t\n]+eol \n

%%

{word} { word_count++; char_count += yyleng; }{eol} { char_count++; line_count++; }. { char_count++; }

%%

int main(){ yylex(); printf(“l: %d - w: %d - c: %d\n”, line_count, word_count, char_count);}

DEFINITIONS

RULES

USER CODE


Example%{

/* crude verb recognition program */%}

%%

[\t ]+ { /* ignore whitespace */ }

is |are |was |being |do |did |would |can |have |go { printf(“%s: is a verb\n”,yytext); }

[a-zA-Z]+ { printf(“%s: is not a verb\n”, yytext); }

.|\n { ECHO; /* default catch-all */ }

%%

int main(){

yylex();}


Example (cont.) Compiled & run:

% ./verbdid I have fun?did: is a verbI: is not a verbhave: is a verbfun: is not a verb?^D


Hints and tips Error reporting

– you’ll want to be able to report (at least) a line number for unrecognized toekns (and other error conditions related to the parser to follow)

– consider using %option yylineno in flex• you can easily implement this function yourself (how?)

– it can be useful to have special actions apply inside a comment (for example) or other semantic construct

• have a look at “start conditions” (lex manual) and <<EOF>> rules

Recall that tokens often have associated semantic values that must be recorded over time– symbol table: a look-up table (typically a hash table) that

permits storage and retrieval of data to be associated with a symbol

– consider how this would be integrated with lex


Parsers What we’ve seen to this point is syntax analysis

– only concerned with identifying the structural components of the input

Typically the sequences of tokens are also significant: this is semantic analysis– recognize sequences of tokens (or classes of tokens) and

perform appropriate actions

“Parsers” validate the phrase structure of input– specific sequences of tokens– recognizer– determine the semantics of the input

• consider parse trees (abstract syntax)


Parsers (cont.) A language is defined by the phrase structure

of its component expressions. e.g.:

addition expression = ID ADDOP IDe.g.

a = b

decl = TYPE ID decls SEMICOLON

decls = decls COMMA decls | IDe.g.

int a, b, c;


Specification of languages Consider defining phrases with regex’s

– e.g. addition expressionsdigits = [0-9]+sum = (digits “+”)* digits

• e.g. 28 + 301 + 9

– what about parentheses?digits = [0-9]+sum = expr “+” exprexpr = “(“ sum “)” | digits

• e.g. (109 + 23) … 61 … (1 + (250 + 3))


Specification of languages (cont.) BUT…it is impossible for a DFA to recognize

balanced parentheses (can’t count to arbitrary N)– sum and expr are thus not regular expressions

recall abbreviations in lex– what does lex do with such abbreviations?– RHS is substituted for LHS prior to generation of DFA– try substituting abbreviations in prev. example

• explosion of abbreviations– abbreviations does not increase expressive power

What we need is recursive abbreviations


Context Free Grammars (CFGs) A precise method of specifying context free

languages

Incorporate recursion into definitions– counting

• e.g. balanced parentheses

– arbitrary repetition• e.g. mathematical expressions


CFG Terminology Non-terminals: variables that represent a language

(UPPER CASE)

Terminals: atomic symbols in the language (lower case)

Productions: rules relating variables ()– languages associated with given non-terminal contains

strings formed by concatenating strings from langauges of other non-terminals, and possibly terminals

Start symbol: a special symbol that starts all derivations (S)


Backus-Nuar Form (BNF) From Hopcroft & Ullman, 1979 Describing natural language:

<sentence> <noun phrase> <verb phrase><noun phrase> <adjective> <noun phrase><noun phrase> <noun><noun> boy<adjective> little

Generally not adequate for describing natural language (no accommodation of context)

Ideal for most programming languages– Backus-Nuar Form (BNF)


Productions Example

– arithmetic expressions with + and - operators, id-class operands and balanced parentheses

S EXPREXPR EXPR + EXPREXPR EXPR - EXPREXPR ( EXPR )EXPR id

S EXPREXPR EXPR + EXPR | EXPR - EXPR | ( EXPR ) | id


Derivations To show a sentence is in the language defined by a

grammar, we can perform a derivation --- start with start symbol and repeatedly replace any non-terminal by one of its RHSs

S EXPR EXPR + EXPR EXPR + id id + id

S EXPR EXPR - EXPR ( EXPR ) - EXPR ( EXPR ) - id ( EXPR + EXPR ) - id ( EXPR + id ) - id ( id + id ) - id


Parse trees A tree in which each symbol in a derivation is

connected to the one from which it was derived– several derivations can have the same parse tree

S

EXPR

EXPR EXPR

-

+

id

id

id

( )

EXPR

EXPR

EXPR

S ( id + id ) - id


Derivation sequence Many different possible derivations of the same

sentence– if more than one non-terminal appears in the RHS of

productions, we can choose which to expand first

Two obvious conventions:– leftmost derivation

• choose leftmost non-terminal to expand• top down (recursive descent) parsing• easiest to write by hand

– rightmost derivation• choose rightmost non-terminal to expand• “canonical” derivation• bottom up parsers (e.g. yacc)


Repetition and recursion Two ways to specify recursion

Left Recursion– non-terminal appears as the first symbol on RHS of

production (NOTE: for yacc it is better to use left recursion where possible - minimizes stack size)

– e.g. A Az | z

Right recursion– non-terminal appears as the last symbol on RHS of

production– e.g. A zA | z

Either produces the same language rule– which we use can have significant effect depending on the

parsing algorithm used


Example Specifying a programming language (Pascal-like)

PROGRAM HEADER VARS BODY

HEADER program string ‘(‘ IO ‘)’ ‘;’IO input | output | inpout | none

VARS DECLS | void ‘;’DECLS DECLS DECL | DECLDECL TYPE IDS ‘;’TYPE integer | realIDS IDS ‘,’ id | id

BODY begin STMTS end

STMTS STMTS STMT | STMTSTMT EXPR ‘;’

EXPR EXPR ‘+’ EXPR | EXPR ‘=‘ EXPR | ‘(‘ EXPR ‘)’ | id | number


Errors in grammars Ambiguity: effect on semantics

– consider 2 - 1 - 3– (2 - 1) - 3 != 2 - (1 - 3)– checking for ambiguous grammars in general CFG

is impossible• algorithms exist for certain classes of grammar (such as

those for which we can generate parsers)

Recall: grammar used to define a language– errors in grammar: wrong language defined– comparison for identity (equality) between pairs of

grammars in the general case is also impossible


Ambiguous grammars A grammar is ambiguous if we can derive a sentence

with two different parse trees– semantics are no longer necessarily clear; e.g.

S EXPREXPR EXPR + EXPR | EXPR - EXPR | id

– NOTE :multiple ways to derive id + id - id

Leftmost derivation:

S EXPR EXPR + EXPR id + EXPR id + EXPR - EXPR id + id - EXPR id + id - id

Rightmost derivation:

S EXPR EXPR - EXPR EXPR - id EXPR + EXPR - id EXPR + id - id id + id - id


Ambiguous grammars (cont.)S

EXPR EXPR

+

-id

id id

EXPR

EXPR EXPR

S

EXPR EXPR

-

+ id

id id

EXPR

EXPREXPR


Resolving ambiguity Disambiguating rules

– explicitly states which parse tree is correct– no change required to grammar

Precedence– stated order of derivations based on operator– recall: subtrees will be evaluated before expressions

represented by root node --- order of derivations is opposite to order of evaluation

Associativity– stated order of derivations based on location– left associative: derivation from first choice– right associative: derivation from last choice


Resolving ambiguity (cont.) Rewrite the grammar

– accommodate concepts of precedence and associativity in statement of grammar

• write rules that have phrases to be evaluated first deriving later in production sequence

PrecedenceEXPR EXPR + EXPR | MEXPRMEXPR MEXPR * MEXPR | AEXPRAEXPR ( EXPR ) | number

AssociativityEXPR EXPR + MEXPR | MEXPRMEXPR MEXPR * AEXPR | AEXPRAEXPR ( EXPR ) | number


Common ambiguities Mathematical expressions

– if parentheses are not required, operators that are not associative by nature may be ambiguous

Conditional expressions– dangling else

if condition

if condition

statements

else

statements

if condition

if condition

statements

else

statements


Notes Classes of grammars

– regular grammar (regex)• A zB | z OR (i.e. not both)• A Bz | z

– context free grammar (CFG)• A B (A is any non-terminal, B is any string)

– context sensitive grammar• xAz xBz (A is any non-terminal, B is any string)

– unrestricted grammar• also called recursively enumerable

Example: context issues in programming languages– symbols defined prior to use– cannot specify with CFGs


yacc What is yacc?

– a parser generator

– INPUT:• a description file that uses a BNF-like notation to specify

sequences of tokens to be recognized as a semantic unit– OUTPUT:

• source code that implements the parser

– yacc operates on tokens rather than the input directly• requires a source of tokens (like lex!)

Specific variants– bison, byacc for creating C/C++ parsers– CUP for creating Java parsers– etc.


Using lex & yacc together The parser is the higher level routine

– it calls the lexer when it needs a token from the input

– the scanner sends tokens to the parser as codes– not all input is of interest to the parser

(whitespace, comments) so the lexer does not return these

What are the token codes?– scanner and parser must agree

• solution: let yacc define the token codes• tokens defined in the parser will automatically be defined

as a small integer value using #define macros in a header file generated automatically by yacc


yacc and parsing Shift/reduce parsing

– a yacc parser looks for rules that might match the tokens seen so far

– has a set of states: each reflects a possible position in one or more partially matched rules

– when it reads a token that doesn’t complete a rule, it pushes the token onto a stack and switches to a new state

• this is a shift

– when it reads a token that completes a rule, it pops the RHS symbols off the stack, pushese the LHS symbol onto the statck and switches to a new state

• this is a reduce

– whenever a rule is reduced, user code associated with the rule is executed


Shift/reduce parsinge.g. statement NAME = expression

expression NUMBER + NUMBER | NUMBER - NUMBER

Parse: A = 12 + 13

stack: A (shift A)A= (shift =)A=12 (shift 12)A=12+ (shift +)A=12+13 (shift 13)

This matches the rule expression NUMBER + NUMBERso reduce: pop 13, +, 12 and push expression

stack: A=expression (shift A)

This matches the rule statement NAME = expressionso reduce: pop expression, =, A and push statement

End of input. Stack has been reduced to the start symbol, so the input was valid according to the grammar


Structure of a yacc file

Definitions section– specify tokens and types for symbols, precedence and associativity

rules• declared outside of %{ %}• tokens and types for symbols in the grammer (with %token and %type

respectively)• we can specify a non-integer token type (as a union) with %union• a start symbol can be explicitly specified with %start

– anything inside of %{ %} is copied verbatim into the final program (so should be C code)

• comments, #include, #define, variables (e.g. symbol table)

DEFINITIONS SECTION%% RULES SECTION%% USER CODE SECTION


Structure of a yacc file (cont.) Rules section

– a grammar rule and an action (program code) to execute when that pattern is found

• default start symbol is the LHS of the first rule• NOTE: yacc cannot parse ambiguous grammars!

– a rule consists of a list of grammar rules (using “:” instead of “->”), optionally including an action consisting of program code, with a semi-colon terminating each rule

– parser generated will execute any action present when it reduces a rule

User code section– any legal program code, not enclosed in %{ %}– copied verbatim into final program– main(), other subroutines used (or expected) by actions from the

rules section• caution: there should only be one main() between lex and yacc

(obviously)


Symbols NOTE: yacc reverses the BNF conventions with

respect to terminals and non-terminals– non-terminals are lower-case; terminals are upper-case

Every symbol in a yacc grammar has a value– symbols can be of different types by using the %union and %type directives

– the LHS is referred to as $$; the symbols on the RHS are referred to by position, as $1, $2, $3, …

– these shorthand notations are replaced in the generated code by the actual variable containing the value


Running yacc Executing yacc

yacc -d -y <yaccfile>

– outputs C source, by default named y.tab.c, and an include file for use by a scanner, named y.tab.h

– must also produce/compile a scanner and link it all together

% yacc -d -y example.y (produces y.tab.[ch])

Key points– automatically provides a function yyparse()

• scans the input, shifting/reducing until the scanner reports the end of input (subsequent calls will reset the state and continue)

– internal variables are available to both lexer and parser (yyin - input stream; yylval - value of lexer token, etc.)


Example: an expression parser%{ #include <stdlib.h> #include <stdio.h>%}

%union { int ival; char *sval;}%token PLUS MINUS EQUALS%token <sval> NAME%token <ival> NUMBER%type <ival> expression

%%

statement : NAME EQUALS expression { printf(“%s = %d“, $1, $3); } | expression { printf(“= %d\n”,$1); } ;

expression : expression PLUS NUMBER { $$ = $1 + $3; } | expression MINUS NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } ;

%%

extern FILE *yyin;

int yyerror(char *s) { fprintf(stderr, “%s\n”,s); }

int main(){ if (yyin == NULL) yyin = stdin; while (!feof(yyin)) yyparse();}

DEFINITIONS

RULES

USER CODE


Example: expression parser’s scanner

Things to note here:– control source of input by setting yyin (in yacc)– yyerror() is called by yacc on parse errors (and can be freely

used in actions otherwise), and should be provided– y.tab.h is the include file generated by yacc that contains the

token definitions– yacc parsers contain an internal variable called yylval that the

lexer should set to contain any value associated with a token (the token itself is always returned as an integer - as defined by yacc)

– note: o.k. to return yytext[0], but not yytext - why?• careful managing memory when copying strings (this

coupling can’t be avoided with lex/yacc)

%{ #include “y.tab.h” %} %%

[a-zA-Z_] {yylval.sval = strdup(yytext); return(NAME); }[0-9]+ {yylval.ival = atoi(yytext); return(NUMBER); }“=“ { return(EQUALS); }“+” { return(PLUS); }“-” { return(MINUS); }[ \t] { /* ignore whitespace */ }\n { return(0); /* logical EOF */ }

%%


Understanding conflicts Pointer model

– you can think of yacc processing as a “pointer” which moves through the yacc grammar as each token is read

– at first there is only 1 pointer; may be >1 to represent partially recognized rules

– e.g.

start : A B C ;

• reads A and B

start : A B C ;

This material is drawn from “lex & yacc (2e)” , Levine, Mason and Brown


Understanding conflicts (cont.) e.g.,

Recall: rule is reduced when a pointer reaches the end of a rule

start : x | y ;x : A B z R ;y : A B z S ;z : C D ;

reads A and B


reads D


reads C



Understanding conflicts (cont.) Reduce/reduce conflict

– rule is reduced while there is more than one pointer

start : x | y ;x : A ;y : A ;

reads A

start : x | y ;x : A ;y : A ;

reduce rule x? reduce rule y?


Understanding conflicts (cont.) Shift/reduce conflict

– rule is reduced while there is more than one pointer

start : x | y ;x : A R ;y : A ;

reads A

start : x | y ;x : A R;y : A ;

shift R in rule x? reduce rule y?


Understanding look-ahead issues Keep in mind that the implementation of a parser

algorithm is a separate issue from CFGs

yacc parsers use 1 token look-ahead– the following is not a reduce/reduce error as yacc makes

decisions based on the next token as well

– the following grammar is not ambiguous, however requires 2 tokens of look-ahead

• yacc cannot do this, so: reduce/reduce error

start : x B | y C ;x : A ;y : A ;

start : x B C | y B C ;x : A ;y : A ;


Understanding token typing Default token type is int

%union - identifies all possible C types that tokens can have

e.g.%union { char *str; double real; int integer;}

Permits symbols to be of type <str>, <real> or <integer>, with the type corresponding to the C type in the %union

Note: most of this is handled automatically for you - the declaration is what is important


Understanding token typing (cont.) Now:

%token <type> TOKEN1, TOKEN2, …– declares all listed tokens to be of the stated type

e.g.

%token <str> NAME

– the NAME (terminal) token has an associated semantic value that corresponds to the type associated with the identifier str in the %union directive

What about non-terminals?%type <type> nonterm1, nonterm2, …


Issues We are ignoring much in this overview:

– redefining input() and output() routines to work on sources other than streams (FILE *)

– default main() routines in yacc

– incorporating lexers and parsers as modules in a larger system

– changing the default names of files/internal functions/internal variables (necessary if you want more than one parser in a program)

– many internal variables/functions (yywrap, etc.)

– we are probably ignoring issues in covering the ignored issues


Additional resources Online manual

http://www.gnu.org/software/bison/manual/index.html

“lex & yacc (2e)”, John Levine, Tony Mason & Doug Brown, O’Reilly, 1992

The lex & yacc primer/HOWTOhttp://ds9a.nl/lex-yacc/

Google remains your friend (so I’m told)

Documents

Lex & yacc CIS*2750 Winter 2013. CIS*2750 (W13)D. McCaughan Scanners A “scanner” turns an input stream in the source language into token codes –in principle:

Lex & yacc CIS2750 Winter 2013. CIS2750 (W13)D. McCaughan Scanners A “scanner” turns an input stream in the source language into token codes –in principle: