31
Compiler Tools Lex/Yacc – Flex & Bison

Compiler Tools

  • Upload
    lirit

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

Compiler Tools. Lex/Yacc – Flex & Bison. Compiler Front End (from Engineering a Compiler). Scanner (Lexical Analyzer) Maps stream of characters into words Basic unit of syntax x = x + y ; becomes < id , x > < eq , = > < id , x > < plus_op , + > < id , y > < sc , ; > - PowerPoint PPT Presentation

Citation preview

Page 1: Compiler Tools

Compiler Tools

Lex/Yacc – Flex & Bison

Page 2: Compiler Tools

Compiler Front End (from Engineering a Compiler)

Scanner (Lexical Analyzer)• Maps stream of characters into words

Basic unit of syntax x = x + y ; becomes <id,x> <eq,=> <id,x> <plus_op,+> <id,y> <sc,; >

• The actual words are its lexeme• Its part of speech (or syntactic category) is called its token

type• Scanner discards white space & (often) comments

Sourcecode Scanner

Intermediate RepresentationParser

Errors

tokens

Speed is an issue in scanning

use a specialized recognizer

Page 3: Compiler Tools

The Front End (from Engineering a Compiler)

Parser• Checks stream of classified words (parts of speech) for

grammatical correctness• Determines if code is syntactically well-formed• Guides checking at deeper levels than syntax• Builds an IR representation of the code

Parsing is harder than scanning. Better to put more rules in scanner (whitespace etc).

Sourcecode Scanner

IRParser

Errors

tokens

Page 4: Compiler Tools

Flex – Fast Lexical Analyzer

FLEXscanner

(program to recognize patterns

in text)

regular expressions& C-code rules

lex.yy.c contains yylex()

compile

executable – analyzesand executes input

Here’s where we’ll put the regular expressions to good use!

(Scanner generator)

Page 5: Compiler Tools

Flex input file

3 sectionsdefinitions

%%

rules

%%

user code

Page 6: Compiler Tools

Definition Section Examples

name definition DIGIT [0-9]

ID [a-z][a-z0-9]* A subsequent reference to {DIGIT}+"."{DIGIT}* is identical to:([0-9])+"."([0-9])*

Page 7: Compiler Tools

C Code

Can include C-code in definitions%{

/* This is a comment inside the definition

*/

#include <math.h> // may need headers

#include <stdio.h> // for printf in BB

#include <stdlib.h> // for exit(0) in BB

%}

Page 8: Compiler Tools

Rules

The rules section of the flex input contains a series of rules of the form: pattern action

In the definitions and rules sections, any indented text or text enclosed in %{ and %} is copied verbatim to the output (with the %{ %}'s removed). The %{ %}'s must appear unindented on lines by themselves.

Page 9: Compiler Tools

Example: Simple Pascal-like recognizer

Definitions section:/* scanner for a toy Pascal-like language */

%{ /* need for the call to atof() below */

#include <math.h> %}DIGIT [0-9] ID [a-z][a-z0-9]*

Remember these are on a lineby themselves, unindented!

}Lines inserted as-is intoresulting code

} Definitions that can be used inrules section

Page 10: Compiler Tools

Example continued

Rules section:%%

{DIGIT}+ { printf("An integer: %s (%d)\n", yytext, atoi(yytext ));}

{DIGIT}+"."{DIGIT}* {printf("A float: %s (%g)\n", yytext, atof(yytext));}

if|then|begin|end|procedure|function {printf("A keyword: %s\n", yytext);}

{ID} { printf( "An identifier: %s\n", yytext ); }

"+"|"-"|"*"|"/" { printf( "An operator: %s\n", yytext ); }

"{"[^}\n]*"}" /* eat up one-line comments */

[ \t\n]+ /* eat up whitespace */

. { printf( "Unrecognized character: %s\n", yytext ); }

pattern actiontext that matched the pattern(a char*)

Page 11: Compiler Tools

Example continued

User code (required for flex, in library for lex)

%% yywrap() {} // needed to link, unless libfl.a is available// OR put %option noyywrap at the top of a flex file.

int main(int argc, char ** argv ) {

++argv, --argc; /* skip over program name */

if ( argc > 0 ) yyin = fopen( argv[0], "r" );

else yyin = stdin; yylex();

}

lexer function produced by lex

lex input file

Page 12: Compiler Tools

Lex techniques

Hardcoding lists not very effective. Often use symbol table. Example in lec & yacc, not covered in class but see me if you’re interested.

Page 13: Compiler Tools

Bison – like Yacc (yet another compiler compiler)

Context-free Grammarin BNF form, LALR(1)* Bison

Bison parser (c program)group tokens according togrammar rules

Bison parser provides yyparse

You must provide:• the lexical analyzer (e.g., flex)• an error-handling routine named yyerror• a main routine that calls yyparse

*LookAhead Left Recursive

Page 14: Compiler Tools

Bison Parser

Same sections as flex (yacc came first): definitions, rules, C-Code

We’ll discuss rules first, then definitions and C-Code

Page 15: Compiler Tools

Bison Parser – Rule Section

Consider CFG <statement> -> ID = <expression>

Would be written in bison “rules” section as:statement: NAME ‘=‘ expression

| expression { printf("= %d\n", $1); } ;

expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; } | NUMBER ‘-’ NUMBER { $$ = $1 + $3; } | NUMBER { $$ = $1; }

; Use : between lhs and rhs, place ; at end. What are $$? next slide…

white space

; at end

NOTE: The first rule in statement won’t be operational yet…

Page 16: Compiler Tools

More on bison Rules and Actions

$1, $3 refer to RHS values. $$ sets value of LHS. In expression, $$ = $1 + $3 means it sets the value of

lhs (expression) to NUMBER ($1) + NUMBER ($3) A rule action is executed when the parser reduces that

rule (will have recognized both NUMBER symbols) lexer should have returned a value via yylval (next slide)

statement: NAME ‘=‘ expression

| expression { printf("= %d\n", $1); }

;

expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; }

| NUMBER ‘-’ NUMBER { $$ = $1 - $3; }

;

$$ $1 $2 $3

when is thisexecuted?

Page 17: Compiler Tools

Coordinating flex and bison

Example to return int value:[0-9]+ { yylval = atoi(yytext); return NUMBER;}

returns recognized tokensets value for use in actions

This one just returns the numericvalue of the string stored in yytext

atoi is C function to convert string to integer

In prior flex examples we just returned tokens, not values

Also need to skip whitespace, return symbols

[ \t] ; /* ignore white space */

\nreturn 0; /* logical EOF */

. return yytext[0];

Page 18: Compiler Tools

Bison Rule Details

Unlike flex, bison doesn’t care about line boundaries, so add white space for readability

Symbol on lhs of first rule is start symbol, can override with %start declaration in definition section

Symbols in bison have values, must be “declared” as some type YYSTYPE determines type Default for all values is int We’ll be using different types for YYSTYPE in the

SimpleCalc exercises

Page 19: Compiler Tools

Bison Parser – Definition Section

Definition SectionTokens used in grammar should be defined.

Example rule: expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; } The token NUMBER should be defined. Later

we’ll see cases where expression should also be defined, and how to define tokens with other data types. %token must be lowercase, e.g.,:

%token NUMBERFrom the tokens that are defined, Bison will

create an appropriate header fileSingle quoted characters can be used as

tokens without declaring them, e.g., ‘+’, ‘=‘ etc.

Page 20: Compiler Tools

Lex - Definition Section

Must include the header created by bison

Must declare yylval as extern%{

#include "simpleCalc.tab.h

extern int yylval;

#include <math.h>

%}

Page 21: Compiler Tools

Bison Parser – C Section

At a minimum, provide yyerror and main routines

yyerror(char *errmsg){ fprintf(stderr, "%s\n", errmsg);}

main(){yyparse();

}

Page 22: Compiler Tools

Bison Intro Exercise

Download SimpleCalc.y, SimpleCalc.l and mbison.bat

Create calculator executable mbison simpleCalc

FYI, mbison includes these steps: bison -d simpleCalc.y flex -L -osimpleCalc.c simpleCalc.l gcc -c simpleCalc.c gcc -c simpleCalc.tab.c gcc -Lc:\progra~1\gnuwin32\lib simpleCalc.o simpleCalc.tab.o -osimpleCalc.exe -lfl –ly

Test with valid sentences (e.g., 3+6-4) and invalid sentences.

Page 23: Compiler Tools

Understanding simpleCalc%{#include "simpleCalc.tab.h"extern int yylval;%}

%%[0-9]+ { yylval = atoi(yytext); return NUMBER; }[ \t]; /* ignore white space */\n return 0; /* logical EOF */. return yytext[0];

%%/*---------------------------------------*//* 5. Other C code that we need. */ yyerror(char *errmsg){ fprintf(stderr, "%s\n", errmsg);}

main(){

yyparse();}

#ifndef YYTOKENTYPE# define YYTOKENTYPE /* Put the tokens into the symbol table, so that GDB and other debuggers know about them. */ enum yytokentype { NAME = 258, NUMBER = 259 };#endif/* Tokens. */#define NAME 258#define NUMBER 259

simpleCalc.tab.h

simpleCalc.l

Explanation:When the lexer recognizes a number[0-9]+ it returns the token NUMBERand sets yylval to the corresponding integer value.When the lexer sees a carriage return it returns 0. If it sees a space or tab it ignores it.When it sees any other character it returns that character (the first character in the yytext buffer). If the yyparse recognizes it – good! Otherwise the parser can generate an error.

Page 24: Compiler Tools

Understanding simpleCalc, continued

%token NAME NUMBER%%statement: NAME '=' expression

| expression { printf("= %d\n", $1); };

expression: expression '+' NUMBER { $$ = $1 + $3; }| expression '-' NUMBER { $$ = $1 - $3; }| NUMBER { $$ = $1; };

ExplanationExecute simpleCalc and enter expression 1+2main program calls yyparse. This calls lex to recognize 1 as a NUMBER (puts 1 in yylval), sets $$ = $1 Calls lex which returns +, matches ‘+’ in first expression rhsCalls lex to recognize 2 as a NUMBER (puts 2 in yylval)Recognize expression + NUMBER and “reduce” this rule, does action {$$ = $1 + $3}. Recognizes expression as a statement, so it does the printf action.

Page 25: Compiler Tools

Adding other variable types*

YYSTYPE determines the data type of the values returned by the lexer.

If lexer returns different types depending on what is read, include a union:

%union { // C feature, allows one memory area to char cval; // be interpreted in different ways.

char *sval; // For bison, will be used with yylvalint ival;

} The union will be placed at the top of your .y file

(in the definitions section) Tokens and non-terminals should be defined

using the union

* relates to SimpleCalc exercise 2

Page 26: Compiler Tools

Adding other variable types - Example

Definitions in simpleCalc.y:%union {float fval;int ival;

}%token <ival>NUMBER%token <fval>FNUMBER%type <fval> expression

Use union in rules in simpleCalc.l:{DIGIT}+ { yylval.ival = atoi(yytext); return NUMBER;}

Page 27: Compiler Tools

Processing lexemes in flex*

Sometimes you want to modify a lexeme before it is passed to bison. This can be done by putting a function call in the flex rules

Example: to convert input to lower case put a prototype for your function in the

definition section (above first %%) write the function definition in the C-code

section (bottom of file) call your function when the token is

recognized. Use strdup to pass the value to bison.

* relates to SimpleCalc exercise 3

Page 28: Compiler Tools

Example continued

%{ #include “example.tab.h“

void make_lower(char *text_in);%}%%[a-zA-Z]+ {make_lower(yytext); yylval.sval = strdup(yytext); return KEYWORD; }%%void make_lower(char *text_in){int i; for (i=0; i<strlen(yytext); ++i)

yytext[i]=tolower(yytext[i]); }

need prototype here

function code in C section

function call to process textmake duplicate using strdupreturn token type

Page 29: Compiler Tools

Adding actions to rules *

For more complex processing, functions can be added to bison.

Remember to add a prototype at the top, and the function at the bottom

* relates to SimpleCalc exercise 4

Page 30: Compiler Tools

Processing more than one line *

To process more than one line, ensure the \n is simply ignored

Use a recursive rule to allow multiple inputs

* relates to SimpleCalc exercise 4

Page 31: Compiler Tools

Summary of steps (from online manual)

The actual language-design process using Bison, from grammar specification to a working compiler or interpreter, has these parts:

1. Formally specify the grammar in a form recognized by Bison (i.e., machine-readable BNF). For each grammatical rule in the language, describe the action that is to be taken when an instance of that rule is recognized. The action is described by a sequence of C statements.

2. Write a lexical analyzer to process input and pass tokens to the parser.

3. Write a controlling function (main) that calls the Bison-produced parser.

4. Write error-reporting routines.