Intro to Best Practices (RUP)

One pass compilerCompiler Design

CSC532

Symbol Table

• Stores the symbol of the source program as the compiler encounters them.

• Each entry contains the symbol name plus a number of parameters describing what is known about the symbol

• Reserved words (if, then, else, etc.) maybe stored in the symbol table as well.

Symbol Table

• As a minimum we must be able to– INSERT a new symbol into the table– RETRIEVE a symbol so that its parameters

maybe retrieved and/or modified,– Query to find out if a symbol is already in the

table.• Each entry can be implemented as a

record. Records can have different formats (Variant records in Pascal).

Storing characters• Method 1: A fixed size space within each entry

large enough to hold the largest possible name. Most names will be much shorter than this so there will be a lot of wasted storage

• Method 2: Store all symbols in one large separate array. Each symbol is terminated with an end of symbol mark (EOS). Each symbol table record contains a pointer to the first character of the symbol.

• Method n: modern languages (e.g. Java, C++ std components) has efficient DS, e.g. string or vector

Symbol Table Data Structure

• One Linear list: – Easy to implement – search time will be very long if source has

many symbols.


Hash table: – Run the symbol name through a hash

function to create an index in a table. – If some other symbol has already claimed the

space then rehash with another hash function to get another index, etc.

– Hash Table must be large enough to accommodate largest number of symbols.


• Open hash: – Store the entries in a number of linear lists

( called Buckets). – Use a hash function on the symbol name to

determine which lists to use. – A good hash function will spread the symbols

across the buckets, so each linear list will be short.

Hash Functions

• Goal is to get a hash function that generates a different index for each symbol name in the source.

Index = f (string)

• Some programmers use symbols like tmp1. tmp2, tmp3..so the hash function should use the last character of the name.

Hash Functions(continued)

• Other programmers use symbols like xvel, yvel, zvel..so the hash function should use the first character of the name.

• Best if all characters in the name are used. • Characters should be given different weights so

x2y2z, y2x2z, z2y2x…are hashed differently. • Modern languages have hash functions/objects

Phases of A Compiler

CONTD.• Example source statement:• position := initial + rate * 60• After lexical analysis:• id1 := id2 + id3 * 60 and three symbols are entered in the symbol

table:1 position2 initial3 Rate• After syntax analysis:

CONTD.

id1

:=

+

After syntax analysis:

id3

id2 *

60

CONTD.

• After semantic analysis:

id1

:=

+

id2 *

id3 inttoreal 60

CONTD.• After intermediate code generation: temp1 := inttoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3• After code optimization: temp1 := id3 * 60.0 id1 := id2 + temp1• After final code generation: MOVF id3, R2 MULF #60.0, R2 MOVF id2, R1 ADDF R2,R1 MOVF R1, id1

Some Definitions

• Lexeme: The character sequence forming a token. Examples:

:=, * ,+, rate ,60• Syntax: What programs look like.• Semantics: What programs mean.

Context Free Grammar

• Specifying the syntax of a language. • Also known as Backus-Naur Form or BNF.

list list + digitBy itself is not CFG

Context Free Grammar• Example: In C an if-else statement looks like: if (expression) statement else statement The statement is a concatenation of 7 elements:1. the keyword if,2. opening parenthesis,3. an expression,4. a closing parenthesis,5. a statement,6. the keyword else,7. a statement.

CFG• We write this as a production: stmt if (expr) stmt else stmt where

– stmt denotes a statement, – expr denotes an expression, – the arrow “” is read as “can have the form”

• The tokens in this production are: if , else, ()• The variables are stmt and expr. (Non-terminal)• Variables are sequences of tokens and are

called non-terminals.

CFG: Notation

• A context free grammar has 4 components:

1. A set of tokens known as terminal symbols

2. A set of non terminals3. A set of productions4. A non terminal designated as a start

symbol.

CFG: Example

• The productions are: list list + digit list list – digit list digit digit 0|1|2|3|4|5|6|7|8|9

CFG: Example• The vertical lines in the last production mean

“or”. A digit can have the form of 0 or 1 or 2, etc. The first three productions can be combined:

• list list + digit | list – digit | digit

• The tokens (terminals) of this grammar are: + - 0 1 2 3 4 5 6 7 8 9 • The non terminals are list and digit, with list

being the starting non-terminal because its productions are written first.

• What is 9-5+2 ?

• This is the parse tree for 9-5+2list

list

list

digit

9 -

digit

5+

digit

2

CFG: Another example

block begin opt_stmts endopt_stmts stmts_list| Єstmts_list stmts_list;stmt| stmt

Where Є = empty string of symbols.

CFG: Another example

• Ambiguity. • Consider a grammar with a single

production: string string + string | string - string | 0|1|2|3|4|5|6|7|8|9

• string like 9-5+2 will have two parse trees:

9-5+2 will have two parse trees

string

string

string

9

- string

5

+ string

2

string

string- string

9 string+ string

5 2

Ambiguity

• The left parse tree parses the expression as though it were written (9-5) +2 which equals 6.

• The right parse tree parses the expression as though it were written 9- (5+2) which equals 2.

• It is important to have only one parse tree for any string of symbols. The grammar should be unambiguous.

Ambiguity Reduction• Associativity of operators: • Precedence of operators: • Syntax for arithmetic expressions: Assume

the basic units are digits and parenthesized expressions.

• Factor digit | (expr)

Associativity of operators:• In most languages addition, subtraction, multiplication

and division are left associative.

• Exponentiation is usually right associative. • In C the assignment operator, = , is right associative.

A = b = c is treated like a = (b = c).

Precedence of operators:• Usually multiplication and division have

higher precedence than addition and subtraction.

• An expression like 9+5*2 – 9+(5*2), not (9+5) * 2.

Syntax for arithmetic expressions

• The binary operators * and / have highest precedence. They are left associative.

term term * factor|term/factor|factor

• Terms are combined with + and -:Therefore the resultant grammar is:expr expr + term | expr – term| termtermterm * factor|term/factor| factorfactordigit|(expr)digit 0|1|2|3|4|5|6|7|8|9

STOP here

Syntax of our Source Language• program program id (identifier_list); declarations subprogram_declarations compound_statement• identifier_list id|identifier_list, id• declarations declarations var

identifier_list:type;|e• type standard_type|array[num..num] of

standard_type• standard_type integer|real

• subprogram_declarations subprogram_declarations subprogram_declaration;|e• subprogram_declaration subprogram_head declarations compound_statement• subprogram_head function id arguments : standard_type;|

procedure id arguments;• arguments (parameter_list)|e• parameter_list identifier_list : type | parameter_list ;identifier_list :

type• Compound_statement begin optional_statements end• optional_statements statement_list | e• statement_list statement | statement_list ; statement

• statement variable assignop expression | procedure_statement | compound_statement | if expression then statement else

statement | while expression do statement• variable id | id [expression]• procedure_statement id | id (expression_list)• expression_list expression | expression_list,

expression• expression simple_expression | simple_expression

relop simple_expression

• simple_expression term | sign term | simple_expression addop term

• term factor | term mulop factor• factor id | id (expression_list) |num |

(expression)| not factor• sign + | -

Syntax – Directed Translation

• Associate a set of attributes with each grammar symbol. With each production associate a set of semantic rules for computing values of the attributes.

• Synthesized attribute: The value of the attribute at any node of a parse tree can br computed from the attribute values of the children at the node.

• Can be evaluated by a single bottom – up traversal of the parse tree.

SDT (continued)

• Example : Translating infix notation to postfix notation.

If a node in the parse tree is labeled with X

then let X.t be a string – valued attribute associated with the node.

X.t || Y.t means concatenate X.t with Y.t

Syntax Directed DefinitionPRODUCTION• expr expr1 + term

• expr expr1 – term

• expr term

• term 0• term 1• ……• term 9

SEMANTIC RULE• expr.t := expr1.t || term.t || ‘+’

• expr.t := expr1.t || term.t || ‘-’

• expr.t := term.t

• term.t := ‘0’• term.t := ‘1’• …..• term.t := ‘9’

Attribute Values at Nodes in Parse Tree

expr.t = 95-2+

term.t = 2

2

expr.t = 95-

expr.t = 9

term.t = 9

9 -

term.t = 5

5 +

Example : RobotPRODUCTION• seq begin

• seq seq1 instr

• instr east

• instr north

• instr west

• instr south

SEMANTIC RULES• seq.x := 0 seq.y := 0• seq.x := seq1.x + instr.dx seq.y := seq1.y + instr.dy• instr.dx := 1 instr.dy := 0• instr.dx := 0 instr.dy := 1• instr.dx := -1 instr.dy := 0• instr.dx := 0 instr.dy := -1

seq.x = -1

seq.y = -1

seq.x = -1

seq.y = 0

seq.x = 0

seq.y= 0

begin

instr.dx = -1

instr.dy = 0

west

instr.dx = 0

instr.dy = -1

south

Translation Schemes• Translation scheme: A context-free grammar with

semantic actions embedded within the right sides of the productions.

• Example : rest + term {print (‘+’)} rest1• The semantic action is enclosed within braces. The

production itself is : rest + term rest1• Parse tree: Do a post order traversal of the tree.

After the + and term leaves are traversed, the {print (‘+’)} leaf is traversed and the semantic action is performed, then the rest1leaf is traversed and then the root, rest is visited.

• In a simple syntax-directed definition the translation order of the non terminals on the right sides is the same as their order in the productions. These definitions can be implemented with translation schemes.

rest

+ term {print (‘+’)} rest1

Example: Translating into Post-fix Form

• expr expr + term {print (‘+’)} • expr expr - term {print (‘-’)}• expr term • term 0 {print (‘0’)}• term 1 {print (‘1’)}• ……• term 9 {print (‘9’)}

Parsing• Determines if a string of tokens can be generated by a grammar• Parser can be constructed for any grammar• For any context-free grammar there is a parser that takes at most O

(n3) time to parse a string of n tokens.• Almost all programming languages that arise in practice can be

parsed in O (n) time making a single left-to-right scan of the input looking ahead one token at a time.

• Two classes of parsing methods : Top-down – Construct the parse tree starting at the root and working

down towards the leaves. Bottom-up – Construct the parse tree starting at the leaves and

working up toward the roots.• Efficient top-down parsers easier to construct• Bottom-up parsers handle larger class of grammar and translation

schemes.

Top – Down Parsing• Recursive-decent parsing is a top-down method

where we execute a set of recursive procedures to process the input.

• Predictive parsing – a special case of recursive-decent parsing.- can be used if the scanned input symbol unambiguously determines the production selected for each nonterminal.

• Example grammar:type simple | id |array [simple] of typesimple integer | char | num .. Num

Pseudo Code for Predictive Parserprocedure match (t: token);begin

if lookahead = t then lookahead := nexttokenelseerror

end;procedure type;begin

if lookahead is in {integer, char, num} thensimpleelse if lookahead =‘ ’ then beginmatch (‘ ’ ); match (id) endelse if lookagead = array then beginmatch(array); match(‘[’); simple; match (‘]’);match (of); type endelse error

end;

procedure simple;begin

if lookahead = integer then match(inteher)else if lookahead = char then match (char)else if lookahead = num then begin

match(num); match(..); match(num) endelse error

end;

• No need to backtrack as long as the first tokens on the right sides of the productions are disjoint.

• e-productions: If any non terminal has an e-production then treat the e-production last. There is no “else error” at the end of the procedure.

• Left-recursion requires special handling. A production like expr expr + termis left-recursive. If the expr procedure calls itself at the beginning the parser will loop forever. Usually the production can be re-written to make it right-recursive.

• Example: expr expr + term | term produces sequences like:

termterm + termterm + term + term…..• The same sequence can be produced with

the following grammar:expr term restrest + term rest | e

Translator for Simple Expressions

• Grammar for translating infix expressions to post-fix :expr expr + term {print (‘+’)} expr expr - term {print (‘-’)}expr term term 0 {print (‘0’)}term 1 {print (‘1’)}……term 9 {print (‘9’)}

• The left-recursive productions have to be modified.• Modified grammar is :

expr term restrest + term {print (‘+’)} restrest - term {print (‘+’)} restrest eterm 0 {print (‘0’)} term 1 {print (‘1’)} ….. term 9 {print (‘9’)}

• In pseudo code the non terminal procedures are:procedure expr;

begin term; rest end;

procedure rest;begin

if lookahead = ‘+’ then beginmatch(‘+’); term; print(‘+’); rest end

else if lookahead = ‘-’ then beginmatch(‘-’); term; print(‘-’); rest end

elseend;

Procedure term

begin If isdigit(lookahead) then begin Print(lookahead); match(lookahead) end else errorEnd;• Note that is digit is a boolean valued

function that returns TRUE if the argument is a digit. The match procedure was described before.

Section 2.6 – Lexical Analysis• A lexical analyzer converts the input stream of

characters into a stream of token to be analyzed by the parser.

• Removal of white space and comments. Most languages allow blanks, tabs, and new lines to be inserted between tokens. Also comments are usually allowed. The lexical analyzer removes these characters.

• Constants. The Lexical analyzer collects the sequences of digits for a constant and passes a single token to the parser. An attribute of the token contains the value of the constant.

Example• The input stream 31 + 28 + 59 is transformed into five

tokens with attributes:<num, 31><+, > <num, 28><+, ><num, 59>• Identifiers are names of variables, arrays, functions etc.

The parser wants to see a token like id for each identifier. Example:

count = count + increment is converted to :Id = id + id• We need to know if the same name has been seen

before. We use symbol table. The lexical analyzer adds a pointer to the symbol table entry as an attribute of each token.

• Keywords: Many languages use fixed character strings like begin, end, if, …for certain constructs. These keywords usually satisfy the rules for identifiers. We need a mechanism to distinguish between keywords and identifiers. The problem is easier if the keywords are reserved; no keywords can be used as identifier.

• Lexemes like <, <-, and <>, in pascal need special treatment. When the lexical analyzer sees the character it has to read the next character to see what token to pass on.

Figure 2.25

Input

Read Character

Push back character

Lexical Analyzer

Parser

Token and attributes

• Could put a buffer between the lexical analyzer and the parser to hold a number of tokens and their attributes. Usually, the buffer only holds one token; the lexical analyzer is a procedure called by the parser and returns one token and its attributes whenever called.

• The interface between the input and the lexical analyzer is complicated by characters being pushed back. When a Pascal compiler reads < it reads the next character (to see if the lexeme is <= or <>); if the source is “x bound” then the “b” character must be pushed back.

Documents

Intro to Best Practices (RUP)