59
One pass compiler Compiler Design CSC532

Intro to Best Practices (RUP)

  • Upload
    lehuong

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Intro to Best Practices (RUP)

One pass compilerCompiler Design

CSC532

Page 2: Intro to Best Practices (RUP)

Symbol Table

• Stores the symbol of the source program as the compiler encounters them.

• Each entry contains the symbol name plus a number of parameters describing what is known about the symbol

• Reserved words (if, then, else, etc.) maybe stored in the symbol table as well.

Page 3: Intro to Best Practices (RUP)

Symbol Table

• As a minimum we must be able to– INSERT a new symbol into the table– RETRIEVE a symbol so that its parameters

maybe retrieved and/or modified,– Query to find out if a symbol is already in the

table.• Each entry can be implemented as a

record. Records can have different formats (Variant records in Pascal).

Page 4: Intro to Best Practices (RUP)

Storing characters• Method 1: A fixed size space within each entry

large enough to hold the largest possible name. Most names will be much shorter than this so there will be a lot of wasted storage

• Method 2: Store all symbols in one large separate array. Each symbol is terminated with an end of symbol mark (EOS). Each symbol table record contains a pointer to the first character of the symbol.

• Method n: modern languages (e.g. Java, C++ std components) has efficient DS, e.g. string or vector

Page 5: Intro to Best Practices (RUP)

Symbol Table Data Structure

• One Linear list: – Easy to implement – search time will be very long if source has

many symbols.

Page 6: Intro to Best Practices (RUP)

Symbol Table Data Structure

Hash table: – Run the symbol name through a hash

function to create an index in a table. – If some other symbol has already claimed the

space then rehash with another hash function to get another index, etc.

– Hash Table must be large enough to accommodate largest number of symbols.

Page 7: Intro to Best Practices (RUP)

Symbol Table Data Structure

• Open hash: – Store the entries in a number of linear lists

( called Buckets). – Use a hash function on the symbol name to

determine which lists to use. – A good hash function will spread the symbols

across the buckets, so each linear list will be short.

Page 8: Intro to Best Practices (RUP)

Hash Functions

• Goal is to get a hash function that generates a different index for each symbol name in the source.

Index = f (string)

• Some programmers use symbols like tmp1. tmp2, tmp3..so the hash function should use the last character of the name.

Page 9: Intro to Best Practices (RUP)

Hash Functions(continued)

• Other programmers use symbols like xvel, yvel, zvel..so the hash function should use the first character of the name.

• Best if all characters in the name are used. • Characters should be given different weights so

x2y2z, y2x2z, z2y2x…are hashed differently. • Modern languages have hash functions/objects

Page 10: Intro to Best Practices (RUP)

Phases of A Compiler

Page 11: Intro to Best Practices (RUP)

CONTD.• Example source statement:• position := initial + rate * 60• After lexical analysis:• id1 := id2 + id3 * 60 and three symbols are entered in the symbol

table:1 position2 initial3 Rate• After syntax analysis:

Page 12: Intro to Best Practices (RUP)

CONTD.

id1

:=

+

After syntax analysis:

id3

id2 *

60

Page 13: Intro to Best Practices (RUP)

CONTD.

• After semantic analysis:

id1

:=

+

id2 *

id3 inttoreal 60

Page 14: Intro to Best Practices (RUP)

CONTD.• After intermediate code generation: temp1 := inttoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3• After code optimization: temp1 := id3 * 60.0 id1 := id2 + temp1• After final code generation: MOVF id3, R2 MULF #60.0, R2 MOVF id2, R1 ADDF R2,R1 MOVF R1, id1

Page 15: Intro to Best Practices (RUP)

Some Definitions

• Lexeme: The character sequence forming a token. Examples:

:=, * ,+, rate ,60• Syntax: What programs look like.• Semantics: What programs mean.

Page 16: Intro to Best Practices (RUP)

Context Free Grammar

• Specifying the syntax of a language. • Also known as Backus-Naur Form or BNF.

list list + digitBy itself is not CFG

Page 17: Intro to Best Practices (RUP)

Context Free Grammar• Example: In C an if-else statement looks like: if (expression) statement else statement The statement is a concatenation of 7 elements:1. the keyword if,2. opening parenthesis,3. an expression,4. a closing parenthesis,5. a statement,6. the keyword else,7. a statement.

Page 18: Intro to Best Practices (RUP)

CFG• We write this as a production: stmt if (expr) stmt else stmt where

– stmt denotes a statement, – expr denotes an expression, – the arrow “” is read as “can have the form”

• The tokens in this production are: if , else, ()• The variables are stmt and expr. (Non-terminal)• Variables are sequences of tokens and are

called non-terminals.

Page 19: Intro to Best Practices (RUP)

CFG: Notation

• A context free grammar has 4 components:

1. A set of tokens known as terminal symbols

2. A set of non terminals3. A set of productions4. A non terminal designated as a start

symbol.

Page 20: Intro to Best Practices (RUP)

CFG: Example

• The productions are: list list + digit list list – digit list digit digit 0|1|2|3|4|5|6|7|8|9

Page 21: Intro to Best Practices (RUP)

CFG: Example• The vertical lines in the last production mean

“or”. A digit can have the form of 0 or 1 or 2, etc. The first three productions can be combined:

• list list + digit | list – digit | digit

• The tokens (terminals) of this grammar are: + - 0 1 2 3 4 5 6 7 8 9 • The non terminals are list and digit, with list

being the starting non-terminal because its productions are written first.

• What is 9-5+2 ?

Page 22: Intro to Best Practices (RUP)

• This is the parse tree for 9-5+2list

list

list

digit

9 -

digit

5+

digit

2

Page 23: Intro to Best Practices (RUP)

CFG: Another example

block begin opt_stmts endopt_stmts stmts_list| Єstmts_list stmts_list;stmt| stmt

Where Є = empty string of symbols.

Page 24: Intro to Best Practices (RUP)

CFG: Another example

• Ambiguity. • Consider a grammar with a single

production: string string + string | string - string | 0|1|2|3|4|5|6|7|8|9

• string like 9-5+2 will have two parse trees:

Page 25: Intro to Best Practices (RUP)

9-5+2 will have two parse trees

string

string

string

9

- string

5

+ string

2

string

string- string

9 string+ string

5 2

Page 26: Intro to Best Practices (RUP)

Ambiguity

• The left parse tree parses the expression as though it were written (9-5) +2 which equals 6.

• The right parse tree parses the expression as though it were written 9- (5+2) which equals 2.

• It is important to have only one parse tree for any string of symbols. The grammar should be unambiguous.

Page 27: Intro to Best Practices (RUP)

Ambiguity Reduction• Associativity of operators: • Precedence of operators: • Syntax for arithmetic expressions: Assume

the basic units are digits and parenthesized expressions.

• Factor digit | (expr)

Page 28: Intro to Best Practices (RUP)

Associativity of operators:• In most languages addition, subtraction, multiplication

and division are left associative.

• Exponentiation is usually right associative. • In C the assignment operator, = , is right associative.

A = b = c is treated like a = (b = c).

Page 29: Intro to Best Practices (RUP)

Precedence of operators:• Usually multiplication and division have

higher precedence than addition and subtraction.

• An expression like 9+5*2 – 9+(5*2), not (9+5) * 2.

Page 30: Intro to Best Practices (RUP)

Syntax for arithmetic expressions

• The binary operators * and / have highest precedence. They are left associative.

term term * factor|term/factor|factor

• Terms are combined with + and -:Therefore the resultant grammar is:expr expr + term | expr – term| termtermterm * factor|term/factor| factorfactordigit|(expr)digit 0|1|2|3|4|5|6|7|8|9

Page 31: Intro to Best Practices (RUP)

STOP here

Page 32: Intro to Best Practices (RUP)

Syntax of our Source Language• program program id (identifier_list); declarations subprogram_declarations compound_statement• identifier_list id|identifier_list, id• declarations declarations var

identifier_list:type;|e• type standard_type|array[num..num] of

standard_type• standard_type integer|real

Page 33: Intro to Best Practices (RUP)

• subprogram_declarations subprogram_declarations subprogram_declaration;|e• subprogram_declaration subprogram_head declarations compound_statement• subprogram_head function id arguments : standard_type;|

procedure id arguments;• arguments (parameter_list)|e• parameter_list identifier_list : type | parameter_list ;identifier_list :

type• Compound_statement begin optional_statements end• optional_statements statement_list | e• statement_list statement | statement_list ; statement

Page 34: Intro to Best Practices (RUP)

• statement variable assignop expression | procedure_statement | compound_statement | if expression then statement else

statement | while expression do statement• variable id | id [expression]• procedure_statement id | id (expression_list)• expression_list expression | expression_list,

expression• expression simple_expression | simple_expression

relop simple_expression

Page 35: Intro to Best Practices (RUP)

• simple_expression term | sign term | simple_expression addop term

• term factor | term mulop factor• factor id | id (expression_list) |num |

(expression)| not factor• sign + | -

Page 36: Intro to Best Practices (RUP)

Syntax – Directed Translation

• Associate a set of attributes with each grammar symbol. With each production associate a set of semantic rules for computing values of the attributes.

• Synthesized attribute: The value of the attribute at any node of a parse tree can br computed from the attribute values of the children at the node.

• Can be evaluated by a single bottom – up traversal of the parse tree.

Page 37: Intro to Best Practices (RUP)

SDT (continued)

• Example : Translating infix notation to postfix notation.

If a node in the parse tree is labeled with X

then let X.t be a string – valued attribute associated with the node.

X.t || Y.t means concatenate X.t with Y.t

Page 38: Intro to Best Practices (RUP)

Syntax Directed DefinitionPRODUCTION• expr expr1 + term

• expr expr1 – term

• expr term

• term 0• term 1• ……• term 9

SEMANTIC RULE• expr.t := expr1.t || term.t || ‘+’

• expr.t := expr1.t || term.t || ‘-’

• expr.t := term.t

• term.t := ‘0’• term.t := ‘1’• …..• term.t := ‘9’

Page 39: Intro to Best Practices (RUP)

Attribute Values at Nodes in Parse Tree

expr.t = 95-2+

term.t = 2

2

expr.t = 95-

expr.t = 9

term.t = 9

9 -

term.t = 5

5 +

Page 40: Intro to Best Practices (RUP)

Example : RobotPRODUCTION• seq begin

• seq seq1 instr

• instr east

• instr north

• instr west

• instr south

SEMANTIC RULES• seq.x := 0 seq.y := 0• seq.x := seq1.x + instr.dx seq.y := seq1.y + instr.dy• instr.dx := 1 instr.dy := 0• instr.dx := 0 instr.dy := 1• instr.dx := -1 instr.dy := 0• instr.dx := 0 instr.dy := -1

Page 41: Intro to Best Practices (RUP)

seq.x = -1

seq.y = -1

seq.x = -1

seq.y = 0

seq.x = 0

seq.y= 0

begin

instr.dx = -1

instr.dy = 0

west

instr.dx = 0

instr.dy = -1

south

Page 42: Intro to Best Practices (RUP)

Translation Schemes• Translation scheme: A context-free grammar with

semantic actions embedded within the right sides of the productions.

• Example : rest + term {print (‘+’)} rest1• The semantic action is enclosed within braces. The

production itself is : rest + term rest1• Parse tree: Do a post order traversal of the tree.

After the + and term leaves are traversed, the {print (‘+’)} leaf is traversed and the semantic action is performed, then the rest1leaf is traversed and then the root, rest is visited.

Page 43: Intro to Best Practices (RUP)

• In a simple syntax-directed definition the translation order of the non terminals on the right sides is the same as their order in the productions. These definitions can be implemented with translation schemes.

rest

+ term {print (‘+’)} rest1

Page 44: Intro to Best Practices (RUP)

Example: Translating into Post-fix Form

• expr expr + term {print (‘+’)} • expr expr - term {print (‘-’)}• expr term • term 0 {print (‘0’)}• term 1 {print (‘1’)}• ……• term 9 {print (‘9’)}

Page 45: Intro to Best Practices (RUP)

Parsing• Determines if a string of tokens can be generated by a grammar• Parser can be constructed for any grammar• For any context-free grammar there is a parser that takes at most O

(n3) time to parse a string of n tokens.• Almost all programming languages that arise in practice can be

parsed in O (n) time making a single left-to-right scan of the input looking ahead one token at a time.

• Two classes of parsing methods : Top-down – Construct the parse tree starting at the root and working

down towards the leaves. Bottom-up – Construct the parse tree starting at the leaves and

working up toward the roots.• Efficient top-down parsers easier to construct• Bottom-up parsers handle larger class of grammar and translation

schemes.

Page 46: Intro to Best Practices (RUP)

Top – Down Parsing• Recursive-decent parsing is a top-down method

where we execute a set of recursive procedures to process the input.

• Predictive parsing – a special case of recursive-decent parsing.- can be used if the scanned input symbol unambiguously determines the production selected for each nonterminal.

• Example grammar:type simple | id |array [simple] of typesimple integer | char | num .. Num

Page 47: Intro to Best Practices (RUP)

Pseudo Code for Predictive Parserprocedure match (t: token);begin

if lookahead = t then lookahead := nexttokenelseerror

end;procedure type;begin

if lookahead is in {integer, char, num} thensimpleelse if lookahead =‘ ’ then beginmatch (‘ ’ ); match (id) endelse if lookagead = array then beginmatch(array); match(‘[’); simple; match (‘]’);match (of); type endelse error

end;

Page 48: Intro to Best Practices (RUP)

procedure simple;begin

if lookahead = integer then match(inteher)else if lookahead = char then match (char)else if lookahead = num then begin

match(num); match(..); match(num) endelse error

end;

Page 49: Intro to Best Practices (RUP)

• No need to backtrack as long as the first tokens on the right sides of the productions are disjoint.

• e-productions: If any non terminal has an e-production then treat the e-production last. There is no “else error” at the end of the procedure.

• Left-recursion requires special handling. A production like expr expr + termis left-recursive. If the expr procedure calls itself at the beginning the parser will loop forever. Usually the production can be re-written to make it right-recursive.

• Example: expr expr + term | term produces sequences like:

Page 50: Intro to Best Practices (RUP)

termterm + termterm + term + term…..• The same sequence can be produced with

the following grammar:expr term restrest + term rest | e

Page 51: Intro to Best Practices (RUP)

Translator for Simple Expressions

• Grammar for translating infix expressions to post-fix :expr expr + term {print (‘+’)} expr expr - term {print (‘-’)}expr term term 0 {print (‘0’)}term 1 {print (‘1’)}……term 9 {print (‘9’)}

• The left-recursive productions have to be modified.• Modified grammar is :

Page 52: Intro to Best Practices (RUP)

expr term restrest + term {print (‘+’)} restrest - term {print (‘+’)} restrest eterm 0 {print (‘0’)} term 1 {print (‘1’)} ….. term 9 {print (‘9’)}

Page 53: Intro to Best Practices (RUP)

• In pseudo code the non terminal procedures are:procedure expr;

begin term; rest end;

procedure rest;begin

if lookahead = ‘+’ then beginmatch(‘+’); term; print(‘+’); rest end

else if lookahead = ‘-’ then beginmatch(‘-’); term; print(‘-’); rest end

elseend;

Page 54: Intro to Best Practices (RUP)

Procedure term

begin If isdigit(lookahead) then begin Print(lookahead); match(lookahead) end else errorEnd;• Note that is digit is a boolean valued

function that returns TRUE if the argument is a digit. The match procedure was described before.

Page 55: Intro to Best Practices (RUP)

Section 2.6 – Lexical Analysis• A lexical analyzer converts the input stream of

characters into a stream of token to be analyzed by the parser.

• Removal of white space and comments. Most languages allow blanks, tabs, and new lines to be inserted between tokens. Also comments are usually allowed. The lexical analyzer removes these characters.

• Constants. The Lexical analyzer collects the sequences of digits for a constant and passes a single token to the parser. An attribute of the token contains the value of the constant.

Page 56: Intro to Best Practices (RUP)

Example• The input stream 31 + 28 + 59 is transformed into five

tokens with attributes:<num, 31><+, > <num, 28><+, ><num, 59>• Identifiers are names of variables, arrays, functions etc.

The parser wants to see a token like id for each identifier. Example:

count = count + increment is converted to :Id = id + id• We need to know if the same name has been seen

before. We use symbol table. The lexical analyzer adds a pointer to the symbol table entry as an attribute of each token.

Page 57: Intro to Best Practices (RUP)

• Keywords: Many languages use fixed character strings like begin, end, if, …for certain constructs. These keywords usually satisfy the rules for identifiers. We need a mechanism to distinguish between keywords and identifiers. The problem is easier if the keywords are reserved; no keywords can be used as identifier.

• Lexemes like <, <-, and <>, in pascal need special treatment. When the lexical analyzer sees the character it has to read the next character to see what token to pass on.

Page 58: Intro to Best Practices (RUP)

Figure 2.25

Input

Read Character

Push back character

Lexical Analyzer

Parser

Token and attributes

Page 59: Intro to Best Practices (RUP)

• Could put a buffer between the lexical analyzer and the parser to hold a number of tokens and their attributes. Usually, the buffer only holds one token; the lexical analyzer is a procedure called by the parser and returns one token and its attributes whenever called.

• The interface between the input and the lexical analyzer is complicated by characters being pushed back. When a Pascal compiler reads < it reads the next character (to see if the lexeme is <= or <>); if the source is “x bound” then the “b” character must be pushed back.