Joey Paquet, 2000, 20021 Lecture 10 Introduction to Code Generation and Intermediate Representations

Joey Paquet, 2000, 2002 1

Lecture 10

Introduction to Code Generationand

Intermediate Representations


Introduction to Code Generation

• Front end: – Lexical Analysis– Syntactic Analysis– Intermediate Code Generation

• Back end: – Intermediate Code Optimization– Object Code Generation

• The front end is machine-independent, i.e. it can be reused to build compilers for different architectures

• The back end is machine-dependent, i.e. these steps are related to the nature of the assembly or machine language of the target architecture


Introduction to Code Generation• After syntactic analysis, we have a number of options to

choose from:– generate object code directly from the parse– generate intermediate code, and then generate object

code from it– generate an intermediate abstract representation, and

then generate code directly from it– generate an intermediate abstract representation,

generate intermediate code, and then the object code

• All these options have one thing in common: they are all based on syntactic information gathered in the semantic analysis


Introduction to Code Generation

SyntacticAnalyzer

ObjectCode

SyntacticAnalyzer

IntermediateRepresentation

ObjectCode

Lexical Analyzer

Lexical Analyzer

Lexical Analyzer

SyntacticAnalyzer

IntermediateRepresentation

IntermediateCode

ObjectCode

SyntacticAnalyzer

IntermediateCode

ObjectCode

Lexical Analyzer

Front End Back End


Interm. Representations & Code• Intermediate representations synthetize

the syntactic information gathered during the parse, generally in the form of a tree or directed graph.

• Intermediate representations enable high-level code optimization.

• Intermediate code is a low-level coded (text) representation of the program, directly translatable to object code.

• Intermediate code enables low-level, architecture-dependent optimizations.


Part I

Intermediate Representations


Abstract Syntax Trees• Each node represents the application of a rule in the grammar• A subtree is created only after the complete parsing of a right

hand side• Pointers to subtrees are sent up and grafted as upper subtrees

are completed• Parse trees (concrete syntax trees) emphasize the

grammatical structure of the program• Abstract syntax trees emphasize the actual computations

to be performed. They do not refer to the actual non-terminals defined in the grammar, hence their name.


Parse vs Abstract Syntax Trees

x = a*b+a*b

x

+

=

a b* a b*

E

A

E

E

Parse Tree

x

a b a b

*

=

*

+

x = a*b+a*b

Abstract Syntax Tree


Directed Acyclic Graphs (DAG)

• Directed acyclic graphs (DAG) are a relative of syntax trees: they are used to show the syntactic structure of valid programs in the form of a “tree”.

• In DAGs, the nodes for repeated variables and expressions are merged into a single node.

• DAGs are more complicated to build than syntax trees, but directly implements lots of code optimization by avoiding redundant operations.

Joey Paquet, 2000, 2002 10

AST vs DAG

x

a b a b

*

=

*

+

x = a*b+a*b

Abstract Syntax Tree

x

=

a b

*

+

x = a*b+a*b

Directed Acyclic Graph

Joey Paquet, 2000, 2002 11

Postfix Notation• Every expression is rewritten with its operators at the

end, e.g.:

• Easy to generate from a bottom-up parse• Can be generated from a syntax tree using postorder

traversal

a+b ab+a+b*c abc*+if A then B else C ABC?If A then if B then C else D else E ABCD?E?x=a*b+a*b ab*ab*+x=

Joey Paquet, 2000, 2002 12

Postfix Notation• Its nature allows it to be naturally evaluated with the

use of a stack• Operands are pushed onto the stack; operators pop the

right amount of operands from the stack, do the operation, then push the result back onto the stack.

• However, this notation is restricted to simple expressions such as in arithmetics where every rule conveys an operation

• It cannot be used for the expression of most programming languages constructs

Joey Paquet, 2000, 2002 13

Three-Address Code

• Three-address codes (3AC) is an intermediate language that maps directly to assembly code, but that is not architecture-dependent

• It breaks the program into short statements requiring no more than three variables and no more than one operator, e.g:

x = a+b*c t := b*cx := a+t

source 3AC

Joey Paquet, 2000, 2002 14

Three-Address Code• The temporary variables are

generated at compile time and added to the symbol table

• In the generated code, the variables will refer to actual memory cells. Their address is also stored in the symbol table

• 3AC can also be represented as quadruples, which are even more related to assembly languages

t := b*c L 3,bM 3,cST 3,t

x := a+t L 3,aA 3,tST 3,x

3AC ASM

t := b*c MULT t,b,c x := a+t ADD x,a,t

3AC Quadruples

Joey Paquet, 2000, 2002 15

Intermediate Languages• In this case, we generate code in a language for which

we already have a compiler or interpreter• Such languages are generally very low-level and

dedicated to the compiler construction task• It provides the compiler writer with a “virtual machine”• Various compilers can be built using the same virtual

machine• The virtual machine compiler can be compiled on

different machines to provide a translator to various architectures.

• For the project, we have the moon compiler, which

provides a virtual assembly language and a compiler.

Joey Paquet, 2000, 2002 16

Project Overview

SyntacticAnalyzer

Lexical Analyzer

MoonCode

TokenStream

MoonCompiler

ObjectCode

SourceCode

• Your compiler generates Moon code• The Moon compiler (virtual machine) is used to

generate an exectuable for your program• Your compiler is thus retargetable by

recompilation of the moon compiler on a new processor

Joey Paquet, 2000, 2002 17

Part II

Semantic Actions and

Code Generation

Joey Paquet, 2000, 2002 18

Semantic Actions• Semantics is about giving a meaning to the compiled

program.• Semantic actions have two parts:

– Semantic checking: check if the compiled program has a meaning, e.g variables are declared, operator and function have the right parameter types and number of parameters upon calling

– Semantic translation: translate declarations, statements and expressions to machine code

• Semantic translation is conditional to semantic checking

Joey Paquet, 2000, 2002 19

Semantic Actions• Semantic actions are inserted in the grammar (thus

transforming it in an attribute grammar)– In recursive descent parsers, they are represented by

function calls imbedded in the parsing functions– In table-driven top-down parsers, they are represented by

functions pushed on the stack along with the right hand sides they belong to

• Most semantic actions use attributes for their resolution:– In recursive descent parsers, they are migrated using

reference parameter passing– In table-driven top-down parsers, they are migrated using

a semantic stack

Joey Paquet, 2000, 2002 20

Semantic Actions• There are semantic actions associated with:

– Declarations: • variable declarations• type declarations• function declarations

– Control structures: • conditional statements• loop statements

– Expressions: • assignment operations• arithmetic and logical expressions

Joey Paquet, 2000, 2002 21

Processing Declarations• In processing declarations, the only semantic checking

there is to do is to ensure that every object (e.g. variable, type, class, function, etc.) is declared once and only once

• This restriction is tested using the symbol table entries• Symbol table entries are generated as declarations are

encountered• Afterwards, every time an identifier is encountered, a

check is made in the symbol table to ensure that it has been properly defined

Joey Paquet, 2000, 2002 22

Processing Declarations• Code generation in declarations comes in the form of

memory allocation for the objects defined• Every object defined, no matter its type, will eventually

have to be stored in the computer’s memory• Memory allocation must be done according to the size

of the objects defined, which depends on the target machine

• For each identifier declared, you must generate a label that will be used to refer to that variable in the ASM code and store it in the location field of its entry in the symbol table

• See the Moon machine description documentation for more explanations specific to the project

Joey Paquet, 2000, 2002 23

Processing Variable Declarations• <varDecl> <type><id>; {varDeclSem}

– An entry is created in the corresponding symbol table. Memory space is reserved for the variable according to the size of the type of the variable and linked to a label in the ASM code

– The starting address (or its label) is stored in the symbol table entry of the variable. In the case of arrays, the offset of (size of the elements) is often stored in the symbol table record

• <varDecl> <type><idList>; {varDeclSem}– To generate each entry, (one for each element in the list),

the compiler must keep track of the type of the declaration. This is an attribute that is migrated using a technique appropriate to the parsing method used

Joey Paquet, 2000, 2002 24

Processing Type Declarations• Most programming languages allow the definition of

types that aggregates of the basic types defined in the language

• There are typically arrays or record types, or even abstract data types (or classes) in object-oriented programming languages

• <typeDecl> <type><id> is <typeDef>; {typeDeclSem}– An entry is created in the symbol table for the new type

defined. It contains a definition (e.g. size) of all the elements of the new type

– This information is used when new objects of that type are declared in the program, and to compute the offset when arrays of elements of that type are created

Joey Paquet, 2000, 2002 25

Type Compatibility• In Pascal:

– A,B: array (1..10) of integer;– C,D: array (1..10) of integer;

• this defines two data types: – type Type1 is array (1..10) of integer;– A,B: Type1– type Typ2 is array (1..10) of integer;– C,D: Type2

• Here, Type1 and Type2 are clearly two distinct data types, e.g. A := D is not permitted in the program

Joey Paquet, 2000, 2002 26

Type Compatibility• Some compilers use some rules to define type

equivalence. One of the first compilers to implement this was the Algol68 compiler

• Advantage: – gives more flexibility to the language

• Drawbacks: – compiler is much more complicated to implement– compiler will hardly distinguish between equivalent types

Joey Paquet, 2000, 2002 27

Processing Arrays• Static arrays are arrays with static size defined at

compile time• Most programming languages allow only integer litterals

for the initialization of array size, or constant integer variables when available in the language

• Pascal: A: array (1..10) if integer• C: int A[10];

orconst size=10;

int A[size];

Joey Paquet, 2000, 2002 28

Processing Arrays• This restriction comes from the fact that the memory

allocated to the array has to be set at compile time, and is fixed throughout the execution of the program

• When processing an array declaration, a sufficient amount of memory is allocated to the variable depending on the size of the elements and the cardinality of the array

• Only the starting address (or a label) is stored in the symbol table. The offset (the size of the array in memory) is also sometimes stored in the symbol table record to avoid referring of elements outside the bounds

• Dynamic arrays are generally implemented using pointers, dynamic memory allocation functions and an execution stack or heap

Joey Paquet, 2000, 2002 29

Processing Expressions• Semantic records contain the type and location for

variables (normally labels in the ASM code) or the type and value for constant factors

• Semantic records are created at the leafs of the tree when factors (F) are recognized, and then passed upwards in the tree

• These semantic records contain the attributes that are migrated within the tree to find a global result for the symbol on top of the tree for that expression

Joey Paquet, 2000, 2002 30

Processing Expressions• As new nodes (or subtrees) are created going up in the

tree, intermediate results are stored in temporary semantic records containing subresults for subexpressions

• Each time an operator node is resolved, its corresponding semantic checking and translation is done and its subresult is stored in a temporary variable for which you have to allocate some memory and generate a label

• You can even put an entry in the symbol table for each intermediate result you generated. You can then use these for further reference, e.g. for debugging

Joey Paquet, 2000, 2002 31

Processing Expressions• Doing so, the code is generated sequentially as the tree

is traversed:

a

b c

+

*

x

=t1 = b*c L 3,b

M 3,cST 3,t1

t2 = a+t1 L 3,aA 3,t1ST 3,t2

x = t2 L 3,t2ST 3,x

subtree ASM

Joey Paquet, 2000, 2002 32

Conclusions

• Most compilers build an intermediate representation of the parsed program, normally as an abstract syntax tree.

• These will allow high-level optimizations to occur before the code is generated.

• In the project, we are outputting MOON code, which is an intermediate language.

• MOON code could be the subject of low-level optimizations.

Joey Paquet, 2000, 2002 33

Conclusions

• Semantic actions are composed of a semantic checking, and a semantic translation part.

• Semantic actions are inserted at appropriate places in the grammar to achieve the semantic checking and transaltion phase.

• Semantic translation is conditional to semantic checking.

Joey Paquet, 2000, 2002 34

Conclusions

• There are semantic actions for: – Declarations (variables, functions, types,

etc)– Expressions (arithmetic, logic, etc)– Control structures (loops, conditions, etc)

Documents

Joey Paquet, 2000, 20021 Lecture 10 Introduction to Code Generation and Intermediate Representations