Compilers and Code Optimizationwpage.unina.it/edoardo.fusella/cco/downloads/lezione3.pdfParsing Parsing is the process of determining whether a string of tokens can be generated by

Compilers and Code

OptimizationEDOARDO FUSELLA

Front-end

Contents

Lexical Analysis

Syntax Analysis

Semantic Analysis

The role of the front end

The front end of the compiler performs analysis.

The analysis is usually broken up into

Lexical Analysis: breaking the input into individual words or “tokens”

Syntax Analysis (or parsing): parsing the phrase structure of the

program

Semantic Analysis: calculating the program’s meaning

Lexical Analysis

Lexical AnalyzerGoals

The lexical analyzer takes a stream of characters and produces a

stream of names, keywords, and punctuation marks

It discards white space and comments between the tokens

It would unduly complicate the parser to have to account for possible

white space and comments at every possible point

Lexical AnalyzerPhases

Lexical analyzer are divided into a cascade of two phases:

Scanning

Consists of the simple processes that do not require tokenization of

the input.

Deletion of comments.

Compaction of consecutive whitespace characters into one.

Lexical analysis

Encode constants as tokens

Recognize Keywords and Identifiers

Store identifier names in a symbol table

Lexical AnalyzerInput/Output

Input

A sequence of characters

Character set:

ASCII

ISO 8859-1 (Latin-1)

ISO 10646 (16-bit = Unicode)

Others (EBCDIC, JIS, etc)

Output

A series of tokens:

Punctuation ( ) ; , [ ]

Operators + - ** :=

Keywords begin end if while try catch

Identifiers Square_Root

String literals “press Enter to continue”

Character literals ‘x’

Numeric literals

Integer: 123

Floating_point: 4_5.23e+2

Based representation: 16#ac#

Lexical AnalyzerFree form vs Fixed form

Free form languages (all modern ones)

White space does not matter. Ignore these:

Tabs, spaces, new lines, carriage returns

Only the ordering of tokens is important

Fixed format languages (historical)

Layout is critical and Lexical analyzer must know about layout to find tokens

Fortran Fixed Format:

80 columns per line

Column 1-5 for the statement number/label column

Column 6 for continuation mark (?)

Column 7-72 for the program statements

Column 73-80 Ignored (Used for other purpose)

Letter C in Column 1 meant the current line is a comment

Lexical AnalyzerPunctuation/Operators

Punctuation

Separators

Typically individual special characters such as ( { } : …)

Sometimes double characters: lexical scanner looks for longest token: (*, /* , --comment openers in various languages

Returns token kind

And perhaps location for error messages and debugging purposes

Operators

Like punctuation

No real difference for lexical analyzer

Typically single or double special chars ( +, -, ==, <= …)

Returns token kind

And perhaps location

Lexical AnalyzerKeywords/Identifiers

Keywords

Reserved identifiers

E.g. BEGIN END in Pascal, if in C, catch in C++

Returns token kind

And perhaps location

Identifiers

Rules differ: Length, allowed characters, separators

Need to build a names table

Single entry for all occurrences of Var1

Language may be case insensitive: same entry for VAR1, vAr1, Var1

Typical structure: hash table

Returns token kind

And key (index) to table entry

Table entry includes location information

Lexical AnalyzerOrganization of names table

Most common structure is hash table

Chain according to hash code

Serial search on one chain

Hash code computed from characters (e.g. sum mod table size).

No hash code is perfect! Expect collisions.

Avoid any arbitrary limits on table or chain size.

Lexical AnalyzerString and Character Literals

String Literals

Text must be stored

Actual characters are important

Not like identifiers: must preserve casing

Character set issues: uniform internal representation

Table needed

Lexical analyzer returns key into table

May or may not be worth hashing to avoid duplicates

Character Literals

Similar issues to string literals

Lexical Analyzer returns token kind and identity of character

Lexical AnalyzerNumeric Literals

Need a table to store numeric value

E.g. 123 = 0123 = 01_23 (Ada uses underscores to separate groups of digits)

But cannot use predefined type for values

Because may have different bounds

Floating point representations much more complex

Denormals, correct rounding

Very delicate to compute correct value

Host / target issues

Lexical AnalyzerHandling Comments

Comments have no effect on program

Can be eliminated by scanner

But may need to be retrieved by tools

Error detection issues

E.g. unclosed comments

Scanner skips over comments and returns next meaningful token

Lexical AnalyzerCase Equivalence

Some languages are case-insensitive

Pascal, Ada

Some are not

C, Java

Lexical analyzer ignores case if needed

This_Routine = THIS_RouTine

Error analysis may need exact casing

Friendly diagnostics follow user’s conventions

Lexical AnalyzerPerformance Issues

Lexical analysis can become bottleneck

Minimize processing per character

Skip blanks fast

I/O is also an issue (read large blocks)

We compile frequently

Compilation time is important, especially during development

Communicate with parser through global variables

Lexical AnalyzerInterface to Lexical Analyzer

Either: Convert entire file to a file of tokens

Lexical analyzer is separate phase

Or: Parser calls lexical analyzer to supply next token

This approach avoids extra I/O

Parser builds tree incrementally, using successive tokens as tree

nodes

Lexical AnalyzerFormalism: Regular grammar

Non-terminals (arbitrary names)

Terminals (characters)

Productions limited to the following:

Non-terminal ::= terminal

Non-terminal ::= terminal Non-terminal

Treat character class (e.g. digit) as terminal

Regular grammars cannot count:

Cannot express size limits on identifiers, literals

Cannot express proper nesting (parentheses)

Lexical AnalyzerFormalism: Regular grammar

Grammar for real literals with no exponent

digit :: = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

REAL ::= digit REAL1

REAL1 ::= digit REAL1 (arbitrary size)

REAL1 ::= . INTEGER

INTEGER ::= digit INTEGER (arbitrary size)

INTEGER ::= digit

Start symbol is REAL

Lexical AnalyzerFormalism: Regular Expressions

Regular expressions (RE) defined by an alphabet (terminal

symbols) and three operations:

Alternation (union) RE1 | RE2

Concatenation RE1 RE2

Repetition RE* (zero or more RE’s)

Language of RE’s = regular grammars

Regular expressions are more convenient for some applications

Lexical AnalyzerFinite State Machines

A language defined by a grammar is a (possibly infinite) set of

strings

An automaton is a computation that determines whether a given

string belongs to a specified language

A finite state machine (FSM) is an automaton that recognize

regular languages (regular expressions)

Lexical AnalyzerSpecifying an FSM

A set of labeled states

Directed arcs between states labeled with character

One or more states may be terminal

A distinguished state is start

Automaton makes transition from state S1 to S2

If and only if arc from S1 to S2 is labeled with next character in input

Token is legal if automaton stops on terminal state

Lexical AnalyzerBuilding FSM from Grammar

One state for each non-terminal

A rule of the form

Non-terminal-1 ::= terminal

Generates transition from S1 to final state

A rule of the form

Non-terminal-1 ::= terminal Non-terminal-2

Generates transition from S1 to S2 on an arc labeled by the terminal

Lexical AnalyzerGraphic representation

digit

digit

Real

digit

.

Lexical AnalyzerBuilding FSM’s from RE’s

Every RE corresponds to a grammar

For all regular expressions

A natural translation to FSM exists

Alternation often leads to non-deterministic machines

Syntax Analysis

Syntax AnalysisParsing

Parsing is the process of determining whether a string of tokens can be

generated by a grammar.

Most parsing methods fall into one of two classes, called the top-down

and bottom-up methods.

In top-down parsing, construction starts at the root and proceeds to the

leaves. In bottom-up parsing, construction starts at the leaves and

proceeds towards the root.

Efficient top-down parsers are easy to build by hand.

Bottom-up parsing, however, can handle a larger class of grammars.

They are not as easy to build, but tools for generating them directly

from a grammar are available.

Syntax AnalysisContext-free Grammars

Context-free grammar (CFG)

Language is a set of strings; each string is a finite sequence of symbols taken from a finite alphabet

For parsing

the strings are source programs

the symbols are lexical tokens

the alphabet is the set of token-types returned by the lexical analyzer

A grammar has a set of productions of the form

𝑠𝑦𝑚𝑏𝑜𝑙 → 𝑠𝑦𝑚𝑏𝑜𝑙 𝑠𝑦𝑚𝑏𝑜𝑙 … 𝑠𝑦𝑚𝑏𝑜𝑙

Where

there are zero or more symbols on the right-hand side

Each symbol is either terminal, meaning that it is a token from the alphabet of strings in the language, or nonterminal, meaning that it appears on the left-hand side of some production.

Syntax AnalysisDerivations

Productions are treated as rewriting rules to generate a string

We can perform a derivation to show that a certain sentence is in the

language of the grammar

Start with the start symbol, then repeatedly replace any nonterminal by one of its

right-hand sides

Rightmost and leftmost derivations

A rightmost/leftmost derivation is one in which the rightmost/leftmost nonterminal symbol

is always the one expanded

𝐸 → 𝐸 + 𝐸 | 𝐸 ∗ 𝐸 | − 𝐸 | (𝐸) | 𝑖𝑑

Derivations for – (𝑖𝑑 + 𝑖𝑑)

𝐸 => −𝐸 => −(𝐸) => −(𝐸 + 𝐸) => −(𝑖𝑑 + 𝐸) => −(𝑖𝑑 + 𝑖𝑑)

Syntax AnalysisDerivations: example

Grammar 1

𝐸 → 𝑖𝑑

𝐸 → 𝑛𝑢𝑚

𝐸 → 𝐸 ∗ 𝐸

𝐸 → 𝐸/𝐸

𝐸 → 𝐸 + 𝐸

𝐸 → 𝐸 − 𝐸

𝐸 → 𝐸

Derivation for 1-2-3

𝐸 → 𝐸 − 𝐸

𝐸 → 𝐸 − 3

𝐸 → 𝐸 − 𝐸 − 3

𝐸 → 𝐸 − 2 − 3

𝐸 → 1 − 2 − 3

Parse tree

Syntax AnalysisAmbiguous Grammars

A grammar is ambiguous if it can derive a sentence with different parse trees.

Grammar 1 is ambiguous

Parse trees for the sentence 1-2-3

(1 − 2) − 3 = −4 versus 1 − (2 − 3) = 2

Syntax AnalysisAmbiguous Grammars

Similarly

1 + 2 ∗ 3 versus 1 + (2 ∗ 3)

Syntax AnalysisElimination of ambiguity

Ambiguous grammars are problematic for compiling

Unambiguous grammars preferred

Often ambiguous grammars can be transformed into unambiguous

grammars.

Considering previous example

∗ has higher precedence than +

each operator associates to the left, so that we get (1 − 2) − 3 instead of 1 − (2 − 3)

Syntax AnalysisElimination of ambiguity: example

Grammar 2

𝐸 → 𝐸 + 𝑇

𝐸 → 𝐸 − 𝑇

𝐸 → 𝑇

𝑇 → 𝑇 ∗ 𝐹

𝑇 → 𝑇/𝐹

𝑇 → 𝐹

𝐹 → 𝑖𝑑

𝐹 → 𝑛𝑢𝑚

𝐹 → 𝐸

Derivation for 1-2-3

𝐸 → 𝐸 − 𝑇

𝐸 → 𝐸 − F

𝐸 → 𝐸 − 3

𝐸 → 𝐸 − 𝑇 − 3

𝐸 → 𝐸 − 𝐹 − 3

𝐸 → 𝐸 − 2 − 3

𝐸 → 𝑇 − 2 − 3

𝐸 → 𝐹 − 2 − 3

𝐸 → 1 − 2 − 3

• Same set of sentences as the

ambiguous grammar

• Each sentence has exactly one

parse tree

• The symbols 𝐸, 𝑇 , and 𝐹 stand for

expression, term, and factor

• factors are things you multiply

• terms are things you add

Top Down Parsing

A Top-down parser tries to create a parse tree from the root towards the leafs scanning input from left to right

find a leftmost derivation for an input string

Example:

S → cAd Input: cad

A → ab | a

S S Backtrack S

/ | \ / | \ / | \

c A d c A d when we choose c A d

/ \ the wrong rule |

a b a

Recursive Descent Parsing

Top-down parser

Each production corresponds to one recursive procedure

Each procedure recognizes an instance of a non-terminal

returns tree fragment for the non-terminal

Example:

S → if E then S else S L → end

S → begin S L L → S L

S → print E E → num = num

One function for each nonterminal

One clause for each production

Recursive Descent ParsingExample

S → if E then S else S

S → begin S L

S → print E

L → end

L → S L

E → num = num

Bottom-up Parsing

Constructs parse tree for an input string beginning

at the leaves (the bottom) and working towards

the root (the top)

Example: id*id

𝐸 → 𝐸 + 𝑇 | 𝑇

𝑇 → 𝑇 ∗ 𝐹 | 𝐹

𝐹 → 𝐸 | 𝑖𝑑

id

F * idid*id T * id

id

F

T * F

id

F id T * F

id

F id

T

T * F

id

F id

T

E

Shift-reduce parser

Bottom-up parser

The general idea is to shift some symbols of input to the stack until a reduction can be applied

At each reduction step, a specific substring matching the body of a production is replaced by the nonterminal at the head of the production

The key decisions during bottom-up parsing are about when to reduce and about what production to apply

A reduction is a reverse of a step in a derivation

The goal of a bottom-up parser is to construct a derivation in reverse:

𝐸 → 𝑇 → 𝑇 ∗ 𝐹 → 𝑇 ∗ 𝑖𝑑 → 𝐹 ∗ 𝑖𝑑 → 𝑖𝑑 ∗ 𝑖𝑑

Shift-reduce parserHandle pruning

A Handle is a substring that matches the body of a

production and whose reduction represents one step

along the reverse of a rightmost derivation

Right sentential form Handle Reducing production

id*id id F->id

F*id F

id

T->F

T*id F->id

T*F T*F E->T*F

𝐸 → 𝐸 + 𝑇 | 𝑇𝑇 → 𝑇 ∗ 𝐹 | 𝐹𝐹 → 𝐸 | 𝑖𝑑

Shift-reduce parserHandle pruning

Basic operations:

Shift

Reduce

Accept

Error

Example: 𝑖𝑑 ∗ 𝑖𝑑

Stack Input Action

$

$id

id*id$ shift

*id$ reduce by F->id$F *id$ reduce by T->F$T *id$ shift$T* id$ shift

$T*id $ reduce by F->id

$T*F $ reduce by T->T*F

$T $ reduce by E->T

$E $ accept

Semantic Analysis

Role of Semantic Analysis

The principal job of the semantic analyzer is to enforce static semantic rules

constructs a syntax tree (usually first)

information gathered is needed by the code generator

Considerable variety in the extent to which parsing, semantic analysis, and intermediate code generation are interleaved

A common approach interleaves construction of a syntax tree with parsing, and then follows with separate, sequential phases for semantic analysis and code generation

Semantic Analysis Attribute Grammars

Context-Free Grammars (CFGs) are used to specify the syntax of programming languages

E.g. arithmetic expressions

How do we tie these rules to mathematical concepts?

Attribute grammars are annotated CFGs in which annotations are used to establish meaning relationships among symbols

Provide a formal framework for decorating such a tree

Both semantic analysis and (intermediate) code generation can be described in terms of annotation, or "decoration" of a parse/syntax tree

Semantic Analysis Attribute Grammars: an example

Each grammar symbols

has a set of attributes

E.g. the value of E1 is

the attribute E1.val

Each grammar rule has

a set of rules over the

symbol attributes

Semantic Function

rules

E.g. sum, quotient

Copy rules

1. 𝐸 → 𝐸 + 𝑇

2. 𝐸 → 𝐸 − 𝑇

3. 𝐸 → 𝑇

4. 𝑇 → 𝑇 ∗ 𝐹

5. 𝑇 → 𝑇/𝐹

6. 𝑇 → 𝐹

7. 𝐹 → 𝑖𝑑

8. 𝐹 → 𝑛𝑢𝑚

9. 𝐹 → 𝐸

1. 𝐸1 → 𝐸2 + 𝑇 𝐸1. 𝑣𝑎𝑙 ≔ 𝑠𝑢𝑚(𝐸2. 𝑣𝑎𝑙, 𝑇. 𝑣𝑎𝑙)

2. 𝐸1 → 𝐸2 − 𝑇 𝐸1. 𝑣𝑎𝑙 ≔ 𝑑𝑖𝑓𝑓(𝐸2. 𝑣𝑎𝑙, 𝑇. 𝑣𝑎𝑙)

3. 𝐸 → 𝑇 𝐸. 𝑣𝑎𝑙 ≔ 𝑇. 𝑣𝑎𝑙

4. 𝑇1 → 𝑇2 ∗ 𝐹 𝑇1. 𝑣𝑎𝑙 ≔ 𝑝𝑟𝑜𝑑(𝑇2. 𝑣𝑎𝑙, 𝐹. 𝑣𝑎𝑙)

5. 𝑇1 → 𝑇2/𝐹 𝑇1. 𝑣𝑎𝑙 ≔ 𝑞𝑢𝑜𝑡(𝑇2. 𝑣𝑎𝑙, 𝐹. 𝑣𝑎𝑙)

6. 𝑇 → 𝐹 𝑇. 𝑣𝑎𝑙 ≔ 𝐹. 𝑣𝑎𝑙

7. 𝐹 → 𝑖𝑑 𝐹. 𝑣𝑎𝑙 ≔ 𝑖𝑑. 𝑣𝑎𝑙

8. 𝐹 → 𝑛𝑢𝑚 𝐹. 𝑣𝑎𝑙 ≔ 𝑛𝑢𝑚. 𝑣𝑎𝑙

9. 𝐹 → 𝐸 𝐹. 𝑣𝑎𝑙 ≔ 𝐸. 𝑣𝑎𝑙

Semantic Analysis Attribute Grammars

The attribute grammar serves to define the semantics of

the input program

Attribute rules are best thought of as definitions, not

assignments

They are not necessarily meant to be evaluated at any

particular time, or in any particular order, though they do

define their left-hand side in terms of the right-hand side

Semantic Analysis Evaluating Attributes

The process of evaluating attributes is called annotation,

or decoration, of the parse tree

When a parse tree under this grammar is fully decorated,

the value of the expression will be in the val attribute of the

root

The code fragments for the rules are called semantic

functions (they should be cast as functions)

e.g. 𝐸1. 𝑣𝑎𝑙 ≔ 𝑠𝑢𝑚(𝐸2. 𝑣𝑎𝑙, 𝑇. 𝑣𝑎𝑙)

Semantic Analysis Evaluating Attributes

The figure shows the result of annotating the parse tree for (1 + 3) ∗ 2

Each symbols has at most one attribute shown in the corresponding box

Numerical value in this example

Punctuation marks have no attributes

Operator symbols have no value

Arrows represent attribute flow

A bottom up approach

a) The values of the constants 1 and 3have been placed in new syntax tree

leaves

b) The pointers to these leaves become

child pointers of a new internal

+ node

c) The pointers to this node propagates

up into the attributes of 𝑇, and a new

leaf is created for 2

d) The pointers for 𝑇 and 𝐹 become

child pointers of a new internal ∗node, and a pointer to this node

propagates up into the attributes of 𝐸

Semantic Analysis Construction of the Syntax Tree

Documents

Compilers and Code Optimizationwpage.unina.it/edoardo.fusella/cco/downloads/lezione3.pdfParsing Parsing is the process of determining whether a string of tokens can be generated by