Language Translation Issues Lecture 5: Dolores Zage

Language Translation Issues

Lecture 5:

Dolores Zage

Programming Language Syntax

The arrangement of words as elements in a sentence to show their relationship In C, X = Y + Z represents a valid sequence of

symbols, XY +- does not provides significant information for

understanding a program translation into an object program

rules: 2 + 3 x 4 is 14 not 20 (2+3) x 4 - specify interpretation by syntax - syntax guides

the translator

General Syntactic Criteria Provide a common notation between the programmer

and the programming language processor the choice is constrained only slightly by the necessity

to communicate particular items of information for example: a variable may be represented as a real can be

done by an explicit declaration as in Pascal or by an implicit naming convention as FORTRAN

general criteria: easy to read, write, translate and unambiguous

Readability Algorithm is apparent from inspection of text self-documenting natural statement formats liberal use of key words and noise words provision for embedded comments unrestricted length identifiers mnemonic operator symbols COBOL design emphasizes readability often at the

expense of ease of writing and translation

Writeability

Enhanced by concise and regular structures (notice readability->verbose, different; help us to distinguish programming features)

FORTRAN - implicit naming does not help us catch misspellings (like indx and index, both are good integer variables, even though the programmer wanted indx to be index)

redundancy can be good easier to read and allows for error checking

Translation

Ease of Key of easy translation is regularity of

structure LISP can be translated in a few short easy

rules, but it is a bear to read. COBOL has large number of syntactic

constructs -> hard to translate

Lack of ambiguity

Central problem in every language design! Ambiguous construction allows for two or

more different interpretations these do not arise in the structure of

individual program elements but in the interplay between structures

The dangling else is a classic example:

If then else

If (boolean) then if(boolean) then statement 1 else statement 2

B1 B1

B2 B2

S1

S2

S1 S2

Resolve dangling else

Include begin … end delimiter around embedded conditional -ALGOL

Ada-> delimiter end if C and Pascal -> final else is paired with the

nearest then

Character set

ASCII 26 letters -> other languages have hundreds of letters identifiers and key words and reserved words blanks can be not significant except in literal

character-string data (FORTRAN) or used as separators

delimiters -> begin, end { }

Other elements Identifiers, operators, key words, reserved

words Free vrs Fixed format -

free written anywhere fixed - FORTRAN - first five characters are

reserved for labels statements -

simple - no embedding structured or nested - embedded

Overall Program-Subprogram Structure Separate subprogram definitions ( Common blocks in

FORTRAN) separate data definitions ( class mechanism) nested subprogram definitions (Pascal nesting one

subprogram in the other) separate interface definitions - package interface in Ada

- in C you can do this with an include file data descriptions separated from executable

statements (COBOL data and environment divisions) unseparated subprogram divisions - no organization -

early BASIC and SNOBOL

Stages in Translation

Process of translation of a program from its original syntax into executable form is central in every programming implementation

translation can be quite simple as in LISP and Prolog but more often quite complex

most languages could be implemented with only trivial translation if you wrote a software interpreter and willing to accept slow execution speeds

Stages in Translation

Syntactic recognition parts of compiler theory are fairly standard Analysis of the Source Program

the structure of the program must be laboriously built up character by character during translation

Synthesis of the Object Program construction of the executable program from the

output of the semantic analysis

Structure of a Compiler

Lexical analysis

Syntactic analysis

Semantic analysis

Optimization

Code generation linking

SymboltableOthertables

source program

lexical tokens

parse tree

intermediate code

optimized intermediate code

Objectcode

Executablecode

Object code fromother compilations

SOURCEPROGRAMRECOGNITIONPHASES

OBJECTCODEGENERATIONPHASES

Analysis of the Source Program

lexical analysis (tokenizing) parsing ( syntactic analysis) semantic analysis

symbol-table maintenance insertion of implicit information (default

settings)macro processing and compile-time

operations(#ifdefs)

Synthesis of the Object Program

Optimization code generation - internal representation must

be formed into assembly language statements, machine code or other object form

linking and loading - references to external data or other subprograms

Translator Groupings

Crudely grouped by the number of passes they make over the source code

standard - uses 2 passes decomposes into components, variable name usage generates an object program from collected information

one pass - fast compilation - Pascal was designed so that it could be done in one pass

three or more passes - if execution speed is paramount

Formal Translation Models

Based on the context-free theory of languages

the formal definition of the syntax of a programming language is called a grammar

a grammar consists of a set of rules (production) that specify the sequences of characters (lexical items) that form allowable programs in the language beginning defined

Chomsky Hierarchy

Language syntax was one of the earliest formal modes to be applied to programming language design

in 1959 Chomsky outlined a model of grammars

Classes of grammar and abstract machines

Chomsky Level Grammar Class Machine Class

0 Unrestricted Turning machine1 Context sensitive Linear-bounded automaton2 Context free Pushdown automaton3 Regular Finite-state automaton

Type 2 are our BNF grammars. Type 2 and 3 are what we use in programming languages

A type n language is one that is generated by a type n grammar, where there is no grammar type n + 1 that also generates it. Every grammar of type is, by definition, also a grammar of type n-1.

Grammar To Chomsky it is a 4-tuple (V, T, P, Z) where V is an alphabet T in V is an alphabet of terminal symbols P is a finite set of rewriting rules Z the distinguished symbol, is a member of T-V The language of a grammar is the set of terminal

strings which can be represented from Z The difference in the four types is in the form of the

rewriting rules allowed in P

Type 0 or phrase structure

Rules can have the form: u :: = V with

u in V+ and V in V*

That is, the left part u can also be a sequence of symbols and the right part can be empty

abc -> dca a -> nil

Type 1 or context sensitive or context dependent Restrict the rewriting rules xUy ::= xuy we are only allowed to Rewrite U as u only in

the context x…y all productions a -> b where the length side a always must be less than or

equal to the length of b

G = ( {S,B,C}, {a,b,c}, S, P)

P = S -> aSBC S -> abC bB -> bb bC -> bc CB -> BC cC -> cc What language is generated by this context

sensitive grammar?

Deciding the language?

always start with the start rule: in this case it is S but it can any nonTerminal (look at the 4-tuple definition)

create a tree starting with the start rule and apply the productions finally finishing with all terminals

“generalize” the pattern

Identifying L given G

P = 1. S -> aSBC 2. S -> abC 3. bB -> bb 4. bC -> bc 5. CB -> BC 6. cC -> cc

SabC aSBC

abcaabCBC

aabBCC

aabbCC

aabbcC

aabbcc

aaSBCBCaaabCBCBC

aaabBCCBC

aaabBCBCC

aaabBBCCC

aaabbBCCC

aaabbbCCC

aaabbbcCC

aaabbbccC

aaabbbccc

L -> anbncn where n>= 1

Type 2 or context free

U can be rewritten as u regardless of the context in which it appears

This grammar has only one symbol on the left hand side

It also allows a rule to go the empty string

Context Free Expression Grammar

E-> E + T | E - T | T T -> T * F | T / F | F F -> number | name | (E)

Type 3 - regular grammars

Restrict the rules once more all rules must have the form u :: N or u :: WN

Grammars

As we moved from type 3 to type 2 to type 1 to type 0, the resulting languages became more complex

type 2 and type 3 became important in programming languages

type 3 provided a model (FSM) for building lexical analyzers

type 2 (BNF) for developing parse trees of programs

BNF Grammars

Consider the structure of an English sentence. We usually describe it as sequence of categories

subject / verb / object

Examples:

The girl/ played / baseball.

The boy / cooked / dinner.

BNF Grammars Each category can be further divided. For example subject is represented by

article noun

article / noun / verb / object

There are other possible sentence structures besides the simple declarative ones, such as questions.

auxiliary verb / subject / predicate

Is / the boy / cooking dinner?

Represent sentences by a set of rules

<sentence> ::= <declarative> | <question> <declarative> ::= <subject> <verb> <object>. <subject> ::= <article><noun> <question> ::= <auxiliary verb> <subject> <predicate>

This specific notation is called BNF (Backus-Naur form) and was developed in the late 1950s by John Backus as way to express the syntactic definition of ALGOL. At the same time Chomsky developed a similar grammatical form, the context-free grammar. The BNF and context-free grammar for are equivalent in power; the differences are only in notation. For this reason BNF grammar and context-free grammar are interchangeable. (in grammars)

Syntax

A BNF grammar is composed of a finite set of BNF grammar rules, which define a language

syntax is concerned with form rather than meaning, a (programming) language consists of a set of syntactically correct programs, each of which is simply a sequence of characters

Production Rules

A grammar -> set of production rules <real-number> ::= <integer_part> . <fraction> <integer_part> ::= <digit> | <integer_part> <digit> <fraction> ::= <digit>| <digit> <fraction> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9

nonterminalsToken or terminal

Doesn’t Have to Make Sense! A syntactically correct program need not make

any sense semantically. If it is executed it would not have to compute

anything useful it could not computer anything at all

For example look at our simple declarative and imperative sentences -> the syntax

subject verb object is fulfilled but doesn’t make any sense

The home / ran / the girl.

Parse Trees Production rules are rules for building strings of tokens beginning with the starting nonterminal, you can use

the rules to build a tree The parse tree -

each leaf either has a terminal or is empty nonleaf nodes are with nonterminals generates the string formed by reading terminals at its

leaves from left to right a string is only in a language if is generated by some parse

tree

Parse tree

<real-number> ::= <integer_part> . <fraction><integer_part> ::= <digit> | <integer_part> <digit><fraction> ::= <digit>| <digit> <fraction><digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9

<real-number>

<integer_part>

<fraction>.String 13.13

<digit> <fraction><integer_part> <digit>

<digit> <digit>

1

1

3

3

Use of Formal Grammar

Important to the language user and language implementor

user may consult to answer subtle questions about program form, punctuation, and structure

implementor may use it to determine all the possible cases of input program structures that are allowed

common agreed upon definition

BNF grammar or Context free

Assigns a structure to each string in the language

is is always a tree because of the restrictions on BNF grammar rules

parse tree provides an intuitive semantic structure

BNF does a good job in defining the syntax of a language

Syntax not defined by BNF notation

Despite the elegance, power and simplicity of BNF grammars there are areas of language that cannot be expressed (contextual dependence)

ex: the same identifier may not be defined twice in the same scope

also every language can be defined by multiple grammars

problem : ambiguity (the dangling else) They /are /flying planes or They / are flying/ planes

Ambiguity

Ambiguity is often a property of a given grammar

G : S -> SS | 0 | 1 the grammar that generates binary strings is

ambiguous because there is a string in the language that has two distinct parse trees

Ambiguous Grammar

S S

S S

S S

S

SS S

0 0 0 01 1

Ambiguous Grammar

If every grammar for a given language is ambiguous, then the language is inherently ambiguous. However, the language that generates binary string is not because there is a grammar thatthat is unambiguous

G: T -> 0T | 1T | 0 | 1

Expressions

We need control structures for expressions Implicit (default) control - are in effect unless

modified by the programmer through some explicit structure

explicit - modify implicit sequence

Sequencing with Arithmetic Expressions

Root = -B B2 - 4 * A * C

2 * A

There are 15 separate operations in this formula In a programming language this can be stated as a single expression

Sequencing with Arithmetic Expressions Expressions are powerful and a natural

device for expressing sequences of operations however, they raise new problems.

The sequence-control mechanisms that operate to determine the order of operations within an expression are complex and subtle

Tree-Structure Representation

Clarifies the control structure of the expression

*

+ -

a b c d

(a+b) * (c-a)

Syntax for Expressions

For a programming language we must have a notation for writing trees as linear sequences of symbols

There are three common ones prefix postfix infix

Expression Notationprefix opE1E2 +ab

postfix E1E2op ab+

infix E1opE2 a+bpostfix and prefix, nice -> do not have to use ()

infix postfix prefix

(a+b)*c ab+c* *+abc

a+b*c abc*+ +a*bc

a+b+c ab+c+ ++abc

(a+b)+c ab+c+ ++abc

a + (b+c) abc++ +a+bc

Which of the following is a valid expression (either postfix or prefix)?

B C * D - + * A B C - B B B * *

Expression Notation - Infix

However, infix is familiar and easy to read Infix is suited to binary operators, for unary

operators or multi-agrument function calls must be exceptions to the general infix property

But how to decode a+b*c? Precedence (order of operations) Associativity ( normally left to right)

Precedence

Give operators precedence levels higher precedence operators are evaluated

before lower precedence operators without precedence rules, parentheses would

be needed in expressions works well with all mathematical symbols but

breaks done with new operators not from classical mathematics (?: in C)

Associativity

What if operators with the same precedence are grouped together?

Operators + - / * are left associative 1+2+3+4 : left associative a=b=c=2+3 : right associative 234 : right associative mixfix notation - when symbols or keywords

interspersed with the components of expressions - IF a>b then a else b

Abstract Syntax Tree Infix, postfix, prefix use a different notation,

but all have the same meaningful components

an abstract syntax tree is a way to represent this for the notations

infix postfix prefix

(a+b)*c ab+c* *+abc

*+ c

a b

Side Effects

The use of operations that have side effects in expressions is the basis of a long-standing controversy in programming language design

Side effects are implicit results. For example an operation may return an explicit result, as in the sum returned as the result of an addition, but it may also modify the values stored in other data objects.

A * fun(x ) + a

First, we must fetch the r-value of a and the fun(x) must be evaluated.

Notice the addition requires the value of a and the result of the multiplication.

It is clearly desirable to fetch a once and use it twice

Moreover, it should make no difference whether fun(x) is evalutated before or after the value of a if fetched

A * fun(x ) + a However if fun has the side effect of changing the

value of a, then the exact order of evaluation is critical!

If a has the initial value of 1 and fun(x) returns 3 and also changes the value of a to 2, then the possible values for this expression can be: evaluate each term in sequence: 1 * 3 + 2 = 5 evaluate a only once: 1 * 3 * 1 = 4 call fun(x) before evaluating a: 2 * 3 + 2 = 8 all are correct according the syntax

Positions on side effects in expressions Outlaw them! Disallow functions with side

effects or make them undefined allow them but make it clear exactly what the

order of evaluation is so the programmer can make proper use

The later is most general, but many language definitions this question is ignored and the result is different implementations provide conflicting interpretations

Documents

Language Translation Issues Lecture 5: Dolores Zage