Upload
lauren-singleton
View
228
Download
3
Tags:
Embed Size (px)
Citation preview
Language Translation Issues
Lecture 5:
Dolores Zage
Programming Language Syntax
The arrangement of words as elements in a sentence to show their relationship In C, X = Y + Z represents a valid sequence of
symbols, XY +- does not provides significant information for
understanding a program translation into an object program
rules: 2 + 3 x 4 is 14 not 20 (2+3) x 4 - specify interpretation by syntax - syntax guides
the translator
General Syntactic Criteria Provide a common notation between the programmer
and the programming language processor the choice is constrained only slightly by the necessity
to communicate particular items of information for example: a variable may be represented as a real can be
done by an explicit declaration as in Pascal or by an implicit naming convention as FORTRAN
general criteria: easy to read, write, translate and unambiguous
Readability Algorithm is apparent from inspection of text self-documenting natural statement formats liberal use of key words and noise words provision for embedded comments unrestricted length identifiers mnemonic operator symbols COBOL design emphasizes readability often at the
expense of ease of writing and translation
Writeability
Enhanced by concise and regular structures (notice readability->verbose, different; help us to distinguish programming features)
FORTRAN - implicit naming does not help us catch misspellings (like indx and index, both are good integer variables, even though the programmer wanted indx to be index)
redundancy can be good easier to read and allows for error checking
Translation
Ease of Key of easy translation is regularity of
structure LISP can be translated in a few short easy
rules, but it is a bear to read. COBOL has large number of syntactic
constructs -> hard to translate
Lack of ambiguity
Central problem in every language design! Ambiguous construction allows for two or
more different interpretations these do not arise in the structure of
individual program elements but in the interplay between structures
The dangling else is a classic example:
If then else
If (boolean) then if(boolean) then statement 1 else statement 2
B1 B1
B2 B2
S1
S2
S1 S2
Resolve dangling else
Include begin … end delimiter around embedded conditional -ALGOL
Ada-> delimiter end if C and Pascal -> final else is paired with the
nearest then
Character set
ASCII 26 letters -> other languages have hundreds of letters identifiers and key words and reserved words blanks can be not significant except in literal
character-string data (FORTRAN) or used as separators
delimiters -> begin, end { }
Other elements Identifiers, operators, key words, reserved
words Free vrs Fixed format -
free written anywhere fixed - FORTRAN - first five characters are
reserved for labels statements -
simple - no embedding structured or nested - embedded
Overall Program-Subprogram Structure Separate subprogram definitions ( Common blocks in
FORTRAN) separate data definitions ( class mechanism) nested subprogram definitions (Pascal nesting one
subprogram in the other) separate interface definitions - package interface in Ada
- in C you can do this with an include file data descriptions separated from executable
statements (COBOL data and environment divisions) unseparated subprogram divisions - no organization -
early BASIC and SNOBOL
Stages in Translation
Process of translation of a program from its original syntax into executable form is central in every programming implementation
translation can be quite simple as in LISP and Prolog but more often quite complex
most languages could be implemented with only trivial translation if you wrote a software interpreter and willing to accept slow execution speeds
Stages in Translation
Syntactic recognition parts of compiler theory are fairly standard Analysis of the Source Program
the structure of the program must be laboriously built up character by character during translation
Synthesis of the Object Program construction of the executable program from the
output of the semantic analysis
Structure of a Compiler
Lexical analysis
Syntactic analysis
Semantic analysis
Optimization
Code generation linking
SymboltableOthertables
source program
lexical tokens
parse tree
intermediate code
optimized intermediate code
Objectcode
Executablecode
Object code fromother compilations
SOURCEPROGRAMRECOGNITIONPHASES
OBJECTCODEGENERATIONPHASES
Analysis of the Source Program
lexical analysis (tokenizing) parsing ( syntactic analysis) semantic analysis
symbol-table maintenance insertion of implicit information (default
settings)macro processing and compile-time
operations(#ifdefs)
Synthesis of the Object Program
Optimization code generation - internal representation must
be formed into assembly language statements, machine code or other object form
linking and loading - references to external data or other subprograms
Translator Groupings
Crudely grouped by the number of passes they make over the source code
standard - uses 2 passes decomposes into components, variable name usage generates an object program from collected information
one pass - fast compilation - Pascal was designed so that it could be done in one pass
three or more passes - if execution speed is paramount
Formal Translation Models
Based on the context-free theory of languages
the formal definition of the syntax of a programming language is called a grammar
a grammar consists of a set of rules (production) that specify the sequences of characters (lexical items) that form allowable programs in the language beginning defined
Chomsky Hierarchy
Language syntax was one of the earliest formal modes to be applied to programming language design
in 1959 Chomsky outlined a model of grammars
Classes of grammar and abstract machines
Chomsky Level Grammar Class Machine Class
0 Unrestricted Turning machine1 Context sensitive Linear-bounded automaton2 Context free Pushdown automaton3 Regular Finite-state automaton
Type 2 are our BNF grammars. Type 2 and 3 are what we use in programming languages
A type n language is one that is generated by a type n grammar, where there is no grammar type n + 1 that also generates it. Every grammar of type is, by definition, also a grammar of type n-1.
Grammar To Chomsky it is a 4-tuple (V, T, P, Z) where V is an alphabet T in V is an alphabet of terminal symbols P is a finite set of rewriting rules Z the distinguished symbol, is a member of T-V The language of a grammar is the set of terminal
strings which can be represented from Z The difference in the four types is in the form of the
rewriting rules allowed in P
Type 0 or phrase structure
Rules can have the form: u :: = V with
u in V+ and V in V*
That is, the left part u can also be a sequence of symbols and the right part can be empty
abc -> dca a -> nil
Type 1 or context sensitive or context dependent Restrict the rewriting rules xUy ::= xuy we are only allowed to Rewrite U as u only in
the context x…y all productions a -> b where the length side a always must be less than or
equal to the length of b
G = ( {S,B,C}, {a,b,c}, S, P)
P = S -> aSBC S -> abC bB -> bb bC -> bc CB -> BC cC -> cc What language is generated by this context
sensitive grammar?
Deciding the language?
always start with the start rule: in this case it is S but it can any nonTerminal (look at the 4-tuple definition)
create a tree starting with the start rule and apply the productions finally finishing with all terminals
“generalize” the pattern
Identifying L given G
P = 1. S -> aSBC 2. S -> abC 3. bB -> bb 4. bC -> bc 5. CB -> BC 6. cC -> cc
SabC aSBC
abcaabCBC
aabBCC
aabbCC
aabbcC
aabbcc
aaSBCBCaaabCBCBC
aaabBCCBC
aaabBCBCC
aaabBBCCC
aaabbBCCC
aaabbbCCC
aaabbbcCC
aaabbbccC
aaabbbccc
L -> anbncn where n>= 1
Type 2 or context free
U can be rewritten as u regardless of the context in which it appears
This grammar has only one symbol on the left hand side
It also allows a rule to go the empty string
Context Free Expression Grammar
E-> E + T | E - T | T T -> T * F | T / F | F F -> number | name | (E)
Type 3 - regular grammars
Restrict the rules once more all rules must have the form u :: N or u :: WN
Grammars
As we moved from type 3 to type 2 to type 1 to type 0, the resulting languages became more complex
type 2 and type 3 became important in programming languages
type 3 provided a model (FSM) for building lexical analyzers
type 2 (BNF) for developing parse trees of programs
BNF Grammars
Consider the structure of an English sentence. We usually describe it as sequence of categories
subject / verb / object
Examples:
The girl/ played / baseball.
The boy / cooked / dinner.
BNF Grammars Each category can be further divided. For example subject is represented by
article noun
article / noun / verb / object
There are other possible sentence structures besides the simple declarative ones, such as questions.
auxiliary verb / subject / predicate
Is / the boy / cooking dinner?
Represent sentences by a set of rules
<sentence> ::= <declarative> | <question> <declarative> ::= <subject> <verb> <object>. <subject> ::= <article><noun> <question> ::= <auxiliary verb> <subject> <predicate>
This specific notation is called BNF (Backus-Naur form) and was developed in the late 1950s by John Backus as way to express the syntactic definition of ALGOL. At the same time Chomsky developed a similar grammatical form, the context-free grammar. The BNF and context-free grammar for are equivalent in power; the differences are only in notation. For this reason BNF grammar and context-free grammar are interchangeable. (in grammars)
Syntax
A BNF grammar is composed of a finite set of BNF grammar rules, which define a language
syntax is concerned with form rather than meaning, a (programming) language consists of a set of syntactically correct programs, each of which is simply a sequence of characters
Production Rules
A grammar -> set of production rules <real-number> ::= <integer_part> . <fraction> <integer_part> ::= <digit> | <integer_part> <digit> <fraction> ::= <digit>| <digit> <fraction> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9
nonterminalsToken or terminal
Doesn’t Have to Make Sense! A syntactically correct program need not make
any sense semantically. If it is executed it would not have to compute
anything useful it could not computer anything at all
For example look at our simple declarative and imperative sentences -> the syntax
subject verb object is fulfilled but doesn’t make any sense
The home / ran / the girl.
Parse Trees Production rules are rules for building strings of tokens beginning with the starting nonterminal, you can use
the rules to build a tree The parse tree -
each leaf either has a terminal or is empty nonleaf nodes are with nonterminals generates the string formed by reading terminals at its
leaves from left to right a string is only in a language if is generated by some parse
tree
Parse tree
<real-number> ::= <integer_part> . <fraction><integer_part> ::= <digit> | <integer_part> <digit><fraction> ::= <digit>| <digit> <fraction><digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9
<real-number>
<integer_part>
<fraction>.String 13.13
<digit> <fraction><integer_part> <digit>
<digit> <digit>
1
1
3
3
Use of Formal Grammar
Important to the language user and language implementor
user may consult to answer subtle questions about program form, punctuation, and structure
implementor may use it to determine all the possible cases of input program structures that are allowed
common agreed upon definition
BNF grammar or Context free
Assigns a structure to each string in the language
is is always a tree because of the restrictions on BNF grammar rules
parse tree provides an intuitive semantic structure
BNF does a good job in defining the syntax of a language
Syntax not defined by BNF notation
Despite the elegance, power and simplicity of BNF grammars there are areas of language that cannot be expressed (contextual dependence)
ex: the same identifier may not be defined twice in the same scope
also every language can be defined by multiple grammars
problem : ambiguity (the dangling else) They /are /flying planes or They / are flying/ planes
Ambiguity
Ambiguity is often a property of a given grammar
G : S -> SS | 0 | 1 the grammar that generates binary strings is
ambiguous because there is a string in the language that has two distinct parse trees
Ambiguous Grammar
S S
S S
S S
S
SS S
0 0 0 01 1
Ambiguous Grammar
If every grammar for a given language is ambiguous, then the language is inherently ambiguous. However, the language that generates binary string is not because there is a grammar thatthat is unambiguous
G: T -> 0T | 1T | 0 | 1
Expressions
We need control structures for expressions Implicit (default) control - are in effect unless
modified by the programmer through some explicit structure
explicit - modify implicit sequence
Sequencing with Arithmetic Expressions
Root = -B B2 - 4 * A * C
2 * A
There are 15 separate operations in this formula In a programming language this can be stated as a single expression
Sequencing with Arithmetic Expressions Expressions are powerful and a natural
device for expressing sequences of operations however, they raise new problems.
The sequence-control mechanisms that operate to determine the order of operations within an expression are complex and subtle
Tree-Structure Representation
Clarifies the control structure of the expression
*
+ -
a b c d
(a+b) * (c-a)
Syntax for Expressions
For a programming language we must have a notation for writing trees as linear sequences of symbols
There are three common ones prefix postfix infix
Expression Notationprefix opE1E2 +ab
postfix E1E2op ab+
infix E1opE2 a+bpostfix and prefix, nice -> do not have to use ()
infix postfix prefix
(a+b)*c ab+c* *+abc
a+b*c abc*+ +a*bc
a+b+c ab+c+ ++abc
(a+b)+c ab+c+ ++abc
a + (b+c) abc++ +a+bc
Which of the following is a valid expression (either postfix or prefix)?
B C * D - + * A B C - B B B * *
Expression Notation - Infix
However, infix is familiar and easy to read Infix is suited to binary operators, for unary
operators or multi-agrument function calls must be exceptions to the general infix property
But how to decode a+b*c? Precedence (order of operations) Associativity ( normally left to right)
Precedence
Give operators precedence levels higher precedence operators are evaluated
before lower precedence operators without precedence rules, parentheses would
be needed in expressions works well with all mathematical symbols but
breaks done with new operators not from classical mathematics (?: in C)
Associativity
What if operators with the same precedence are grouped together?
Operators + - / * are left associative 1+2+3+4 : left associative a=b=c=2+3 : right associative 234 : right associative mixfix notation - when symbols or keywords
interspersed with the components of expressions - IF a>b then a else b
Abstract Syntax Tree Infix, postfix, prefix use a different notation,
but all have the same meaningful components
an abstract syntax tree is a way to represent this for the notations
infix postfix prefix
(a+b)*c ab+c* *+abc
*+ c
a b
Side Effects
The use of operations that have side effects in expressions is the basis of a long-standing controversy in programming language design
Side effects are implicit results. For example an operation may return an explicit result, as in the sum returned as the result of an addition, but it may also modify the values stored in other data objects.
A * fun(x ) + a
First, we must fetch the r-value of a and the fun(x) must be evaluated.
Notice the addition requires the value of a and the result of the multiplication.
It is clearly desirable to fetch a once and use it twice
Moreover, it should make no difference whether fun(x) is evalutated before or after the value of a if fetched
A * fun(x ) + a However if fun has the side effect of changing the
value of a, then the exact order of evaluation is critical!
If a has the initial value of 1 and fun(x) returns 3 and also changes the value of a to 2, then the possible values for this expression can be: evaluate each term in sequence: 1 * 3 + 2 = 5 evaluate a only once: 1 * 3 * 1 = 4 call fun(x) before evaluating a: 2 * 3 + 2 = 8 all are correct according the syntax
Positions on side effects in expressions Outlaw them! Disallow functions with side
effects or make them undefined allow them but make it clear exactly what the
order of evaluation is so the programmer can make proper use
The later is most general, but many language definitions this question is ignored and the result is different implementations provide conflicting interpretations