17
UNIT 3 : INTRODUCTION TO COMPILER WRITING Structure 3.0 Introduction 3.1 Objectives 3.2 What is a Compiler? 3.3 Approaches to Compiler Development 3.3.1 Assembly Lpnguage Coding 3.3.2 Cmss-compiler 3.3.3 Bootstrapping 3.4 Compiler Designing Phases 3.4.1 Lexical Analysis 3.4.2 SyntacticalAnalysis 3.4.3 Semantic Analysis 3.4.4 Code Generation and Opimizatim 3.5 Software Tools 3.5.1 Lex 3.5.2 Yacc 3.5.3 Program Development Tdr 3.6 Summary 3.7 Model Answers 3.8 Further Readings 3.0 INTRODUCTION The study of compiler designing form a cenml theme in the field 060mputer scicnce. An understanding of the technique used by high level language compilers can give the program- mer a set of skills applicable in many aspects of software design - one does not have to be a compiler writer to make use of them. In the previous unit, we discussed one type of translator, i.e. assembler which translates as- sembly language program into machine language. In this unit, we will look at another type of translator called compiler. The compi:cr writing is not confined to one discipline only but rather spans sevcral other disciplines: programming languages, computer archilecture, thcory of programming languages, algorithms,etc. Today a few basic compiler writing techniques can be used to construct translators for a wide variety of languages. This h i t is intended as an introduction to the basic essential features of compiler dcsigning. This unit is organiscd as follows: The first two sections cover the basic definitions of a compiler and different approaches to compiler development. The next section covers in brief several phases of compiler design. These are lexical analysis, syntax analysis, semantic analysis, code generation and optimiza- tion. At the end we prcscnt Lex and Yacc, two compiler construction tools that can greatly simplify the implementation of a compiler. A familiarity with the malerial covercd in this unit will be a great help in understanding the inncr function of a compiler. At the end of this unit, you will be able to: identify several phases of compilers identify several approaches to compiler development and

Compiler Writing Tools

Embed Size (px)

Citation preview

Page 1: Compiler Writing Tools

UNIT 3 : INTRODUCTION TO COMPILER WRITING

Structure 3.0 Introduction 3.1 Objectives 3.2 What is a Compiler? 3.3 Approaches to Compiler Development

3.3.1 Assembly Lpnguage Coding

3.3.2 Cmss-compiler 3.3.3 Bootstrapping

3.4 Compiler Designing Phases 3.4.1 Lexical Analysis 3.4.2 Syntactical Analysis 3.4.3 Semantic Analysis 3.4.4 Code Generation and Opimizatim

3.5 Software Tools 3.5.1 Lex 3.5.2 Yacc 3.5.3 Program Development Tdr

3.6 Summary 3.7 Model Answers 3.8 Further Readings

3.0 INTRODUCTION

The study of compiler designing form a cenml theme in the field 060mputer scicnce. An understanding of the technique used by high level language compilers can give the program- mer a set of skills applicable in many aspects of software design - one does not have to be a compiler writer to make use of them.

In the previous unit, we discussed one type of translator, i.e. assembler which translates as- sembly language program into machine language. In this unit, we will look at another type of translator called compiler. The compi:cr writing is not confined to one discipline only but rather spans sevcral other disciplines: programming languages, computer archilecture, thcory of programming languages, algorithms, etc. Today a few basic compiler writing techniques can be used to construct translators for a wide variety of languages. This h i t is intended as an introduction to the basic essential features of compiler dcsigning. This unit is organiscd as follows:

The first two sections cover the basic definitions of a compiler and different approaches to compiler development. The next section covers in brief several phases of compiler design. These are lexical analysis, syntax analysis, semantic analysis, code generation and optimiza- tion. At the end we prcscnt Lex and Yacc, two compiler construction tools that can greatly simplify the implementation of a compiler.

A familiarity with the malerial covercd in this unit will be a great help in understanding the inncr function of a compiler.

At the end of this unit, you will be able to:

identify several phases of compilers

identify several approaches to compiler development and

Page 2: Compiler Writing Tools

Programming Cunceprs and Software Tools

define two compiler writing tools, i.e. Lex and Yacc

3.2 WHAT IS A COMPILER?

A compiler is a software (Program) that reads a program written in a source language and translates it into an equivalent program in another language - the targd language (see figure 1). The important aspect of compilation process is to produce diagnostic (error messages) in the source program. These error messages are mainly due to the grammatical mistakes done by a programmer.

Source Program (in High

level language)

Compiler --T- target pollrun

Error merragu

Fig. 1. A Complier

There are thousands of source languages. ranging from C and PASCAL to specialised lan- guages that have arisen in virtually every area of computer application. Target languages are also in thousands. A target language may be another programming language or the machine language or an assembly language. Compilers are classified as single pass. multipass, debug. ging or optimizing. depending on how they have been constructed or on what functions they are suppo9ed to perform. Earlier (in 1950's) compilers were considered as a d=cult program to write. The first FORTRAN compiler, for example, took 18 staff-years to implement. But now several new technques and tools have been developed for handling many of the impor- tant tasks that occur during compilation process. Good implementation languages. program- ming environments (editdB &buggers, etc.) and software tools have also been developed. With these development compiler writing exercise has become easier.

- - - -

3.3 APPROACHES TO COMPILER DEVELOPMENT

There are several approaches to compiler developments. Here we will look at some of them.

33.1 Assembly Language Coding

Early compilers were mostly coded in assembly language. The main consideration was to in- crease efficiency. This approach worked very well for small High Level Languages (HLL). As languages and their compilers became larger, lots of bugs started surfacing which were difficult to remove. The major difficulty with assembly language implementation was of poor software maintenance.

Around this time. it was realised that coding the compilers in high level language would over- come this disadvantage of poor maintenance. Many compilers were therefore coded in FORTRAN. the only widely available HLL at that time. For example, FORTRAN H com- piler for IBMD60 was coded in FORTRAN. Later many system programming languages were developed to ensure efficiency of compilers written into HLL. Assembly language is still being used but trend is towards compiler implementation through HLL.

3.3.2 Cross-Compiler

A cross-compiler is a compiler which runs on one machine and generates a code for another machine. The only difference between a crosscompiler and a normal compiler is in tenns 0 1

code generated by it. For example, consider the problem of implementing a Pascal compiler on a new piece of hardware (a computer called X) on which assembly language is the only programming language already available. Under lhese circumstances, the obvious approa.

Page 3: Compiler Writing Tools

is to write the Pascal compiler in assembler. Hence, the compiler in this case is a pmgram In-uEtkb- Wm that takes Pascal surce es input, produces machine code for the target machine as output and is written in the assembly language of the target machine. The languages characterising this compiler cap be

Compiler writton in assembly I.agua8e Xobjcclcode

d n r on X machine

showing that Pascal source is translated by a program written in X assembly language (the compiler) running on machine X into X's object code. This codt can then be run on the tar- get machine. lhis notation is essentially equivalent to the T-diagram. The

Tdiagram for this compiler is shown in figure 2.

The language &cepted as input by the compiler is stated on the left, the language output by the compilef is shown on the right and the language in which the compiler is writtem is shown at the bottom. The advantage of this particular notation is that several T-diagrams can be meshed together to represent more complex compiler implementation methods. .This com- piler implementation involves a great deal of work since a large assembly language program has to be written for X. It is to be noticed in this case that the compiler is very machine spec if^; that is. not only does it run on X but it also produces machine code suitable for running on X. Furlhemore. d y one computer is involved in the entirt implementation pmcess.

The use of a high-level language for coding the compiler can offer great Savings in im- plementation effort If the language in which the compiler is being written is already avail- able on the computer in use, them the process is simple. F a example. Pascal might already be available on machine X, thus permitting the coding of, say, a Modula-2 compiler in Pascal. Such a compiler can be represented as:

Pascal Compiler ~ r m i n g on I

If the language knrhich the compiler is being written is not available on the machine, then all is not lost, since it may be possible to make useof an implementation of that language on another machine. For example, a Module-2 compiler could be implemented in Pascal on machine Y,. producing object code for machine X:

X's object code

- Pascal Compiler

running on y

The object code for X generated on machine Y would of course have to be transferred to X for its execution. This praxs of generating code on one machine for execition on another is called cross-compilation.

At fmt sight, the introduction of a second computer to the compiler implementation plan seems to offer a somewhat inconvenient solution. Each time a compilation is required. it has to be done on machine Y and the object code transferred, perhaps via a slow or laborious mechanism, to machine X for execution. Furthermore. both computes have to be running and inter-linked somehow, for this approach to work. But the significance of the cross-compila- tion approach can be seen in the next section.

333 Bootstrapping . It is a concept of developing a compiler for a language by using subsets (small part) of b e same language.

Page 4: Compiler Writing Tools

Pmgrunmlng Conepb ud Sam*wrToob

Suppose that a Modda-2 compiler is required for machine X, but that the compiler itself is to be coded in Modula-2. Coding the compiler in the language it is to compile is nothing spe- cial and, as will be seen, it has a great deal in its favour. Suppose further that Modula-2 is al- ready available on machine Y. In this case, the compiler can be run on machine Y, producing object codeformachine X:

This is the same situation as before except that the compiler is coded in Modula-2 rather than Pascal. The special feature of this approach appears in the next step. The compiler, running on Y. is nothing more than a large program written in Modula-2. Its function is to transform an input file of Module-2 statemenn into a functionally equivalent sequence of statements in X's machine code. Therefore, the source statements of this Module-2 compiler can be passed into itself running on Y to produce a file containing X's machine code. This file is of cowsea Module-2 compiler, which is capable of being run on X. By making the compiler compile itself, a version of the compiler that runs on X has been created. Once this machine code has been transferred to X, a self-sufficient Modula-2 compiler is available on X; hence there is no further use for machine Y for supporting Module-2 compilation.

Modula 2 Source language

This implementation plan is very attractive. Machine Y is only required for compiler develop- ment and once this development has reached the stage at which the compiler can (correctly) compile itself, machine Y is no longer required. Consequently, the original compiler imple- mented on Y need not be of the highest quality - for example, optimization can be complete- ly disregarded. Further devdlopment (and obviously conventional use) of the compiler can then continue at leisure on machine X.

This approach to compiler implementation is called bootstrapping. Many languages, includ- ing C, Pascal, FORTRAN and LISP have been implemented in this way.

Pascal wasfirst implemented by writing a compiler in Pascal itself. This was done through several bootstrapping processes. The compiler was then mslated "by hand" into an avail- able low level language.

Modula 2 complier ~ n n i n g on Y

3.4 COMPILER DESIGNING PHASES

) X's object code

The compiler being a complex program is developed through several phases. Each p h m transforms the source program from one representation to another. The tasks of a compiler can be divided very broadly into two sub-tasks.

(i) The analysis of a source progmm

(ii) The synthesis of the object program

In a typical compiler, the analysis task consists of 3 phases.

(i) Lexical analysis

(ii) Syntax analysis

(iii) Semantic analysis

The synthesis task is usually considered as a code generation phase but it can be dividca into some other distinct phases like intermediate code generation and code optimization. These four phase functions in sequence are shown in figure 3. Code optimization is beyon? this unit.

The nature of-the interface between these four phases depends on the compiler. It is pcri- pmsible for the four phases to exist as four separate programs.

Page 5: Compiler Writing Tools

Introdurtlon to Compiler Wrtthg

Source Program

Analysis I

Syntax Analysis

Symbol Table

Semantic Analysis

Code generation & optimization I

Fig. 3 Compiler Design Phases

3.4.1 Lexical Analysis

Lexical analysis is the first phase of a compiler. Lexical analysis, also called scanning, scans a source program fom left to right character by character and group them into tokens having a collective meaning. It performs two important tasks. First, it scans a source program charac- ter by chamcter from left to right and groups them into tokens (or syntactic element). Each taken or basic syntactic element represents a logically cohesive sequence of characters such as identifier (also called variable), a keyword (if, then, else, etc.), a multi-character operator < =, etc. The output of this phase goes to the next phase, i.e. syntax .analysis or parsing. The inmction between two phases is shown below in figure 4 .

Analysis Analysis Parn tree

taken f r m

Symbol ('lJ Fig. 4 Interaction between the first two phases

The second task performed during lexical analysis is to make entry of tokens into a symbol table if it is not there. Some other tasks performed during lexical analysis are:

to remove all comments, tabs, blank spaces and machine characters.

to produce e n a messages (also called diagonostics) occurred in a source program.

Let us consider the following Pascal language statement,

For i = 1 TO 50 do sum = sum + x [i]; sum of numbers stored in array x

After going through the statement, the lexical analysis transforms it into the sequence of tokens:

For i=1TO50dosum:=sum+x[i];

Tokens are based on certain grammatical structures. Regular expressions are important nota- tions for specifying these tokens. It consists of symbols (in the alphabet of the language that is being defined) and a set of operators that allow:

(i) concatenation (combination of strings),

Page 6: Compiler Writing Tools

(ii) repetition, and

(iii) alteration.

Examples of Regular Expressions

(i) ab denotes the set of strings f ab)

(ii) a l b denotes either a or b

(iii) a* denotes (empty, a, aa, am), etc. *

(iv) ab* denotes {a, ab, abb, abbb)

(v) . [a - z A - z] [a - z A - z 0 - 01' gives a definition of a variable which means that a variable starts with an alphabetic character followed by either alphabetic character or digit 'character.

Some more examples of operators will also be covered in Section 3.5.1.

Writing a lexical analysis completely from scratch is a fairly challenging task. Several tools have been built for constructing lexical analysis from special purpose notation based on regular expressions. Perhaps the most famous of these tools is Lex, one of the many utilities available with Unix operating system. Lex requires that the syntax of each lexical token be defined in terms of a regular expression. Associated with each regular expression is a frag- ment of code that defines the action to be taken when that expression is recognised. Section 3.5.1 will have detailed discussion on it.

The Symbol Table

An essential function of a compiler is to record the identifiers and the related information about its attributes type (numeric or character). its scope (where in the program it is valid) and in the case of procedure ar the function names, such things as the number and types of its arguments, the mechanism of passing each argument and the type of result it returns.

A symbol table is a set of locations containing a record for each identifier with fields for the attributes of the identifier. A symbol table allows us to find the record for each identifier (variable) and to store or retrieve data from that record quickly. '

For example, take an expression written in C such as int x, y, z;

The lexical analysis after oing through this expression will enter x, y and z into the symbol table. This is shown in 2 e figure given below.

Memory Location

Fig. 5 Symbol Table

The fmt column of this table contains the entry of variables and the second contains the ad- dress of memory locations where values of these variables will be stored.

The remaining phases enter information about identifiers into the symbol table and then use this information in various ways.

Check Your Progress 1

Qucsiion I: Explain the relevance of regular exprcsslon inlo lexical analysis.

Page 7: Compiler Writing Tools

Question 2: Construct a regular expressions for an unsigned number defined in Pascal.

3.4.2 Syntax Analysis

Every language whether it is a programming language or any natural language follows cer- tain grammatical rules that &fme syntactical structures of a language. In C language, for ex- ample a program is made out of main function consisting of blocks, a block out of statements, a statement out of expressions, an expression out of tokens and so on. The syntax of a programming language constructs can be described by Backens Naur Form - (BNF) notations. These types of notations are also called context-free grammars. Well formed grammars offer significant advantages to compiler designer :

A grammar gives a precise,, yet easy to .understand syntactic specification of a programming language.

Development of tools for designing Parser to determine if a source program is syn- tactically correct, can be achieved Erom certain class of grammars.

A well designed grammar imparts a structure to a programming language that is use- ful for the translation of source program into correct object code.

Syntax analysis is the second phase of compilation process. This process is also called pars- ing. It performs the.following operations:

1. Obtains a group of tokens from the lexical analyser.

2. Determines whether a string of tokens can be generated by a grammar of the lan- guage, i.e. it checks whether the expression is syntactically correct or not.

3. Reports synw error(s) if any.

The output of parsing is a representation of the syntactic structure of a statement in the form of Parse tree (syntax tree).

The process of parsing is shown below in figure 6.

fable

Source

Program '

Figure 6: Process of Parsing

For example the statement

Lexical Analysis

X = Y+Z could be represented by the syntax tree shown in the figure 7.

Fig. 7 Parse tree '

tokens

< -

Intraduction to'cmplier Writing

syntax Analysis

Pane t n e )

Rest of phases

object

code '

Page 8: Compiler Writing Tools

The parse tree of the statement in this form means that the first Y and Z will be added and then its result will be assigned to X.

Context free grammars:

Each programming language has got its own syntax (grammars). In this section we will dis- cuss context free grammar for specifying the syntax of a language. A grammar natwally describes the hierarchical structlne of many programming language constructs. For example, an ifelse statement in C has the form : if (expression) statement else statement

Suppose we take variables expr. and sunt to denote expressions and statements respectively then if-else statement can be written as start + if (expr) srmt else sunt. Such a rule is called Production In a production lexical elements like the keywords if, eke and the parenthesis are called tokens (also called terminal symbols). Variables like expr and stmt represent sequence of tokens and are called non- terminals.

A context free grammar has four components:

1. A set of terminal symbols like keywords for a programming language

2. A set of non-terminal symbols

3. A set of productions (rules) where each production consists of a non-terminal called the left side of the productiongn arrow and a sequence of tokens andlor non-terminals. called the right side of tokens.

4. A designation of one of the non-terminals as the start symbol.

Example: use of expression is common in a programming language. An expression consists of digits and arithmetical operations +. - , * etc. e.g. 3-2+1,4+3. 1. Since arithmetical operators must appear between two digits, we defme expressions a list of digits separated by arithmetical operations. The following grammar describes the syntax of arithmetical expres- sions. The productions are

List + list + digits (1)

List + list - digits (2)

List + list * digits (3)

List + digits (4)

digit+ OI112131415 161718191 (4)

The fmt three productions with non-terminal symbol list on the left side can also be written as list + list + digits 1 list - digits I list * digits. In this production + - 0 1 2 3 4 5 6 7 8 9 are all tokens of the grammar. List and digits are non-;tenninal symbols. List is also starting sym- bol of the production rule.

terminal

5 * 8 - 6 + 2 <- Led (tdcenr

Fig. 8 Parse tree for an expression 5 * 8 - 6 + 2-

symbol

Page 9: Compiler Writing Tools

I

This grammar will be able to generate any type of arithmetic expressions. A gramma. Introduction to Compller Wrlting

derives strings (expressions) by beginning with the start symbol and repeatedly replac, 2

non-terminal symbol by the right side of a production for that non-terminal. In this ex '21;:; list will be replaced with another list which will be replaced further by some other list :#r digits. Example: Suppose we have an expression 5*8-6+2. Let us verify whether this exprcs-

b sion can be derived from this grammar and construct a parse tree (Figure 8).

I 5 is a list by production (5). since 5 is a digit.

1 5*8 is a list by production (3)

5+8-6 is a list by production (2)

5*8-6+2 is a list by produ&ion (1) A parse bee graphically displays how the start symboI of the grammar derives an expression 5*8-62. A parse uee is a ttee with the following proper- ties:

1. The root is labeled by the start symbol (list).

2. Each leaf is labeled by a token or terminal symbol.

3. Each interior mode is labeled by a non-terminal (list, digit).

I t The syntax of a language defines the set of valid programs. but a programmer or compiler I writer must have more information before the language can be used or compiler developed. I The semantic rules provide this information and specify the meaning or actions of all valid

programs allowed by the syntax rules.

Check Your Progress 2

Sdc :~ ion 1: Conslruct conlext free grammar for

(a) acccptlng English language statement (b) if-thcn and if- hen-else slalemenl (c) Pascal and C-language slatements (d) an arihmctic expression

Quc.:t~on 2 : Gcneralc a Parse wee for expression a+b*c based on a grammar for an arith- metic expression.

(!i~cciion 3: Generate a parse tree for English language sentence "The girl ate mango" baed on a grammar defined in I (a) of Check Your Progress 2.

3.43 Semantic Analysis

The role of semantic analyzer is to derive methods by which the structures consuucted by the syntax analyzer may be evaluated or analyzed.

The semantic analysis phase c k k s the source program for semantic m r s and gathers data Lype information for the subsequent code-generation phase. An important component of semanlic analysis is type checking. Here the compiler checks that each operatar has operands that are permitted by the source language specification. For example: Many programming languages definition require a compiler to report an e m every time a real number is used to index an array. For example a[5.6]; here 5.6 is a real value not an integer.

To illustrate some of the actions of a semantic d y z e r consider the expression a+b-c*d in a language such as pascal where a,b,c have data type integet and d has type real. The synrax analyzer produces a parse'tree of the form shown in figure 9(a).

One of the tasks of the semantic analyzer is to perform type checking within this expression. By consulting the symbol table, the data types of all the variables can be inserted into the tree as shown in the figure 9(b) and performs semantic type conversion and label a node accord- ingly.

Page 10: Compiler Writing Tools

hgrauunlng Conceptr and whvare Took

(integer data type)

(integer data type)

(integer data type) (red data type)

(integer

hgrauunlng Conceptr and whvare Took

(integer dam type)

(integer data type) (real data

Fig. 9 Semantic Analysis of an arithmetic expression

The semantic halyser can determine the types of the intermediate results and thus propagate the type attributes through the tree checking for compatibility as it goes. In our example, the semantic analyzer fmt considers the results of c and d. According to the Pascal semantic rule integer + real + real. the + node can be labeled as real. This is shown in figure 9(c).

Compilers vary widely in the role taken by the semantic analyzer. In some simpler compilers, there is no easily identXiable semantic analysis phase, the syntax analyzer itself does seman- tic analysis and intermediate code generation directly. In other compilers syntax analysis, semantic analysis and code generation is a separate phase. In the next section we will dis- cuss about code generation phase.

3.4.4 Code Generation and Optimization

The final phase of the compiler is the code generator. The code generator takes an input as intermediate representation (in the form of parse tree) of the source program and produces as output an equivalent target program (figure 10).

pars; tree +

Symbol I Table 1

data type

- &*gure 10: Code generation phase

The target program may take on a variety of forms: absolute machine language, relocatable machine language or assembly language. Producing an absolute machine lan- guage program as output has the advantage that it can be placed in a fmed location in memory and immediately executed.

information Code Code

) gensntim ) oplLniutim Source Lexical tokens Syntax Parre tree

p r o g r ~ ) Analysis ) Andyrir )

Producing a relocatable machine language program (object module) as output allqvs sub- programs to be compiled separately. A set of relocatable object modules can be linked together and loaded for execution by a linking loader (refer to Unit 2). The process of link-

I

semantic Analysis

Page 11: Compiler Writing Tools

I

ing and loading in producing relocatable object code might be little time consuming but it Intrcductlon to Complier I

, provides flexibility in being able to compile subroutine separately and to call other pre- viously compiled program from the object module. If the target machine does not handle relocation automatically. the compiler must pmvide relocation information to the loader to link the separately compiled program segments.

I Producing an assembly-language program as output makes the process of code generation in somewhat simpler. Wecan generate symbolic instruction and use the macro facilities of the assembler to help generate code.

I Some issues in the design of code-generation: A through knowledge of the target machine's architecture as well as instruction Set is required to write a good code generator. The code generator is concerned with the choice of machine instruction, allocation of machine registers, addressing, interfacing with operating system. The concept of registers. addressing scheme has been discussed in Block 1 of Course 1. To produce faster and more compact code, the code geaerator should include some form of code optimization. This may exploit techniques such as the use of special purpose machine instructions or addressing modes, register optimization etc. This code optimization may incorporate both machine-de- pendent and machine-independent techniques.

3.5 SOFI'WARE TOOLS

I Writing a compilcr is not a simple project and anything that make the task simpler is worth exploring. At a very early stage in the history of compiler development it has recognised that some aspects of compiler design could be automated. Consequently a great deal of effort has been directed towards the development of software tools to aid the production of a compiler. Two best known software tools for compiler constructions are Lex (a lexical analyzer gener- ator) and Yacc (a parser generator). both of which are available under the UNIX operating system. Their cotltinuing popularity is partly due to their widespread availability but also be- cause they are powerful and easy to use with a wide range of applicability. This section

I describes these two software tools. I

i 35.1 Lex I

In this section. we describe a particular tool called Lex that has been widely used to specify lexical analyzer for a variety of languages. We refer to tool as Lex compiler and to its input specification as Lex language. Lex is a software tool that takes as input a specification of a set of regular expressions together with actions to be taken on recognising each of these ex- pressions. The output of Lex is a program that recognizes the regular expression and acts ap- propriately on each.Lex is generally used in the manner depicted in the following figure 11.

Pig. 11: C h t i q r lexical d y t a with lex.

First, a specification of lexical analyzer is prepared by creating a pmgram lex.1 in the Lex lq- guage. Then. Lex.1 is run through the lex compiler to produce a C program 1ex.yy.c. Lex.yy.c. consists of a C language program containing recongniser for regular expressions together with user supplied code. Finally 1ex.yy.c. is run through the C compiler to produce an pbject program a.out which is a lexical analyzer that transforms an input stream into a se- quence of tokens.

Page 12: Compiler Writing Tools

Lex specifications:'

A lex program consists of three parts:

declaration

96% . -.. _ translation rules

user routines

Any of these three sections may be empty but the 96% separator between the definitions and the rules cannot be omitted.

The declaration section includes declaration of variables, constants and regular definitions. The regular definition are statements used as components of the regular expression appearing in the mslation rules. - -- The translation rules of a lex program which is k e y part of the Lex input are statements of the form:

Pn (action n)

where each pi is a regular expression and each action is a program fragment describing what action the lexical analyzer should take when patterns pl...p,, matche a token. In lex, the ac- tions are written in C-language in general, however, they can be in any implementation language.

The third section contains user routines which are needed by action. Alternatively, these pro-- cedures can be compiled separately and loaded with lexical analyzer.

Lex supports a very powerful range of operators for the constnrctions of regular expressions. For example a regular expression of an identifier can be written as:

[A-Z a-z] [A-Z a-z 0-91'

which represents an arbitrary suing of letters and digits beginning with a letter suitable for machining a variable in many programming languages.

Here is a list of Lex operam with examples:

Operator notation Example Meaning

* (asterisk) a* Set of all strings of zero or more a's, i.e. , (empty a, aa, aaa... )

a l b a+

Either a o6 b One or more instances of a i.e. a. aa. aaa etc. Zero or one of instance of a

191 [a b cl a I b I c. An alphabetical character c!ass such as [a-zl denotes the regular expression a I b I . . . I z.

Here are few more examples of Lex operators:

(a I b) (a I b) denotes (aa, ab, ba, bb), the set of all strings of a's and b's of length two. Another regular expression for this same set is aa I ab I ba I bb

(a b)' denotes the set of all strings containing zero or more instances of an a or b, that is the set of all strings of a's and b's.

Page 13: Compiler Writing Tools

X ( X'YZ denotes the set containing the suing X and all smngs consisting of zero or more X's followed by YZ.

Given a familiarity with Lex and C, a lexical analyzer for almost any programming language can be written in a very short time. Lex does not have application solely in the generator of lexical analyzers. It can also be used 6 assist in the implementation of almost any text pat- tern matching application, such as text editing, code conversion and so on. *

3.5.2 Yacc

Yacc (Yet Another Compiler Compiler) assists in the next phase of the compiler. It creates a parser which will be output in a form suitable for inclusion in the next phase. Yacc is avail- able as a command (utility) on the UNIX system and has been used to help implement hundreds of compilers.

A parser can be constructed using Yacc in the manner illustrated in figure 12.

ifica,i.-~ Compiler IT) Y.U"C

~.tab.c J-;) Compiler a. ,,,,t

\Ir

Input .a. out ) Output

Figure 12:' Yacc functioning

First a file say Par5e.y containing a Yacc specification for an expression is prepared. The UNIX system command Yacc parse.y mansforms the file parse.y into a C program called y.tab.c which is a representation of parser written in C language along with other C programs that the user may have prepared. Y.tab.c is run through C compiler and produces object pro- gram a.out that performs the vanslation specified by the original Yacc program. Yacc source program has also 3 parts as Lex. This can be expressed in Yacc specification as:

declaration

translation rules

% %

C - Programs

Example: To illustrate how to prepare a Yam source program, let us construct a simple desk calculator that reads an arithmetic expression, evaluates it, and then prints its numeric value. We shall build the desk calculator starting with the following grammar for arithmetic expres- sions:

expr + expr + term I term

term + term * factor I factor

factor + . (expr) digit ',

The token digit is a single digit ranging from 0 to 9. A Yacc desk calculator program derived from this grammar is shown in figure 13.

Introduction to Complier Writing

The declarations part. There are two optional sections in the declarations part of a Yacc program. In the fist section, we write ordinary C declarations, delimited by % ( and %] . Here we place declarations of any temporaries used by the translation rules or procedures of

Page 14: Compiler Writing Tools

the second and third sections. In figure 13, this section contains only @e include-statement #include cctype.h> that causes the C prepmessor to include the standard header file cctype.h> that contains the predicate isdigit.

% ( #include ype.h > 96 %token DIGIT %% line : expr'b' ( printf("%h", $1);') exPr : expr '+'term ($$=$1+$3;)

: term term : term '*' factor [$$=$I *$3; )

: factor factor : '('ewpr')' { S = S 2 ; )

: DIGIT

96% .

lex0 I int char;

char = get( ); if (isdigit(char) ) (

lval = char-'0'; return DIGIT;

1 . retumc;

1

Figure 13: Yacc specification of a simple desk calculator

Also in the declarations part are declarations of grammar tokens. In figure 13 the statement

%token DIGIT

declares DIGIT UnSdefined) to be a token. Tokens declared in this section can then be used in the second and third parts of the Yacc specification.

The translation rules part. In the part of the Yacc specification after the fmt 96% pair, we put the translation rules. Each rule consists of a grammar production and the aswiated semantic action. A set of productions that we have been writing

c left side > - calt 1> I calt 2,l ... I calt n>

would be written in Yacc as

c left side > : c alt 1 > (semantic action 1) : c alt 2 > (semantic action 2) : calt n > (semantic action n)

In a Yacc production, a quoted single character 'c' is taken to be the terminal symbol c, and unquoted strings of letters and digits not declared to be tokens are taken to be nonterminals. Alternative right sides can be separated by a vertical bar, and a semicolon follows each left side is taken to be the start symbol.

A Yacc semantic action is a sequence of C statements. In a semantic action, the symbol $$ refers to the attribute value associated with the nonterminal on the left, while $i refers to the value associated with ith grammar symbol (terminal or nonterminal) on the right. The seman tic action is performed whenever we reduce by the associated production, so nohally the, semantic action computes a value for $$ in terms of the $i's. In the Yacc specification, we have written the two production rules for expressions (expr).

expr : expr + term I term

and their associated semantic actions as

Page 15: Compiler Writing Tools

expr : expr6+' term ($$=$1+$3; : term

Note that the nonterrninal term in the fmt production is the third grammar symbol on the right, while '+' is the second. The semantic action associated with the fmt production adds the value of the expr and the term on the right and assigns the result as the value for the non- terminal expr on the left. We have omitted the semantic action for the second production al- together. since copying the value is the default action for productions with a single grammar symbol on the right In general, ( $$ = $1; ) is the default semantic action.

Notice that we have added a new starting production

line : expr In'

to the Yacc specification. This production says that an input to the desk calculator is to be an expression The semantic action associated with this production prints the decimal value of the expression.

The supporting C-routines part. The third part of a Yacc specification consists of support- , ing C-routines. A lexical analyzer by the name lex( ) must be provided Other procedures such as error recovery routines may be added as necessary.

The lexical analyser lex( ) produces pairs consisting of a token and its associated attribute value. If a token such as DIGIT is returned. the token must be declared in the fust section of the Yacc specification. The athibute value associated with a token is communicated to the parser through a Yacc defined variable lval.

C language routine reads input characters one at a time using the getchar(). If the character is a digit, the value of the digit is stored in the variable lval. and the token DIGIT is returned. Otherwise. the character itself is returned as the token. This section will be more clear once you have gone through. Block 3 of Course 4 on C-programming.

The power and utility of Yacc should not be underestimated. The effort saved during com- piler implementation by using Yacc rather than a handwritten parser, can be considerable.

353 Program Development Tools

In the previous section, we have discussed two compiler writing tools - Lex and Yacc but the support for a programming language does not end with the implementation of a compiler only. There are some other langqage support tools a program has to access for reducing the cost of software development These additional facilities include tool for program editing, debugging, andysis and documentation. Ideally such tools should be closely integrated with the language implementation so that, for example, when the compiler detects syntactic errors in the source program, the editor can be entered automatically with the cursor indicating the position of the error.

The rapid production of syntactically correct program is a major goal. The use of syntax directed editors is increasing and they can assist the user in eliminating syntax errors as the program is being input Such an error only accepts syntactically correct input and prompts the user, if rmxsaq to input only those constructs that are syntactically correct at the cur- rentposition These editors can considerably improve run time efficiency by removing the necessity for repeated runs of the compiler to remove syntax errors.

Debugging tools should also be iqtegrated with the programming language for finding and removing bugs in the shortest period of time.

Integrated program development environment are now being considered as an important aid for the rapid consauction of correct softwares. Considerable progress has been made over the last decade in the provision of powerful software tools to ease the programmers burden. The application of the basic ptinciples of language and compiler design will help continue the development

Introduction to Complkr Wrlting

Check Your Progress 3

Question 1: Explain the utility of Lex and Yacc.

Page 16: Compiler Writing Tools

~mgraininln~ Con- and Software Tooh Question 2: Write the usefulness of p r o p dcvelopmcnt tools.

Question 3: Construct lexical tokens parsc tree and code gcncration/using asscmbly lan- guage program studied in Block 3 of Course 1 for expression a*b+c-d.

3.6 SUMMARY

This unit discussed several issues related to the compiler. The initial discussion focused on the approaches to the compiler designing phases, which included lexical analysis, parsing semantic analysis and the code generation, while the latter p m examined two important software tools Lex and Yacc and also program development m l s which greatly simplify im- plementation of a compiler.

3.7 MODEL ANSWERS

Check Your ,Progress 1

1. During the process of lexical analysis, a source program is scanneqcharacter by charac- ter and collected together into several group (called tokens) according to the lexical structure of the language. The syntax of these simple tokens can be speczed in terns of regular grammars. Writing a lexical analysis from scratch is a fairly challenging task. Regular expressions can be implemented in software simply and effectively. Several tools have been built for consmcting lexical analysis based on ~ g u l a r grammars.

2. Unsigned number in Pascal are functions such as 1234,12.34, 1.234E5 or 1.23E-4.

Unsigneanum = digits optional-fraction optional-exponent

digits = digit digits*; i t derives niunber

optional-frac tion = . digits* I empty

optional-exponent = (empty (I + I - l empty) digits) l empty

digit = O l 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9

Check Your Progress 2

1. (a) <Eng sentence> : : = <NP> <VP, <NP> : : = <article> <noun> < V b : : = <verb> <Np> <article> . . - - alanlthe

when NP is a noun phrase and VP is a verb phrase respectively.

(b) Stmt :: = i f expr then stmt I if expr then sunt else sunt I Others

when others refer to any language statement.

(c) No model answer.

(d) <expr, . . . . - - <expr> + <term> I <tern . . - <tern> . . - <tern> + <id> I <id> , '

<id> . . . . - - 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9

where id refers to an identifier for example a,b,c etc.

2. Parse tiee for expression a+b+c

Page 17: Compiler Writing Tools

3. Derive a parse tree for English language statement. The girl ate based on grammar defined in 1 (a) of Check Your.Pr0ges.s 1.

I The gi'rl ate <article> <noun>

a mango

Check Your Progress 3

1. Lex and Yacc both are software tools available under UNIX operating system. Lex has bcen widely used to specify lexical analyser for a variety of languages. Lex takes as input a spccilication of a set of expressions together with actions to be taken on recognis- ing cach of thcse expressions. The output of Lex is a program that recognise the regular cxprcssion and acts appropriately on each. Yacc is a parser generator. In accepts a set of grammar rulcs with actions to be taken for each grammar rule and constructs automat- ically a parscr for the grammar. Yacc and Lex mix together well and it is easy to con- struct a lexical analyscr using Lex which can be called by the parser consuucted by Yacc.

2. No model answer.

3. No model answer.

3.8 FURTHER READINGS

. ,.

I . Aho, A.V., Ul1manJ.D. (1977). Principlesof Compiler Design - Addison-Wesely.

2. Aho, A.V. Sethi. Ravi & Ullman J.D. "Compilers -Principles - Techniques and Tools" - Addison-Wesely Publishing Company.

3. Watson. DES. High-Level Languages and their Compilers - Addison- Wesely Publishing Company.