Download doc - nscnetwork.files.wordpress.com · Web viewData definition language (DDL): DDL is the language used to describe the contents of the database. It is used to describe, for example, attribute

Database Management system Dept of Computer Science & Engg, VJCET

MODULE I

1.1 Basic Concepts DBMS is a collection of interrelated data and a set of programs to access this

data in convenient and efficient way. It controls the organization, storage,

retrieval, security and integrity of data in a database. In other words, it enables

users to create and maintain a database. It accepts requests from the

application and instructs the operating system to transfer the appropriate data.

It facilitates the processes of defining, constructing, manipulating and sharing

of database among various users and applications.

-Defining a database means specifying the different type of data elements to

be stored in the database. i.e. data types, structures and constraints. For a bank

database, specifies the fields like Name (string of alphabets), Acct Number

(integer with range) and also the characteristics of each field.

-Constructing the database is the process of storing the data itself on

some storage medium that is controlled by the dbms.

-Manipulating a database is the processing of database. It includes

updating and retrieving of database.

1.2 Purpose of database systems

File system Verses Database approachOne way to store information in a computer system is to store it

as in traditional file system. In this method each data is stored in different files.

And there is an application programs for each of the application.

Data redundancy and inconsistency

In traditional file systems the data may be duplicated. For eg: Consider

a bank having two accounts savings bank account and credit check

account. In this case, the address of customer is stored in two files: one

with SB account and other with checking record. Thus this duplication

will result in need of high storage space. And this will also leads to the

inconsistency. That is, if the address of a customer changes, then the

1


change may be reflected only in one account. This is the inconsistency

of data.

Difficulty in accessing information

Suppose the bank needs a list of customers with an account higher than

Rs. 10,000. But, we do not have an application at hand to list out this

request. Thus, to access this information we have two choices. First

one is that list out the SB account customers and then extracts the

needed list manually. In the second option, we have to develop a new

program to satisfy the new request. The two are difficult.

Data Isolation

Data are scattered in different files and files may be in various formats.

So it is difficult extract the appropriate data.

Integrity problems

The constraint of data is enforced through the programs by appropriate

code. So if we need to add a new constraint, we have to change the

code. Then, it is very difficult to add or change the constraints. The

problem will be compounded when constraints involves several

constraints from different files.

Atomicity problems

Suppose a failure occurs during execution of the program. Then the

execution stops in the middle of the program resulting in an

inconsistency. But the execution of a program should end to a

consistency state. For a traditional file system the failure mostly result

to an inconsistency state.

1.3 Features (characteristics) of DBMS

The basic difference difference between traditional file processing and

database approach is that in traditional file processing, each user defines and

implements the files needed for a specific application as part of programming

the application. But in case of database approach, a single repository of data is

maintained that is defined once and then is accessed by various users.

For eg. Consider a student record, in traditional file processing the office

should have a record for each student to keep his or her fees and payments.

And in department have another record for students to keep their marks and

2


progress. Even though both office and Department interested in data about

students, each user maintains separate files, because each user requires some

data that is not available from other user.

Now what are the features of database approach?

Database system is

1. Self describing:

i.e. The database system contains not only the database itself but also a

complete definition or description and structure of database . This structure

is stored in a catalog with type, storage format and constraints as I

mentioned earlier. The information stored in database is called meta-data.

2. Data security

The DBMS can prevent unauthorized users from viewing or updating the

database. Using passwords, users are allowed access to the entire database

or a subset of it known as a "subschema." For example, in a student

database, some users may be able to view payment details while others

may view only mark list of students.

3. Data Integrity

The DBMS can ensure that no more than one user can update the same

record at the same time. It can keep duplicate records out of the database;

for example, no two customers with the same customer number can be

entered.

4. Interactive Query

Most DBMSs provide query languages and report writers that let users

interactively interrogate the database and analyze its data. This important

feature gives users access to all management information as needed. i.e.

we will get easily all details of each student at any time.

5. Interactive Data Entry and Updating

Many DBMSs provide a way to interactively enter and edit data, allowing

you to manage your own files and databases. However, interactive

operation does not leave an audit trail and does not provide the controls

necessary in a large organization. These controls must be programmed into

the data entry and update programs of the application.

6. Data Independence

With DBMSs, the details of the data structure are not stated in each

3


application program. The program asks the DBMS for data by field name;

for example, a coded equivalent of "give me customer name and balance

due" would be sent to the DBMS. Without a DBMS, the programmer must

reserve space for the full structure of the record in the program. Any

change in data structure requires changing all application programs.

1.4 DBMS Components Data:

Data stored in a database include numerical data which may be

integers (whole numbers only) or floating point numbers (decimal),

and non-numerical data such as characters (alphabetic and numeric

characters), date or logical (true or false). More advanced systems may

include more complicated data entities such as pictures and images as

data types.

Standard operations:

Standard operations are provided by most DBMS. These operations

provide the user basic capabilities for data manipulation. Examples of these

standard operations are sorting, deleting and selecting records.

Data definition language (DDL): DDL is the language used to describe the contents of the database. It is used to describe, for example, attribute names

(field names), data types, location in the database, etc.

Data manipulation and query language:

Normally a query language is supported by a DBMS to form

commands for input, edit, analysis, output, reformatting, etc. Some degree

of standardization has been achieved with SQL (Structured Query

Language).

Programming tools:

Besides commands and queries, the database should be accessible

directly from application programs through function calls (subroutine

calls) in conventional programming languages.

File structures:

Every DBMS has its own internal structures used to organize the

data although some common data models are used by most DBMS.

Abstraction

4


We all know that each application program have some data

relevant to a particular task. And an application program needs to use a portion

of data, which is used by some other programs. In early days of

computerization, each application programmer designs the file structure,

metadata of the file and the access method each record. That is, each application

program use its own data, details concerning the structure of data as well as the

access and to interpret each data. The application programs are implemented

independently and by hence itself, any change in storage media requires changes

to these structures and access methods. Because the files were structured for one

application, it was difficult to use the data in these files to new applications

requiring data from several files belonging to different existing applications.

Eg: Consider two application programs that require the data on an entity

set EMPLOYEE. The first application program involves the public relation

department sending each employee a news letter and related material. This

application program interested in the record type EMPLOYEE, that

containing the values for the attributes of EMPL_Name and

EMPL_Address.

1.5 Architecture of DBMS The generalized architecture of DBMS is called ANSI/SPARC model.

The architecture is divided into three levels: External level, Conceptual level

and Internal level.

The view at each of these levels is described by a schema. Schema

describes the records and its relationships in the view.

a. External view or User view

It is the highest level of data abstraction. This includes only those portions

of database of concern to a user or Application program. Each user has a

different external view and it is described by means of a scheme called external

schema. The schema contains the definition of the logical records and

relationships in external view. It also contains the method of deriving the objects

in the external view from the objects in the conceptual view.

b. Conceptual view

At this level of database abstraction, all the database entities and the

5


relationship among them are included. One conceptual view represents the entire

database called conceptual schema. It describes the method of deriving the

objects in the conceptual view from the objects in the internal view. And also

specify the checks to retain the data consistency and integrating.

c.Internal view

It is the lowest level of abstraction, closest to the physical storage method. It

describes how the data is stored, what is the structure of data storage and the

method of accessing these data. It is represented by internal schema.

View Level …Defined by User …………………………………………………………………………………

…………..

………………………………………………………………………………

………………..

Defined by DBA

Defined by DBA for

optimization

Fig 1.1

1.6 Data independenceData independence of DBMS is the capacity to change the schema at one

level of database system without having to change the next high levels. The

three schema architecture can be used to achieve this data independence. We can

define data independence into two types:

1. Logical data independence

It is the capacity to change the conceptual schema without having to

change the external schema. Sometimes, we may need to change the

conceptual schema to expand the database, to change the constraints, or to

6

View 2 View 4View 1 View 3

Physical Level

Logical Level


reduce the database. Only the view definitions and mappings need to be

changed in DBMS that supporting logical data independence. Application

programmer cannot feel any change in the schema construct of DBMS.

2. Physical data independence

Physical data independence is the capacity to change the internal

schema without having to change the conceptual schema and external

schema. The internal schema may change to improve the performance of

retrieval or update. Then the conceptual schema need not change if the data

remains same. For e.g.: We need not change the Query to retrieve a student

progress report even though the DBMS take a new method to store the

student record.

Advantages

1. Controlling Redundancy

In traditional file processing, every user group maintains its own

files. Each group independently keeps files on their db e.g., students.

Therefore, much of the data is stored twice or more. And the redundancy

leads to several problems:

a. duplication of effort

i.e. storage space wasted when the same data is stored repeatedly

b. files that represent the same data may become inconsistent (since the

updates are applied independently by each users group).

We can use controlled redundancy.

2. Restricting Unauthorized Access

A DBMS should provide a security and authorization subsystem.

Some db users will not be authorized to access all information in the db(e.g.,

financial data).

Some users are allowed only to retrieve data. Some users are allowed both to

retrieve and to update database.

3. Providing Persistent Storage for Program Objects and Data Structures

Data structure provided by DBMS must be compatible with the

programming language’s data structures. E.g., Object oriented DBMS are

compatible with programming languages such as C++, SMALLTALK, and

7


the DBMS software automatically performs conversions between

programming data structure and file formats.

4. Permitting Inference and Actions Using Deduction Rules

Deductive database systems provide capabilities for defining

deduction rules to inference new information from the stored database facts.

5. Providing Multiple User Interfaces

(e.g., query languages, programming languages interfaces, forms, menu-

driven interfaces, etc.)

6. Representing Complex Relationships Between Data

The complex relationship between data is easily represented.

7. Enforce Integrity Constraints

The integrity constraint for information is reasonably enforced by the

database management system.

1.7 DBMS Disadvantages A database system generally provides on-line access to the database

for many users. In contrast, a conventional system is often designed to meet

a specific need and therefore generally provides access to only a small

number of users. Because of the larger number of users accessing the data

when a database is used, the enterprise may involve additional risks as

compared to a conventional data processing system in the following areas.

1. Confidentiality, Privacy and Security When information is centralized and is made available to

users from remote locations, the possibilities of abuse are often more than in

a conventional system. To reduce the chances of unauthorized users

accessing sensitive information, it is necessary to take technical,

administrative and, possibly, legal measures. Most databases store valuable

information that must be protected from deliberate attack and destruction.

2. Data Quality Since the database is accessible to users remotely,

adequate controls are needed to control users updating data and to control

data quality. With increased number of users accessing data directly, there

are enormous opportunities for users to damage the data. Unless there are

suitable controls, the data quality may be compromised.

8


3. Data Integrity Since a large number of users could be using a database

concurrently, we should have to ensure that data remain correct during

operation. The main threat to data integrity comes from several different

users attempting to update the same data at the same time. The database

therefore needs to be protected against accidental changes by the users.

4. Enterprise Vulnerability Centralizing all data of an enterprise in one database may mean

that the database becomes critical resource. The survival of the enterprise

may depend on reliable information being available from its database. The

enterprise therefore becomes vulnerable to the destruction of the database or

to unauthorized modification of the database.

5. The Cost of using a DBMS

Conventional data processing systems are typically designed to

run a number of well-defined, preplanned processes. Such systems are often

"tuned" to run efficiently for the processes that they were designed for.

Although the conventional systems are usually fairly inflexible in that new

applications may be difficult to implement and/or expensive to run, they are

usually very efficient for the applications they are designed for.

The database approach on the other hand provides a

flexible alternative where new applications can be developed relatively

inexpensively. The flexible approach is not without its costs and one of these

costs is the additional cost of running applications that

the conventional system was designed for. Using standardized software is

almost always less machine efficient than specialized software.

1.8 Data model Entities and Attributes

Entities are distinguishable objects of concern and are modeled using

their characteristics or attributes. A database usually contains large number

9


of similar entities. For eg: A company database consists of a large number of

employees may want to store similar information for each employee. Then

each of the employees can be termed as an entity. An entity can be an

object with physical existence. For eg: a car, a person or an employee. But

each entity will have its own value. Each entity has properties that describe

the entity called attribute of that entity. Collection of entities with same

attributes termed as an entity type.

For eg: Employee (Employee_id, Address, Designation, Salary)

Here Employee is an entity and Employee_id, Address, Designation, Salary

represents the attribute of entity Employee.

There can be several types of attributes such as Simple versus

composite, single-valued verses multi-valued and stored verses derived.

1. Composite versus Simple

Composite attributes are those attributes that can be divided into

smaller sub parts with independent meaning. Consider the above e.g.: in

which the attribute Address can be

divided into small sub parts like City, State and Street_address. The

attributes that are not divisible are called simple or atomic attributes. The

value of a composite attribute is the concatenation of the value of its

constituent simple attributes.

2. Single-valued verses multi-valued

Most of the attributes will have only single value for a particular

entity. Such attributes are called single valued. In some cases there may be

having more than one value for an attribute of a particular entity. These

attributes are called multi-valued. The attribute age of an entity person will

have only one value, while the college degree of that person will have more

than one degree. So the attribute age can be consider as single-valued and

college degree as multi-valued.

3.Stored verses derived

In some case the attribute values can be related so that one can be

derived from the other. Consider a person as an entity. The attributes age and

DateOfBirth of person is

10


related. i.e. the age of a person can be derived from the current date and his

DateOfBirth. The age attribute hence is called Derived attribute and the

DateOfBirth is called stored attribute from where age of person calculated.

Entity set

An entity set is a set of entities of the same type that share the

same properties, or attributes. It is represented by a set of attributes. An

attribute, as used in the E-R model can be characterized by the following

attributes.

Simple and composite attributes

Single and multi-valued attributes

Null attributes

Derived attributes

A relationship is an association among several entities. And a relationship

set is a set of relationships of the same type.

Keys

Before designing a database we should be able to specify how entities

within a given entity set and relationships within a given relationship set are

distinguished. Conceptually the individual entities and relationships are

distinct; but from a database perspective, the difference must be expressed

by their attributes. The concept of key is used to make such distinctions.

Super key is a set of attributes that, taken collectively, to identify

uniquely an entity in the entity set. For eg: the social_security_no attribute

of the entity set employee is sufficient to distinguish one employee entity

from another. Thus social_security_no is a superkey for the entity set

employee. Superkeys with minimal subset is known as the candidate key.

For eg: it is possible to combine the attributes, employ_id & employ_name

form a superkey. But the social_security_no is sufficient to distinguish the

two employees. Thus social_security_no is a candidate key. Usually

primary key is used to denote the candidate key that is chosen by the

database designer to identify an entity from an entity set. A key (super,

candidate and primary) is a property of the entity set rather than the

individual entities.

Entity- Relationship (E-R) Diagram

11


The overall logical structure of a database can be expressed graphically by

an E-R diagram. The diagram consists of the following major components.

Rectangles: represent entity set.

Ellipses: represent attributes.

Diamonds: represents relationship sets.

Lines: links attribute set to entity set and entity set to relationship set.

Double ellipses: represent multi-valued attributes.

Dashed ellipses: denote derived attributes.

For eg: Consider an E-R diagram, which consists of two entity sets

customer and loan.

Fig1.2

A data model is a plan for building a database. The model represents

data conceptually, the way the user sees it, rather than how computers store

it. Data models focus on required data elements and associations; most often

they are expressed graphically using

Entity-relationship diagrams. On a more abstract level, the term is also used

in describing a database's overall structure. Mostly used data modeling

techniques are

1. Entity- Relationship model

2. Hierarchical model

3. Network model

12

Emp_id

Employee

LocationProduct

Salary

Designation

Addr

Works For

Company


4. Object-oriented model

1.9 Hierarchical Model The hierarchical data model organizes data in a tree structure.

There is a hierarchy of parent and child data segments. This structure implies

that a record can have repeating information, generally in the child data

segments. Data in a series of records have a set of field values attached to it.

It collects all the instances of a specific record together as a record type.

These record types are the equivalent of tables in the relational model, and

with the individual records being the equivalent of rows. To create links

between these record types, the hierarchical model uses Parent Child

Relationships

Hierarchical databases link records like an organization chart.

A record type can be owned by only one owner. In the following example,

orders are owned by only one customer. Hierarchical structures were widely

used with early mainframe systems; however, they are often restrictive in

linking real-world structures.

Fig 1.3

Advantages:

• Hierarchical Model is simple to construct and operate on

• Corresponds to a number of natural hierarchically organized domains -

e.g., assemblies in manufacturing, personnel organization in companies

13

Customer

Order


• Language is simple; uses constructs like GET, GET UNIQUE, GET

NEXT, GET NEXT WITHIN PARENT etc.

Disadvantages:

• Navigational and procedural nature of processing

• Database is visualized as a linear arrangement of records

• Little scope for "query optimization"

1.10 Network Model

In 1971, the Conference on Data Systems Languages (CODASYL) formally

defined the network model. The basic data modeling construct in the

network model is the set construct. A set consists of an owner record type, a

set name, and a member record type. A member record type can have that

role in more than one set, hence the multiparent concept is supported. An

owner record type can also be a member or owner in another set. In network

databases, a record type can have multiple owners. In the example below,

orders are owned by both customers and products, reflecting their natural

relationship in business.

Fig 1.4

Advantages:

• Network Model is able to model complex relationships and represents

semantics of add/delete on the relationships.

14

Customer

Order

Product


• Can handle most situations for modeling using record types and

relationship types.

• Language is navigational; uses constructs like FIND, FIND member, FIND

owner, FIND NEXT within set, GET etc. Programmers can do optimal

navigation through the database.

Disadvantages:

• Navigational and procedural nature of processing

• Database contains a complex array of pointers that thread through a set of

records.

• Little scope for automated "query optimization"

1.11 Object-Oriented Model Object DBMSs add database functionality to object programming

languages. They bring much more than persistent storage of

programming language objects. Object DBMSs

extend the semantics of the C++, Smalltalk and Java object programming

languages to provide full-featured database programming capability,

while retaining native language compatibility. A major benefit of this

approach is the unification of the application and database development

into a seamless data model and language environment. As a result,

applications require less code, use more natural data modeling, and code

bases are easier to maintain. Object developers can write complete

database applications with a modest

Objects

15

Order


Fig 1.5

1.12 Entity relational model (RDBMS - relational database

management system)

A database based on the relational model developed by E.F. Codd. A

relational database allows the definition of data structures, storage and

retrieval operations and integrity constraints. In such a database the data and

relations between them are organized in tables. A table is a collection of

records and each record in a table contains the same fields.

It permits the database designer to create a consistent,

logical representation of information. Consistency is achieved by including

declared constraints in the database design, which is usually referred to as

the logical schema. The theory includes a process of database normalization

whereby a design with certain desirable properties can be selected from a set

of logically equivalent alternatives. The access plans and other

implementation and operation details are handled by the DBMS engine, and

are not reflected in the logical model. This contrasts with common practice

for SQL DBMSs in which performance tuning often requires changes to the

logical model.

The basic relational building block is the domain or data

type, usually abbreviated nowadays to type. A tuple is an unordered set of

attribute values. An attribute is an ordered pair of attribute name and type

name. An attribute value is a specific valid value for the type of the attribute.

This can be either a scalar value or a more complex type. Relational

databases do not link records together physically, but the design of the

records must provide a common field, such as account number, to allow for

matching. Often, the fields used for matching are indexed in order to speed

up the process.

16


In the following example, customers, orders and products

are linked by comparing data fields and/or indexes when information from

more than one record type is needed. This method is more flexible for ad

hoc inquiries. Many hierarchical and network DBMSs also provide this

capability.

Relational model

Fig 1.6

MODULE 2

17

Customer Order ProductCustomer Order


2.1 Basic Structure of relational model - The relational model for database management is a data model based on predicate logic and set theory. It was invented by Edgar Codd. The fundamental assumption of the relational model is that all data are represented as mathematical n-ary relations, an n-ary relation being a subset of the Cartesian product of n sets.

1) Relation The fundamental organizational structure for data in the relational model is the relation. A relation is a two-dimensional table made up of rows and columns. Each relation also called a table, stores data about entities.

2) Tuples - The rows in a relation are called tuples. They represent specific occurrences (or records) of an entity. Each row consists of a sequence of values, one for each column in the table. In addition, each row (or record) in a table must be unique. A tuple variable is a variable that stand for a tuple.

3) Attributes – The column in a relation is called attribute. The attributes represent characteristics of an entity.

4) Domain – For each attribute there is a set of permitted values called domain of that attribute. For all relations ‘r’, the domain of all attributes of ‘r’ should be atomic. A domain is said to be atomic if elements of the domain are considered to be indivisible units.

2.2 Database Schema – Logical design of the database is termed as database schema.

1) Database instance – Database instance is a snapshot of the data in a database at a given instant of time.

2) Relation schema – The concept of relation schema corresponds to the programming notion of type definition. It can be considered as the definition of a domain of values. The database schema is the collection of relation schemas that define a database.

3) Relation instance – The concept of a relation instance corresponds to the programming language notion of a value of a variable. For relation instance, we actually mean the “relation” itself.

2.3 Keys – A key is the relational means of specifying uniqueness. The keys applicable in relational model are primary key, candidate key and super key.

1.) Primary key - A primary key is a value that can be used to identify a unique row in a table. Attributes are associated with it.

2.) Candidate key - A candidate key of a relation variable is a set of attributes of that relation variable such that (1) at all times it holds in the relation assigned to that variable that there are no two distinct tuples with the same values for these attributes and (2) there is not a proper subset for which (1) holds.

18


3.) Super key - A superkey is defined in the relational model as a set of attributes of a relation variable for which it holds that in all relations assigned to that variable there are no two distinct tuples that have the same values for the attributes in this set.

4.) Foreign key - A foreign key is a field or group of fields in a database record that point to a key field or group of fields forming a key of another database record in some (usually different) table. A relation schema, r1, derived from an E-R schema may include among its attributes the primary key of another relation schema, r2. This attribute is the foreign key from r1, referencing r2. The relation r1 is called the referencing relation of the foreign key dependency and r2 is called the referenced relation of r2.

2.4 Schema diagram – A database schema, along with primary key and foreign key dependencies, can be depicted pictorially by schema diagrams. Each relation in the database schema is represented as a box, with the attributes listed inside it and the relation name above it. If there are primary key attributes, a horizontal line crosses the box, with the primary key attributes listed above the line. Foreign key dependencies appear as arrows from the foreign key attributes of the referencing relation to the foreign key attributes of the referenced relation.

2.5 Relational algebra – The relational algebra is a procedural query language. (A query language is a language in which a user requests information from the database.) It consists of a set of operations that take one or two relations as input and produce a new relation as the result. The fundamental operations in relational algebra are select, project, union, set difference, Cartesian product and rename. There are several other operations namely, set intersection, natural join, division and assignment.

Fundamental operations

1. Select operation - The select operation selects tuples that satisfy a given predicate. The Greek symbol ‘σ’ is used to denote selection. The predicate appears as a subscript to σ . It is a unary operation.

E.g. Consider the borrow relation and branch relation in the banking example:

Borrow relation

Branch name Loan# Customer

nameAmount

Downtown

Round Hill

Redwood

17

23

13

Jones

Smith

Hayes

1000

2000

130019


Table 2.1

Branch relation

Table 2.2To select

tuples (rows) of the borrow relation where the branch is “Redwood”, we would write bname =”Redwood” (borrow)

The new relation created as the result of this operation consists of one tuple: (Redwood, 13, Hayes,1300). We allow comparisons using =, , <, , > and in the selection predicate. We also allow the logical connectives (or) and (and). For example:

bname = “Downtown” amount > 800 (borrow) 2. Project operation - The project operation is used to retrieve specific attributes/columns from a relation. It is denoted using Greek letter pi (∏). It is a unary operation.

For example, to obtain a relation showing customers and branches, but ignoring amount and loan#, we write

∏branchname,customername(borrow)

3) Union operation – The union operation is a binary operation since it involves 2 relations. It is used to retrieve tuples appearing in either or both the relations participating in the UNION. It is denoted as U. For a union operation RUS to be legal, we require that

o R and S must have the same number of attributes. o The domains of the corresponding attributes must be the same.

4) Set difference – The set difference operation is a binary operation. Set difference is denoted by the minus sign ( ). It finds tuples that are in one relation, but not in another. Thus R-S results in a relation containing tuples that are in R but not in S.

5) Cartesian product – This is a binary operation involving 2 relations. It is used to obtain all possible combination of tuples from two relations. The cartesian product of two relations is denoted by a cross ( ), written R1 x R2 for relations R1 and R2.

Branch name Branch city Assets

Downtown

Round Hill

Redwood

Brooklyn

Horseneck

Palo Alto

9000000

21000000

17000000

20


The result of R1 x R2 is a new relation with a tuple for each possible pairing of tuples from R1 and R2. In order to avoid ambiguity, the attribute names have attached to them the name of the relation from which they came. If no ambiguity will result, we drop the relation name. If R1 has n tuples, and R2 has m tuples, then R=R1 x R2 will have mxn tuples.

6) Rename – The rename operation solves the problems that occur with naming when performing the cartesian product of a relation with itself.

Suppose we want to find the names of all the customers who live on the same street and in the same city as Smith.

Customer name Customer street Customer city

Jones

Smith

Hayes

Main

North

Main

Harrison

Rye

Harrison

Table 2.3 Customer relation

We can get the street and city of Smith by writing

To find other customers with the same information, we need to reference the customer relation again:

where p is a selection predicate requiring street and ccity values to be equal.

So we have to distinguish between the two street values appearing in the Cartesian product, as both come from a single customer relation. For that, we use the rename operator, denoted by the Greek letter rho ( ).

We write

to get the relation r under the name of x.

If we use this to rename one of the two customer relations we are using, the ambiguities will disappear.

21


Additional operations

1. Set Intersection - Set intersection is denoted by , and returns a relation that

contains tuples that are in both of its argument relations. It does not add any power as

Eg: Consider the depositor and borrower relations. If we want to find all customers

who have both a loan and an account, we have to take the intersection of two

relations. It can be written as ∏ customer name(borrower) ∩ ∏ customer name(depositor).

2. Natural join - Natural join is a dyadic operator that is written as R S where R

and S are relations. The result of the natural join is the set of all combinations of

tuples in R and S that are equal on their common attribute names.

Consider R and S to be sets of attributes. We denote attributes appearing in both

relations by R ∩ S. We denote attributes in either or both relations by RUS. Consider

two relations r(R) and s(S). The natural join of r and s, denoted by r s is a relation

on scheme R ∩ S. It is a projection onto R U S of a selection on r x s where the

predicate requires r.a =s.a for each attribute a in R ∩ S. Formally, r s = Π R U S

(σ r.A1=s.A1 Λ r.A2=s.A2 Λ….r.An=s.An (r x s)) where R ∩ S = {Λ1, Λ2,…., Λn }

For an example consider the tables Employee and Dept and their natural join:

Dept

DeptName Manager

Sales Harriet

Production Charles

Employee

Name EmpId DeptName

Harry 3415 Finance

Sally 2241 Sales

Harriet 2202 Sales

22


Finance George

Table 2.4 Table 2.5

Table 2.6

3. Equi-join - If we want to combine tuples from two relations where the combination

condition is not simply the equality of shared attributes then it is convenient to have a

more general form of join operator, which is the θ-join (or theta-join). The θ-join is a

dyadic operator that is written as or where a and b are attribute names, θ is a binary relation in the set {<, ≤, =, >, ≥}, v is a value constant, and R and S are relations. The result of this operation consists of all combinations of tuples in R and S that satisfy the relation θ. The result of the θ-join is defined only if the headers of S and R are disjoint, that is, do not contain a common attribute.

4. Outer-join - Whereas the result of a join (or inner join) consists of tuples formed by combining matching tuples in the two operands, an outer join contains those tuples and additionally some tuples formed by extending an unmatched tuple in one of the operands by "fill" values for each of the attributes of the other operand. Three outer join operators are defined: left outer join, right outer join, and full outer join.

Left Outer join - The left outer join is written as R =X S where R and S are relations. The result of the left outer join is the set of all combinations of tuples in R and S that are equal on their common attribute names, in addition to tuples in R that have no

Employee Dept

Name EmpId DeptName Manager

Harry 3415 Finance George

Sally 2241 Sales Harriet

George 3401 Finance George

Harriet 2202 Sales Harriet

23


matching tuples in S. For an example consider the tables Employee and Dept and their left outer join:

In the resulting relation, tuples in S which have no common values in common

attribute names with tuples in R take a null value, ω. Since there are no tuples in Dept

with a DeptName of Finance or Executive, ωs occur in the resulting relation where

tuples in DeptName have tuples of Finance or Executive.

Table2.8

Table 2.9

The left outer join can be simulated using the natural join and

set union as follows:

R =X S = R ∪ (R S)

Dept

DeptName Manager

Sales Harriet

Production CharlesEmployee

Name EmpId DeptName

Harry 3415 Finance

Sally 2241 Sales

George 3401 Finance

Harriet 2202 Sales

Tim 1123 Executive

24


Table 2.10

Right outer join - The right outer join

behaves almost identically to the left outer

join, with the exception that all the values

from the "other" relation appear in the

resulting relation. The right outer join is

written as R X= S where R and S are

relations. The result of the right outer join is

the set of all combinations of tuples in R and

S that are equal on their common attribute

names, in addition to tuples in S that have no

matching tuples in R. For an eg. consider the

tables Employee and Dept and their right

outer join:

In the resulting relation, tuples in R which

have no common values in common attribute names with tuples in S take a null value,

ω. Since there are no tuples in Employee with a DeptName of Production, ωs occur in

the Name attribute of the resulting relation where tuples in DeptName had tuples of

Production.

Table 2.11

Employee =X Dept


Harry 3415 Finance ω

George 3401 Finance ω

Tim 1123 Executive ω



Employee

Name EmpId DeptName

Harry 3415 Finance

Sally 2241 Sales

George 3401 Finance

Harriet 2202 Sales

Tim 1123 Executive

Dept

DeptName Manager

Sales Harriet

Production Charles

25


Table2.12

Employee X= Dept




ω ω Production Charles

Table2.13

Full outer join - The outer join or full outer join in effect combines the results of the

left and right outer joins. The full outer join is written as R =X= S where R and S are

relations. The result of the full outer join is the set of all combinations of tuples in R

and S that are equal on their common attribute names, in addition to tuples in S that

have no matching tuples in R and tuples in R that have no matching tuples in S in their

common attribute names.

26


For an example consider the tables Employee and Dept and their full outer join:

In the resulting relation, tuples in R which have no common values in common

attribute names with tuples in S take a null value, ω. Tuples in S which have no

common values in common attribute names with tuples in R, also take a null value, ω

Table 2.14 Table 2.15

Table 2.16 Table2.17

Employee =X= Dept

Employee

Name EmpId DeptName

Harry 3415 Finance

Sally 2241 Sales

George 3401 Finance

Harriet 2202 Sales

Tim 1123 Executive

Dept

DeptName Manager

Sales Harriet

Production Charles

27



Harry 3415 Finance ω


George 3401 Finance ω


Tim 1123 Executive ω

ω ω Production Charles

5. Division operation - The division is a binary operation that is written as R ÷ S. The

result consists of the restrictions of tuples in R to the attribute names unique to R, i.e.,

in the header of R but not in the header of S, for which it holds that all their

combinations with tuples in S are present in R. For an example see the tables

Completed, DBProject and their division: Table 2.19

Table 2.18

Completed

Student TaskFred

Fred

Fred

Eugene

Eugene

Sara

Sara

Database1

Database2

Compiler1

Database1

Compiler1

Database1

Database2

DBProject

TaskDatabase1

Database2

Completed ÷ DBProject

StudentFred

Sara

28


Let r(R) and s(S) be relations. Let . The relation r ÷ s is a relation on scheme

R – S. A tuple t is in r ÷ s if for every tuple ts in s there is a tuple tr in r satisfying both

of the following:

These conditions say that the portion of a tuple is in if and only if there

are tuples with the portion and the portion in for every value of the

portion in relation .

6. Assignment operation - Sometimes it is useful to be able to write a relational

algebra expression in parts using a temporary relation variable. The assignment

operation, denoted , works like assignment in a programming language.

We could rewrite our division definition as

No extra relation is added to the database, but the relation variable created can be used

in subsequent expressions. Assignment to a permanent relation would constitute a

modification to the database.

2.6 Tuple Relational Calculus - The tuple calculus is a calculus that was introduced by Edgar F. Codd as part of the relational model in order to give a declarative database query language for this data model. The tuple relational calculus is a nonprocedural language. (The relational algebra was procedural.) We must provide a formal description of the information desired. A query in the tuple relational calculus is expressed as { t / P(t) } i.e. the set of tuples t for which predicate P is true. We also use the notation

o t[a] to indicate the value of tuple on attribute. o t є r to show that tuple t is in relation r.

Example Queries

For example, to find the branch-name, loan number, customer name and amount for loans over $1200:

29


This gives us all attributes, but suppose we only want the customer names. (We would use project in the algebra.) We need to write an expression for a relation on scheme (cname).

In English, we may read this equation as “the set of all tuples t such that there exists a tuple s in the relation borrow for which the values of t and s for the cname attribute are equal, and the value of s for the amount attribute is greater than 1200.”

The notation means that “there exists a tuple t in relation r such that predicate Q(t) is true''. Consider another example: Find all customers having a loan from the SFU branch, and the cities in which they live:

In English, we might read this as “the set of all (cname,ccity) tuples for which cname is a borrower at the SFU branch, and ccity is the city of cname”. Tuple variable s ensures that the customer is a borrower at the SFU branch. Tuple variable u is restricted to pertain to the same customer as , and also ensures that ccity is the city of the customer.

The logical connectives (AND) and (OR) are allowed, as well as (negation). We also use the existential quantifier and the universal quantifier .

Formal Definition

A tuple relational calculus expression is of the form { t | P(t) } where P is a formula. Several tuple variables may appear in a formula.

Tuple variable : A tuple variable is said to be a free variable unless it is quantified by a or a . If it is quantified by a or a , it is said to be bound variable.

Formula : A formula is built of atoms. An atom is one of the following forms:

o s є r , where s is a tuple variable, and r is a relation ( is not allowed). o s[x] θ u[y] where s and u are tuple variables, and x and y are attributes, and θ

is a comparison operator ( ). o s[x] θ c, where c is a constant in the domain of attribute x.

Formulae are built up from atoms using the following rules:

o An atom is a formula. o If P is a formula, then so are and (P). o If P1and P2 are formulae, then so are P1 P2, , and . o If P(s) is a formula containing a free tuple variable s, then

30


are also formulae.

Important equivalences:

ooo

Safety of Expressions

A tuple relational calculus expression may generate an infinite expression, e.g.

There are infinite number of tuples that are not in borrow. Most of these tuples contain values that do not appear in the database. So we have to restrict the relational calculus.

Safe Tuple Expressions

The domain of a formula , denoted dom( ), is the set of all values referenced in P. We may say an expression { t / P(t) }is safe if all values that appear in the result are values from dom( ). A safe expression yields a finite number of tuples as its result. Otherwise, it is called unsafe. The tuple relational calculus restricted to safe expressions is equivalent in expressive power to the relational algebra.

2.7 Domain Relational Calculus - The domain relational calculus (DRC) is a calculus that was introduced by Edgar F. Codd as a declarative database query language for the relational data model. This language uses the same operators as tuple calculus; Logical operators Λ(and), V(or) and ¬ (not). The existential quantifier ( )∃ and the universal quantifier ( ) can be used to bind the variables.∀ Formal Definition

An expression is of the form

where the represent domain variables, and is a formula.

An atom in the domain relational calculus is of the following forms :

o <x1, x2, …., xn> є r where r is a relation on n attributes, and xi, 1 ≤ i ≤ n, are domain variables or constants.

o x θ y , where x and y are domain variables, and θ is a comparison operator.

o x θ c , where c is a constant.

Formulae are built up from atoms using the following rules:

31


o An atom is a formula. o If P is a formula, then so are and (P).o If P1and P2 are formulae, then so are P1 P2, , and

. o If P(s) is a formula containing a free tuple variable s, then

are also formulae.

Example Queries

Find branch name, loan number, customer name and amount for loans of over $1200.

Find all customers who have a loan for an amount > than $1200.

Find all customers having a loan from the SFU branch, and the city in which they live.

Find all customers having a loan, an account or both at the SFU branch.

Find all customers who have an account at all branches located in Brooklyn.

Safety of Expressions

We say that an expression

{ < x1, x2,…..,xn > | P (x1, x2,….xn)} is safe if all of the following hold:

1. All values that appear in tuples of the expression are values from dom(P).

2. For every “there exists” sub formula of the form Эx (P1(x)), the subformula is true if and only if there is a value x in dom(P1) such that P1(x) is true.

3. For every “ for all” subformula of the form Vx (P1(x)), the subformula is true if and only if P1(x) is true for all values of x.

32


An expression such as { <b, l, a> | ¬(<b, l, a> є loan)} is unsafe because it allows values in the result that are not in the domain of the expression.

All three of the following are equivalent:

o The relational algebra. o The tuple relational calculus restricted to safe expressions. o The domain relational calculus restricted to safe expressions.

2.8 SQL – Sql has become the standard relational database language. It has several parts:

o Data definition language (DDL) - provides commands to Define relation schemes. Delete relations. Create indices. Modify schemes.

o Interactive data manipulation language (DML) - a query language based on both relational algebra and tuple relational calculus, plus commands to insert, delete and modify tuples.

o Embedded data manipulation language - for use within programming languages like C, PL/1, Cobol, Pascal, etc.

o View Definition - commands for defining views o Authorization - specifying access rights to relations and views. o Integrity - a limited form of integrity checking. o Transaction control - specifying beginning and end of transactions.

Basic Structure

Basic structure of an SQL expression consists of select, from and where clauses.

A typical SQL query has the form : select A1, A2,…….,An

from r1, r2,….,rn

where PEach Ai represents an attribute, and each ri a relation. P is a predicate. This query is equivalent to the algebra expression.

Π A1, A2,….,An(σ p (r1 x r2 x ………x rm))

If the where clause is omitted, the predicate P is true. The list of attributes can be replaced with a * to select all. The result of an SQL query is a relation.

The select clause - corresponds to the projection operation of the relational algebra. It is used to list the attributes desired in the result of a query. If we want to remove duplicates in a selection procedure, we use the keyword distinct after select. The keyword all is used to specify explicitly that duplicates are not removed. select *

33


means select all the attributes. Select clause can also contain arithmetic expressions involving operators (+, -, *, / ) and operating on constants or attributes of tuples.

Eg: 1. select branch-name from loan

1. select branch-name, loan-number, amount*100from loan

The where clause - corresponds to selection predicate in relational algebra. It consists of a predicate involving attributes of the relations that appear in the from clause. SQL uses the logical connectives and, or and not - rather than mathematical symbols Λ, V and ¬ in the where clause. The operands of the logical connectives can be expressions involving the comparison operators <, >, ≤, ≥, = and <>. SQL includes a between comparison operator to simplify where clauses that specify that a value be less than or equal to some value or greater than or equal to some other value.

Eg: 1. select loan-number from loan where amount between 90000 and 100000

The from clause - corresponds to Cartesian product of the relational algebra. It lists the relations to be scanned in the evaluation of the expression.

The rename operation – SQL provides a mechanism for renaming both relations and attributes. It uses the as clause, taking the form: old-name as new-name.

String operations - The most commonly used operation on strings is pattern matching using the operator like. We describe patterns using two special characters:

Percent (%) – The % character matches any substring. Underscore ( _ ) – The _ character matches any character.

Patterns are case-sensitive. The keyword escape is used to define the escape character. We can use not like for string mismatching.

Ordering the display of tuples - SQL allows the user to control the order in which tuples are displayed.

o order by makes tuples appear in sorted order (ascending order by default).

o desc specifies descending order. o asc specifies ascending order.

Set operations - SQL has the set operations union, intersect and except. union eliminates duplicates, being a set operation. If we want to retain duplicates, we may use union all, similarly for intersect and except.

Not all implementations of SQL have these set operations. except in SQL-92 is called minus in SQL-86.

34


Aggregate functions - In SQL we can compute functions on groups of tuples using the group by clause. Attributes given are used to form groups with the same values. SQL can then compute

o average value -- avg o minimum value -- min o maximum value -- max o total sum of values -- sum o number in group -- count

These are called aggregate functions. They return a single value. having-clause is used to state conditions that applies to groups rather than to tuples. Predicates in the having clause are applied after the formation of groups. If a where clause and a having clause appear in the same query, the where clause predicate is applied first. Tuples satisfying where clause are placed into groups by the group by clause. The having clause is applied to each group. Groups satisfying the having clause are used by the select clause to generate the result tuples. If no having clause is present, the tuples satisfying the where clause are treated as a single group.

Null values – The keyword null is used to test for a null value(absence of information about the value of an attribute).

2.9Views in SQL - A view in SQL is defined using the create view command: create view v as <query expression>where <query expression> is any legal query expression. The view created is given the name v. To create a view all-customer of all branches and their customers:

create view all-customer as

(select bname, cnamefrom depositor, accountwhere depositor.account# = account.account#) union

(select bname, cname from borrower, loan where borrower.loan# = loan.loan#)

Having defined a view, we can now use it to refer to the virtual relation it creates. View names can appear anywhere a relation name can.

2.10 Data manipulations

Insert – It is used to insert a single tuple to a relation. To insert data into a relation, we either specify a tuple, or write a query whose result is the set of tuples to be inserted. Attribute values for inserted tuples must be members of the attribute's domain.

35


Eg: To insert a tuple for Smith who has $1200 in account A-9372 at the SFU branch.

insert into account values (“SFU”, “A-9372”', 1200)

It is important that we evaluate the select statement fully before carrying out any insertion. If some insertions were carried out even as the select statement were being evaluated, the insertion might insert an infinite number of tuples. Evaluating the select statement completely before performing insertions avoids such problems. It is possible for inserted tuples to be given values on only some attributes of the schema. The remaining attributes are assigned a null value denoted by null. We can prohibit the insertion of null values using the SQL DDL.

Delete – The delete command removes tuples from a relation. Deletion is expressed in much the same way as a query. Instead of displaying, the selected tuples are removed from the database. We can only delete whole tuples. A deletion in SQL is of the form delete from r where P. Tuples in r for which P is true are deleted. If the where clause is omitted, all tuples are deleted. We may only delete tuples from one relation at a time, but we may reference any number of relations in a select-from-where clause embedded in the where clause of a delete. However, if the delete request contains an embedded select that references the relation from which tuples are to be deleted, ambiguities may result.

Update - Updating allows us to change some values in a tuple without necessarily changing all. where clause of update statement may contain any construct legal in a where clause of a select statement (including nesting). A nested select within an update may reference the relation that is being updated. As before, all tuples in the relation are first tested to see whether they should be updated, and the updates are carried out afterwards.

Update of a view - The view update exists also in SQL. An example will illustrate: Consider a clerk who needs to see all information in the loan relation except amount. Let the view branch-loan be given to the clerk: create view branch-loan as select bname, loan# from loan

Since SQL allows a view name to appear anywhere a relation name may appear, the clerk can write: insert into branch-loan values (“SFU”, “L-307”). This insertion is represented by an insertion into the actual relation loan, from which the view is constructed. However, we have no value for amount. This insertion results in (“SFU'', “L-307”, null) being inserted into the loan relation.

MODULE 3

36


3.1 Transaction and system preliminaries.

The concept of transaction has been devised as a convenient and precise way

of describing the various logical units that form a database system. We have

transaction systems which are systems that operate on very large databases, on which

several (sometimes running into hundreds) of users concurrently operate – i.e. they

manipulate the database transaction. There are several such systems presently in

operation in our country also – if you consider the railway reservation system,

wherein thousands of stations – each with multiple number of computers operate on a

huge database, the database containing the reservation details of all trains of our

country for the next several days. There are many other such systems like the airlines

reservation systems, distance banking systems, stock market systems etc. In all these

cases apart from the accuracy and integrity of the data provided by the database (note

that money is involved in almost all the cases – either directly or indirectly), the

systems should provide instant availability and fast response to these hundreds of

concurrent users. In this block, we discuss the concept of transaction, the problems

involved in controlling concurrently operated systems and several other related

concepts. We repeat – a transaction is a logical operation on a database and the users

intend to operate with these logical units trying either to get information from the

database and in some cases modify them. Before we look into the problem of

concurrency, we view the concept of multiuser systems from another point of view –

the view of the database designer.

3.1.1 A typical multiuser system

We remind ourselves that a multiuser computer system is a system that can be

used by a number of persons simultaneously as against a single user system,

which is used by one person at a time. (Note however, that the same system can be

used by different persons at different periods of time). Now extending this

concept to a database, a multiuser database is one which can be accessed and

modified by a number of users simultaneously – whereas a single user database is

37


one which can be used by only one person at a time. Note that multiuser

databases essentially mean there is a concept of multiprogramming but the

converse is not true. Several users may be operating simultaneously, but not all of

them may be operating on the database simultaneously.

Now, before we see what problems can arise because of concurrency, we see

what operations can be done on the database. Such operations can be single line

commands or can be a set of commands meant to be operated sequentially. Those

operations are invariably limited by the “begin transaction” and “end transaction”

statements and the implication is that all operations in between them are to be done on

a given transaction.

Another concept is the “granularity” of the transaction. Assume each field in a

database is named. The smallest such named item of the database can be called a

field of a record. The unit on which we operate can be one such “grain” or a number

of such grains collectively defining some data unit. However, in this course, unless

specified otherwise, we use of “single grain” operations, but without loss of

generality. To facilitate discussions, we presume a database package in which the

following operations are available.

i) Read_tr(X: The operation reads the item X and stores it into an assigned

variable. The name of the variable into which it is read can be anything,

but we would give it the same name X, so that confusions are avoided. I.e.

whenever this command is executed the system reads the element required

from the database and stores it into a program variable called X.

ii) Write – tr(X): This writes the value of the program variable currently

stored in X into a database item called X.

Once the read –tr(X) is encountered, the system will have to perform the

following operations.

1. Find the address of the block on the disk where X is stored.

2. Copy that block into a buffer in the memory.

3. Copy it into a variable (of the program) called X.

A write –tr (x) performs the converse sequence of operations.

1. Find the address of the diskblock where the database variable X is stored.

38


2. Copy the block into a buffer in the memory.

3. Copy the value of X from the program variable to this X.

4. Store this updated block back to the disk.

Normally however, the operation (4) is not performed every time a write –tr is

executed. It would be a wasteful operation to keep writing back to the disk every

time. So the system maintains one/more buffers in the memory which keep getting

updated during the operations and this updated buffer is moved on to the disk at

regular intervals. This would save a lot of computational time, but is at the heart of

some of the problems of concurrency that we will have to encounter.

3.1.2 The need for concurrency controlLet us visualize a situation wherein a large number of users (probably spread

over vast geographical areas) are operating on a concurrent system. Several problems

can occur if they are allowed to execute their transactions operations in an

uncontrolled manner.

Consider a simple example of a railway reservation system. Since a number

of people are accessing the database simultaneously, it is obvious that multiple copies

of the transactions are to be provided so that each user can go ahead with his

operations. Let us make the concept a little more specific. Suppose we are

considering the number of reservations in a particular train of a particular date. Two

persons at two different places are trying to reserve for this train. By the very

definition of concurrency, each of them should be able to perform the operations

irrespective of the fact that the other person is also doing the same. In fact they will

not even know that the other person is also booking for the same train. The only way

of ensuring the same is to make available to each of these users their own copies to

operate upon and finally update the master database at the end of their operation.

Now suppose there are 10 seats are available. Both the persons, say A and B

want to get this information and book their seats. Since they are to be accommodated

concurrently, the system provides them two copies of the data. The simple way is to

perform a read –tr (X) so that the value of X is copied on to the variable X of person

A (let us call it XA) and of the person B (XB). So each of them know that there are 10

seats available.

39


Suppose A wants to book 8 seats. Since the number of seats he wants is (say

Y) less than the available seats, the program can allot him the seats, change the

number of available seats (X) to X-Y and can even give him the seat numbers that

have been booked for him.

The problem is that a similar operation can be performed by B also. Suppose

he needs 7 seats. So, he gets his seven seats, replaces the value of X to 3 (10 – 7) and

gets his reservation.

The problem is noticed only when these blocks are returned to main database

(the disk in the above case).

Before we can analyse these problems, we look at the problem from a more

technical view.

1 The lost update problem: This problem occurs when two transactions that access

the same database items have their operations interleaved in such a way as to make

the value of some database incorrect. Suppose the transactions T1 and T2 are

submitted at the (approximately) same time. Because of the concept of interleaving,

each operation is executed for some period of time and then the control is passed on to

the other transaction and this sequence continues. Because of the delay in updatings,

this creates a problem. This was what happened in the previous example. Let the

transactions be called TA and TB.

TA TB

Read –tr(X)

Read –tr(X) Time

X = X – NA

X = X - NB

Write –tr(X)

write –tr(X)

fig1 fig2

Note that the problem occurred because the transaction TB failed to record the

transactions TA. I.e. TB lost on TA. Similarly since TA did the writing later on, TA lost

the updatings of TB.

40


2 The temporary update (Dirty read) problem

This happens when a transaction TA updates a data item, but later on (for some

reason) the transaction fails. It could be due to a system failure or any other

operational reason. Or the system may have later on noticed that the operation should

not have been done and cancels it. To be fair, it also ensures that the original value is

restored.

But in the meanwhile, another transaction TB has accessed the data and since it

has no indication as to what happened later on, it makes use of this data and goes

ahead. Once the original value is restored by TA, the values generated by TB are

obviously invalid.

TA TB

Read –tr(X) Time

X = X – N

Write –tr(X)

Read –tr(X)

X = X - N

write –tr(X)

Failure

X = X + N

Write –tr(X)

Fig3

The value generated by TA out of a non-sustainable transaction is a “dirty

data” which is read by TB, produces an illegal value. Hence the problem is called a

dirty read problem.

3 The Incorrect Summary Problem: Consider two concurrent operations, again

called TA and TB. TB is calculating a summary (average, standard deviation or some

such operation) by accessing all elements of a database (Note that it is not updating

any of them, only is reading them and is using the resultant data to calculate some

values). In the meanwhile TA is updating these values. In case, since the Operations

41


are interleaved, TA, for some of it’s operations will be using the not updated data,

whereas for the other operations will be using the updated data. This is called the

incorrect summary problem.

TA TB

Sum = 0Read –tr(A)Sum = Sum + A

Read –tr(X)X = X – NWrite –tr(X)

Read tr(X)Sum = Sum + XRead –tr(Y)Sum = Sum + Y

Read (Y)Y = Y – NWrite –tr(Y)

Fig4In the above example, both TA will be updating both X and Y. But since it first updates X and then Y and the operations are so interleaved that the transaction TB uses both of them in between the operations, it ends up using the old value of Y with the new value of X. In the process, the sum we got does not refer either to the old set of values or to the new set of values.

4 Unrepeatable read: This can happen when an item is read by a transaction twice,

(in quick succession) but the item has been changed in the meanwhile, though the

transaction has no reason to expect such a change. Consider the case of a reservation

system, where a passenger gets a reservation detail and before he decides on the

aspect of reservation the value is updated at the request of some other passenger at

another place.

3.1.4 The concept of failures and recovery Any database operation can not be immune to the system on which it operates

(both the hardware and the software, including the operating systems). The system

should ensure that any transaction submitted to it is terminated in one of the following

ways.

42


a) All the operations listed in the transaction are completed, the

changes are recorded permanently back to the database and the

database is indicated that the operations are complete.

b) In case the transaction has failed to achieve it’s desired objective,

the system should ensure that no change, whatsoever, is reflected

onto the database. Any intermediate changes made to the database

are restored to their original values, before calling off the

transaction and intimating the same to the database.

In the second case, we say the system should be able to “Recover” from the

failure. Failures can occur in a variety of ways.

i) A System Crash: A hardware, software or network error can make the

completion of the transaction an impossibility.

ii) A transaction or system error: The transaction submitted may be faulty

– like creating a situation of division by zero or creating a negative

numbers which cannot be handled (For example, in a reservation

system, negative number of seats convey no meaning). In such cases,

the system simply discontinuous the transaction by reporting an error.

iii) Some programs provide for the user to interrupt during execution. If

the user changes his mind during execution, (but before the

transactions are complete) he may opt out of the operation.

iv) Local exceptions: Certain conditions during operation may force the

system to raise what are known as “exceptions”. For example, a bank

account holder may not have sufficient balance for some transaction to

be done or special instructions might have been given in a bank

transaction that prevents further continuation of the process. In all

such cases, the transactions are terminated.

v) Concurrency control enforcement: In certain cases when concurrency

constrains are violated, the enforcement regime simply aborts the

process to restart later.

The other reasons can be physical problems like theft, fire etc or system

problems like disk failure, viruses etc. In all such cases of failure, a recovery

mechanism is to be in place.

43


3.2 Transaction States and additional operations

Though the read tr and write tr operations described above the most

fundamental operations, they are seldom sufficient. Though most operations on

databases comprise of only the read and write operations, the system needs several

additional operations for it’s purposes. One simple example is the concept of

recovery discussed in the previous section. If the system were to recover from a crash

or any other catastrophe, it should first be able to keep track of the transactions –

when they start, when they terminate or when they abort. Hence the following

operations come into picture.

i) Begin Trans: This marks the beginning of an execution process.

ii) End trans: This marks the end of a execution process.

iii) Commit trans: This indicates that transaction is successful and the

changes brought about by the transaction may be incorporated onto the

database and will not be undone at a later date.

iv) Rollback: Indicates that the transaction is unsuccessful (for whatever

reason) and the changes made to the database, if any, by the transaction

need to be undone.

Most systems also keep track of the present status of all the transactions at the present

instant of time (Note that in a real multiprogramming environment, more than one

transaction may be in various stages of execution). The system should not only be

able to keep a tag on the present status of the transactions, but also should know what

are the nextpossibilities for the transaction to proceed and in case of a failure, how to

roll it back. The whole concept takes the state transition diagram. A simple state

transition diagram, in view of what we have seen so for can appear as follows:

Terminate

Abort Terminate

Begin End

44

ActivePartially committed

Committed

Termi-

nated

Failure


Transaction Transaction Commit

Read/Write

Fig5

The arrow marks indicate how a state of a transaction can change to a next

state. A transaction is in an active state immediately after the beginning of execution.

Then it will be performing the read and write operations. At this state, the system

protocols begin ensuring that a system failure at this juncture does not make

erroneous recordings on to the database. Once this is done, the system “Commits”

itself to the results and thus enters the “Committed state”. Once in the committed

state, a transaction automatically proceeds to the terminated state.

The transaction may also fail due to a variety of reasons discussed in a

previous section. Once it fails, the system may have to take up error control exercises

like rolling back the effects of the previous write operations of the transaction. Once

this is completed, the transaction enters the terminated state to pass out of the system.

A failed transaction may be restarted later – either by the intervention of the

user or automatically.

The concept of system log:

To be able to recover from failures of the transaction operations the

system needs to essentially maintain a track record of all transaction operations that

are taking place and that are likely to affect the status of the database. This

information is called a “System log” (Similar to the concept of log books) and may

become useful when the system is trying to recover from failures. The log

information is kept on the disk, such that it is not likely to be affected by the normal

system crashes, power failures etc. (Otherwise, when the system crashes, if the disk

also crashes, then the entire concept fails). The log is also periodically backed up into

removable devices (like tape) and is kept in archives.

The question is, what type of data or information needs to be logged into the

system log?

Let T refer to a unique transaction – id, generated automatically whenever a

new transaction is encountered and this can be used to uniquely identify the

transaction. Then the following entries are made with respect to the transaction T.

45


i) [Start-Trans, T] : Denotes that T has started execution.

ii) [Write-tr, T, X, old, new]: denotes that the transaction T has changed the

old value of the data X to a new value.

iii) [read_tr, T, X] : denotes that the transaction T has read the value of the X

from the database.

iv) [Commit, T] : denotes that T has been executed successfully and confirms

that effects can be permanently committed to the database.

v) [abort, T] : denotes that T has been aborted.

These entries are not complete. In some cases certain modification to their purpose

and format are made to suit special needs.

(Note that though we have been talking that the logs are primarily useful for recovery

from errors, they are almost universally used for other purposes like reporting,

auditing etc).

The two commonly used operations are “undo” and “redo” operations. In the undo, if

the transaction fails before permanent data can be written back into the database, the

log details can be used to sequentially trace back the updatings and return them to

their old values. Similarly if the transaction fails just before the commit operation is

complete, one need not report a transaction failure. One can use the old, new values

of all write operation on the log and ensure that the same is entered onto the database.

Commit Point of a Transaction:

The next question to be tackled is when should one commit to the results of a

transaction? Note that unless a transaction is committed, it’s operations do not get

reflected in the database. We say a transaction reaches a “Commit point” when all

operations that access the database have been successfully executed and the effects of

all such transactions have been included in the log. Once a transaction T reaches a

commit point, the transaction is said to be committed – i.e. the changes that the

transaction had sought to make in the database are assumed to have been recorded

into the database. The transaction indicates this state by writing a [commit, T] record

into it’s log. At this point, the log contains a complete sequence of changes brought

about by the transaction to the database and has the capacity to both undo it (in case

46


of a crash) or redo it (if a doubt arises as to whether the modifications have actually

been recorded onto the database).

Before we close this discussion on logs, one small clarification. The records

of the log are on the disk (secondary memory). When a log record is to be written, a

secondary device access is to be made, which slows down the system operations. So

normally a copy of the most recent log records are kept in the memory and the

updatings are made there. At regular intervals, these are copied back to the disk. In

case of a system crash, only those records that have been written onto the disk will

survive. Thus, when a transaction reaches commit stage, all records must be

forcefully written back to the disk and then commit is to be executed. This concept is

called ‘forceful writing’ of the log file.

3.3 Desirable Transaction properties. (ACID properties)For the effective and smooth database operations, transactions should possess

several properties. These properties are – Atomicity, consistency preservation,

isolation and durability. Often by combining their first letters, they are called ACID

properties.

i) Atomicity: A transaction is an atomic unit of processing i.e. it cannot be

broken down further into a combination of transactions. Looking

otherway, a given transaction will either get executed or is not performed

at all. There cannot be a possibility of a transaction getting partially

executed.

ii) Consistency preservation: A transaction is said to be consistency

preserving if it’s complete execution takes the database from one

consistent state to another.

We shall slightly elaborate on this. In steady state a database is expected to be

consistent i.e. there are not anomalies in the values of the items. For example if a

database stores N values and also their sum, the database is said to be consistent if the

addition of these N values actually leads to the value of the sum. This will be the

normal case.

Now consider the situation when a few of these N values are being changed.

Immediately after one/more values are changed, the database becomes inconsistent.

47


The sum value no more corresponds to the actual sum. Only after all the updatings

are done and the new sum is calculated that the system becomes consistent.

A transaction should always ensure that once it starts operating on a database,

it’s values are made consistent before the transaction ends.

iii) Isolation: Every transaction should appear as if it is being executed in

isolation. Though, in a practical sense, a large number of such transactions

keep executing concurrently no transaction should get affected by the

operation of other transactions. Then only is it possible to operate on the

transaction accurately.

iv) Durability; The changes effected to the database by the transaction should

be permanent – should not vanish once the transaction is removed. These

changes should also not be lost due to any other failures at later stages.

Now how does one enforce these desirable properties on the transactions? The

atomicity concept is taken care of, while designing and implementing the transaction.

If, however, a transaction fails even before it can complete it’s assigned task, the

recovery software should be able to undo the partial effects inflicted by the

transactions onto the database.

The preservation of consistency is normally considered as the duty of the

database programmer. A “consistent state” of a database is that state which satisfies

the constraints specified by the schema. Other external constraint may also be

included to make the rules more effective. The database programmer writes his

programs in such a way that a transaction enters a database only when it is in a

consistent state and also leaves the state in the same or any other consistent state.

This, of course implies that no other transaction “interferes” with the action of the

transaction in question.

This leads us to the next concept of isolation i.e. every transaction goes about

doing it’s job, without being bogged down by any other transaction, which may also

be working on the same database. One simple mechanism to ensure this is to make

sure that no transaction makes it’s partial updates available to the other transactions,

until the commit state is reached. This also eliminates the temporary update problem.

However, this has been found to be inadequate to take care of several other problems.

Most database transaction today come with several levels of isolation. A transaction

is said to have a level zero (0) isolation, if it does not overwrite the dirty reads of

higher level transactions (level zero is the lowest level of isolation). A transaction is

48


said to have a level 1 isolation, if it does not lose any updates. At level 3, the

transaction neither loses updates nor has any dirty reads. At level 3, the highest level

of isolation, a transaction does not have any lost updates, does not have any dirty

reads, but has repeatable reads.

3.4 The Concept of SchedulesWhen transactions are executing concurrently in an interleaved fashion, not

only does the action of each transaction becomes important, but also the order of

execution of operations from each of these transactions. As an example, in some of

the problems that we have discussed earlier in this section, the problem may get itself

converted to some other form (or may even vanish) if the order of operations becomes

different. Hence, for analyzing any problem, it is not just the history of previous

transactions that one should be worrying about, but also the “schedule” of operations.

Schedule (History of transaction):

We formally define a schedule S of n transactions T1, T2 …Tn as on ordering of

operations of the transactions subject to the constraint that, for each transaction, T i

that participates in S, the operations of Ti must appear in the same order in which they

appear in Ti. I.e. if two operations Ti1 and Ti2 are listed in Ti such that Ti1 is earlier to

Ti2, then in the schedule also Ti1 should appear before Ti2. However, if Ti2 appears

immediately after Ti1 in Ti, the same may not be true in S, because some other

operations Tj1 (of a transaction Tj) may be interleaved between them. In short, a

schedule lists the sequence of operations on the database in the same order in which it

was effected in the first place.

For the recovery and concurrency control operations, we concentrate mainly on readtr

and writetr operations, because these operations actually effect changes to the

database. The other two (equally) important operations are commit and abort, since

they decide when the changes effected have actually become active on the database.

Since listing each of these operations becomes a lengthy process, we make a notation

for describing the schedule. The operations of readtr, writetr, commit and abort, we

49


indicate by r, w, c and a and each of them come with a subscript to indicate the

transaction number

For example SA : r1(x); y2(y); w2(y); r1(y), W1 (x); a1

Indicates the following operations in the same order:

Readtr(x) transaction 1

Read tr (y) transaction 2

Write tr (y) transaction 2

Read tr(y) transaction 1

Write tr(x) transaction 1

Abort transaction 1

Conflicting operations: Two operations in a schedule are said to be in conflict if they

satisfy these conditions

i) The operations belong to different transactions

ii) They access the same item x

iii) Atleast one of the operations is a write operation.

For example : r1(x); w2 (x)

W1 (x); r2(x)

w1 (y); w2(y)

Conflict because both of them try to write on the same item.

But r1 (x); w2(y) and r1(x) and r2(x) do not conflict, because in the first case the read

and write are on different data items, in the second case both are trying read the same

data item, which they can do without any conflict.

A Complete Schedule: A schedule S of n transactions T1, T2…….. Tn is said to be a

“Complete Schedule” if the following conditions are satisfied.

50


i) The operations listed in S are exactly the same operations as in T1, T2 ……

Tn, including the commit or abort operations. Each transaction is

terminated by either a commit or an abort operation.

ii) The operations in any transaction. Ti appear in the schedule in the same

order in which they appear in the Transaction.

iii) Whenever there are conflicting operations, one of two will occur before

the other in the schedule.

A “Partial order” of the schedule is said to occur, if the first two conditions of the

complete schedule are satisfied, but whenever there are non conflicting operations in

the schedule, they can occur without indicating which should appear first.

This can happen because non conflicting operations any way can be executed in any

order without affecting the actual outcome.

However, in a practical situation, it is very difficult to come across complete

schedules. This is because new transactions keep getting included into the schedule.

Hence, often one works with a “committed projection” C(S) of a schedule S. This set

includes only those operations in S that have committed transactions i.e. transaction

Ti whose commit operation Ci is in S.

Put in simpler terms, since non committed operations do not get reflected in the actual

outcome of the schedule, only those transactions, who have completed their commit

operations contribute to the set and this schedule is good enough in most cases.

3.5 Schedules and Recoverability :

Recoverability is the ability to recover from transaction failures. The success

or otherwise of recoverability depends on the schedule of transactions. If fairly

straightforward operations without much interleaving of transactions are involved,

error recovery is a straight forward process. On the other hand, if lot of interleaving

51


of different transactions have taken place, then recovering from the failure of any one

of these transactions could be an involved affair. In certain cases, it may not be

possible to recover at all. Thus, it would be desirable to characterize the schedules

based on their recovery capabilities.

To do this, we observe certain features of the recoverability and also of

schedules. To begin with, we note that any recovery process, most often involves a

“roll back” operation, wherein the operations of the failed transaction will have to be

undone. However, we also note that the roll back need to go only as long as the

transaction T has not committed. If the transaction T has committed once, it need not

be rolled back. The schedules that satisfy this criterion are called “recoverable

schedules” and those that do not, are called “non-recoverable schedules”. As a rule,

such non-recoverable schedules should not be permitted.

Formally, a schedule S is recoverable if no transaction T which appears is S

commits, until all transactions T1 that have written an item which is read by T have

committed.

The concept is a simple one. Suppose the transaction T reads an item X from

the database, completes its operations (based on this and other values) and commits

the values. I.e. the output values of T become permanent values of database.

But suppose, this value X is written by another transaction T’ (before it is read

by T), but aborts after T has committed. What happens? The values committed by T

are no more valid, because the basis of these values (namely X) itself has been

changed. Obviously T also needs to be rolled back (if possible), leading to other

rollbacks and so on.

The other aspect to note is that in a recoverable schedule, no committed

transaction needs to be rolled back. But, it is possible that a cascading roll back

scheme may have to be effected, in which an uncommitted transaction has to be rolled

back, because it read from a value contributed by a transaction which later

aborted. But such cascading rollbacks can be very time consuming because at any

instant of time, a large number of uncommitted transactions may be operating. Thus,

it is desirable to have “cascadeless” schedules, which avoid cascading rollbacks.

52


This can be ensured by ensuring that transactions read only those values which

are written by committed transactions i.e. there is no fear of any aborted or failed

transactions later on. If the schedule has a sequence wherein a transaction T1 has to

read a value X by an uncommitted transaction T2, then the sequence is altered, so that

the reading is postponed, till T2 either commits or aborts.

This delays T1, but avoids any possibility of cascading rollbacks.

The third type of schedule is a “strict schedule”, which as the name suggests is highly

restrictive in nature. Here, transactions are allowed neither to read or write a value X

until the last transaction that wrote X has committed or aborted. Note that the strict

schedules largely simplifies the recovery process, but the many cases, it may not be

possible device strict schedules.

It may be noted that the recoverable schedule, cascadeless schedules and strict

schedules each is more stringent than it’s predecessor. It facilitates the recovery

process, but sometimes the process may get delayed or even may become impossible

to schedule.

3.6 Serializability

Given two transaction T1 and T2 are to be scheduled, they can be scheduled in

a number of ways. The simplest way is to schedule them without in that bothering

about interleaving them. I.e. schedule all operation of the transaction T1 followed by

all operations of T2 or alternatively schedule all operations of T2 followed by all

operations of T1.

T1 T2

read_tr(X)

X=X+N

write_tr(X)

read_tr(Y)

Y=Y+N

Write_tr(Y)

53


Time read_tr(X)

X=X+P

Write_tr(X)

Fig 6 Non-interleaved (Serial Schedule) :A

T1 T2 T2 T2

read_tr(X) read_tr(X)

read_tr(X)

X=X+N X=X+P X=X+P

write_tr(X) Write_tr(X)

write_tr(X)

read_tr(Y) readtr(X)

Y=Y+N |

Write_tr(Y) |

Fig 7 Non-interleaved (Serial Schedule):B

These now can be termed as serial schedules, since the entire sequence of operation in

one transaction is completed before the next sequence of transactions is started.

In the interleaved mode, the operations of T1 are mixed with the operations of T2.

This can be done in a number of ways. Two such sequences are given below:

T1 T2

read_tr(X )

X=X+N

54


read_tr(X)

X=X+P

write_tr(X)

read_tr(Y)

Write_tr(X)

Y=Y+N

Write_tr(Y)

Fig 8 Interleaved (non-serial schedule):C

T1 T2

read_tr(X)

X=X+N

write_tr(X)

read_tr(X)

X=X+P

Write_tr(X)

read_tr(Y)

Y=Y+N

Write_tr(Y)

Fig 9 Interleaved (Nonserial) Schedule D.

Formally a schedule S is serial if, for every transaction, T in the schedule, all

operations of T are executed consecutively, otherwise it is called non serial. In such a

non-interleaved schedule, if the transactions are independent, one can also presume

that the schedule will be correct, since each transaction commits or aborts before the

next transaction begins. As long as the transactions individually are error free, such a

sequence of events are guaranteed to give a correct results.

The problem with such a situation is the wastage of resources. If in a serial

schedule, one of the transactions is waiting for an I/O, the other transactions also

cannot use the system resources and hence the entire arrangement is wasteful of

resources. If some transaction T is very long, the other transaction will have to keep

waiting till it is completed. Moreover, wherein hundreds of machines operate

55


concurrently becomes unthinkable. Hence, in general, the serial scheduling concept is

unacceptable in practice.

However, once the operations are interleaved, so that the above cited problems

are overcome, unless the interleaving sequence is well thought of, all the problems

that we encountered in the beginning of this block become addressable. Hence, a

methodology is to be adopted to find out which of the interleaved schedules give

correct results and which do not.

A schedule S of N transactions is “serialisable” if it is equivalent to some

serial schedule of the some N transactions. Note that there are n! different serial

schedules possible to be made out of n transaction. If one goes about interleaving

them, the number of possible combinations become unmanageably high. To ease our

operations, we form two disjoint groups of non serial schedules- these non serial

schedules that are equivalent to one or more serial schedules, which we call

“serialisable schedules” and those that are not equivalent to any serial schedule and

hence are not serialisable once a nonserial schedule is serialisable, it becomes

equivalent to a serial schedule and by our previous definition of serial schedule will

become a “correct” schedule. But now can one prove the equivalence of a nonserial

schedule to a serial schedule?

The simplest and the most obvious method to conclude that two such

schedules are equivalent is to find out their results. If they produce the same results,

then they can be considered equivalent. i.e. it two schedules are “result equivalent”,

then they can be considered equivalent. But such an oversimplification is full of

problems. Two sequences may produce the same set of results of one or even a large

number of initial values, but still may not be equivalent. Consider the following two

sequences:

S1 S2

read_tr(X) read_tr(X)

X=X+X X=X*X

write_tr(X) Write_tr(X)

fig 10

56


For a value X=2, both produce the same result. Can be conclude that they are

equivalent? Though this may look like a simplistic example, with some imagination,

one can always come out with more sophisticated examples wherein the “bugs” of

treating them as equivalent are less obvious. But the concept still holds -result

equivalence cannot mean schedule equivalence. One more refined method of finding

equivalence is available. It is called “ conflict equivalence”. Two schedules can be

said to be conflict equivalent, if the order of any two conflicting operations in both the

schedules is the same (Note that the conflicting operations essentially belong to two

different transactions and if they access the same data item, and atleast one of them is

a write_tr(x) operation). If two such conflicting operations appear in different orders

in different schedules, then it is obvious that they produce two different databases in

the end and hence they are not equivalent.

1 Testing for conflict serializability of a schedule:

We suggest an algorithm that tests a schedule for conflict serializability.

1. For each transaction Ti, participating in the schedule S, create a node

labeled T1 in the precedence graph.

2. For each case where Tj executes a readtr(x) after Ti executes write_tr(x),

create an edge from Ti to Tj in the precedence graph.

3. For each case where Tj executes write_tr(x) after Ti executes a read_tr(x),

create an edge from Ti to Tj in the graph.

4. For each case where Tj executes a write_tr(x) after Ti executes a

write_tr(x), create an edge from Ti to Tj in the graph.

5. The schedule S is serialisable if and only if there are no cycles in the

graph.

If we apply these methods to write the precedence graphs for the four cases of

section 1.8, we get the following precedence graphs.

X

57

T1 T2 T1 T2


X

Schedule A Schedule B

X

X

Schedule C Schedule D

Fig 11

We may conclude that schedule D is equivalent to schedule A.

2.View equivalence and view serializability:

Apart from the conflict equivalence of schedules and conflict serializability, another

restrictive equivalence definition has been used with reasonable success in the context

of serializability. This is called view serializability.

Two schedules S and S1 are said to be “view equivalent” if the following conditions

are satisfied.

i) The same set of transactions participates in S and S1 and S and S1

include the same operations of those transactions.

ii) For any operation ri(X) of Ti in S, if the value of X read by the

operation has been written by an operation wj(X) of Tj(or if it is the

original value of X before the schedule started) the same condition

must hold for the value of x read by operation ri(X) of Ti in S1.

iii) If the operation Wk(Y) of Tk is the last operation to write, the item Y in

S, then Wk(Y) of Tk must also be the last operation to write the item y

in S1.

58

T1 T2T1 T2


The concept being view equivalent is that as long as each read operation of the

transaction reads the result of the same write operation in both the schedules, the write

operations of each transaction must produce the same results. Hence, the read

operations are said to see the same view of both the schedules. It can easily be

verified when S or S1 operate independently on a database with the same initial state,

they produce the same end states. A schedule S is said to be view serializable, if it is

view equivalent to a serial schedule.

It can also be verified that the definitions of conflict serializability and view

serializability are similar, if a condition of “ constrained write assumption” holds on

all transactions of the schedules. This condition states that any write operation wi(X)

in Ti is preceded by a ri(X) is Ti and that the value written by wi(X) in Ti depends

only on the value of X read by ri(X). This assumes that computation of the new value

of X is a function f(X) based on the old value of x read from the database. However,

the definition of view serializability is less restrictive than that of conflict

serializability under the “unconstrained write assumption” where the value written by

the operation Wi(x) in Ti can be independent of it’s old value from the database. This

is called a “blind write”.

But the main problem with view serializability is that it is extremely complex

computationally and there is no efficient algorithm to do the same.

3.Uses of serializability:

If one were to prove the serializability of a schedule S, it is equivalent to saying that S

is correct. Hence, it guarantees that the schedule provides correct results. But being

serializable is not the same as being serial. A serial scheduling inefficient because of

the reasons explained earlier, which leads to under utilization of the CPU, I/O devices

and in some cases like mass reservation system, becomes untenable. On the other

hand, a serializable schedule combines the benefits of concurrent execution( efficient

system utilization, ability to cater to larger no of concurrent users) with the guarantee

of correctness.

But all is not well yet. The scheduling process is done by the operating system

routines after taking into account various factors like system load, time of transaction

59


submission, priority of the process with reference to other process and a large number

of other factors. Also since a very large number of possible interleaving combinations

are possible, it is extremely difficult to determine before hand the manner in which

the transactions are interleaved. In other words getting the various schedules itself is

difficult, let alone testing them for serializability.

Hence, instead of generating the schedules, checking them for serializability and then

using them, most DBMS protocols use a more practical method – impose restrictions

on the transactions themselves. These restrictions, when followed by every

participating transaction, automatically ensure serializability in all schedules that are

created by these participating schedules.

Also, since transactions are being submitted at different times, it is difficult to

determine when a schedule begins and when it ends. Hence serializability theory can

be used to deal with the problem by considering only the committed projection C(CS)

of the schedule. Hence, as an approximation, we can define a schedule S as

serializable if it’s committed C(CS) is equivalent to some serial schedule.

3.7.Locking techniques for concurrency control

Many of the important techniques for concurrency control make use of the concept

of the lock. A lock is a variable associated with a data item that describes the status of

the item with respect to the possible operations that can be done on it. Normally

every data item is associated with a unique lock. They are used as a method of

synchronizing the access of database items by the transactions that are operating

concurrently. Such controls, when implemented properly can overcome many of the

problems of concurrent operations listed earlier. However, the locks themselves may

create a few problems, which we shall be seeing in some detail in subsequent sections.

Types of locks and their uses:

Binary locks: A binary lock can have two states or values ( 1 or 0) one of them

indicate that it is locked and the other says it is unlocked. For example if we presume

60


1 indicates that the lock is on and 0 indicates it is open, then if the lock of item(X) is 1

then the read_tr(x) cannot access the time as long as the lock’s value continues to be

1. We can refer to such a state as lock (x).

The concept works like this. The item x can be accessed only when it is free

to be used by the transactions. If, say, it’s current value is being modified, then X

cannot be (infact should not be) accessed, till the modification is complete. The

simple mechanism is to lock access to X as long as the process of modification is on

and unlock it for use by the other transactions only when the modifications are

complete.

So we need two operations lockitem(X) which locks the item and

unlockitem(X) which opens the lock. Any transaction that wants to makes use of the

data item, first checks the lock status of X by the lockitem(X). If the item X is

already locked, (lock status=1) the transaction will have to wait. Once the status

becomes = 0, the transaction accesses the item, and locks it (makes it’s status=1).

When the transaction has completed using the item, it issues an unlockitem (X)

command, which again sets the status to 0, so that other transactions can access the

item.

Notice that the binary lock essentially produces a “mutually exclusive” type of

situation for the data item, so that only one transaction can access it. These operations

can be easily written as an algorithm as follows:

The Locking algorithm

Lockitem(X):

Start: if Lock(X)=0, /* item is unlocked*/

Then Lock(X)=1 /*lock it*/

Else

{

wait(until Lock(X)=0) and

the lock manager wakes up the transaction)

61


go to start

}

The Unlocking algorithm:

Unlock item(X):

Lock(X) 0; ( “unlock the item”)

{ If any transactions are waiting,

Wakeup one of the waiting transactions}

The only restrictions on the use of the binary locks are that they should be

implemented as indivisible units (also called “critical sections” in operating systems

terminology). That means no interleaving operations should be allowed, once a lock

or unlock operation is started, until the operation is completed. Otherwise, if a

transaction locks a unit and gets interleaved with many other transactions, the locked

unit may become unavailable for long times to come with catastrophic results.

To make use of the binary lock schemes, every transaction should follow certain

protocols:

1. A transaction T must issue the operation lockitem(X), before issuing a

readtr(X) or writetr(X).

2. A transaction T must issue the operation unlockitem(X) after all readtr(X)

and write_tr(X) operations are complete on X.

3. A transaction T will not issue a lockitem(X) operation if it already holds

the lock on X (i.e. if it had issued the lockitem(X) in the immediate

previous instance)

4. A transaction T will not issue an unlockitem(X) operation unless it holds

the lock on X.

Between the lock(X) and unlock(X) operations, the value of X is held only

by the transaction T and hence no other transaction can operate on X, thus

many of the problems discussed earlier are prevented.

Shared/Exclusive locks

62


While the operation of the binary lock scheme appears satisfactory, it suffers

from a serious drawback. Once a transaction holds a lock (has issued a lock

operation), no other transaction can access the data item. But in large concurrent

systems, this can become a disadvantage. It is obvious that more than one transaction

should not go on writing into X or while one transaction is writing into it, no other

transaction should be reading it, no harm is done if several transactions are allowed to

simultaneously read the item. This would save the time of all these transactions,

without in anyway affecting the performance.

This concept gave rise to the idea of shared/exclusive locks. When only read

operations are being performed, the data item can be shared by several transaction,

only when a transaction wants to write into it that the lock should be exclusive. Hence

the shared/exclusive lock is also sometimes called multiple mode lock. A read lock is

a shared lock (which can be used by several transactions), whereas a writelock is an

exclusive lock. So, we need to think of three operations, a read lock, a writelock and

unlock. The algorithms can be as follows:

Read Lock Operation:

Readlock(X):

Start: If Lock (X) = “unlocked”

Then {

Lock(X) “read locked”,

No of reads(X) 1

}

else if Lock(X) = “read locked”

then no. of reads(X) = no of reads(X)0+1;

else { wait until Lock(X) “unlocked” and the lock

manager

wakes up the transaction) }

go to start

end.

63


The writelock operation :

Writelock(X)

Start: If lock(X) = “unlocked”

Then Lock(X) “write-locked”.

Else { wait until Lock(X) = “unlocked” and

The lock manager wakes up the transaction}

Go to start

End;

The Unlock Operation :

Unlock(X)

If lock(X) = “write locked”

Then { Lock(X) “unlocked”’

Wakeup one of the waiting transaction, if any

}

else if Lock(X) = “read locked”

then { no of reads(X) no of reads –1

if no of reads(X)=0

then { Lock(X) = “unlocked”

wakeup one of the waiting transactions, if any

}

}

The algorithms are fairly straight forward, except that during the unlocking

operation, if a number of read locks are there, then all of them are to be unlocked

before the unit itself becomes unlocked.

To ensure smooth operation of the shared / exclusive locking system, the

system must enforce the following rules:

64


1. A transaction T must issue the operation readlock(X) or writelock(X)

before any read or write operations are performed.

2. A transaction T must issue the operation writelock(X) before any

writetr(X) operation is performed on it.

3. A transaction T must issue the operation unlock (X) after all readtr(X) are

completed in T.

4. A transaction T will not issue a readlock(X) operation if it already holds a

readlock or writelock on X.

5. A transaction T will not issue a writelock(X) operation if it already holds a

readlock or writelock on X.

Conversion Locks

In some cases, it is desirable to allow lock conversion by relaxing the

conditions (4) and (5) of the shared/ exclusive lock mechanism. I.e. if a transaction T

already holds are type of lock on a item X, it may be allowed to convert it to other

types. For example, it is holding a readlock on X, it may be allowed to upgrade it to a

writelock. All that the transaction does is to issue a writelock(X) operation. If T is

the only transaction holding the readlock, it may be immediately allowed to upgrade

itself to a writelock, otherwise it has to wait till the other readlocks (of other

transactions) are released. Similarly if it is holding a writelock, T may be allowed to

downgrade it to readlock(X). The algorithms of the previous sections can be amended

to accommodate the conversion locks and this has been left as on exercise to the

students.

Before we close the section, it should be noted that use of binary locks does

not by itself guarantee serializability. This is because of the fact that in certain

combinations of situations, a key holding transaction may end up unlocking the unit

too early. This can happen because of a variety of reasons, including a situation

wherein a transaction feels it is no more needing a particular data unit and hence

unlocks, it but may be indirectly writing into it at a later time (through some other

unit). This would result in ineffective locking performance and the serializability is

65


lost. To guarantee such serializability, the protocol of two phase locking is to be

implemented, which we will see in the next section.

Two phase locking:

A transaction is said to be following a two phase locking if the operation of

the transaction can be divided into two distinct phases. In the first phase, all items

that are needed by the transaction are acquired by locking them. In this phase, no

item is unlocked even if it’s operations are over. In the second phase, the items are

unlocked one after the other. The first phase can be thought of as a growing phase,

wherein the store of locks held by the transaction keeps growing. In the second

phase, called the shrinking phase, the no. of locks held by the transaction keep

shrinking.

readlock(Y)

readtr(Y) Phase I

writelock(X)

-----------------------------------

unlock(Y)

readtr(X) Phase II

X=X+Y

writetr(X)

unlock(X)

fig12

3.8Query Optimization Techniques:

1. Heuristic-based query optimization – This is based on heuristic rules for ordering

the operations in a query execution strategy. In general, many different relational

algebra expressions-and hence many different query trees can be equivalent.i.e they

can correspond to the same query. The query parser will typically generate a standard

initial query tree to correspond to an SQL query without doing an optimization. The

optimizer must include rules for equivalence among relational algebra expressions

that can be applied to the query. The heuristic query optimization rules then utilize

66


these equivalence expressions to transform the initial tree, into the final optimized

query tree.

General transformation rules for relational algebra operations:

1. Cascade of σ : A conjunctive selection condition can be broken up into a cascade

of individual σ operations.

2. Commutativity of σ : The σ operation is commutative.

3. Cascade of П : In a cascade of П operations, all but the last one can be ignored.

4. Commutating σ with П : If the selection condition c involves only those attributes

A1, A2,…An in the projection list, the 2 operations can be commuted:

П A1, A2,..An (σc ( R) ) = σc (П A1, A2,..An ( R))

5. Commutativity of (and X ) : The operation is commutative as is the X

operation. i.e. R S = S R

R X S = S X R

6. Commuting σ with (or X) : If all the attributes in the selection condition c

involve only the attributes of one of the relations being joined, say R, the two

operations can be commuted as follows:

σc (R S) = (σc(R) ) S

Alternatively, if the selection condition c can be written as c1 and c2, where condition

c1 involves only the attributes of R and condition c2 involves only the attributes of S,

the operations commute as follows:

σc (R S) = (σc1(R) ) (σc2(S) )

The same rule apply if the is replaced by a X operation.

7. Commuting П with (or X) : Suppose that the projection list is L = {A1, A2,

….An, B1, B2,….Bm} where A1, A2, ……..An are attributes of R and B1, B2, ……

Bm are attributes of S. If the join condition c involves only attributes in L, the two

operations can be commuted as follows: П L ( R c S) = (П A1, A2,..An (R) ) c (П B1,

B2,..Bm (S) )

If the join condition c contains additional attributes not in L, these must be added to

the projection list, and a final П operation is needed. i.e. if attributes An+1,……,An+k

of R and Bm+1,……,Bm+p of S are involved in the join condition c but are not in the

projection list L, the operations commute as follows:

П L ( R c S) = П L ( (П A1, A2,..An,An+1,…..An+k (R) ) c (П B1, B2,..Bm,Bm+1,….Bm+p(S) )).

67


For X, there is no condition c, so the first transformation rule

always by replacing c with X.

8. Commutativity of set operations: The set operations U and ∩ are commutative

but – is not.

9. Associativity of , X , U and ∩ : These 4 operations are individually associative.

i.e if Ө stands for any of these four operations then (R Ө S) Ө T = R Ө (S Ө T).

10. Commuting σ with set operations: The σ operation commutes with U, ∩ and - .

If Ө stands for any of these three operations then σc(R Ө S) = (σc(R ) Ө (σc( S)).

11. The П operation commutes with U: П L ( R U S) = ( П L ( R)) U (П L ( S)).

12. Converting a (σ , X) sequence into : If the condition c of a σ that follows a X

corresponds to a join condition, convert the (σ, X) sequence into a as follows:

(σc (R X S) = (R c S)

Outline Of Heuristic Algebraic Optimization Algorithm

Based on the above mentioned rules we can now outline the steps of the algorithm as :

1. Using rule1, break up any SELECT operations with conjunctive conditions

into a cascade of SELECT operations.

2. Using rules 2, 4, 6 and 10 concerning the commutativity of SELECT with

other operations, move each SELECT operations as far down the query tree as

is permitted by the attributes involved in the select condition.

3. Using rules 5 and 9 concerning commutativity and associativity of binary

operations, rearrange the leaf nodes of the tree using the following criteria.

First, position the leaf node relations with the most restrictive SELECT

operations so they are executed first in the query tree representation. The

definition of most restrictive SELECT can mean either the ones that produce a

relation with the fewest tuples or with the smallest absolute size. Another

possibility is to define the most restrictive SELECT as the one with the

smallest selectivity. Second, make sure that the ordering of leaf nodes does not

cause CARTESIAN PRODUCT operations. For e.g. if the two relations with

the most restrictive SELECT do not have a direct join condition between

them, it may be desirable to change the order of leaf nodes to avoid Cartesian

products.

68


4. Using rule 12, combine a CARTESIAN PRODUCT operation with a

subsequent SELECT operation in the tree into a JOIN operation, if the

condition represents a join condition.

5. Using rules 3, 4, 7 and 11 concerning the cascading of PROJECT and the

commuting of PROJECT with other operations, break down and move lists of

projection attributes down the tree as far as possible by creating new

PROJECT operations as needed. Only those attributes needed in the query

result and in subsequent operations in the query tree should be kept after each

PROJECT operation.

6. Identify subtrees that represent groups of operations that can be executed by a

single algorithm.

2. Cost Based optimization – A query optimizer should not solely depend on

heuristic rules; it should also estimate and compare the costs of executing a query

using different execution strategies and should choose the strategy with the lowest

cost estimate. This approach is more suitable for compiled queries where the

optimization is done at compile time and the resulting execution strategy code is

stored and executed directly at run-time.

Cost Components for Query Execution

The cost of executing a query includes the following components:

1. Access cost to secondary storage: This is the cost of searching for, reading and

writing data blocks that reside on secondary storage, mainly on disk. The cost of

searching for records in a file depends on the type of access structures on that file,

such as ordering, hashing and primary or secondary indices. In addition, factors

such as whether the file blocks are allocated contiguously on the same disk

cylinder or scattered on the disk affect the access cost.

2. Storage cost: This is the cost of storing any intermediate files that are generated by

an execution strategy for the query.

3. Computation cost: This is the cost of performing in memory operations on the

data buffers during query execution. Such operations include searching for and

sorting records, merging records for a join and performing computations on field

values.

4. Memory usage cost: This is the cost pertaining to the number of memory buffers

needed during query execution.

69


5. Communication cost: This is the cost of shipping the query and its result from the

database site to the site or terminal where the query originated.

These components are used for cost function that is used to estimate query execution

cost. To estimate the costs of various execution strategies, we must keep track of

information that is needed for the cost functions. This information may be stored in

the DBMS catalog, where it is accessed by the query optimizer. First, we must know

the size of each file. For a file whose records are all of the same type, the number of

records(tuples), the (average) record size and the number of blocks are needed. The

blocking factor of the file may also be needed.

3.10 Assertions An assertion is a predicate expressing a condition that we wish the database always to

satisfy. Domain constraints and referential-integrity constraints are special forms of

assertions. There are many constraints that we cannot express using only these special

forms. Examples of such constraints include

1. The sum of all loan numbers for each branch must be less than the sum of all

account balances at the branch.

2. Every loan has at least one customer who maintains an account with a minimum

balance of $1000.00

An assertion in SQL-92 takes the form

Create assertion <assertion-name> check <predicate>

The two constraints mentioned can be written as shown next. Since SQL does not

provide a “for all X, P(X)” construct (where P is a predicate), we are forced to

implement the construct using the equivalent “not exists X such that not P(X) ”

construct , which can be written in SQL.

1. Create assertion sum-constraint check (not exists (select * from branch

where (select sum(amount) from loan where loan.branch-

name=branch.branch-name) >= (select sum(amount) from account where

loan.branch-name=branch.branch-name)))

2. Create assertion balance-constraint check (not exists (select * from loan

where not exists (select * from borrower, depositor, account where

loan.loan-number=borrower.loan-number and

borrower.customer-name=depositor.customer-name and

70


depositor.account-number=account.account.number and

account.balance>=1000)))

When an assertion is created, the system tests it for validity. If the assertion is valid,

then any future modification to the database is allowed only if it does not cause that

assertion to be violated.

3.10 Triggers A trigger is a statement that is executed automatically by the system as a side

effect of a modification to the database. To design a trigger mechanism, we must meet

two requirements:

1. Specify the conditions under which the trigger is to be executed.

2. Specify the actions to be taken when the trigger executes

3.11 The basic structure of the oracle system An Oracle server consists of an Oracle database – the collection of

stored data, including log and control files – and the Oracle instance – the processes,

including Oracle (system) processes and user processes taken together, created for a

specific instance of the database operation.

Oracle Database Structure

The Oracle database has two primary structures:

1. A physical structure – referring to the actual stored data.

2. A logical structure – corresponding to an abstract representation of stored data,

which roughly corresponds to the conceptual schema of the databases.

The database contains the following types of files:

1. One or more data files; these contain the actual data.

2. Two or more log files called redo log files; these record all changes made to

data and are used in the process of recovering, if certain changes do not get

written to permanent storage.

3. One or more control files; these contain control information such as database

name, file names and locations and a database creation timestamp.

4. Trace files and an alert log; background processes have a trace file associate

with them and the alert log maintains major database events.

71


The structure of an Oracle database consists of the definition of database in terms of

schema objects and one or more tablespaces. The schema objects contain definitions

of tables, views, sequences, stored procedures, indexes, clusters and database links.

Oracle instance : The set of processes that constitute an instance of the server’s

operation is called an Oracle instance, which consists of a System Global Area and a

set of background processes.

System Global Area (SGA) : This area of memory is used for database

information shared by users. Oracle assigns an SGA area when an instance starts.

The SGA in turn is divided into several types of memory structures:

1. Database buffer cache: This keeps the most recently accessed data blocks from

the database. This helps in reducing the disk I/O activity.

2. Redo log buffer, which is the buffer for the redo log file and is used for

recovery purposes.

3. Shared pool, which contains shared memory constructs.

User processes : Each user process corresponds to the execution of some

application or some tool.

Program Global Area (PGA): This is a memory buffer that contains data and

control information for a server process.

Oracle processes: A process is a thread of control or a mechanism in an operating

system that can execute a series of steps. A process has its own private memory

area where it runs.

Oracle Processes: Oracle creates server processes to handle requests from connected

user processes. The background processes are created for each instance of Oracle;

they perform I/O asynchronously and provide parallelism for better performance.

Oracle Startup and Shutdown: An Oracle database is not available to users until

the Oracle server has been started up and the database has been opened. Starting a

database and making it available system wide requires the following steps:

1. Starting an instance of the database: The SGA is allocated and background

processes are created in this step.

2. Mounting a database: This associates a previously started Oracle instance with a

database. Until then it is available only to administrators. The database

administrator chooses whether to run the database in exclusive or parallel mode.

When an oracle instance mounts a database in an exclusive mode, only that

instance can access the database. On the other hand, if the instance is started in a

72


parallel or share mode, other instances that are started in parallel mode can also

mount the database.

3. Opening a database: Opening a database makes it available for normal database

operations by having oracle open the on-line data files and log files.

The reverse of the above operations will shut down an Oracle instance as follows:

1. Close the window.

2. Dismount the database.

3. Shut down the Oracle instance.

3.12 Database structure and its manipulation in oracleSchema Objects: In Oracle schema refers to a collection of data definition objects.

Schema objects are the individual objects that describe tables, views etc. Tables are

the basic units of data. Synonyms are direct reference to objects. Program units

include function, stored procedure or package.

Oracle Data Dictionary: This is a read-only set of tables that keeps the metadata –

schema description – for a database. Oracle dictionary, has the following components:

Names of users

Security information

Schema objects information

Integrity constraints

Space allocation and utilization of database objects

Statistics on attributes, tables and predicates

Access audit trail information

3.13 Storage organization in oracleA database is divided into logical storage units called tablespaces, with the following

characteristics:

Each database is divided into one or more tablespaces.

There is a system tablespace and users tablespace.

One or more datafiles are created in each tablespace.

The combined storage capacity of a database’s tablespace is the total storage

capacity of the database.

Data Blocks: Data Block represents the smallest unit of I/O. A data block has the

following components:

73


Header: Contains general block information such as block address and type of

segment.

Table directory: Contains information about tables that have data in the data

block.

Row directory: Contains information about the actual rows.

Row data: Uses the bulk of space in the data block.

Free space: Space allocated for row updates and new rows.

Extents: When a table is created, Oracle allocates it an initial extent. Incremental

extents are automatically allocated when the initial extent becomes full. All extents

allocated in index segments remain allocated as long as the index exists. When an

index associated with a table or cluster is dropped, Oracle reclaims the space.

Segments: A segment is made up of a number of extents and belongs to a tablespace.

Oracle uses the following types of 4 segments:

Data segments: Each nonclustered table and each cluster has a single data segment to

hold all its data, which is created when the application creates the table or cluster with

the CREATE command.

Index segments: Each index in an Oracle database has a single index segment, which

is created with the CREATE INDEX command.

Temporary segments: These are created by Oracle for use by SQL statements that

need a temporary work area.

Rollback segments: Each database must contain one or more rollback segments,

which are used for “undoing” transactions.

3.14 Programming in PL/SQL :

BLOCK PL/SQL STRUCTURE:

PL/SQL is a block-structured language. A PL/SQL block defines a unit of

processing, which can include its own local variables, SQL statements, cursors, and

exception handlers. The blocks can be nested. The simplest block structure is given

below.

74

DECLARE

Variable declarations

BEGIN

Program statements

EXCEPTION

WHEN exception

THEN


In the above PL/SQL block, block parts are logical. Blocks starts with

DECLARATION section in which memory variables and other oracle objects can

be declared. The next section contains SQL executable statements for

manipulating table data by using the variables and constants declared in the

DECLARE section. EXCEPTIONS is the last sections of the PL/SQL block which

contains SQL and/or PL/SQL code to handle errors that may crop up during the

execution of the above code block. EXCEPTION section is optional.

Each block can contain other blocks, i.e. blocks can be nested. Blocks of the

code cannot be nested in the DECLARATION section.

PL/SQL CHARACTER SET

PL/SQL uses the standard ASCII set. The basic character set includes the

following.

Words used in a PL/SQL blocks are called lexical units. We can freely insert

blank spaces between lexical units in a PL/SQL blocks. The spaces have no effect

on the PL/SQL block.

The ordinary symbols used in PL/SQL blocks are

( ) + - * / < > = ; % , “ [ ] :

Compound symbols used in PL/SQL block are

<> != -= ^= <= >= : = ** || << >>

VARIABLES

75

Uppercase alphabets A to Z.

Lowercase alphabets a to z.

Numbers 0 to 9

Symbols ( ) + - * / < > = ! ; : , . @ ‘

% “ # $ ^ & _ \ { } ? [ ]


Variables may be used to store the result of a query or calculations. Variables

must be declared before being used. Variables in PL/SQL block are named variables.

A variable name must begin with a character and can be followed by a maximum of

29 other characters (variable length is 30 characters).

Reserved words cannot be used as variable names unless enclosed within the

double quotes. Variables must be separated from each other by at least one space or

by a punctuation mark.

The case (upper/lower) is insignificant when declaring variable names. Space

cannot be used in a variable name.

LITERALS

A literal is a numeric value or a character string used to represent itself. So,

literals can be classified into two types.

Numeric literals

Non- numeric literals (string literals)

Numeric literals:

These can be either integers or floating point numbers. If a float is being

represented, then the integer part must be separated from the float part by a period

( . ).

Integers 25 43 437 -57 etc

Floats 6.34 25E-03 0.1 +17.1 etc

Non numeric literals:

These are represented by one or more legal characters and must be enclosed

within single quotes.

Ex: ‘ Hello world ’

76


‘ EMPLOYEE NAME ’‘ ******* ’

‘ A’

‘ * ’

We can represent single quote character itself in a non-numeric literal by writing it

twice.

Ex: ‘Don’’t go without saving the program’

PL/SQL will also have literals, which are called as logical (boolean) literals.

These are predetermined constants. The value it can take are TRUE, FALSE, and

NULL.

COMMENTS

A comment line begins with a double hyphen (--). In this case the entire

line will be treated as a comment.

Ex: -- This section performs salary updation.

The comment line begins with a slash followed by an asterisk (/*) till the

occurrence of an asterisk followed by a slash (*/). In this case comment

lines can be extended to more than one lines.

Ex-1: /* this is only for user purpose

which calculates the total salary temporarily

and stores the value in temp_sal */

Ex-2: /* This takes rows from /* table EMPLOYEE */

and put on another table */

In the above comment, there is a comment within an another comment

line,

this is not allowed in PL/SQL.

PL/SQL DATA TYPES AND DECLARATIONS:

PL/SQL supports the standard ORACLE SQL data types. The default data

types that

77


can be declared in PL/SQL are

NUMBER: For storing numeric data

Syntax: variable name NUMBER (precision, [scale])

precision determines the number of significant digits that

NUMBER

can contain. Scale determines the number of digits to the right of

the

decimal point.

Ex: NUMBER (4,2) stores 4234.60

NUMBER (10) stores 3289473348

CHAR: This data type stores fixed length character data.

Syntax: Variable name CHAR (size)

where size specifies fixed length of the variable name.

Ex: CHAR (10) stores MASTERFILE

VARCHAR2: It stores variable length character string data.

Syntax: Variable name VARCHAR2 (size)

Where size specifies the maximum length of the variable name.

Ex: VARCHAR2 (20) stores TRANSACTIONFILE

DATE: The date data types store a date and time.

Syntax: variable name DATE

Ex: date_of_birth DATE

BOOLEAN: This data type stores only TRUE, FALSE or NULL values.

Syntax: variable name BOOLEAN

Ex: flag BOOLEAN.

%TYPE declares a variable or constant to have the same data type as that of a

previously defined variable or of a column in a table or in a view.

78


NOT NULL causes creation of a variable or a constant that cannot have a NULL

value. If you attempt to assign the value NULL to a variable or a constant that has

been assigned a NOT NULL constraint, causes an error.

NOTE: As soon as a variable or constant has been declared as NOT NULL, it must be

assigned a value. Hence every NOT NULL declaration of a variable or constant needs

to be followed by PL/SQL expression that loads a value into the variable or constant

declared.

DECLARING VARIABLES

We can declare a variable of any data type either native to the ORACLE or native to

PL/SQL. Variables are declared in the DECLARE section of the PL/SQL block.

Declaration involves the name of the variable followed by its data type. All statement

must end with a semicolon (;) which is the delimiter in PL/SQL. To assign a value to

the variable the assignment operator (:=) is used.

The general syntax is <Variable name> <type> [ :=<value> ];

Ex: pay NUMBER (6,2);

in_stack BOOLEAN;

name VARCHAR2 (30);

room CHAR (2);

date_of_purchase DATE;

ASSIGNING A VALUE TO A VARIABLE:

A value can be assigned to the variable in any one of the following two ways.

Using the assignment operator :=

Ex: tax := price * tax_rate pay := basic + da.

Selecting or fetching table data values in to variables.

79


Ex: SELECT sal INTO pay

FROM Employee

WHERE emp_name = ‘SMITH’;

DECLARING A CONSTANT:

Declaring a constant is similar to declaring a variable except that you have

to add

the key word CONSTANT and immediately assign a value to it. Thereafter, no further

assignment to the constants is possible.

Ex: pf_percent CONSTANT NUMBER (3,2) := 8.33;

PICKING UP A VARIABLE’S PARAMETERS FROM A TABLE CELL

The basic building block of a table is a cell (i.e. table’s column). While creating a table user attaches certain attributes like data type and constraints. These attributes can be passed on to the variables being created in PL/SQL. This simplifies the declaration of variables and constants.

For this purpose, the %TYPE attribute is used in the declaration of a

variable when the variable’s attributes must be picked from a table field (i.e. column).

Ex: current_sal employee.sal % TYPE

In the above example, current_sal is the variable of PL/SQL block. It gets the data

type

and constraints of the column (field) sal belong to the table Employee. Declaring a variable

with the %TYPE attribute has two advantages

You do not need to know the data type of the table column

If you change the parameters of the table column, the variable’s parameters will

change as well.

PL/SQL allows you to use the %TYPE attribute in a nesting variable declaration.

The following example illustrates several variables defined on earlier %TYPE

declarations in a nesting fashion.

80


Ex: Dept_sales INTEGER;

Area_sales dept_sales %TYPE0;

Group_sales area_sales %TYPE;

Regional_sales area_sales %TYPE;

Corporate_sales regional_sales %TYPE;

In case, variables for the entire row of a table need to be declared, then

instead

of declaring them individually, %ROWTYPE is used.

Ex: emp_row_var employee %ROWTYPE;

Here, the variable emp_row_var will be a composite variable, consisting of

the column

names of the table as its member. To refer to a specific, say ‘sal’; the following

statement will be used.

emp_row_var.sal := 5000;

AN IDENTIFIER IN PL/SQL BLOCK:

The name of any ORACLE object (variable, memory variable, constant, record,

cursor etc) is known as an Identifier. The following laws have to be followed while

working with identifiers.

An identifier cannot be declared twice in the same block

The same identifier can be delcared in two different blocks.

In the second law, the two identifiers are unique and any change in one does

not affect the other.

PL/SQL OPERATORS

81


Operators are the glue that holds expressions together. PL/SQL operators can be

divided into

the following categories.

Arithmetic operators

Comparison operators

Logical operators

String operators

PL/SQL operators are either unary (i.e. they act on one value/variable) or binary

(they act on two values/variables)

1) ARITHMETIC OPERATORS:

Arithmetic operators are used for mathematical computations. They are

2) COMPARISON OPERATORS:

Comparison operators return a BOOLEAN result, either TRUE or FALSE.

They are

82

+ Addition

- Subtraction or Negation ( Ex: -5)

* Multiplication

/ Division

** Exponentiation operator (example 10**5 = 10^5)

= Equality operator 5=3

!= Inequality operator a!=b

<> Inequality operator 5<>3

-= Inequality operator ‘john’ -= ’johny’

< Less than operator a<b

> Greater than operator a>b

<= Less than or equal to a<=b

>= Greater than or equal to a>=b


In addition to this PL/SQL also provides some other comparison operators like LIKE,

IN,

BETWEEN, IS NULL etc.

LIKE: Pattern-matching operator.

It is used to compare a character string against a pattern. Two wild card

characters are defined for use with LIKE, the % (percentage sign) and ( _ )

underscore. The % sign matches any number of characters in a string and ( _ )

matches exactly one.

Ex-1: new%matches with newyork, newjersey etc (i.e. any string

beginning with

the word new).

Ex-2: ‘_ _ _day’ matches with Sunday, Monday and Friday and It will not

match with other days like ‘Tuesday’, ‘Wednesday’, ‘Thursday’

and

‘Saturday’.

IN: Checks to see if a value lies within a specified list of values. The syntax is

Syntax: The_value [NOT] IN (value1, value2, value3……)

Ex: 3 IN (4, 8, 7, 5, 3, and 2) Returns TRUE.

Sun NOT IN ( ‘sat’, ‘mon’, ‘tue’, ‘wed’, ‘sun’) Returns TRUE.

BETWEEN: Checks to see if a value lies with in a specified range of value.

Syntax: the_value [NOT] BETWEEN low_end AND high_end.

Ex: 5 BETWEEN –5 AND 10. Returns TRUE

4 NOT BETWEEN 3 AND 4 Returns FALSE.

IS NULL: Checks to see if a value is NULL.

83


Syntax: the_value IS [NOT] NULL

Ex: If balance IS NULL then

If acc_id IS NOT NULL then

3) LOGICAL OPERATORS.

PL/SQL implements 3 logical operations AND, OR and NOT. The NOT

operator is unary operator and is typically used to negate the result of a comparison

expression, where as the AND and OR operators are typically used to link together

multiple comparisons.

A AND B is true only if A returns TRUE and B returns TRUE else it is

FALSE.

A OR B is TRUE if either of A or B is TRUE. And it is FALSE if both A

and B

are FASLE.

NOT A Returns TRUE if A is FALSE

Returns FALSE if A is TRUE.

Ex: (5 = 5) AND (4<20) AND (2>=2) Returns TRUE

(5=5) OR (5!=4) Returns TRUE.

‘mon’ IN ( ‘sun’, ‘sat’) OR (2 = 2) Returns TRUE.

4) STRING OPERATORS:

PL/SQL has two operators specially designed to operate only on character string type

data. These are LIKE and ( || ) Concatenation operator. LIKE is a comparison

operator and is used to compare strings and it is discussed in the previous session.

Concatenation operator has following syntax.

Synatx: String_1 || string_2

String_1 and string_2 both are strings and can be a string constants, string variables or

string expressions. The concatenation operator returns a resultant string consisting of

all the characters in string_1 followed by all the characters in string_2.

84


Ex : ‘Chandra’ || ’shekhar’ Returns ‘Chandrashekhar’

A=’Engineering’ B=’College’ C=VARCHAR2 (50)

C=A || ‘ ‘ || B Returns a value to variable C as ‘Engineering

College’.

NOTE-1: PL/SQL string comparisons are always case sensitive, i.e. ‘aaa’ not

equal to

‘AAA’.

NOTE-2: ORACLE has some built in functions that are designed to convert

from one

data type to another data type.

To_date: Converts a character string into date

To_number: Converts a character string to a number.

To_char: Converts either a number or date to character string.

Ex: To_date (‘1/1/92’, ‘mm/dd/yy/’); Returns 01-jan-1992.

To_date (‘1-1-1998’, ‘mm-dd-yyyy’); Returns 01/01/1998.

To_date (‘Jan 1, 2001’,’mm dd, yyyy’); Returns Saturday, January 01,

2001.

To_date (‘1/1/02’, ‘mm/dd/rr’); Returns Tue, Jan 01, 2002.

To_number (‘123.99’, ‘999D99’); Returns 123.99

To_number ( ‘$1,232.95’, ‘$9G999D99’); Returns $1, 232.99

To_char (123.99, ‘999D99’); Returns 123.99.

CONDITIONAL CONTROL IN PL/SQL :

In PL/SQL, the if statement allows you to control the execution of a

block of

code. In PL/SQL we can use the following if forms

IF condition THEN statements END IF;

IF condition THEN statements

ELSE statements

END IF;

IF condition THEN statements

85


ELSE IF condition THEN

Statements

ELSE statements

END IF

END IF;

ITERATIVE CONTROL IN PL/SQL :

PL/SQL provides iterative control and execution of PL/SQL statements

in the

block. This is the ability to repeat or skip sections of a code block.

Following are

the four types of iterative statements provided by the PL/SQL

The Loop statement

The WHILE Loop statement

The GOTO statement

FOR Loop

i. LOOP STATEMENT:

A loop repeats a sequence of statements. The format is as follows.

LOOP

Statements

END LOOP;

The one or more PL/SQL statements can be written between the key

words LOOP and END LOOP. Once a LOOP begins to run, it will go on

forever. Hence loops are always accompanied by a conditional statements that

keeps control on the number of times it is executed. We can also build user

defined exists from a loop, where required.

Ex: LOOP

Cntr : = cntr + 1;

IF cntr > 100

EXIT;

86


END IF;

END LOOP;

EXIT statement brings the control out of loop if the condition is

satisfied.

ii. WHILE LOOP :

The WHILE loop enables you to evaluate a condition before a

sequence of statements would be executed. If condition is TRUE then

sequence of statements are executed. This is different from the FOR loop

where you must execute the loop atleast once. The syntax for the WHILE loop

is as follows:

Syntax: WHILE < Condition is TRUE >

LOOP

< Statements >

END LOOP;

Ex : DECLARE

Count NUMBER(2) : = 0;

BEGIN

WHILE count < = 10

LOOP

Count : = count + 1;

Message('while loop executes');

END LOOP;

END;

EXIT and EXIT WHEN statement:

EXIT and EXIT WHEN statements enable you to escape out of the control

of a loop. The format of the EXIT statement is as follows :

Syntax: EXIT;

EXIT WHEN statements has following syntax

Syntax: EXIT WHEN <condition is true >;

87


EXIT WHEN statement enables you to specify the condition required to exit the

execution of the loop. In this case no if statement is required.

Ex-1: IF count > = 10 EXIT;

Ex-2: EXIT WHEN count > = 10;

iii. THE GOTO STATEMENT :

The GOTO statement allows you to change the flow of control within a

PL/SQL

block. The syntax is as follows

Syntax: GOTO <label name> ;

The label is surrounded by double brackets (<< >>) and label must not have a

semi colon after the label name. The label name does not contain a semi colon

because it is not a PL/SQL statement. But rather an identifier of a block of PL/SQL

code. You must have at least one statement after the label otherwise an error will

result. The GOTO destination must be in the same block, at the same level as or

higher than the GOTO statement itself.

Ex: IF result = 'fail' THEN

GOTO failed_stud

END IF;

<<failed_stud>>

Message ('student is failed');

The entry point of the destination block is defined within << >> as

shown above, i.e. labels are written within the symbol << >>. Notice that

<<failed_stud>> is a label and it is not ended with semicolon ( ; ).

iv. FOR LOOP :

FOR loop will allow you to execute a block of code repeatedly until

some condition occurs. The syntax of FOR loop is as follows.

Syntax: FOR loop_index IN [ REVERSE] low_value ..

High_value LOOP

Statements to execute

END LOOP;

88


The loop_index is defined by oracle as a local variable of type integer.

REVERSE allows you to execute the loop in reverse order. The low_value ..

High_value is the range to execute the loop. These can be constants or

variables. The line must be terminated with loop with no semicolon at the end

of this line. You can list the statements to be executed until the loop is

executed is evaluated to false.

Ex: FOR v_count IN 1 .. 5 LOOP

Message ('for loop executes');

END LOOP;

In the above example the message 'for loop executes' is displayed five

times.

We can terminate the FOR loop permanently using EXIT statement

based on some BOOLEAN condition. Nesting of FOR loop can also be

allowed in PL/SQL. The outer loop executed once, then the inner loop is

executed as many times as the range indicates, and then the control is returned

to the outer loop until its range expires.

Ex: FOR out_count IN 1..2 LOOP

FOR in_count IN 1..2 LOOP

Message ('nested for loop');

END LOOP;

END LOOP;

In the above example the message 'nested for loop' is displayed four

times.

Let us discuss some examples from the understanding how to write a

PL/SQL block structure. Here we assume that a table called "EMP" is created

and the datas are already inserted into it.

Table name : EMP

Create table EMP

( emp_no NUMBER (3),

name VARCHAR2 (15),

89


salary NUMBER (6,2),

dept VARCHAR2 (15),

div VARCHAR2 (2) );

EXAMPLE-1:

DECLARE

num NUMBER (3);

sal emp.salary %TYPE;

emp_name emp.name %TYPE;

count NUMBER (2) : = 1;

starting_emp CONSTANT NUMBER(3) : = 134;

BEGIN

SELECT name, salary INTO emp_name, sal

FROM EMP

WHERE emp_no = starting_emp;

WHILE sal < 4000.00

LOOP

Count : = count + 1;

SELECT emp_no, name, salary INTO

Num, emp_name, sal FROM EMP

WHERE emp_no > 2150;

END LOOP;

Commit;

END;

In the above example there are five statements in the declaration part.

The num is a integer type, sal and emp_name takes the similar data type of

the salary and name columns of EMP table respectively. Count is a variable

of type integer and takes initial value 1. Starting_emp is a constant and it is of

integer type with immediately assigned value 134.

Between BEGIN and END key words, there are some SQL executable

statements used for manipulating the table data. The SELECT statement

extracts data stored in name and salary columns of EMP table corresponding

90


to the employee having employee number 134. It stores those values In the

variables emp_name and sal respectively.

If sal less than 4000 then the statements within the loop will be

executed. Within the loop, there are two SQL statements, the first one

increments the count value by 1 and the second statement is a SELECT

statement. The commit statement commits the changes made to that table. The

END statement terminates the PL/SQL block.

EXAMPLE-2:

This example assumes the existence of table accounts created by using

the following SQL statements.

Create table Accounts

(accnt_id NUMBER(3),

name VARCHAR2(25),

bal NUMBER(6,2) );

PL/SQL block:

DECLARE

acct_balance NUMBER(6,2);

acct CONSTANT NUMBER(3) : = 312;

debit_amt CONSTANT NUMBER(5,2) : = 500.00;

BEGIN

SELECT bal INTO acct_balance FROM Accounts

WHERE accnt_id = acct;

IF acct_balance = debit_amt THEN

UPDATE Accounts

SET bal : = bal - debit_amt WHERE accnt_id = acct;

ELSE

Message ('insufficient amount in account');

END IF;

END;

91


The above example illustrates the use of IF .. THEN .. ELSE.. END IF

condition control statements.

Declaration part declares one variable and two constants. The

SELECT statement extracts the amount in the bal column of Accounts table

corresponding to account number 312, and stores that in a variable

acct_balance.

If statement checks acct_balance for sufficient amount before

debiting. It updates the table Accounts if it has sufficient amount in the

balance, else it displays a message intimating insufficient fund in the account

of specified accnt_id.

EXAMPLE-3:

This example assumes two tables, which are created by following

statements.

Create table Inventory

( prod_no NUMBER (6),

product VARCHAR2 (15),

quantity NUMBER (5) );

Create table Purchase_record

( mesg VARCHAR2 (50),

d_ate DATE );

PL/SQL block :

DECLARE

num_in_stack NUMBER(5);

BEGIN

SELECT quantity INTO num_in_stack

FROM Inventory WHERE product = 'gasket';

IF num_in_stack > 0 THEN

UPDATE Inventory SET quantity : = quantity - 1

WHERE product = 'gasket';

92


INSERT INTO Purchase_record

VALUES (' One gasket purchased', sysdate);

ELSE

INSERT INTO Purchase_record

VALUES ('no gasket availabel',sysdate);

Message ( 'there are no more gasket in stack' );

END IF;

Commit;

END;

The above block of PL/SQL code does the following;

It determines how many gaskets are left in stack.

If the number left in staff is greater than zero, it updates the inventory

to reflect the sale of a gasket.

It stores the fact that a gasket was purchased on a certain date.

If the stock available is zero, it stores the fact that there are no more

gaskets for sale on the date on which such a situation occurred.

ERROR HANDLING IN PL/SQL :

PL/SQL has the capability of dealing with the errors that arise while

executing a PL/SQL block of code. It has a number of conditions that are pre

programmed in to it that are recognized as error conditions. These are called

internally defined exceptions. You can also program PL/SQL to recognize user-

defined exceptions.

There are two different types Error conditions ( Exceptions).

user defined error conditions / exceptions.

Predetermined internal PL/SQL exceptions.

1) USER DEFINED EXCEPTIONS:

93


User can write a set of code, which is to be executed while error occurs when

executing a PL/SQL block of code. These set of code are called user-defined

exceptions, and these are placed in the last section of PL/SQL block called

EXCEPTIONS.

The method used to recognise user-defined exceptions is as follows

Declare a user defined exception in the declaration section of

PL/SQL block.

In the main program block for the conditions that needs special

attention, execute a RAISE statement.

Reference the declared exception with an error handling routine in

EXCEPTION section of PL/SQL block.

RAISE statement acts like CALL statement of high level languages. It has

general format

RAISE < name of exception >

When RAISE statement is executed, it stops the normal processing of

PL/SQL block

of code and control passes to an error handler block of the code at the end

of PL/SQL

program block ( EXCEPTION section ).

An exception declaration declares a name for user defined error conditions

that the PL/SQL code block recognizes. It can only appear in the DECLARE section

of the PL/SQL code which preceedes the key word BEGIN.

EXAMPLE :

DECLARE

---------------

zero_commission Exception;

---------------

BEGIN

-----------------

94


IF commission = 0 THEN

RAISE zero_commission;

------------------------

EXCEPTION

WHEN zero_commission THEN

Process the error

END;

Exception handler (error handler block ) is written between the key words

EXCEPTION and END. The exception handling part of a PL/SQL code is

optional. This block of code specifies what action has to be taken when the named

exception condition occurs.

The naming convention for exception name are exactly the same as those for

variables or constants. All the rules for accessing an exception from PL/SQL

blocks are same as those for variables and constants. However, it should be noted

that exceptions cannot be passed as arguments to functions or procedures like

variables or constants.

2) PREDETERMINED INTERNAL PL/SQL EXCEPTIONS :

The ORACLE server defines several errors with standard names. Although

every ORACLE error has a number, the error must be referred by name. PL/SQL

has predefined some common ORACLE errors and exceptions. Some of them are

given below:

NO_DATA_FOUND Raised when a select statement returns zero rows.

TOO_MANY_ROWS Raised when a select statement returns more than

one rows.

VALUE_ERROR Raised when there is either a data type mismatch or if the

size is smaller than required size.

95


INVALID_NUMBER Raised when conversion of a number to a

character string failed.

ZERO_DIVIDE Raised when attempted to divide by zero.

PROGRAM_ERROR Raised if PL/SQL encounters an internal problem.

STORAGE_ERROR Raised if PL/SQL runs out of memory or if

memory if corrupted.

DUP_VAL_ON_INDEX Raised when attempted to insert or update a

duplicate into a column that has unique index.

INVALID_CURSOR Raised when illegal cursor operation was

attempted.

CURSOR_ALREADY_OPEN Raised when attempted to open a cursor that was

previously opened.

NOT_LOGGED_ON Raised when a database call was made without

being logged into ORACLE.

LOGIN_DENIED Raised when login to ORACLE failed because of invalid

username and password.

OTHERS This will be raised when the all other exceptions failed to

catch the errors.

It is possible to use WHEN OTHERS clause in the exception handling part of the

PL/SQL block. It will take care of all exceptions that are not taken care of in the

code.

The syntax for exception handling portion of PL/SQL block is as follows:

EXCEPTION

WHEN exception_1 THEN Statements;

WHEN exception_2 THEN Statements;

- - --- ---- -- ---

END;

In this syntax, exception_1 and exception_2 are the names of exceptions (may be

predefined or user-defined ). Statements in the PL/SQl code that will be executed

if the exception name is satisfied.

96


EXAMPLE-1:

This example writes PL/SQL code for validating accnt_id of Accounts table

so that it must not be left blank, if it is blank cursor should not be allowed to move to

the next field.

DECLARE

no_value exception;

BEGINIF : Accounts.accnt_id IS NULL THEN

RAISE no_value;

ELSE

next_field;

END IF;

EXCEPTIONWHEN no_value THEN

Message ( 'account id cannot be, blank Please enter valid data !!! ' );

go_field ( : system.cursor_field );

END;

In the above example accnt_id field of Accounts table is checked for NULL

value. If it is NULL, then RAISE statement calls exception handler no_value.

This exception name no_value is declared in DECLARE section and defined in

the EXCEPTION section of PL/SQL block by using WHEN statement. no_value

is a user-defined exception.

EXAMPLE-2:

DECLARE

balance Accounts.bal %TYPE;

acount_num Accounts.accnt_id %TYPE;

97


BEGIN

SELECT accnt_id bal INTO account_num, balance

FROM Accounts WHERE accnt_id > 0000;

EXCEPTION

WHEN no_data_found THEN

Message ('empty table');

END;

In the above example exception is used in the PL/SQL block. This exception is

predefined internal PL/SQL exception (NO_DATA_FOUND).

Therefore, it does not require declaration in DECLARE section and RAISE

statement in BEGIN … END portion of the block. Even though it is not raised, the

ORACLE server will raise this exception when there is no data in bal and accnt_id

field.

PL/SQL FUNCTIONS AND PROCEDURES :

PL/SQL allows you to define functions and procedures. These are similare to

functions and procedures defined in any other languages, and they are defined as one

PL/SQL block.

FUNCTIONS :

The syntax for defining a function is as follows :

FUNCTION name [ (argument-list) ] RETURN data-type {IS, AS}

Variable-declarations

BEGIN

Program-code

[ EXCEPTION

error-handling-code]

END;

In this syntax,

98


name The name you want to give the function.

argument-list List of input and/or output parameters for the functions.

data-type. The data type of the function's return value.

Variable-declarations Where you declare any variables that are local to the function.

program-code Where you write PL/SQL statements that make up the

function.

error-handling-code Where you write any error handling routine.

Notice that the function block is similar to the PL/SQL block that we discussed

earlier.

The keyword DECLARE has been replaced by FUNCTION header, which

names the function, describes the parameter and indicates the return type.

Functions can be called by using name( argument list )

Example:

FUNCTION check(b_exp in BOOLEAN,

True_number in NUMBER,

False_number in NUMBER)

RETURN NUMBER IS

BEGIN

IF b_exp THEN RETURN true_number;

ELSE

RETURN false_number;

END IF;

END;

The above function can be called as follows.

Check ( 2 > 1, 1 , 0)

Check (5 = 0, 1, 0)

PROCEDURES:

99


The declaration of procedures is almost identical to that of function

and the syntax

is given below.

PROCEDURE name [(argument list)] {IS,AS}

Variable declaration

BEGIN

Program code

[EXCEPTION

Error handling code ]

END;

Here name is the name that you want to give the procedure and all other are

similar to function declaration. Procedure declaration resembles a function declaration

except that there is no data type and key word PROCEDURE is used instead of

FUNCTION.

Ex: PROCEDURE swapn (A IN OUT NUMBER, B IN OUT NUMBER) IS

Temp_num NUMBER;

BEGIN

Temp_num : = A;

A : = B;

B : = temp_num;

END;

The above procedure can be called as follows.

Swapn (3,4);

Swapn (-6,7);

DATABASE TRIGGERS :

PL/SQL can be used to write data base triggers. Triggers are used to define code

that is executed/fired when certain actions or event occur. At the data base level,

triggers can be defined for events such as inserting a record into a table, deleting a

record, and updating a record.

100


The trigger definition consists of following basic parts.

The events that fires the trigger

The database table on which event must occur

An optional condition controlling when the trigger is executed

A PL/SQL block containing the code to be executed when the trigger is fired.

A trigger is a data base object, like a table or an index. When you define a trigger,

it becomes a part of the database and it is always is executed when the event for

which it is defined occurs.

Syntax for creating a data base trigger is shown below.

CREATE [ or REPLACE ] TRIGGER trigger-name

{ BEFORE | AFTER } verb-list ON table-name

[ FOR EACH ROW [ WHEN condition ] ]

DECLARE

Declarations

BEGIN

PL/SQL code

END;

In the above syntax

REPLACE

Is used to recreate if trigger already exists.

trigger-name

Is the name of the trigger to be created.

verb-list

The SQL verbs that fire the Create, i.e. it may be INSERT,

UPDATE or DELETE.

table-name

The table on which the trigger is defined.

condition

An optional condition placed on the execution of the triggers.

101


declarations.

Consists of any variable, record or cursor declarations needed

by this PL/SQL blocks.

PL/SQL code

PL/SQL code that gets executed when the trigger fires.

EXAMPLE:

CREATE TRIGGER check_salary

BEGORE insert or update of S AL, JOB on EMP

FOR EACH ROW WHEN ( new. Job != 'director')

DECLARE

minsal NUMBER;

maxsal NUMBER;

BEGIN

SELECT min_sal, max_sal INTO minsal, maxsal

FROM salary-mast WHERE JOB = :new.job;

IF ( :new-sal < minsal or :new.sal > maxsal ) THEN

Message ( 'salary out of range' );

END IF;

END;

3.15 CURSOR IN PL/SQL:

PL/SQL cursors provide a way for your program to select multiple rows of data

from the database and then to process each row individually. Cursors are PL/SQL

constructs that enable you to process, one row at a time, the results of a multi row

query.

ORACLE uses work areas to execute SQL statements, PL/SQL allows user to

name private work areas and access the stored information. The PL/SQL

constructs to identify each and every work area used by SQL is called a Cursor.

There are 2 types of cursors.

Implicit cursors

102


Explicit cursors

Implicit cursors are declared by ORACLE for each UPDATE, DELETE and

INSERT SQL commands. Explicit cursors are declared and used by the user to

process multiple row, returned by SELECT statement.

The set of rows returned by a query is called the Active Set. Its size depends on

the number of rows that meet the search criteria of the SQL query. The data that is

stored in the cursor is called the Active Data Set.

ORACLE cursor is a mechanism used to easily process multiple rows of data.

Cursors contain a pointer that keeps track of current row being accessed, which

enables your program to process the rows at a time.

EXAMPLE:

When a user executes the following SELECT statement

SELECT emp_no, emp_name, job, salary

FROM employee

WHERE dept = 'physics'

The resultant dataset will be displayed as follows

Table3.1

1) EXPLICIT CURSOR MANAGEMENT :

The following are the steps to using explicitly defined cursors within PL/SQL

Declare the cursor

Open the cursor

Fetch data from the cursor

Close the cursor

103

1234 A. N. Sharanu Asst. Professor 22,000.00

1345 N. Bharath Senior Lecturer 17,000.00

1400 M. Mala Lab Incharge 9,000.00


Declaring the cursor :

Declaring a cursor enables you to define the cursor and assign a name to it. It has

following syntax.

CURSOR cursor-name

IS SELECT statement

Ex: CURSOR c_name IS

SELECT emp_name FROM Emp WHERE dept = 'physics'

Opening a cursor:

Opening a cursor executes the query and identifies the active set that contains

all the rows, which meet the query search criteria.

Syntax :

OPEN cursor_name

Ex:

OPEN c_name

Open statement retrieves the records from the database and places it in the

cursor (private SQL area).

Fetching data from cursor:

The fetch statement retrieves the rows from the active set one row at a time. The

fetch statement is used usually used in conjunction with iterative process

( looping statements ). In iterative process the cursor advances to the next row in

the active set each time the fetch command is executed. The fetch command is the

only means to navigate through the active set.

Syntax : FETCH cursor-name INTO record-list

Record-list is the list of variables that will receive the columns (fields ) from the active set.

Ex: LOOP

-----------

------------

FETCH c_name INTO name;

104


-----------

END LOOP;

Closing a cursor :

Closing statement closes/deactivates/disables the previously opened cursor and

makes the active set undefined. Once it is closed, you cannot perform any

operations on it. Once a cursor is closed, the user can reopen the cursor by using

Open statement.

Syntax : CLOSE cursor_name

Ex: CLOSE c_name;

EXAMPLE-1 :

The HRD manager has decided to raise the salary for all the employees in

the physics department by 0.05. whenever any such raise is given to the employees, a

record for the same is maintained in the emp_raise table ( the data table definitions are

given below ). Write a PL/SQL block to update the salary of each employee and insert

the record in the emp_raise table.

Tabe: employee

emp_code varchar (10)

emp_name varchar (10)

dept varchar (15)

job varchar (15)

salary number (6,2)

Table: emp_raise

emp_code Varchar(10)

raise_date Date

raise_amt Number(6,2)

Solution:

DECLARE

105


CURSOR c_emp IS

SELECT emp_code, salary FROM employee

WHERE dept = 'physics';

str_emp_code employee.emp_code %TYPE;

num_salary employee.salary %TYPE;

BEGIN

OPEN c_emp;

LOOP

FETCH c_emp INTO str_emp_code, num_salary;

UPDATE employee SET salary : = num_salary + (num_salary * 0.05)

WHERE emp_code = str_emp_code;

INSERT INTO emp_raise

VALUES ( str_emp_code, sysdate, num_salary * 0.05 );

END LOOP;

Commit;

CLOSE c_emp;

END;

2) EXPLICIT CURSOR ATTRIBUTES:

ORACLE provides certain attributes/cursor variables to control the execution of

the cursor. Whenever any cursor (explicit or implicit ) is opened and used,

ORACLE creates a set of four system variables via which ORACLE keep track of

the "current status" of the cursor. Programmers can access these variables. They

are

%NOT FOUND: Evaluates to TRUE if the last fetch is failed i.e. no more rows are

left.

Syntax: cursor_name %NOT FOUND

%FOUND: Evaluates to TRUE, when last fetch succeeded

Syntax: cursor_name %FOUND

106


%ISOPEN: Evaluates to TRUE, if the cursor is opened, otherwise evaluates

to

FALSE. Syntax: cursor_name %ISOPEN

%ROWCOUNT: Returns the number of rows fetched.

Syntax: cursor_name %ROWCOUNT.

EXAMPLE :

DECLAREv_emp_name varchar2(32);

v_salary_rate number(6,2);

v_payroll_total number(9,2);

v_pay_type char;

CURSOR c_emp IS

SELECT emp_name, pay_rate, pay_type FROM employee

WHERE emp_dept = 'physics'

BEGIN

IF c_name %ISOPEN THEN

RAISE

not_opened

ELSE

OPEN c_emp;

LOOP

FETCH c_emp INTO v_emp_name, v_salary_rate,

v_pay_type;

EXIT WHEN c_emp % NOTFOUND;

IF v_pay_type = 'S' THEN

v_payroll_total : = (v_salary_rate * 1.25 );

ELSE

v_payroll_total : = (v_salary_rate * 40);

END IF;

INSERT INTO weekly_salary VALUES ( 'v_payroll_total' );

END LOOP;

CLOSE c_emp;

107


EXCEPTION

WHEN not_opened

Message ( 'cursor is not opened' );

END;

REFERENCES:-

1. Teach yourself PL/SQL in 21 days - SAMS Publications.

2. ORACLE-7 - Ivan Bayross.

2. ORACLE Developer 2000 - Ivan Bayross.

3. ORACLE Developers guide - David McClanahan.

MODULE 4

4.1 Introduction

Measure Of Quality We can discuss the goodness of relation schema in two levels.1. Logical Level

All you know, logical level represent the middle level in three level architecture of DBMS. The logical level describes how users interpret the relation schema and the meaning of their attributes. Having good relation schema at this level enables users to understand clearly the meaning of data in relations and hence to formulate the queries correctly.2 Implementation Level

It is the lowermost level in the DBMS architecture, which describes how the tuples in base relations are stored and updated. This level applies only to the storage level in the database. But the former logical level applies to both view

108


level and logical level. And the database becomes effective as much as with the effective storage.

4.2 Database Design Techniques

Generally we can design the database in two different approaches.1. Top-Down Design (Analysis) methodology

It starts with the major entities of their interest, their attributes and their relationships. And then we add other entities and may split these entities into a number of specialized entities and add the relationships between these entities.2. Bottom-Up Design

It starts with a set of attributes. And group these attributes into entities. Then find out the relationship between these entities. Identify the higher-level entities, generalized these entities and locate relationships at this higher level.

Problems with bad schema

1. Redundant storage of data:

2. Wastage of disc space

3. More running time

Informal Design Guidelines for relation schema

There are four informal measures of quality for relation schema design.

1. Semantics of the relation attribute

For all attributes belonging to a relation will have certain real world meaning and a proper interpretation associated with them. The semantics specifies how to interpret the attribute values stored in a tuple of the relation. Guideline:-Design relation schema so that it is easy to explain its meaning. Do not combine the attributes from multiple entity types and relationship types into a single relation. Otherwise if the relation contains multiple entities, the semantics will be unambiguous and the relation cannot be easily explained.

2. Reducing the redundant values in tuples and update anomalies

One goal of schema design is to minimize the storage space that the space relations occupy. The anomalies that may be present in the database relation can be classified into three categories.

1. Insertion anomalies

2. Deletion anomalies

3. Modification anomalies

109


Guideline: - Design relation schema so that there is no insertion, deletion and modification anomalies are present in relations. If any anomalies are present note them clearly and make sure that the programs that update the database will operate correctly.

3. Null values in tuples

The null value in tuples is wastage of storage space in storage level and may also lead to problems with understanding the meaning of attributes and with specifying join operations at the logical level. Another problem is how to account the aggregate functions in null valued attributes. Guideline: - As far possible, avoid placing attributes in base relation whose values frequently be null. If nulls are unavoidable, make sure that they are applicable for exceptional cases only.

4.3 ConstraintsConstraints on database can generally divided into four main

categories.1. Inherent model based: Constraints that are inherent to the data

model called inherent model based constraint like, a relation

cannot have duplicate tuples.

2. Schema Based: Constraints that can be directly expressed in the

schemas of the data model, typically specifying in DDL are

called Schema based constraint.

3. Application Based: Schemas expressed by the application

programs are called application-based constraints.

4. Data Dependency: It is the constraint that is related to the

dependency between the values in the relation.

Now we can go through the details of schema-based constraints. These

schema based constraints are expressed in relational model. It includes five basic

constraints.

1. Domain constraint

2. Key constraint

3. Entity integrity constraint

4. Referential integrity constraint

5. Constraint on nulls

4.4 Domain constraint A domain D represents a set of atomic values. The data type describing the

type of values that can appear in each column is represented by this domain. i.e

110


each value in the domain indivisible as far as the relational model is concerned.

Specifying the data type from which the data values specifying the domain are

drawn specifies the domain. A domain is given a name, data type and format.

4.5 Entity-Integrity constraintEntity integrity constraint states that no primary key value can be null. Primary

key value is used to identify tuples. Having null value in primary key implies we

cannot identify tuples. The key constraint and integrity constraint are specified on

individual relations.

4.6 Referential integrity constraintThe Referential integrity constraint is specified between two relations and is used

to maintain the consistency among tuples in two relations. Referential integrity

constraint states that a tuple in one relation that refer to another relation must refer

to an existing tuple in that relation. To define Referential integrity constraint first

we have to define the concept of foreign key (FK).

A set of attributes FK in relation schema R1, is a foreign key of R1 that references

the R2 relation if it satisfies the following two rules.

1. The attributes in FK have the same domain as the PK attributes of R2. The

attributes FK are said to reference or refer to the relation R2.

2. A value of FK in tuple t1 of the current state r1(R) either occurs as a value of

PK for some tuple t2 in the current state r2(R) or is null.

If t1[FK]=t2[PK] then we can say that the tuple t1 refers to the tuple t2.

R1 is referencing relation

R2 is referenced relation.

Key constraint A relation is defined as a set of tuples. By definition of a relation all the

tuples in a relation are distinct. i.e no two tuples can have the same value for all

their attributes. There are some subsets of relation schema R with the property that

no two tuples in any relation state ‘r ‘ of ‘R’ should have the same value for these

attributes.

A key K of a relation schema R is a super key of R with the additional property

that removing any attribute A from K leaves a set of attributes K’ that is not a

super key of R any more. Hence key satisfies the following two constraints.

1. The two distinct tuples in any state of the relation cannot have identical

values for all the attributes in the key.

111


2. It is a minimal super key. i.e from a super key we cannot remove any

attribute and still have the uniqueness constraint to hold the first condition.

Null ConstraintIt specifies whether the null values are permitted to an attribute in a database.

4.7 Functional Dependency (FD)Functional dependency is a constraint between two sets of attributes from the

database.

Definition: A FD denoted by X→ Y between two sets of attributes X and Y that

are subsets of R specifies a constraint on the possible tuples that can form a

relation state r of R. The constraint is that, for any two tuples t1 and t2 in r that

have t1[X] =t2[X], they must also have t1[Y] =t2[Y]. i. e. the values of the Y

component of a tuple in r depends on the values of X component or the X

component determines the value of Y component.

i.e

1. The constraint on R states that there cannot be more than one tuple with

a given X value in any relation state r(R) ╡ X→ Y for any subset of

attributes Y of R.

2. If X→ Y in R doesn’t say whether or not Y→ X in R.

A functional dependency FD: is called trivial if Y is a subset of X.

Definition: A functional dependency, denoted by X→Y, between two sets of

attributes X and Y that are subsets of the attributes of relation R, specifies that the

values in a tuple corresponding to the attributes in Y are uniquely determined by

the values corresponding to the attributes in X.

For example, the social security number uniquely determines a name;

SSN→ Name

Functional dependencies are determined by the semantics of the relation, in

general, they cannot be determined by inspection of an instance of the relation.

That is, a functional dependency is a constraint and not a property derived from a

relation.

Inference rules

Armstrong's axioms - sound and complete i.e, enable the computation of any

functional dependency.

112


Functional dependencies are

1. Reflexivity - if the B's are a subset of the A's then A → B

2. Augmentation - If A → B, then A, C → B, C.

3. Transitivity - If A → B and B → C then A → C.

Additional inference rules

4. Decomposition - If A → B, C then A → B

5. Union - If A → B and A → C then A → B, C

6. Pseudo transitive - If A → B and C, B → D then C, A → D

Equivalence of sets of functional dependencies

Two functional dependencies S & T are equivalent iff S→ T and T → S.

The dependency {A_1, ..., A_n} → {B_1, ..., B_m}

is trivial if the B's are a subset of the A's

is nontrivial if at least one of the B's is not among the A's

is completely nontrivial if none of the B's is also one of the A's

Closure (F+)

All dependencies that include F and that can be inferred from F using the above

rules are called closure of F denoted by F+.

Algorithm to compute closure

We have to find out whether F╞ X → Y. This is the case when X → Y Є F+ .For

the better method is to generate X+, closure of X under Fand test F╞ X → Y

using the first two axioms augmentation and reflexive rules.

Algorithm:

Input: A set of FD ‘s F and asset of attributes X.

Output: The closure X+ of X under the FD’s F+.

X+:=X;

Change=true;

While change do

Begin

Change: = False;

For each FD W → Z in F do

Begin

If W Z then do

113


Begin

X+: = X+UZ;

Change: =True;

End;

End;

End;

4.8 NormalizationIn relational database theory, normalization is the process of restructuring the

logical data model of a database to eliminate redundancy, organize data

efficiently, reduce repeating data and to reduce the potential for anomalies

during data operations. Data normalization also may improve data consistency

and simplify future extension of the logical data model. The formal

classifications used for describing a relational database's level of normalization

are called normal forms (NF).

A non-normalized database can suffer from data anomalies:

A non-normalized database may store data representing a particular referent in

multiple locations. An update to such data in some but not all of those locations

results in an update anomaly, yielding inconsistent data. A normalized database

prevents such an anomaly by storing such data (i.e. data other than primary

keys) in only one location.

A non-normalized database may have inappropriate dependencies, i.e.

relationships between data with no functional dependencies. Adding data to such

a database may require first adding the unrelated dependency. A normalized

database prevents such insertion anomalies by ensuring that database relations

mirror functional dependencies.

Similarly, such dependencies in non-normalized databases can hinder deletion.

That is, deleting data from such databases may require deleting data from the

inappropriate dependency. A normalized database prevents such deletion

anomalies by ensuring that all records are uniquely identifiable and contain no

extraneous information.

4.9 Normal forms Edgar F. Codd originally defined the first three normal

114


The first normal form requires that tables be made up of a primary key and a

number of atomic fields, and the second and third deal with the relationship of

non-key fields to the primary key. These have been summarized as requiring

that all non-key fields be dependent on "the key, the whole key and nothing but

the key". In practice, most applications in 3NF are fully normalized. However,

research has identified potential update anomalies in 3NF databases. BCNF is a

further refinement of 3NF that attempts to eliminate such anomalies.

The fourth and fifth normal forms (4NF and 5NF) deal specifically with the

representation of many-many and one-many relationships. Sixth normal form

(6NF) only applies to temporal databases.

4.10. First normal form (1NF) First normal form (1NF) lays the groundwork for an organized database

design:

Ensure that each table has a primary key: minimal set of attributes which

can uniquely identify a record. It states that the domain of an attribute must

include only atomic values and the value of any attribute in a tuple must be

single value from the domain of that attribute. It doesn’t allow nested relation.

Data that is redundantly duplicated across multiple rows of a table is moved out

to a separate table.

Atomicity: Each attribute must contain a single value, not a set of values.

Eg: Consider a Relation Person. The person will have the attributes SSN, Name,

Age, Address and College_Degree.

Person

Table 4.1

Now we can analyze this relation. Now check what are the possible values of

each attributes. Here SSN and Age will have only one value for a person. But

The college_Degree will have more than one value. And the address and Name

of person can be divided into more than one attributes. Hence this relation is not

in 1NF. So let us change this relation schema into 1NF by dividing this relation

into two relations.

115

SSN Name Address Age College_Degree


Name→ FName, MInit, LName

Address→ ApartmentNo, City

Person_Residence

Table 4.2

College_Degree

Table 4.3

4.11 Second normal form (2NF) First, the table must be in 1NF, plus, we want to make sure that every

Non-Primary-Key attribute (field) is fully functionally dependent upon the

ENTIRE Primary-Key for its existence. This rule ONLY applies when you have

a multi-part (concatenated) Primary Key (PK).

It requires that data stored in a table with a composite primary key must not be

dependent on only part of the table's primary key. And the database must meet

all the requirements of the first normal form.

Take each non-key field, and ask this question: If I knew part of the PK, could I

tell what the non-key field would be.

Inventory

Table4.4

In this inventory table, Description combined with Supplier is our PK. This is

because we have two of the same product that come from different suppliers.

There are two non-key fields. So, we can ask the questions:

If we know just Description, can we find out Cost? No, because we have more

than one supplier for the same product.

116

SSN FName LName MInit ApartmentNo City

SSN UG PG

Description Supplier Cost Supplier_Address


If we know just Supplier, and we find out Cost? No, because we need to know

what the Item is as well.

Therefore, Cost is fully, functionally dependent upon the ENTIRE PK

(Description-Supplier) for its existence.

If we know just Description, can we find out Supplier Address? No, because

we have more than one supplier for the same product.

If we know just Supplier, Can we find out Supplier Address? Yes. The

Address does not depend upon the Description of the item.

Therefore, Supplier Address is NOT functionally dependent upon the ENTIRE PK

(Description-Supplier) for its existence.

We must get rid of Supplier Address from this table.

Inventory

Table 4.5

Supplier

Table 4.6

At this point, since it is the "Supplier" table, we can rename the "Supplier"

filed to "Name." Name is the PK for this new table.

General Definition:

A relation schema R is in second normal form (2NF) if every nonprime

attribute A in R is not partially dependent on any key of R.

4.12 Third normal form (3NF) For 3NF, first, the table must be in 2NF, plus, we want to make sure

that the non-key fields are dependent upon ONLY the PK, and not on any other

field in the table. This is very similar to 2NF, except that now you are

comparing the non-key fields to OTHER non-key fields.

For database to be in third normal form

1. The database must meet all the requirements of the second normal form.

117

Description Supplier Cost

Name Supplier_Address


2. Any field which is dependent not only on the primary key but also on another

field is moved out to a separate table.

Book

Table 4.7

Again, just ask the questions:

If I know # of Pages, can I find out Author's Name? No. Can I find out

Author's affiliation No? No.

If I know Author's Name, can I find out # of Pages? No. Can I find out

Author's affiliation No? YES.

Therefore, Author's affiliation No is functionally dependent upon Author's

Name, not the PK for its existence.

Book

Table 4.8

Author

Table 4.9

General Definition:

A relation schema R is in 3NF if, whenever a nontrivial functional

dependency X→A holds in R,

Either a) X is a Superkey

Or b) Y is a prime attribute of R.

i.e. A relation schema R is in 3NF if every nonprime attribute of R meets both of

the following terms:

1. It is fully functionally dependent on every key of R.

2. It is nontransitively dependent on every key of R.

4.13 Boyce-Codd normal form (BCNF) A row is in BCNF if and only if every determinant is a candidate key.

118

Name Auth_Name #Pages Auth_Affil_No

Name Auth_Name #Pages Name Auth_Name #Pages

Name Auth_Affil_No


The second and third normal forms assume that all attributes not part of the

candidate keys depend on the candidate keys but does not deal with

dependencies within the keys. BCNF deals with such dependencies.

A relation R is said to be in BCNF if whenever X -> A holds in R, and A is not

in X, then X is a candidate key for R.

BCNF covers very specific situations where 3NF misses interdependencies

between non key attributes. It should be noted that most relations that are in

3NF are also in BCNF. Infrequently, a 3NF relation is not in BCNF and this

happens only if

(a) the candidate keys in the relation are composite keys (that is, they are not

single attributes),

(b) there is more than one candidate key in the relation, and

(c) the keys are not disjoint, that is, some attributes in the keys are common.

The BCNF differs from the 3NF only when there are more than one candidate

keys and the keys are composite and overlapping. Consider for example, the

relationship

enrol (sno, sname, cno, cname, date-enrolled)

Let us assume that the relation has the following candidate keys:

(sno, cno)

(sno, cname)

(sname, cno)

(sname, cname)

(we have assumed sname and cname are unique identifiers). The relation is in

3NF but not in BCNF because there are dependencies

sno -> sname

cno -> cname

where attributes that are part of a candidate key are dependent on part of

another candidate key. Such dependencies indicate that although the relation is

about some entity or association that is identified by the candidate keys

e.g. (sno, cno), there are attributes that are not about the whole thing that the

keys identify. For example, the above relation is about an association

(enrolment) between students and subjects and therefore the relation needs to

include only one identifier to identify students and one identifier to identify

subjects. Providing two identifiers about students (sno, sname) and two keys

119


about subjects (cno, cname) means that some information about students and

subjects that is not needed is being provided. This provision of information

will result in repetition of information and the anomalies. If we wish to include

further information about students and courses in the database, it should not be

done by putting the information in the present relation but by creating new

relations that represent information about entities student and subject.

These difficulties may be overcome by decomposing the above relation in the

following three relations:

(sno, sname)

(cno, cname)

(sno, cno, date-of-enrolment)

We now have a relation that only has information about students, another only

about subjects and the third only about enrolments. All the anomalies and

repetition of information have been removed.

4.14 Multivalued Dependency and Fourth normal form In a relational model, if all of the information about an entity is to be

represented in one relation, it will be necessary to repeat all the information

other than the multivalue attribute value to represent all the information that

we wish to represent. This results in many tuples about the same instance of

the entity in the relation and the relation having a composite key (the entity id

and the mutlivalued attribute). Of course the other option suggested was to

represent this multivalue information in a separate relation. The situation of

course becomes much worse if an entity has more than one multivalued

attributes and these values are represented in one relation by a number of

tuples for each entity instance. The multivalued dependency relates to this

problem when more than one multivalued attributes exist. Consider the

following relation that represents an entity employee that has one mutlivalued

attribute proj:

emp (e#, dept, salary, proj)

We have so far considered normalization based on functional dependencies;

dependencies that apply only to single-valued facts. For example, e# -> dept

implies only one dept value for each value of e#. Not all information in a

120


database is single-valued, for example, proj in an employee relation may be

the list of all projects that the employee is currently working on. Although e#

determines the list of all projects that an employee is working on, e# -> proj is

not a functional dependency.

We can more clear the multivalued dependency by the following example.

programmer (emp_name, qualifications, languages)

This relation includes two multivalued attributes of entity programmer;

qualifications and languages. There are no functional dependencies.

The attributes qualifications and languages are assumed independent of each

other. If we were to consider qualifications and languages separate entities,

we would have two relationships (one between employees and qualifications

and the other between employees and programming languages). Both the

above relationships are many-to-many i.e. one programmer could have several

qualifications and may know several programming languages. Also one

qualification may be obtained by several programmers and one programming

language may be known to many programmers.

Functional dependency A -> B relates one value of A to one value of B while

multivalued dependency A ->> B defines a relationship in which a set of

values of attribute B are determined by a single value of A.

Now, more formally, X ->> Y is said to hold for R(X, Y, Z) if t1 and t2 are two

tuples in R that have the same values for attributes X and therefore with t1[x]

= t2[x] then R also contains tuples t3 and t4 (not necessarily distinct) such that

t1[x] = t2[x] = t3[x] = t4[x]

t3[Y] = t1[Y] and t3[Z] = t2[Z]

t4[Y] = t2[Y] and t4[Z] = t1[Z]

In other words if t1 and t2 are given by

t1 = [X, Y1, Z1], and

t2 = [X, Y2, Z2]

then there must be tuples t3 and t4 such that

t3 = [X, Y1, Z2], and

t4 = [X, Y2, Z1]

We are therefore insisting that every value of Y appears with every value of Z

to keep the relation instances consistent. In other words, the above conditions

insist that X alone determines Y and Z and there is no relationship between Y

121


and Z since Y and Z appear in every possible pair and hence these pairings

present no information and are of no significance.

Fourth normal formFourth normal form (or 4NF) requires that there be no non-trivial multivalued

dependencies of attribute sets on something other than a superset of a

candidate key. A table is said to be in 4NF if and only if it is in the BCNF and

multivalued dependencies are functional dependencies. The 4NF removes

unwanted data structures: multivalued dependencies.

Definition: A relation schema R is in 4NF with respect to a set of

dependencies if, for every non trivial multivalued dependency X ->>Y in F+,

X is a superkey for R.

Properties Of Relational Decompositions

Decomposition Property: A relation must satisfy the following two properties

during decomposition.

i. Lossless:- A lossless-join dependency is a property of decomposition, which

ensures that spurious rows are generated when relations are united through a

natural join operation. i.e. The information in an instance r of R must be

preserved in the instances r1, r2, r3, …..rk where ri = ∏Ri (r)

Decomposition is lossless with respect to a set of functional dependencies F if,

for every relation instance r on R satisfying F,

R=∏R1 (r) * ∏R2 (r) * . . . . . . . ∏Rn (r)

ii. Dependency Preserving Property: - If a set of functional dependencies

hold on R it should be possible to enforce F by enforcing appropriate

dependencies on each r1 .

Decomposition D= (R1, R2, R3, ………, Rk) of schema R preserves a set of

dependencies F if,

(∏R1 (F) U ∏R2 (F) U . . . . . . . . . . . ∏Rn (F)) +=F+

∏Ri(F) is the projection of F onto Ri.

i.e Any FD that logically follows from F must also logically follows from the

union of projection of F onto Ri ‘S . Then D is called dependency preserving.

4.15 Join Dependency and Fifth Normal Form

122


Join dependency is the term used to indicate the property of a relation

schema that cannot be decomposed losslesly into two relation schema, but can

be decomposed losslesly into three or more simpler relation schema. It means

that a table, after it has been decomposed into three or more smaller tables

must be capable of being joined again on common keys to form the original

table.

Fifth normal form

Fifth normal form (5NF and also PJ/NF) requires that there are no non-trivial

join dependencies that do not follow from the key constraints. A table is said

to be in the 5NF if and only if it is in 4NF and the candidate keys imply every

join dependency in it.

4.16 Pitfalls in Relational Database Design.

A bad design may have several properties, including:

Repetition of information.

Inability to represent certain information.

Loss of information.

Module 5.

5.1Distributed database concepts A distributed computing system consists of a number of processing elements that

are interconnected by a computer network and that co-operate in performing certain

assigned tasks.

A distributed database (DDB) is a collection of multiple logically interrelated

databases distributed over a computer network. A distributed database management

123


system (DDBMS) is a software system that manages a distributed database while

making the distribution transparent to the user. At the physical hardware level, the

following main factors distinguish a DDBMS from a centralized system:

There are multiple computers called sites or nodes.

These sites must be connected by some type of communication

network to transmit data and commands among sites.

Parallel versus Distributed technology – There are two main types of

multiprocessor system architecture:

Shared memory (tightly coupled) architecture: Multiple memory share

secondary (disk) storage and also share primary memory.

Shared disk (loosely coupled) architecture: Multiple processors share

secondary (disk) storage but each has their own primary memory.

Database management systems developed using the above types of architectures are

termed parallel database management systems; rather than DDBMS they utilize

parallel processor technology. In another type of architecture called shared nothing

architecture, every processor has its own primary and secondary (disk) memory, no

common memory exists and the processors communicate over a high-speed

interconnection network. Although the shared nothing architecture resembles a

distributed database computing environment, major differences exist in the mode of

operation. In shared nothing architecture, there is symmetry and homogeneity of

nodes; this is not true of the distributed database environment where heterogeneity of

nodes is very common.Advantages of Distributed Databases:-

1. Management of distributed data with different levels of transparency: Ideally,

a DBMS should be distribution transparent in the sense of hiding the details of where

each file is physically stored within the system. The following types of transparencies

are possible:

Distribution or network transparency: This refers to the freedom for the user

from the operational details of the network. It may be divided into location

transparency and naming transparency. Location transparency refers to the

fact that the command used to perform a task is independent of the location of

data and the location of the system where the command was issued. Naming

transparency implies that once a name is specified, the named objects can be

accessed unambiguously without additional specification.

124


Replication transparency: Copies of data may be stored at multiple sites for

better availability, performance and reliability. Replication transparency

makes the user unaware of the existence of fragments.

Fragmentation transparency: Fragmentation makes the user unaware of the

existence of fragments.

2. Increased availability and reliability: Reliability is defined as the probability that

a system is running at a certain time point. Availability is the probability that the

system is continuously available during a time interval. When the data and DBMS

software are distributed over several sites one site may fail while other sites continue

to operate. Only the data and software that exist at the failed state cannot be accessed.

This improves both reliability and availability.

3. Improved performance: A distributed DBMS fragments the database by keeping

the data closer to where it is needed most. Data localization reduces the contention for

CPU and I/O services and simultaneously reduces access delays involved in wide area

networks. When a large database is distributed over multiple sites, smaller databases

exist at each site. As a result, local queries and transactions accessing data at a single

site have better performance because of the small local databases. Moreover,

interquery and intraquery parallelism can be achieved by executing multiple queries at

different sites.

4. Easier expansion: In a distributed environment, expansion of the system in terms

of adding more data, increasing database sizes or adding more processors is much

easier.

Additional Functions of Distributed Databases

1. Keeping track of data: The ability to keep track of the data distribution,

fragmentation and replication by expanding the DBMS catalog.

2. Distributed Query processing: The ability to access remote sites and transmit

queries and data among the various sites via a communication network.

3. Distributed transaction management: The ability to devise execution strategies for

queries and transactions that access data from more than one site and to synchronize

the access to distributed data and maintain integrity of the overall database.

4. Replicated data management: The ability to decide which copy of a replicated data

item to access and to maintain the consistency of copies of a replicated data item.

125


5. Distributed database recovery: The ability to recover from individual site crashes

and from new types of failures such as the failure of communication links.

6. Security: Distributed transactions must be executed with the proper management of

the security of the data and the authorization/access privileges of users.

7. Distributed directory (catalog) management: A directory contains information

(metadata) about data in the database.

5.2 Data Fragmentation This is the process of breaking up the database into logical units called fragments,

which may be assigned for storage at the various sites. There are mainly two types of

fragmentation:

Horizontal fragmentation

Vertical fragmentation

a) Horizontal fragmentation – A horizontal fragment of a relation is a subset of the

tuples in that relation. The tuples that belong to the horizontal fragment are specified

by a condition on one or more attributes of the relation. Often, only a single attribute

is involved. Horizontal fragmentation divides a relation “horizontally” by grouping

rows to create subset of tuples, where each subset has a certain logical meaning.

These fragments can be assigned to different sites in the distributed system. Derived

horizontal fragmentation applies the partitioning of a primary relation to other

secondary relations which are related to the primary via a foreign key. Each horizontal

fragment on a relation R can be specified by a σCi(R) operation in the relational

algebra. A set of horizontal fragments whose conditions C1, C2, ……., Cn include all

the tuples in R (i.e. every tuples in R satisfies (C1 or C2 or …..or Cn)) is called a

complete horizontal fragmentation of R. In many cases, a complete horizontal

fragmentation is also disjoint; i.e. no tuple in R satisfies (Ci and Cj) for any i ≠ j.

b) Vertical fragmentation – Vertical fragmentation divides a relation “vertically” by

columns. A vertical fragment of a relation keeps only certain attributes of the relation.

It is necessary to include the primary key or some candidate key attribute in every

vertical fragment so that the full relation can be reconstructed from the fragments. For

e.g.: Consider the schema Employee (Name, Bdate, Address, Sex, SSN, Salary, DNo).

We want to fragment this relation into 2 vertical fragments. The first fragment

includes personal information – Name, Address, Bdate and Sex – and the second

126


fragment includes work related information – SSN, Salary and DNo. This

fragmentation is not proper because, if the two fragments are stored separately we

cannot put the original employee tuples back together, since there is no common

attribute between the two fragments. Hence we must add SSN attribute to the personal

information fragment also. A vertical fragment on a relation R can be specified by a

ПLi(R) operation in the relational algebra. A set of vertical fragments whose projection

lists L1, L2, ……., Ln include all the attributes in R but share only the primary key

attribute of R is called a complete vertical fragmentation of R. In this case, the

projection lists satisfy the following conditions:

1. L1 U L2 U…..U Ln = ATTRS(R)

2. Li ∩ Lj = PK(R) for any i ≠ j, where ATTRS(R) is the set of attributes of R and

PK(R) is the primary key of R.

c) Mixed (Hybrid) fragmentations – Mixed fragmentation is the combination of

vertical fragmentation and horizontal fragmentation. In general a fragment of a

relation can be constructed by a SELECT-PROJECT combination of operations

ПL(σC(R)).

If C = True and L ≠ ATTRS(R), we get a vertical fragment.

If C ≠ True and L = ATTRS(R), we get a horizontal fragment.

If C ≠ True and L ≠ ATTRS(R), we get a mixed fragment.

d) Fragmentation schema – A fragmentation schema of a database is a definition of

a set of fragments that includes all attributes and tuples in the database and satisfies

the condition that the whole database can be reconstructed from the fragments by

applying some sequence of OUTER UNION and UNION operations.

e) Allocation schema – An allocation schema describes the allocation of fragments to

sites of the DDBS; hence it is a mapping that specifies for each fragment the site(s) at

which it is stored.

5.3 Data Replication and Allocation If a fragment is stored at more than one site, it is said to be replicated.

a) Fully replicated distributed database – If the replication of whole database is

done at every site in the distributed system, the resulting database is called a fully

replicated distributed database. This can improve availability remarkably because the

127


system can continue to operate as long as at least one site is up. It also improves

performance of retrieval for global queries. The disadvantage of full replication is that

it can slow down update operations drastically.

b) Nonredundant allocation – In this system, each fragment is stored at exactly one

site. In this case, all fragments must be disjoint except for the repetition of primary

keys among vertical (or mixed) fragments.

c) Partial replication – In this system, some fragments of the database may be

replicated whereas others may not. The number of copies of each fragment can range

from one up to the total number of sites in the distributed system.

d) Replication schema – A description of the replication of fragments is called

replication schema. Each fragment – or each copy of a fragment – must be assigned to

a particular site in the distributed system. This process is called data distribution or

data allocation.

5.4 Types of Distributed Database Systems

The term distributed database management system can describe various systems that

dif fer from one another in many respects. The main thing that all such systems have

in com. mon is the fact that data and software are distributed over multiple sites

connected by some form of communication network.

The first factor we consider is the qegree of homogen.eity of the DDBMS

software. If all servers (or individual local DBMSs) use identical soft.ware and all

users (clients) use identical software, the DDBMS is called homogeneous; 'otherwise,

it is called heterogeneous. Another factor related to the degree of homogeneity is the

degree of local auton. omy. If there is no provision for the local site to function as a

stand-alone DBMS, then the system has no local autonomy. On the other hand, if

direct access by local transactions to a server is permitted, the system has some

degree of local autonomy.

At one extreme of the autonomy spectrum, we have a DDBMS that "looks like" a

centralized DBMS to the user. A single conceptual schema exists, and all access to the

system is obtained through a site that is part of the 'DDBMS--which means that no

local autonomv exists. At the other extreme we encounter a type of DDBMS called a

federated DDBMS (or a multidatabase system). In such a system, each server is an

fndependent and autonomous centralized DBMS that has its own local users, local

128


transactions, and DBA and hence has a very high degree of local autonomy. The term

federated database system (FOBS) is used when there is some global view or schema

of the federation of databases that is shared by the applications. On the other hand, a

multidatabase system does not have a global schema and interactively constructs one

as needed by the application. Both systems are hybrids between distributed and

centralized systems and the distinction we made between them is not strictly

followed. We will refer to them as FDBSs in a generic sense.

In a heterogeneous FOBS, one server may be a relational DBMS, another a network

DBMS, and a third an object or hierarchical DBMS; in such a case it is necessary to

have a canonical system language and to include language translators to translate

subqueries nom the canonical language to the language of each server. We briefly

discuss the issues affecting the design of FDBSs below.

Federated Database Management Systems Issues

. The type of heterogeneity present inFDBSs may arise from several sources.

.. Differences in data models: Databases in an organization come from a

variety of data models including, the relational data model, the object data

model, etc.The modeling capabilities of the models vary. Hence, to deal with

them uniformly via a single global schema or to process them in a single

language is challenging. Even if two databases are both from the RDBMS

environment, the same information may be represented as an attribute name,

as a relation name, or as a value in different databases. This calls for an

intelligent query-processing mechanism that can relate information based on

metadata.

. Differences in constraints: Constraint facilities for specification and

implementation vary from system to system-. There are comparable features

that must be reconciled in the construction of a global schema. For example,

the relationships from ER models are represented as referential integrity

constraints in the relational model. Triggers may have to be used to

implement certain constraints in the relational model. The global schema

must also deal with potential conflicts among constraints.

. Differences in query languages: Even with the same data model, the

languages and their versions vary. For example, SQL has multiple versions

129


like SQL-89, SQL-92 (SQL2), and SQL3, and each system has its own set of

data types, comparison operators, string manipulation features, and so on.

Semantic Heterogeneity.

Semantic heterogeneity occurs when there are differences in the meaning,

interpretation, and intended use of the same or related data. Semantic heterogeneity

among component database systems (OBSs) creates tne mggest hurdle in designing

global schemas of heterogeneous databases. The design autonQmy of component

OBSs refers to their freedom of choosing the following design parameters, which In

turn affect the eventual complexity of the FOBS:

. The universe of discourse from which the data is drawn: For example, two

customer accounts databases in the federation may be from United States and

Japan with entirely different sets of attributes about customer accounts

required by the accounting practices. Currency rate fluctuations would also

present a problem. Hence, relations in these two databases which have

identical nameS---CUSTOMER or ACCOUNT may have some common and

some entirely distinct information.

. Representation and naming: The representation and naming of data

elements and,the structure of the data model may be prespecified for each

local database.

. The understanding, meaning, and subjective interpretation of data.

This is a chief contributor to semantic heterogeneity

. Transaction,and policy constraints: These deal with serializability criteria,

compensating transactions, and other transaction policies.

Derivation of summaries: Aggregation, summarization, and other data-

processing features and operations supported by the system.

5.5Query Processing in Distributed Databases

Data Transfer Costs of Distributed Query Processing

In a distributed system, several additional factors further complicate

query processing. The first is the cost of transferring data over the network. This

data includes intermediate files that are transferred to other sites for further

130


processing, as well as the final result files that may have to be transferred to the

site where the query result is needed. Although these costs may not be very high

if the sites are connected via a high-performance local area network, they

become quite significant in other types of networks. Hence, OOBMS query

optimization algorithms consider the goal of n~ducing the amount of data

transfer as an optimization criterion in choosing a distributed query execution

strategy.

Distributed Query Processing Using Semijoin

The idea behind distributed query processing using the semi join operation is to reduce the number of tuples in a relation before transferring it to another site. Intuitively, the idea is 10 send the joining column of one relation R to the site where the other relation S is located; this column is then joined with S. Following that, the join attributes, along with rheattributes required in the result, are projected out and shipped back to the original site and joined with R. Hence, only the joining column of R is transferred in one direction, and a subset of S with no extraneous tuples or attributes is transferred in the other direction.lf only a small fraction of the tuples in S participate in the join, this can be quite an efficient solution to minimizing data transfer.

Query and Update Decomposition

In a DDBMS with no distribution transparency, the user phrases a query directly in

terms of specific fragments.

The user must also maintain consistency of replicated data items when updating a

DDBMS with no replication transparency.

On the other hand, a DDBMS that supports full distribution, fragmentation, and

replication transparency allows the user to specify a query or update request on

the schema just as though the DBMS were centralized. For updates, the DDBMS

is responsible for maintaining consistency among replicated items by using one of

the distributed concurrency control algorithms. For queries, a query decomposi.

tion module must break up or decompose a query into subqueries that can be

executed at the individual sites. In addition, a strategy for combining the results of

the subqueries to form the query result must be generated. Whenever the DDBMS

determines that an item referenced in the query is replicated, it must choose or

materialize a particular replica during query execution.

131


To determine which replicas include the data items referenced in a query, the

DDBMS refers to the fragmentation, replication, and distribution information

stored in the DDBMS catalog. For vertical fragmentation, the attribute list for

each fragment is kept in the catalog. For horizontal fragmentation, a condition,

sometimes called a guard, is kept for each fragment. This is basically a selection

condition that specifies which tuples exist in the fragment; it is called a guard

because only tuples that satisfy this condition are permitted to be stored in the

fragment. For mixed fragments, both the attribute list and the guard can. dition are

kept in the catalog.

5.6 Concurrency Control and Recovery in Distributed Databases

For concurrency control and recovery purposes, numerous problems arise in a

distributed DBMS environment that are not encountered in a centralized DBMS

environment. These include the following:

Dealing with multiple copies of the data items: The concurrency control

method is responsible for maintaining consistency among these copies. The

recovery method is responsible for making a copy consistent with other

copies if the site on which the copy is stored fails and recovers later.

Failure of individual sites: The DDBMS should continue to operate with its

running sites, if possible, when one or more individual sites fail. When a site

recovers, its local database must be brought up to date with the rest of the

sites before it rejoins the system.

Failure of communication links: The system must be able to deal with failure

of one or more of the communication links that connect the sites. An extreme

case of this problem is that network partitioning may occur. This breaks up the

sites into two or more partitions, where the sites within each partition can

communicate only with one another and not with sites in other partitions. .

Distributed commit: Problems can arise with committing a transaction that is

accessing databases stored on multiple sites if some sites fail during the

commit process. The two-phase commit protocol (see Chapter 21) is often

used to deal with this problem.

Distributed deadlock: Deadlock may occur among several sites, so techniques

for dealing with deadlocks must be extended to take this into account.

132


References

1. Fundamentals of Database System Elmasri and Navathe (3rd Edition),Pearson

Education Asia

2. Database System Concepts - Henry F Korth, Abraham Silbershatz, Mc

Graw Hill 2nd edition.

3. An Introduction to Database Systems - C.J.Date (7th Edition) Pearson

Education Asia

4. Database Principles, Programming and Performance – Patrick O’Neil, Elizabeth

O’Neil

5. An Introduction to Database Systems - Bibin C. Desai

6. Teach yourself PL/SQL in 21 days - SAMS Publications.

7. SQL,PLSQL - Ivan Bayross.

8. ORACLE Developers guide - David McClanahan.

133


134