23
Advanced Database System Set 1 1. List and explain various Normal Forms. How BCNF differs from the Third Normal Form and 4th Normal forms? Normalization is the process of designing a data model to efficiently store data in a database. The end result is that redundant data is eliminated, and only data related to the attribute is stored within the table. First Normal form (1NF): A relation is said to be in 1NF if it has only single valued attributes, neither repeating nor arrays are permitted. Second Normal Form (2NF): A relation is said to be in 2NF if it is in 1NF and every non key attribute is fully functional dependent on the primary key. Third Normal Form (3NF): We say that a relation is in 3NF if it is in 2NF and has no transitive dependencies. Boyce-Codd Normal Form (BCNF): A relation is said to be in BCNF if and only if every determinant in the relation is a candidate key. Fourth Normal Form (4NF): A relation is said to be in 4NF if it is in BCNF and contains no multi valued attributes. Fifth Normal Form (5NF): A relation is said to be in 5NF if and only if every join dependency in relation is implied by the candidate keys of relation. Domain-Key Normal Form (DKNF): We say that a relation is in DKNF if it is free of all modification anomalies. Insertion, Deletion, and update anomalies come under modification Page No. 1

Mb0049 Smu

Embed Size (px)

Citation preview

Page 1: Mb0049 Smu

Advanced Database System Set 1

1. List and explain various Normal Forms. How BCNF differs from the Third Normal

Form and 4th Normal forms?

Normalization is the process of designing a data

model to efficiently store data in a database. The end

result is that redundant data is eliminated, and only data

related to the attribute is stored within the table.

First Normal form (1NF): A relation is said to be in 1NF ifit has only single valued attributes, neither repeating norarrays are permitted.

Second Normal Form (2NF): A relation is said to be in 2NF ifit is in 1NF and every non key attribute is fully functionaldependent on the primarykey.

Third Normal Form (3NF): We say that a relation is in 3NF ifit is in 2NF and has no transitive dependencies.

Boyce-Codd Normal Form (BCNF): A relation is said to be inBCNF if and only if every determinant in the relation is acandidate key.

Fourth Normal Form (4NF): A relation is said to be in 4NF ifit is in BCNF and contains no multi valued attributes.

Fifth Normal Form (5NF): A relation is said to be in 5NF ifand only if every join dependency in relation is implied bythe candidate keys of relation.

Domain-Key Normal Form (DKNF): We say that a relation is inDKNF if it is free of all modification anomalies. Insertion,Deletion, and update anomalies come under modificationanomalies.

Third Normal Form (3NF) and Boyce-Codd Normal Form (BCNF)

Page No. 1

Page 2: Mb0049 Smu

Advanced Database System Set 1

Third normal form states that a table must have no transitive dependencies. This means that a row could be uniquely identified by each column individually but that no column depends on any other column to identify the row. If columns X, Y and Z exist, deleting any two columns will still leave a set of uniquely identifiable rows. BCNF extends 3NF, stating that no non-trivial functional dependencies can exist on anything other than the superkey - that is, a superset of the candidate keys.

Typically, 3NF means there are no transitive dependencies. A transitive dependency is when two columnar relationships imply another relationship. For example, name -> extension and extension -> store_location, so name -> store_location, which is not a dependency we want to model in our table and could lead to faulty data. In the table we have defined, though, we still have a 3NF table. Also, what happens if an employee changes extension and the old sales records aren't updated? Or if a customer moves? These entry points for error are problematic but acceptable in 3NF because the dependencies are trivial. What we need to look to is BCNF, or Boyce-Codd Normal Form, which requires all transitive dependencies be eliminated in addition to the table being 3NF.

CREATE TABLE Sales (

            employee_id,            customer_name,            product_id);

CREATE TABLE Employees (

            employee_id,            manager_id,            name,            extension);

CREATE TABLE Customers (

            name,            address);

CREATE TABLE Products (

            product_id,            name,            price);

Page No. 2

Page 3: Mb0049 Smu

Advanced Database System Set 1

Now information about employees, customers, and products are isolated from the Sales table. There are obvious practical problems with this example, such as product price changes breaking sales records, but that is a matter for another discussion. Each fact is represented in a single row in a table - an employee's extension and manager are listed once in the Employees table, rather than repeatedly (and probably erroneously) in the Sales table.

difference between BCNF and 4NF (Fourth Normal Form)

• Database must be already achieved to 3NF to take it to BCNF, but database must be in 3NF and BCNF, to reach 4NF.

• In fourth normal form, there are no multi-valued dependencies of the tables, but in BCNF, there can be multi-valued dependency data in the tables.

2. Describe the heuristics of Query optimization.

Query Optimization

The goal of any query processor is to execute each query as efficiently as possible. Efficiency here can be measured in both response time and correctness.

The traditional, relational DB approach to query optimization is to transform the query to an execution tree, and then execute query elements according to a sequence that reduces the search space as quickly as possible and delays execution of the most expensive (in time) elements as long as possible. A commonly used execution heuristic is:

1. Execute all select and project operations on single tables first, in order to eliminate unnecessary rows and columns from the result set.

2. Execute join operations for further reduce the result set.

3. Execute operations on media data, since these can be very time consuming.

4. Prepare the result set for presentation.

Page No. 3

Page 4: Mb0049 Smu

Advanced Database System Set 1

Equivalence of Expressions

The first step in selecting a query-processing strategy is to find a relational algebra expression that is equivalent to the given query and is efficient to execute. We’ll use the following relations as examples: Customer-scheme = (cname, street, ccity)Deposit-scheme = (bname, account#, name, balance)Branch-scheme = (bname, assets, bcity)

Selection Operation

1. Consider the query to find the assets and branch-names of all banks who have depositors living in Port Chester. In relational algebra, this is Π bname, assets(σ ccity=”Port Chester”

(customer deposit branch))- This expression constructs a huge relation,

customer deposit branch of which we are only interested in a few tuples.- We also are only interested in two attributes of this relation. - We can see that we only want tuples for which ccity = “Port Chester”. - Thus we can rewrite our query as: Π bname, assets(σccity=”Port Chester”(customer))

customer deposit branch)- This should considerably reduce the size of the intermediate relation.

Projection Operation

1. Like selection, projection reduces the size of relations. It is advantageous to apply projections early. Consider this form of our example query:

2. When we compute the subexpression

we obtain a relation whose scheme is (cname, ccity, bname, account#, balance)

3. We can eliminate several attributes from this scheme. The only ones we need to retain are those that - appear in the result of the query or- are needed to process subsequent operations.

Page No. 4

Page 5: Mb0049 Smu

Advanced Database System Set 1

4. By eliminating unneeded attributes, we reduce the number of columns of the intermediate result, and thus its size. 5. In our example, the only attribute we need is bname (to join with branch). So we can rewrite our expression as:

Note that there is no advantage in doing an early project on a relation before it is needed for some other operation: - We would access every block for the relation to remove attributes. - Then we access every block of the reduced-size relation when it is actually needed. - We do more work in total, rather than less!

Natural Join Operation

Another way to reduce the size of temporary results is to choose an optimal ordering of the join operations. Natural join is associative:

Although these expressions are equivalent, the costs of computing them may differ. Look again at our expression

we see that we can compute deposit branch first and then join with the first part. However, deposit branch is likely to be a large relation as it contains one tuple for every account. The other part, is probably a small relation (comparatively).

So, if we compute first, we get a reasonably small relation.

It has one tuple for each account held by a resident of Port Chester. This temporary relation is much smaller than deposit branch. Natural join is commutative:

Thus we could rewrite our relational algebra expression as:

But there are no common attributes between customer and branch, so this is a Cartesian product. Lots of tuples!

Page No. 5

Page 6: Mb0049 Smu

Advanced Database System Set 1

If a user entered this expression, we would want to use the associativity and commutativity of natural join to transform this into the more efficient expression we have derived earlier (join with deposit first, then with branch).

One of the main heuristic rules is to apply SELECT and PROJECT operations before

applying the JOIN or other binary operations. This is because the size of the file resulting

from a binary operation—such as JOIN—is usually a multiplicative function of the sizes

of the input files. The SELECT and PROJECT operations reduce the size of a file and

hence should be applied before a join or other binary operation.

ü       Cost-based optimization is expensive, even with dynamic programming.

ü       Systems may use heuristics to reduce the number of choices that must be

made in a cost-based fashion.

ü       Heuristic optimization transforms the query-tree by using a set of rules that

typically (but not in all cases) improve execution performance:

·         Perform selection early (reduces the number of tuples)

·         Perform projection early (reduces the number of attributes)

·         Perform most restrictive selection and join operations (i.e. with

smallest result size) before other similar operations.

·         Some systems use only heuristics, others combine heuristics with

partial cost-based optimization.

 

Eg:

 

customer_name((branch_city = “Brooklyn” (branch) account) depositor)

 

1)       When we compute

(branch_city = “Brooklyn” (branch) account )

Page No. 6

Page 7: Mb0049 Smu

Advanced Database System Set 1

we obtain a relation whose schema is:

(branch_name, branch_city, assets, account_number, balance)

2)       Push projections using equivalence rules; eliminate unneeded attributes from

intermediate results to get:

customer_name ((

account_number ( (branch_city = “Brooklyn” (branch) account ))

depositor )

3)       Performing the projection as early as possible reduces the size of the relation

to be joined.

2. Describe the Structural Semantic Data Model (SSM) with relevant examples.

Ans: SSM was developed as teaching tool and has been and can continue to be modified to include new modeling concepts. A particular requirement today is the inclusion of concepts and syntax symbols for modeling multimedia objects.

Concepts Definition Examples

Entity(object) Something of interest to the information system about which data is collected

A person, student, customer, employee, department, product exam, order…

Entity type A set of entities sharing common attributes Citizens of Norway

Person (Name, Address...)

Subclass, superclass

A sub-class entity type is a specialization, of, alternatively a role played by, a super-class entity type.

Subclass : Superclass

Student IS_A person

Page No. 7

Page 8: Mb0049 Smu

Advanced Database System Set 1

entity type Teacher IS_A Person

Shared Subclass entity type

A shared subclass entity type has characteristics of 2 or more parent entity types

A student-assistant IS_BOTHA student and an employee

Category entity type

A subclass entity type of 2 or more district / independent super-class entity types

An owner IS_EITHERA Person or an organization

4. Describe the following with respect to Object Oriented Databases:

a. Query Processing in Object-Oriented Database Systems

b. Query Processing Architecture

Query Processing in Object-Oriented Database Systems

One of the criticisms of first-generation object-oriented database management systems (OODBMSs) was their lack of declarative query capabilities. This led some researchers to brand first generation (network and hierarchical) DBMSs as object-oriented. It was

Page No. 8

Page 9: Mb0049 Smu

Advanced Database System Set 1

commonly believed that the application domains that OODBMS technology targets do not need querying capabilities. This belief no longer holds, and declarative query capability is accepted as one of the fundamental features of OO-DBMS. Indeed, most of the current prototype systems experiment with powerful query languages and investigate their optimization. Commercial products have started to include such languages as well e.g. O2 and ObjectStore.

In this Section we discuss the issues related to the optimization and execution of OODBMS query languages (which we collectively call query processing). Query optimization techniques are dependent upon the query model and language. For example, a functional query language lends itself to functional optimization which is quite different from the algebraic, cost-based optimization techniques employed in relational as well as a number of object-oriented systems. The query model, in turn, is based on the data (or object) model since the latter defines the access primitives which are used by the query model. These primitives, at least partially, determine the power of the query model. Despite this close relationship, in this unit we do not consider issues related to the design of object models, query models, or query languages in any detail.

Almost all object query processors proposed to date use optimization techniques developed for relational systems. However, there are a number of issues that make query processing more difficult in OODBMSs. The following are some of the more important issues:

Type System

Relational query languages operate on a simple type system consisting of a single aggregate type: relation. The closure property of relational languages implies that each relational operator takes one or more relations as operands and produces a relation as a result. In contrast, object systems have richer type systems. The results of object algebra operators are usually sets of objects (or collections) whose members may be of different types. If the object languages are closed under the algebra operators, these heterogeneous sets of objects can be operands to other operators. This requires the development of elaborate type inference schemes to determine which methods can be applied to all the objects in such a set. Furthermore, object algebras often operate on semantically different collection types (e.g., set, bag, list) which imposes additional requirements on the type inference schemes to determine the type of the results of operations on collections of different types.

Encapsulation

Relational query optimization depends on knowledge of the physical storage of data (access paths) which is readily available to the query optimizer. The encapsulation of methods with the data that they operate on in OODBMSs raises (at least) two issues. First, estimating the cost of executing methods is considerably more difficult than estimating the cost of accessing an attribute according to an access path. In fact, optimizers have to worry about optimizing method execution, which is not an easy

Page No. 9

Page 10: Mb0049 Smu

Advanced Database System Set 1

problem because methods may be written using a general-purpose programming language. Second, encapsulation raises issues related to the accessibility of storage information by the query optimizer. Some systems overcome this difficulty by treating the query optimizer as a special application that can break encapsulation and access information directly. Others propose a mechanism whereby objects “reveal” their costs as part of their interface.

Complex Objects and Inheritance

Objects usually have complex structures where the state of an object references other objects. Accessing such complex objects involves path expressions. The optimization of path expressions is a difficult and central issue in object query languages. We discuss this issue in some detail in this unit. Furthermore, objects belong to types related through inheritance hierarchies. Efficient access to objects through their inheritance hierarchies is another problem that distinguishes object-oriented from relational query processing.

Object Models

OODBMSs lack a universally accepted object model definition. Even though there is some consensus on the basic features that need to be supported by any object model (e.g., object identity, encapsulation of state and behavior, type inheritance, and typed collections), how these features are supported differs among models and systems. As a result, the numerous projects that experiment with object query processing follow quite different paths and are, to a certain degree, incompatible, making it difficult to amortize on the experiences of others. This diversity of approaches is likely to prevail for some time, therefore, it is important to develop extensible approaches to query processing that allow experimentation with new ideas as they evolve. We provide an overview of various extensible object query processing approaches.

Query Processing Methodology

A query processing methodology similar to relational DBMSs, but modified to deal with the difficulties discussed in the previous section, can be followed in OODBMSs. Figure 6.1 depicts such a methodology proposed in.

The steps of the methodology are as follows.

1. Queries are expressed in a declarative language

2. It requires no user knowledge of object implementations, access paths or processing strategies

3. The calculus expression is first

4. Calculus Optimization

Page No. 10

Page 11: Mb0049 Smu

Advanced Database System Set 1

5. Calculus Algebra Transformation

6. Type check

7. Algebra Optimization

8. Execution Plan Generation

9. Execution

5. Describe the theory of Fuzzy Querying to Relational Databases.

The concept of Fuzzy Logic (FL) was conceived by Lotfi Zadeh, a professor at the University of California at Berkley, and presented not as a control methodology, but as a way of processing data by allowing partial set membership rather than crisp set membership or non-membership. This approach to set theory was not applied to control systems until the 70's due to insufficient small-computer capability prior to that time. Professor Zadeh reasoned that people do not require precise, numerical information input, and yet they are capable of highly adaptive control. If feedback controllers could be programmed to accept noisy, imprecise input, they would be much more effective and perhaps easier to implement.

Fuzzy Logic requires some numerical parameters in order to operate such as what is considered significant error and significant rate-of-change-of-error, but exact values of

Page No. 11

Page 12: Mb0049 Smu

Advanced Database System Set 1

these numbers are usually not critical unless very responsive performance is required in which case empirical tuning would determine them

The proposed model

The easiest way of introducing fuzziness in the database model is to use classical relational databases and formulate a front end to it that shall allow fuzzy querying to the database. A limitation imposed on the system is that because we are not extending the database model nor are we defining a new model in any way, the underlying database model is crisp and hence the fuzziness can only be incorporated in the query.

To incorporate fuzziness we introduce fuzzy sets / linguistic terms on the attribute domains / linguistic variables e.g. on the attribute domain AGE we may define fuzzy sets as YOUNG, MIDDLE and OLD. These are defined as the following:

Fig. 8.4: Age

For this we take the example of a student database which has a table STUDENTS with the following attributes:

A snapshot of the data existing in the database

Page No. 12

Page 13: Mb0049 Smu

Advanced Database System Set 1

Meta Knowledge

At the level of meta knowledge we need to add only a single table, LABELS with the following structure:

Fig. 8.6: Meta Knowledge

This table is used to store the information of all the fuzzy sets defined on all the attribute domains. A description of each column in this table is as follows:

· Label: This is the primary key of this table and stores the linguistic term associated with the fuzzy set.

· Column_Name: Stores the linguistic variable associated with the given linguistic term.

· Alpha,Beta, Gamma, Delta: Stores the range of the fuzzy set.

Implementation

The main issue in the implementation of this system is the parsing of the input fuzzy query. As the underlying database is crisp, i.e. no fuzzy data is stored in the database, the INSERT query will not change and need not be parsed therefore it can be presented to the database as it is. During parsing the query is parsed and divided into the following

1. Query Type: Whether the query is a SELECT, DELETE or UPDATE.

2. Result Attributes: The attributes that are to be displayed used only in the case of the SELECT query.

3. Source Tables: The tables on which the query is to be applied.

4. Conditions: The conditions that have to be specified before the operation is performed. It is further sub-divided into Query Attributes (i.e. the attributes on which the query is to be applied) and the linguistic term. If the condition is not fuzzy i.e. it does not contain a linguistic term then it need not be subdivided.

Page No. 13

Page 14: Mb0049 Smu

Advanced Database System Set 1

6. Describe the Differences between Distributed & Centralized Databases

Differences in Distributed & Centralized Databases

1 Centralized Control vs. Decentralized Control

In centralized control one "database administrator" ensures safety of data whereas in distributed control, it is possible to use hierarchical control structure based on a "global database administrator" having the central responsibility of whole data along with "local database administrators", who have the responsibility of local databases.

2 Data Independence

In central databases it means the actual organization of data is transparent to the application programmer. The programs are written with "conceptual" view of the data (called "Conceptual schema"), and the programs are unaffected by physical organization of data. In Distributed Databases, another aspect of "distribution dependency" is added to the notion of data independence as used in Centralized databases. Distribution Dependency means programs are written assuming the data is not distributed. Thus correctness of programs is unaffected by the movement of data from one site to another; however, their speed of execution is affected.

3 Reduction of Redundancy

In centralized databases redundancy was reduced for two reasons: (a) inconsistencies among several copies of the same logical data are avoided, (b) storage space is saved. Reduction of redundancy is obtained by data sharing. In distributed databases data redundancy is desirable as (a) locality of applications can be increased if data is replicated at all sites where applications need it, (b) the availability of the system can be increased, because a site failure does not stop the execution of applications at other sites if the data is replicated. With data replication, retrieval can be performed on any copy, while updates must be performed consistently on all copies.

4 Complex Physical Structures and Efficient Access

In centralized databases complex accessing structures like secondary indexed, interfile chains are used. All these features provide efficient access to data. In distributed databases efficient access requires accessing data from different sites. For this an efficient distributed data access plan is required which can be generated either by the programmer or produced automatically by an optimizer.

Problems faced in the design of an optimizer can be classified in two categories:

a) Global optimization consists of determining which data must be accessed at which sites and which data files must consequently be transmitted between sites.

Page No. 14

Page 15: Mb0049 Smu

Advanced Database System Set 1

b) Local optimization consists of deciding how to perform the local database accesses at each site.

5 Integrity, Recovery and Concurrency Control

A transaction is an atomic unit of execution and atomic transactions are the means to obtain database integrity. Failures and concurrency are two dangers of atomicity. Failures may cause the system to stop in midst of transaction execution, thus violating the atomicity requirement. Concurrent execution of different transactions may permit one transaction to observe an inconsistent, transient state created by another transaction during its execution. Concurrent execution requires synchronization amongst the transactions, which is much harder in all distributed systems.

6 Privacy and Security

In traditional databases, the database administrator, having centralized control, can ensure that only authorized access to the data is performed.

In distributed databases, local administrators face the same as well as two new aspects of the problem; (a) security (protection) problems because of communication networks is intrinsic to database systems. (b) In certain databases with a high degree of "site autonomy" may feel more protected because they can enforce their own protections instead of depending on a central database administrator.

7 Distributed Query Processing

The DDBMS should be capable of gathering and presenting data from more than one site to answer a single query. In theory a distributed system can handle queries more quickly than a centralized one, by exploiting parallelism and reducing disc contention; in practice the main delays (and costs) will be imposed by the communications network. Routing algorithms must take many factors into account to determine the location and ordering of operations. Communications costs for each link in the network are relevant, as also are variable processing capabilities and loadings for different nodes, and (where data fragments are replicated) trade-offs between cost and currency. If some nodes are updated less frequently than others there may be a choice between querying the local out-of-date copy very cheaply and getting a more up-to-date answer by accessing a distant location. The ability to do query optimization is essential in this context – the main objective being to minimize the quantity of data to be moved around. With single-site databases one must consider both generalized operations on internal query representations and the exploitation of information about the current state of the database.

8 Distributed Directory (Catalog) Management

Catalogs for distributed databases contain information like fragmentation description, allocation description, mappings to local names, access method description, statistics on the database, protection and integrity constraints (consistency information) which are more detailed as compared to centralized databases.

Relative Advantages of Distributed Databases over Centralized Databases

Page No. 15

Page 16: Mb0049 Smu

Advanced Database System Set 1

Organizational and Economic Reasons

Many organizations are decentralized, and a distributed database approach fits more naturally the structure of the organization. The organizational and economic motivations are amongst the main reasons for the development of distributed databases. In organizations already having several databases and feeling the necessity of global applications, distributed databases is the natural choice.

Incremental Growth

In a distributed environment, expansion of the system in terms of adding more data, increasing database size, or adding more processors is much easier.

Reduced Communication Overhead

Many applications are local, and these applications do not have any communication overhead. Therefore, the maximization of the locality of applications is one of the primary objectives in distributed database design.

Performance Considerations

Data localization reduces the contention for CPU and I/O services and simultaneously reduces access delays involved in wide are networks. Local queries and transactions accessing data at a single site have better performance because of the smaller local databases. In addition, each site has a smaller number of transactions executing than if all transactions are submitted to a single centralized database. Moreover, inter-query and intra-query parallelism can be achieved by executing multiple queries at different sites, or breaking up a query into a number of sub queries that execute in parallel. This contributes to improved performance.

Reliability and Availability

Reliability is defined as the probability that a system is running (not down) at a certain time point. Availability is the probability that the system is continuously available during a time interval. When the data and DBMS software are distributed over several sites, one site may fail while other sites continue to operate. Only the data and software that exist at the failed site cannot be accessed. This improves both reliability and availability. Further improvement is achieved by judiciously replicating data and software at more than one site.

Management of Distributed Data with Different Levels of Transparency In a distributed database, following types of transparencies are possible:

Distribution or Network Transparency

This refers to freedom for the user from the operational details of the network. It may be divided into location and naming transparency. Location transparency refers to the fact that the command used to perform a task is independent of the location of data and the location of the system where the command was issued. Naming transparency implies that once a name is specified, the named objects can be accessed unambiguously without additional specification.

Page No. 16

Page 17: Mb0049 Smu

Advanced Database System Set 1

Replication Transparency

Copies of the data may be stored at multiple sites for better availability, performance, and reliability. Replication transparency makes the user unaware of the existence of copies.

Fragmentation Transparency

Two main types of fragmentation are Horizontal fragmentation, which distributes a relation into sets of tuples (rows), and Vertical Fragmentation which distributes a relation into sub relations where each sub relation is defined by a subset of the column of the original relation. A global query by the user must be transformed into several fragment queries. Fragmentation transparency makes the user unaware of the existence of fragments.

Page No. 17