Chapter 11 Further Normalization (I)zshen/Webfiles/notes/CS360/Note11.pdf · The process of further normalization is built around the concept of normal forms. A ta-ble is said to

Chapter 11Further Normalization (I)

Throughout this book, we have made use of

the suppliers and parts database as an exam-

ple, as follows:

S {S#, SNAME, STATUS, CITY}

PRIMARY KEY {S#}

P {P#, PNAME, COLOR, WEIGHT, CITY}

PRIMARY KEY {P#};

SP {S#, P#, QTY}

PRIMARY KEY {S#, P#}

FOREIGN KEY {S#} REFERENCES S

FOREIGN KEY {P#} REFERENCES P

1

The above design does seem to make sense:

it is obvious that three tables S, P, and SP are

necessary; it is also obvious that the supplier

CITY attribute belongs in table S, the part COLOR

attribute belongs in P, and the shipment QTY

belongs in SP, etc.. But, why are they the right

decisions?

Assume that we move the supplier attribute

CITY out of the S and into SP to obtain the

following SCOP:

S# CITY P# QTYS1 London P1 100S1 London P2 100S2 Paris P1 200S2 Paris P2 200S3 Paris P2 300S4 London P2 400S4 London P4 400S4 London P5 400

2

Looking at this modified table, we immediately

spot what is wrong with this design: redun-

dancy: every tuple for supplier S1 tells user S1

is located in London, every tuple for supplier

S2 tells us that S2 is located in Paris, and so

on. This leads to some further problems: Af-

ter an update, S1 might be shown as being

located in London in one tuple, and in Ams-

terdam in another (inconsistency).

Thus, a good design principle is “one fact in

one place.” This chapter provides a formal

treatment for this simple idea. In fact, rela-

tional model is already normalized in certain

sense, every tuple contains exactly one value

for each attribute. We will say that it is in the

first normal form (INF).

3

What are they for?

As we noticed that a given table is in INF,

but still possess certain undesirable properties,

e.g., the table SCOP. The principle of further

normalization allows us to recognize such cases

and to replace them with those that are more

desirable in certain way.

In case of the table SCOP, those principles would

tell us exactly what is wrong, and would tell us

how to replace it with two more desirable ta-

bles, one with heading {S#, CITY}, and another

with heading {S#, P#, QTY}.

4

Normal forms

The process of further normalization is built

around the concept of normal forms. A ta-

ble is said to be in a certain normal form if

it satisfies a certain prescribed set of condi-

tions. For example, a table is in second normal

form (2NFS) iff it is in 1NFS and also satisfies

some additional conditions.

Many normal forms have been defined. The

following picture shows their relationship:

5

Before we can start...

To replace a table with something better, we

have to apply certain normalization procedure,

which will successively reduce a give collection

of tables to some more desirable form. Such

a procedure should be reversible, in the sense

that it is also possible to take its output and

map it back to the input, i.e., such a proce-

dure should be information preserving. In other

words, the only decomposition we are inter-

ested in are ones that will lose nothing during

its application.

For example, consider S, with heading {S#,STATUS, CITY}. We will consider several op-

tions to decompose it into several simpler ta-

bles.

6

You have a choice

Given the following two options:

we observe that in case (a), nothing is lost: the

SST and SC values still tell us that S3 has sta-

tus 30 and city Paris, etc., while in case (b),

although we can still tell that both suppliers

have status 30, we cannot tell which suppler

has which city. Thus, we call second decom-

position lossy.

7

What is the problem?

First of all, the decomposition is really a pro-

cess of projection: SST, SC and ST are each

projections of S. Thus, projection is the de-

composition operator.

Secondly, when we say case (a) is non-lossy, we

really mean that if we join SST and SC, we will

get back to the original S. On the other hand,

in case (b), if we join them together, we won’t

get back the original table S. Thus, reversibility

means precisely that the original table is equal

to the join of its projections. Hence, join is the

recomposition operator.

8

The million buck question

Question: If R1 and R2 are projections of

some table R, and R1 and R2 between them

include all of the attributes of R, what condi-

tions have to be satisfied in order to guarantee

that their join takes back to the original R?

We notice that S# → STATUS and S# → CITY

are two FDs associated with S. As a matter of

fact, we have the following general result:

Heath’s theorem: Let R{A, B, C} be a table,

where A, B, and C are attributes of R. If R

satisfies the FD A → B, then R is equal to the

join of its projections on {A, B} and {A, C}.

9

More on FDs

1. An FD rule is left-irreducible if its left-hand

side cannot be further cut down. E.g., con-

sider SCOP, the table satisfies {S#, P#} → CITY.

However, P# is redundant, since we also have S#

→ CITY, which implies the earlier one. Thus,

we can say that CITY is irreducibly dependent

on S#, but not on {S#, P#}.

2. FDs, as special integrity constraints, are

semantic notions. Recognizing the FDs is thus

part of the process of understanding what the

data means. The FD S# → CITY means that

each supplier is located in precisely one city,

and the way to specify this understanding is to

declare the FD in the database.

10

3. Let R be a table and let I be some irre-

ducible set of FDs that apply to R. It is con-

venient to represent the set I by means of a

functional dependency diagram. E.g., below is

the FD diagram for S, SP, and P.

Each arrow in the above figure comes out a

candidate key. The normalization process is to

eliminate arrows that don’t come out of can-

didate keys.

11

The fun starts...

Assume that each table has exactly one can-

didate key, thus also the primary key. Then,

an table is in 3NF iff the non-key attributes are

1) mutually independent, and 2) irreducibly de-

pendent on the primary key.

In contrast, a table is in 1NF if, in every legal

value of that table, every tuple contains exactly

one value for each attribute.

A table that is only in 1NF, but not in higher

normal forms, has a structure that is unde-

sirable for several reasons. For example, let’s

suppose that information about suppliers and

shipments, is lumped together into a single ta-

ble, FIRST, as follows:

FIRST {S#,STATUS,CITY,P#,QTY}

PRIMARY KEY {S#,P#}

12

Below is the FD diagram of this table, with an

additional FD: CITY → STATUS:

This diagram is more complex than that for

a 3NF table, since in a 3NF diagram, arrows

come out of candidate keys only; while in this

diagram for FIRST, there are certain additional

arrows, which will cause some troubles. In fact,

FIRST violates both conditions of being a 3NF

table: non-key attributes STATUS depends on

another non-key CITY; and they are not irre-

ducibly dependent on the primary key, since

both STATUS and CITY depend on S# alone.

13

An example

S# STATUS CITY P# QTYS1 20 London P1 300S1 20 London P2 200S1 20 London P3 400S1 20 London P4 200S1 20 London P5 100S1 20 London P6 100S2 10 Paris P1 300S2 10 Paris P2 400S3 10 Paris P2 200S4 20 London P2 200S4 20 London P4 400S4 20 London P5 400

The redundancy is obvious. E.g., every tuple

for supplier S1 shows that the city as London;

likewise, every tuple of city London shows the

status as 20.

14

Update related problems

The redundancies in FIRST lead to a bunch of

problems, when we try to apply various update

operations, such as INSERT, DELETE, and

UPDATE, on the table. For the moment, we

focus on the supplier-city redundancy.

1. We cannot insert the fact that a particular

supplier is located in a particular city until and

unless that supplier supplies at least one part,

since, unless that happens, there is no primary

key value for such a tuple.

15

2. By the same token, if we delete the only

tuple for a particular supplier, we delete not

only the shipment connecting that supplier to

a particular part, but also the information that

the supplier is located in a particular city.

3. The city value for a given supplier appears in

FIRST many times, in general. Then, if supplier

S1 moves over to Amsterdam, we are faced

with either the inconsistency problem, or we

have to search throughout the table to find

every tuple connecting S1 and London.

16

What to do?

The solution to these problems, is to replace

FIRST by the two tables:

SECOND {S#,STATUS,CITY}

and

SP {S#,P#,QTY}.

Below is the corresponding FD diagrams.

17

Sample values

S# STATUS CITYS1 20 LondonS2 10 ParisS3 10 ParisS4 20 LondonS5 30 Athens

S# P# QTYS1 P1 300S1 P2 200S1 P3 400S1 P4 200S1 P5 100S1 P6 100S2 P1 300S2 P2 400S2 P2 200S2 P2 200S2 P4 300S2 P5 400

18

They are better!

All the aforementioned problems have been elim-

inated:

We can insert the information that S5 is lo-

cated in Athens, even though S5 does not cur-

rently supply any parts.

We can also delete the shipment connecting

S3 and P2 by deleting the appropriate tuple

from SP, w/o losing the information that S3

is located in Paris.

Finally, in the revised structure, the city for

a given supplier appears only once. Thus,the

S#-CITY redundancy has been eliminated.

19

The effect of decomposing FIRST into SECOND

and SP is to eliminate the dependencies that

were not irreducible.

More specifically, in FIRST, the attribute CITY

does not describe the entity identified by the

primary key, namely a shipment, it merely de-

scribes the supplier involved in that shipment.

Hence, we should not mix those two kinds of

information in the same table.

20

Second normal form

A table is in 2NF iff it is in 1NF and every

non-key attribute is irreducibly dependent on

the primary key.

Both SECOND and SP are in 2NF, but FIRST is

not, since the two non-key attributes STATUS

and CITY depend on S# alone, while in that

case, the primary key is {S#,P#}.

A table that is in 1NF, but not in 2NF can

always be reduced to an equivalent collection

of 2NF tables, by replacing the 1NF table with

suitable projections.

21

The process

Let R(A, B, C, D) be a table, and {A, B} be its

primary key, and the FD A → D holds. Then,

it can be decomposed into the following two

tables: R1(A, D) with A as its primary key; and

R2(A, B, C) with {A, B} as its primary key, and

A as its foreign key referencing R1.

For example, given the {S#, Status, City, P#,

QTY} table, since {S#, P#} is the primary key,

thus, A ≡ S#, and B ≡ P#. We also have

the FD: S# → {City, Status}, thus, D ≡ {City,

Status}, and finally, C ≡ QTY.

Thus, by the process, we have two tables, {S#,City, Status} and {S#, P#, QTY}.

22

It is not good enough!

Below are the corresponding FD diagrams for

those two tables.

Although table SP is actually in 3NF, thus sat-

isfactory for now; the FD diagram for SECOND

is still more complex than a 3NF diagram.

More specifically, the dependency of STATUS on

S#, is transitive (via CITY.) Such dependencies

lead to update problems.

23

What do you mean?

1. We cannot insert the fact that a particular

city has a status until we have some supplier

actually located in that city.

2. Similarly, if we delete the only tuple for a

particular city, we delete both the information

for that supplier concerned and the information

that a specific city has a status.

3. Again, the status for a given city appears

many ties, which leads to....

S# STATUS CITYS1 20 LondonS2 10 ParisS3 10 ParisS4 20 LondonS5 30 Athens

24

One step further

Again the solution is to replace the originalSECOND with two projections, SC {S#, CITY}, andCS {CITY, STATUS}, with the following diagram

and the corresponding values:

S# CITYS1 LondonS2 ParisS3 ParisS4 LondonS5 Athens

CITY STATUSAthens 30London 20Paris 10Rome 50

25

Third normal form

It is clear to see that the revised structure over-

comes all the problems, and the effect of the

projection is to eliminate the transitive depen-

dence of, in this case, STATUS on S#. Again,

STATUS does not describe the whole tuple in

SECOND, it only describes the supplier.

In general, a table is in 3NF iff it is in 2NF

and every non-key attribute is non transitively

dependent on the primary key. both SC and CS

are in 3NF, and SECOND is not.

Again, any table that is in 2NF but not in 3NF

can be converted to an equivalent collection of

3NF tables, via projections.

26

The process

Let R(A, B, C, D) be a table, and let A be its

primary key, and the FD B → C holds. Then,

it can be decomposed into the following two

tables: R1(B, C) with B as its primary key; and

R2(A, B, D) with A being its primary key, and

B as its foreign key referencing R1.

Given Second{S#, City, Status}, we immedi-

ately have that A ≡ S#. Moreover, since City

→ Status, we have that B ≡ City, C ≡ Status

and D ≡ ∅.

Now the process says that we should split the

table into R1(City, Status) and R2(S#, City).

27

An alternative

Start with the table First, we apply a process

to split it into Second and SP, both in 2NF,

then apply another process to convert Second

into two tables, both in 3NF. We can also do

it another way.

Given FIRST {S#,STATUS,CITY,P#,QTY}, we have

that A ≡ {S# P#}. Since CITY → STATUS, we

have that B ≡ City and C ≡ Status. Finally, D

≡ QTY.

Then, we have that R1 : {CITY, STATUS}, and

R2 : {S#,P#,CITY, QTY}.

In R2, since S# → CITY. CITY is not irreducibly

dependent on {S#,P#}. Hence, R2 is not in

2NF. Using the previous process, R2 can be

further decomposed into R21 : {S#, CITY}, and

R22 : {S#,P#,QTY}.28

Dependency preservation

During the reduction process, it is often the

case that a given table can be decomposed,

without losing any information, in a variety

of different ways. Consider the table SECOND

{S#,STATUS,CITY} , with FDs S# → CITY and

CITY → STATUS, thus, S# → STATUS, by tran-

sitivity. Besides the given decomposition A:

SC {S#, CITY}, and CS {CITY, STATUS}; we can

also put into decomposition B SC {S#, CITY},and SS {S#, STATUS}. (?)

Both of them are non lossy, and in 3NF. But,

decomposition B is less satisfactory, since it is

still not possible to insert the information that

a particular city has a particular status unless

some supplier is located in that city.

29

Dig a bit deeper

We may observe that the two projections in the

decomposition A are independent of one an-

other, in the sense that updates can be made

in either one w/o regarding for another, pro-

vided that update is legal within the context

of the projection concerned.

In decomposition B, updates to either of the

two projections must be monitored to ensure

that the FD: CITY → STATUS is not violated.

Thus, the two constraints depend on each other.

30

Big deal

The essence is that in decomposition B, the

aforementioned FD has become a database

constraint that relates two tables. On the

other hand, for decomposition A, it is the tran-

sitive FD:S# → STATUS that becomes a database

constraint, which will be automatically enforced

as long as the two separate table constraints

S# → CITY and CITY → STATUS are enforced.

The latter is much easier to achieve, since we

only need to enforce two table constraints.

31

You have a choice

Thus, the concept of independent projectionsprovides a guideline for choosing a particulardecomposition when there is more than onepossibility.

It has been shown that projections R1 and R2of a table R are independent iff 1) every FDin R is implied by those in R1 and R2; and 2)the common attributes of R1 and R2 form acandidate key for at least one of the pair.

For example, in A, the common attribute CITY

constitutes the primary key for CS; and everyFD in SECOND either appears in one of the twoprojections, or logically follows from those thatdo. For B, the FD CITY → STATUS cannot bededucted from the FDs of those projections.

There is an algorithm that can decompose anytable, without losing information, into a set ofindependent 3NF projections.

32

Boyce/Codd normal form

Now, we drop our assumption that every table

has just one candidate key and consider the

general case, for which we define another nor-

mal form. Recall that the term determinant

refers to the left-hand side of an FD. Also, by

a trivial FD, we mean an FD whose right-hand

side is a subset of its left-hand side.

A table is in BCNF iff every determinant of

a nontrivial, left-irreducible FD is a candidate

key. Thus, for such tables, the only arrows in

their diagrams come out of candidate keys.

33

It is clear that FIRST is not in BCNF, since two

determinants S# and CITY are not candidate

keys. Similarly, SECOND is also not in BCNF.

On the other hand tables SP, SC, and CS are

all in BCNF, since in each case, the sole can-

didate key is the only determinant in the table.

Let’s consider another example, involving two

disjoint candidate keys. Suppose that in the

usual suppliers table S, S# and SNAME are both

candidate keys. Assume, however, STATUS and

CITY are now mutually independent. Thus, we

now have the following diagram for their FDs.

Table S is in BCNF, since the only determi-

nants are candidates keys.

34

Example continued...

Now, we consider some examples in which the

candidate keys overlap, i.e., they share at least

one attribute. Again, we assume that supplier

names are unique, and we consider SSP{S#,SNAME, P#, QTY}, with {S#, P#}, and {SNAME,P#}, as candidate keys. Since both S# and

SNAME are determinants (?) and they happen

not to be candidate key by themselves, SSP is

not in BCNF.

S# SNAME P# QTYS1 Smith P1 300S1 Smith P2 200S1 Smith P3 400S1 Smith P4 200

It is clear that SSP has redundancies and will

suffer from update anomalies, as well.

35

Yet another solution

So, we break SSP into two projections: SS {S#,SNAME}, and SP {S#, P#, QTY}; or alternatively,

SS {S#, SNAME}, and SP {SNAME, P#, QTY}. Both

of them are in BCNF.

Common sense tells us that the latter structure

is much preferable to SSP, which is supported

by the functional dependency theory and the

BCNF theory.

Homework: Do either Exercise 12.3, or 12.4.

36

Documents

Chapter 11 Further Normalization (I)zshen/Webfiles/notes/CS360/Note11.pdf · The process of further normalization is built around the concept of normal forms. A ta-ble is said to