60
Normalization (Review) Database Systems

Normalization

Embed Size (px)

Citation preview

Normalization (Review)

Database Systems

By Stanley Githinji

WARNING

• This stuff can get confusing.

• So concentrate. This is the science bit.

By Stanley Githinji

Redundancy & Normalisation

• Redundant data• Is data that _already_ exists elsewhere in the

database

• Redundant data leads to various subtle, but important problems:

• INSERT anomalies• UPDATE anomalies• DELETE anomalies

• Normalisation• Aims to reduce data redundancy

• Redundancy is expressed in terms of dependencies

• Normal forms are defined that do not have certain types of dependency

By Stanley Githinji

What is Normalization?

• It is a mathematical process that converts one set of formulae into another equivalent set of formulae. That is it.

• This only makes sense if you think of information in terms of propositions – statements of fact.

• Do not think in terms of objects and entities at the logical level. This is not how we communicate information.

By Stanley Githinji

Propositions Example

Program(X-Factor) & Host(Kate) Program(I’m a celebrity) & Host(Ant) & Host(Dec)Program(Big Brother) & Host(Davina) & coHost(Dermot)

Unnormalized Reality TV

Program Host coHost

X-Factor Kate null

I’m a Celebrity Ant null Dec

Big Brother Davina Dermot

By rearranging these propositions into different forms we can achieve a better structure for manipulating the info…this is Normalization.

A mess

By Stanley Githinji

'Zeroth' and 1st Normal Form

• In the original definition of the relational model

• All data values should be atomic

• This means that table entries should be single values, not be repeating groups or ‘complex’ objects

• A relation is said to be in first normal form (1NF) if

• All data values are atomic

• No duplicate columns

• A 'relation' that is not in 1NF is said to be in 'zeroth' normal form (0NF), and is unnormalized

By Stanley Githinji

0NF to 1NF

To convert a 0NF ‘relation’ to a 1NF relation: Split up any non-atomic values

0NF - Teaching

Module Dept Lecturer Text

M1 D1 L1 T1 T2

M2 D1 L1 T1 T3 M3 D1 L2 T4 M4 D2 L3 T1 T5 M5 D2 L4 T6

1NF - Teaching

Module Dept Lecturer Text

M1 D1 L1 T1 M1 D1 L1 T2 M2 D1 L1 T1 M2 D1 L1 T3 M3 D1 L2 T4 M4 D2 L3 T1 M4 D2 L3 T5 M5 D2 L4 T6

By Stanley Githinji

Back to Reality

Program(X-Factor) & Host(Kate) Program(I’m a celebrity) & Host(Ant) & Host(Dec)Program(Big Brother) & Host(Davina) & coHost(Dermot)

Program(X-Factor) & Host(Kate) Program(I’m a celebrity) & Host(Ant) Program(I’m a celebrity) & Host(Dec)Program(Big Brother) & Host(Davina) & coHost(Dermot)

By Stanley Githinji

ONF - Reality TV

Program Host coHost

X-Factor Kate null

I’m a Celebrity Ant null Dec

Big Brother Davina Dermot

1NF - Reality TV

Program Host coHost

X-Factor Kate null

I’m a Celebrity Ant null

I’m a Celebrity Dec null

Big Brother Davina Dermot

By Stanley Githinji

What have we done there?

• We took unformatted information and put it into a format that allows it to be represented as… a mathematical relation.

• 1NF is different from subsequent normalizaiton - it essentially says, all data must fit into relations.

• I.e. A table = relation by 1NF

By Stanley Githinji

But there are still problems in 1NF…

• INSERT anomaliesCan't add a module with

no texts

• UPDATE anomaliesTo change lecturer for

M1, we have to change two rows

• DELETE anomaliesIf we remove M3, we

remove L2 as well

1NF - Teaching

Module Dept Lecturer Text

M1 D1 L1 T1 M1 D1 L1 T2 M2 D1 L1 T1 M2 D1 L1 T3 M3 D1 L2 T4 M4 D2 L3 T1 M4 D2 L3 T5 M5 D2 L4 T6

By Stanley Githinji

Functional Dependencies

• Redundancy can often be described as a functional dependency

• A functional dependency (FD) is a semantic link between two sets of attributes in a relation

• Another part of 'normalisation‘ is to remove undesirable FDs

• A set of attributes, A, functionally determines another set, B, if:

• Whenever two rows of the relation have the same value for all attributes in A then they also have the same value for all attributes in B.

• We say: A B

By Stanley Githinji

Why care about FD?

• We define a set of 'normal forms‘

• Each normal form has fewer FDs than the last

• Since FDs represent redundancy, each normal form has less redundancy than the last

• Not all FDs cause a problem, but…

• We identify various sorts of FD that do.

• Each normal form removes a type of FD that is a problem.

• We will also need a way to remove FDs.

By Stanley Githinji

Properties of FDs

In any relation:

• The primary key FDs any set of attributes in that relation: K X

• Any set of attributes is of course FD on itself: X X

Set of other attributes

Primary key

By Stanley Githinji

Rules for FD’s

• Reflexivity: If B is a subset of A then: A B

• Augmentation: If A B then A U C B U C

• Transitivity: If A B and B C then A C

By Stanley Githinji

FD Example

• The primary key is {Module, Text} so{Module, Text} {Dept, Lecturer}

• 'Trivial' FDs, eg:{Text, Dept} {Text}{Module} {Module}{Dept, Lecturer} { }

1NF - Teaching

Module Dept Lecturer Text

M1 D1 L1 T1 M1 D1 L1 T2 M2 D1 L1 T1 M2 D1 L1 T3 M3 D1 L2 T4 M4 D2 L3 T1 M4 D2 L3 T5 M5 D2 L4 T6

By Stanley Githinji

FD Example

• Other FDs are• {Module} {Lecturer}• {Module} {Dept}• {Lecturer} {Dept}

• These are non-trivial and don't come from the primary key

1NF - Teaching

Module Dept Lecturer Text

M1 D1 L1 T1 M1 D1 L1 T2 M2 D1 L1 T1 M2 D1 L1 T3 M3 D1 L2 T4 M4 D2 L3 T1 M4 D2 L3 T5 M5 D2 L4 T6

By Stanley Githinji

FD Diagrams

Module LecturerDept Text

{Module, Text} is the primary key, so we put a double box around them

{Module} {Dept} and {Module} {Lecturer}, so we have{Module} {Dept, Lecturer}

{Lecturer} {Dept}, so we have an arrow from Lecturer to Dept

By Stanley Githinji

Partial FDs and 2NF

• Partial FDs:

• A FD, A B is a partial FD, if some attribute of A can be removed and the FD still holds

• Formally, there is some proper subset of A, C A, such that C B

• 2nd normal form A relation is in second normal form (2NF) if it is in 1NF

and no non-primary-key attribute is partially dependent on the primary key

i.e. a member of the set A is superfluous

By Stanley Githinji

Second Normal Form

• 1NF is not in 2NF

We have the FD:{Module, Text} {Lecturer, Dept}

but also…{Module} {Lecturer, Dept}

• And so Lecturer and Dept are partially dependent on the primary key

1NF - Teaching

Module Dept Lecturer Text

M1 D1 L1 T1 M1 D1 L1 T2 M2 D1 L1 T1 M2 D1 L1 T3 M3 D1 L2 T4 M4 D2 L3 T1 M4 D2 L3 T5 M5 D2 L4 T6

Module Dept Lecturer Text

By Stanley Githinji

So what do we do?• Say we have a relation

with scheme S and the full FD A B.

• We can organize this to make sure A ∩ B = { }. (ie. no trivial dependencies)

• Let C = S – (A U B)

• So we have• A – attributes on the LHS

of the FD• B – attributes on the RHS

of the FD• C – all other attributes

• Well it turns out that we can split the relation into two parts:

R1, with scheme: C U AR2, with scheme: A U B

• The original relation can be recovered as the natural join of R1 and R2

Vital point

By Stanley Githinji

1NF to 2NF – Example

1NF - Teaching

Module Dept Lecturer Text

M1 D1 L1 T1 M1 D1 L1 T2 M2 D1 L1 T1 M2 D1 L1 T3 M3 D1 L2 T4 M4 D2 L3 T1 M4 D2 L3 T5 M5 D2 L4 T6

Module Dept Lecturer Text

2NF - Text

Module Text

M1 T1 M1 T2 M2 T1 M2 T3 M3 T4 M4 T1 M4 T5 M1 T6

Module Text

2NF - Modules

Module Dept Lecturer

M1 D1 L1 M2 D1 L1 M3 D1 L2 M4 D2 L3 M5 D2 L4

Module Dept Lecturer

By Stanley Githinji

Problems Resolved in 2NF

Those 1NF problems:

• INSERT – Can't add a module with no texts

• UPDATE – To change lecturer for M1, we have to change two rows

• DELETE – If we remove M3, we remove L2 as well, and lose information we might not have wanted to!

• In 2NF the first two are resolved, but not the third one

2NF - Modules

Module Dept Lecturer

M1 D1 L1 M2 D1 L1 M3 D1 L2 M4 D2 L3 M5 D2 L4

By Stanley Githinji

Problems Remaining in 2NF

• INSERT anomalies• Can't add lecturers who teach no modules

• UPDATE anomalies• To change the department for L1 we must alter two

rows• DELETE anomalies

• If we delete M3 we delete L2 as well

2NF - Modules

Module Dept Lecturer

M1 D1 L1 M2 D1 L1 M3 D1 L2 M4 D2 L3 M5 D2 L4

Module Dept Lecturer

By Stanley Githinji

Transitive FDs and 3NF

• Transitive FDs:• An FD, A C is a transitive FD, if there is some

set B such that A B and B C are non-trivial FDs in the relation

• I.e. There exists:A B C

• Third normal form:

A relation is in third normal form (3NF) if it is in 2NF and no non-primary-key attribute is transitively dependent on the primary key

By Stanley Githinji

Third Normal Form

• This is not in 3NF

• We have the FDs:{Module} {Lecturer}{Lecturer} {Dept}

So there is a transitive FD from the primary key {Module} to {Dept}

2NF – Modules

Module Dept Lecturer

M1 D1 L1 M2 D1 L1 M3 D1 L2 M4 D2 L3 M5 D2 L4

Module Dept Lecturer

By Stanley Githinji

2NF to 3NF – Example

2NF - Modules

Module Dept Lecturer

M1 D1 L1 M2 D1 L1 M3 D1 L2 M4 D2 L3 M5 D2 L4

Module Dept Lecturer Lecturer Dept Module Lecturer

3NF - place

Lecturer Dept

L1 D1 L2 D1 L3 D2 L4 D2

3NF - Modules

Module Lecturer

M1 L1 M2 L1 M3 L2 M4 L3 M5 L4

By Stanley Githinji

Problems Resolved in 3NF

• Problems in 2NF

• INSERT – Can't add lecturers who teach no modules

• UPDATE – To change the department for L1 we must alter two rows

• DELETE – If we delete M3 we delete L2 as well

• In 3NF all of these are resolved:

3NF - Places

Lecturer Dept

L1 D1 L2 D1 L3 D2 L4 D2

3NF - Modules

Module Lecturer

M1 L1 M2 L1 M3 L2 M4 L3 M5 L4

By Stanley Githinji

Normalisation and Design

• Normalisation is integrally related to DB design:

• A database should normally be in 3NF at least

• If your design leads to a non-3NF DB, then you might want to revise it

• When you find you have a non-3NF DB:

• Identify the FDs that are causing a problem• Think if they will lead to any insert, update, or delete

anomalies• Try to remove them

By Stanley Githinji

So, the story so far:

• 1NF – turn your propositions into a format that fits relations by removing repeating groups.

• 2NF – look for partial FD’s and separate into another table anything that is not functionally dependent on the full primary key.

• 3NF – look for transitive FD’s and separate off into a separate table.

By Stanley Githinji

• Normalization reduces data redundancy in a database

• By doing so it eliminates serious manipulation anomalies.

• Normalization is ultimately just rearranging propositions to a better structure.

• This is done by identifying and removing damaging functional dependencies.

1. Normalization refresher

By Stanley Githinji

Normal forms so Far…

• First normal form• All data values are

atomic, and so everything fits into a mathematical relation.

• Second normal form• As 1NF plus no non-

primary-key attribute is partially dependant on the primary key

• Third normal form• As 2NF plus no non-

primary-key attribute depends transitively on the primary key

By Stanley Githinji

2. Normalization Example

• Consider a table representing orders in an online store

• Each entry in the table represents an item on a particular order. (thinking in terms of records. Yuk.)

• Columns• Order• Product• Customer• Address• Quantity• UnitPrice

• Primary key is {Order, Product}

By Stanley Githinji

Functional Dependencies

{Order} {Customer}

{Customer} {Address}

{Product} {UnitPrice}

{Order} {Address}

Each order is for a single customer

Each customer has a single address

Each product has a single price

FD’s 1 and 2 are transitive

By Stanley Githinji

1NF

Example – FD Diagram

Order Product Customer Address Quantity UnitPrice

R

By Stanley Githinji

Normalisation to 2NF• Remember 2nd normal form means no partial

dependencies on the key. But we have:

{Order} {Customer, Address}{Product} {UnitPrice}

And a primary key of: {Order, Product}

• So to get rid of the first FD we project over:

{Order, Customer, Address} and

{Order, Product, Quantity and UnitPrice}

By Stanley Githinji

1NF

Normalisation to 2NF

Order Product Customer Address Quantity UnitPrice

R

Order Customer AddressR1

Order Product Quantity UnitPriceR2

By Stanley Githinji

Normalisation to 2NF

• R1 is now in 2NF, but there is still a partial FD in R2:

{Product} {UnitPrice}

Order Product Quantity UnitPrice

• To remove this we project over:

{Product, UnitPrice} and {Order, Product, Quantity}

By Stanley Githinji

Normalisation to 2NF

Order Product Quantity UnitPriceR2

Product UnitPrice Order Product Quantity

R4R3

1NF

2NF

By Stanley Githinji

Now let’s go 3NF…

• R has now been split into 3 relations - R1, R3, and R4… but R1 has a transitive FD on its key…

• To remove this problem we project R1 over:

{Order, Customer} and {Customer, Address}

Order Customer AddressR1

{Order} {Customer} {Address}

By Stanley Githinji

So more chopping…

Order Customer AddressR12NF

Order CustomerR5 Customer AddressR6

3NF

By Stanley Githinji

Let’s summarize that:

• 1NF: {Order, Product, Customer, Address, Quantity, UnitPrice}

• 2NF:{Order, Customer, Address} {Product, UnitPrice}{Order, Product, Quantity}

• 3NF:{Product, UnitPrice}{Order, Product, Quantity}{Order, Customer}{Customer, Address}

By Stanley Githinji

So this…

Order Product Customer Address Quantity UnitPrice

R

0NF

By Stanley Githinji

has become this…

Order CustomerPurchase

Customer AddressDetails

Order Product QuantityAmounts

Product UnitPricePrices3NF

By Stanley Githinji

English. Obviously.

Also responsible for all that DBS coursework you have to do….

3. Boyce-Codd Normal Form

Very clever.

Edgar F. Codd. Revolutionized db’s by inventing the RM at IBM in 1970

• Did Codd make any mistakes along the way?

• Nulls

• Forbidding complex objects in 1NF.

• Not initially seeing that Primary keys are a superfluous concept.

• Initially thinking 3NF was enough.

By Stanley Githinji

The Primary Key Myth

• In all our discussions so far we have considered the existence of a primary key.

• What if there is more than one column(s) which could be the primary key? Which of these candidate key should we pick?

• None - Candidate keys are the vital concept. Calling one ‘primary’ is pretty unimportant.

By Stanley Githinji

But hold on there sherlock?

• When we defined our normal forms we always talked about the primary key.

• This was fine if we only had one candidate key, but what if there are more several candidate keys?

• This realization changed the requirement of having data in 3NF to requiring Boyce-Codd Normal Form

• … which relies on a concept called prime attributes.

By Stanley Githinji

Prime Attributes

• An attribute of a relation is called prime if it is part of a candidate key, and non-prime otherwise.

2NF definition alters:

As 1NF and in addition no non-primary-key attribute is partially dependant on the primary key.

As 1NF and in addition no non-prime attribute is partially dependent on any candidate key.

n.b. these are the same if there is only one candidate key

By Stanley Githinji

3NF Revisited

• The same change is made to 3NF:

3NF definition alters:

As 2NF and in addition no non-primary-key attribute depends transitively on the primary key

As 2NF and in addition no non-prime attribute is transitively dependent on any candidate key.

By Stanley Githinji

Boyce-Codd Normal Form

• Going to explain this by example.

• Consider a relation, Labs, which stores information about the enrollments for the various lab sessions on computer science courses :

Schema:

• Each course can have several lab session slots.

• Each student taking a course is assigned to a single lab session slot for it.

• Each lab session slot is managed by a single lecturer.

By Stanley Githinji

Example: Labs Relation

Student Subject Lecturer

Pauline Java JimPauline Databases PeterEnden Java JimEnden Databases PeterBlom Java Tim

Candidate keys: {Student, Subject} and {Student, Lecturer}

By Stanley Githinji

FDs in the Labs Relation

• For each lab subject a student is only taught Lecturer: {Student, Subject} {Lecturer}

Student Subject Lecturer

• A lecturer only teaches one lab subject : {Lecturer} {Subject}

The labs table has the following non-trivial FDs:

By Stanley Githinji

Can we normalize?

• 1NF - Any repeating groups? NO• 2NF - Are there any partial dependencies? NO• 3NF - Is {Student, Subject} {Lecturer} {Subject} cyclic? NO

(The key at the start is {Student, Subject} not {Subject} )

Student Subject Lecturer

Pauline Java JimPauline Databases PeterEnden Java JimEnden Databases PeterBlom Java Tim

{Student, Subject} {Lecturer}

{Lecturer} {Subject}

…so the table is already in 3NF

By Stanley Githinji

But it still has Anomalies!

Student Subject Lecturer

Pauline Java JimPauline Databases PeterEnden Java JimEnden Databases PeterBlom Java Tim

3NF LabsINSERT anomoliesYou can’t setup an empty lab session

UPDATE anomoliesGary taking over Jim’s Java lab involves changing two rows.

DELETE anomoliesDeleting Blom, means losing all knowledge of Tim’s Java lab.

Student Subject Lecturer

Pauline Java JimPauline Databases PeterEnden Java JimEnden Databases PeterBlom Java Tim

3NF LabsINSERT anomoliesYou can’t setup an empty lab session

UPDATE anomoliesGary taking over Jim’s Java lab involves changing two rows.

DELETE anomoliesDeleting Blom, means losing all knowledge of Tim’s Java lab.

Student Subject Lecturer

Pauline Java JimPauline Databases PeterEnden Java JimEnden Databases PeterBlom Java Tim

3NF LabsINSERT anomoliesYou can’t setup an empty lab session

UPDATE anomoliesGary taking over Jim’s Java lab involves changing two rows.

DELETE anomoliesDeleting Blom, means losing all knowledge of Tim’s Java lab.

Oh for pete'ssake, so much for 3NF…

By Stanley Githinji

4. Boyce-Codd Normal Form

• It was quickly found that 3NF isn't perfect.

• But ONLY on the rare occurrences that:(a) Candidate keys are composite(b) there is more than one candidate key(c) those candidate keys overlap.

• The problem is being caused by dependence between parts of the keys themselves:

Student Subject Lecturer{Lecturer} {Subject} but Subject by itself is not a key:

By Stanley Githinji

• A relation is in Boyce-Codd normal form (BCNF) if for every FD A B either:

• B is contained in A (the FD is trivial), or• A contains a candidate key of the relation

• This is the same as 3NF except we don’t allow B to be prime (part of a candidate key)

• Remember if there is only one candidate key then 3NF and BCNF are the same thing.

The Solution - BCNF

By Stanley Githinji

So what do we do?

NO! - because we have lost information this way. We have lost the links between an individual 'lab' and the person in it!

Student Subject Lecturer3NF

BCNFStudent Subject Subject Lecturer

By Stanley Githinji

We would incorrectlly have:

Student Subject

Pauline JavaPauline DatabasesEnden JavaEnden DatabasesBlom Java

Subject Lecturer

Java JimJava TimDatabases Peter

If we joined them back together we would have no way of knowing which people who did Java were in Tim’s session or Jim’s session.

By Stanley Githinji

BCNF completed• If you fail BCNF, there is something wrong with

your propositions…

• …they were actually about two things, at least one of which you did not identify correctly.

Student Subject Lab

Pauline Java 1Pauline Databases 3Enden Java 1Enden Databases 3Blom Java 2

Lab Lecturer

1 Jim 2 Tim 3 Peter

LabsEnrollment

By Stanley Githinji

Higher Normal Forms

• BCNF is as far as we can go with FDs

• Higher normal forms are based on other sorts of dependency

• Fourth normal form removes multi-valued dependencies

• Fifth normal form removes join dependencies

1NF Relations

2NF Relations

3NF Relations

BCNF Relations

4NF Relations

5NF Relations