Upload
kim-jomo
View
184
Download
2
Tags:
Embed Size (px)
Citation preview
By Stanley Githinji
WARNING
• This stuff can get confusing.
• So concentrate. This is the science bit.
By Stanley Githinji
Redundancy & Normalisation
• Redundant data• Is data that _already_ exists elsewhere in the
database
• Redundant data leads to various subtle, but important problems:
• INSERT anomalies• UPDATE anomalies• DELETE anomalies
• Normalisation• Aims to reduce data redundancy
• Redundancy is expressed in terms of dependencies
• Normal forms are defined that do not have certain types of dependency
By Stanley Githinji
What is Normalization?
• It is a mathematical process that converts one set of formulae into another equivalent set of formulae. That is it.
• This only makes sense if you think of information in terms of propositions – statements of fact.
• Do not think in terms of objects and entities at the logical level. This is not how we communicate information.
By Stanley Githinji
Propositions Example
Program(X-Factor) & Host(Kate) Program(I’m a celebrity) & Host(Ant) & Host(Dec)Program(Big Brother) & Host(Davina) & coHost(Dermot)
Unnormalized Reality TV
Program Host coHost
X-Factor Kate null
I’m a Celebrity Ant null Dec
Big Brother Davina Dermot
By rearranging these propositions into different forms we can achieve a better structure for manipulating the info…this is Normalization.
A mess
By Stanley Githinji
'Zeroth' and 1st Normal Form
• In the original definition of the relational model
• All data values should be atomic
• This means that table entries should be single values, not be repeating groups or ‘complex’ objects
• A relation is said to be in first normal form (1NF) if
• All data values are atomic
• No duplicate columns
• A 'relation' that is not in 1NF is said to be in 'zeroth' normal form (0NF), and is unnormalized
By Stanley Githinji
0NF to 1NF
To convert a 0NF ‘relation’ to a 1NF relation: Split up any non-atomic values
0NF - Teaching
Module Dept Lecturer Text
M1 D1 L1 T1 T2
M2 D1 L1 T1 T3 M3 D1 L2 T4 M4 D2 L3 T1 T5 M5 D2 L4 T6
1NF - Teaching
Module Dept Lecturer Text
M1 D1 L1 T1 M1 D1 L1 T2 M2 D1 L1 T1 M2 D1 L1 T3 M3 D1 L2 T4 M4 D2 L3 T1 M4 D2 L3 T5 M5 D2 L4 T6
By Stanley Githinji
Back to Reality
Program(X-Factor) & Host(Kate) Program(I’m a celebrity) & Host(Ant) & Host(Dec)Program(Big Brother) & Host(Davina) & coHost(Dermot)
Program(X-Factor) & Host(Kate) Program(I’m a celebrity) & Host(Ant) Program(I’m a celebrity) & Host(Dec)Program(Big Brother) & Host(Davina) & coHost(Dermot)
By Stanley Githinji
ONF - Reality TV
Program Host coHost
X-Factor Kate null
I’m a Celebrity Ant null Dec
Big Brother Davina Dermot
1NF - Reality TV
Program Host coHost
X-Factor Kate null
I’m a Celebrity Ant null
I’m a Celebrity Dec null
Big Brother Davina Dermot
By Stanley Githinji
What have we done there?
• We took unformatted information and put it into a format that allows it to be represented as… a mathematical relation.
• 1NF is different from subsequent normalizaiton - it essentially says, all data must fit into relations.
• I.e. A table = relation by 1NF
By Stanley Githinji
But there are still problems in 1NF…
• INSERT anomaliesCan't add a module with
no texts
• UPDATE anomaliesTo change lecturer for
M1, we have to change two rows
• DELETE anomaliesIf we remove M3, we
remove L2 as well
1NF - Teaching
Module Dept Lecturer Text
M1 D1 L1 T1 M1 D1 L1 T2 M2 D1 L1 T1 M2 D1 L1 T3 M3 D1 L2 T4 M4 D2 L3 T1 M4 D2 L3 T5 M5 D2 L4 T6
By Stanley Githinji
Functional Dependencies
• Redundancy can often be described as a functional dependency
• A functional dependency (FD) is a semantic link between two sets of attributes in a relation
• Another part of 'normalisation‘ is to remove undesirable FDs
• A set of attributes, A, functionally determines another set, B, if:
• Whenever two rows of the relation have the same value for all attributes in A then they also have the same value for all attributes in B.
• We say: A B
By Stanley Githinji
Why care about FD?
• We define a set of 'normal forms‘
• Each normal form has fewer FDs than the last
• Since FDs represent redundancy, each normal form has less redundancy than the last
• Not all FDs cause a problem, but…
• We identify various sorts of FD that do.
• Each normal form removes a type of FD that is a problem.
• We will also need a way to remove FDs.
By Stanley Githinji
Properties of FDs
In any relation:
• The primary key FDs any set of attributes in that relation: K X
• Any set of attributes is of course FD on itself: X X
Set of other attributes
Primary key
By Stanley Githinji
Rules for FD’s
• Reflexivity: If B is a subset of A then: A B
• Augmentation: If A B then A U C B U C
• Transitivity: If A B and B C then A C
By Stanley Githinji
FD Example
• The primary key is {Module, Text} so{Module, Text} {Dept, Lecturer}
• 'Trivial' FDs, eg:{Text, Dept} {Text}{Module} {Module}{Dept, Lecturer} { }
1NF - Teaching
Module Dept Lecturer Text
M1 D1 L1 T1 M1 D1 L1 T2 M2 D1 L1 T1 M2 D1 L1 T3 M3 D1 L2 T4 M4 D2 L3 T1 M4 D2 L3 T5 M5 D2 L4 T6
By Stanley Githinji
FD Example
• Other FDs are• {Module} {Lecturer}• {Module} {Dept}• {Lecturer} {Dept}
• These are non-trivial and don't come from the primary key
1NF - Teaching
Module Dept Lecturer Text
M1 D1 L1 T1 M1 D1 L1 T2 M2 D1 L1 T1 M2 D1 L1 T3 M3 D1 L2 T4 M4 D2 L3 T1 M4 D2 L3 T5 M5 D2 L4 T6
By Stanley Githinji
FD Diagrams
Module LecturerDept Text
{Module, Text} is the primary key, so we put a double box around them
{Module} {Dept} and {Module} {Lecturer}, so we have{Module} {Dept, Lecturer}
{Lecturer} {Dept}, so we have an arrow from Lecturer to Dept
By Stanley Githinji
Partial FDs and 2NF
• Partial FDs:
• A FD, A B is a partial FD, if some attribute of A can be removed and the FD still holds
• Formally, there is some proper subset of A, C A, such that C B
• 2nd normal form A relation is in second normal form (2NF) if it is in 1NF
and no non-primary-key attribute is partially dependent on the primary key
i.e. a member of the set A is superfluous
By Stanley Githinji
Second Normal Form
• 1NF is not in 2NF
We have the FD:{Module, Text} {Lecturer, Dept}
but also…{Module} {Lecturer, Dept}
• And so Lecturer and Dept are partially dependent on the primary key
1NF - Teaching
Module Dept Lecturer Text
M1 D1 L1 T1 M1 D1 L1 T2 M2 D1 L1 T1 M2 D1 L1 T3 M3 D1 L2 T4 M4 D2 L3 T1 M4 D2 L3 T5 M5 D2 L4 T6
Module Dept Lecturer Text
By Stanley Githinji
So what do we do?• Say we have a relation
with scheme S and the full FD A B.
• We can organize this to make sure A ∩ B = { }. (ie. no trivial dependencies)
• Let C = S – (A U B)
• So we have• A – attributes on the LHS
of the FD• B – attributes on the RHS
of the FD• C – all other attributes
• Well it turns out that we can split the relation into two parts:
R1, with scheme: C U AR2, with scheme: A U B
• The original relation can be recovered as the natural join of R1 and R2
Vital point
By Stanley Githinji
1NF to 2NF – Example
1NF - Teaching
Module Dept Lecturer Text
M1 D1 L1 T1 M1 D1 L1 T2 M2 D1 L1 T1 M2 D1 L1 T3 M3 D1 L2 T4 M4 D2 L3 T1 M4 D2 L3 T5 M5 D2 L4 T6
Module Dept Lecturer Text
2NF - Text
Module Text
M1 T1 M1 T2 M2 T1 M2 T3 M3 T4 M4 T1 M4 T5 M1 T6
Module Text
2NF - Modules
Module Dept Lecturer
M1 D1 L1 M2 D1 L1 M3 D1 L2 M4 D2 L3 M5 D2 L4
Module Dept Lecturer
By Stanley Githinji
Problems Resolved in 2NF
Those 1NF problems:
• INSERT – Can't add a module with no texts
• UPDATE – To change lecturer for M1, we have to change two rows
• DELETE – If we remove M3, we remove L2 as well, and lose information we might not have wanted to!
• In 2NF the first two are resolved, but not the third one
2NF - Modules
Module Dept Lecturer
M1 D1 L1 M2 D1 L1 M3 D1 L2 M4 D2 L3 M5 D2 L4
By Stanley Githinji
Problems Remaining in 2NF
• INSERT anomalies• Can't add lecturers who teach no modules
• UPDATE anomalies• To change the department for L1 we must alter two
rows• DELETE anomalies
• If we delete M3 we delete L2 as well
2NF - Modules
Module Dept Lecturer
M1 D1 L1 M2 D1 L1 M3 D1 L2 M4 D2 L3 M5 D2 L4
Module Dept Lecturer
By Stanley Githinji
Transitive FDs and 3NF
• Transitive FDs:• An FD, A C is a transitive FD, if there is some
set B such that A B and B C are non-trivial FDs in the relation
• I.e. There exists:A B C
• Third normal form:
A relation is in third normal form (3NF) if it is in 2NF and no non-primary-key attribute is transitively dependent on the primary key
By Stanley Githinji
Third Normal Form
• This is not in 3NF
• We have the FDs:{Module} {Lecturer}{Lecturer} {Dept}
So there is a transitive FD from the primary key {Module} to {Dept}
2NF – Modules
Module Dept Lecturer
M1 D1 L1 M2 D1 L1 M3 D1 L2 M4 D2 L3 M5 D2 L4
Module Dept Lecturer
By Stanley Githinji
2NF to 3NF – Example
2NF - Modules
Module Dept Lecturer
M1 D1 L1 M2 D1 L1 M3 D1 L2 M4 D2 L3 M5 D2 L4
Module Dept Lecturer Lecturer Dept Module Lecturer
3NF - place
Lecturer Dept
L1 D1 L2 D1 L3 D2 L4 D2
3NF - Modules
Module Lecturer
M1 L1 M2 L1 M3 L2 M4 L3 M5 L4
By Stanley Githinji
Problems Resolved in 3NF
• Problems in 2NF
• INSERT – Can't add lecturers who teach no modules
• UPDATE – To change the department for L1 we must alter two rows
• DELETE – If we delete M3 we delete L2 as well
• In 3NF all of these are resolved:
3NF - Places
Lecturer Dept
L1 D1 L2 D1 L3 D2 L4 D2
3NF - Modules
Module Lecturer
M1 L1 M2 L1 M3 L2 M4 L3 M5 L4
By Stanley Githinji
Normalisation and Design
• Normalisation is integrally related to DB design:
• A database should normally be in 3NF at least
• If your design leads to a non-3NF DB, then you might want to revise it
• When you find you have a non-3NF DB:
• Identify the FDs that are causing a problem• Think if they will lead to any insert, update, or delete
anomalies• Try to remove them
By Stanley Githinji
So, the story so far:
• 1NF – turn your propositions into a format that fits relations by removing repeating groups.
• 2NF – look for partial FD’s and separate into another table anything that is not functionally dependent on the full primary key.
• 3NF – look for transitive FD’s and separate off into a separate table.
By Stanley Githinji
• Normalization reduces data redundancy in a database
• By doing so it eliminates serious manipulation anomalies.
• Normalization is ultimately just rearranging propositions to a better structure.
• This is done by identifying and removing damaging functional dependencies.
1. Normalization refresher
By Stanley Githinji
Normal forms so Far…
• First normal form• All data values are
atomic, and so everything fits into a mathematical relation.
• Second normal form• As 1NF plus no non-
primary-key attribute is partially dependant on the primary key
• Third normal form• As 2NF plus no non-
primary-key attribute depends transitively on the primary key
By Stanley Githinji
2. Normalization Example
• Consider a table representing orders in an online store
• Each entry in the table represents an item on a particular order. (thinking in terms of records. Yuk.)
• Columns• Order• Product• Customer• Address• Quantity• UnitPrice
• Primary key is {Order, Product}
By Stanley Githinji
Functional Dependencies
{Order} {Customer}
{Customer} {Address}
{Product} {UnitPrice}
{Order} {Address}
Each order is for a single customer
Each customer has a single address
Each product has a single price
FD’s 1 and 2 are transitive
By Stanley Githinji
Normalisation to 2NF• Remember 2nd normal form means no partial
dependencies on the key. But we have:
{Order} {Customer, Address}{Product} {UnitPrice}
And a primary key of: {Order, Product}
• So to get rid of the first FD we project over:
{Order, Customer, Address} and
{Order, Product, Quantity and UnitPrice}
By Stanley Githinji
1NF
Normalisation to 2NF
Order Product Customer Address Quantity UnitPrice
R
Order Customer AddressR1
Order Product Quantity UnitPriceR2
By Stanley Githinji
Normalisation to 2NF
• R1 is now in 2NF, but there is still a partial FD in R2:
{Product} {UnitPrice}
Order Product Quantity UnitPrice
• To remove this we project over:
{Product, UnitPrice} and {Order, Product, Quantity}
By Stanley Githinji
Normalisation to 2NF
Order Product Quantity UnitPriceR2
Product UnitPrice Order Product Quantity
R4R3
1NF
2NF
By Stanley Githinji
Now let’s go 3NF…
• R has now been split into 3 relations - R1, R3, and R4… but R1 has a transitive FD on its key…
• To remove this problem we project R1 over:
{Order, Customer} and {Customer, Address}
Order Customer AddressR1
{Order} {Customer} {Address}
By Stanley Githinji
So more chopping…
Order Customer AddressR12NF
Order CustomerR5 Customer AddressR6
3NF
By Stanley Githinji
Let’s summarize that:
• 1NF: {Order, Product, Customer, Address, Quantity, UnitPrice}
• 2NF:{Order, Customer, Address} {Product, UnitPrice}{Order, Product, Quantity}
• 3NF:{Product, UnitPrice}{Order, Product, Quantity}{Order, Customer}{Customer, Address}
By Stanley Githinji
has become this…
Order CustomerPurchase
Customer AddressDetails
Order Product QuantityAmounts
Product UnitPricePrices3NF
By Stanley Githinji
English. Obviously.
Also responsible for all that DBS coursework you have to do….
3. Boyce-Codd Normal Form
Very clever.
Edgar F. Codd. Revolutionized db’s by inventing the RM at IBM in 1970
• Did Codd make any mistakes along the way?
• Nulls
• Forbidding complex objects in 1NF.
• Not initially seeing that Primary keys are a superfluous concept.
• Initially thinking 3NF was enough.
By Stanley Githinji
The Primary Key Myth
• In all our discussions so far we have considered the existence of a primary key.
• What if there is more than one column(s) which could be the primary key? Which of these candidate key should we pick?
• None - Candidate keys are the vital concept. Calling one ‘primary’ is pretty unimportant.
By Stanley Githinji
But hold on there sherlock?
• When we defined our normal forms we always talked about the primary key.
• This was fine if we only had one candidate key, but what if there are more several candidate keys?
• This realization changed the requirement of having data in 3NF to requiring Boyce-Codd Normal Form
• … which relies on a concept called prime attributes.
By Stanley Githinji
Prime Attributes
• An attribute of a relation is called prime if it is part of a candidate key, and non-prime otherwise.
2NF definition alters:
As 1NF and in addition no non-primary-key attribute is partially dependant on the primary key.
As 1NF and in addition no non-prime attribute is partially dependent on any candidate key.
n.b. these are the same if there is only one candidate key
By Stanley Githinji
3NF Revisited
• The same change is made to 3NF:
3NF definition alters:
As 2NF and in addition no non-primary-key attribute depends transitively on the primary key
As 2NF and in addition no non-prime attribute is transitively dependent on any candidate key.
By Stanley Githinji
Boyce-Codd Normal Form
• Going to explain this by example.
• Consider a relation, Labs, which stores information about the enrollments for the various lab sessions on computer science courses :
Schema:
• Each course can have several lab session slots.
• Each student taking a course is assigned to a single lab session slot for it.
• Each lab session slot is managed by a single lecturer.
By Stanley Githinji
Example: Labs Relation
Student Subject Lecturer
Pauline Java JimPauline Databases PeterEnden Java JimEnden Databases PeterBlom Java Tim
Candidate keys: {Student, Subject} and {Student, Lecturer}
By Stanley Githinji
FDs in the Labs Relation
• For each lab subject a student is only taught Lecturer: {Student, Subject} {Lecturer}
Student Subject Lecturer
• A lecturer only teaches one lab subject : {Lecturer} {Subject}
The labs table has the following non-trivial FDs:
By Stanley Githinji
Can we normalize?
• 1NF - Any repeating groups? NO• 2NF - Are there any partial dependencies? NO• 3NF - Is {Student, Subject} {Lecturer} {Subject} cyclic? NO
(The key at the start is {Student, Subject} not {Subject} )
Student Subject Lecturer
Pauline Java JimPauline Databases PeterEnden Java JimEnden Databases PeterBlom Java Tim
{Student, Subject} {Lecturer}
{Lecturer} {Subject}
…so the table is already in 3NF
By Stanley Githinji
But it still has Anomalies!
Student Subject Lecturer
Pauline Java JimPauline Databases PeterEnden Java JimEnden Databases PeterBlom Java Tim
3NF LabsINSERT anomoliesYou can’t setup an empty lab session
UPDATE anomoliesGary taking over Jim’s Java lab involves changing two rows.
DELETE anomoliesDeleting Blom, means losing all knowledge of Tim’s Java lab.
Student Subject Lecturer
Pauline Java JimPauline Databases PeterEnden Java JimEnden Databases PeterBlom Java Tim
3NF LabsINSERT anomoliesYou can’t setup an empty lab session
UPDATE anomoliesGary taking over Jim’s Java lab involves changing two rows.
DELETE anomoliesDeleting Blom, means losing all knowledge of Tim’s Java lab.
Student Subject Lecturer
Pauline Java JimPauline Databases PeterEnden Java JimEnden Databases PeterBlom Java Tim
3NF LabsINSERT anomoliesYou can’t setup an empty lab session
UPDATE anomoliesGary taking over Jim’s Java lab involves changing two rows.
DELETE anomoliesDeleting Blom, means losing all knowledge of Tim’s Java lab.
Oh for pete'ssake, so much for 3NF…
By Stanley Githinji
4. Boyce-Codd Normal Form
• It was quickly found that 3NF isn't perfect.
• But ONLY on the rare occurrences that:(a) Candidate keys are composite(b) there is more than one candidate key(c) those candidate keys overlap.
• The problem is being caused by dependence between parts of the keys themselves:
Student Subject Lecturer{Lecturer} {Subject} but Subject by itself is not a key:
By Stanley Githinji
• A relation is in Boyce-Codd normal form (BCNF) if for every FD A B either:
• B is contained in A (the FD is trivial), or• A contains a candidate key of the relation
• This is the same as 3NF except we don’t allow B to be prime (part of a candidate key)
• Remember if there is only one candidate key then 3NF and BCNF are the same thing.
The Solution - BCNF
By Stanley Githinji
So what do we do?
NO! - because we have lost information this way. We have lost the links between an individual 'lab' and the person in it!
Student Subject Lecturer3NF
BCNFStudent Subject Subject Lecturer
By Stanley Githinji
We would incorrectlly have:
Student Subject
Pauline JavaPauline DatabasesEnden JavaEnden DatabasesBlom Java
Subject Lecturer
Java JimJava TimDatabases Peter
If we joined them back together we would have no way of knowing which people who did Java were in Tim’s session or Jim’s session.
By Stanley Githinji
BCNF completed• If you fail BCNF, there is something wrong with
your propositions…
• …they were actually about two things, at least one of which you did not identify correctly.
Student Subject Lab
Pauline Java 1Pauline Databases 3Enden Java 1Enden Databases 3Blom Java 2
Lab Lecturer
1 Jim 2 Tim 3 Peter
LabsEnrollment
By Stanley Githinji
Higher Normal Forms
• BCNF is as far as we can go with FDs
• Higher normal forms are based on other sorts of dependency
• Fourth normal form removes multi-valued dependencies
• Fifth normal form removes join dependencies
1NF Relations
2NF Relations
3NF Relations
BCNF Relations
4NF Relations
5NF Relations