Upload
jocansino4496
View
230
Download
0
Embed Size (px)
Citation preview
8/13/2019 ITD102_Lecture 6-Normalization Process
1/39
Database NormalizationEngr. Julius Cansino
8/13/2019 ITD102_Lecture 6-Normalization Process
2/39
What is Normalization
Normalization allows us to organizedata so that it:
Allows faster access (dependenciesmake sense)
Reduced space (less redundancy)
Normalization is a method oforganizing data elements into tables.
8/13/2019 ITD102_Lecture 6-Normalization Process
3/39
Normal Forms
Normalization is done throughchanging or transforming data intovarious Normal Forms.
There are 5 Normal Forms but wealmost never use 4NF or 5NF.
We will only be concerned with 1NF,2NF, and 3NF.
8/13/2019 ITD102_Lecture 6-Normalization Process
4/39
For a database to be in a normalform, it must meet all requirementsof the previous forms:
Eg. For a database to be in 2NF, it mustalready be in 1NF. For a database to bein 3NF, it must already be in 1NF and
2NF.
8/13/2019 ITD102_Lecture 6-Normalization Process
5/39
Sample Data
Manager Employees
Bob Susie, Eric
Edward Bella, Andrew
Taylor Mark, Jane
This data has some problems:
The Employees column is not atomic.
A column must be atomic, meaning that itcan only hold a single item of data. Thiscolumn holds more than one employeename.
8/13/2019 ITD102_Lecture 6-Normalization Process
6/39
Data that is not atomic means: We cant easily sort the data
We cant easily search or index the data
We cant easily change the data
We cant easily reference the data inother tables
Manager Employees
Bob Susie, EricEdward Bella, Andrew
Taylor Mark, Jane
8/13/2019 ITD102_Lecture 6-Normalization Process
7/39
Manager Employee1 Employee2
Bob Susie Eric
Edward Bella Andrew
Taylor Mark Jane
Breaking the Employee column intomore than 1 column doesnt solveour problems:
The data may look atomic, but onlybecause we have many identical
columns storing a single piece of datainstead of a single column storing manypieces of data.
8/13/2019 ITD102_Lecture 6-Normalization Process
8/39
We still cant easily sort, search, orindex our employees.
What if a manager has more than 2employees, 10 employees, 100employees? Wed need to add columnsto our database just for these cases.
It is still hard to reference ouremployees in other tables.
Manager Employee1 Employee2
Bob Susie Eric
Edward Bella Andrew
Taylor Mark Jane
8/13/2019 ITD102_Lecture 6-Normalization Process
9/39
By the way, what would be a goodchoice of a Primary Key for thistable?
Manager Employee1 Employee2
Bob Susie Eric
Edward Bella Andrew
Taylor Mark Jane
8/13/2019 ITD102_Lecture 6-Normalization Process
10/39
First Normal Form
1NF means that we must:
Eliminate duplicate columns from thesame table, and
Create separate tables for each group ofrelated data into separate tables, eachwith a unique row identifier (primary
key) Lets get started by making our
columns atomic
8/13/2019 ITD102_Lecture 6-Normalization Process
11/39
Atomic Data
By breaking eachtuple of our tableinto an entry for
each employee, wehave made ourdata atomic.
What would be the
primary key?
Manager Employee
Bob SusieBob Eric
Edward Bella
Edward Andrew
Taylor Mark
Taylor Jane
8/13/2019 ITD102_Lecture 6-Normalization Process
12/39
Primary Key
The best primarykey would be theEmployee column.
Every employeeonly has onemanager, thereforean employee is
unique.
Employee Manager
Susie Bob
Eric Bob
Bella EdwardAndrew Edward
Mark Taylor
Jane Taylor
8/13/2019 ITD102_Lecture 6-Normalization Process
13/39
First Normal Form
Congratulations!
The fact that allour data and
columns is atomicand we have aprimary key meansthat we are in 1NF!
Employee Manager
Susie BobEric Bob
Bella Edward
Andrew Edward
Mark Taylor
Jane Taylor
8/13/2019 ITD102_Lecture 6-Normalization Process
14/39
First Normal Form Revised
Of course theremay come a daywhen we hire a
second employeeor manager withthe same name. Toavoid this, lets use
an employee IDinstead of theirname.
ID Employee ManagerID
1 Susie 7
2 Eric 7
3 Bella 84 Andrew 8
5 Mark 9
6 Jane 9
7 Bob
8 Edward
9 Taylor
8/13/2019 ITD102_Lecture 6-Normalization Process
15/39
8/13/2019 ITD102_Lecture 6-Normalization Process
16/39
Moving to Second Normal Form
A database in 2NF must also be in1NF:
Data must be atomic
Every row (or tuple) must have aunique primary key
Plus:
Subsets of data that apply to multiplerows (repeating data) are moved toseparate tables
8/13/2019 ITD102_Lecture 6-Normalization Process
17/39
CustID FirstName LastName Address City State Zip
1 Bob Smith 123 Main St. Tucson AZ 12345
2 John Brown 555 2nd Ave. St. Paul MN 54355
3 Sandy Jessop 4256 James St. Chicago IL 435554 Maria Hernandez 4599 Columbia Vancouver BC V5N 1M0
5 Gameil Hintz 569 Summit St. St. Paul MN 54355
6 James Richardson 12 Cameron Bay Regina SK S4T 2V8
7 Shiela Green 12 Michigan Ave. Chicago IL 43555
8 Ian Sampson 56 Manitoba St. Winnipeg MB M5W 9N7
9 Ed Rodgers 15 Athol St. Regina SK S4T 2V9
This data is in 1NF: all fields are atomic and the CustID serves asthe primary key
8/13/2019 ITD102_Lecture 6-Normalization Process
18/39
But lets payattention to theCity, State, and Zipfields: There are 2 rows of
repeating data:one for Chicago,and one for St.
Paul. Both have the same
city, state and zipcode
City State Zip
Tucson AZ 12345
St. Paul MN 54355
Chicago IL 43555Vancouver BC V5N 1M0
St. Paul MN 54355
Regina SK S4T 2V8
Chicago IL 43555Winnipeg MB M5W 9N7
Regina SK S4T 2V9
8/13/2019 ITD102_Lecture 6-Normalization Process
19/39
The CustID determines all the data in therow, but U.S. Zip codes determines theCityand State. (eg. A given Zip code can
only belong to one city and state sostoring Zip codes with a City and State isredundant)
This means that Cityand Stateare
Functionally Dependenton the value inZipcode and not only the primary key.
8/13/2019 ITD102_Lecture 6-Normalization Process
20/39
To be in 2NF, this repeating datamust be in its own table.
So:
Lets create a Zip code table that mapsZip codes to their City and State.
Note that Canadian Postal Codes are
different: the same city and state canhave many different postal codes.
8/13/2019 ITD102_Lecture 6-Normalization Process
21/39
Our Data in 2NFCustID FirstName LastName Address Zip
1 Bob Smith 123 Main St. 12345
2 John Brown 555 2nd Ave. 54355
3 Sandy Jessop 4256 James St. 43555
4 Maria Hernandez 4599 Columbia V5N 1M0
5 Gameil Hintz 569 Summit St. 54355
6 James Richardson 12 Cameron Bay S4T 2V87 Shiela Green 12 Michigan Ave. 43555
8 Ian Sampson 56 Manitoba St. M5W 9N7
9 Ed Rodgers 15 Athol St. S4T 2V9
Zip City State
12345 Tucson AZ54355 St. Paul MN
43555 Chicago IL
V5N 1M0 Vancouver BC
S4T 2V8 Regina SK
M5W 9N7 Winnipeg MB
S4T 2V9 Regina SK
We see that we can actually save 2
rows in the Zip Code table by removingthese redundancies: 9 customerrecords only need 7 Zip code records.
Zip code becomes a foreign key in thecustomer table linked to the primarykey in the Zip code table
CustomerTable
ZipCodeTable
8/13/2019 ITD102_Lecture 6-Normalization Process
22/39
Advantages of 2NF
Saves space in the database byreducing redundancies
If a customer calls, you can just ask
them for their Zip code and youllknow their city and state! (No morespelling mistakes)
If a City name changes, we onlyneed to make one change to thedatabase.
8/13/2019 ITD102_Lecture 6-Normalization Process
23/39
Summary So Far
1NF: All data is atomic
All rows have a unique primary key
2NF: Data is in 1NF
Subsets of data in multiple columns are
moved to a new table These new tables are related usingforeign keys
8/13/2019 ITD102_Lecture 6-Normalization Process
24/39
Moving to 3NF
To be in 3NF, a database must be:
In 2NF
All columns must be fully functionally
dependent on the primary key (Thereare no transitive dependencies)
8/13/2019 ITD102_Lecture 6-Normalization Process
25/39
In this table:
CustomerID and ProdID depend on theOrderID and no other column (good)
Stated another way, If you know the OrderID,you know the CustID and the ProdID
So: OrderID CustID, ProdID
OrderID CustID ProdID Price Quantity Total
1 1001 AB-111 50 1,000 50,000
2 1002 AB-111 60 500 30,000
3 1001 ZA-245 35 100 3,500
4 1003 MB-153 82 25 2,050
5 1004 ZA-245 42 10 420
6 1002 ZA-245 40 50 2,000
7 1001 AB-111 75 100 7,500
8/13/2019 ITD102_Lecture 6-Normalization Process
26/39
But there are some fields that arenot dependent on OrderID: Total is the simple product of
Price*Quantity. As such, has a transitivedependency to Price and Quantity.
Because it is a calculated value, doesntneed to be included at all.
OrderID CustID ProdID Price Quantity Total
1 1001 AB-111 50 1,000 50,000
2 1002 AB-111 60 500 30,000
3 1001 ZA-245 35 100 3,500
4 1003 MB-153 82 25 2,050
5 1004 ZA-245 42 10 420
6 1002 ZA-245 40 50 2,000
7 1001 AB-111 75 100 7,500
8/13/2019 ITD102_Lecture 6-Normalization Process
27/39
Also, we can see that Price isntreally dependent on ProdID, orOrderID. Customer 1001 bought AB-
111 for $50 (in order 1) and for $75(in order 7), while 1002 spent $60for each item in order 2.
OrderID CustID ProdID Price Quantity Total
1 1001 AB-111 50 1,000 50,000
2 1002 AB-111 60 500 30,000
3 1001 ZA-245 35 100 3,500
4 1003 MB-153 82 25 2,050
5 1004 ZA-245 42 10 420
6 1002 ZA-245 40 50 2,000
7 1001 AB-111 75 100 7,500
8/13/2019 ITD102_Lecture 6-Normalization Process
28/39
Maybe price is dependent on theProdID and Quantity: The more youbuy of a given product the cheaper
that product becomes! So we ask the business manager and
she tells us that this is the case.
OrderID CustID ProdID Price Quantity Total
1 1001 AB-111 50 1,000 50,000
2 1002 AB-111 60 500 30,000
3 1001 ZA-245 35 100 3,500
4 1003 MB-153 82 25 2,050
5 1004 ZA-245 42 10 420
6 1002 ZA-245 40 50 2,000
7 1001 AB-111 75 100 7,500
8/13/2019 ITD102_Lecture 6-Normalization Process
29/39
8/13/2019 ITD102_Lecture 6-Normalization Process
30/39
Lets diagram the dependencies. We can see that all fields are
dependent on OrderID, the Primary
Key (white lines)
OrderID CustID ProdID Price Quantity Total
8/13/2019 ITD102_Lecture 6-Normalization Process
31/39
But Total is also determined by Priceand Quantity (yellow lines) This is a derived field
(Price x Quantity = Total)
We can save a lot of space by gettingrid of it altogether and just calculatingtotal when we need it
OrderID CustID ProdID Price Quantity Total
8/13/2019 ITD102_Lecture 6-Normalization Process
32/39
Price is also determined by bothProdID and Quantity rather than theprimary key (red lines). This is called
a transitive dependency. We mustget rid of transitive dependencies tohave 3NF.
OrderID CustID ProdID Price Quantity
8/13/2019 ITD102_Lecture 6-Normalization Process
33/39
We do this by moving the transitivedependency into a second table
OrderID CustID ProdID Price Quantity
8/13/2019 ITD102_Lecture 6-Normalization Process
34/39
By splitting out thetable, we canquickly adjust our
price table to meetour competitor, orif the priceschanges from our
suppliers.
OrderID CustID ProdID Quantity
ProdID PriceQuantity
8/13/2019 ITD102_Lecture 6-Normalization Process
35/39
The second table is our pricing list. Think of Quantity as a range:
AB-111: 1-100, 101-500, 501 and moreZA-245: 1-10, 11-50, 51 and more
The primary Key for this second table is acomposite of ProdID and Quantity.
OrderID CustID ProdID Quantity ProdID Quantity Price
1 1001 AB-111 1,000 AB-111 1 75
2 1002 AB-111 500 AB-111 101 60
3 1001 ZA-245 100 AB-111 501 50
4 1003 MB-153 25 ZA-245 1 425 1004 ZA-245 10 ZA-245 11 40
6 1002 ZA-245 50 ZA-245 51 35
7 1001 AB-111 100 MB-153 1 82
8/13/2019 ITD102_Lecture 6-Normalization Process
36/39
Congratulations! Were now in 3NF! We can also quickly figure out what
price to offer our customers for any
quantity they want.
OrderID CustID ProdID Quantity ProdID Quantity Price
1 1001 AB-111 1,000 AB-111 1 75
2 1002 AB-111 500 AB-111 101 60
3 1001 ZA-245 100 AB-111 501 50
4 1003 MB-153 25 ZA-245 1 425 1004 ZA-245 10 ZA-245 11 40
6 1002 ZA-245 50 ZA-245 51 35
7 1001 AB-111 100 MB-153 1 82
8/13/2019 ITD102_Lecture 6-Normalization Process
37/39
8/13/2019 ITD102_Lecture 6-Normalization Process
38/39
Summarizing
A database is in 2NF if:
It is in 1NF
There is no repeating data in its tables.
Put another way, if we use a compositeprimary key, then all attributes aredependent on all parts of the key.
8/13/2019 ITD102_Lecture 6-Normalization Process
39/39
And Finally
A database is in 1NF if:
All its attributes are atomic (meaningthey contain only a single unit or type of
data), and All rows have a unique primary key.