ITD102_Lecture 6-Normalization Process

Embed Size (px)

Citation preview

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    1/39

    Database NormalizationEngr. Julius Cansino

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    2/39

    What is Normalization

    Normalization allows us to organizedata so that it:

    Allows faster access (dependenciesmake sense)

    Reduced space (less redundancy)

    Normalization is a method oforganizing data elements into tables.

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    3/39

    Normal Forms

    Normalization is done throughchanging or transforming data intovarious Normal Forms.

    There are 5 Normal Forms but wealmost never use 4NF or 5NF.

    We will only be concerned with 1NF,2NF, and 3NF.

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    4/39

    For a database to be in a normalform, it must meet all requirementsof the previous forms:

    Eg. For a database to be in 2NF, it mustalready be in 1NF. For a database to bein 3NF, it must already be in 1NF and

    2NF.

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    5/39

    Sample Data

    Manager Employees

    Bob Susie, Eric

    Edward Bella, Andrew

    Taylor Mark, Jane

    This data has some problems:

    The Employees column is not atomic.

    A column must be atomic, meaning that itcan only hold a single item of data. Thiscolumn holds more than one employeename.

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    6/39

    Data that is not atomic means: We cant easily sort the data

    We cant easily search or index the data

    We cant easily change the data

    We cant easily reference the data inother tables

    Manager Employees

    Bob Susie, EricEdward Bella, Andrew

    Taylor Mark, Jane

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    7/39

    Manager Employee1 Employee2

    Bob Susie Eric

    Edward Bella Andrew

    Taylor Mark Jane

    Breaking the Employee column intomore than 1 column doesnt solveour problems:

    The data may look atomic, but onlybecause we have many identical

    columns storing a single piece of datainstead of a single column storing manypieces of data.

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    8/39

    We still cant easily sort, search, orindex our employees.

    What if a manager has more than 2employees, 10 employees, 100employees? Wed need to add columnsto our database just for these cases.

    It is still hard to reference ouremployees in other tables.

    Manager Employee1 Employee2

    Bob Susie Eric

    Edward Bella Andrew

    Taylor Mark Jane

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    9/39

    By the way, what would be a goodchoice of a Primary Key for thistable?

    Manager Employee1 Employee2

    Bob Susie Eric

    Edward Bella Andrew

    Taylor Mark Jane

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    10/39

    First Normal Form

    1NF means that we must:

    Eliminate duplicate columns from thesame table, and

    Create separate tables for each group ofrelated data into separate tables, eachwith a unique row identifier (primary

    key) Lets get started by making our

    columns atomic

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    11/39

    Atomic Data

    By breaking eachtuple of our tableinto an entry for

    each employee, wehave made ourdata atomic.

    What would be the

    primary key?

    Manager Employee

    Bob SusieBob Eric

    Edward Bella

    Edward Andrew

    Taylor Mark

    Taylor Jane

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    12/39

    Primary Key

    The best primarykey would be theEmployee column.

    Every employeeonly has onemanager, thereforean employee is

    unique.

    Employee Manager

    Susie Bob

    Eric Bob

    Bella EdwardAndrew Edward

    Mark Taylor

    Jane Taylor

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    13/39

    First Normal Form

    Congratulations!

    The fact that allour data and

    columns is atomicand we have aprimary key meansthat we are in 1NF!

    Employee Manager

    Susie BobEric Bob

    Bella Edward

    Andrew Edward

    Mark Taylor

    Jane Taylor

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    14/39

    First Normal Form Revised

    Of course theremay come a daywhen we hire a

    second employeeor manager withthe same name. Toavoid this, lets use

    an employee IDinstead of theirname.

    ID Employee ManagerID

    1 Susie 7

    2 Eric 7

    3 Bella 84 Andrew 8

    5 Mark 9

    6 Jane 9

    7 Bob

    8 Edward

    9 Taylor

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    15/39

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    16/39

    Moving to Second Normal Form

    A database in 2NF must also be in1NF:

    Data must be atomic

    Every row (or tuple) must have aunique primary key

    Plus:

    Subsets of data that apply to multiplerows (repeating data) are moved toseparate tables

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    17/39

    CustID FirstName LastName Address City State Zip

    1 Bob Smith 123 Main St. Tucson AZ 12345

    2 John Brown 555 2nd Ave. St. Paul MN 54355

    3 Sandy Jessop 4256 James St. Chicago IL 435554 Maria Hernandez 4599 Columbia Vancouver BC V5N 1M0

    5 Gameil Hintz 569 Summit St. St. Paul MN 54355

    6 James Richardson 12 Cameron Bay Regina SK S4T 2V8

    7 Shiela Green 12 Michigan Ave. Chicago IL 43555

    8 Ian Sampson 56 Manitoba St. Winnipeg MB M5W 9N7

    9 Ed Rodgers 15 Athol St. Regina SK S4T 2V9

    This data is in 1NF: all fields are atomic and the CustID serves asthe primary key

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    18/39

    But lets payattention to theCity, State, and Zipfields: There are 2 rows of

    repeating data:one for Chicago,and one for St.

    Paul. Both have the same

    city, state and zipcode

    City State Zip

    Tucson AZ 12345

    St. Paul MN 54355

    Chicago IL 43555Vancouver BC V5N 1M0

    St. Paul MN 54355

    Regina SK S4T 2V8

    Chicago IL 43555Winnipeg MB M5W 9N7

    Regina SK S4T 2V9

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    19/39

    The CustID determines all the data in therow, but U.S. Zip codes determines theCityand State. (eg. A given Zip code can

    only belong to one city and state sostoring Zip codes with a City and State isredundant)

    This means that Cityand Stateare

    Functionally Dependenton the value inZipcode and not only the primary key.

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    20/39

    To be in 2NF, this repeating datamust be in its own table.

    So:

    Lets create a Zip code table that mapsZip codes to their City and State.

    Note that Canadian Postal Codes are

    different: the same city and state canhave many different postal codes.

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    21/39

    Our Data in 2NFCustID FirstName LastName Address Zip

    1 Bob Smith 123 Main St. 12345

    2 John Brown 555 2nd Ave. 54355

    3 Sandy Jessop 4256 James St. 43555

    4 Maria Hernandez 4599 Columbia V5N 1M0

    5 Gameil Hintz 569 Summit St. 54355

    6 James Richardson 12 Cameron Bay S4T 2V87 Shiela Green 12 Michigan Ave. 43555

    8 Ian Sampson 56 Manitoba St. M5W 9N7

    9 Ed Rodgers 15 Athol St. S4T 2V9

    Zip City State

    12345 Tucson AZ54355 St. Paul MN

    43555 Chicago IL

    V5N 1M0 Vancouver BC

    S4T 2V8 Regina SK

    M5W 9N7 Winnipeg MB

    S4T 2V9 Regina SK

    We see that we can actually save 2

    rows in the Zip Code table by removingthese redundancies: 9 customerrecords only need 7 Zip code records.

    Zip code becomes a foreign key in thecustomer table linked to the primarykey in the Zip code table

    CustomerTable

    ZipCodeTable

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    22/39

    Advantages of 2NF

    Saves space in the database byreducing redundancies

    If a customer calls, you can just ask

    them for their Zip code and youllknow their city and state! (No morespelling mistakes)

    If a City name changes, we onlyneed to make one change to thedatabase.

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    23/39

    Summary So Far

    1NF: All data is atomic

    All rows have a unique primary key

    2NF: Data is in 1NF

    Subsets of data in multiple columns are

    moved to a new table These new tables are related usingforeign keys

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    24/39

    Moving to 3NF

    To be in 3NF, a database must be:

    In 2NF

    All columns must be fully functionally

    dependent on the primary key (Thereare no transitive dependencies)

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    25/39

    In this table:

    CustomerID and ProdID depend on theOrderID and no other column (good)

    Stated another way, If you know the OrderID,you know the CustID and the ProdID

    So: OrderID CustID, ProdID

    OrderID CustID ProdID Price Quantity Total

    1 1001 AB-111 50 1,000 50,000

    2 1002 AB-111 60 500 30,000

    3 1001 ZA-245 35 100 3,500

    4 1003 MB-153 82 25 2,050

    5 1004 ZA-245 42 10 420

    6 1002 ZA-245 40 50 2,000

    7 1001 AB-111 75 100 7,500

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    26/39

    But there are some fields that arenot dependent on OrderID: Total is the simple product of

    Price*Quantity. As such, has a transitivedependency to Price and Quantity.

    Because it is a calculated value, doesntneed to be included at all.

    OrderID CustID ProdID Price Quantity Total

    1 1001 AB-111 50 1,000 50,000

    2 1002 AB-111 60 500 30,000

    3 1001 ZA-245 35 100 3,500

    4 1003 MB-153 82 25 2,050

    5 1004 ZA-245 42 10 420

    6 1002 ZA-245 40 50 2,000

    7 1001 AB-111 75 100 7,500

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    27/39

    Also, we can see that Price isntreally dependent on ProdID, orOrderID. Customer 1001 bought AB-

    111 for $50 (in order 1) and for $75(in order 7), while 1002 spent $60for each item in order 2.

    OrderID CustID ProdID Price Quantity Total

    1 1001 AB-111 50 1,000 50,000

    2 1002 AB-111 60 500 30,000

    3 1001 ZA-245 35 100 3,500

    4 1003 MB-153 82 25 2,050

    5 1004 ZA-245 42 10 420

    6 1002 ZA-245 40 50 2,000

    7 1001 AB-111 75 100 7,500

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    28/39

    Maybe price is dependent on theProdID and Quantity: The more youbuy of a given product the cheaper

    that product becomes! So we ask the business manager and

    she tells us that this is the case.

    OrderID CustID ProdID Price Quantity Total

    1 1001 AB-111 50 1,000 50,000

    2 1002 AB-111 60 500 30,000

    3 1001 ZA-245 35 100 3,500

    4 1003 MB-153 82 25 2,050

    5 1004 ZA-245 42 10 420

    6 1002 ZA-245 40 50 2,000

    7 1001 AB-111 75 100 7,500

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    29/39

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    30/39

    Lets diagram the dependencies. We can see that all fields are

    dependent on OrderID, the Primary

    Key (white lines)

    OrderID CustID ProdID Price Quantity Total

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    31/39

    But Total is also determined by Priceand Quantity (yellow lines) This is a derived field

    (Price x Quantity = Total)

    We can save a lot of space by gettingrid of it altogether and just calculatingtotal when we need it

    OrderID CustID ProdID Price Quantity Total

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    32/39

    Price is also determined by bothProdID and Quantity rather than theprimary key (red lines). This is called

    a transitive dependency. We mustget rid of transitive dependencies tohave 3NF.

    OrderID CustID ProdID Price Quantity

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    33/39

    We do this by moving the transitivedependency into a second table

    OrderID CustID ProdID Price Quantity

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    34/39

    By splitting out thetable, we canquickly adjust our

    price table to meetour competitor, orif the priceschanges from our

    suppliers.

    OrderID CustID ProdID Quantity

    ProdID PriceQuantity

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    35/39

    The second table is our pricing list. Think of Quantity as a range:

    AB-111: 1-100, 101-500, 501 and moreZA-245: 1-10, 11-50, 51 and more

    The primary Key for this second table is acomposite of ProdID and Quantity.

    OrderID CustID ProdID Quantity ProdID Quantity Price

    1 1001 AB-111 1,000 AB-111 1 75

    2 1002 AB-111 500 AB-111 101 60

    3 1001 ZA-245 100 AB-111 501 50

    4 1003 MB-153 25 ZA-245 1 425 1004 ZA-245 10 ZA-245 11 40

    6 1002 ZA-245 50 ZA-245 51 35

    7 1001 AB-111 100 MB-153 1 82

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    36/39

    Congratulations! Were now in 3NF! We can also quickly figure out what

    price to offer our customers for any

    quantity they want.

    OrderID CustID ProdID Quantity ProdID Quantity Price

    1 1001 AB-111 1,000 AB-111 1 75

    2 1002 AB-111 500 AB-111 101 60

    3 1001 ZA-245 100 AB-111 501 50

    4 1003 MB-153 25 ZA-245 1 425 1004 ZA-245 10 ZA-245 11 40

    6 1002 ZA-245 50 ZA-245 51 35

    7 1001 AB-111 100 MB-153 1 82

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    37/39

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    38/39

    Summarizing

    A database is in 2NF if:

    It is in 1NF

    There is no repeating data in its tables.

    Put another way, if we use a compositeprimary key, then all attributes aredependent on all parts of the key.

  • 8/13/2019 ITD102_Lecture 6-Normalization Process

    39/39

    And Finally

    A database is in 1NF if:

    All its attributes are atomic (meaningthey contain only a single unit or type of

    data), and All rows have a unique primary key.