Data warehouse and Data MiningData warehouse and Data Mining Lecture No. 12 Normalization and De-normalization Database Design • Conceptual – identify important entities and relationships

Naeem A. Mahoto

Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

Email: [email protected]

Data warehouse and Data Mining

Lecture No. 12

Normalization and De-normalization

Database Design •  Conceptual

–  identify important entities and relationships –  determine attribute domains and candidate keys

•  Logical –  Split data into multiple tables, such that:

•  no information is lost •  useful information can be easily reconstituted

–  draw the E-R diagram –  validate model using normalization

•  Physical –  implement on DBMS

Database Anomalies •  Database anomalies are unmatched or missing

information caused by limitations or flaws within a given database

•  Database anomalies are the problems in relations that occur due to redundancy in the relations

•  These anomalies affect the process of inserting, deleting and modifying data in the relations/tables

Types of Anomalies •  Insertion Anomaly: It occurs when a new record is inserted

in the relation –  In this anomaly, the user cannot insert a fact about an entity

until he/she has an additional fact about another entity •  Deletion Anomaly: It occurs when a record is deleted from

the relation –  In this anomaly, the deletion of facts about an entity

automatically deleted the fact of another entity •  Modification Anomaly: It occurs when the record is updated

in the relation. –  In this anomaly, the modification in the value of specific attribute

requires modification in all records in which that value occurs

Normalization •  Normalization is the process of converting bad database

design into a form that overcomes database anomalies •  It is the process of organizing the fields and tables of a

relational database to minimize redundancy (eliminate redundant data) and dependency (ensure dependency make sense)

•  Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them

•  The goal is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then the database is updated using the defined relationships

Normalization •  Edgar F. Codd (inventor of relational model)

proposed (in 1970) normalization through several normal forms: –  First normal form (1NF) –  Second normal form (2NF) –  Third normal form (3NF) –  Boyce-Codd normal form (BCNF) –  Fourth normal form (4NF) –  Fifth normal form (5NF) –  Domain key normal form (DKNF)

First Normal Form (1NF) •  A relation/table is in first normal form if the domain

of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain

•  Example: Consider a table that stores Customers and their Telephone Number. A customer may have more than one Telephone number

First Normal Form (1NF) Tables designed with 1NF

Second Normal Form (2NF) •  A relation/table is in 2NF if and only if it is in 1NF

and every non-prime attribute of the table is dependent on the whole of a candidate key

•  A table/relation is in 2NF if it is in first normal form and every non-primary-key column is fully functional dependent on the primary key

•  Full functional dependency indicates that if A and B are columns of a table, B is fully dependent on A

Second Normal Form (2NF) •  Consider a table describing employees' skills:

Candidate Key is composite {Employee, Skill} - Employee might need to appear more than once (he/she might have multiple Skills) - Current Work Location, is dependent on only part of the candidate key - Therefore the table is not in 2NF

A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}

Progressing to 2NF •  If a table is not in second normal form:

–  Move that data item and the part of the primary key on which it is functionally dependent to a new table

–  Add any other data items are functionally dependent on the same part of the key

–  Make the partial primary key the primary key for the new table

Second Normal Form (2NF) A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}

Third Normal Form (3NF) •  A table is in 3NF if and only if both of the

following conditions hold: –  The relation R (table) is in second normal form (2NF) –  Every non-prime attribute of R is non-transitively

dependent (i.e. directly dependent) on every superkey of R

•  A table that is in 1NF and 2NF and in which no non-primary-key column is transitively dependent on the primary key

Third Normal Form (3NF) •  Example: consider a table with A, B, and C. If B

is functional dependent on A and C is functional dependent on B, then C is transitively dependent on A via B (provided that A is not functionally dependent on B or C)

Third Normal Form (3NF) •  2NF table that fails to meet the requirements of

3NF is: Candidate key (composite key)

Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner

Progressing to 3NF •  Move all items involved in transitive

dependencies to a new entity

•  Identify a primary key for the new entity

•  Place the primary key for the new entity as a foreign key on the original entity

Third Normal Form (3NF)

Boyce-Codd Normal Form (BCNF)

•  It is a slightly stronger version of the third normal form (3NF)

•  A relational schema R is in Boyce–Codd normal form if and only if for every one of its dependencies X → Y, at least one of the following conditions hold: –  X → Y is a trivial functional dependency (Y ⊆ X) –  X is a superkey for schema R

•  Only in rare cases does a 3NF table not meet the requirements of BCNF

Fourth Normal Form (4NF) •  A table is in fourth normal form (4NF) if it is in 3NF and

there are no multi-valued dependencies •  Multi-valued Dependency: In a table with columns A, B,

and C, there is a multivalued dependence of column B on column A, if each value for A is associated with a specific collection of values for B and, furthermore, this collection is independent of any values for C –  E.g. (employee, skill, language), Two many-to-many

relationships that are independent because any skill can be paired with any language

•  To remove multi-valued dependencies, create separate tables for the independent repeating groups

De-normalization •  De-normalization is the process of combining

tables in a careful manner to improve performance

•  This is the process of breaking the rules for 3NF •  The primary reasons to do this are:

–  To reduce the no. of joins that must be processed in queries, thereby improving database performance

–  To map the physical database structure more closely to user’s dimensional business model, structuring tables along the lines of how users will ask questions

De-normalization •  Normalization is a rule of thumb in DBMS, but in Decision

Support System (DSS) ease of use is achieved by way of de-normalization

•  It brings "close" dispersed but related data items •  Query performance in DSS significantly dependent on

physical data model •  De-normalization specifically improves performance by either:

–  Reducing the number of tables and hence the reliance on joins, which consequently speeds up performance

–  Reducing the number of joins required during query execution, or –  Reducing the number of rows (records) to be retrieved from the

Primary Data Table

Normalization vs. De-normalization

De-normalization •  “Depending on whether the modeler is building

the model for a data mart or a data warehouse the data modeler will wish to engage in some degree of de-normalization”. [Bill Inmon]

•  De-normalization of the logical data model serves the purpose of making the data more efficient to access. In the case of a data mart, a high degree of de-normalization can be practiced. In the case of a data warehouse a low degree of de-normalization is in order.” [Bill Inmon]

Issues to consider in De-normalization

•  The effects of de-normalization on database performance are unpredictable: as many applications can be affected negatively by de-normalization

•  De-normalize the implementation of the logical model only after one has thoroughly analyzed the costs and benefits, and only after a normalized logical design has been completed

De-normalization: Effects •  Consider the following list of effects of de-

normalization before one decides to undertake design changes: –  A de-normalized physical implementation can

increase hardware costs –  While de-normalization benefits the applications it is

specifically designed to enhance, it often decreases the performance of other applications

–  De-normalization introduces update anomalies to the database

De-normalization •  The following items are typical of the de-

normalizations that can sometimes be exploited to optimize performance: –  Pre-join –  Column Replication or Movement –  Pre-Aggregation

Pre-join: De-normalization •  A pre-join de-normalization moves frequently

joined attributes to the same base relation in order to eliminate join processing

•  It avoids performance impact of the frequent joins

•  Typically increases storage requirements

Pre-join: De-normalization •  Before de-normalization:

sale_id store_id sale_dt …

tx_id sale_id item_id … item_qty sale$

1

m

select sum(sales_detail.sale_amt)!from sales ,sales_detail!where sales.sales_id = sales_detail.sales_id! and sales.sales_dt between '2006-11-26' and '2006-12-25' ;!

Pre-join: De-normalization •  After de-normalization:

t x _ i d sale_id store_id sale_dt item_id … item_qty $

select sum(d_sales_detail.sale_amt)!from d_sales_detail!where d_sales_detail.sales_dt between '2006-11-26' and '2006-12-25';!

Column Replication: De-normalization

•  Take columns that are frequently accessed via large-scale joins and replicate (or move) them into detail table(s) to avoid join operation

•  It avoids performance impact of the frequent joins

•  It increases storage requirements for database

Column Replication: De-normalization A three table join requires re-distribution of significant amounts of data to answer many important questions related to customer transaction behavior Before de-normalization:

After de-normalization:

Tx_Id Account_Id Tx$ Tx_Dt Location_Id …

Account_Id Customer_Id Balance $ Open_Dt …

Tx_Id Account_Id Tx$ Tx_Dt Location_Id …

1 m

1 m

Customer_Id Customer_Nm Address SIC …

Account_Id Customer_Id Balance $ Open_Dt …

Tx_Id Account_Id Customer_Id Tx$ Tx_Dt Location_Id …

1 m

1 m 1

m

Pre-aggregation: De-normalization

•  Take aggregate values that are frequently used in decision-making and pre-compute them into physical tables in the database

•  It can provide huge performance advantage in avoiding frequent aggregation of detailed data

•  Pre-aggregation adds significant burden to maintenance for Data Warehouse

Documents

Data warehouse and Data MiningData warehouse and Data Mining Lecture No. 12 Normalization and De-normalization Database Design • Conceptual – identify important entities and relationships