38
Schema Tuning

Schema Tuning. Outline Database design: Normalization –Problem of redundancy –Why? Functional dependency –How to solve? Decomposition –Objective of the

Embed Size (px)

Citation preview

Schema Tuning

Outline

• Database design: Normalization– Problem of redundancy– Why?

• Functional dependency– How to solve?

• Decomposition– Objective of the decomposition?

• Normalized• Normal form

• Schema refinement– Vertical partitioning and Antipartitioning– Denormalization– Horizontal decomposition

Design Steps• The design steps:

1. Real-World 2. ER model 3. Relational Schema

4. Better relational Schema 5. Relational DBMS

• Guideline for Step (3) to step (4) – “design theory” for relations– Or “normalization theory”.

• Why normalization theory– Automatic mappings from ER to relations may not produce the best

relational design possible.– Database designers may go directly from (1) to (3), in which case, the

relational design can be really bad.

Database Schema

• A relation schema is a relation name and a set of attributesR(a int, b varchar[20]);

• A relation instance for R is a set of records over the attributes in the schema for R.

The Evils of Redundancy

• Redundancy is the root of several problems associated with relational schemas:– redundant storage– insert/delete/update anomalies

• Consider relation obtained from Hourly_Emps:– Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked)

• Notation: We will denote this relation schema by listing the attributes: SNLRWH– This is really the set of attributes Hourly_Emps

{S,N,L,R,W,H}.

Example

• Problems due to R W :– Update anomaly:

• Can we change W in just the 1st tuple of SNLRWH?

– Insertion anomaly: • Suppose W is not allowed

null. What if we want to insert an employee and don’t know the hourly wage for his rating?

– Deletion anomaly: • If we delete all employees

with rating 5, we lose the information about the wage for rating 5!

S N L R W H

123-22-3666 Attishoo 48 8 10 40

231-31-5368 Smiley 22 8 10 30

131-24-3650 Smethurst 35 5 7 30

434-26-3751 Guldu 35 5 7 32

612-67-4134 Madayan 35 8 10 40

Functional Dependencies (FDs)• A functional dependency X Y holds over relation R

if, for every allowable instance r of R:– i.e., given two tuples in r, if the X values agree, then the Y

values must also agree. (X and Y are sets of attributes.)

• K is a key for relation R if:1. K determines all attributes of R.

2. For no proper subset of K is (1) true.• If K satisfies only (1), then K is a superkey.

Example• Consider relation Hourly_Emps:

– Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked)

FD is a key:– ssn is the key

S SNLRWH

FDs give more detail than the mere assertion of a key.– rating determines

hrly_wages R W

S N L R W H

123-22-3666 Attishoo 48 8 10 40

231-31-5368 Smiley 22 8 10 40

131-24-3650 Smethurst 35 5 7 30

434-26-3751 Guldu 35 5 7 30

612-67-4134 Madayan 35 8 10 40

Decomposition of a Relation Scheme

• A decomposition of R(A1 ... An) consists of replacing R by two or more relations such that:– Each new relation scheme contains a subset of the attributes

of R • Implies that there exists no attributes that do not appear in R

– Every attribute of R appears as an attribute of one of the new relations.

• E.g., – SNLRWH can be decomposed into SNLRH and RW.

Problems with Decompositions

• Three potential problems to consider:Performance: Some queries become more expensive.

• e.g., How much did sailor Joe earn? (salary = W*H)

Lossless Join Decompositions: Given instances of the decomposed relations, we may not be able to reconstruct the corresponding instance of the original relation!

• Fortunately, not in the SNLRWH example.

Dependency Preserving Decomposition: Checking some dependencies may require joining the instances of the decomposed relations.

• Fortunately, not in the SNLRWH example.

• Tradeoff: Must consider these issues vs. redundancy.

Lossless Join Decompositions

• Decomposition of R into X and Y is lossless-join w.r.t. a set of FDs F if, for every instance r that satisfies F: X(r) Y(r)=r

• It is always true that rX(r) Y(r)– In general, the other direction does not

hold! If it does, the decomposition is lossless-join.

• Definition extended to decomposition into 3 or more relations in a straightforward way.

Dependency Preserving Decomposition

• Consider CSJDPQV, C is key, JP →C and SD → P.– BCNF decomposition: CSJDQV and SDP

– Problem: Checking JP → C requires a join!

• Dependency preserving decomposition (Intuitive):– If R is decomposed into X, Y and Z, and we enforce

the FDs that hold on X, on Y and on Z, then all FDs that were given to hold on R must also hold.

– No join is necessary to enforce original FD

Normalization

• A relation is normalized if every interesting functional dependency X A involving attributes in R has the property that X is a key of R.– BCNF– Kent :“ Each attribute

must describe [an entity or relationship identified by] the key, the whole key, and nothing but the key.“

• No partial dependency• No transitive denepdency

BCNF

Partial dependency

Transitive dependency

Example

– Schema1 (not normalized): OnOrder1(supplier_id, part_id, quantity,

supplier_address)

– Schema 2 (Normalized):OnOrder2(supplier_id, part_id, quantity);

Supplier(supplier_id, supplier_address);

Example #1

• Suppose that a bank associates each customer with his or her home branch. Each branch is in a specific legal jurisdiction( 权限 ). – Is the relation R(customer, branch, jurisdiction)

normalized?

Example #1

• What are the functional dependencies?– customer branch

– branch jurisdiction

– customer jurisdiction

– Customer is the key, but a functional dependency exists where customer is not involved.

– R is not normalized.

Example #2

• Suppose that a doctor can work in several hospitals and receives a salary from each one. – Is R(doctor, hospital, salary) normalized?

Example #2

• What are the functional dependencies?– doctor, hospital salary

• The key is doctor, hospital

• The relation is normalized.

Example #3

• Same relation R(doctor, hospital, salary) and we add the attribute primary_home_address. Each doctor has a primary home address and several doctors can have the same primary home address. Is R(doctor, hospital, salary, primary_home_address) normalized?

Example #3

• What are the functional dependencies?– doctor, hospital salary– doctor primary_home_address– doctor, hospital primary_home_address

• The key is “doctor, hospital”, however doctor (a subset) determines one attribute, hence not normalized

• A normalized decomposition would be:– R1(doctor, hospital, salary)– R2(doctor, primary_home_address)

Normal Forms

• Returning to the issue of schema refinement, the first question to ask is whether any refinement is needed!

• If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds of problems are avoided/minimized. This can be used to help us decide whether decomposing the relation will help.

Normal formsUniverse of relations

1 NF

2NF3NF

BCNF4NF

5NF

Practical Schema Design

• Identify entities in the application (e.g., doctors, hospitals, suppliers).

• Each entity has attributes (an hospital has an address, a juridiction, …).

• There are two constraints on attributes:– An attribute cannot have attribute of its own.– The entity associated with an attribute must

functionally determine that attribute.

Practical Schema Design

• Each entity becomes a relation• To those relations, add relations that reflect

relationships between entities.– Worksin (doctor_ID, hospital_ID)

• Identify the functional dependencies among all attributes and check that the schema is normalized:– If functional dependency AB C holds, then ABC

should be part of the same relation.

Vertical Partitioning

• Three attributes: account_ID, balance, address.• Functional dependencies:

– account_ID balance– account_ID address

• Two normalized schema design:– (account_ID, balance, address)or– (account_ID, balance)– (account_ID, address)

• Which design is better?

Vertical Partitioning

• Which design is better depends on the query pattern:– The application that

sends a monthly statement is the principal user of the address of the owner of an account

– The balance is updated or examined several times a day.

• The second schema might be better because the relation (account_ID, balance) can be made smaller:– More account_ID, balance

pairs fit in memory, thus increasing the hit ratio

– A scan performs better because there are fewer pages.

Tuning Normalization

• Rule of thumb:– A single normalized relation XYZ is better than two

normalized relations XY and XZ if X, Y and Z are frequently accessed together

– Why? No costly join operations.– The two-relation design is better iff:

• Users access tend to partition between the two sets Y and Z most of the time

• Attributes Y or Z have large values (one-third the page size or larger)

– Too many accesses across pages

Vertical Partitioningand Scan

• R (X,Y,Z)– X is an integer– YZ are large strings

• Scan Query• Vertical partitioning

exhibits poor performance when XYZ are accessed together.

• Vertical partitioning provides a speedup if only two of the attributes (XY) are accessed.

0

0.005

0.01

0.015

0.02

No Partitioning -XYZ

VerticalPartitioning - XYZ

No Partitioning -XY

VerticalPartitioning - XY

Th

rou

gp

ut

(qu

erie

s/se

c)

Vertical Partitioningand Point Queries

• R (X,Y,Z)– X is an integer– YZ are large strings

• A mix of point queries access either XYZ or XY.

• Vertical partitioning gives a performance advantage if the proportion of queries accessing only XY is greater than 20%.

• The join is not expensive compared to a simple look-up.

0

200

400

600

800

1000

0 20 40 60 80 100

% of access that only concern XY

Th

rou

gh

pu

t (q

uer

ies/

sec)

no vertical partitioning

vertical partitioning

Vertical Antipartitioning

• Brokers’ bond-buying decisions are based on the price trends of those bonds.

• The database holds the closing price for the last 3000 trading days, however the 10 most recent trading days are especially important.– (bond_id, issue_date, maturity, …)

(bond_id, date, price)Vs.– (bond_id, issue_date, maturity, today_price, …

10dayago_price)(bond_id, date, price)

Tuning Denormalization

• Denormalizing means violating normalization for the sake of performance:– Denormalization speeds up performance when

attributes from different normalized relations are often accessed together

– Denormalization hurts performance for relations that are often updated.

Denormalizing -- data

Settings:lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY,

L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT );

region( R_REGIONKEY, R_NAME, R_COMMENT );

nation( N_NATIONKEY, N_NAME, N_REGIONKEY, N_COMMENT,);

supplier( S_SUPPKEY, S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL, S_COMMENT);

– 600000 rows in lineitem, 25 nations, 5 regions, 500 suppliers

Denormalizing -- transactions

lineitemdenormalized ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT, L_REGIONNAMEL_REGIONNAME);

– 600000 rows in lineitemdenormalized

– Cold Buffer

– Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.

Queries on Normalized vs. Denormalized Schemas

Queries:select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY,

L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT, R_NAME

from LINEITEM, REGION, SUPPLIER, NATIONwhereL_SUPPKEY = S_SUPPKEYand S_NATIONKEY = N_NATIONKEYand N_REGIONKEY = R_REGIONKEYand R_NAME = 'EUROPE';

select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT, L_REGIONNAME

from LINEITEMDENORMALIZEDwhere L_REGIONNAME = 'EUROPE';

Denormalization

• TPC-H schema• Query: find all lineitems

whose supplier is in Europe.

• With a normalized schema this query is a 4-way join.

• If we denormalize lineitem and introduce the name of the region for each lineitem we obtain a 30% throughput improvement.

0

0.0005

0.001

0.0015

0.002

normalized denormalized

Th

rou

gh

pu

t (Q

uer

ies/

sec)

Horizontal Decompositions

• Relation decomposition: Relation is replaced by a collection of relations that are projections.

• Horizontal Decomposition : Replace relation by a collection of relations that are selections.– Each new relation has same schema as the original,

but a subset of the rows.– Collectively, new relations contain all rows of the

original. Typically, the new relations are disjoint.

Horizontal Decompositions (Contd.)

• Suppose that contracts with value > 10000 are subject to different rules. This means that Contracts with val>10000 will often be queried

• One way to deal with this is to build a clustered B+ tree index on the val field of Contracts.

• A second approach is to replace contracts by two new relations: LargeContracts and SmallContracts, with the same attributes

• Performs like index on such queries, but no index overhead. Can build clustered indexes on other attributes, in addition!

Masking Conceptual Schema Changes

• The replacement of Contracts by LargeContracts and SmallContracts can be masked by the view.

• However, queries with the condition val>10000 must be asked wrt LargeContracts for efficient execution: so users concerned with performance have to be aware of the change

CREATE VIEW Contracts(cid, sid, jid, did, pid, qty, val)AS SELECT * FROM LargeContractsUNIONSELECT *FROM SmallContracts