DistributedDatabaseSystemsWeek3

8/7/2019 DistributedDatabaseSystemsWeek3

1/10

ICS 611 Spring Semester, 2008 L Gottschalk

Distributed Data Base Systems1

5.3 Fragmentation, page 112

5.3.1 Horizontal Fragmentation, page 112Primary horizontal fragmentation is preformed on relations using predicates definedon that relation.Example: student table separated by home campus:Select * from student group by HomeCampus;

(assume HomeCampus is an attribute of the relation student).

Derived horizontal fragmentation is based on predicates based on another relation.Example: faculty table fragmented by location of fax machine:Select from faculty, department where facultyDept =

DeptNum group by department.fax;

Information Requirements of Horizontal Fragmentation, page113

Database Information, page 113

Need to know the Global Conceptual Schema, which knows all the relations and allthe links.

The end of a link that has the foreign key is called the member. The end with theprimary key being pointed to is the owner.Given Link L1 between timecards and employees:Owner(L1) = timecard

Member(L1)= employee

Cardinality of relation R is denoted as card(R).

Application Information, page 114

Again, qualitative information is used for fragmentation planning.Quantitative information is used in allocation models.

For qualitative, you can analyze the 20% of user queries that account for 80% of theactivity. (This is because these 20% are used over and over.)

Qualitative:Now, to determine simple predicates:Given relation R(A1, A2, , An), where Ai is an attribute defined over domain Di,then a simple predicate pj defined on R has the formPj : Ai Theta Value where Theta is in {=,


2/10


Pri denotes the set of all simple predicates defined on relation Ri. The members ofPri are denoted by pij.

The minterm predicate is the conjunction of simple predicates.So fragmentation can occur on minterm predicate, as well as simple predicates.

The math is extremely hard when Theta is other than =, so we will use that only.(p. 116)

Quantitative:We need two sets of data:1. Minterm selectivity: the number of tuples of the relation that would be accessedby a user query with a minterm predicate.

Example: m1 is Title = Elect Eng AND SAL < 30000 produces 0 results.M2 is Title = Elect Eng AND SAL < 30000 produces many results.The selectivity of a minterm mi is sel(mi).

2 Access frequency: frequency that user applications access data.For set of queries Q = {q1, q2, . . ., qq},Acc(qi) indicates the access frequency of qi in a given period.

From this, we can determine the access frequencies of mi:Acc(mi).

Primary Horizontal Fragmentation, page 117

For Relation R, each horizontal fragment is Ri.Where

Ri = Result(Fi(R).

Where Fi is the selection formula used to obtain fragment Ri.

Fi must be a minterm predicate for us to be able to carry out calculations.

Example:PROJ table is decomposed into two horizontal fragments PROJ1 and PROJ2:PROJ1 = Result(Budget200000(PROJ)

Another example:

PROJ1 = Result(LOC=Montreal(PROJ)PROJ2 = Result(LOC= New York(PROJ) andPROJ3 = Result(LOC=Paris(PROJ)

So, the definition of Horizontal Fragment:a Horizontal Fragment Ri of relation R consists of all tuples of R that satisfyminterm predicate mi.

Page 2


3/10


Therefore, there are as many horizontal fragments of R as there are mintermpredicates.

Completeness rule: A set of simple predicates Pr is said to be complete if and onlyif there is an equal probability of access by every application to any tuple belongingto any midterm fragment, p119.

Example: if you fragment PROJ table by location, then if the only applicationthat accesses PROJ accesses by location (Select * from PROJ where location =Montreal) , then each tuple of each fragment has same probability of beingselected.

All the tuples in fragment PROJ1=Montreal have a 1 probability of beingselected.All the tuples in fragments PROJ2=New York and in PROJ2=Paris have 0probability.

But if another application selects by project size (e.g., select * from PROJ wherebudget


4/10


Iteratively add predicates to this set, ensuring minimality at each iteration.Stop when you can find any reason to break any of the predicates further.

At the end of step 1, the set of predicates is both minimal and complete.

Step 2: derive the set of minterm predicates that can be defined on the set of

predicates from step 1. These midterm predicates determine the fragments thatare used as candidates in the allocation step.Determining the set of predicates (step 1) was easy. But the number of mintermpredicates is exponential on the number of predicates. So need to eliminate someminterm predicates from consideration:

Step 3: identify those minterms that may be contradictory to the set of implicationsI. For example:if predicate1 (p1) is that attributeA = value 1;and predicate 2(p2) is that attributeA = value 2;

and the implications I state that attributeA must be either value 1 or value 2 and

cannot be both

then the set of minterm predicates are:attriubteA = value 1 AND attributeA = value2attributeA NOT= value 1 and attributeA = value 2;attributeA = value 1 and attriubteA NOT= value2;attriubteA NOT= value1 and attributeA NOT= value2;

In this example, the first and last minterm predicates cannot happen, so can beeliminated.

Extended example 5.11, page 122:

Given this schema:relation Pay: title(pk), salaryrelation Emp: eno(pk), ename, titlerelation Proj: jno(p), jname, budget, locrelation Asg: eno(pk), Jno(pk), resp, dur

A weak entity is needed to avoid directly connecting Proj and Emp in a many-to-many relationship.

You can see that Asg [assignment] is a weak entity that joins the two strong entitiesEmp[employee] and Proj [project].

Weak entities always have a compound key made up of the primary keys of the twotables that it connects. Thats why Asg has compound key of eno AND jno, theprimary keys of Emp and Proj.

Also, Pay is simply a way to find salary if you know title. So is attached to Emp viathe title foreign key in Emp table.

Page 4


5/10


Suppose there is only one application that accesses Pay. It checks salary at lessthan or equal to $30,000. It gives a one rate raise to those under, another rate ofraise to those above. (the rich get richer!)

The predicates:SAL 30000

This is complete and minimal.

Therefore the minterm predicates arem1: Sal < 30000m2: Sal NOT 30000)

So we end up with two fragments.

- - - - - -Now consider two applications that use Proj table.

The first accesses proj name and budet according to site:Select PName, Budget from Proj where PNO=value;

So the predicates are:p1: Loc = Montrealp2: Loc = New Yorkp3: Loc = Paris

and the second accesses projects based on sizesop4: Budget 200000

So there set of Predicates has five membersand obviously is complete and minimal.

The set of minterms are these six:Montreal and 200000and so on (2 for New York and 2 for Paris).

There are many possible minterms such asp1 ^ p2 ^ p3^ p4^ p5 (example: Montreal and NewYork and Paris and under200000 and over 200000)but this minterm is impossible as are most others due to obvious set of implicationsI that you could write easily:Montreal implies not-New York and not-ParisNew York implies not-Montreal and not-ParisUnder200000 implies not over 200000.Over200000 implies not under 200000.

Dont let current values in the database influence set of implications I.

Page 5


6/10


the values may change.So dont let Montreal imply budget is under 200000. that may not be true nextweek.

So, we defined 5 predicates and 6 minterms.

Derived Horizontal Fragmentation, page 125

To keep this from becoming impossible, we only consider relations joined on equi-joins. (and semi-joins).

The attribute used to determine fragment membership is in the owner of the link.But the resulting fragment is defined ONLY on the attributes of the member relation.(Remember: link has owner end and member end.)

Example 5.12, page 125:Well use the Pay and the Emp tables again.

Emp table can be fragmented on pay range (those over 30000 and those under30000) by using joined table PAY.

There is an additional complication with derived fragmentation:Note that the Asg table (assignments) has two links.By which one do you fragment:Project data, or by Emp data??

Answer:1. The fragmentation with better join characteristics.2. The fragmentation used in more applications.

Its obvious how to deal with the second answer.

Heres an example (5.13, page 127) on how to deal with the first answer:

An application accesses information about engineers who work on local projectsmore than it accesses employees on projects at other sites.

Another application accesses information by emp number.

So this suggests fragmenting Asg according to Proj and according to Emp.

Therefore,

One is fragmenting along a chain. IN this case, Pay-Emp-Asg chain.

Typically there is more than one choice for derived fragmentation. The finalanswer may be delayed until allocation phase

5.3.2 Vertical Fragmentation, page 131Goal: a fragmentation that minimizes the running time of applications that run onthese fragments.

Page 6


7/10


On stand alone mainframes: allows applications to deal with smaller tables, and toput most active partition on faster disks.

Horizontal partitioning: IF there are n simple predicates, then there are 2 to the npossible minterm predicates, some of which are invalidated by I.

Vertical: if total number of simple predicates is m, number of possible fragments ism to the mth.

Therefore, only heuristics can be applied. Two types of heuristics:1. Grouping (keep adding more and more attributes to make relations)2. Splitting

We do (2) since

it is more like what we do with horizontal, and

because solution is closer to the full relation than each attribute its ownrelation, and

results in non-overlapped relations.

Information requirements of Vertical Fragmentation, page 132

The goal of vertical partitioning is to put in one place (fragment) the attributes thatare usually accessed together.

The measure of togetherness is the affinity of attributes, which measures howclosely related they are.

But the designers or users are able to specify such. So we start with more primitivedata.

Step 1:

Let Q = {q1, q2, , qq} be the set of user queries (applications) that will run onrelation R(A1, A2, . . ., An).

For each query qi and each attribute Aj, we have an attribute usage value,denoted as use(qi, Aj).

The value of use(qi, Aj) is 1 if Attribute Aj is referenced by qi, otherwise = 0.

Using the 80/20 rule (chk only main applications), this values should be easy to

determine.

(Work in class example 5.15, page 133.)

Step 2:

Attribute usage values dont represent the weight of application frequencies.

Page 7


8/10


So, we also need an attribute affinity matrix (AA) where each cell is the value ofattributei and attributejs bond with each other.

(with respect to a set of applications Q = {qi, q2, . . . , qn}, using the 80/20 rule)

Each cell of the attribute affinity matrix is called aff(Ai, Aj), and is the measure of

the bond between Ai and Aj

Each cell is called the attribute affinity measure aff(Ai, Aj).

We will work this in class.

Clustering Algorithm, page 135

The attribute affinity matrix will be used in the rest of this chapter.

We will cluster together attributes with high affinity for each other.

Then we will split the relation accordingly.

We will permute the rows and columns of the attribute affinity matrix.The result is the CLUSTERED affinity matrix (CA).

The goal is to transform from this: (figure 5.16, page 135)A1 A2 A3 A4

A1 45 0 45 0A2 0 80 5 75A3 45 5 53 3A4 0 75 3 78

to this: (figure 5.17(d), page 140)

A1 A3 A2 A4A1 45 45 0 0A3 45 53 5 3A2 0 5 80 75A4 0 3 75 78

In figure 5.17(d) we see the creation of two clusters: one is in the upper left cornerand contains the smaller affinity values and the other is in the lower right cornerand contains the larger affinity values. This clustering indicates how the attributesof relation PROJ should be split However, in general the border for this split is notthis clear-cut. When the CA matrix is big, usually more than two clusters areformed and there are more than one candidate partitioning. Thus there is a need to

approach this problem more systematically. page 141

Partitioning Algorithm, page 141

Consider the clustered attribute matrix of figure 5.18 below:

A1 A1 | ..Ai Ai+1 ..AnA1 |A2 |

Page 8


9/10


..Ai |_______________x_________________Ai+1 |. |. |An |

Note that the matrix is separated into an upper left (UL) quadrant and a lower right(LR) quadrant by placing an x along the diagonal. Hopefully, this will separatethe matrix at the border between the two clusterings such that one set ofapplications (Top Queries or TQ) that accesses the UL attributes, and another set ofapplications (Bottom Queries or BQ) that accesses the LR attributes, and there is a(minimal) set of other applications (Other Queries or OQ) that access both UL andLR attributes.

The optimization problem is that there are n-1 possible placements of x along thediagonal, when there are n attributes. To find the placement of x, we need tocomputer the cost of each set of applications, as the sum of the number of

accesses made by the applications in that group.These totals are:CTQ (cost of TQ)CBQ (cost of BQ)COQ (cost of OQ).

Then we want to maximize the equation:

z = CTQ * CBQ (COQ * COQ)

This equation will not be maximized unless CTQ and CBQ are nearly equal.

This has the benefit of balancing the load on the two systems that will hold the twofragments.

The computation of this equation grows ONLY linearly as the number of attributesincreases.

There are two complications:

1) The procedure above splits into two partitions. For larger sets of attributes, theremay be more than two clusters. To compute z for more than two clusters is verycomputationally intense. Each possible split of 1, 2, m-1 fragments must becalculated.

2) A better clustering may be found if we repeat the calculation of z n times (for nattributes).Before each calculation of z, we first shift the top row of the clustered matrix to thebottom, and the left column to the right edge. This of course increases theintensity of the calculation effort!

Page 9


10/10


When all this effort described in this section has been done to the CA matrixdescribed earlier, we find that the optimal partition is:

PROJ1 = {PNO, BUDGET}PROJ2 = {PNO, PNAME, LOC}

This can be shown to be correct:It is complete, allows reconstruction, and is disjoint.

5.3.3 Hybrid Fragmentation, page 146

Tables can be fragmented horizontally, and then vertically.The PROJ table in the examples above is a good example of that.

This nested fragmentation of different types of fragmentation is called:- hybrid, or- mixed, or

- nested.

Page 10

Documents

DistributedDatabaseSystemsWeek3