View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Distributed Databases
D = D1 X D2 X . . . X Dn
- D is implicitly specified
Goal: Discover patterns in implicit D, using the explicit Di’s
D1 D2 Dn
A B C C D E A E G
Limitations:- Can’t move Di’s to a common site
- Size / communication cost/Privacy- Can’t update local databases- Can’t send actual data tuples
Geographically distributed nodes
Explicit and Implicit Databases
321162
121162
211221
211261
321161
121161
FEDCBA
Implicit Database
Explicit Component Databases
22
12
21
11
CA
SharedSet
------162
122311161
121111261
211221221
CEAFCDCBA
Node 3Node 2Node 1
Decomposition of Computations
- Since D is implicit,
- For a computation:
- Decompose F into G and g’s
- Decomposition depends on
- F
- Di’s
- Set of shared attributes
D1 D2 Dn
A B C C D E A E G
)]()...(),([ 2211 nn DgDgDgGR
)(DFR
Decomposition of ComputationsComputational primitives
– Arithmetic primitives
• Count of tuples in implicit D• Mean Value of an attribute in D• Informational entropy for a subset of D• Covariance matrix for D
– non-numeric primitives
• Median value of an atribute in D• Sorting subsets of tuples in D
Decomposition of Computations• Computational cost of decomposition
– Communication cost• Number of messages exchanged
– Number of database queries
• Who does the decomposition?– Algorithm itself, at run time
– Depending on the nature of overlap in Di’s
Count All Tuples in Implicit D
)(# DtuplesR Can be decomposed as:
m
j
n
iCondi j
DNR1 1
))(((
– condJ : Jth tuple in Shareds
– n: number of participating databases (Dis)
– (N(Dt)condJ): count of tuples in Dt satisfying condJ
– Local computation: gi(Di,) = N(Dt)condJ
– G is a sum-of-products
22
12
21
11
CA
Shareds
L attributes;k values each;
kl tuples
Implementing Decomposed Computations
Stationary Agents
D1 D2
A B C C D E A E G
Dn Dx
A A AA
D1 D2
A B C C D E A E G
Dn Dx
Mobile Agents
Messages
Aglet
Implementation of Count(D)Stationary Agents
- Request / Send Summaries
- Simple SQL interface
- 1 count / message
- l attributes having k values each
- Query-code interface
- counts/message
- l attributes having k values each
Mobile Agents:
D1 D2 Dn
A B C C D E A E G
22
12
21
11
CA
Shareds
L attributes;k values each;
kl tuples
kln*Messages exchanged:
kl
Messages exchanged:n
Number of hops:n
Implementation of Count(D-test)
Stationary Agents- Simple SQL interface
- Query-code interface
Mobile Agents:
22
12
21
11
CA
Shareds
L attributes;k values each;
kl tuples
kln*Messages exchanged:
Messages exchanged:n
Number of hops:n
)))(((..1
J testandcond
n
tttest
J
DNCount
Average Value of an attribute in D
Compute counts for each value of an attribute:
n
iiiC
CNCN total
Avg1
1 ))(*(*)(
Stationary Agents- Simple SQL interface
- Query-code interface
Mobile Agents:
klnk **)1( Messages exchanged:
Messages exchanged:n
Number of hops:n
(1 integer/message)
integers/message)1(* kl k
Exception Tuples
• Database of interest may exclude some tuples of D• Learning site keeps a relation E of exception tuples
– E may have explicit tuples
– E may have rules to generate exception tuples
m
j
n
iCondCondi jj
NDNR1 1
))()((( E
Explicit Databases
22
12
21
11
CA
SharedSet------162
122311161
121111261
211221221
CEAFCDCBA
Node 3Node 2Node 1
--
--
--
32
EB
Exceptions
Computing Informational Entropy
Consists of various counts only:
Stationary agent/Simple SQL interface:
Stationary agent/Query-code interface:
Mobile agent:
))log( 2b
bc
c b
bc
b N
N
N
NE
2** kln kMessages exchanged:
nMessages exchanged:
Number of hops:n
(1 integer/message)
integers/message2*kl k
[Number of messages/hops is independent of the size of D]
Decomposition of Algorithms
• Arithmetic primitives are 1-step decompositions– Counts, averages, entropy
• Algorithms involve– Arithmetic primitives
– non-numeric primitives
– Control structure
• Decomposition studied for– Decision tree induction algorithm
– Mining of association rules• Control structure is unaltered
• Primitive computations are decomposed
D1 D2
A B C C D E A E G
Dn Dx
• Learner Node• Control structure• Decomposition• Composition
Building a Decision Tree
To induce a decision tree having:
- d levels; m attributes in n databases; l shared attributes
- k values/attribute
Stationary agent/Simple SQL interface:
Stationary agent/Query-code interface:
Mobile agent:
]**[*]2/[ 22 klndmd kMessages exchanged: (1 integer/message)
][*]2/[ 2 ndmd Messages exchanged: integers/message2*kl k
][*]2/[ 2 ndmd hops
[Number of messages/hops is independent of the size of D]
Mining Association Rules
Main operations:
- Enumerate item-sets
- Compute support/confidence
- Basic computation: Count-of-tuples
Communication Complexity:
- m (avg.) item sets at each level of enumeration tree
- j levels of enumeration tree
- Query-code can count for all item sets at a level simultaneously
- Therefore, we need:
Number of Counts Needed: jm**2
nj * nj *Messages, or hops
More Complex Computations
• Covariance matrix for D– Useful for eigen vectors/principal components
– Needs second order moments
• Graph/Network algorithms– Each node has part of a graph
– Some nodes are shared• Determine MST
• Paths of Min/Max flow
• flow patterns
Dt
tt yx
Sum of Products
• Sum of products for two attributes:
• There are six different ways in which x and y may be distributed
• Each requires a different decomposition
– Case 1: x same as y; and x belongs to the SharedSet.
– Case 2: x same as y; and x does not belong to the SharedSet.
– Case 3: x and y both belong to the SharedSet.
Dt
tt yx
)....(*2 DinxCountx jj
j
)(*)....( 2kk k condCountcondforxAvg
)....(** SharedincondCountyx kkk k
Sum of Products
– Case 4: x belongs to SharedSet and y does not.
– Case 5: x, y don’t belong to the SharedSet and reside on different nodes.• For each tuple t in SharedSet, obtain
• and then
– Case 6: x, y don’t belong to the SharedSet and reside on the same node.
)(** jj j xxCountyx
)(,)( tytx
t
tySumtxSum ))((*))((
t
tCounttod )(*)(Pr where
Prod(t) is average of product of x and y for cond-t of SharedSet
Self-decomposing Algorithms• Easy decomposability of arithmetic primitives
– Average/Covariance matrix/Entropy
• Control structure of algorithms is not altered– More gains possible, by altering control structure
• Decomposition is driven by the set of shared attributes
• Algorithm can determine shared attributes in n messages/hops
• Algorithms decompose in accordance with attribute sharing– No human intervention needed
• Message complexity is independent of sizes of databases