Mining Distributed Databases Raj Bhatnagar University of Cincinnati

Mining Distributed Databases

Raj Bhatnagar

University of Cincinnati

Distributed Databases

D = D1 X D2 X . . . X Dn

- D is implicitly specified

Goal: Discover patterns in implicit D, using the explicit Di’s

D1 D2 Dn

A B C C D E A E G

Limitations:- Can’t move Di’s to a common site

- Size / communication cost/Privacy- Can’t update local databases- Can’t send actual data tuples

Geographically distributed nodes

Explicit and Implicit Databases

321162

121162

211221

211261

321161

121161

FEDCBA

Implicit Database

Explicit Component Databases

22

12

21

11

CA

SharedSet

------162

122311161

121111261

211221221

CEAFCDCBA

Node 3Node 2Node 1

Decomposition of Computations

- Since D is implicit,

- For a computation:

- Decompose F into G and g’s

- Decomposition depends on

- F

- Di’s

- Set of shared attributes

D1 D2 Dn

A B C C D E A E G

)]()...(),([ 2211 nn DgDgDgGR

)(DFR

Decomposition of ComputationsComputational primitives

– Arithmetic primitives

• Count of tuples in implicit D• Mean Value of an attribute in D• Informational entropy for a subset of D• Covariance matrix for D

– non-numeric primitives

• Median value of an atribute in D• Sorting subsets of tuples in D

Decomposition of Computations• Computational cost of decomposition

– Communication cost• Number of messages exchanged

– Number of database queries

• Who does the decomposition?– Algorithm itself, at run time

– Depending on the nature of overlap in Di’s

Count All Tuples in Implicit D

)(# DtuplesR Can be decomposed as:

m

j

n

iCondi j

DNR1 1

))(((

– condJ : Jth tuple in Shareds

– n: number of participating databases (Dis)

– (N(Dt)condJ): count of tuples in Dt satisfying condJ

– Local computation: gi(Di,) = N(Dt)condJ

– G is a sum-of-products

22

12

21

11

CA

Shareds

L attributes;k values each;

kl tuples

Implementing Decomposed Computations

Stationary Agents

D1 D2

A B C C D E A E G

Dn Dx

A A AA

D1 D2

A B C C D E A E G

Dn Dx

Mobile Agents

Messages

Aglet

Implementation of Count(D)Stationary Agents

- Request / Send Summaries

- Simple SQL interface

- 1 count / message

- l attributes having k values each

- Query-code interface

- counts/message

- l attributes having k values each

Mobile Agents:

D1 D2 Dn

A B C C D E A E G

22

12

21

11

CA

Shareds


kl tuples

kln*Messages exchanged:

kl

Messages exchanged:n

Number of hops:n

Implementation of Count(D-test)

Stationary Agents- Simple SQL interface


Mobile Agents:

22

12

21

11

CA

Shareds


kl tuples

kln*Messages exchanged:


Number of hops:n

)))(((..1

J testandcond

n

tttest

J

DNCount

Average Value of an attribute in D

Compute counts for each value of an attribute:

n

iiiC

CNCN total

Avg1

1 ))(*(*)(

Stationary Agents- Simple SQL interface


Mobile Agents:

klnk **)1( Messages exchanged:


Number of hops:n

(1 integer/message)

integers/message)1(* kl k

Exception Tuples

• Database of interest may exclude some tuples of D• Learning site keeps a relation E of exception tuples

– E may have explicit tuples

– E may have rules to generate exception tuples

m

j

n

iCondCondi jj

NDNR1 1

))()((( E

Explicit Databases

22

12

21

11

CA

SharedSet------162

122311161

121111261

211221221

CEAFCDCBA

Node 3Node 2Node 1

--

--

--

32

EB

Exceptions

Computing Informational Entropy

Consists of various counts only:

Stationary agent/Simple SQL interface:

Stationary agent/Query-code interface:

Mobile agent:

))log( 2b

bc

c b

bc

b N

N

N

NE

2** kln kMessages exchanged:

nMessages exchanged:

Number of hops:n

(1 integer/message)

integers/message2*kl k

[Number of messages/hops is independent of the size of D]

Decomposition of Algorithms

• Arithmetic primitives are 1-step decompositions– Counts, averages, entropy

• Algorithms involve– Arithmetic primitives

– non-numeric primitives

– Control structure

• Decomposition studied for– Decision tree induction algorithm

– Mining of association rules• Control structure is unaltered

• Primitive computations are decomposed

D1 D2

A B C C D E A E G

Dn Dx

• Learner Node• Control structure• Decomposition• Composition

Building a Decision Tree

To induce a decision tree having:

- d levels; m attributes in n databases; l shared attributes

- k values/attribute

Stationary agent/Simple SQL interface:

Stationary agent/Query-code interface:

Mobile agent:

]**[*]2/[ 22 klndmd kMessages exchanged: (1 integer/message)

][*]2/[ 2 ndmd Messages exchanged: integers/message2*kl k

][*]2/[ 2 ndmd hops

[Number of messages/hops is independent of the size of D]

Mining Association Rules

Main operations:

- Enumerate item-sets

- Compute support/confidence

- Basic computation: Count-of-tuples

Communication Complexity:

- m (avg.) item sets at each level of enumeration tree

- j levels of enumeration tree

- Query-code can count for all item sets at a level simultaneously

- Therefore, we need:

Number of Counts Needed: jm**2

nj * nj *Messages, or hops

More Complex Computations

• Covariance matrix for D– Useful for eigen vectors/principal components

– Needs second order moments

• Graph/Network algorithms– Each node has part of a graph

– Some nodes are shared• Determine MST

• Paths of Min/Max flow

• flow patterns

Dt

tt yx

Sum of Products

• Sum of products for two attributes:

• There are six different ways in which x and y may be distributed

• Each requires a different decomposition

– Case 1: x same as y; and x belongs to the SharedSet.

– Case 2: x same as y; and x does not belong to the SharedSet.

– Case 3: x and y both belong to the SharedSet.

Dt

tt yx

)....(*2 DinxCountx jj

j

)(*)....( 2kk k condCountcondforxAvg

)....(** SharedincondCountyx kkk k

Sum of Products

– Case 4: x belongs to SharedSet and y does not.

– Case 5: x, y don’t belong to the SharedSet and reside on different nodes.• For each tuple t in SharedSet, obtain

• and then

– Case 6: x, y don’t belong to the SharedSet and reside on the same node.

)(** jj j xxCountyx

)(,)( tytx

t

tySumtxSum ))((*))((

t

tCounttod )(*)(Pr where

Prod(t) is average of product of x and y for cond-t of SharedSet

Self-decomposing Algorithms• Easy decomposability of arithmetic primitives

– Average/Covariance matrix/Entropy

• Control structure of algorithms is not altered– More gains possible, by altering control structure

• Decomposition is driven by the set of shared attributes

• Algorithm can determine shared attributes in n messages/hops

• Algorithms decompose in accordance with attribute sharing– No human intervention needed

• Message complexity is independent of sizes of databases

Continuing Work

Determine patterns of flow in a network– Communication network traffic

– Geographic/economic flows

Localflowdata

Localflowdata

Localflowdata

Localflowdata

Documents

Mining Distributed Databases Raj Bhatnagar University of Cincinnati