2015年12月4日星期五 2015年12月4日星期五 2015年12月4日星期五 Data Mining: Concepts and Techniques1 CSG230 Summary Donghui Zhang

2023年4月18日Data Mining: Concepts and Technique

s 1

CSG230 Summary

Donghui Zhang


s 2

What we learned?

1. Frequent pattern & association

2. Clustering

3. Classification

4. Data warehousing

5. Additional


s 3

What we learned?


frequent itemsets (Apriori, FP-growth)

max and closed itemsets

association rules

essential rules

generalized itemsets

Sequential pattern

2. Clustering

3. Classification

4. Data warehousing

5. Additional


s 4

What we learned?


2. Clustering

k-means

Birch (based on CF-tree)

DBSCAN

CURE

3. Classification

4. Data warehousing

5. Additional


s 5

What we learned?


2. Clustering

3. Classification

decision tree

naïve Baysian classifier

Baysian network

neural net and SVM

4. Data warehousing

5. Additional


s 6

What we learned?


2. Clustering

3. Classification

4. Data warehousing

concept, schema

data cube & operations (rollup, …)

cube computation: multi-way array aggregation

iceberg cube

dynamic data cube

5. Additional


s 7

What we learned?


2. Clustering

3. Classification

4. Data warehousing

5. Additional

lattice (of itemsets, g-itemsets, rules, cuboids)

distance-based indexing


s 8

1. Frequent pattern & association frequent itemsets (Apriori, FP-growth)

max and closed itemsets

association rules

essential rules

generalized itemsets

Sequential pattern


s 9

Basic Concepts: Frequent Patterns and Association Rules

Itemset X={x1, …, xk} Find all the rules XY with min

confidence and support support, s, probability that a

transaction contains XY confidence, c, conditional

probability that a transaction having X also contains Y.

Let min_support = 50%, min_conf = 50%:

A C (50%, 66.7%)C A (50%, 100%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Transaction-id Items bought

10 A, B, C

20 A, C

30 A, D

40 B, E, F


s 10

From Mining Association Rules to Mining Frequent Patterns (i.e. Frequent Itemsets)

Given a frequent itemset X, how to find association

rules?

Examine every subset S of X.

Confidence(S X – S ) = support(X)/support(S)

Compare with min_conf

An optimization is possible (refer to exercises

6.1, 6.2).


s 11

The Apriori Algorithm—An Example

Database TDB

1st scan

C1L1

L2

C2 C2

2nd scan

C3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 2

{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset

{B, C, E}Itemset sup{B, C, E} 2


s 12

Important Details of Apriori

How to generate candidates? Step 1: self-joining Lk

Step 2: pruning How to count supports of candidates? Example of Candidate-generation

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd acde from acd and ace

Pruning: acde is removed because ade is not in L3

C4={abcd}


s 13

Construct FP-tree from a Transaction Database

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

min_support = 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

1. Scan DB once, find frequent 1-itemset (single item pattern)

2. Sort frequent items in frequency descending order, f-list

3. Scan DB again, construct FP-tree F-list=f-c-a-b-m-p


s 14


{}

f:1

c:1

a:1

m:1

p:1

Header Table


min_support = 3






s 15


{}

f:2

c:2

a:2

b:1m:1

p:1 m:1

Header Table


min_support = 3






s 16


{}

f:3

c:2

a:2

b:1m:1

p:1 m:1

Header Table


min_support = 3





b:1


s 17


{}

f:3 c:1

b:1

p:1

b:1c:2

a:2

b:1m:1

p:1 m:1

Header Table


min_support = 3






s 18


{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table


min_support = 3






s 19

Find Patterns Having P From P-conditional Database

Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent

item p Accumulate all of transformed prefix paths of item p to

form p’s conditional pattern base

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table



s 20

Max-patterns

Frequent pattern {a1, …, a100} (1001) +

(1002) + … + (1

10

00

0) = 2100-1 = 1.27*1030

frequent sub-patterns! Max-pattern: frequent patterns without

proper frequent super pattern BCDE, ACD are max-patterns BCD is not a max-pattern

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,FMin_sup=2


s 21

Example

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,F

(ABCDEF)

Items Frequency

A 2

B 2

C 3

D 3

E 2

F 1

ABCDE 0

Min_sup=2

Max patterns:

A (BCDE)B (CDE) C (DE) E ()D (E)


s 22

Example

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,F

Items Frequency

AB 1

AC 2

AD 2

AE 1

ACD 2

Min_sup=2

(ABCDEF)


Max patterns:

Node A


s 23

Example

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,F

Items Frequency

AB 1

AC 2

AD 2

AE 1

ACD 2

Min_sup=2

(ABCDEF)


ACD

Max patterns:

Node A


s 24

Example

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,F

Items Frequency

BCDE 2

Min_sup=2

(ABCDEF)


ACD

Max patterns:

Node B


s 25

Example

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,F

Items Frequency

BCDE 2

Min_sup=2

(ABCDEF)


ACD

BCDE

Max patterns:

Node B


s 26

Example

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,F

Min_sup=2

(ABCDEF)


ACD

BCDE

Max patterns:


s 27

A Critical Observation

Rule Support Confidence

A → BC sup(ABC) sup(ABC)/sup(A)

AB → C sup(ABC) sup(ABC)/sup(AB)

AC → B sup(ABC) sup(ABC)/sup(AC)

A → B sup(AB) sup(AB)/sup(A)

A → C sup(AC) sup(AC)/sup(A)

A → BC has smaller support and confidence than the other rules, independent to the TDB.

Rules AB → C, AC → B, A → B and A → C are redundant with regard to A → BC.

While mining association rules, a large percentage of rules may be redundant.


s 28

Formal Definition of Essential Rule

Definition 1 Rule r1 implies another rule r2 if support(r1)≤support(r2) and confidence(r1)≤ confidence(r2) independent to TDB.

Denote as r1 r2

Definition 2 Rule r1 is an essential rule if r1 is strong and r2 s.t. r2 r1 .


s 29

Example of a Lattice of rules

ABC

ABCCABBAC

ACBABCABAC BABC BCA CACB

• Generate the child nodes: move or delete from the consequent.• To find essential rules: start from each max itemset; browse top-down; prune a sub-tree whenever a rule is confident.


s 30

Frequent generalized itemsets

A taxonomy of items. TDB involves leaf items in the taxonomy. A g-itemset may contain g-items, but

cannot contain an ancestor and a descendant at the same time.

!! A descendant g-item is a “superset”!! Anyone who bought {milk, bread} also

bought {milk}. Anyone who bought {A} also bought {W}.

?? how to find frequent g-itemsets? Browse (and prune) a lattice of g-

itemsets! To get children, replace one item by its

ancestor (if conflicts, remove instead.)


s 31

What Is Sequential Pattern Mining?

Given a set of sequences, find the complete set of frequent subsequences

A sequence database Given support threshold min_sup =2, <(ab)c> is a sequential pattern

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>


s 32

Mining Sequential Patterns by Prefix Projections

Step 1: find length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f>

Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: The ones having prefix <a>; The ones having prefix <b>; … The ones having prefix <f>

SID sequence


20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>


s 33

Finding Seq. Patterns with Prefix <a>

Only need to consider projections w.r.t. <a> <a>-projected database: <(abc)(ac)d(cf)>,

<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>,

by checking the frequency of items like a and _a. Further partition into 6 subsets

Having prefix <aa>; … Having prefix <af>

SID sequence


20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>


s 34

2. Clustering k-means

Birch (based on CF-tree)

DBSCAN

CURE


s 35

The K-Means Clustering Method

Pick k objects as initial seed points Assign each object to the cluster with the

nearest seed point Re-compute each seed point as the

centroid (or mean point) of its cluster Go back to Step 2, stop when no more new

assignment

Not optimal. A counter example?


s 36

BIRCH (1996)

Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)

Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering Phase 1: scan DB to build an initial in-memory CF tree

(a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)

Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree

Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans

Weakness: handles only numeric data, and sensitive to the order of the data record.

Balanced Iterative Reducing and Clustering using Hierarchies


s 37

Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: Ni=1 Xi

SS: Ni=1 (Xi )2

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),244)

(3,4)(2,6)(4,5)(4,7)(3,8)


s 38

Some Characteristics of CF

Two CF can be aggregated.

Given CF1=(N1, LS1, SS1), CF2 = (N2, LS2, SS2),

If combined into one cluster, CF=(N1+N2, LS1+LS2, SS1+SS2).

The centroid and radius can both be computed from CF.

centroid is the center of the cluster

radius is the average distance between an object and the centroid.

how?

N

N

i ixx 1

0N

R

N

i i xx 2

1 0)(


s 39

Some Characteristics of CF

N

LSx

0

N

NLSLSNLSNSS )/(*2)/( 2

NR xxxx i

N

i i)*2)()((

0

2

1 0

2

2*1

LSSSNN


s 40

CF-Tree in BIRCH

Clustering feature:

summary of the statistics for a given subcluster: the 0-th, 1st and 2nd moments of the subcluster from the statistical point of view.

registers crucial measurements for computing cluster and utilizes storage efficiently

A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering

A nonleaf node in a tree has descendants or “children”

The nonleaf nodes store sums of the CFs of their children

A CF tree has two parameters

Branching factor: specify the maximum number of children.

threshold T: max radius of sub-clusters stored at the leaf nodes


s 41

Insertion in a CF-Tree

To insert an object o to a CF-tree, insert to the root node of the CF-tree.

To insert o into an index node, insert into the child node whose centroid is the closest to o.

To insert o into a leaf node,

If an existing leaf entry can “absorb” it (i.e. new radius <= T), let it be;

Otherwise, create a new leaf entry.

Split:

Choose two entries whose centroids are the farthest away;

Assign them to two different groups;

Assign the remaining entries to one of these groups.


s 42

Density-Based Clustering: Background (II)

Density-reachable: A point p is density-reachable

from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi

Density-connected A point p is density-connected to

a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts.

p

qp1

p q

o


s 43

DBSCAN: Density Based Spatial Clustering of Applications with Noise

Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points

Discovers clusters of arbitrary shape in spatial databases with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5


s 44

DBSCAN: The Algorithm

Arbitrary select a point p

Retrieve all points density-reachable from p wrt Eps and MinPts.

If p is a core point, a cluster is formed.

If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.

Continue the process until all of the points have been processed.


s 45

k-means does not perform well on this; AGNES + dmin has single-link effect!

Motivation for CURE


s 46

Cure: The Basic Version

Initially, insert to PQ every object as a cluster.

Every cluster in PQ has: (Up to) C representative points Pointer to closest cluster (dist between two

clusters = min{dist(rep1, rep2)}. While PQ has more than k clusters,

Merge the top cluster with its closest cluster.


s 47

Representative points

Step 1: choose up to C points. If a cluster has no more than C points, all of them. Otherwise, choose the first point as the farthest

from the mean. Choose the others as the farthest from the chosen ones.

Step 2: shrink each point towards mean: p’ = p + * (mean – p) [0,1]. Larger means shrinking more. Reason for shrink: avoid outlier, as faraway

objects are shrunk more.


s 48

3. Classification decision tree

naïve Baysian classifier

Baysian net

neural net and SVM


s 49

Training Dataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example from Quinlan’s ID3


s 50

Output: A Decision Tree for “buys_computer”

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40


s 51

Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-

conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are

discretized in advance) Examples are partitioned recursively based on selected

attributes Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf There are no samples left


s 52

Suppose X can have one of m values… V1, V2, … Vm

What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s

H(X) = The entropy of X “High Entropy” means X is from a uniform (boring) distribution “Low Entropy” means X is from varied (peaks and valleys)

distribution

General Case

mm ppppppXH 2222121 logloglog)(

P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm

m

jjj pp

12log


s 53

Specific Conditional Entropy

Definition of Conditional Entropy:

H(Y|X=v) = The entropy of Y among only those records in which X has value v

Example:

• H(Y|X=Math) = 1

• H(Y|X=History) = 0

• H(Y|X=CS) = 0

X = College Major

Y = Likes “Gladiator”

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes


s 54

Conditional EntropyDefinition of general Conditional Entropy:

H(Y|X) = The average conditional entropy of Y

= ΣjProb(X=vj) H(Y | X = vj)

X = College Major

Y = Likes “Gladiator”

Example:

vj Prob(X=vj) H(Y | X = vj)

Math 0.5 1

History 0.25 0

CS 0.25 0

H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes


s 55

Conditional entropy H(C|age)

H(C|age<=30) = 2/5 * lg(5/2) + 3/5 * lg(5/3) = 0.971

H(C|age in 30..40) = 1 * lg 1 + 0 * lg 1/0 = 0 H(C|age>40) = 3/5 * lg(5/3) + 2/5 * lg(5/2) =

0.971

age buy no<=30 2 330…40 4 0>40 3 2

694.0)40|(14

5

])40,30(|(14

4)30|(

14

5)|(

ageCH

ageCHageCHageCH


s 56

Select the attribute with lowest conditional entropy

H(C|age) = 0.694 H(C|income) = 0.911 H(C|student) = 0.789 H(C|credit_rating) =

0.892 Select “age” to be the

tree root! yes

age?

<=30 >4030..40

student?

no yes

no yes

credit rating?

fairexcellent

no yes


s 57

Bayesian Classification

X: a data sample whose class label is unknown, e.g. X =(Income=medium, Credit_rating=Fair, Age=40).

Hi: a hypothesis that a record belongs to class Ci, e.g.

Hi = a record belongs to the “buy computer” class. P(Hi), P(X): probabilities. P(Hi/X): a conditional probability: among all records

with medium income and fair credit rating, what’s the probability to buy a computer?

This is what we need for classification! Given X, P(Hi/X) tells us the possibility that it belongs to some class.

What if we need to determine a single class for X?


s 58

Bayesian Theorem

Another concept, P(X|Hi) : probability of observing the sample X, given that the hypothesis holds. E.g. among all people who buy computer, what percentage has the same value as X.

We know P(X Hi) = P(Hi|X) P(X) = P(X|Hi) P(Hi), So

)()()|(

)|(XP

iHPiHXPXiHP

We should assign X to the class Ci where P(Hi|X) is maximized, equivalent to maximize P(X|Hi) P(Hi).


s 59

Naïve Bayes Classifier

A simplified assumption: attributes are conditionally independent:

The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C)

No dependence relation between attributes Greatly reduces the number of probabilities to

maintain.

n

kCixPCiXP

k1

)|()|(


s 60

Sample quiz questions

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

1. What data does

naïve Baysian netmaintain?

2. Given X =(age<=30,Income=medium,Student=yesCredit_rating=Fai

r)

buy or not buy?


s 61

Naïve Bayesian Classifier: Example

Compute P(X/Ci) for each class

P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4

X=(age<=30 ,income =medium, student=yes,credit_rating=fair)

P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028

P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007

X belongs to class “buys_computer=yes”Pitfall: forget P(Ci)


s 62

Assume five variables

T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny

T only directly influenced by L (i.e. T is conditionally independent of R,M,S given L)

L only directly influenced by M and S (i.e. L is conditionally independent of R given M & S)

R only directly influenced by M (i.e. R is conditionally independent of L,S, given M)

M and S are independent


s 63

Making a Bayes net

S M

R

L

T

Step One: add variables.• Just choose the variables you’d like to be included in the

net.



s 64

Making a Bayes net

S M

R

L

T

Step Two: add links.• The link structure must be acyclic.• If node X is given parents Q1,Q2,..Qn you are promising

that any variable that’s a non-descendent of X is conditionally independent of X given {Q1,Q2,..Qn}



s 65

Making a Bayes net

S M

R

L

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Step Three: add a probability table for each node.• The table for node X must list P(X|Parent Values) for each

possible combination of parent values



s 66

Computing with Bayes Net

P(T ^ ~R ^ L ^ ~M ^ S) =P(T ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) = P(T L) * P(~R ^ L ^ ~M ^ S) =P(T L) * P(~R L ^ ~M ^ S) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M | S)*P(S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M)*P(S).

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2


s 67

What we learned?

4. Data warehousing concept, schema

data cube & operations (rollup, …)

cube computation: multi-way array aggregation

iceberg cube

dynamic data cube


s 68

What is Data Warehouse?

Defined in many different ways, but not rigorously.

A decision support database that is maintained

separately from the organization’s operational database

Support information processing by providing a solid

platform of consolidated, historical data for analysis.

“A data warehouse is a subject-oriented, integrated, time-

variant, and nonvolatile collection of data in support of

management’s decision-making process.”—W. H. Inmon

Data warehousing:

The process of constructing and using data warehouses


s 69

Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected

to a set of dimension tables Snowflake schema: A refinement of star schema

where some dimensional hierarchy is normalized

into a set of smaller dimension tables, forming a

shape similar to snowflake Fact constellations: Multiple fact tables share

dimension tables, viewed as a collection of stars,

therefore called galaxy schema or fact

constellation


s 70

A data cube

all

product quarter country

product, quarter product,country quarter, country

product, quarter, country

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D(base) cuboid


s 71

Multidimensional Data

Sales volume as a function of product, month, and region

Pro

duct

Regio

n

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Pick one node from each dimensionhierarchy, you get a data cube!

How many cubes? How many distinct cuboids?


s 72

Typical OLAP Operations

Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or

detailed data, or introducing new dimensions Slice and dice:

project and select Pivot (rotate):

reorient the cube, visualization, 3D to series of 2D planes.


s 73

Typical OLAP Operations

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

?? Starting from [product, city, week], what OLAP operations can produce the total sales for every month and every category in the “automobile” industry.


s 74

OLAP Server Architectures

Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and

manage warehouse data and OLAP middle ware to support missing pieces

Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services

greater scalability Multidimensional OLAP (MOLAP)

Array-based multidimensional storage engine (sparse matrix techniques)

fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP)

User flexibility, e.g., low level: relational, high-level: array Specialized SQL servers

specialized support for SQL queries over star/snowflake schemas


s 75

Multi-way Array Aggregation for Cube Computation

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0

a2 a3

C

4428 56

4024 52

3620

60

B

Order: ABCAB: planeAC: lineBC: point


s 76

Multi-Way Array Aggregation for Cube Computation (Cont.)

Let A: 40 values, B: 400 values, C: 4000 values. One chunk contains 10*100*1000 = 1,000,000

values. ABC needs how much memory?

AB plane: 40*400=16,000 AC line: 40*(4000/4) = 40,000 BC point: (400/4)*(4000/4) = 100,000 total: 156,000

CBA needs how much memory? CB plane: 4000*400=1,600,000 CA line: 4000*(40/4) = 40,000 BA point: (400/4)*(40/4) = 1000 total: 1,641,000 --- 10 times more!


s 77

Computing iceberg cube using BUC

BUC (Beyer & Ramakrishnan, SIGMOD’99) Bottom-up vs. top-down?—depending on how you view it! Apriori property:

Aggregate the data, then move to the next level If minsup is not met, stop!


s 78

The Dynamic Data Cube [EDBT’00]

E.g. 16+12+8+6 = 42. Query cost = update cost = O(log2(n))

1..4 5..8

1..4

5..8

4 8 12 16

4812161111

1111

1111

11114 8 12 16

4812161111

1111

1111

1111

4 8 12 16

4812161111

1111

1111

1111

4 8 12 16

4812161111

1111

1111

1111


s 79

Dynamic Data Cube summary

A balanced tree with fanout=4. The leaf nodes contains the original data cube. Each index entry stores an X-border and an Y-

border. Each border is stored as a binary tree, which

supports a 1-dim prefix-sum query and an update in O(log(n)) time.

Overall, the DDC supports a range-sum query and an update both in O(log2n) time.


s 80

5. Additional lattice (of itemsets, g-itemsets, rules, cuboids)

distance-based indexing


s 81

Problem Statement

Given a set S of objects and a metric distance function d(). The similarity search problem is defined as: for an arbitrary object q and a threshold , find{ o | oS d(o, q)< }

Solution without index: for every oS, compute d(q,o). Not efficient!


s 82

An Example of the VP-tree

S={o1,…,o10}. Randomly pick o1 as root. Compute the distance between o1 and oi, sort in

increasing order of distance:

build tree recursively.

o3 o7 o6 o9 o10 o2 o8 o5 o4

5 6 18 34 96 102 111 300 401

o1

o3 , o7 , o6 , o9 o10 , o2 , o8 , o5, o4

34 96


s 83

Query Processing

Given object q, compute d(q,root). Intuitively, if it’s small, search the left tree; otherwise, search the right tree.

In each index node, store: maxDL=max{ d(root, oi)|oi left tree }, minDR=min{ d(root, oi)|oi right tree }.

Pruning condition: prune left if: d(q,root) – maxDL ≥ prune right if: minDR - d(q,root) ≥

?? maxDL=10, minDR=20, d(q,root)=10, =10. Which sub-tree(s) do we check?

?? maxDL=10, minDR=20, d(q,root)=10, for what do we have to check both trees?


s 84

Summary


2. Clustering

3. Classification

4. Data warehousing

5. Additional

Documents

2015年12月4日星期五 2015年12月4日星期五 2015年12月4日星期五 Data Mining: Concepts and Techniques1 CSG230 Summary Donghui Zhang