28
Vertical Set Square Distance: Vertical Set Square Distance: A Fast and Scalable Technique to A Fast and Scalable Technique to Compute Total Variation in Large Compute Total Variation in Large Datasets Datasets Taufik Abidin, Amal Perera, Masum Serazi, William Perrizo 20 th International Conference on Computer and Their Applications March 16-18, 2005 Louisiana, New Orleans

Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William

Embed Size (px)

Citation preview

Vertical Set Square Distance: Vertical Set Square Distance: A Fast and Scalable Technique to Compute A Fast and Scalable Technique to Compute

Total Variation in Large DatasetsTotal Variation in Large Datasets

Taufik Abidin, Amal Perera, Masum Serazi, William Perrizo

20th International Conference on Computer and Their Applications

March 16-18, 2005

Louisiana, New Orleans

20th International Conference on CATA 2005

Outline:Outline: Introduction

Motivation

Contribution

P-tree Vertical Data Structure

Vertical Set Square Distance (VSSD)

Experimental Results

Conclusion and Future Works

20th International Conference on CATA 2005

IntroductionIntroduction The determination of closeness (distance) of a point to

another point or to a set of points (average distance) is required in data mining tasks.

For example: Distance-based clustering.

Distance is used to determine in which cluster a certain point should be placed, i.e. K-means.

Near nbr or case-based classification

The assignment of class label to an unclassified point is based on the majority of class label of the nearest neighbor points. “Nearest” here implies the closest distance of points.

20th International Conference on CATA 2005

One way to measure the distance from a certain point to a set of points is by examining the actual total separation of the set of points from the point in question

However, if the examination is done point by point when the set is very large, scanning the entire space causes the approach to be very costly, slow and non-scalable.

In this paper, we introduce a new technique called the Vertical Set Square Distance (VSSD) that can scalably, quickly and accurately compute the total length of separation (total variation) of a set of points about a point.

Introduction (cont)Introduction (cont)

20th International Conference on CATA 2005

In data mining, efforts have focused on finding techniques that can scalably deal with large cardinality of dataset due to the explosive growth of data stored in digital form.

Most existing approaches for measuring closeness of points are slow and computationally intensive (often relying upon sampling to become useable).

The availability of P-tree technology (vertical data structure) is an appropriate data structure to solve this curse of cardinality.

MotivationMotivation

20th International Conference on CATA 2005

We introduce a new vertical technique that scalably, quickly and accurately computes the total length of separation of a set of points about a fixed point.

We present empirical results based on real and synthetic datasets and show the one-to-one comparison in terms of scalability and speed (time) between our new method that uses the vertical approach (P-tree vertical data structure) and other method that uses the horizontal approach (record-based).

Our ContributionsOur Contributions

20th International Conference on CATA 2005

P-tree vertical data representation consists of set structures essentially representing the data column-by-column rather than row-by-row (i.e. relational data).

The construction of P-trees from an existing relational table is typically by decomposing each attribute in the table into separate bit vectors (e.g., one for each bit position of a numeric attribute or one bitmap for each category in a categorical attribute). The construction for raw data coming from a sensor platform is done directly (without creating a horizontal relational table first.Note that many sensor platforms (e.g.,

RSI sensors), product raw vertical data.

P-Tree Vertical Data StructureP-Tree Vertical Data Structure

20th International Conference on CATA 2005

The Construction of P-TreeThe Construction of P-Tree

The construction of the P- Tree:

1. Vertically project each attribute.2. Vertically project each bit position.3. Compress each piece of bit slice into a P-tree.

Logical operations are AND (), OR () and complement (').

Root count operation is the count of 1-bits from the resulting P-trees logical operations

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1 10

0 1 0 0 1 01

0 0 00 0 0 1 01 10

0 1 0

0 1 0 1 0

0 0 01 0 01

0 1 0

0 0 0 1 0

0 0 10 1

0 0 10 1 01

0 0 00 1 01

0 0 0 0 1 0 010 01^ ^ ^ ^ ^ ^ ^ ^ ^

010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

=

R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

R(A1 A2 A3 A4)2 7 6 16 7 6 02 7 5 12 7 5 75 2 1 42 2 1 57 0 1 47 0 1 4

20th International Conference on CATA 2005

Binary Representation

Let x be a numerical domain of attribute Ai of relation R(A1, A2, …, Ad), then x written in b bits binary number is shown in the left-hand side term of the equation below.

The first subscript of x is the index of the attribute to which x belong. The second subscript of x indicates the bit position. The summation on the right-hand side of the equation is equal to the actual value of x as a decimal number.

Vertical Set Square DistanceVertical Set Square Distance

0

1,0,1, 2

bjji

jibi xxx

20th International Conference on CATA 2005

Let X be a set of vectors in a relation R(A1, A2, …, Ad), let

PX be the P-tree class mask of X, x is a vector : and a is a target vector (also in d-dimensional space), then the vectors x and a in binary can be written as:

The total variation of a set of vectors in X about a vector a can be computed vertically using Vertical Set Square Distance (VSSD) defined as follows:

Vertical Set Square Dist. (Cont)Vertical Set Square Dist. (Cont)

Xx

),,,( 0,)1(,0,2)1(,20,1)1(,1 dbdbb xxxxxxx ),,,( 0,)1(,0,2)1(,20,1)1(,1 dbdbb aaaaaaa

20th International Conference on CATA 2005

Vertical Set Square Dist. (Cont)Vertical Set Square Dist. (Cont)

aXaX axaxXx

2

1

Xx

d

iii ax

Xx

d

i

d

iiii

d

ii aaxx

1 1

2

1

2 2

Xx Xx

d

i Xx

d

iiii

d

ii aaxx

1 1

2

1

2 2

321 TTT

Term T1, T2 and T3 can be computed

separately and their summation is actually the total variation (the sum of the square length of separation of the vectors connecting X and a.

20th International Conference on CATA 2005

Vertical Set Square Dist. (Cont)Vertical Set Square Dist. (Cont)

Xx

d

iixT

1

21

d

i bjjiX

j PPrcT1

0

1,

21 )(2 )(2

0 0)1(0 )1()2*(

,,

jandjl

jandjjklijiX

k PPPrc

2

1

0

1,2

Xx

d

i bjji

j x

0

1,

1

0

1, 22

bkki

k

Xx

d

i bjji

j xx

Xx

d

ibkbj

kijikj xx

1

0

11

,,2

Xx

kiji

d

ibkbj

kj xx ,,1

0

11

2

)^^(2 ,,1

0

11

kijiX

d

ibkbj

kj PPPrc

Or we can write T1, expressing the diagonal terms (j=k) separately (noting also that xi,j

2 = xi,j as they are bits)

20th International Conference on CATA 2005

Vertical Set Square Dist. (Cont)Vertical Set Square Dist. (Cont)

Xx

d

iiiaxT

12 2

Xx

d

i bjji

ji xa

1

0

1,22

d

i bj Xxji

ji xa

1

0

1,22

d

i bjjiX

ji PPrca

1

0

1, )(22

Xx

d

iiiaT 2

3

d

iiX aPrc

1

2)(

And T3 is simply:

aaPrc X )(

20th International Conference on CATA 2005

Vertical Set Square Dist. (Cont)Vertical Set Square Dist. (Cont)

The root count operation is obviously independent of the vector a. These include the root count of single P-tree operand, PX, and the root count of two basic P-tree operands.

and the root count of three P-tree operands

which appear in T1, T2 and T3.

d

i bjjiX PPrc

1

0

1, )(

d

i bj jandjllijiX PPPrc

1

0

1 0 0)1(,, )(

20th International Conference on CATA 2005

Vertical Set Square Dist. (Cont)Vertical Set Square Dist. (Cont)

The independency allows us to pre-compute the root counts once in advance, during the construction of the P-tree, and can use them repeatedly as needed, regardless of the number of target vectors a.

This amortizes the cost of P-tree ANDing for high value datasets, e.g. cancer analysis, RSI, etc.

20th International Conference on CATA 2005

Horizontal Set Square DistanceHorizontal Set Square Distance(HSSD) as a comparison method(HSSD) as a comparison method

Let X, a set of vectors in R(A1,A2 … Ad) and

x = (x1, x2, …, xd) is a vector belong to X and

a = (a1, a2, …, ad) is a target vector

Then the horizontal set square distance (HSSD) is defined as follows:

2

1

Xx

d

iii

Xx

axaxaxaXaX

20th International Conference on CATA 2005

Experimental ResultsExperimental Results The experiments were conducted using both real and

synthetic datasets.

The goals are to compare the execution time (speed) and scalability of our algorithm employing a vertical approach (vertical data structure and horizontal bitwise AND operation) with a horizontal approach (horizontal data structure and vertical scan operation).

Performance of both algorithms was observed under different machine specifications, including an SGI Altix machine.

20th International Conference on CATA 2005

Experimental Results (Cont.)Experimental Results (Cont.)

The specification of machines used in the experiments.

Machine Specification

AMD1GB AMD Athlon K7 1.4GHz, 1GB RAM

P42GB Intel P4 2.4GHz processor, 2GB RAM

SGI AltixSGI Altix CC-NUMA 12 processor shared memory (12 x 4 GB RAM)

20th International Conference on CATA 2005

DatasetsDatasets A set of aerial photographs from the Best Management

Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, ND.

The original image size (dataset) is 1024x1024 pixels (cardinality of 1,048,576), contains 3 bands (RGB) plus synchronized data for soil moisture, soil nitrate and crop yield (total 6 attributes).

20th International Conference on CATA 2005

Datasets (Cont.)Datasets (Cont.) The datasets contain 4 different yield classes:

Low Yield ( 0 < intensity <= 63) Medium Low Yield ( 63 < intensity <= 127) Medium High Yield (127< intensity <= 191) High Yield (191< intensity <= 255)

With super sampling five synthetic datasets are created: 2,097,152 rows 4,194,304 rows (2048x2048 pixels) 8,388,608 rows 16,777,216 rows (4096x4096 pixels) 25,160,256 rows (5016x5016 pixels)

20th International Conference on CATA 2005

Timing and Scalability ResultsTiming and Scalability ResultsObservation 1. The first performance evaluation was done on a P4 2 GB

RAM machine using synthetic datasets having 4.1 and 8.3 million data tuples.

We use 100 arbitrary unclassified target vectors (resulting

in 400 classification computations as to predicted yeild).

Datasets of sizes greater than 8.3 million rows cannot be executed in this machine due to out of memory problems when running Horizontal Set Square Distance (HSSD).

20th International Conference on CATA 2005

ResultsResults VSSD vs HSSD Time Comparison Using 100 Test Cases on 4,194,304 Rows Dataset

-1

1

3

5

7

9

0 10 20 30 40 50 60 70 80 90 100

Test Case ID

Tim

e(i

n S

econ

ds)

VSSD HSSDVSSD vs HSSD Time Comparison

Using 100 Test Cases on 8,388,608 Rows Dataset

-1

1

3

5

7

9

11

13

15

17

0 10 20 30 40 50 60 70 80 90 100Test Case ID

Tim

e(i

n S

econ

ds)

VSSD HSSD

Dataset size

(# of Rows)

Average Time to Compute Total Variation for Each Test Case in (Seconds)

VSSD HSSD

4,193,304 0.0003 7.3800

8,388,608 0.0004 15.1600

20th International Conference on CATA 2005

Timing and Scalability ResultsTiming and Scalability ResultsObservation 1 (Continue)

Dataset

(Rows)

Time

(Seconds)

VSSD HSSD

Root Count Pre-Computation and P-trees Loading

Horizontal Dataset Loading

1,048,576 3.900 4.974

2,097,152 8.620 10.470

4,194,304 18.690 19.914

8,388,608 38.450 39.646This table shows the comparison of time required for loading the vertical data structure to memory and one time root count operations for VSSD, and loading horizontal records to memory needed for HSSD.

20th International Conference on CATA 2005

Timing and Scalability ResultsTiming and Scalability ResultsObservation 2This table shows the algorithm’s timing and

scalability performances when they are executed on different machines.

Cardinality of Dataset

Average Running Time to

Compute Total Variation for Each Test Case

(Seconds)

HSSD VSSD

AMD 1GB P4 2GB SGI Altix 12x4GB AMD 1GB

1,048,576 2.2000 1.8400 5.4800 0.0003

2,097,152 4.4100 3.6400 8.3200 0.0003

4,194,304 8.5800 7.3800 15.8640 0.0004

8,388,608 15.1600 33.9000 0.0004

16,777,216 66.5400 0.0004

25,160,256 115.2040 0.0004

20th International Conference on CATA 2005

Timing and Scalability ResultsTiming and Scalability ResultsObservation 2 (Continue)

VSSD vs HSSD Time Comparison Using 100 Difference Test Cases

Running in Different Types of Machines

0

20

40

60

80

100

120

0 2 4 6 8 10 12 14 16 18 20 22 24Number of Tuples (1024 2̂)

Tim

e(S

econ

ds)

VSSD on AMD-1G

HSSD on AMD-1G

HSSD on P4-2G

HSSD on SGI -48G

Out of Memory

20th International Conference on CATA 2005

ConclusionConclusion Vertical Set Square Distance (VSSD) is fast and accurate

way to compute total variation for classification, clustering and rule mining, and scale well to very large datasets compare to the traditional horizontal approach (HSSD).

The complexity of the VSSD is O(d * b2) where d is the number of dimensions and b is the maximum bit-width of the attributes (depends on the width of the data set only).

The VSSD is very fast because of the independency of root counts of P-tree operands to target vector a which allows the pre-computation of counts once in advance, during the construction of the P-tree.

20th International Conference on CATA 2005

Future WorkFuture Work

Comprehensive study on the Vertical Set Square Distance (VSSD) in many data mining tasks, i.e. classification, clustering or outlier detection.

For classification using VSSD in the voting phase has already been shown to greatly accelerate the assignment of class since the calculation of class votes can be done entirely in one computation without having to visit each individual point, as is the case of the horizontal-based approach.

20th International Conference on CATA 2005

Thank You…Thank You…