51
Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University of Minnesota Research Supported by NSF, DOE, Army Research Office, AHPCRC/ARL http://www. cs . umn . edu /~han Joint work with George Karypis, Vipin Kumar, Anurag Srivastava, and Vineet Singh

Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Embed Size (px)

DESCRIPTION

Why Mine Data? Commercial ViewPoint... zLots of data is being collected and warehoused. zComputing has become affordable. zCompetitive Pressure is Strong yProvide better, customized services for an edge. yInformation is becoming product in its own right.

Citation preview

Page 1: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Scalable Parallel Data Mining

Eui-Hong (Sam) Han

Department of Computer Science and EngineeringArmy High Performance Computing Research CenterUniversity of Minnesota

Research Supported by NSF, DOE, Army Research Office, AHPCRC/ARL

http://www.cs.umn.edu/~han

Joint work with George Karypis, Vipin Kumar, Anurag Srivastava, and Vineet Singh

Page 2: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

What is Data Mining?Many Definitions

Search for Valuable Information in Large Volumes of Data.

Exploration & Analysis, by Automatic or Semi-Automatic Means, of Large Quantities of Data in order to Discover Meaningful Patterns & Rules.

A Step in the KDD Process…

Page 3: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Why Mine Data? Commercial ViewPoint...Lots of data is being collected and

warehoused.Computing has become affordable.Competitive Pressure is Strong

Provide better, customized services for an edge.

Information is becoming product in its own right.

Page 4: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Why Mine Data?Scientific Viewpoint... Data collected and stored at enormous speeds

(Gbyte/hour) remote sensor on a satellite telescope scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data

Traditional techniques are infeasible for raw data Data mining for data reduction..

cataloging, classifying, segmenting data Helps scientists in Hypothesis Formation

Page 5: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Data Mining Tasks...Classification [Predictive]Clustering [Descriptive]Association Rule Discovery [Descriptive]Sequential Pattern Discovery

[Descriptive]Regression [Predictive]Deviation Detection [Predictive]

Page 6: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Classification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categorical

categorical

continuous

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set Model

Learn Classifier

Page 7: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Classification ApplicationDirect Marketing

Fraud Detection

Customer Attrition/Churn

Sky Survey Cataloging

Page 8: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Example Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categorical

categorical

continuous

class

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

The splitting attribute at a node is determined based on the Gini index.

Page 9: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Hunt’s MethodAn Example: Attributes: Refund (Yes, No), Marital Status (Single, Married,

Divorced), Taxable Income (Continuous) Class: Cheat, Don’t Cheat

Refund

Don’t Cheat

Yes No

MaritalStatus

Don’t Cheat

Cheat

Single,Divorced Married

TaxableIncome

Don’t Cheat

< 80K >= 80KDon’t Cheat

Refund

Don’t Cheat

Don’t Cheat

Yes NoRefund

Don’t Cheat

Yes No

MaritalStatus

Don’t Cheat

Cheat

Single,Divorced Married

Page 10: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Binary Attributes: Computing GINI IndexSplits into two partitionsEffect of Weighing partitions:

Larger and Purer Partitions are sought for.

N1 N2C1 0 4C2 6 0Gini=0.000

N1 N2C1 3 4C2 3 0Gini=0.300

N1 N2C1 4 2C2 4 0Gini=0.400

N1 N2C1 6 2C2 2 0Gini=0.300

True?

Yes No

Node N1 Node N2

Page 11: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute,

Sort the attribute on values Linearly scan these values, each time updating the count matrix and

computing gini index Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Taxable Income

60 70 75 85 90 95 100 120 125 220

55 65 72 80 87 92 97 110 122 172 230<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >

Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Sorted ValuesSplit Positions

Page 12: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Need for Parallel Formulations

Need to handle very large datasets.Memory limitations of sequential computers

cause sequential algorithms to make multiple expensive I/O passes over data.

Need for scalable, efficient (fast) data mining computations gain competitive advantage. Handle larger data for greater accuracy in

shorter times.

Page 13: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Constructing a Decision Tree in Parallel

Partitioning of data only– global reduction per

node is required– large number of

classification tree nodes gives high communication cost

n recordsm categorical attributes

Good Bad

35 50

20 5

family

sport

Page 14: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Synchronous Tree Construction Approach

+ No data movement is required

– Load imbalance• can be eliminated by

breadth-first expansion

– High communication cost• becomes too high in lower

parts of the tree

Partition Data Across Processors

Page 15: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Constructing a Decision Tree in Parallel

Partitioning of classification tree nodes– natural concurrency– load imbalance as the

amount of work associated with each node varies

– child nodes use the same data as used by parent node

– loss of locality– high data movement

cost

7,000 records

10,000 training records

3,000 records

2,000 5,000 2,000 1,000

Page 16: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Partitioned Tree Construction Approach

+ Highly concurrent

- High communication cost due to excessive data movements

- Load imbalance

Partition Data and Nodes

Page 17: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Hybrid Parallel Formulation

Page 18: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Load Balancing

Page 19: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Splitting CriterionSwitch to Partitioned Tree Construction

when

Splitting criterion ensures

– G. Karypis and V. Kumar, IEEE Transactions on Parallel and Distributed Systems, Oct. 1994

BalancingLoadCostMovingCostionCommunicat

optimalCostionCommunicatCostionCommunicat 2

Page 20: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Experimental ResultsData set

– function 2 data set discussed in SLIQ paper (Mehta, Agrawal and Rissanen, EDBT’96)

– 2 class labels, 3 categorical and 6 continuous attributes

IBM SP2 with 128 processors– 66.7 MHz CPU with 256 MB real memory– AIX version 4– high performance switch

Page 21: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Speedup Comparison of the Three Parallel Algorithms

0.8 million examples 1.6 million examples

Page 22: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Splitting Criterion Verification in the Hybrid Algorithm

BalancingLoadCostMovingCostionCommunicat

RatioCriterionSplitting

0.8 million examples on 8 processors 1.6 million examples on 16 processors

Page 23: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Speedup of the Hybrid Algorithm with Different Size Data Sets

Page 24: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Summary of Algorithms for Categorical Attributes Synchronous Tree Construction Approach

– no data movement required– high communication cost as tree becomes bushy

Partitioned Tree Construction Approach– processors work independently once partitioned

completely– load imbalance and high cost of data movement

Hybrid Algorithm– combines good features of two approaches– adapts dynamically according to the size and shape of

trees

Page 25: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Handling Continuous AttributesSort continuous attributes at each node of

the tree (as in C4.5). Expensive, hence Undesirable!

Discretize continuous attributes CLOUDS (Alsabti, Ranka, and Singh, 1998) SPEC (Srivastava, Han, Kumar, and Singh, 1997)

Use a pre-sorted list for each continuous attributes SPRINT (Shafer, Agrawal, and Mehta, VLDB’96) ScalParC (Joshi, Karypis, and Kumar, IPPS’98)

Page 26: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Association Rule Discovery: DefinitionGiven a set of records each of which

contain some number of items from a given collection; Produce dependency rules which will predict

occurrence of an item based on occurrences of other items.TID Items

1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Page 27: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Association Rule Discovery ApplicationMarketing and Sales Promotion

Supermarket Shelf Management

Inventory Management

Page 28: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Association Rule Discovery: Support and Confidence

TID Items

1 Bread, Milk2 Beer, Diaper, Bread, Eggs3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Bread, Diaper, Milk

yX ,s

))yX,((||

)yX( PsT

s

))X|y((|)X()yX( P

Association Rule:

Support:

Confidence:

Example:Beer}MilkDiaper,{ ,s

4.052

nsTransactio ofNumber Total)BeerMilk,Diaper,(

s

66.0|)MilkDiaper,(

)BeerMilk,Diaper,(

Page 29: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Handling Exponential Complexity

Given n transactions and m different items: number of possible association rules: computation complexity:

Systematic search for all patterns, based on support constraint [Agarwal & Srikant]: If {A,B} has support at least, then both A and B

have support at least If either A or B has support less than , then {A,B} has

support less than . Use patterns of size k-1 to find patterns of size k.

)2( 1mmO)2( mnmO

Page 30: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Illustrating Apriori PrincipleItem Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1

Items (1-itemsets)

Itemset Count{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3

Pairs (2-itemsets)

Triplets (3-itemsets)Itemset Count{Bread,Milk,Diaper} 3{Milk,Diaper,Beer} 2

Minimum Support = 3

If every subset is considered, 6C1 + 6C2 + 6C3 = 41

With support-based pruning,6 + 6 + 2 = 14

Page 31: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Counting CandidatesFrequent Itemsets are found by counting

candidates.Simple way:

Search for each candidate in each transaction.Expensive!!!

TransactionsCandidates

N M

Page 32: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Association Rule Discovery: Hash tree for fast access.

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 83 5 63 5 76 8 9

2 3 45 6 7

1 2 44 5 7

1 2 54 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash Tree

Page 33: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Parallel Formulation of Association Rules

Need: Huge Transaction Datasets (10s of TB) Large Number of Candidates.

Data Distribution: Partition the Transaction Database, or Partition the Candidates, or Both

Page 34: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Parallel Association Rules: Count Distribution (CD)Each Processor has complete candidate

hash tree.Each Processor updates its hash tree with

local data.Each Processor participates in global

reduction to get global counts of candidates in the hash tree.

Multiple database scans are required if the hash tree is too big to fit in the memory.

Page 35: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

CD: Illustration

{5,8}

2

{3,4}{2,3}{1,3}{1,2}

5372 {5,8}

7

{3,4}{2,3}{1,3}{1,2}

3119 {5,8}

0

{3,4}{2,3}{1,3}{1,2}

2826

P0 P1 P2

Global Reduction of Counts

N/p N/p N/p

Page 36: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Parallel Association Rules: Data Distribution (DD)Candidate set is partitioned among the

processors.Once local data has been partitioned, it is

broadcast to all other processors.High Communication Cost due to data

movement.Redundant work due to multiple traversals

of the hash trees.

Page 37: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

DD: Illustration

All-to-All Broadcast of Candidates

9{1,3}{1,2}

10 {3,4}{2,3} 12

10{5,8} 17

P0 P1 P2

N/p N/p N/pRemote

DataRemote

DataRemote

Data DataBroadcastCount Count Count

Page 38: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Parallel Association Rules: Intelligent Data Distribution (IDD)

Data Distribution using point-to-point communication.

Intelligent partitioning of candidate sets. Partitioning based on the first item of candidates. Bitmap to keep track of local candidate items.

Pruning at the root of candidate hash tree using the bitmap.

Suitable for single data source such as database server.

With smaller candidate set, load balancing is difficult.

Page 39: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

IDD: Illustration

All-to-All Broadcast of Candidates

9{1,3}{1,2}

10 {3,4}{2,3} 12

10{5,8} 17

P0 P1 P2

N/p N/p N/pRemote

DataRemote

DataRemote

Data

1 2,3 5bitmask

DataShift

Count Count Count

Page 40: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Filtering Transactions in IDD

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 83 5 63 5 76 8 9

2 3 45 6 7

1 2 44 5 7

1 2 54 5 8

Skipped!3 5 62 +1 + 2 3 5 6

5 63 +

bitmask 1,3,5 1 2 3 5 6 transaction

Page 41: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Parallel Association Rules: Hybrid Distribution (HD)Candidate set is partitioned into G groups

to just fit in main memory Ensures Good load balance with smaller

candidate set.Logical processor mesh G x P/G is formed.Perform IDD along the column processors

Data movement among processors is minimized.

Perform CD along the row processors Smaller number of processors is global

reduction operation.

Page 42: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

HD: Illustration

C0

C1

C2

C0

C1

C2

C0

C1

C2

N/(P/G) N/(P/G) N/(P/G) CDalongRows

IDD

alo

ng C

olum

nsA

ll-to

-All

Bro

adca

st o

f Can

dida

tes

G G

roup

s of

Pro

cess

ors

P/G Processors per Group

N/P

N/P

N/P N/P

N/P

N/P N/P

N/P

N/P

Page 43: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Parallel Association Rules: Experimental Setup 128-processor Cray T3E

600 MHz DEC Alpha (EV4) 512MB of main memory per processor 3-D torus interconnection network with peak

unidirectional bandwidth of 430 MB/sec. MPI used for communications. Synthetic data set: avg transaction size 15 and

1000 distinct items. For larger data sets, multiple read of transactions

in blocks of 1000. HD switch to CD after 90.7% of the total

computation is done.

Page 44: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Scaleup Results (50K, 0.1%)

Page 45: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Speedup Results (N=1.3 million, M=0.7 million)

Page 46: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Sizeup Results (P=64, M=0.7 million)

Page 47: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Response Time with Varying Candidate Size (P=64, N=1.3 million)

Page 48: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

SP2 Response Time with Varying Candidate Size (P=64, N=1.3 million)

Page 49: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Parallel Association Rules: Summary of Experiments

HD shows the same linear speedup and sizeup behavior as that of CD.

HD Exploits Total Aggregate Main Memory, while CD does not.

IDD has much better scaleup behavior than DD

Page 50: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

SummaryData mining is a rapidly growing field

Fueled by enormous data collection rates, and need for intelligent analysis for business and scientific gains.

Large and high-dimensional nature data requires new analysis techniques and algorithms.

Scalable, fast parallel algorithms are becoming indispensable.

Many research and commercial opportunities!!!

Page 51: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

Collaborators George Karypis and Vipin Kumar

Department of Computer Science and EngineeringArmy High Performance Computing Research CenterUniversity of Minnesota

Anurag Srivastava

Digital Impact

Vineet Singh

Hewlett Packard Laboratories

http://www.cs.umn.edu/~han