Upload
tzu-li-tai
View
400
Download
4
Tags:
Embed Size (px)
DESCRIPTION
[Paper Study] 2008 ICDE Spiros Papadimitriou and Jimeng Sun, IBM T.J. Watson Research Center, NY, USA
Citation preview
DisCo: Distributed
Co-clustering with
Map-Reduce
2008 IEEE International Conference on Data Engineering (ICDE)
Tzu-Li Tai, Tse-En Liu
Kai-Wei Chan, He-Chuan Hoh
National Cheng Kung UniversityDept. of Electrical EngineeringHPDS Laboratory
IBM T.J. Watson Research CenterNY, USA
S. Papadimitron, J. Sun
Agenda
A. MotivationB. Background: Co-Clustering + MapReduceC. Proposed Distributed Co-Clustering ProcessD. Implementation DetailsE. Experimental EvaluationF. ConclusionsG. Discussion
390
Motivation
Fast Growth in Volume of Data
โข Google processes 20 petabytes of data per dayโข Amazon and eBay with petabytes of transactional
data every day
Highly variant structure of data
โข Data sources naturally generate data in impure formsโข Unstructured, semi-structured
391
Motivation
Problems with Big Data mining for DBMSs
โข Significant preprocessing costs for the majority of data mining tasks
โข DBMS lacks performance for large amount of data
392
Motivation
Why distributed processing can solve the issues:
โข MapReduce is irrelevant to the schema or form of the input data
โข Many preprocessing tasks are naturally expressible with MapReduce
โข Highly scalable with commodity machines
393
Motivation
Contributions of this paper:
โข Presents the whole process for distributed data mining
โข Specifically, focuses on the Co-Clustering mining task, and designs a distributed co-clustering method using MapReduce
394
BackGround: Co-Clustering
โข Also named biclustering, or two-mode clustering
โข Input format: a matrix of ๐ rows and ๐ columns
โข Output: Co-clusters (sub-matrices) which rows that exhibit similar behavior across a subset of columns
4*5 4*5
395
BackGround: Co-Clustering
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
Why Co-Clustering?
Student A
Student B
Student C
Student D
Traditional Clustering:
A C
B D
Can only know that students A & C / B & D
have similar scores
396
BackGround: Co-Clustering
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
Why Co-Clustering?
Student A
Student B
Student C
Student D
1 1 0 0 0
1 1 0 0 0
0 0 1 1 1
0 0 1 1 1
Student D
Student B
Student C
Student A
Co-Clustering:
Cluster 1 Cluster 2
Good atScience + Math
Good atEnglish + Chinese+ Social Studies
B & D A & C
Rows that have similar properties for a subset of selected columns
397
BackGround: Co-Clustering
Another Co-Clustering Example: Animal Data
398
BackGround: Co-Clustering
Another Co-Clustering Example: Animal Data
399
BackGround: Co-Clustering
Another Co-Clustering Example: Animal Data
3910
BackGround: MapReduce
The MapReduce Paradigm
Map
Map
Map
Map
Reduce
Reduce
Reduce
(๐๐, ๐๐)
(๐๐, ๐๐)
(๐๐, ๐๐)
(๐๐, ๐๐) (๐๐, [๐ฝ๐]) (๐๐, ๐๐)
3911
Distributed Co-Clustering Process
Mining Network Logs toCo-Cluster Communication Behavior
3912
Distributed Co-Clustering Process
Mining Network Logs toCo-Cluster Communication Behavior
3913
Distributed Co-Clustering Process
The Preprocessing Process
HDFS
MapReduce Job
Extract SrcIP + DstIP and build adjacency matrix
HDFS
SrcI
P
DstIP
IPAddressIPAddressIPAddressIPAddressIPAddressIPAddress
โฆ
0 1 0 1 1 0 0 0 1 1 1 0 โฆโฆ0 1 0 1 1 0 0 0 1 1 1 0 โฆโฆ0 0 1 0 0 0 0 0 0 0 0 0 โฆโฆ
0 0 1 0 0 0 0 0 0 0 0 0 โฆโฆ0 0 1 0 0 0 0 0 0 0 0 0 โฆโฆ
0 1 0 1 1 0 0 0 1 1 1 0 โฆโฆ
MapReduce Job
Build adjacency list
HDFS
MapReduce Job
Build transpose adjacency list
HDFS
3914
Distributed Co-Clustering Process
Co-Clustering (Generalized Algorithm)
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
r(1) = 1
r(2) = 1
r(3) = 1
r(4) = 2
๐ = 1, 1, 1, 2
๐ = 1, 1, 1, 2, 2
๐บ =๐11 ๐12๐21 ๐22
=4 42 0
Random Initialize:
Goal:Co-cluster into 2x2 = 4 sub-matrices
โ ๐ ๐๐ค ๐ฟ๐๐๐๐๐ : 1 or 2, ๐ = ๐
โ ๐ถ๐๐๐ข๐๐ ๐ฟ๐๐๐๐๐ : 1 or 2, ๐ = ๐
3915
Distributed Co-Clustering Process
Co-Clustering (Generalized Algorithm)
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
r(1) = 1
r(2) = 1
r(3) = 1
r(4) = 2
Fix column labels,Iterate through rows:
r(2) = 2
๐ = 1, ๐, 1, 2
๐ = 1, 1, 1, 2, 2
๐บ =๐11 ๐12๐21 ๐22
=๐ ๐๐ ๐
0 1 0 1 1
0 1 0 1 1
1 0 1 0 0
1 0 1 0 0
3916
Distributed Co-Clustering Process
Co-Clustering (Generalized Algorithm)
Fix row labels,Iterate through columns:
๐ = 1, ๐, 1, 2
๐ = 1, ๐, 1, 2, 2
๐บ =๐11 ๐12๐21 ๐22
=๐ ๐๐ ๐
0 1 0 1 1
0 1 0 1 1
1 0 1 0 0
1 0 1 0 0
c(2) = 2
0 0 1 1 1
0 0 1 1 1
1 1 0 0 0
1 1 0 0 0
3917
Distributed Co-Clustering Process
Co-Clustering with MapReduce
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
1 -> 2, 4, 5
2 -> 1, 3
3 -> 2, 4, 5
4 -> 1, 3
MR
1 -> 2,4,5
2 -> 1,3
3 -> 2,4,5
4 -> 1,3
3918
Distributed Co-Clustering Process
Co-Clustering with MapReduce
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
1 -> 2, 4, 5
2 -> 1, 3
3 -> 2, 4, 5
4 -> 1, 3
MR
1 -> 2,4,5
๐, ๐, ๐บ
2 -> 1,3
๐, ๐, ๐บ
3 -> 2,4,5
๐, ๐, ๐บ
4 -> 1,3
๐, ๐, ๐บ
MapReduce Job
๐, ๐, ๐บ random initializationbased on parameters ๐, ๐
๐ = 1,1,1,2
๐ = 1,1,1,2,2
๐บ =4 42 0
3919
Distributed Co-Clustering Process
1 -> 2,4,5
๐, ๐, ๐บ
2 -> 1,3
๐, ๐, ๐บ
3 -> 2,4,5
๐, ๐, ๐บ
4 -> 1,3
๐, ๐, ๐บ
M
M
M
M
(๐๐๐๐๐, ๐๐๐๐๐๐๐)
( 1, 2,4,5 )
( 2, 1,3 )
( 3, 2,4,5 )
( 4, 1,3 )
Mapper Function:
For each K-V input, (๐, ๐ฃ)
1. Calculate โ๐ (with ๐ฃ and ๐)
2. Change row labels if results in lower cost (function of ๐บ)
3. Emit (r(k), (โ๐ , ๐))
โ โ1 = (1,2)0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
๐ฃ = {2,4,5}
๐ = {1,1,1,2,2}
if r(1) = 2, cost becomes higher
โ r(1) = 1
โ emit(r(k), (โ๐ , ๐) ) =
(1, {(1,2), 1})
3920
Distributed Co-Clustering Process
1 -> 2,4,5
๐, ๐, ๐บ
2 -> 1,3
๐, ๐, ๐บ
3 -> 2,4,5
๐, ๐, ๐บ
4 -> 1,3
๐, ๐, ๐บ
M
M
M
M
(๐๐๐๐๐, ๐๐๐๐๐๐๐)
( 1, 2,4,5 )
( 2, 1,3 )
( 3, 2,4,5 )
( 4, 1,3 )
Mapper Function:
For each K-V input, (๐, ๐ฃ)
1. Calculate โ๐ (with ๐ฃ and ๐)
2. Change row labels if results in lower cost (function of ๐บ)
3. Emit (r(k), (โ๐ , ๐))
โ โ1 = (2,0)
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
๐ฃ = {1,3}
๐ = {1,1,1,2,2}
if r(2) = 2, cost becomes lower
โ r(2) = 2
โ emit(r(k), (โ๐ , ๐) ) =
(2, {(2,0), 2})
3921
Distributed Co-Clustering Process
1 -> 2,4,5
๐, ๐, ๐บ
2 -> 1,3
๐, ๐, ๐บ
3 -> 2,4,5
๐, ๐, ๐บ
4 -> 1,3
๐, ๐, ๐บ
M
M
M
M
(๐๐๐๐๐, ๐๐๐๐๐๐๐)
( 1, 2,4,5 )
( 2, 1,3 )
( 3, 2,4,5 )
( 4, 1,3 )
(1, { 1,2 , 1})
(๐๐๐๐๐๐๐๐๐๐๐ ๐๐๐๐, ๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐๐๐๐)
(2, { 2,0 , 2})
(1, { 1,2 , 3})
(2, { 2,0 , 4})
(1, [ 1,2 , 1 , { 1,2 , 3}])
(2, [ 2,0 , 2 , { 2,0 , 4}])
R
R
(๐๐๐๐โ๐๐, [๐ฝ๐๐๐๐]๐โ๐๐)
3922
Distributed Co-Clustering Process
(1, [ 1,2 , 1 , { 1,2 , 3}])
(2, [ 2,0 , 2 , { 2,0 , 4}])
R
R
(๐๐๐๐โ๐๐, [๐ฝ๐๐๐๐]๐โ๐๐)
Reducer Function:
For each K-V input, (๐, [๐])
For each (๐, ๐ผ) โ ๐,
1. Accumulate all ๐ โ ๐ into โ๐
2. ๐ผ๐ = Union of all ๐ผ
3. Emit (๐, (โ๐ , ๐ผ๐)
โ1 = 1,2 + 1,2= (2,4)
๐ผ1 = {1,3}
โ Emit (1, ( 2,4 , {1,3}))
3923
Distributed Co-Clustering Process
(1, [ 1,2 , 1 , { 1,2 , 3}])
(2, [ 2,0 , 2 , { 2,0 , 4}])
R
R
(๐๐๐๐โ๐๐, [๐ฝ๐๐๐๐]๐โ๐๐)
(1, ( 2,4 , {1,3}))
(2, ( 4,0 , {2,4}))
(๐๐๐๐๐๐๐๐๐, ๐๐๐๐๐๐๐๐๐๐๐)
๐ = {๐, ๐, ๐, ๐}
๐ฎ = {๐ ๐๐ ๐
}
๐ = {1,1,1,2,2}
Sync Results
0 1 0 1 1
0 1 0 1 1
1 0 1 0 0
1 0 1 0 0
3924
Distributed Co-Clustering Process
HDFS
MapReduce Job
Build transpose adjacency list
HDFS
MapReduce Job
Fix columnRow iteration
Random ๐, ๐, ๐บgiven ๐, ๐
+ SyncResults
Synced ๐, ๐, ๐บwith best ๐
permutation
MapReduce Job
Fix rowColumn iteration
+
Final Co-Clustering resultwith best ๐, ๐permutations
PreprocessingCo-Clustering
3925
Implementation Details
Tuning the number of Reduce Tasks
โข The number of reduce tasks is related to the number of intermediate keys during the shuffle and sort phase
โข For the co-clustering row-iteration/column-iteration jobs, the number of intermediate keys is either ๐ or ๐
3926
Implementation Details
1 -> 2,4,5
๐, ๐, ๐บ
2 -> 1,3
๐, ๐, ๐บ
3 -> 2,4,5
๐, ๐, ๐บ
4 -> 1,3
๐, ๐, ๐บ
M
M
M
M
(๐๐๐๐๐, ๐๐๐๐๐๐๐)
( 1, 2,4,5 )
( 2, 1,3 )
( 3, 2,4,5 )
( 4, 1,3 )
(1, { 1,2 , 1})
(๐๐๐๐๐๐๐๐๐๐๐ ๐๐๐๐, ๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐๐๐๐)
(2, { 2,0 , 2})
(1, { 1,2 , 3})
(2, { 2,0 , 4})
(1, [ 1,2 , 1 , { 1,2 , 3}])
(2, [ 2,0 , 2 , { 2,0 , 4}])
R
R
(๐๐๐๐โ๐๐, [๐ฝ๐๐๐๐]๐โ๐๐)
๐ = ๐ (row-iterate)โ ๐ inter-keys
3927
Implementation Details
Tuning the number of Reduce Tasks
โข So, for the row-iteration/column-iteration jobs, 1 reduce task is enough
โข However, for some preprocessing tasks such as graph construction where there are a lot of intermediate keys, needs much more reduce tasks
3928
Implementation Details
The Preprocessing Process
HDFS
MapReduce Job
Extract SrcIP + DstIP and build adjacency matrix
HDFS
SrcI
P
DstIP
IPAddressIPAddressIPAddressIPAddressIPAddressIPAddress
โฆ
0 1 0 1 1 0 0 0 1 1 1 0 โฆโฆ0 1 0 1 1 0 0 0 1 1 1 0 โฆโฆ0 0 1 0 0 0 0 0 0 0 0 0 โฆโฆ
0 0 1 0 0 0 0 0 0 0 0 0 โฆโฆ0 0 1 0 0 0 0 0 0 0 0 0 โฆโฆ
0 1 0 1 1 0 0 0 1 1 1 0 โฆโฆ
MapReduce Job
Build adjacency list
HDFS
MapReduce Job
Build transpose adjacency list
HDFS
(๐บ๐๐๐ฐ๐ท, [๐ซ๐๐๐ฐ๐ท])
3929
Experimental Evaluation
Environment
โข There are 39 nodes in four different blade enclosure
Gigabit Ethernet
โข Blade Server
- CPU: two dual-core (Intel Xeon 2.66GHz)- Memory: 8GB- OS: Red Hat Enterprise Linux
โข Hadoop Distributed File System(HDFS) capacity: 2.4 TB
3930
Experimental Evaluation
Datasets
3931
Experimental Evaluation
Preprocessing ISS Data
โข Optimal values of each situationโข Map tasks number 6
โข Reduce tasks number 5
โข Input split size 256MB
6 5 256MB
3932
Experimental Evaluation
Co-Clustering TREC Data
โข After 25 nodes per iteration is roughly about 20 ยฑ 2 seconds. It is better than what we can get on a machine with 48GB RAM.
3933
Conclusion
โข Authors of the paper shared their lessons learnt from data mining experiences with vast quantities of data, particularly in the context of co-clustering, and recommends using a distributed approach
โข Designed a general MapReduce approach for co-clusteringalgorithms
โข Showed that the MapReduce co-clustering framework scales well with real world large datasets (ICC, TREC)
3934
Discussion
โข Necessity of the global ๐, ๐, ๐บ sync actionโข Questionable Scalability for DisCo
3935
Discussion
Necessity of the global ๐, ๐, ๐ฎsync action
MapReduce Job
Fix columnRow iteration
Random ๐, ๐, ๐บgiven ๐, ๐
+ SyncResults
Synced ๐, ๐, ๐บwith best ๐
permutation
MapReduce Job
Fix rowColumn iteration
+
Final Co-Clustering resultwith best ๐, ๐permutations
Co-Clustering
3936
Discussion
1 -> 2,4,5
๐, ๐, ๐บ
2 -> 1,3
๐, ๐, ๐บ
3 -> 2,4,5
๐, ๐, ๐บ
4 -> 1,3
๐, ๐, ๐บ
M
M
M
M
(๐๐๐๐๐, ๐๐๐๐๐๐๐)
( 1, 2,4,5 )
( 2, 1,3 )
( 3, 2,4,5 )
( 4, 1,3 )
(1, { 1,2 , 1})
(๐๐๐๐๐๐๐๐๐๐๐ ๐๐๐๐, ๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐๐๐๐)
(2, { 2,0 , 2})
(1, { 1,2 , 3})
(2, { 2,0 , 4})
(1, [ 1,2 , 1 , { 1,2 , 3}])
(2, [ 2,0 , 2 , { 2,0 , 4}])
R
R
(๐๐๐๐โ๐๐, [๐ฝ๐๐๐๐]๐โ๐๐)
3937
Discussion
Questionable Scalability of DisCo
โข For row-iteration jobs (or column-iteration jobs), the number of intermediate keys is fixed to be ๐ (or ๐)
โข This implies that for a given ๐ and ๐, as the input matrix gets larger, the reducer size* will increase dramatically
โข Since a single reducer (key+associating values) is sent to one reduce task, the memory capacity of a computing node will be a severe bottleneck for overall performance
*reference: Upper Bound and Lower Bound of a MapReduce Computation, 2013 VLDB
3938
Discussion
1 -> 2,4,5
๐, ๐, ๐บ
2 -> 1,3
๐, ๐, ๐บ
3 -> 2,4,5
๐, ๐, ๐บ
4 -> 1,3
๐, ๐, ๐บ
M
M
M
M
(๐๐๐๐๐, ๐๐๐๐๐๐๐)
( 1, 2,4,5 )
( 2, 1,3 )
( 3, 2,4,5 )
( 4, 1,3 )
(1, { 1,2 , 1})
(๐๐๐๐๐๐๐๐๐๐๐ ๐๐๐๐, ๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐๐๐๐)
(2, { 2,0 , 2})
(1, { 1,2 , 3})
(2, { 2,0 , 4})
(1, [ 1,2 , 1 , { 1,2 , 3}])
(2, [ 2,0 , 2 , { 2,0 , 4}])
R
R
(๐๐๐๐โ๐๐, [๐ฝ๐๐๐๐]๐โ๐๐)
3939