[Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

DisCo: Distributed

Co-clustering with

Map-Reduce

2008 IEEE International Conference on Data Engineering (ICDE)

Tzu-Li Tai, Tse-En Liu

Kai-Wei Chan, He-Chuan Hoh

National Cheng Kung UniversityDept. of Electrical EngineeringHPDS Laboratory

IBM T.J. Watson Research CenterNY, USA

S. Papadimitron, J. Sun

Agenda

A. MotivationB. Background: Co-Clustering + MapReduceC. Proposed Distributed Co-Clustering ProcessD. Implementation DetailsE. Experimental EvaluationF. ConclusionsG. Discussion

390

Motivation

Fast Growth in Volume of Data

• Google processes 20 petabytes of data per day• Amazon and eBay with petabytes of transactional

data every day

Highly variant structure of data

• Data sources naturally generate data in impure forms• Unstructured, semi-structured

391

Motivation

Problems with Big Data mining for DBMSs

• Significant preprocessing costs for the majority of data mining tasks

• DBMS lacks performance for large amount of data

392

Motivation

Why distributed processing can solve the issues:

• MapReduce is irrelevant to the schema or form of the input data

• Many preprocessing tasks are naturally expressible with MapReduce

• Highly scalable with commodity machines

393

Motivation

Contributions of this paper:

• Presents the whole process for distributed data mining

• Specifically, focuses on the Co-Clustering mining task, and designs a distributed co-clustering method using MapReduce

394

BackGround: Co-Clustering

• Also named biclustering, or two-mode clustering

• Input format: a matrix of 𝑚 rows and 𝑛 columns

• Output: Co-clusters (sub-matrices) which rows that exhibit similar behavior across a subset of columns

4*5 4*5

395


0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

Why Co-Clustering?

Student A

Student B

Student C

Student D

Traditional Clustering:

A C

B D

Can only know that students A & C / B & D

have similar scores

396


0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

Why Co-Clustering?

Student A

Student B

Student C

Student D

1 1 0 0 0

1 1 0 0 0

0 0 1 1 1

0 0 1 1 1

Student D

Student B

Student C

Student A

Co-Clustering:

Cluster 1 Cluster 2

Good atScience + Math

Good atEnglish + Chinese+ Social Studies

B & D A & C

Rows that have similar properties for a subset of selected columns

397


Another Co-Clustering Example: Animal Data

398



399



3910

BackGround: MapReduce

The MapReduce Paradigm

Map

Map

Map

Map

Reduce

Reduce

Reduce

(𝒌𝟏, 𝒗𝟏)

(𝒌𝟐, 𝒗𝟐)

(𝒌𝟐, 𝒗𝟐)

(𝒌𝟐, 𝒗𝟐) (𝒌𝟐, [𝑽𝟐]) (𝒌𝟑, 𝒗𝟑)

3911

Distributed Co-Clustering Process

Mining Network Logs toCo-Cluster Communication Behavior

3912


Mining Network Logs toCo-Cluster Communication Behavior

3913


The Preprocessing Process

HDFS

MapReduce Job

Extract SrcIP + DstIP and build adjacency matrix

HDFS

SrcI

P

DstIP

IPAddressIPAddressIPAddressIPAddressIPAddressIPAddress

…

0 1 0 1 1 0 0 0 1 1 1 0 ……0 1 0 1 1 0 0 0 1 1 1 0 ……0 0 1 0 0 0 0 0 0 0 0 0 ……

0 0 1 0 0 0 0 0 0 0 0 0 ……0 0 1 0 0 0 0 0 0 0 0 0 ……

0 1 0 1 1 0 0 0 1 1 1 0 ……

MapReduce Job

Build adjacency list

HDFS

MapReduce Job

Build transpose adjacency list

HDFS

3914


Co-Clustering (Generalized Algorithm)

0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

r(1) = 1

r(2) = 1

r(3) = 1

r(4) = 2

𝑟 = 1, 1, 1, 2

𝑐 = 1, 1, 1, 2, 2

𝐺 =𝑔11 𝑔12𝑔21 𝑔22

=4 42 0

Random Initialize:

Goal:Co-cluster into 2x2 = 4 sub-matrices

⇒ 𝑅𝑜𝑤 𝐿𝑎𝑏𝑒𝑙𝑠: 1 or 2, 𝒌 = 𝟐

⇒ 𝐶𝑜𝑙𝑢𝑚𝑛 𝐿𝑎𝑏𝑒𝑙𝑠: 1 or 2, 𝒍 = 𝟐

3915



0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

r(1) = 1

r(2) = 1

r(3) = 1

r(4) = 2

Fix column labels,Iterate through rows:

r(2) = 2

𝑟 = 1, 𝟐, 1, 2

𝑐 = 1, 1, 1, 2, 2

𝐺 =𝑔11 𝑔12𝑔21 𝑔22

=𝟐 𝟒𝟒 𝟎

0 1 0 1 1

0 1 0 1 1

1 0 1 0 0

1 0 1 0 0

3916



Fix row labels,Iterate through columns:

𝑟 = 1, 𝟐, 1, 2

𝑐 = 1, 𝟐, 1, 2, 2

𝐺 =𝑔11 𝑔12𝑔21 𝑔22

=𝟎 𝟒𝟒 𝟎

0 1 0 1 1

0 1 0 1 1

1 0 1 0 0

1 0 1 0 0

c(2) = 2

0 0 1 1 1

0 0 1 1 1

1 1 0 0 0

1 1 0 0 0

3917


Co-Clustering with MapReduce

0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

1 -> 2, 4, 5

2 -> 1, 3

3 -> 2, 4, 5

4 -> 1, 3

MR

1 -> 2,4,5

2 -> 1,3

3 -> 2,4,5

4 -> 1,3

3918


Co-Clustering with MapReduce

0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

1 -> 2, 4, 5

2 -> 1, 3

3 -> 2, 4, 5

4 -> 1, 3

MR

1 -> 2,4,5

𝑟, 𝑐, 𝐺

2 -> 1,3

𝑟, 𝑐, 𝐺

3 -> 2,4,5

𝑟, 𝑐, 𝐺

4 -> 1,3

𝑟, 𝑐, 𝐺

MapReduce Job

𝑟, 𝑐, 𝐺 random initializationbased on parameters 𝑘, 𝑙

𝑟 = 1,1,1,2

𝑐 = 1,1,1,2,2

𝐺 =4 42 0

3919


1 -> 2,4,5

𝑟, 𝑐, 𝐺

2 -> 1,3

𝑟, 𝑐, 𝐺

3 -> 2,4,5

𝑟, 𝑐, 𝐺

4 -> 1,3

𝑟, 𝑐, 𝐺

M

M

M

M

(𝒌𝒆𝒚𝒊𝒏, 𝒗𝒂𝒍𝒖𝒆𝒊𝒏)

( 1, 2,4,5 )

( 2, 1,3 )

( 3, 2,4,5 )

( 4, 1,3 )

Mapper Function:

For each K-V input, (𝑘, 𝑣)

1. Calculate ℊ𝑘 (with 𝑣 and 𝑐)

2. Change row labels if results in lower cost (function of 𝐺)

3. Emit (r(k), (ℊ𝑘 , 𝑘))

⇒ ℊ1 = (1,2)0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

𝑣 = {2,4,5}

𝑐 = {1,1,1,2,2}

if r(1) = 2, cost becomes higher

⇒ r(1) = 1

⇒ emit(r(k), (ℊ𝑘 , 𝑘) ) =

(1, {(1,2), 1})

3920


1 -> 2,4,5

𝑟, 𝑐, 𝐺

2 -> 1,3

𝑟, 𝑐, 𝐺

3 -> 2,4,5

𝑟, 𝑐, 𝐺

4 -> 1,3

𝑟, 𝑐, 𝐺

M

M

M

M


( 1, 2,4,5 )

( 2, 1,3 )

( 3, 2,4,5 )

( 4, 1,3 )

Mapper Function:

For each K-V input, (𝑘, 𝑣)

1. Calculate ℊ𝑘 (with 𝑣 and 𝑐)

2. Change row labels if results in lower cost (function of 𝐺)

3. Emit (r(k), (ℊ𝑘 , 𝑘))

⇒ ℊ1 = (2,0)

0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

𝑣 = {1,3}

𝑐 = {1,1,1,2,2}

if r(2) = 2, cost becomes lower

⇒ r(2) = 2

⇒ emit(r(k), (ℊ𝑘 , 𝑘) ) =

(2, {(2,0), 2})

3921


1 -> 2,4,5

𝑟, 𝑐, 𝐺

2 -> 1,3

𝑟, 𝑐, 𝐺

3 -> 2,4,5

𝑟, 𝑐, 𝐺

4 -> 1,3

𝑟, 𝑐, 𝐺

M

M

M

M


( 1, 2,4,5 )

( 2, 1,3 )

( 3, 2,4,5 )

( 4, 1,3 )

(1, { 1,2 , 1})

(𝒌𝒆𝒚𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆, 𝒗𝒂𝒍𝒖𝒆𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆)

(2, { 2,0 , 2})

(1, { 1,2 , 3})

(2, { 2,0 , 4})

(1, [ 1,2 , 1 , { 1,2 , 3}])

(2, [ 2,0 , 2 , { 2,0 , 4}])

R

R

(𝒌𝒆𝒚𝒓−𝒊𝒏, [𝑽𝒂𝒍𝒖𝒆]𝒓−𝒊𝒏)

3922


(1, [ 1,2 , 1 , { 1,2 , 3}])

(2, [ 2,0 , 2 , { 2,0 , 4}])

R

R


Reducer Function:

For each K-V input, (𝑘, [𝑉])

For each (𝑔, 𝐼) ∈ 𝑉,

1. Accumulate all 𝑔 ∈ 𝑉 into ℊ𝑘

2. 𝐼𝑘 = Union of all 𝐼

3. Emit (𝑘, (ℊ𝑘 , 𝐼𝑘)

ℊ1 = 1,2 + 1,2= (2,4)

𝐼1 = {1,3}

⇒ Emit (1, ( 2,4 , {1,3}))

3923


(1, [ 1,2 , 1 , { 1,2 , 3}])

(2, [ 2,0 , 2 , { 2,0 , 4}])

R

R


(1, ( 2,4 , {1,3}))

(2, ( 4,0 , {2,4}))

(𝒌𝒆𝒚𝒓𝒆𝒔𝒖𝒍𝒕, 𝒗𝒂𝒍𝒖𝒆𝒓𝒆𝒔𝒖𝒍𝒕)

𝒓 = {𝟏, 𝟐, 𝟏, 𝟐}

𝑮 = {𝟐 𝟒𝟒 𝟎

}

𝑐 = {1,1,1,2,2}

Sync Results

0 1 0 1 1

0 1 0 1 1

1 0 1 0 0

1 0 1 0 0

3924


HDFS

MapReduce Job


HDFS

MapReduce Job

Fix columnRow iteration

Random 𝑟, 𝑐, 𝐺given 𝑘, 𝑙

+ SyncResults

Synced 𝑟, 𝑐, 𝐺with best 𝑟

permutation

MapReduce Job

Fix rowColumn iteration

+

Final Co-Clustering resultwith best 𝑟, 𝑐permutations

PreprocessingCo-Clustering

3925

Implementation Details

Tuning the number of Reduce Tasks

• The number of reduce tasks is related to the number of intermediate keys during the shuffle and sort phase

• For the co-clustering row-iteration/column-iteration jobs, the number of intermediate keys is either 𝑘 or 𝑙

3926


1 -> 2,4,5

𝑟, 𝑐, 𝐺

2 -> 1,3

𝑟, 𝑐, 𝐺

3 -> 2,4,5

𝑟, 𝑐, 𝐺

4 -> 1,3

𝑟, 𝑐, 𝐺

M

M

M

M


( 1, 2,4,5 )

( 2, 1,3 )

( 3, 2,4,5 )

( 4, 1,3 )

(1, { 1,2 , 1})


(2, { 2,0 , 2})

(1, { 1,2 , 3})

(2, { 2,0 , 4})

(1, [ 1,2 , 1 , { 1,2 , 3}])

(2, [ 2,0 , 2 , { 2,0 , 4}])

R

R


𝒌 = 𝟐 (row-iterate)⇒ 𝟐 inter-keys

3927


Tuning the number of Reduce Tasks

• So, for the row-iteration/column-iteration jobs, 1 reduce task is enough

• However, for some preprocessing tasks such as graph construction where there are a lot of intermediate keys, needs much more reduce tasks

3928


The Preprocessing Process

HDFS

MapReduce Job

Extract SrcIP + DstIP and build adjacency matrix

HDFS

SrcI

P

DstIP

IPAddressIPAddressIPAddressIPAddressIPAddressIPAddress

…

0 1 0 1 1 0 0 0 1 1 1 0 ……0 1 0 1 1 0 0 0 1 1 1 0 ……0 0 1 0 0 0 0 0 0 0 0 0 ……

0 0 1 0 0 0 0 0 0 0 0 0 ……0 0 1 0 0 0 0 0 0 0 0 0 ……

0 1 0 1 1 0 0 0 1 1 1 0 ……

MapReduce Job

Build adjacency list

HDFS

MapReduce Job


HDFS

(𝑺𝒓𝒄𝑰𝑷, [𝑫𝒔𝒕𝑰𝑷])

3929

Experimental Evaluation

Environment

• There are 39 nodes in four different blade enclosure

Gigabit Ethernet

• Blade Server

- CPU: two dual-core (Intel Xeon 2.66GHz)- Memory: 8GB- OS: Red Hat Enterprise Linux

• Hadoop Distributed File System(HDFS) capacity: 2.4 TB

3930


Datasets

3931


Preprocessing ISS Data

• Optimal values of each situation• Map tasks number 6

• Reduce tasks number 5

• Input split size 256MB

6 5 256MB

3932


Co-Clustering TREC Data

• After 25 nodes per iteration is roughly about 20 ± 2 seconds. It is better than what we can get on a machine with 48GB RAM.

3933

Conclusion

• Authors of the paper shared their lessons learnt from data mining experiences with vast quantities of data, particularly in the context of co-clustering, and recommends using a distributed approach

• Designed a general MapReduce approach for co-clusteringalgorithms

• Showed that the MapReduce co-clustering framework scales well with real world large datasets (ICC, TREC)

3934

Discussion

• Necessity of the global 𝑟, 𝑐, 𝐺 sync action• Questionable Scalability for DisCo

3935

Discussion

Necessity of the global 𝒓, 𝒄, 𝑮sync action

MapReduce Job

Fix columnRow iteration

Random 𝑟, 𝑐, 𝐺given 𝑘, 𝑙

+ SyncResults

Synced 𝑟, 𝑐, 𝐺with best 𝑟

permutation

MapReduce Job

Fix rowColumn iteration

+

Final Co-Clustering resultwith best 𝑟, 𝑐permutations

Co-Clustering

3936

Discussion

1 -> 2,4,5

𝑟, 𝑐, 𝐺

2 -> 1,3

𝑟, 𝑐, 𝐺

3 -> 2,4,5

𝑟, 𝑐, 𝐺

4 -> 1,3

𝑟, 𝑐, 𝐺

M

M

M

M


( 1, 2,4,5 )

( 2, 1,3 )

( 3, 2,4,5 )

( 4, 1,3 )

(1, { 1,2 , 1})


(2, { 2,0 , 2})

(1, { 1,2 , 3})

(2, { 2,0 , 4})

(1, [ 1,2 , 1 , { 1,2 , 3}])

(2, [ 2,0 , 2 , { 2,0 , 4}])

R

R


3937

Discussion

Questionable Scalability of DisCo

• For row-iteration jobs (or column-iteration jobs), the number of intermediate keys is fixed to be 𝑘 (or 𝑙)

• This implies that for a given 𝑘 and 𝑙, as the input matrix gets larger, the reducer size* will increase dramatically

• Since a single reducer (key+associating values) is sent to one reduce task, the memory capacity of a computing node will be a severe bottleneck for overall performance

*reference: Upper Bound and Lower Bound of a MapReduce Computation, 2013 VLDB

3938

Discussion

1 -> 2,4,5

𝑟, 𝑐, 𝐺

2 -> 1,3

𝑟, 𝑐, 𝐺

3 -> 2,4,5

𝑟, 𝑐, 𝐺

4 -> 1,3

𝑟, 𝑐, 𝐺

M

M

M

M


( 1, 2,4,5 )

( 2, 1,3 )

( 3, 2,4,5 )

( 4, 1,3 )

(1, { 1,2 , 1})


(2, { 2,0 , 2})

(1, { 1,2 , 3})

(2, { 2,0 , 4})

(1, [ 1,2 , 1 , { 1,2 , 3}])

(2, [ 2,0 , 2 , { 2,0 , 4}])

R

R


3939

Technology

[Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE