41
DisCo : Distributed Co - clustering with Map - Reduce 2008 IEEE International Conference on Data Engineering (ICDE) Tzu-Li Tai, Tse-En Liu Kai-Wei Chan, He-Chuan Hoh National Cheng Kung University Dept. of Electrical Engineering HPDS Laboratory IBM T.J. Watson Research Center NY, USA S. Papadimitron, J. Sun

[Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Embed Size (px)

DESCRIPTION

[Paper Study] 2008 ICDE Spiros Papadimitriou and Jimeng Sun, IBM T.J. Watson Research Center, NY, USA

Citation preview

Page 1: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

DisCo: Distributed

Co-clustering with

Map-Reduce

2008 IEEE International Conference on Data Engineering (ICDE)

Tzu-Li Tai, Tse-En Liu

Kai-Wei Chan, He-Chuan Hoh

National Cheng Kung UniversityDept. of Electrical EngineeringHPDS Laboratory

IBM T.J. Watson Research CenterNY, USA

S. Papadimitron, J. Sun

Page 2: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Agenda

A. MotivationB. Background: Co-Clustering + MapReduceC. Proposed Distributed Co-Clustering ProcessD. Implementation DetailsE. Experimental EvaluationF. ConclusionsG. Discussion

390

Page 3: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Motivation

Fast Growth in Volume of Data

โ€ข Google processes 20 petabytes of data per dayโ€ข Amazon and eBay with petabytes of transactional

data every day

Highly variant structure of data

โ€ข Data sources naturally generate data in impure formsโ€ข Unstructured, semi-structured

391

Page 4: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Motivation

Problems with Big Data mining for DBMSs

โ€ข Significant preprocessing costs for the majority of data mining tasks

โ€ข DBMS lacks performance for large amount of data

392

Page 5: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Motivation

Why distributed processing can solve the issues:

โ€ข MapReduce is irrelevant to the schema or form of the input data

โ€ข Many preprocessing tasks are naturally expressible with MapReduce

โ€ข Highly scalable with commodity machines

393

Page 6: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Motivation

Contributions of this paper:

โ€ข Presents the whole process for distributed data mining

โ€ข Specifically, focuses on the Co-Clustering mining task, and designs a distributed co-clustering method using MapReduce

394

Page 7: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

BackGround: Co-Clustering

โ€ข Also named biclustering, or two-mode clustering

โ€ข Input format: a matrix of ๐‘š rows and ๐‘› columns

โ€ข Output: Co-clusters (sub-matrices) which rows that exhibit similar behavior across a subset of columns

4*5 4*5

395

Page 8: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

BackGround: Co-Clustering

0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

Why Co-Clustering?

Student A

Student B

Student C

Student D

Traditional Clustering:

A C

B D

Can only know that students A & C / B & D

have similar scores

396

Page 9: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

BackGround: Co-Clustering

0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

Why Co-Clustering?

Student A

Student B

Student C

Student D

1 1 0 0 0

1 1 0 0 0

0 0 1 1 1

0 0 1 1 1

Student D

Student B

Student C

Student A

Co-Clustering:

Cluster 1 Cluster 2

Good atScience + Math

Good atEnglish + Chinese+ Social Studies

B & D A & C

Rows that have similar properties for a subset of selected columns

397

Page 10: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

BackGround: Co-Clustering

Another Co-Clustering Example: Animal Data

398

Page 11: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

BackGround: Co-Clustering

Another Co-Clustering Example: Animal Data

399

Page 12: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

BackGround: Co-Clustering

Another Co-Clustering Example: Animal Data

3910

Page 13: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

BackGround: MapReduce

The MapReduce Paradigm

Map

Map

Map

Map

Reduce

Reduce

Reduce

(๐’Œ๐Ÿ, ๐’—๐Ÿ)

(๐’Œ๐Ÿ, ๐’—๐Ÿ)

(๐’Œ๐Ÿ, ๐’—๐Ÿ)

(๐’Œ๐Ÿ, ๐’—๐Ÿ) (๐’Œ๐Ÿ, [๐‘ฝ๐Ÿ]) (๐’Œ๐Ÿ‘, ๐’—๐Ÿ‘)

3911

Page 14: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

Mining Network Logs toCo-Cluster Communication Behavior

3912

Page 15: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

Mining Network Logs toCo-Cluster Communication Behavior

3913

Page 16: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

The Preprocessing Process

HDFS

MapReduce Job

Extract SrcIP + DstIP and build adjacency matrix

HDFS

SrcI

P

DstIP

IPAddressIPAddressIPAddressIPAddressIPAddressIPAddress

โ€ฆ

0 1 0 1 1 0 0 0 1 1 1 0 โ€ฆโ€ฆ0 1 0 1 1 0 0 0 1 1 1 0 โ€ฆโ€ฆ0 0 1 0 0 0 0 0 0 0 0 0 โ€ฆโ€ฆ

0 0 1 0 0 0 0 0 0 0 0 0 โ€ฆโ€ฆ0 0 1 0 0 0 0 0 0 0 0 0 โ€ฆโ€ฆ

0 1 0 1 1 0 0 0 1 1 1 0 โ€ฆโ€ฆ

MapReduce Job

Build adjacency list

HDFS

MapReduce Job

Build transpose adjacency list

HDFS

3914

Page 17: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

Co-Clustering (Generalized Algorithm)

0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

r(1) = 1

r(2) = 1

r(3) = 1

r(4) = 2

๐‘Ÿ = 1, 1, 1, 2

๐‘ = 1, 1, 1, 2, 2

๐บ =๐‘”11 ๐‘”12๐‘”21 ๐‘”22

=4 42 0

Random Initialize:

Goal:Co-cluster into 2x2 = 4 sub-matrices

โ‡’ ๐‘…๐‘œ๐‘ค ๐ฟ๐‘Ž๐‘๐‘’๐‘™๐‘ : 1 or 2, ๐’Œ = ๐Ÿ

โ‡’ ๐ถ๐‘œ๐‘™๐‘ข๐‘š๐‘› ๐ฟ๐‘Ž๐‘๐‘’๐‘™๐‘ : 1 or 2, ๐’ = ๐Ÿ

3915

Page 18: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

Co-Clustering (Generalized Algorithm)

0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

r(1) = 1

r(2) = 1

r(3) = 1

r(4) = 2

Fix column labels,Iterate through rows:

r(2) = 2

๐‘Ÿ = 1, ๐Ÿ, 1, 2

๐‘ = 1, 1, 1, 2, 2

๐บ =๐‘”11 ๐‘”12๐‘”21 ๐‘”22

=๐Ÿ ๐Ÿ’๐Ÿ’ ๐ŸŽ

0 1 0 1 1

0 1 0 1 1

1 0 1 0 0

1 0 1 0 0

3916

Page 19: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

Co-Clustering (Generalized Algorithm)

Fix row labels,Iterate through columns:

๐‘Ÿ = 1, ๐Ÿ, 1, 2

๐‘ = 1, ๐Ÿ, 1, 2, 2

๐บ =๐‘”11 ๐‘”12๐‘”21 ๐‘”22

=๐ŸŽ ๐Ÿ’๐Ÿ’ ๐ŸŽ

0 1 0 1 1

0 1 0 1 1

1 0 1 0 0

1 0 1 0 0

c(2) = 2

0 0 1 1 1

0 0 1 1 1

1 1 0 0 0

1 1 0 0 0

3917

Page 20: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

Co-Clustering with MapReduce

0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

1 -> 2, 4, 5

2 -> 1, 3

3 -> 2, 4, 5

4 -> 1, 3

MR

1 -> 2,4,5

2 -> 1,3

3 -> 2,4,5

4 -> 1,3

3918

Page 21: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

Co-Clustering with MapReduce

0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

1 -> 2, 4, 5

2 -> 1, 3

3 -> 2, 4, 5

4 -> 1, 3

MR

1 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

2 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

3 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

4 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

MapReduce Job

๐‘Ÿ, ๐‘, ๐บ random initializationbased on parameters ๐‘˜, ๐‘™

๐‘Ÿ = 1,1,1,2

๐‘ = 1,1,1,2,2

๐บ =4 42 0

3919

Page 22: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

1 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

2 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

3 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

4 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

M

M

M

M

(๐’Œ๐’†๐’š๐’Š๐’, ๐’—๐’‚๐’๐’–๐’†๐’Š๐’)

( 1, 2,4,5 )

( 2, 1,3 )

( 3, 2,4,5 )

( 4, 1,3 )

Mapper Function:

For each K-V input, (๐‘˜, ๐‘ฃ)

1. Calculate โ„Š๐‘˜ (with ๐‘ฃ and ๐‘)

2. Change row labels if results in lower cost (function of ๐บ)

3. Emit (r(k), (โ„Š๐‘˜ , ๐‘˜))

โ‡’ โ„Š1 = (1,2)0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

๐‘ฃ = {2,4,5}

๐‘ = {1,1,1,2,2}

if r(1) = 2, cost becomes higher

โ‡’ r(1) = 1

โ‡’ emit(r(k), (โ„Š๐‘˜ , ๐‘˜) ) =

(1, {(1,2), 1})

3920

Page 23: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

1 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

2 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

3 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

4 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

M

M

M

M

(๐’Œ๐’†๐’š๐’Š๐’, ๐’—๐’‚๐’๐’–๐’†๐’Š๐’)

( 1, 2,4,5 )

( 2, 1,3 )

( 3, 2,4,5 )

( 4, 1,3 )

Mapper Function:

For each K-V input, (๐‘˜, ๐‘ฃ)

1. Calculate โ„Š๐‘˜ (with ๐‘ฃ and ๐‘)

2. Change row labels if results in lower cost (function of ๐บ)

3. Emit (r(k), (โ„Š๐‘˜ , ๐‘˜))

โ‡’ โ„Š1 = (2,0)

0 1 0 1 1

1 0 1 0 0

0 1 0 1 1

1 0 1 0 0

๐‘ฃ = {1,3}

๐‘ = {1,1,1,2,2}

if r(2) = 2, cost becomes lower

โ‡’ r(2) = 2

โ‡’ emit(r(k), (โ„Š๐‘˜ , ๐‘˜) ) =

(2, {(2,0), 2})

3921

Page 24: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

1 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

2 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

3 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

4 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

M

M

M

M

(๐’Œ๐’†๐’š๐’Š๐’, ๐’—๐’‚๐’๐’–๐’†๐’Š๐’)

( 1, 2,4,5 )

( 2, 1,3 )

( 3, 2,4,5 )

( 4, 1,3 )

(1, { 1,2 , 1})

(๐’Œ๐’†๐’š๐’Š๐’๐’•๐’†๐’“๐’Ž๐’†๐’…๐’Š๐’‚๐’•๐’†, ๐’—๐’‚๐’๐’–๐’†๐’Š๐’๐’•๐’†๐’“๐’Ž๐’†๐’…๐’Š๐’‚๐’•๐’†)

(2, { 2,0 , 2})

(1, { 1,2 , 3})

(2, { 2,0 , 4})

(1, [ 1,2 , 1 , { 1,2 , 3}])

(2, [ 2,0 , 2 , { 2,0 , 4}])

R

R

(๐’Œ๐’†๐’š๐’“โˆ’๐’Š๐’, [๐‘ฝ๐’‚๐’๐’–๐’†]๐’“โˆ’๐’Š๐’)

3922

Page 25: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

(1, [ 1,2 , 1 , { 1,2 , 3}])

(2, [ 2,0 , 2 , { 2,0 , 4}])

R

R

(๐’Œ๐’†๐’š๐’“โˆ’๐’Š๐’, [๐‘ฝ๐’‚๐’๐’–๐’†]๐’“โˆ’๐’Š๐’)

Reducer Function:

For each K-V input, (๐‘˜, [๐‘‰])

For each (๐‘”, ๐ผ) โˆˆ ๐‘‰,

1. Accumulate all ๐‘” โˆˆ ๐‘‰ into โ„Š๐‘˜

2. ๐ผ๐‘˜ = Union of all ๐ผ

3. Emit (๐‘˜, (โ„Š๐‘˜ , ๐ผ๐‘˜)

โ„Š1 = 1,2 + 1,2= (2,4)

๐ผ1 = {1,3}

โ‡’ Emit (1, ( 2,4 , {1,3}))

3923

Page 26: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

(1, [ 1,2 , 1 , { 1,2 , 3}])

(2, [ 2,0 , 2 , { 2,0 , 4}])

R

R

(๐’Œ๐’†๐’š๐’“โˆ’๐’Š๐’, [๐‘ฝ๐’‚๐’๐’–๐’†]๐’“โˆ’๐’Š๐’)

(1, ( 2,4 , {1,3}))

(2, ( 4,0 , {2,4}))

(๐’Œ๐’†๐’š๐’“๐’†๐’”๐’–๐’๐’•, ๐’—๐’‚๐’๐’–๐’†๐’“๐’†๐’”๐’–๐’๐’•)

๐’“ = {๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ}

๐‘ฎ = {๐Ÿ ๐Ÿ’๐Ÿ’ ๐ŸŽ

}

๐‘ = {1,1,1,2,2}

Sync Results

0 1 0 1 1

0 1 0 1 1

1 0 1 0 0

1 0 1 0 0

3924

Page 27: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Distributed Co-Clustering Process

HDFS

MapReduce Job

Build transpose adjacency list

HDFS

MapReduce Job

Fix columnRow iteration

Random ๐‘Ÿ, ๐‘, ๐บgiven ๐‘˜, ๐‘™

+ SyncResults

Synced ๐‘Ÿ, ๐‘, ๐บwith best ๐‘Ÿ

permutation

MapReduce Job

Fix rowColumn iteration

+

Final Co-Clustering resultwith best ๐‘Ÿ, ๐‘permutations

PreprocessingCo-Clustering

3925

Page 28: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Implementation Details

Tuning the number of Reduce Tasks

โ€ข The number of reduce tasks is related to the number of intermediate keys during the shuffle and sort phase

โ€ข For the co-clustering row-iteration/column-iteration jobs, the number of intermediate keys is either ๐‘˜ or ๐‘™

3926

Page 29: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Implementation Details

1 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

2 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

3 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

4 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

M

M

M

M

(๐’Œ๐’†๐’š๐’Š๐’, ๐’—๐’‚๐’๐’–๐’†๐’Š๐’)

( 1, 2,4,5 )

( 2, 1,3 )

( 3, 2,4,5 )

( 4, 1,3 )

(1, { 1,2 , 1})

(๐’Œ๐’†๐’š๐’Š๐’๐’•๐’†๐’“๐’Ž๐’†๐’…๐’Š๐’‚๐’•๐’†, ๐’—๐’‚๐’๐’–๐’†๐’Š๐’๐’•๐’†๐’“๐’Ž๐’†๐’…๐’Š๐’‚๐’•๐’†)

(2, { 2,0 , 2})

(1, { 1,2 , 3})

(2, { 2,0 , 4})

(1, [ 1,2 , 1 , { 1,2 , 3}])

(2, [ 2,0 , 2 , { 2,0 , 4}])

R

R

(๐’Œ๐’†๐’š๐’“โˆ’๐’Š๐’, [๐‘ฝ๐’‚๐’๐’–๐’†]๐’“โˆ’๐’Š๐’)

๐’Œ = ๐Ÿ (row-iterate)โ‡’ ๐Ÿ inter-keys

3927

Page 30: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Implementation Details

Tuning the number of Reduce Tasks

โ€ข So, for the row-iteration/column-iteration jobs, 1 reduce task is enough

โ€ข However, for some preprocessing tasks such as graph construction where there are a lot of intermediate keys, needs much more reduce tasks

3928

Page 31: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Implementation Details

The Preprocessing Process

HDFS

MapReduce Job

Extract SrcIP + DstIP and build adjacency matrix

HDFS

SrcI

P

DstIP

IPAddressIPAddressIPAddressIPAddressIPAddressIPAddress

โ€ฆ

0 1 0 1 1 0 0 0 1 1 1 0 โ€ฆโ€ฆ0 1 0 1 1 0 0 0 1 1 1 0 โ€ฆโ€ฆ0 0 1 0 0 0 0 0 0 0 0 0 โ€ฆโ€ฆ

0 0 1 0 0 0 0 0 0 0 0 0 โ€ฆโ€ฆ0 0 1 0 0 0 0 0 0 0 0 0 โ€ฆโ€ฆ

0 1 0 1 1 0 0 0 1 1 1 0 โ€ฆโ€ฆ

MapReduce Job

Build adjacency list

HDFS

MapReduce Job

Build transpose adjacency list

HDFS

(๐‘บ๐’“๐’„๐‘ฐ๐‘ท, [๐‘ซ๐’”๐’•๐‘ฐ๐‘ท])

3929

Page 32: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Experimental Evaluation

Environment

โ€ข There are 39 nodes in four different blade enclosure

Gigabit Ethernet

โ€ข Blade Server

- CPU: two dual-core (Intel Xeon 2.66GHz)- Memory: 8GB- OS: Red Hat Enterprise Linux

โ€ข Hadoop Distributed File System(HDFS) capacity: 2.4 TB

3930

Page 33: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Experimental Evaluation

Datasets

3931

Page 34: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Experimental Evaluation

Preprocessing ISS Data

โ€ข Optimal values of each situationโ€ข Map tasks number 6

โ€ข Reduce tasks number 5

โ€ข Input split size 256MB

6 5 256MB

3932

Page 35: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Experimental Evaluation

Co-Clustering TREC Data

โ€ข After 25 nodes per iteration is roughly about 20 ยฑ 2 seconds. It is better than what we can get on a machine with 48GB RAM.

3933

Page 36: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Conclusion

โ€ข Authors of the paper shared their lessons learnt from data mining experiences with vast quantities of data, particularly in the context of co-clustering, and recommends using a distributed approach

โ€ข Designed a general MapReduce approach for co-clusteringalgorithms

โ€ข Showed that the MapReduce co-clustering framework scales well with real world large datasets (ICC, TREC)

3934

Page 37: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Discussion

โ€ข Necessity of the global ๐‘Ÿ, ๐‘, ๐บ sync actionโ€ข Questionable Scalability for DisCo

3935

Page 38: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Discussion

Necessity of the global ๐’“, ๐’„, ๐‘ฎsync action

MapReduce Job

Fix columnRow iteration

Random ๐‘Ÿ, ๐‘, ๐บgiven ๐‘˜, ๐‘™

+ SyncResults

Synced ๐‘Ÿ, ๐‘, ๐บwith best ๐‘Ÿ

permutation

MapReduce Job

Fix rowColumn iteration

+

Final Co-Clustering resultwith best ๐‘Ÿ, ๐‘permutations

Co-Clustering

3936

Page 39: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Discussion

1 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

2 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

3 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

4 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

M

M

M

M

(๐’Œ๐’†๐’š๐’Š๐’, ๐’—๐’‚๐’๐’–๐’†๐’Š๐’)

( 1, 2,4,5 )

( 2, 1,3 )

( 3, 2,4,5 )

( 4, 1,3 )

(1, { 1,2 , 1})

(๐’Œ๐’†๐’š๐’Š๐’๐’•๐’†๐’“๐’Ž๐’†๐’…๐’Š๐’‚๐’•๐’†, ๐’—๐’‚๐’๐’–๐’†๐’Š๐’๐’•๐’†๐’“๐’Ž๐’†๐’…๐’Š๐’‚๐’•๐’†)

(2, { 2,0 , 2})

(1, { 1,2 , 3})

(2, { 2,0 , 4})

(1, [ 1,2 , 1 , { 1,2 , 3}])

(2, [ 2,0 , 2 , { 2,0 , 4}])

R

R

(๐’Œ๐’†๐’š๐’“โˆ’๐’Š๐’, [๐‘ฝ๐’‚๐’๐’–๐’†]๐’“โˆ’๐’Š๐’)

3937

Page 40: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Discussion

Questionable Scalability of DisCo

โ€ข For row-iteration jobs (or column-iteration jobs), the number of intermediate keys is fixed to be ๐‘˜ (or ๐‘™)

โ€ข This implies that for a given ๐‘˜ and ๐‘™, as the input matrix gets larger, the reducer size* will increase dramatically

โ€ข Since a single reducer (key+associating values) is sent to one reduce task, the memory capacity of a computing node will be a severe bottleneck for overall performance

*reference: Upper Bound and Lower Bound of a MapReduce Computation, 2013 VLDB

3938

Page 41: [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

Discussion

1 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

2 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

3 -> 2,4,5

๐‘Ÿ, ๐‘, ๐บ

4 -> 1,3

๐‘Ÿ, ๐‘, ๐บ

M

M

M

M

(๐’Œ๐’†๐’š๐’Š๐’, ๐’—๐’‚๐’๐’–๐’†๐’Š๐’)

( 1, 2,4,5 )

( 2, 1,3 )

( 3, 2,4,5 )

( 4, 1,3 )

(1, { 1,2 , 1})

(๐’Œ๐’†๐’š๐’Š๐’๐’•๐’†๐’“๐’Ž๐’†๐’…๐’Š๐’‚๐’•๐’†, ๐’—๐’‚๐’๐’–๐’†๐’Š๐’๐’•๐’†๐’“๐’Ž๐’†๐’…๐’Š๐’‚๐’•๐’†)

(2, { 2,0 , 2})

(1, { 1,2 , 3})

(2, { 2,0 , 4})

(1, [ 1,2 , 1 , { 1,2 , 3}])

(2, [ 2,0 , 2 , { 2,0 , 4}])

R

R

(๐’Œ๐’†๐’š๐’“โˆ’๐’Š๐’, [๐‘ฝ๐’‚๐’๐’–๐’†]๐’“โˆ’๐’Š๐’)

3939