c 2011 Xiaoyang Gao - ufdcimages.uflib.ufl.eduufdcimages.uflib.ufl.edu/UF/E0/04/34/54/00001/gao_x.pdf · XIAOYANG GAO A THESIS PRESENTED ... In addition, a special thanks to Saeed

EFFICIENT IMPLEMENTATION OF MULTI-DIMENSIONAL CO-CLUSTERING

By

XIAOYANG GAO

A THESIS PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCE

UNIVERSITY OF FLORIDA

2011

c⃝ 2011 Xiaoyang Gao

2

To my Mon and Dad, and everyone that helped me finish this thesis

3

ACKNOWLEDGMENTS

Of the many people who have been enormously helpful in the preparation of this

thesis, I’m especially and heartily thankful to my supervisor Dr. Sanjay Ranka. The

thesis could not have been written without him who not only served as my supervisor

but also encouraged and challenged me throughout the academic problem. His patience

on answering my various questions and instructive guidance have taught me useful

methods to analyze and solve a problem and a good attitude to finish a job well.

I would like to warmly acknowledge Dr. Ahmed Helmy for his support and guidance

in the interesting project, of which the real wireless network data helped a lot on finishing

this thesis. I can always learn something through the meeting with him. Also, I would like

to thank Dr. Shigang Chen for his instruction which simulated my interest on computer

networks and brought me a good master on computer networks related knowledge, as

well as his input on my thesis defense.

In addition, a special thanks to Saeed Moghaddam and Clint P. George for their

necessary support on analysis of co-clustering results. Most especially to my family

and all my friends. Their consideration, motivation and encouragement enabled me to

complete this thesis.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 MULTI-DIMENSIONAL INFORMATION THEORETICAL CO-CLUSTERING . . 12

3 DATA REPRESENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Original Data Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.1 Dense Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.2 Sparse Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Clustered Data Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Marginal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 SERIAL IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2 Serial Implementation and Optimization for Dense Cube . . . . . . . . . . 234.3 Serial Implementation and Optimization for Sparse Cube . . . . . . . . . 24

5 PARALLEL IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1 Parallel Implementation and Optimization for Dense Cube . . . . . . . . . 255.1.1 Parallel Reduction on CUDA Platform . . . . . . . . . . . . . . . . 255.1.2 Multi-items Calculation on CUDA Platform . . . . . . . . . . . . . . 285.1.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Parallel Implementation and Optimization for Sparse Cube . . . . . . . . . 29

6 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1 Data Set, Environment and Measurement Details . . . . . . . . . . . . . . 346.2 Performance and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 356.3 Co-clustering Algorithm Results . . . . . . . . . . . . . . . . . . . . . . . . 38

7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5

LIST OF TABLES

Table page

5-1 Example of sorting indexes in threads and blocks . . . . . . . . . . . . . . . . . 32

5-2 Example of shared memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6-1 Information on datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6

LIST OF FIGURES

Figure page

2-1 Multi-dimensional information theoretic co-clustering algorithm . . . . . . . . . 14

3-1 Storage for dense cube in memory . . . . . . . . . . . . . . . . . . . . . . . . . 17

3-2 Storage for sparse cube in memory . . . . . . . . . . . . . . . . . . . . . . . . 18

3-3 Storage for jagged 2D array in memory . . . . . . . . . . . . . . . . . . . . . . 20

4-1 Flow of implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4-2 Procedure of computing distances in iteration on sparse cube . . . . . . . . . . 24

5-1 Tree-based reduction [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5-2 Example of 2D thread mapping for marginal distribution computation . . . . . . 28

5-3 Reduction of repeated communication between host and device . . . . . . . . 29

5-4 Mapping of threads in parallel implementation for sparse cube . . . . . . . . . 30

6-1 Trends of the loss of mutual information . . . . . . . . . . . . . . . . . . . . . . 35

6-2 Comparison of loss of mutual information . . . . . . . . . . . . . . . . . . . . . 36

6-3 Performance results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6-4 Data points distribution before and after co-clustering . . . . . . . . . . . . . . 39

6-5 Co-clustering result of domain names of 2D dataset . . . . . . . . . . . . . . . 40

7

Abstract of Thesis Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Science

EFFICIENT IMPLEMENTATION OF MULTI-DIMENSIONAL CO-CLUSTERING

By

Xiaoyang Gao

August 2011

Chair: Sanjay RankaMajor: Computer Engineering

Co-Clustering is an important data mining operation that can automatically

cluster along two or more dimensions. Most of the work in the literature focuses

on co-clustering on two dimensions. In this report, we develop extensions of ITCC

(Information Theoretical Co-Clustering) for multi-dimension data. We first extend

the approach to more than two dimensions. We also develop parallel algorithms

for the resulting approach. Our experimental results show that our algorithms and

implementation scale well to handle large datasets both on sequential and parallel

machines. The Multi-Dimensional ITCC has been used to help the analysis of

multi-dimensional wireless data records to find out the hidden model of user activities.

8

CHAPTER 1INTRODUCTION

In the era of data explosion, large amount of data are generated everyday.

However, the utility of data fails to keep up with the increasing amount of data. Lots

of knowledge hidden in data cannot be found out. Clustering is a fundamental tool in

data mining, which is used to automatically group the similar objects into clusters in an

unsupervised way to help people exploit more knowledge that can hardly be discovered

from observation based on common sense or current knowledge.

Data in the real world always has more than one attribute. Clustering along only

one dimension would be unable to discover the new knowledge which has relations

with all the attributes. To deal with this, co-clustering comes out, and provides a way

to automatically and simultaneously cluster the data along two or more dimensions.

Nowadays, this technique has already been widely used in many areas, including

text, web-log, bioinformatics, and wireless network data analysis and modeling.

Researchers have tried to use different measurements on the similarity of different

objects to analyze the data from different aspects. For different applications, various

co-clustering algorithms have been presented in literatures. Most of them focus on

two-dimensional data. However, there is a great demand on clustering the data in

multi-dimensions, since the data in the real world is always more than two dimensions.

For example, the traffic records from wireless network has attributes such as users,

domains, time, locations, and so on. In this way, it is often desirable to co-cluster on all

of them, and discover the knowledge among all the dimensions. By the increasing of

the data dimensions, the efficient implementations of high-dimensional co-clustering

algorithms are also needed in practice to deal with huge amount of real world data.

Generally speaking, we can treat the data in multi-dimensions as a contingency

table. Information theory provides a theoretical way, mutual information, to measure

the mutual dependence of random variables. It provides a good way to measure

9

whether a co-clustering is optimal. Based on mutual information, some researchers

[1] presented Information Theoretical Co-Clustering (ITCC) algorithm, an efficient

co-clustering algorithm. It treats the optimal co-clustering as one that leads to the

largest mutual information between the clustered random variables. Equivalently, it

is to minimize the difference in the mutual information between the original random

variables and the mutual information. For 2D data, it intertwines both row and column

clustering at all stages. The algorithm has been proved that it can monotonously

decrease the difference of the mutual information between the clustered variables and

the corresponding original variables, and finally leads to a local optimal co-clustering

result based on the initialization of the cluster assignments. And fortunately, according to

the literature[1], ITCC provides an reasonable way to co-clustering on multi-dimensional

data without introducing much cost on the efficiency.. In this thesis, we uses Multi-Dimensional

Information Theoretical Co-Clustering (MDITCC) algorithm in the implementation.

Due to the large scale of high dimensional data, it is necessary to construct a

more efficient implementation of the algorithm to make co-clustering faster without

losing precision. Parallelizing the algorithm becomes an ideal choice to improve

the performance and capability. NVIDIA provides CUDA (Compute Unified Device

Architecture) [3], the parallel computing architecture, which enables dramatic increases

in computing performance by harnessing the power of GPU (Graphics Processing Unit)

on general purpose computing. CUDA successfully decreases the cost per GFLOPS,

and provides an developer-friendly environment for constructing parallel program. All

of these make CUDA an ideal platform to parallelize the Multi-Dimensional ITCC. We

develop parallel algorithms based on Multi-Dimensional ITCC on NVIDIA CUDA platform

to improve the efficiency and throughput of the our implementation.

In this thesis, we present a novel and efficient implementation of Multi-Dimensional

Co-Clustering algorithm, which is based on the Information Theoretical Co-Clustering

and performs efficiently on large and multi-dimensional data, especially for sparse and

10

high-dimensional data. We first describe the Multi-Dimensional ITCC, and prove the key

formulas in multi-dimensions. Then, we show new data representations used to store

various data, including original data, clustered data, marginal distributions, and some

other assistant data in the computation. Also, separate data structures for sparse data

and dense data are presented for different applications. Then follows the optimized

serial implementations for sparse and dense data, as well as the parallel ones on CUDA

platform. We also do experiments, which show the improvement on the performance

our implementation provides, and the results of the co-clustering. We demonstrate that

our implementation works correctly and efficiently on the large scale high-dimensional

data by presenting the results of high-dimensional wireless network data co-clustering.

The results also show the parallel implementation has an obvious improvement on the

performance, especially on large scale data.

11

CHAPTER 2MULTI-DIMENSIONAL INFORMATION THEORETICAL CO-CLUSTERING

The Information Theoretical Co-Clustering Algorithm description is based on

two-dimensional data. However, it is easy to be extended into multi-dimensional space

as mentioned in the original literature [1]. To outline the approach of Multi-dimensional

ITCC, we first prove the key formulae of ITCC in multi-dimensional space, and then

describe the Multi-Dimensional ITCC.

In multi-dimensional space, we assume the variables in each dimension are

independent of each other, and treat the input data as a multi-dimensional contingency

table. The key is to represent the loss of the mutual information in multi-dimensional

space. Thus, a new representation of the loss of mutual information for multi-dimensional

contingency table is necessary. Based on the above assumptions, we can write the new

definition of the loss of the mutual information in multi-dimensional space as following:

Lemma 1. For a fixed co-clustering (CD1,CD2, ... ,CDn), we can write the loss of the

mutual information as

I (D1;D2; ... ;Dn)− I (D̂1; D̂2; ... ; D̂n) = D(p(D1,D2, ... ,Dn)∥q(D1,D2, ... ,Dn)) (2–1)

where D(·||·) denotes the Kullback-Leibler (KL) divergence, also known as relative

entropy, and q(D1,D2, ... ,Dn) is the distribution of the form

q(d1, d2, ... , dn) = p(d̂1, d̂2, ... , d̂n)

n∏i=1

p(di | d̂i) (2–2)

Proof. Lemma. 1

p(d̂1, d̂2, ... , d̂n) =∑d1∈d̂1

∑d2∈d̂2

...∑dn∈d̂n

p(d1, d2, ... , dn)

I (D1;D2; ... ;Dn)− I (D̂1; D̂2; ... ; D̂n)

=∑d̂1

∑d̂2

...∑d̂n

∑d1∈d̂1

∑d2∈d̂2

...∑dn∈d̂n

p(d1, d2, ... , dn) logp(d1, d2, ... , dn)

p(d1)p(d2) ... p(dn)

12

−∑d̂1

∑d̂2

...∑d̂n

(∑d1∈d̂1

∑d2∈d̂2

...∑dn∈d̂n

p(d1, d2, ... , dn)) logp(d̂1, d̂2, ... , d̂n)

p(d̂1)p(d̂2) ... p(d̂n)

=∑d̂1

∑d̂2

...∑d̂n

∑d1∈d̂1

∑d2∈d̂2

...∑dn∈d̂n

p(d1, d2, ... , dn) logp(d1, d2, ... , dn)

p(d̂1, d̂2, ... , d̂n)p(d1)

p(d̂1)

p(d2)

p(d̂2)... p(dn)p(d̂n)

=∑d̂1

∑d̂2

...∑d̂n

∑d1∈d̂1

∑d2∈d̂2

...∑dn∈d̂n

p(d1, d2, ... , dn) logp(d1, d2, ... , dn)

q(d1, d2, ... , dn)

Some simple but useful equalities between p and q, which highlight properties of q

desirable to approximating p are also presented.

Proposition 2.1.

q(d̂1, d̂2, ... , d̂n) = p(d̂1, d̂2, ... , d̂n), q(di , d̂i) = p(di , d̂i) (2–3)

q(di) = p(di), q(d̂i) = p(d̂i) (2–4)

p(di | d̂i) = q(di | d̂i) (2–5)

p(d̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n | d̂i) = q(d̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n | d̂i) (2–6)

∀di , d̂i , 1 ≤ i ≤ n.Further , if d̂i = CDi (di), then

q(d1, d2, ... , di−1, di+1, ... , dn | d̂i) = q(d̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n | d̂i)∏k ̸=i

q(dk | d̂k) (2–7)

Proof. Proposition. 2.1 Equation 2–4, 2–5, 2–6 are simple to show and follow from

Equation 2–3. Equation 2–7 follows from

q(d1, d2, ... , di−1, di+1, ... , dn | d̂i)

= q(d1, d2, ... , di−1, di+1, ... , dnd̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n | d̂i)

=q(d1, d2, ... , di−1, di+1, ... , dnd̂1, d̂2, ... , d̂n)

q(d̂i)

=

∑di∈d̂i p(d̂1, d̂2, ... , d̂n)

∏nk=1 p(dk | d̂k)

q(d̂i)

13

=q(d̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n)

q(d̂i)

∏k ̸=i

q(dk | d̂k)

= q(d̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n | d̂i)∏k ̸=i

q(dk | d̂k)

Algorithm Co Clustering(n, p, l1, l2, ... , ln,C(D1),CD2, ... ,CDn)Input: The joint probability distribution p(D1,D2, ... ,Dn), l1, l2, ... , ln are the desirednumber of clusters in each dimension.Output: The partition function CD1,CD2, ... ,CDn .

1. Initialization: Set t = 0. Start with some initial partition functions C (0)D1 ,C(0)D2, ... ,C

(0)Dn

.Compute

q(0)(D̂1, D̂2, ... , D̂n), q(0)(D1 | D̂1), q(0)(D2 | D̂2), ... , q(0)(Dn | D̂n),

and the distributions q(0)(D2,D3, ... ,Dn | d̂1), 1 ≤ d̂1 ≤ l1.

2. Iterations on each dimension k from 1 to n:

(a) Compute k-dimension clusters: for each dk , find its new cluster index as

C(t+k)Dk

(dk) =

argmindkD(p(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | d̂k)∥q(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | d̂k))

resolving ties arbitrarily. Let C (t+k)Dj= C

(t+k−1)Dj

, jk .

(b) Compute distributions

q(t+k)(D̂1, D̂2, ... , D̂n), q(t+k)(D1 | D̂1), q(t+k)(D2 | D̂2), ... , q(t+k)(Dn | D̂n),

and the distributions q(t+k)(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | d̂k), 1 ≤ d̂k ≤ lk .

3. Stop and return CD1 = C(t+n)D1,CD2 = C

(t+n)D2, ... ,CDn = C

(t+n)Dn

. If the changein objective function value, that is D(p(D1,D2, ... ,Dn)∥q(t)(D1,D2, ... ,Dn)) −D(p(D1,D2, ... ,Dn)∥q(t+n)(D1,D2, ... ,Dn)), is ”small”. Else set t = t + n andgo to step 2.

Figure 2-1. Multi-dimensional information theoretic co-clustering algorithm

We can now describe Multi-Dimensional ITCC in Figure 2-1. The algorithm

starts with an initial cluster assignment for every element in each dimension. Based

14

on different initialization, the algorithm will lead to a local minimal loss of mutual

information, however it cannot guarantee a global minimum.

15

CHAPTER 3DATA REPRESENTATION

Data representation in the implementation plays an important role. We treat

the multi-dimensional data as a data cube. By analyzing the algorithm, original data

cube, clustered data cube, and the marginal distributions of data cubes are the

three most important types of data. The data structures used in the implementation

should generically support the data in any number of dimensions. Also, the data

structures are designed based on the consideration of efficient accesses, low space

overhead, and parallel communication-friendly, which means the data can be extracted

into one-dimensional array with least time and space overhead, since most parallel

communication operations are friendly to array of basic types. The following parts

will describe the data structures used for original data cube, clustered data cube, and

marginal distributions in details.

3.1 Original Data Cube

Based on the proportion of zeros in the cube, the original data cubes can be divided

into two different types. One is sparse cube, which is populated primarily with zeros. To

the contrary, the other one is dense cube, in which majority of elements is non-zeros.

For these two types of cubes, two different kinds of data structures are designed

separately.

3.1.1 Dense Cube

The data structure for dense cube uses one-dimensional array to store all the

elements in the cube. The elements are stored sequentially from the logically first

element to the last one. To access one element in the cube, we use a converter to

convert the multi-dimensional indexes into the index of this one-dimensional array, as

well as another converter to do the opposite conversion. The complexity of access an

element is O(k), while k is the number of dimensions. The k is always small. In most

16

cases it is less than 10. So we can treat the time complexity of accessing one element

as a constant. Figure 3-1 shows an example of 2D data storage in dense cube structure.

Storage in Memory

Indexes-Address Converter

2 2 2 0

2 2 2 0

0 0 1 1

0 0 0 1

Original Cube

Address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Value 2 2 2 0 2 2 2 0 0 0 1 1 0 0 0 1

Representation in Memory

Dimension 0 1 Index 1 2 Access Indexes

Address 6 Address Buffer

Figure 3-1. Storage for dense cube in memory

3.1.2 Sparse Cube

One of the most common data structure to store sparse cube is coordinate list, in

which each record contains the multi-dimensional coordinate of a element, and the value

of it. The coordinate list includes all the non-zero elements in the cube.

An interesting characteristic of the original data cube is that it is unnecessary to

visit the elements in a specific sequence. In another words, visiting the elements in any

sequences is acceptable. At the same time, we do not need to do any operations on

the cube. These reasons make the coordinate list format an ideal way to represent the

sparse data.

Still, one-dimensional array is used for the storage. Specifically, suppose the

number of non-zero elements is n, and the number of dimensions is k . A n ∗ k elements

array is used to store all the coordinates of all the elements, each continuous k elements

represent the coordinate of one element. Another n elements array is used to store the

17

value of the elements. The sequence of all the elements in both arrays is the same. For

example, the i th continuous k elements in the first array represent the coordinate of the

i th element, while the i th element in the second array represents the value of the i th

element.

It is worth noting that no random access interface is provided in the sparse cube.

This is mainly because we don’t have such demand in the algorithm.

Storage

Access

Dimension 1 Dimension 2 Value 0 0 2 0 1 2 0 2 2 1 0 2 1 1 2 1 2 2 2 2 1 2 3 1 3 3 1

Representation in Memory

2 2 2 0

2 2 2 0

0 0 1 1

0 0 0 1

Original Cube

Figure 3-2. Storage for sparse cube in memory

Figure 3-2 shows an example of 2D data stored in the sparse cube structure. The

time complexity of access all the elements in the cube is O(l), in which l means the

number of non-zero elements in the sparse cube. In a sparse cube, the number of

non-zeros is much less than that in a dense cube.

3.2 Clustered Data Cube

By analyzing the algorithm, some interesting characteristics of clustered data cube

can be found, which are helpful for designing the data structure:

• The clustered cube is always dense;

18

• The data in the clustered cube changes frequently during runtime;

• The elements are always random accessed during the execution;

• The size of the clustered cube is always small enough so that even storing all theelements of it won’t consume much of the memory.

The dense cube structure described in Original Data Cube section satisfies all the

demands above. Therefore, we use dense cube for the representation of Clustered Data

Cube.

3.3 Marginal Distribution

Marginal distribution is frequently accessed during the execution. Because the

numbers of elements in different dimensions vary, the most space-efficient data

structure to store them is jagged two-dimension array, with the dimension number in

the first dimension and the index of the element in the corresponding dimension in the

second dimension.

In our implementation, we prefer one-dimensional array. Thus we use array for the

marginal distribution. The one-dimensional array stores all the elements in the jagged

array sequentially from the elements in the first dimension to the last. For a fast access

on a specific element, instead of calculating the index of elements by adding up the

length of each dimension from the first one to the specific one, the indexes of each

dimension are stored in another assistant array.

It is worth noting that the same structure is also used for the cluster assignments,

which stores the cluster index for each element in each dimension.

Figure 3-3 shows an example of 4D marginal distributions stored in this data

structure.

19

Element Access

Index Pointer

Dimension Values 0 0.25 0.5 0.25 1 0.45 0.15 0.2 0.1 0.2 2 0.55 0.35 0.05 0.15 3 0.4 0.6

Logical View of Marginal Distribution

Row Index 0 3 8 12 Assistant Array

Value 0.25 0.5 0.25 0.45 0.15 0.2 0.1 0.2 0.55 0.35 0.05 0.15 0.4 0.6 Memory Structure

Figure 3-3. Storage for jagged 2D array in memory

20

CHAPTER 4SERIAL IMPLEMENTATION

Serial implementation provides a basic program structure for the algorithm

procedure. For different types of data, sparse version and dense version are implemented

and optimized separately. Serial implementation also provides a fundamental work flow

implementation for the parallel implementation, which mainly focus on the parallelization

of the computation-intensive part of the algorithm. Figure 4-1 shows the flow structure of

the whole program.

We can divide the computations in the algorithm into several small basic operations.

These operations are:

• Calculating the clustered data cube;

• Calculating marginal distribution of cube (both the original one and the clusteredone);

• Calculating the distance between one element and the corresponding cluster,

D(p(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | dk)∥q(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | d̂k)),

and the corresponding q(t+k)(d1, d2, ... , dk−1, dk+1, ... , dn | d̂k), 1 ≤ d̂k ≤ lk ;

• Finding the minimum of all the distance

C(t+k)Dk

(dk) =

argmindkD(p(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | dk)∥q(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn∥d̂k)).

These basic operations are the key parts of the implementation. The following

parts will first introduce the preprocessing part of the whole program, which converts

the input data into a more friendly way for operations as well as saving on the time and

space consumption. Then the details of the implementation of the computation-intensive

algorithm operations will be discussed separately in dense form and sparse form.

Some specific optimization in implementation level is also provided to achieve better

performance.

21

Figure 4-1. Flow of implementation

4.1 Preprocessing

Preprocessing is indispensable to make the implementation work correctly and

efficiently. For the input data, the elements of each dimension might be any type besides

integer. Even if it is integer, the values may distribute in large range discretely, which

bring huge unused space in the storage in multi-dimensional space and more redundant

computation.

22

Preprocessing counts the number of elements in all dimensions and assigns each

element with a new and sequential ID in that dimensions. In this way, the elements

with no non-zero records in all dimensions are eliminated, which may reduce the

performance of the algorithm and the utility of limited memory. Repeated records, which

have the same attributes, will be aggregated.

4.2 Serial Implementation and Optimization for Dense Cube

Most computation can be done during the iteration on all the elements in the cube.

The operations behave as the following:

• Computing the clustered data cube, and the corresponding marginal distributions:the program calculates the corresponding indexes in the clustered cube from theelement’s index. The elements in corresponding marginal distribution and theclustered data cube can be accessed through the indexes. And the correspondingfields are the result of adding up the value of the iterated element.

• Computing the intermediate q values: the program calculates the q values duringthe iteration on all the elements in original cube following the equation:

q(t+k)(d1, d2, ... , dk−1, dk+1, ... , dn | d̂k) =

p(d̂1, d̂2, ... , ˆdk−1, ˆdk+1, ... , d̂n | d̂k)∏i ̸=k

p(di | d̂i), 1 ≤ d̂k ≤ lk

• The distances between element and cluster: the program iterates on all the pairsof element and cluster, and calculates the distances following the definition ofthe distance, the Kullback-Leibler (KL) divergence. The shortest distance and thecorresponding cluster will be stored. In details, the calculation follows the followingequation:

D(p(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | dk)∥q(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | d̂k)) =∑d̂1

∑d̂2

...∑ˆdk−1

∑ˆdk+1

...∑d̂n

p(d1, d2, ... , dk−1, dk+1, ... , dn | dk) logp(d1, d2, ... , dk−1, dk+1, ... , dn | dk)q(d1, d2, ... , dk−1, dk+1, ... , dn | d̂k)

An optimization can be adopted in the distance computation. Heavy computation

takes place in the repetitive calculation of the indexes and the index conversions,

which cost huge amount of time during process. To reduce it, instead of computing the

distance for each pair of element and cluster, we calculate all the p log pq

values in the

iteration on all the elements once. This reduces the repetitive calculation of the same

23

indexes. Then we add the value to the corresponding distances. Experiments show the

optimization successfully reduces large amount of calculation time, and provides great

performance improvement to the whole program.

4.3 Serial Implementation and Optimization for Sparse Cube

Because the data representation of the sparse cube is different from the one of the

dense cube, the serial implementation for sparse cube solves the problem in a different

way. However, much of the implementation follows the same principle as the one for

dense cube, including the computation of clustered cube and marginal distributions.

For the most complicated distance computation as well as the q computation, it is

also done in the iteration on all the elements in the sparse cube. During iteration on one

element, we can visit each dimension index of the element. If it’s the dimension which is

being clustered, the q value is multiplied by q(d̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n | d̂i), otherwise it

is multiplied by q(di | d̂i). Figure 4-2 shows an example for basic procedure to compute

distance in this implementation.

Example for Iteration on Dimension 2

Calculate Distances

Between 3 and All 6 Clusters

Marginal Distributions

of Original Cube

and Clustered Cube

Probability Distributions

of Clustered Cube

Coordinate Value 5 12 3 6 0.25

Original Data Record

0 2 4 1 Indexes of Clustered Cube

0.013 0.24 0.016 0.005 0.016 0.038 Distances between 3 and all 6 clusters in dimension 2

Dim Element Cluster 2 3 3

New Assignments

Figure 4-2. Procedure of computing distances in iteration on sparse cube

24

CHAPTER 5PARALLEL IMPLEMENTATION

Parallel implementations, both dense cube and the sparse one, focus on the

parallelization of the core and computation-intensive parts of the algorithm. In this

chapter, the parallel implementation for dense cube is presented first, and then follows

the one for sparse cube.

5.1 Parallel Implementation and Optimization for Dense Cube

By analyzing the operations on the dense cube in the algorithm, we can divide all

the computation operations into different types of abstract operations.

• Reduction. This appears in the calculation of cluster distribution, marginaldistribution, distances between each element and corresponding cluster in thecorresponding dimension, and cluster assignments.

• Multi-items calculation. This type of calculation needs to calculate many itemswith the same calculation formula but with different data source. This type appearsmostly in the intermediate q value calculation and the distances.

Some of the calculations belong to both types, such as distances calculation. These

types of calculations compose the most intensive computation in the algorithm. Thus,

our parallel implementation mainly focuses on these two types of calculation.

5.1.1 Parallel Reduction on CUDA Platform

Parallel Reduction on CUDA platform is similar to general parallel reduction, which

is widely used in MPI on clusters. Mainly it is a tree-based reduction. We can exploit

maximum parallel during reduction.

To implement parallel reduction on CUDA platform, there are some more things

need to be considered, including the thread and block concept, shared memory in

thread block. Some optimizations can also be adopted, such as loop unrolling or

even complete unrolling. Some researchers in nVidia presented an optimizing parallel

reduction in their work [2], in which the above factors are considered. Figure 5-1 shows

the main idea of this parallel reduction algorithm. In our implementation, we use most of

their idea for the reduction operation.

25

Figure 5-1. Tree-based reduction [2]

Because CUDA platform has no global synchronization, the data is divided into

several blocks. Each block corresponds to a thread block. The reduction is executed

separately in each thread block, but simultaneously among the blocks. Each thread in

the thread block would first do the reduction on at least two elements to reduce the half

idle threads after load the elements. At the same time, each thread sums up as many

as necessary, if the number is larger than two times of the total number of threads in all

thread blocks. Theoretically, if the number of the elements is log2 n times of twice of the

number of threads, this parallel reduction implementation is cost-optimal.

The reduction uses shared memory as buffer for the elements and results. In the

first step of reduction, each thread loads the corresponding data in the global memory

into shared memory in each block. All the following reduction steps are taken in the

shared memory, which reduces the data access time during the reduction.

The reduction also uses sequential addressing in the shared memory in order to

reduce the shared memory bank conflicts. In each step, each thread sums up its own

26

element with the one in the second half in the block. The threads in the second half

become idle at the same time. In this way, the elements used for the next reduction step

is in the sequential address space, and the time for access data on different shared

address bank is reduced.

Loop is another time consuming part in the whole reduction computation. When the

number of threads goes less than the number of threads in one wrap, we can unroll the

loop and reduce the time for synchronizing the threads. Because the limit on the number

of threads in one block, we can even complete unroll the loop by using C++ template

feature. The branches are eliminated at the compile time. To make it work in the right

way, we need to limit the number of threads in each block to the power of 2.

With the above optimizations for the reduction, the reduction is efficient and

theoretically cost-optimal, which provides a fundamental primitive for the computation in

implementation.

As described above, each type of calculation takes different source data. For

example, the original distribution is used multiple times by different kind of calculation,

but each kind of calculation just uses part of the original data, such as the marginal

distribution and the cluster distribution, which only needs all the elements related to

one element or in one specific block. The data needed by this type of calculation is

not always sequential in memory. When we distribute the calculation into different

threads, we need to make sure each thread takes part of the data uniquely, correctly and

completely. In the parallel implementation, we make mapping functions for each type

of calculation, which map from the thread ID to the location of the data, and from the

location of data to the thread ID. Shared memory is also heavily used when computing

the mapping address of the element. Assistant variables for computing mapping, such

as the number of elements in each dimension, and so on, are stored in the shared

memory. Figure 5-2 shows an example for mapping different areas to sequential

threads.

27

2 2 2 0

2 2 2 0

0 0 1 1

0 0 0 1

Original Cube

Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7

GPUs

A Example on mapping dimension 0

2 2 2 0

2 2 2 0

0 0 1 1

0 0 0 1

Original Cube

Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7

GPUs B Example on mapping dimension 1

Figure 5-2. Example of 2D thread mapping for marginal distribution computation

5.1.2 Multi-items Calculation on CUDA Platform

Another typical calculation in the algorithm is to calculate multiple similar items

with the same formula but with different data source. The most typical part of this type

of calculation in the algorithm is the intermediate q values’ calculation in the distances

calculation. Based on the different q value that is computing, the different data are

used from the original cube, clustered cube, and marginal distributions for original and

clustered cube.

In our implementation, we uses Output Data Partition method to partition the

calculation task to different threads. Each thread calculates a q value by calculating

the indexes corresponding to the thread ID, loading the corresponding data from global

memory. Shared memory is also used for assistant data.

28

5.1.3 Optimization

To use GPUs for calculation, basic memory operations, such as allocating device

memory, freeing device memory, and copying between host and device, are frequently

used. The memory needs to be allocated first and then the data is copied to the device

memory from the host. After finishing the calculation, it is copied back to the host

memory from graphic device. These operations are very time consuming when the

amount of data is very large.

In our parallel implementation, some parts of the implementation are parallelized.

Some intermediate result and original read-only data should not be copied multiple

times between host and device. We optimize the implementation by leave the read-only

data, such as original data and its marginal distribution, and intermediate data in the

device memory to be reused. We also reuse the address space allocated before to

reduce the multiple times repeated allocation and freeing.

Host Device X

X

X X

Original Data

Cluster Assignments

Intermediate Data

Figure 5-3. Reduction of repeated communication between host and device

5.2 Parallel Implementation and Optimization for Sparse Cube

Instead of parallel tree-based reduction in the implementation for dense cube,

atomic operations are heavily used in the parallel implementation for sparse cube.

Tree-based reduction won’t work well in this situation, because we can hardly find a way

to map the data to be reduced into sequential threads. In addition, this kind of atomic

29

operation won’t bring in much overhead in complexity. The block just happens when two

threads write the same block of memory.

The most computation-intensive part is the computation of the distances, which is

the key part for parallelization. However, the parallel implementation for the computation

of the distances is a little bit tricky. First, we still treat the computation on one non-zero

element as a individual task. And then we map it to a thread on GPUs. The reason why

we put the computation of all distances related to the element in one task is because,

normally, the size of clustered cube is small, and the number of non-zero elements is

quite large, which exceed the number of all threads that can execute simultaneously in

GPUs or even several times of it. Divided the task into smaller granularity won’t bring

improvement on performance. Although we can perform tree-based reduction on finding

the shortest distance among distances, it needs synchronize operation and introduce

large overhead on extra memory cost and the cost on wrap exchanges during the

execution. Instead of tree-based reduction, when we calculate the distances, we can

rewrite into the same shared memory field if the newer value is smaller, which is more

efficient. Figure 5-4 shows an example of mapping between task and threads. Then

every thread atomically adds this value to the corresponding memory field in the device

memory.

Dimension 1 Dimension 2 Value 0 0 2 0 1 2 0 2 2 1 0 2 1 1 2 1 2 2 2 2 1 2 3 1 3 3 1

Original Data

Thread 0 Thread 1 Thread 2 Thread 3

GPUs

Map to

GPUs

Figure 5-4. Mapping of threads in parallel implementation for sparse cube

30

During the iteration on all the coordinates, we computes all the distances between

the element and the cluster in the specific dimension. However, the number of the

distances is usually very large. Each of these fields would be frequently read or written

during the process of the algorithm. So there exists one problem. Frequent atomic

operations on global memory costs huge amount of time, for many of the atomic

operations execute in sequence and each write access by atomic operations to global

memory will cost hundreds of cycles. This approach becomes inefficient and can be

hardly used in practical.

Shared memory is a good solution to reduce the time of write access by atomic

operations. But we still have two problems. One is the distances is too large, but in

CUDA the shared memory in each block is quite small, just 16KB in the devices with

Compute Capability 1.X , and 48KB in the devices with Compute Capability 2.X . It is

impossible to place even a normal-size co-clustering problem’s distances into shared

memory. The other one is, large number of atomic operations reduces the parallelism.

Most of operations are sequential, which reduces the advantage of the parallelization.

We use a smart solution to solve the above problems. Even though the total number

of distances is often very large, the maximal number of threads in one block is small,

512 in the device with Compute Capability 1.X and 1024 in the devices with Compute

Capability 2.X . (We’ll use the devices with Compute Capability 1.X for the following

explanation.) In the worst situation, the total number of distances that generates in one

thread block is 512 multiplied by the number of clusters in this dimension. This is still a

large number for the amount of shared memory. If the number of clusters exceeds 8, the

total amount of shared memory for distances may exceed the limitation.

To solve this problem, we add another preprocessing procedure. This preprocessing

procedure generate multiple copies of coordinate list. Although all the coordinate lists

represent the same cube, the sequences of the coordinates are different. For each copy

of coordinate list, it is sorted in ascending order of the indexes in the corresponding

31

dimension. The preprocessing is time consuming and space consuming, however, it

is only a one-time procedure. We only need to execute it once and place the results

in the device memory. Through the preprocessing, we reduces the number of different

elements in the clustering dimension in one block. Although the worst case is the same,

it rarely happens in practice. From the statistics in our experiments, no more than 40

different elements appear in one block. In most situations the number is 10. Thus, in

this way, we can place the distances that will be calculated in one block in its shared

memory. Not only is the access time reduced, but also the number of atomic operations

on global memory is reduced.

Table 5-1. Example of sorting indexes in threads and blocksThread block Record index in threads0 0 0 0 0 0 0 1 11 1 1 1 2 2 3 3 32 3 3 4 4 4 4 4 43 5 5 5 5 6 6 6 7

Table 5-2. Example of shared memoryBlock Elements Distance to cluster0 0 0.14 0.14 0.23 0.230 1 0.14 0.12 0.34 0.241 1 0.24 0.14 0.23 0.531 2 0.23 0.14 0.34 0.241 3 0.55 0.53 0.14 0.502 3 0.24 0.34 0.33 0.602 4 0.06 0.05 0.35 0.503 5 0.40 0.44 0.35 0.243 6 0.32 0.33 0.53 0.143 7 0.36 0.24 0.42 0.11

Atomic operations on shared memory are fast, comparing to those on global

memory. Copying from shared memory to global memory is straightforward. Sequential

writes only happen in the fields which appear in two blocks. In the worst case, the

maximal number of such kind of fields is only equal to the number of blocks. The total

amount of time on sequential operations reduces. In this way, we solve the problem

32

of the computation of distances. Other parts, like computation on clustered cube and

finding the assignments through the distances, use the same parallelism solution as

described in other parts.

33

CHAPTER 6EXPERIMENTS

This section provides some evidence to show the benefit from the Multi-Dimensional

ITCC, and its parallel implementation. In particular, we apply the implementation to

randomly generated data for performance evaluation, and the real data of wireless

records for co-clustering functional evaluation. We show the algorithm works well on

multi-dimensional data, and the parallel version of implementation has obvious speed up

than the serial implementation on the same data set.

6.1 Data Set, Environment and Measurement Details

The data sets we use to evaluate the performance of the implementations are

random-generated 3D data. The size of the cube is 200 × 200 × 200. Totally we have

generated 10 data sets. They have 10000, 20000, 40000, 80000, ... , 5120000 records

respectively.

The data sets we use to evaluate the Co-clustering results are wireless data

records. There are 2 data sets. Table 6.1 shows the details of each dataset. In the table,

uid stands for User ID, did stands for Domain ID, and lid stands for Location ID.

The evaluation takes places on the tesla node in University of Florida High

Performance Computing Center. The host has 4 Intel E5462 cores running at 2.8GHz,

16GB of RAM, and nVidia Tesla (C1060) GPU with 4GB RAM. Both the parallel

implementation and serial implementation are run on the same machine.

To reduce the other factors, such as preprocessing and output, which may affect the

result, we only measure the time for the computation in each loop, which is the core part

Table 6-1. Information on datasetsDataset 1 Dataset 2

Number of Dimensions 2 3Number of Non-0 Elements 305464 18808Names of Dims uid, did uid, did, lidNumber of Elements per Dimension 22816, 100 1800, 100, 68Number of Clusters 15, 15 10, 10, 10

34

of the parallelization. To reduce the uncertainty of the experiments’ results, each time

measurement of each loop derives from the time consumption of several loops divided

by the number of loops.

6.2 Performance and Discussion

The performance of the implementations depends on many factors, including the

number of dimensions, the number of elements in each dimension, the number of

non-zero elements, and the number of clusters in the results.

Number of iterations varies for different input data and its initialization. In our

experiments, we set the threshold to 10−6, which means the co-clustering will stop when

the change of loss of mutual information is less than the threshold, 10−6. Figure 6-1

shows the trends of the loss of mutual information as co-clustering goes on. We can see

the loss of mutual information decreases rapidly at first and slower after each loop.

5.65

5.75

5.85

5.95

6.05

6.15

6.25

1 2 3 4 5 6 7 8Loss

of

Mu

tual

Info

rmat

ion

(b

it)

Iteraition Count

Trends of Loss of Mutual Information

Run No. 1

Run No. 2

Run No. 3

Run No. 4

Run No. 5

Run No. 6

Run No. 7

Run No. 8

Run No. 9

Figure 6-1. Trends of the loss of mutual information

We have also compared loss of mutual information before and after co-clustering

among 20 times of executions on the same data sets. Figure 6-2 shows the changes. It

35

shows the algorithm is able to successfully decrease the loss of mutual information no

matter what kind of initialization has been applied on the original data.

5.4

5.6

5.8

6

6.2

6.4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Loss

of

Mu

tual

Info

rmat

ion

(b

it)

Run No.

Comparison of LMIs

LMI after Co-clustering Original LMI

Figure 6-2. Comparison of loss of mutual information

For the performance evaluation, we apply our implementations on the 10 random-generated

datasets. Figure 6-3 shows the detailed results of performance. Generally speaking, the

parallel implementations both have improvement on the serial implementations for both

sparse cube or dense cube.

For the same datasets, we can find as the numbers of clusters growing, the time

consumption grows. It is because the number of distances needed to be calculated

grows. For the performance of different implementations on different types of data,

we applied our implementations to the random-generated datasets. Comparing the

implementation for dense data and the one for sparse data, the former one performs

better on the dense data set, while at the same time the latter one performs better on

sparse cube. The data shows that, for the serial implementation, when the density of

the non-zero records is larger than 8%, the dense implementation performs better.

For the parallel implementation, we can find out that parallelization exploited from the

36

0

1

2

3

4

5

6

7

8

9

10

10000 20000 40000 80000 160000

Tim

e f

or

on

e it

era

tio

n (

sec)

Number of non-zero Records in 200 * 200 * 200 Data Cube

Performance Results on Sparse Cube

Serial Impl. For Sparse Cube

Parallel Impl. For Sparse Cube

Serial Impl. For Dense Cube

Parallel Impl. For Dense Cube

A Performance results on sparse cube

0

10

20

30

40

50

60

70

80

90

320000 640000 1280000 2560000 5120000

Tim

e f

or

on

e it

era

tio

n (

sec)

Number of non-zero Records in 200 * 200 * 200 Data Cube

Performance Results on Dense Cube

Serial Impl. For Sparse Cube

Parallel Impl. For Sparse Cube

Serial Impl. For Dense Cube

Parallel Impl. For Dense Cube

B Performance results on dense cube

Figure 6-3. Performance results

37

implementation for sparse data is much more than the one from the implementation for

dense data. The time consumption of the implementation for sparse cube shows linear

increments as the number of records increasing, for both serial implementation and

parallel implementation.

In sum, implementation for dense cube show greater performance on dense data.

At the same time, the implementations for sparse cube do better on sparse data.

Parallel implementations both have obvious speedup on the corresponding serial

implementations.

6.3 Co-clustering Algorithm Results

To show the co-clustering result of Multi-Dimensional ITCC, we applied the

implementation on the real 3D wireless data records. We have visualized the distribution

of data points in a 3D space. Figure 6-4 shows the status of data points distributions

before and after co-clustering algorithm execution. In the figure, we group the ids in

the same cluster together. From the figure, we can clearly find out the data points with

similar properties are clustered together after the algorithm execution.

The co-clustering results are heavily related to the initialization of the cluster

assignment of the original elements. And in our experiments, we use random initialization

for those datasets. In this section, we show one of domain co-clustering results from 3D

dataset. Figure 6-5 shows the clustered domain names.

From this result, we can empirically find some interested clusters which group

together the domains some groups of users usually visit together. First, microsoftoffice2007,

macfee and windowsmedia belong to cluster 5, which may belong to the websites

windows users often visit. yahoo and yimg both belong to yahoo, and they might always

be visited together. hotmail, live, and go all belong to Microsoft Live Service. Some

potential knowledge might also be exploited from the results too. For example, we might

guess the mac users like washingtonpost more, because mac and washingtonpost

38

0500

10001500

2000

0

50

1000

20

40

60

80

A Data points distribution after initialization

0500

10001500

2000

0

50

1000

20

40

60

80

B Data points distribution after co-clustering

Figure 6-4. Data points distribution before and after co-clustering

39

Cluster 0: netflix flyingcroc wsj

Cluster 1: google veoh lexis

Cluster 2: about adrevolver brightcove cnn contextweb diggriver doubleclick

ebay ebayrtm fastclick gridserver imageshack itunes microsoft

mozilla panthercdn secureserver theplanet tribalfusion typepad

webtrendslive wikimedia xpc-mii youtube virtualearth tmcs

ebayimg imeem myspace bankofamerica coremetrics americanidol

ha-hosting msnbcsports ltdomains nih nytimes bigcharts

Cluster 3: cbsig ln travelpn

Cluster 4: ilike aol

Cluster 5: infoave usc windowsmedia hackerwatch mcafee microsoftoffice2007

Cluster 6: comcast harvard hamachi ucsb digg

Cluster 7: facebook llnw mediaplex msn tfbnw xlhost apmebf 247realmedia

Level3

Cluster 8: go net live hotmail gotomypc

Cluster 9: aster fastres fastwebnet smartbro bodoglife torrentbox qwest

Cluster 10: apple

Cluster 11: bluehost rr yahoo yimg

Cluster 12: drmisnotforsale softlayer quiettouch westlaw

Cluster 13: co steadfast socialmedia lokm

Cluster 14: cnet washingtonpost earthlink mac opendns orb

Figure 6-5. Co-clustering result of domain names of 2D dataset

belong to one cluster. In another words, the results from multi-dimensional co-clustering

are valuable and instructive for knowledge discovery from large amount of new data.

However, this is just the local optimal results from the initialization. We can get

much different results from different initializations. According to our experiments, the

results from random initializations are quite unstable. A feasible solution for the unstable

performance is using the co-clustering results from another algorithm to initialize the

ITCC. In this way, we’ll get a more optimal result on mutual information than the one

used for initialization.

40

CHAPTER 7CONCLUSION

The parallel and serial implementations of Multi-Dimensional Information Theoretic

Co-Clustering algorithm have shown that it can work well with any-dimensional

data, especially for the sparse and high-dimensional data. Parallel versions of

implementations for both sparse and dense cube have an obvious performance

improvement on the corresponding serial versions of implementation. The implementations

for sparse cube have greater adaptability on large scale of data, which will be more

useful than the dense versions in most of the situations. Parallel implementation for

sparse cube exploits more parallelism on the algorithm, which shows more obvious

speedup on the corresponding serial one. The Multi-Dimensional ITCC can successfully

co-cluster multi-dimensional data, and exploits potential knowledge hidden in the data,

which is helpful for knowledge discovery from the real-world data.

41

REFERENCES

[1] I. S. Dhillon, S. Mallela, and D. S. Modha. Information theoretical co-clustering. InNinth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’03),pages 89–98, 2003.

[2] M. Harris. Optimization parallel reduction in CUDA. Technical report, 2007.

[3] NVIDIA Corporation. NVIDIA CUDA C Programming Guide. http://developer.

download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_C_Programming_

Guide.pdf, 2010.

42

http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_C_Programming_Guide.pdf



BIOGRAPHICAL SKETCH

Xiaoyang Gao is a Master of Science candidate in computer engineering from

Department of Computer and Information Science and Engineering at University

of Florida. While he’s studying in University of Florida, he did research in parallel

computing and data mining under Dr. Sanjay Ranka’s supervision and applied them on

modeling the wireless data. He received his bachelor’s degree in Computer Science and

Technology from Huazhong University of Science and Technology, Wuhan, P. R. China.

43

Documents

c 2011 Xiaoyang Gao - ufdcimages.uflib.ufl.eduufdcimages.uflib.ufl.edu/UF/E0/04/34/54/00001/gao_x.pdf · XIAOYANG GAO A THESIS PRESENTED ... In addition, a special thanks to Saeed