Upload
vudung
View
215
Download
0
Embed Size (px)
Citation preview
EFFICIENT IMPLEMENTATION OF MULTI-DIMENSIONAL CO-CLUSTERING
By
XIAOYANG GAO
A THESIS PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCE
UNIVERSITY OF FLORIDA
2011
c⃝ 2011 Xiaoyang Gao
2
To my Mon and Dad, and everyone that helped me finish this thesis
3
ACKNOWLEDGMENTS
Of the many people who have been enormously helpful in the preparation of this
thesis, I’m especially and heartily thankful to my supervisor Dr. Sanjay Ranka. The
thesis could not have been written without him who not only served as my supervisor
but also encouraged and challenged me throughout the academic problem. His patience
on answering my various questions and instructive guidance have taught me useful
methods to analyze and solve a problem and a good attitude to finish a job well.
I would like to warmly acknowledge Dr. Ahmed Helmy for his support and guidance
in the interesting project, of which the real wireless network data helped a lot on finishing
this thesis. I can always learn something through the meeting with him. Also, I would like
to thank Dr. Shigang Chen for his instruction which simulated my interest on computer
networks and brought me a good master on computer networks related knowledge, as
well as his input on my thesis defense.
In addition, a special thanks to Saeed Moghaddam and Clint P. George for their
necessary support on analysis of co-clustering results. Most especially to my family
and all my friends. Their consideration, motivation and encouragement enabled me to
complete this thesis.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 MULTI-DIMENSIONAL INFORMATION THEORETICAL CO-CLUSTERING . . 12
3 DATA REPRESENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Original Data Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.1 Dense Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.2 Sparse Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Clustered Data Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Marginal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 SERIAL IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2 Serial Implementation and Optimization for Dense Cube . . . . . . . . . . 234.3 Serial Implementation and Optimization for Sparse Cube . . . . . . . . . 24
5 PARALLEL IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1 Parallel Implementation and Optimization for Dense Cube . . . . . . . . . 255.1.1 Parallel Reduction on CUDA Platform . . . . . . . . . . . . . . . . 255.1.2 Multi-items Calculation on CUDA Platform . . . . . . . . . . . . . . 285.1.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Parallel Implementation and Optimization for Sparse Cube . . . . . . . . . 29
6 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.1 Data Set, Environment and Measurement Details . . . . . . . . . . . . . . 346.2 Performance and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 356.3 Co-clustering Algorithm Results . . . . . . . . . . . . . . . . . . . . . . . . 38
7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5
LIST OF TABLES
Table page
5-1 Example of sorting indexes in threads and blocks . . . . . . . . . . . . . . . . . 32
5-2 Example of shared memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6-1 Information on datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6
LIST OF FIGURES
Figure page
2-1 Multi-dimensional information theoretic co-clustering algorithm . . . . . . . . . 14
3-1 Storage for dense cube in memory . . . . . . . . . . . . . . . . . . . . . . . . . 17
3-2 Storage for sparse cube in memory . . . . . . . . . . . . . . . . . . . . . . . . 18
3-3 Storage for jagged 2D array in memory . . . . . . . . . . . . . . . . . . . . . . 20
4-1 Flow of implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4-2 Procedure of computing distances in iteration on sparse cube . . . . . . . . . . 24
5-1 Tree-based reduction [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5-2 Example of 2D thread mapping for marginal distribution computation . . . . . . 28
5-3 Reduction of repeated communication between host and device . . . . . . . . 29
5-4 Mapping of threads in parallel implementation for sparse cube . . . . . . . . . 30
6-1 Trends of the loss of mutual information . . . . . . . . . . . . . . . . . . . . . . 35
6-2 Comparison of loss of mutual information . . . . . . . . . . . . . . . . . . . . . 36
6-3 Performance results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6-4 Data points distribution before and after co-clustering . . . . . . . . . . . . . . 39
6-5 Co-clustering result of domain names of 2D dataset . . . . . . . . . . . . . . . 40
7
Abstract of Thesis Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
EFFICIENT IMPLEMENTATION OF MULTI-DIMENSIONAL CO-CLUSTERING
By
Xiaoyang Gao
August 2011
Chair: Sanjay RankaMajor: Computer Engineering
Co-Clustering is an important data mining operation that can automatically
cluster along two or more dimensions. Most of the work in the literature focuses
on co-clustering on two dimensions. In this report, we develop extensions of ITCC
(Information Theoretical Co-Clustering) for multi-dimension data. We first extend
the approach to more than two dimensions. We also develop parallel algorithms
for the resulting approach. Our experimental results show that our algorithms and
implementation scale well to handle large datasets both on sequential and parallel
machines. The Multi-Dimensional ITCC has been used to help the analysis of
multi-dimensional wireless data records to find out the hidden model of user activities.
8
CHAPTER 1INTRODUCTION
In the era of data explosion, large amount of data are generated everyday.
However, the utility of data fails to keep up with the increasing amount of data. Lots
of knowledge hidden in data cannot be found out. Clustering is a fundamental tool in
data mining, which is used to automatically group the similar objects into clusters in an
unsupervised way to help people exploit more knowledge that can hardly be discovered
from observation based on common sense or current knowledge.
Data in the real world always has more than one attribute. Clustering along only
one dimension would be unable to discover the new knowledge which has relations
with all the attributes. To deal with this, co-clustering comes out, and provides a way
to automatically and simultaneously cluster the data along two or more dimensions.
Nowadays, this technique has already been widely used in many areas, including
text, web-log, bioinformatics, and wireless network data analysis and modeling.
Researchers have tried to use different measurements on the similarity of different
objects to analyze the data from different aspects. For different applications, various
co-clustering algorithms have been presented in literatures. Most of them focus on
two-dimensional data. However, there is a great demand on clustering the data in
multi-dimensions, since the data in the real world is always more than two dimensions.
For example, the traffic records from wireless network has attributes such as users,
domains, time, locations, and so on. In this way, it is often desirable to co-cluster on all
of them, and discover the knowledge among all the dimensions. By the increasing of
the data dimensions, the efficient implementations of high-dimensional co-clustering
algorithms are also needed in practice to deal with huge amount of real world data.
Generally speaking, we can treat the data in multi-dimensions as a contingency
table. Information theory provides a theoretical way, mutual information, to measure
the mutual dependence of random variables. It provides a good way to measure
9
whether a co-clustering is optimal. Based on mutual information, some researchers
[1] presented Information Theoretical Co-Clustering (ITCC) algorithm, an efficient
co-clustering algorithm. It treats the optimal co-clustering as one that leads to the
largest mutual information between the clustered random variables. Equivalently, it
is to minimize the difference in the mutual information between the original random
variables and the mutual information. For 2D data, it intertwines both row and column
clustering at all stages. The algorithm has been proved that it can monotonously
decrease the difference of the mutual information between the clustered variables and
the corresponding original variables, and finally leads to a local optimal co-clustering
result based on the initialization of the cluster assignments. And fortunately, according to
the literature[1], ITCC provides an reasonable way to co-clustering on multi-dimensional
data without introducing much cost on the efficiency.. In this thesis, we uses Multi-Dimensional
Information Theoretical Co-Clustering (MDITCC) algorithm in the implementation.
Due to the large scale of high dimensional data, it is necessary to construct a
more efficient implementation of the algorithm to make co-clustering faster without
losing precision. Parallelizing the algorithm becomes an ideal choice to improve
the performance and capability. NVIDIA provides CUDA (Compute Unified Device
Architecture) [3], the parallel computing architecture, which enables dramatic increases
in computing performance by harnessing the power of GPU (Graphics Processing Unit)
on general purpose computing. CUDA successfully decreases the cost per GFLOPS,
and provides an developer-friendly environment for constructing parallel program. All
of these make CUDA an ideal platform to parallelize the Multi-Dimensional ITCC. We
develop parallel algorithms based on Multi-Dimensional ITCC on NVIDIA CUDA platform
to improve the efficiency and throughput of the our implementation.
In this thesis, we present a novel and efficient implementation of Multi-Dimensional
Co-Clustering algorithm, which is based on the Information Theoretical Co-Clustering
and performs efficiently on large and multi-dimensional data, especially for sparse and
10
high-dimensional data. We first describe the Multi-Dimensional ITCC, and prove the key
formulas in multi-dimensions. Then, we show new data representations used to store
various data, including original data, clustered data, marginal distributions, and some
other assistant data in the computation. Also, separate data structures for sparse data
and dense data are presented for different applications. Then follows the optimized
serial implementations for sparse and dense data, as well as the parallel ones on CUDA
platform. We also do experiments, which show the improvement on the performance
our implementation provides, and the results of the co-clustering. We demonstrate that
our implementation works correctly and efficiently on the large scale high-dimensional
data by presenting the results of high-dimensional wireless network data co-clustering.
The results also show the parallel implementation has an obvious improvement on the
performance, especially on large scale data.
11
CHAPTER 2MULTI-DIMENSIONAL INFORMATION THEORETICAL CO-CLUSTERING
The Information Theoretical Co-Clustering Algorithm description is based on
two-dimensional data. However, it is easy to be extended into multi-dimensional space
as mentioned in the original literature [1]. To outline the approach of Multi-dimensional
ITCC, we first prove the key formulae of ITCC in multi-dimensional space, and then
describe the Multi-Dimensional ITCC.
In multi-dimensional space, we assume the variables in each dimension are
independent of each other, and treat the input data as a multi-dimensional contingency
table. The key is to represent the loss of the mutual information in multi-dimensional
space. Thus, a new representation of the loss of mutual information for multi-dimensional
contingency table is necessary. Based on the above assumptions, we can write the new
definition of the loss of the mutual information in multi-dimensional space as following:
Lemma 1. For a fixed co-clustering (CD1,CD2, ... ,CDn), we can write the loss of the
mutual information as
I (D1;D2; ... ;Dn)− I (D̂1; D̂2; ... ; D̂n) = D(p(D1,D2, ... ,Dn)∥q(D1,D2, ... ,Dn)) (2–1)
where D(·||·) denotes the Kullback-Leibler (KL) divergence, also known as relative
entropy, and q(D1,D2, ... ,Dn) is the distribution of the form
q(d1, d2, ... , dn) = p(d̂1, d̂2, ... , d̂n)
n∏i=1
p(di | d̂i) (2–2)
Proof. Lemma. 1
p(d̂1, d̂2, ... , d̂n) =∑d1∈d̂1
∑d2∈d̂2
...∑dn∈d̂n
p(d1, d2, ... , dn)
I (D1;D2; ... ;Dn)− I (D̂1; D̂2; ... ; D̂n)
=∑d̂1
∑d̂2
...∑d̂n
∑d1∈d̂1
∑d2∈d̂2
...∑dn∈d̂n
p(d1, d2, ... , dn) logp(d1, d2, ... , dn)
p(d1)p(d2) ... p(dn)
12
−∑d̂1
∑d̂2
...∑d̂n
(∑d1∈d̂1
∑d2∈d̂2
...∑dn∈d̂n
p(d1, d2, ... , dn)) logp(d̂1, d̂2, ... , d̂n)
p(d̂1)p(d̂2) ... p(d̂n)
=∑d̂1
∑d̂2
...∑d̂n
∑d1∈d̂1
∑d2∈d̂2
...∑dn∈d̂n
p(d1, d2, ... , dn) logp(d1, d2, ... , dn)
p(d̂1, d̂2, ... , d̂n)p(d1)
p(d̂1)
p(d2)
p(d̂2)... p(dn)p(d̂n)
=∑d̂1
∑d̂2
...∑d̂n
∑d1∈d̂1
∑d2∈d̂2
...∑dn∈d̂n
p(d1, d2, ... , dn) logp(d1, d2, ... , dn)
q(d1, d2, ... , dn)
Some simple but useful equalities between p and q, which highlight properties of q
desirable to approximating p are also presented.
Proposition 2.1.
q(d̂1, d̂2, ... , d̂n) = p(d̂1, d̂2, ... , d̂n), q(di , d̂i) = p(di , d̂i) (2–3)
q(di) = p(di), q(d̂i) = p(d̂i) (2–4)
p(di | d̂i) = q(di | d̂i) (2–5)
p(d̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n | d̂i) = q(d̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n | d̂i) (2–6)
∀di , d̂i , 1 ≤ i ≤ n.Further , if d̂i = CDi (di), then
q(d1, d2, ... , di−1, di+1, ... , dn | d̂i) = q(d̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n | d̂i)∏k ̸=i
q(dk | d̂k) (2–7)
Proof. Proposition. 2.1 Equation 2–4, 2–5, 2–6 are simple to show and follow from
Equation 2–3. Equation 2–7 follows from
q(d1, d2, ... , di−1, di+1, ... , dn | d̂i)
= q(d1, d2, ... , di−1, di+1, ... , dnd̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n | d̂i)
=q(d1, d2, ... , di−1, di+1, ... , dnd̂1, d̂2, ... , d̂n)
q(d̂i)
=
∑di∈d̂i p(d̂1, d̂2, ... , d̂n)
∏nk=1 p(dk | d̂k)
q(d̂i)
13
=q(d̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n)
q(d̂i)
∏k ̸=i
q(dk | d̂k)
= q(d̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n | d̂i)∏k ̸=i
q(dk | d̂k)
Algorithm Co Clustering(n, p, l1, l2, ... , ln,C(D1),CD2, ... ,CDn)Input: The joint probability distribution p(D1,D2, ... ,Dn), l1, l2, ... , ln are the desirednumber of clusters in each dimension.Output: The partition function CD1,CD2, ... ,CDn .
1. Initialization: Set t = 0. Start with some initial partition functions C (0)D1 ,C(0)D2, ... ,C
(0)Dn
.Compute
q(0)(D̂1, D̂2, ... , D̂n), q(0)(D1 | D̂1), q(0)(D2 | D̂2), ... , q(0)(Dn | D̂n),
and the distributions q(0)(D2,D3, ... ,Dn | d̂1), 1 ≤ d̂1 ≤ l1.
2. Iterations on each dimension k from 1 to n:
(a) Compute k-dimension clusters: for each dk , find its new cluster index as
C(t+k)Dk
(dk) =
argmindkD(p(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | d̂k)∥q(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | d̂k))
resolving ties arbitrarily. Let C (t+k)Dj= C
(t+k−1)Dj
, jk .
(b) Compute distributions
q(t+k)(D̂1, D̂2, ... , D̂n), q(t+k)(D1 | D̂1), q(t+k)(D2 | D̂2), ... , q(t+k)(Dn | D̂n),
and the distributions q(t+k)(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | d̂k), 1 ≤ d̂k ≤ lk .
3. Stop and return CD1 = C(t+n)D1,CD2 = C
(t+n)D2, ... ,CDn = C
(t+n)Dn
. If the changein objective function value, that is D(p(D1,D2, ... ,Dn)∥q(t)(D1,D2, ... ,Dn)) −D(p(D1,D2, ... ,Dn)∥q(t+n)(D1,D2, ... ,Dn)), is ”small”. Else set t = t + n andgo to step 2.
Figure 2-1. Multi-dimensional information theoretic co-clustering algorithm
We can now describe Multi-Dimensional ITCC in Figure 2-1. The algorithm
starts with an initial cluster assignment for every element in each dimension. Based
14
on different initialization, the algorithm will lead to a local minimal loss of mutual
information, however it cannot guarantee a global minimum.
15
CHAPTER 3DATA REPRESENTATION
Data representation in the implementation plays an important role. We treat
the multi-dimensional data as a data cube. By analyzing the algorithm, original data
cube, clustered data cube, and the marginal distributions of data cubes are the
three most important types of data. The data structures used in the implementation
should generically support the data in any number of dimensions. Also, the data
structures are designed based on the consideration of efficient accesses, low space
overhead, and parallel communication-friendly, which means the data can be extracted
into one-dimensional array with least time and space overhead, since most parallel
communication operations are friendly to array of basic types. The following parts
will describe the data structures used for original data cube, clustered data cube, and
marginal distributions in details.
3.1 Original Data Cube
Based on the proportion of zeros in the cube, the original data cubes can be divided
into two different types. One is sparse cube, which is populated primarily with zeros. To
the contrary, the other one is dense cube, in which majority of elements is non-zeros.
For these two types of cubes, two different kinds of data structures are designed
separately.
3.1.1 Dense Cube
The data structure for dense cube uses one-dimensional array to store all the
elements in the cube. The elements are stored sequentially from the logically first
element to the last one. To access one element in the cube, we use a converter to
convert the multi-dimensional indexes into the index of this one-dimensional array, as
well as another converter to do the opposite conversion. The complexity of access an
element is O(k), while k is the number of dimensions. The k is always small. In most
16
cases it is less than 10. So we can treat the time complexity of accessing one element
as a constant. Figure 3-1 shows an example of 2D data storage in dense cube structure.
Storage in Memory
Indexes-Address Converter
2 2 2 0
2 2 2 0
0 0 1 1
0 0 0 1
Original Cube
Address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Value 2 2 2 0 2 2 2 0 0 0 1 1 0 0 0 1
Representation in Memory
Dimension 0 1 Index 1 2 Access Indexes
Address 6 Address Buffer
Figure 3-1. Storage for dense cube in memory
3.1.2 Sparse Cube
One of the most common data structure to store sparse cube is coordinate list, in
which each record contains the multi-dimensional coordinate of a element, and the value
of it. The coordinate list includes all the non-zero elements in the cube.
An interesting characteristic of the original data cube is that it is unnecessary to
visit the elements in a specific sequence. In another words, visiting the elements in any
sequences is acceptable. At the same time, we do not need to do any operations on
the cube. These reasons make the coordinate list format an ideal way to represent the
sparse data.
Still, one-dimensional array is used for the storage. Specifically, suppose the
number of non-zero elements is n, and the number of dimensions is k . A n ∗ k elements
array is used to store all the coordinates of all the elements, each continuous k elements
represent the coordinate of one element. Another n elements array is used to store the
17
value of the elements. The sequence of all the elements in both arrays is the same. For
example, the i th continuous k elements in the first array represent the coordinate of the
i th element, while the i th element in the second array represents the value of the i th
element.
It is worth noting that no random access interface is provided in the sparse cube.
This is mainly because we don’t have such demand in the algorithm.
Storage
Access
Dimension 1 Dimension 2 Value 0 0 2 0 1 2 0 2 2 1 0 2 1 1 2 1 2 2 2 2 1 2 3 1 3 3 1
Representation in Memory
2 2 2 0
2 2 2 0
0 0 1 1
0 0 0 1
Original Cube
Figure 3-2. Storage for sparse cube in memory
Figure 3-2 shows an example of 2D data stored in the sparse cube structure. The
time complexity of access all the elements in the cube is O(l), in which l means the
number of non-zero elements in the sparse cube. In a sparse cube, the number of
non-zeros is much less than that in a dense cube.
3.2 Clustered Data Cube
By analyzing the algorithm, some interesting characteristics of clustered data cube
can be found, which are helpful for designing the data structure:
• The clustered cube is always dense;
18
• The data in the clustered cube changes frequently during runtime;
• The elements are always random accessed during the execution;
• The size of the clustered cube is always small enough so that even storing all theelements of it won’t consume much of the memory.
The dense cube structure described in Original Data Cube section satisfies all the
demands above. Therefore, we use dense cube for the representation of Clustered Data
Cube.
3.3 Marginal Distribution
Marginal distribution is frequently accessed during the execution. Because the
numbers of elements in different dimensions vary, the most space-efficient data
structure to store them is jagged two-dimension array, with the dimension number in
the first dimension and the index of the element in the corresponding dimension in the
second dimension.
In our implementation, we prefer one-dimensional array. Thus we use array for the
marginal distribution. The one-dimensional array stores all the elements in the jagged
array sequentially from the elements in the first dimension to the last. For a fast access
on a specific element, instead of calculating the index of elements by adding up the
length of each dimension from the first one to the specific one, the indexes of each
dimension are stored in another assistant array.
It is worth noting that the same structure is also used for the cluster assignments,
which stores the cluster index for each element in each dimension.
Figure 3-3 shows an example of 4D marginal distributions stored in this data
structure.
19
Element Access
Index Pointer
Dimension Values 0 0.25 0.5 0.25 1 0.45 0.15 0.2 0.1 0.2 2 0.55 0.35 0.05 0.15 3 0.4 0.6
Logical View of Marginal Distribution
Row Index 0 3 8 12 Assistant Array
Value 0.25 0.5 0.25 0.45 0.15 0.2 0.1 0.2 0.55 0.35 0.05 0.15 0.4 0.6 Memory Structure
Figure 3-3. Storage for jagged 2D array in memory
20
CHAPTER 4SERIAL IMPLEMENTATION
Serial implementation provides a basic program structure for the algorithm
procedure. For different types of data, sparse version and dense version are implemented
and optimized separately. Serial implementation also provides a fundamental work flow
implementation for the parallel implementation, which mainly focus on the parallelization
of the computation-intensive part of the algorithm. Figure 4-1 shows the flow structure of
the whole program.
We can divide the computations in the algorithm into several small basic operations.
These operations are:
• Calculating the clustered data cube;
• Calculating marginal distribution of cube (both the original one and the clusteredone);
• Calculating the distance between one element and the corresponding cluster,
D(p(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | dk)∥q(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | d̂k)),
and the corresponding q(t+k)(d1, d2, ... , dk−1, dk+1, ... , dn | d̂k), 1 ≤ d̂k ≤ lk ;
• Finding the minimum of all the distance
C(t+k)Dk
(dk) =
argmindkD(p(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | dk)∥q(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn∥d̂k)).
These basic operations are the key parts of the implementation. The following
parts will first introduce the preprocessing part of the whole program, which converts
the input data into a more friendly way for operations as well as saving on the time and
space consumption. Then the details of the implementation of the computation-intensive
algorithm operations will be discussed separately in dense form and sparse form.
Some specific optimization in implementation level is also provided to achieve better
performance.
21
Figure 4-1. Flow of implementation
4.1 Preprocessing
Preprocessing is indispensable to make the implementation work correctly and
efficiently. For the input data, the elements of each dimension might be any type besides
integer. Even if it is integer, the values may distribute in large range discretely, which
bring huge unused space in the storage in multi-dimensional space and more redundant
computation.
22
Preprocessing counts the number of elements in all dimensions and assigns each
element with a new and sequential ID in that dimensions. In this way, the elements
with no non-zero records in all dimensions are eliminated, which may reduce the
performance of the algorithm and the utility of limited memory. Repeated records, which
have the same attributes, will be aggregated.
4.2 Serial Implementation and Optimization for Dense Cube
Most computation can be done during the iteration on all the elements in the cube.
The operations behave as the following:
• Computing the clustered data cube, and the corresponding marginal distributions:the program calculates the corresponding indexes in the clustered cube from theelement’s index. The elements in corresponding marginal distribution and theclustered data cube can be accessed through the indexes. And the correspondingfields are the result of adding up the value of the iterated element.
• Computing the intermediate q values: the program calculates the q values duringthe iteration on all the elements in original cube following the equation:
q(t+k)(d1, d2, ... , dk−1, dk+1, ... , dn | d̂k) =
p(d̂1, d̂2, ... , ˆdk−1, ˆdk+1, ... , d̂n | d̂k)∏i ̸=k
p(di | d̂i), 1 ≤ d̂k ≤ lk
• The distances between element and cluster: the program iterates on all the pairsof element and cluster, and calculates the distances following the definition ofthe distance, the Kullback-Leibler (KL) divergence. The shortest distance and thecorresponding cluster will be stored. In details, the calculation follows the followingequation:
D(p(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | dk)∥q(D1,D2, ... ,Dk−1,Dk+1, ... ,Dn | d̂k)) =∑d̂1
∑d̂2
...∑ˆdk−1
∑ˆdk+1
...∑d̂n
p(d1, d2, ... , dk−1, dk+1, ... , dn | dk) logp(d1, d2, ... , dk−1, dk+1, ... , dn | dk)q(d1, d2, ... , dk−1, dk+1, ... , dn | d̂k)
An optimization can be adopted in the distance computation. Heavy computation
takes place in the repetitive calculation of the indexes and the index conversions,
which cost huge amount of time during process. To reduce it, instead of computing the
distance for each pair of element and cluster, we calculate all the p log pq
values in the
iteration on all the elements once. This reduces the repetitive calculation of the same
23
indexes. Then we add the value to the corresponding distances. Experiments show the
optimization successfully reduces large amount of calculation time, and provides great
performance improvement to the whole program.
4.3 Serial Implementation and Optimization for Sparse Cube
Because the data representation of the sparse cube is different from the one of the
dense cube, the serial implementation for sparse cube solves the problem in a different
way. However, much of the implementation follows the same principle as the one for
dense cube, including the computation of clustered cube and marginal distributions.
For the most complicated distance computation as well as the q computation, it is
also done in the iteration on all the elements in the sparse cube. During iteration on one
element, we can visit each dimension index of the element. If it’s the dimension which is
being clustered, the q value is multiplied by q(d̂1, d̂2, ... , ˆdi−1, ˆdi+1, ... , d̂n | d̂i), otherwise it
is multiplied by q(di | d̂i). Figure 4-2 shows an example for basic procedure to compute
distance in this implementation.
Example for Iteration on Dimension 2
Calculate Distances
Between 3 and All 6 Clusters
Marginal Distributions
of Original Cube
and Clustered Cube
Probability Distributions
of Clustered Cube
Coordinate Value 5 12 3 6 0.25
Original Data Record
0 2 4 1 Indexes of Clustered Cube
0.013 0.24 0.016 0.005 0.016 0.038 Distances between 3 and all 6 clusters in dimension 2
Dim Element Cluster 2 3 3
New Assignments
Figure 4-2. Procedure of computing distances in iteration on sparse cube
24
CHAPTER 5PARALLEL IMPLEMENTATION
Parallel implementations, both dense cube and the sparse one, focus on the
parallelization of the core and computation-intensive parts of the algorithm. In this
chapter, the parallel implementation for dense cube is presented first, and then follows
the one for sparse cube.
5.1 Parallel Implementation and Optimization for Dense Cube
By analyzing the operations on the dense cube in the algorithm, we can divide all
the computation operations into different types of abstract operations.
• Reduction. This appears in the calculation of cluster distribution, marginaldistribution, distances between each element and corresponding cluster in thecorresponding dimension, and cluster assignments.
• Multi-items calculation. This type of calculation needs to calculate many itemswith the same calculation formula but with different data source. This type appearsmostly in the intermediate q value calculation and the distances.
Some of the calculations belong to both types, such as distances calculation. These
types of calculations compose the most intensive computation in the algorithm. Thus,
our parallel implementation mainly focuses on these two types of calculation.
5.1.1 Parallel Reduction on CUDA Platform
Parallel Reduction on CUDA platform is similar to general parallel reduction, which
is widely used in MPI on clusters. Mainly it is a tree-based reduction. We can exploit
maximum parallel during reduction.
To implement parallel reduction on CUDA platform, there are some more things
need to be considered, including the thread and block concept, shared memory in
thread block. Some optimizations can also be adopted, such as loop unrolling or
even complete unrolling. Some researchers in nVidia presented an optimizing parallel
reduction in their work [2], in which the above factors are considered. Figure 5-1 shows
the main idea of this parallel reduction algorithm. In our implementation, we use most of
their idea for the reduction operation.
25
Figure 5-1. Tree-based reduction [2]
Because CUDA platform has no global synchronization, the data is divided into
several blocks. Each block corresponds to a thread block. The reduction is executed
separately in each thread block, but simultaneously among the blocks. Each thread in
the thread block would first do the reduction on at least two elements to reduce the half
idle threads after load the elements. At the same time, each thread sums up as many
as necessary, if the number is larger than two times of the total number of threads in all
thread blocks. Theoretically, if the number of the elements is log2 n times of twice of the
number of threads, this parallel reduction implementation is cost-optimal.
The reduction uses shared memory as buffer for the elements and results. In the
first step of reduction, each thread loads the corresponding data in the global memory
into shared memory in each block. All the following reduction steps are taken in the
shared memory, which reduces the data access time during the reduction.
The reduction also uses sequential addressing in the shared memory in order to
reduce the shared memory bank conflicts. In each step, each thread sums up its own
26
element with the one in the second half in the block. The threads in the second half
become idle at the same time. In this way, the elements used for the next reduction step
is in the sequential address space, and the time for access data on different shared
address bank is reduced.
Loop is another time consuming part in the whole reduction computation. When the
number of threads goes less than the number of threads in one wrap, we can unroll the
loop and reduce the time for synchronizing the threads. Because the limit on the number
of threads in one block, we can even complete unroll the loop by using C++ template
feature. The branches are eliminated at the compile time. To make it work in the right
way, we need to limit the number of threads in each block to the power of 2.
With the above optimizations for the reduction, the reduction is efficient and
theoretically cost-optimal, which provides a fundamental primitive for the computation in
implementation.
As described above, each type of calculation takes different source data. For
example, the original distribution is used multiple times by different kind of calculation,
but each kind of calculation just uses part of the original data, such as the marginal
distribution and the cluster distribution, which only needs all the elements related to
one element or in one specific block. The data needed by this type of calculation is
not always sequential in memory. When we distribute the calculation into different
threads, we need to make sure each thread takes part of the data uniquely, correctly and
completely. In the parallel implementation, we make mapping functions for each type
of calculation, which map from the thread ID to the location of the data, and from the
location of data to the thread ID. Shared memory is also heavily used when computing
the mapping address of the element. Assistant variables for computing mapping, such
as the number of elements in each dimension, and so on, are stored in the shared
memory. Figure 5-2 shows an example for mapping different areas to sequential
threads.
27
2 2 2 0
2 2 2 0
0 0 1 1
0 0 0 1
Original Cube
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7
GPUs
A Example on mapping dimension 0
2 2 2 0
2 2 2 0
0 0 1 1
0 0 0 1
Original Cube
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7
GPUs B Example on mapping dimension 1
Figure 5-2. Example of 2D thread mapping for marginal distribution computation
5.1.2 Multi-items Calculation on CUDA Platform
Another typical calculation in the algorithm is to calculate multiple similar items
with the same formula but with different data source. The most typical part of this type
of calculation in the algorithm is the intermediate q values’ calculation in the distances
calculation. Based on the different q value that is computing, the different data are
used from the original cube, clustered cube, and marginal distributions for original and
clustered cube.
In our implementation, we uses Output Data Partition method to partition the
calculation task to different threads. Each thread calculates a q value by calculating
the indexes corresponding to the thread ID, loading the corresponding data from global
memory. Shared memory is also used for assistant data.
28
5.1.3 Optimization
To use GPUs for calculation, basic memory operations, such as allocating device
memory, freeing device memory, and copying between host and device, are frequently
used. The memory needs to be allocated first and then the data is copied to the device
memory from the host. After finishing the calculation, it is copied back to the host
memory from graphic device. These operations are very time consuming when the
amount of data is very large.
In our parallel implementation, some parts of the implementation are parallelized.
Some intermediate result and original read-only data should not be copied multiple
times between host and device. We optimize the implementation by leave the read-only
data, such as original data and its marginal distribution, and intermediate data in the
device memory to be reused. We also reuse the address space allocated before to
reduce the multiple times repeated allocation and freeing.
Host Device X
X
X X
Original Data
Cluster Assignments
Intermediate Data
Figure 5-3. Reduction of repeated communication between host and device
5.2 Parallel Implementation and Optimization for Sparse Cube
Instead of parallel tree-based reduction in the implementation for dense cube,
atomic operations are heavily used in the parallel implementation for sparse cube.
Tree-based reduction won’t work well in this situation, because we can hardly find a way
to map the data to be reduced into sequential threads. In addition, this kind of atomic
29
operation won’t bring in much overhead in complexity. The block just happens when two
threads write the same block of memory.
The most computation-intensive part is the computation of the distances, which is
the key part for parallelization. However, the parallel implementation for the computation
of the distances is a little bit tricky. First, we still treat the computation on one non-zero
element as a individual task. And then we map it to a thread on GPUs. The reason why
we put the computation of all distances related to the element in one task is because,
normally, the size of clustered cube is small, and the number of non-zero elements is
quite large, which exceed the number of all threads that can execute simultaneously in
GPUs or even several times of it. Divided the task into smaller granularity won’t bring
improvement on performance. Although we can perform tree-based reduction on finding
the shortest distance among distances, it needs synchronize operation and introduce
large overhead on extra memory cost and the cost on wrap exchanges during the
execution. Instead of tree-based reduction, when we calculate the distances, we can
rewrite into the same shared memory field if the newer value is smaller, which is more
efficient. Figure 5-4 shows an example of mapping between task and threads. Then
every thread atomically adds this value to the corresponding memory field in the device
memory.
Dimension 1 Dimension 2 Value 0 0 2 0 1 2 0 2 2 1 0 2 1 1 2 1 2 2 2 2 1 2 3 1 3 3 1
Original Data
Thread 0 Thread 1 Thread 2 Thread 3
GPUs
Map to
GPUs
Figure 5-4. Mapping of threads in parallel implementation for sparse cube
30
During the iteration on all the coordinates, we computes all the distances between
the element and the cluster in the specific dimension. However, the number of the
distances is usually very large. Each of these fields would be frequently read or written
during the process of the algorithm. So there exists one problem. Frequent atomic
operations on global memory costs huge amount of time, for many of the atomic
operations execute in sequence and each write access by atomic operations to global
memory will cost hundreds of cycles. This approach becomes inefficient and can be
hardly used in practical.
Shared memory is a good solution to reduce the time of write access by atomic
operations. But we still have two problems. One is the distances is too large, but in
CUDA the shared memory in each block is quite small, just 16KB in the devices with
Compute Capability 1.X , and 48KB in the devices with Compute Capability 2.X . It is
impossible to place even a normal-size co-clustering problem’s distances into shared
memory. The other one is, large number of atomic operations reduces the parallelism.
Most of operations are sequential, which reduces the advantage of the parallelization.
We use a smart solution to solve the above problems. Even though the total number
of distances is often very large, the maximal number of threads in one block is small,
512 in the device with Compute Capability 1.X and 1024 in the devices with Compute
Capability 2.X . (We’ll use the devices with Compute Capability 1.X for the following
explanation.) In the worst situation, the total number of distances that generates in one
thread block is 512 multiplied by the number of clusters in this dimension. This is still a
large number for the amount of shared memory. If the number of clusters exceeds 8, the
total amount of shared memory for distances may exceed the limitation.
To solve this problem, we add another preprocessing procedure. This preprocessing
procedure generate multiple copies of coordinate list. Although all the coordinate lists
represent the same cube, the sequences of the coordinates are different. For each copy
of coordinate list, it is sorted in ascending order of the indexes in the corresponding
31
dimension. The preprocessing is time consuming and space consuming, however, it
is only a one-time procedure. We only need to execute it once and place the results
in the device memory. Through the preprocessing, we reduces the number of different
elements in the clustering dimension in one block. Although the worst case is the same,
it rarely happens in practice. From the statistics in our experiments, no more than 40
different elements appear in one block. In most situations the number is 10. Thus, in
this way, we can place the distances that will be calculated in one block in its shared
memory. Not only is the access time reduced, but also the number of atomic operations
on global memory is reduced.
Table 5-1. Example of sorting indexes in threads and blocksThread block Record index in threads0 0 0 0 0 0 0 1 11 1 1 1 2 2 3 3 32 3 3 4 4 4 4 4 43 5 5 5 5 6 6 6 7
Table 5-2. Example of shared memoryBlock Elements Distance to cluster0 0 0.14 0.14 0.23 0.230 1 0.14 0.12 0.34 0.241 1 0.24 0.14 0.23 0.531 2 0.23 0.14 0.34 0.241 3 0.55 0.53 0.14 0.502 3 0.24 0.34 0.33 0.602 4 0.06 0.05 0.35 0.503 5 0.40 0.44 0.35 0.243 6 0.32 0.33 0.53 0.143 7 0.36 0.24 0.42 0.11
Atomic operations on shared memory are fast, comparing to those on global
memory. Copying from shared memory to global memory is straightforward. Sequential
writes only happen in the fields which appear in two blocks. In the worst case, the
maximal number of such kind of fields is only equal to the number of blocks. The total
amount of time on sequential operations reduces. In this way, we solve the problem
32
of the computation of distances. Other parts, like computation on clustered cube and
finding the assignments through the distances, use the same parallelism solution as
described in other parts.
33
CHAPTER 6EXPERIMENTS
This section provides some evidence to show the benefit from the Multi-Dimensional
ITCC, and its parallel implementation. In particular, we apply the implementation to
randomly generated data for performance evaluation, and the real data of wireless
records for co-clustering functional evaluation. We show the algorithm works well on
multi-dimensional data, and the parallel version of implementation has obvious speed up
than the serial implementation on the same data set.
6.1 Data Set, Environment and Measurement Details
The data sets we use to evaluate the performance of the implementations are
random-generated 3D data. The size of the cube is 200 × 200 × 200. Totally we have
generated 10 data sets. They have 10000, 20000, 40000, 80000, ... , 5120000 records
respectively.
The data sets we use to evaluate the Co-clustering results are wireless data
records. There are 2 data sets. Table 6.1 shows the details of each dataset. In the table,
uid stands for User ID, did stands for Domain ID, and lid stands for Location ID.
The evaluation takes places on the tesla node in University of Florida High
Performance Computing Center. The host has 4 Intel E5462 cores running at 2.8GHz,
16GB of RAM, and nVidia Tesla (C1060) GPU with 4GB RAM. Both the parallel
implementation and serial implementation are run on the same machine.
To reduce the other factors, such as preprocessing and output, which may affect the
result, we only measure the time for the computation in each loop, which is the core part
Table 6-1. Information on datasetsDataset 1 Dataset 2
Number of Dimensions 2 3Number of Non-0 Elements 305464 18808Names of Dims uid, did uid, did, lidNumber of Elements per Dimension 22816, 100 1800, 100, 68Number of Clusters 15, 15 10, 10, 10
34
of the parallelization. To reduce the uncertainty of the experiments’ results, each time
measurement of each loop derives from the time consumption of several loops divided
by the number of loops.
6.2 Performance and Discussion
The performance of the implementations depends on many factors, including the
number of dimensions, the number of elements in each dimension, the number of
non-zero elements, and the number of clusters in the results.
Number of iterations varies for different input data and its initialization. In our
experiments, we set the threshold to 10−6, which means the co-clustering will stop when
the change of loss of mutual information is less than the threshold, 10−6. Figure 6-1
shows the trends of the loss of mutual information as co-clustering goes on. We can see
the loss of mutual information decreases rapidly at first and slower after each loop.
5.65
5.75
5.85
5.95
6.05
6.15
6.25
1 2 3 4 5 6 7 8Loss
of
Mu
tual
Info
rmat
ion
(b
it)
Iteraition Count
Trends of Loss of Mutual Information
Run No. 1
Run No. 2
Run No. 3
Run No. 4
Run No. 5
Run No. 6
Run No. 7
Run No. 8
Run No. 9
Figure 6-1. Trends of the loss of mutual information
We have also compared loss of mutual information before and after co-clustering
among 20 times of executions on the same data sets. Figure 6-2 shows the changes. It
35
shows the algorithm is able to successfully decrease the loss of mutual information no
matter what kind of initialization has been applied on the original data.
5.4
5.6
5.8
6
6.2
6.4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Loss
of
Mu
tual
Info
rmat
ion
(b
it)
Run No.
Comparison of LMIs
LMI after Co-clustering Original LMI
Figure 6-2. Comparison of loss of mutual information
For the performance evaluation, we apply our implementations on the 10 random-generated
datasets. Figure 6-3 shows the detailed results of performance. Generally speaking, the
parallel implementations both have improvement on the serial implementations for both
sparse cube or dense cube.
For the same datasets, we can find as the numbers of clusters growing, the time
consumption grows. It is because the number of distances needed to be calculated
grows. For the performance of different implementations on different types of data,
we applied our implementations to the random-generated datasets. Comparing the
implementation for dense data and the one for sparse data, the former one performs
better on the dense data set, while at the same time the latter one performs better on
sparse cube. The data shows that, for the serial implementation, when the density of
the non-zero records is larger than 8%, the dense implementation performs better.
For the parallel implementation, we can find out that parallelization exploited from the
36
0
1
2
3
4
5
6
7
8
9
10
10000 20000 40000 80000 160000
Tim
e f
or
on
e it
era
tio
n (
sec)
Number of non-zero Records in 200 * 200 * 200 Data Cube
Performance Results on Sparse Cube
Serial Impl. For Sparse Cube
Parallel Impl. For Sparse Cube
Serial Impl. For Dense Cube
Parallel Impl. For Dense Cube
A Performance results on sparse cube
0
10
20
30
40
50
60
70
80
90
320000 640000 1280000 2560000 5120000
Tim
e f
or
on
e it
era
tio
n (
sec)
Number of non-zero Records in 200 * 200 * 200 Data Cube
Performance Results on Dense Cube
Serial Impl. For Sparse Cube
Parallel Impl. For Sparse Cube
Serial Impl. For Dense Cube
Parallel Impl. For Dense Cube
B Performance results on dense cube
Figure 6-3. Performance results
37
implementation for sparse data is much more than the one from the implementation for
dense data. The time consumption of the implementation for sparse cube shows linear
increments as the number of records increasing, for both serial implementation and
parallel implementation.
In sum, implementation for dense cube show greater performance on dense data.
At the same time, the implementations for sparse cube do better on sparse data.
Parallel implementations both have obvious speedup on the corresponding serial
implementations.
6.3 Co-clustering Algorithm Results
To show the co-clustering result of Multi-Dimensional ITCC, we applied the
implementation on the real 3D wireless data records. We have visualized the distribution
of data points in a 3D space. Figure 6-4 shows the status of data points distributions
before and after co-clustering algorithm execution. In the figure, we group the ids in
the same cluster together. From the figure, we can clearly find out the data points with
similar properties are clustered together after the algorithm execution.
The co-clustering results are heavily related to the initialization of the cluster
assignment of the original elements. And in our experiments, we use random initialization
for those datasets. In this section, we show one of domain co-clustering results from 3D
dataset. Figure 6-5 shows the clustered domain names.
From this result, we can empirically find some interested clusters which group
together the domains some groups of users usually visit together. First, microsoftoffice2007,
macfee and windowsmedia belong to cluster 5, which may belong to the websites
windows users often visit. yahoo and yimg both belong to yahoo, and they might always
be visited together. hotmail, live, and go all belong to Microsoft Live Service. Some
potential knowledge might also be exploited from the results too. For example, we might
guess the mac users like washingtonpost more, because mac and washingtonpost
38
0500
10001500
2000
0
50
1000
20
40
60
80
A Data points distribution after initialization
0500
10001500
2000
0
50
1000
20
40
60
80
B Data points distribution after co-clustering
Figure 6-4. Data points distribution before and after co-clustering
39
Cluster 0: netflix flyingcroc wsj
Cluster 1: google veoh lexis
Cluster 2: about adrevolver brightcove cnn contextweb diggriver doubleclick
ebay ebayrtm fastclick gridserver imageshack itunes microsoft
mozilla panthercdn secureserver theplanet tribalfusion typepad
webtrendslive wikimedia xpc-mii youtube virtualearth tmcs
ebayimg imeem myspace bankofamerica coremetrics americanidol
ha-hosting msnbcsports ltdomains nih nytimes bigcharts
Cluster 3: cbsig ln travelpn
Cluster 4: ilike aol
Cluster 5: infoave usc windowsmedia hackerwatch mcafee microsoftoffice2007
Cluster 6: comcast harvard hamachi ucsb digg
Cluster 7: facebook llnw mediaplex msn tfbnw xlhost apmebf 247realmedia
Level3
Cluster 8: go net live hotmail gotomypc
Cluster 9: aster fastres fastwebnet smartbro bodoglife torrentbox qwest
Cluster 10: apple
Cluster 11: bluehost rr yahoo yimg
Cluster 12: drmisnotforsale softlayer quiettouch westlaw
Cluster 13: co steadfast socialmedia lokm
Cluster 14: cnet washingtonpost earthlink mac opendns orb
Figure 6-5. Co-clustering result of domain names of 2D dataset
belong to one cluster. In another words, the results from multi-dimensional co-clustering
are valuable and instructive for knowledge discovery from large amount of new data.
However, this is just the local optimal results from the initialization. We can get
much different results from different initializations. According to our experiments, the
results from random initializations are quite unstable. A feasible solution for the unstable
performance is using the co-clustering results from another algorithm to initialize the
ITCC. In this way, we’ll get a more optimal result on mutual information than the one
used for initialization.
40
CHAPTER 7CONCLUSION
The parallel and serial implementations of Multi-Dimensional Information Theoretic
Co-Clustering algorithm have shown that it can work well with any-dimensional
data, especially for the sparse and high-dimensional data. Parallel versions of
implementations for both sparse and dense cube have an obvious performance
improvement on the corresponding serial versions of implementation. The implementations
for sparse cube have greater adaptability on large scale of data, which will be more
useful than the dense versions in most of the situations. Parallel implementation for
sparse cube exploits more parallelism on the algorithm, which shows more obvious
speedup on the corresponding serial one. The Multi-Dimensional ITCC can successfully
co-cluster multi-dimensional data, and exploits potential knowledge hidden in the data,
which is helpful for knowledge discovery from the real-world data.
41
REFERENCES
[1] I. S. Dhillon, S. Mallela, and D. S. Modha. Information theoretical co-clustering. InNinth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’03),pages 89–98, 2003.
[2] M. Harris. Optimization parallel reduction in CUDA. Technical report, 2007.
[3] NVIDIA Corporation. NVIDIA CUDA C Programming Guide. http://developer.
download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_C_Programming_
Guide.pdf, 2010.
42
BIOGRAPHICAL SKETCH
Xiaoyang Gao is a Master of Science candidate in computer engineering from
Department of Computer and Information Science and Engineering at University
of Florida. While he’s studying in University of Florida, he did research in parallel
computing and data mining under Dr. Sanjay Ranka’s supervision and applied them on
modeling the wireless data. He received his bachelor’s degree in Computer Science and
Technology from Huazhong University of Science and Technology, Wuhan, P. R. China.
43