Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
New Approaches for Clustering and Parallel
Algorithms
Professor Suely Oliveira Computer science, Mathematics and Applied
Math Program. December 3rd, 2010
Weighted and unweighted Networks in real world and clustering them
• Weighted networks: Edges hold closeness information
• Unweighted networks:
Clustering Data UCI Datasets:
Iris (1936): 150 data, 4 features, 3 classes.
Bag of words (2008): 8M data,100K features,…..
My applications so far: Documents, Manufacturing, Protein-Protein Networks. Also Genomes of N2-fixing bacteria and other agricultural applications, other biological and medical problems. Some involve incomplete data.
Datasets
Protein-Protein Interactions represent a pivotal aspect of protein function. Almost every cellular process relies on transient or permanent physical binding of two or more proteins in order to accomplish the respective task.
S. Cerevisiae: one of the most intensively studied eukaryotic model
organisms in molecular and cell biology.
DIP = Database of Interacting Proteins. Documents experimentally determined protein-protein interactions. Useful for understanding properties, prediction and evolution of proteins interaction.
MIP = mamallian Protein-Protein Interaction Database. Determined and cross-referenced to the major sequence databases
(SWISS-PROT, GenBank, PIR). Munich Information center for Protein Sequence (MIPS) annotates 6451 proteins. 800 of them overlap the residual network of DIP.
Basic Mathematical Idea
),( Minimize :(P2)
),( and ),( Maximize :(P1)
:Aims
).,,(
, , ,
,),( ,),( Define
. and clusters twoconstruct
,graph for the matrix similarityGiven
1
BAs
BBsAAs
dddiagD
dddddddSd
SAAsSBAs
BA
G(V,E)S
n
BA
Bi
iB
Ai
i
j
Aiji
Ai Aj
ij
Ai Bj
ij
Clustering and Optimization
),(
),(
),(
),( Minimize :Cut MinMax
),(),( Minimize :Cut Normalized
|B|
),(
|A|
),( Minimize :Cut Ratio
),(),( Maximize :method means-K
BBs
BAs
AAs
BAs
d
BAs
d
BAs
BAsBAs
BBsAAs
BA
baiq
BBs
BAs
AAs
BAs
q
-or )( Subject to
),(
),(
),(
),( Minimize
baiq
Dqq
SqqqMax
T
T
q
-or )( Subject to
)(J m
Dqq
SqqqMax
T
T
q)(J m
DqqSD )(
ofr eigenvectosmallest second isSolution
Relax
Approximate
Clustering is a relative of Graph Partitioning In Graph Partitioning the 2 subgraphs have the same number
of nodes. In clustering the number of data in each cluster can be
different, but in both cases you want to minimize the
connections (similarities) between the sub-graphs (or clusters) .
Mathematical Model
Let be a graph with vertices and edges . We allow either edges or vertices to have positive weights, and respectively. One way to describe a partition is to assign a value of +1 to all the vertices in one set and a value of –1 to all the vertices in the other.
Eeij
),( EVG Vv
)(ije ew )(iwv
. partiiton in is )( if 1
partition in is )( if 1)(
2
1
Piv
Pivix
.otherwise 0
partitionsdifferent in are )( and )( if 1))()((
41 2
jvivjxix
From Discrete to Continuous Model The discrete optimization problem: Relax to a continuous optimization problem:
.1)( (b)
0)( (a) Subject to
))()()((41 = )( Minimize 2
ix
ix
jxixewxf
i
Eij
eije
.)( (b)
0)( (a) Subject to
))()()((41 = )( Minimize
2
2
nix
ix
jxixewxf
i
i
Eij
eije
General Approach for Clustering (MR Rao)
Optimization formulation of MR Rao
Objective: sum of distances 𝑑𝑖𝑗 where items 𝑖 and 𝑗
are in the same cluster.
Constraint 1: each item belongs to exact one cluster
Constraint 2: each cluster has ≥ 1 item
min𝑥 𝑑𝑖𝑗 𝑥𝑖𝑘𝑥𝑗𝑘 subject to
𝑚
𝑘=1
𝑛
𝑖,𝑗=1
𝑥𝑖𝑘 = 1𝑚𝑘=1 for all 𝑖
𝑥𝑖𝑘 ≥ 1 for all 𝑘𝑛𝑖=1
𝑥𝑖𝑘 ∈ 0, 1 for all 𝑖 and 𝑘
A semi definite programming for clustering
Semi-definite programming (SDP): “ variables” are matrices
SDPs give convex approximation to combinatorial problems
min𝑋𝐶 ∗ 𝑍 := 𝑐𝑖𝑗
𝑛𝑖,𝑗=1 𝑧𝑖𝑗 subject to
𝐴𝑝 ∗ 𝑍 = 𝛼𝑝, 𝑝 = 1, 2,⋯ , 𝑞 𝑍 positive semi-definite
min𝑋,𝑌𝐷 ∗ 𝑌 subject to
𝑋 ≥ 0 componentwise 𝑥𝑖𝑘 = 1𝑚𝑘=1 for all 𝑖
𝑌 𝑋𝑋𝑇 𝐼
is positive semi-definite
Speed: multicore and many-cores New CPUs have 2, 4, or more cores
Each core is a separate processor and can have several threads,has its own local memory(cache)
Parallel processing arrived to your laptop with multicore CPUs but programmers take little advantage of it
GPU’s used for graphics since 1998, recently with CUDA it started being used for numerical computation in general.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 14
DRAM
Cache
ALU
Control
ALU
ALU
ALU
DRAM
CPU GPU
CPUs and GPUs have fundamentally different design philosophies
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 15
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture Texture Texture
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Load/store Load/store Load/store Load/store Load/store
Architecture of a CUDA-capable GPU
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 498AL, University of Illinois, Urbana-Champaign 16
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
threadID
Thread Block 0
… …
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
Thread Block 1
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
Thread Block N - 1
Thread Blocks
Divide monolithic thread array into multiple blocks
Threads within a block cooperate via shared memory, atomic operations and barrier synchronization
Threads in different blocks cannot cooperate
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 498AL, University of Illinois, Urbana-Champaign 17
Host
Kernel
1
Kernel
2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(0, 1)
Block
(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Block IDs and Thread IDs
Each thread uses IDs to decide what data to work on Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
Simplifies memory addressing when processing multidimensional data Image processing
Solving PDEs on volumes
…
Matrix Multiply using blocks
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column index of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
PROBLEM: CGMA rate = 1.0
(2data accessed in global memory for 1 mult & 1 add
Using Shared memory on the GPU __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
1. __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
2. __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;
4. int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on
5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of Md and Nd tiles into shared memory
9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10. Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];
11. __syncthreads();
12. for (int k = 0; k < TILE_WIDTH; ++k)
13. Pvalue += Mds[ty][k] * Nds[k][tx];
14. __syncthreads();
}
15. Pd[Row*Width + Col] = Pvalue;
}
2020
G80 Shared Memory and Threading
•Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16. The 86.4 GB/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS! (peak performance)
•-----------------------------------------------------------------------------------------
Each SM in G80 has 16KB shared memory. (up to 8 blocks of 2KB).
For TILE_WIDTH = 16, each thread block uses 2*16*16*4B = 2KB of shared memory. That is fine!
--------------------------------------------------------------------------------------------------
The threading hardware (768 in each SM limits the number of thread blocks to 3). Shared memory is not the limiting factor here! • -----------------------------------------------------------------------------------------------------------
–The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time
Other applications for clustering and GPU algorithms
Spam vs not-spam (interested…) Counter-terrorism and fraud detection (interested) Robotics (collaboration long distance) Simulating impacts (collaboration w/ Stewart) Lots of new papers in being written… Fall 2010 – 22C196: Threads and GPUs Taught by Prof Oliveira: http://www.cs.uiowa.edu/~oliveira Reading/Research credit courses available.