New Approaches for Clustering and Parallel Algorithmsoliveira/CS-colloquium-nov2010.pdf · 2010-12-04 · New Approaches for Clustering and Parallel Algorithms Professor Suely Oliveira

New Approaches for Clustering and Parallel

Algorithms

Professor Suely Oliveira Computer science, Mathematics and Applied

Math Program. December 3rd, 2010

Weighted and unweighted Networks in real world and clustering them

• Weighted networks: Edges hold closeness information

• Unweighted networks:

Clustering Data UCI Datasets:

Iris (1936): 150 data, 4 features, 3 classes.

Bag of words (2008): 8M data,100K features,…..

My applications so far: Documents, Manufacturing, Protein-Protein Networks. Also Genomes of N2-fixing bacteria and other agricultural applications, other biological and medical problems. Some involve incomplete data.

Datasets

Protein-Protein Interactions represent a pivotal aspect of protein function. Almost every cellular process relies on transient or permanent physical binding of two or more proteins in order to accomplish the respective task.

S. Cerevisiae: one of the most intensively studied eukaryotic model

organisms in molecular and cell biology.

DIP = Database of Interacting Proteins. Documents experimentally determined protein-protein interactions. Useful for understanding properties, prediction and evolution of proteins interaction.

MIP = mamallian Protein-Protein Interaction Database. Determined and cross-referenced to the major sequence databases

(SWISS-PROT, GenBank, PIR). Munich Information center for Protein Sequence (MIPS) annotates 6451 proteins. 800 of them overlap the residual network of DIP.

Basic Mathematical Idea

),( Minimize :(P2)

),( and ),( Maximize :(P1)

:Aims

).,,(

, , ,

,),( ,),( Define

. and clusters twoconstruct

,graph for the matrix similarityGiven

1

BAs

BBsAAs

dddiagD

dddddddSd

SAAsSBAs

BA

G(V,E)S

n

BA

Bi

iB

Ai

i

j

Aiji

Ai Aj

ij

Ai Bj

ij

Clustering and Optimization

),(

),(

),(

),( Minimize :Cut MinMax

),(),( Minimize :Cut Normalized

|B|

),(

|A|

),( Minimize :Cut Ratio

),(),( Maximize :method means-K

BBs

BAs

AAs

BAs

d

BAs

d

BAs

BAsBAs

BBsAAs

BA

baiq

BBs

BAs

AAs

BAs

q

-or )( Subject to

),(

),(

),(

),( Minimize

baiq

Dqq

SqqqMax

T

T

q

-or )( Subject to

)(J m

Dqq

SqqqMax

T

T

q)(J m

DqqSD )(

ofr eigenvectosmallest second isSolution

Relax

Approximate

Clustering is a relative of Graph Partitioning In Graph Partitioning the 2 subgraphs have the same number

of nodes. In clustering the number of data in each cluster can be

different, but in both cases you want to minimize the

connections (similarities) between the sub-graphs (or clusters) .

Mathematical Model

Let be a graph with vertices and edges . We allow either edges or vertices to have positive weights, and respectively. One way to describe a partition is to assign a value of +1 to all the vertices in one set and a value of –1 to all the vertices in the other.

Eeij

),( EVG Vv

)(ije ew )(iwv

. partiiton in is )( if 1

partition in is )( if 1)(

2

1

Piv

Pivix

.otherwise 0

partitionsdifferent in are )( and )( if 1))()((

41 2

jvivjxix

From Discrete to Continuous Model The discrete optimization problem: Relax to a continuous optimization problem:

.1)( (b)

0)( (a) Subject to

))()()((41 = )( Minimize 2

ix

ix

jxixewxf

i

Eij

eije

.)( (b)

0)( (a) Subject to

))()()((41 = )( Minimize

2

2

nix

ix

jxixewxf

i

i

Eij

eije

General Approach for Clustering (MR Rao)

Optimization formulation of MR Rao

Objective: sum of distances 𝑑𝑖𝑗 where items 𝑖 and 𝑗

are in the same cluster.

Constraint 1: each item belongs to exact one cluster

Constraint 2: each cluster has ≥ 1 item

min𝑥 𝑑𝑖𝑗 𝑥𝑖𝑘𝑥𝑗𝑘 subject to

𝑚

𝑘=1

𝑛

𝑖,𝑗=1

𝑥𝑖𝑘 = 1𝑚𝑘=1 for all 𝑖

𝑥𝑖𝑘 ≥ 1 for all 𝑘𝑛𝑖=1

𝑥𝑖𝑘 ∈ 0, 1 for all 𝑖 and 𝑘

A semi definite programming for clustering

Semi-definite programming (SDP): “ variables” are matrices

SDPs give convex approximation to combinatorial problems

min𝑋𝐶 ∗ 𝑍 := 𝑐𝑖𝑗

𝑛𝑖,𝑗=1 𝑧𝑖𝑗 subject to

𝐴𝑝 ∗ 𝑍 = 𝛼𝑝, 𝑝 = 1, 2,⋯ , 𝑞 𝑍 positive semi-definite

min𝑋,𝑌𝐷 ∗ 𝑌 subject to

𝑋 ≥ 0 componentwise 𝑥𝑖𝑘 = 1𝑚𝑘=1 for all 𝑖

𝑌 𝑋𝑋𝑇 𝐼

is positive semi-definite

Speed: multicore and many-cores New CPUs have 2, 4, or more cores

Each core is a separate processor and can have several threads,has its own local memory(cache)

Parallel processing arrived to your laptop with multicore CPUs but programmers take little advantage of it

GPU’s used for graphics since 1998, recently with CUDA it started being used for numerical computation in general.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 14

DRAM

Cache

ALU

Control

ALU

ALU

ALU

DRAM

CPU GPU

CPUs and GPUs have fundamentally different design philosophies

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 15

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture Texture Texture

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Load/store Load/store Load/store Load/store Load/store

Architecture of a CUDA-capable GPU

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 498AL, University of Illinois, Urbana-Champaign 16

…

float x =

input[threadID];

float y = func(x);

output[threadID] = y;

…

threadID

Thread Block 0

… …

float x =

input[threadID];

float y = func(x);


…

Thread Block 1

…

float x =

input[threadID];

float y = func(x);


…

Thread Block N - 1

Thread Blocks

Divide monolithic thread array into multiple blocks

Threads within a block cooperate via shared memory, atomic operations and barrier synchronization

Threads in different blocks cannot cooperate

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 498AL, University of Illinois, Urbana-Champaign 17

Host

Kernel

1

Kernel

2

Device

Grid 1

Block

(0, 0)

Block

(1, 0)

Block

(0, 1)

Block

(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread

(0,1,0)

Thread

(1,1,0)

Thread

(2,1,0)

Thread

(3,1,0)

Thread

(0,0,0)

Thread

(1,0,0)

Thread

(2,0,0)

Thread

(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Block IDs and Thread IDs

Each thread uses IDs to decide what data to work on Block ID: 1D or 2D

Thread ID: 1D, 2D, or 3D

Simplifies memory addressing when processing multidimensional data Image processing

Solving PDEs on volumes

…

Matrix Multiply using blocks

// Calculate the row index of the Pd element and M

int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;

// Calculate the column index of Pd and N

int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;

// each thread computes one element of the block sub-matrix

for (int k = 0; k < Width; ++k)

Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;

PROBLEM: CGMA rate = 1.0

(2data accessed in global memory for 1 mult & 1 add

Using Shared memory on the GPU __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)

{

1. __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];

2. __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;

4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on

5. int Row = by * TILE_WIDTH + ty;

6. int Col = bx * TILE_WIDTH + tx;

7. float Pvalue = 0;

// Loop over the Md and Nd tiles required to compute the Pd element

8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {

// Collaborative loading of Md and Nd tiles into shared memory

9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];

10. Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];

11. __syncthreads();

12. for (int k = 0; k < TILE_WIDTH; ++k)

13. Pvalue += Mds[ty][k] * Nds[k][tx];

14. __syncthreads();

}

15. Pd[Row*Width + Col] = Pvalue;

}

2020

G80 Shared Memory and Threading

•Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16. The 86.4 GB/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS! (peak performance)

•-----------------------------------------------------------------------------------------

Each SM in G80 has 16KB shared memory. (up to 8 blocks of 2KB).

For TILE_WIDTH = 16, each thread block uses 2*16*16*4B = 2KB of shared memory. That is fine!

--------------------------------------------------------------------------------------------------

The threading hardware (768 in each SM limits the number of thread blocks to 3). Shared memory is not the limiting factor here! • -----------------------------------------------------------------------------------------------------------

–The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time

Other applications for clustering and GPU algorithms

Spam vs not-spam (interested…) Counter-terrorism and fraud detection (interested) Robotics (collaboration long distance) Simulating impacts (collaboration w/ Stewart) Lots of new papers in being written… Fall 2010 – 22C196: Threads and GPUs Taught by Prof Oliveira: http://www.cs.uiowa.edu/~oliveira Reading/Research credit courses available.