Download ppt - Weekly Report- Kmeans

Weekly Report-KmeansWeekly Report-Kmeans

Ph.D. Student: Leo Leedate: Nov. 13, 2009

OutlineOutlineK-means

◦CPU-based algorithm workflow;◦Reading Kaiyong’s code;◦Some naïve thoughts;

Work plan

K-meansK-meansCPU-based algorithm workflow;

N data and K centers, dim dimension;

Compute D[N][K]

Compute MinD[N]

Compute NewCenter[K]

If NewCenter == center

No

Yes

K-meansK-meansPseudocode:

While(!bFlag && ++i <= nIterationsTime){

ComputeDis(&dis, data, centers);

FindMinDis(dis, &index);

ComputeNewCen(&newCen, data, index);

if(newCen-centers < b)bFlag = true

elsecenters = newCen;

}

K-meansK-meansSince each iteration relays on the

previous one, we would sequential run each iteration but parallel each function inside the iteration.

◦ Compute distance;

◦ Find the nearest center;

◦ Computer new centers;

K-means-compute the K-means-compute the distancedistance

Data[N][Dim], Centers[Dim][K] Dis[N][K] Nearly the same as Matrix mulitipication Only replace A[i][k]*B[K][j]->(A[i][k]-B[K][j])2

Using the so called tiles, increase the compute to memory

access ratio

Data

Centers

Distances

K-means-compute the K-means-compute the distancedistanceFrom Kaiyong

◦ dim3 threads(16, 2, 1); ◦ dim3 grid(k/32, n/32, 1);

◦ ComputeDistance<32,32><<<grid,threads>>>(…)

◦ template <unsigned int B_WIDTH, unsigned int C_HIGH> __global__ void ComputeDistance(….)

◦ {

◦ }

K-means-compute the K-means-compute the distancedistance float* indexQ = Query

+ threadIdx.x + (blockIdx.y*C_HIGH + threadIdx.y) *

dim;

float* indexR = Ref + blockIdx.x*B_WIDTH + threadIdx.y * blockDim.x + threadIdx.x;

float* indexC = C + blockIdx.x * C_HIGH + threadIdx.y * blockDim.x

+ threadIdx.x + blockIdx.y *C_HIGH* wB ;

K-means-compute the K-means-compute the distancedistance __shared__ float as[16][C_HIGH+1]; Do

◦ for(int i = 0; i < C_HIGH; i += 2) as[threadIdx.x][threadIdx.y + i] = indexQ[i*dim];

◦ indexQ += 16;◦ __syncthreads();◦ for(int i = 0; i < 16; i++, indexR += wB)

for( int j = 0; j < C_HIGH; j++) { c_temp = indexR[0]-as[i][j]; c[j] += c_temp*c_temp; }

◦ __syncthreads(); while(indexQ < Alast);

K-means-compute the K-means-compute the distancedistancefor(int i = 0; i < C_HIGH; i++, indexC

+= wB){

indexC[0] = c[i];}

K-means - compute the K-means - compute the distancedistanceQuestions

◦Why template <unsigned int B_WIDTH, unsigned int C_HIGH>, not parameters?

◦Why load sub matrix in that way? Sth to do with WARP? If we use 16*16 tile instead of 32*32, should load method change?

◦This algorithm is nearly the same as the so-called most efficient Matrix Mulitiplication. Thread(16, 4), grid(wc/4, hc/16)

Compute the distanceCompute the distanceVery useful in data mining

◦K-means;◦K-nn;◦Hieratical clustering;◦…

K-means-K-means-Find the nearest centerN Reductions

◦ Sum, max, min…◦ Sequential addressing◦ Completely unroll◦ n/logn threads, logn steps;

N

K

K-means-K-means-Find the nearest centerdim3 threads_find(16,1,1);dim3 grid_find(1, data_height, 1);template <unsigned int blockSize> __global__ void

cpu_FindSmallDistance( float* Dist, int* D_index, int k){

__shared__ float sdata[blockSize];__shared__ int d_index[blockSize];

// perform first level of reduction, reading from g-memory, writing to s-memory

unsigned int tid = threadIdx.x; unsigned int i = blockSize + threadIdx.x;

float* p_data = Dist + blockIdx.y*k;sdata[tid] = p_data[tid];d_index[tid] = tid;if (i < k)

if( sdata[tid] > p_data[i]){ sdata[tid] = p_data[i];

d_index[tid] = i; }EMUSYNC;

K-means-K-means-Find the nearest center if( sdata[tid] > sdata[tid + 8] ) {sdata[tid] = sdata[tid + 8];

d_index[tid] = d_index[tid+8];} EMUSYNC;

if( sdata[tid] > sdata[tid + 4] ) {sdata[tid] = sdata[tid + 4]; d_index[tid] = d_index[tid+4];} EMUSYNC;

if( sdata[tid] > sdata[tid + 2] ) {sdata[tid] = sdata[tid + 2]; d_index[tid] = d_index[tid+2];} EMUSYNC;

if( sdata[tid] > sdata[tid + 1] ) {d_index[tid] = d_index[tid+1];} EMUSYNC;

// write result for this block to global mem if (tid == 0) D_index[blockIdx.y] = d_index[0];} Since the K is presumed to be equal or smaller than 32, this

implementation is optimal.

K-means-K-means-Computer new centers;CPU-based Algorithm

◦ For each Data C[ Index[i] ] += Data[i]

Not direct addressingNot as beautiful as Matrix Mul and Reduction

// 把最近的点都加起来，分成 100组，每组 512个

dim3 threads_collect(32,1,1);dim3 grid_collect(100, 1, 1);

N

dim Index

K-means-K-means-Computer new centers; __shared__ int index[32];

__shared__ float as[32*34]; int idx = threadIdx.x;float* p_Data = Data + blockIdx.x * GroupSize*Dim;int* p_index = D_index + blockIdx.x * GroupSize;int* p_index_last = p_index + GroupSize;float* p_centro = Centro + blockIdx.x * k*Dim;int* p_counter = Counter + blockIdx.x*32;int index_i = 0;int centro_count = 0;

// initial shared mem centeroif(idx < k)

for( int i = 0; i < Dim; i++)as[i*k+idx] = 0;

EMUSYNC;

N

dim Index

K-means-K-means-Computer new centers;//每次取 32个 index，所以如果一共有 512个数据，， 512/32=16轮for(; p_index < p_index_last; p_index += blockDim.x){

//取 32个 index放入到 shared mem中，这样可以让 IO结合index[idx] = p_index[idx]; EMUSYNC;//循环次，每次计算一个数据for(int i = 0; i < 32; i++, p_Data += Dim){

index_i = index[i];// 每一个 thread对应一个 centro，所以当处理一个对应的数据时，这里 ++if(idx == index_i)

centro_count++;

for(int j = idx; j < Dim; j += blockDim.x){

as[j*k+index_i] += p_Data[j];}EMUSYNC;

}}

N

dim Index

K-means-K-means-Computer new centers; //只有在 centro范围内的线程才回写数据，每个线程里面负责一个 centroif(idx < k){

p_counter[idx] = centro_count;for(int j = 0; j < Dim; j++, p_centro += k){

p_centro[idx] = as[j*k+idx];}

}

Now we got 100 Dim*K Matrix◦ Matrix adding-reduction, each element is a matrix.◦ Kaiyong’s code reduces to 10, and gets the final.

N

Dim Index

Work plan - K-meansWork plan - K-meansTest the program

◦Each function, GPU VS CPU;

Compare with other papers.

Work planK-means;

Learn data mining, prepare for final exam;

Go on reading parallel computing books.

Thanks