Weekly Report-KmeansWeekly Report-Kmeans
Ph.D. Student: Leo Leedate: Nov. 13, 2009
OutlineOutlineK-means
◦CPU-based algorithm workflow;◦Reading Kaiyong’s code;◦Some naïve thoughts;
Work plan
K-meansK-meansCPU-based algorithm workflow;
N data and K centers, dim dimension;
Compute D[N][K]
Compute MinD[N]
Compute NewCenter[K]
If NewCenter == center
No
Yes
K-meansK-meansPseudocode:
While(!bFlag && ++i <= nIterationsTime){
ComputeDis(&dis, data, centers);
FindMinDis(dis, &index);
ComputeNewCen(&newCen, data, index);
if(newCen-centers < b)bFlag = true
elsecenters = newCen;
}
K-meansK-meansSince each iteration relays on the
previous one, we would sequential run each iteration but parallel each function inside the iteration.
◦ Compute distance;
◦ Find the nearest center;
◦ Computer new centers;
K-means-compute the K-means-compute the distancedistance
Data[N][Dim], Centers[Dim][K] Dis[N][K] Nearly the same as Matrix mulitipication Only replace A[i][k]*B[K][j]->(A[i][k]-B[K][j])2
Using the so called tiles, increase the compute to memory
access ratio
Data
Centers
Distances
K-means-compute the K-means-compute the distancedistanceFrom Kaiyong
◦ dim3 threads(16, 2, 1); ◦ dim3 grid(k/32, n/32, 1);
◦ ComputeDistance<32,32><<<grid,threads>>>(…)
◦ template <unsigned int B_WIDTH, unsigned int C_HIGH> __global__ void ComputeDistance(….)
◦ {
◦ }
K-means-compute the K-means-compute the distancedistance float* indexQ = Query
+ threadIdx.x + (blockIdx.y*C_HIGH + threadIdx.y) *
dim;
float* indexR = Ref + blockIdx.x*B_WIDTH + threadIdx.y * blockDim.x + threadIdx.x;
float* indexC = C + blockIdx.x * C_HIGH + threadIdx.y * blockDim.x
+ threadIdx.x + blockIdx.y *C_HIGH* wB ;
K-means-compute the K-means-compute the distancedistance __shared__ float as[16][C_HIGH+1]; Do
◦ for(int i = 0; i < C_HIGH; i += 2) as[threadIdx.x][threadIdx.y + i] = indexQ[i*dim];
◦ indexQ += 16;◦ __syncthreads();◦ for(int i = 0; i < 16; i++, indexR += wB)
for( int j = 0; j < C_HIGH; j++) { c_temp = indexR[0]-as[i][j]; c[j] += c_temp*c_temp; }
◦ __syncthreads(); while(indexQ < Alast);
K-means-compute the K-means-compute the distancedistancefor(int i = 0; i < C_HIGH; i++, indexC
+= wB){
indexC[0] = c[i];}
K-means - compute the K-means - compute the distancedistanceQuestions
◦Why template <unsigned int B_WIDTH, unsigned int C_HIGH>, not parameters?
◦Why load sub matrix in that way? Sth to do with WARP? If we use 16*16 tile instead of 32*32, should load method change?
◦This algorithm is nearly the same as the so-called most efficient Matrix Mulitiplication. Thread(16, 4), grid(wc/4, hc/16)
Compute the distanceCompute the distanceVery useful in data mining
◦K-means;◦K-nn;◦Hieratical clustering;◦…
K-means-K-means-Find the nearest centerN Reductions
◦ Sum, max, min…◦ Sequential addressing◦ Completely unroll◦ n/logn threads, logn steps;
N
K
K-means-K-means-Find the nearest centerdim3 threads_find(16,1,1);dim3 grid_find(1, data_height, 1);template <unsigned int blockSize> __global__ void
cpu_FindSmallDistance( float* Dist, int* D_index, int k){
__shared__ float sdata[blockSize];__shared__ int d_index[blockSize];
// perform first level of reduction, reading from g-memory, writing to s-memory
unsigned int tid = threadIdx.x; unsigned int i = blockSize + threadIdx.x;
float* p_data = Dist + blockIdx.y*k;sdata[tid] = p_data[tid];d_index[tid] = tid;if (i < k)
if( sdata[tid] > p_data[i]){ sdata[tid] = p_data[i];
d_index[tid] = i; }EMUSYNC;
K-means-K-means-Find the nearest center if( sdata[tid] > sdata[tid + 8] ) {sdata[tid] = sdata[tid + 8];
d_index[tid] = d_index[tid+8];} EMUSYNC;
if( sdata[tid] > sdata[tid + 4] ) {sdata[tid] = sdata[tid + 4]; d_index[tid] = d_index[tid+4];} EMUSYNC;
if( sdata[tid] > sdata[tid + 2] ) {sdata[tid] = sdata[tid + 2]; d_index[tid] = d_index[tid+2];} EMUSYNC;
if( sdata[tid] > sdata[tid + 1] ) {d_index[tid] = d_index[tid+1];} EMUSYNC;
// write result for this block to global mem if (tid == 0) D_index[blockIdx.y] = d_index[0];} Since the K is presumed to be equal or smaller than 32, this
implementation is optimal.
K-means-K-means-Computer new centers;CPU-based Algorithm
◦ For each Data C[ Index[i] ] += Data[i]
Not direct addressingNot as beautiful as Matrix Mul and Reduction
// 把最近的点都加起来,分成 100组,每组 512个
dim3 threads_collect(32,1,1);dim3 grid_collect(100, 1, 1);
N
dim Index
K-means-K-means-Computer new centers; __shared__ int index[32];
__shared__ float as[32*34]; int idx = threadIdx.x;float* p_Data = Data + blockIdx.x * GroupSize*Dim;int* p_index = D_index + blockIdx.x * GroupSize;int* p_index_last = p_index + GroupSize;float* p_centro = Centro + blockIdx.x * k*Dim;int* p_counter = Counter + blockIdx.x*32;int index_i = 0;int centro_count = 0;
// initial shared mem centeroif(idx < k)
for( int i = 0; i < Dim; i++)as[i*k+idx] = 0;
EMUSYNC;
N
dim Index
K-means-K-means-Computer new centers;//每次取 32个 index,所以如果一共有 512个数据,, 512/32=16轮for(; p_index < p_index_last; p_index += blockDim.x){
//取 32个 index放入到 shared mem中,这样可以让 IO结合index[idx] = p_index[idx]; EMUSYNC;//循环次,每次计算一个数据for(int i = 0; i < 32; i++, p_Data += Dim){
index_i = index[i];// 每一个 thread对应一个 centro,所以当处理一个对应的数据时,这里 ++if(idx == index_i)
centro_count++;
for(int j = idx; j < Dim; j += blockDim.x){
as[j*k+index_i] += p_Data[j];}EMUSYNC;
}}
N
dim Index
K-means-K-means-Computer new centers; //只有在 centro范围内的线程才回写数据,每个线程里面负责一个 centroif(idx < k){
p_counter[idx] = centro_count;for(int j = 0; j < Dim; j++, p_centro += k){
p_centro[idx] = as[j*k+idx];}
}
Now we got 100 Dim*K Matrix◦ Matrix adding-reduction, each element is a matrix.◦ Kaiyong’s code reduces to 10, and gets the final.
N
Dim Index
Work plan - K-meansWork plan - K-meansTest the program
◦Each function, GPU VS CPU;
Compare with other papers.
Work planK-means;
Learn data mining, prepare for final exam;
Go on reading parallel computing books.
Thanks