IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad

IIIT,

Hyd

erab

ad

Performance Primitives for Massive Multithreading

P J NarayananCentre for Visual Information Technology

IIIT, Hyderabad

IIIT,

Hyd

erab

ad

Lessons from GPU Computing

• Massively multithreaded: several thousands to millions of threads for good performance• Good performance depends on a lot– Resource utilization: shared memory, registers– Memory access: locality, arithmetic intensity• Optimum point may change with architecture– Retuning infeasible for every developer• Solution: Use standard libraries or primitives– Implemented well keeping the trade-offs in mind– Used by everyone: build your algorithms using them

IIIT,

Hyd

erab

ad

What are the primitives?

• Standard data-parallel primitives– scan, reduce– sort, split• But also:– segmented split– scatter, gather, data-copy– Transpose• Could have domain-specific primitives– Graph theory, numerical algorithms– Computer vision, Image processing

IIIT,

Hyd

erab

ad

Computing Using Primitives

• A typical program will/should have 75-80% of the work done through such primitives

• Application developer writes glue kernels to connect and clean up the components– Code for this simple and perhaps unchanging– Even inefficient implementations non-critical

• Example: A program with running time T uses primitives for 75% of operations. A new architecture doubles performance

• New running time: (with no speedup for non-primitive part)– 0.5 * (0.75 T) + 0.25 T = 0.625 T, instead of ideal 0.5 T.– 0.6 Tif 80% was using primitives and 0.55 T if 90%

IIIT,

Hyd

erab

ad

Primitive vs Library

• Both motivated by similar thinking: Reuse!• Primitive is typically an algorithmic step, which

finds diverse use– Used as a low-level step of an algorithm• A library function provides an end-to-end

functionality– Used to achieve a high-level functionality– Could be a “primitive” at a sufficiently high level!• Use a library if available. Avoids development even

using primitives!

IIIT,

Hyd

erab

ad

K-Means Clustering

• An iteration (with N vectors of d dimensions and K clusters)– Each vector finds distances to each

cluster center• O (N K d) operations

– Attach itself to the closest centre; take its label• O (N K) operations to find the minimum

distance

– Compute the mean of each cluster or vectors with the same label• O (N d) operations to find K means.

• GPU implementation of clustering of 128-dimensional SIFT vectors, a frequent problem in Computer Vision.

IIIT,

Hyd

erab

ad

SIFT Clustering

Problem: Cluster a few (4-8) million, (128 dimensional) SIFT vectors into a few (1-2) thousand clusters using K-Means• Representation: row major. That is, the N components

of each of the 128 dimensions stored together, tightly. (N rows of d each)

• Given: initial cluster means (could be random vectors)• Output: K cluster means and N labels, one for each

input vector giving cluster membership• Large amount of computations; well suited to a GPU-

like architecture

IIIT,

Hyd

erab

ad

Data Representation

1

2

N

1 2 3 d

Input Vectors in Row Major

1

2

K

1 2 d

Cluster Centers in Row Major

1 2 3 N

Cluster Labels

IIIT,

Hyd

erab

ad

Distance Computation

1. Loop over K clusters, loading c cluster centers to shared memory at a time

2. A block of t threads loops over all d components of t input vectors, loading component vi and accumulating (Ci – vi)2

3. Write distances in a K x N array, with K distances for a vector stored consecutively.

Shared memory used to the maximum and all memory accesses are perfectly coalesced.

IIIT,

Hyd

erab

ad

After Distance Evaluations

1

2

K

1 2 3 N

Vector to Cluster Distance Matrix

IIIT,

Hyd

erab

ad

Finding Closest Center

• We need to know the index of the centre that gave the minimum distance.• A block of t threads load t distances for a

particular centre. Keep track of the minimum distance and the corresponding index across the K centers.• Write index into a new labels array of length N.

All memory accesses are perfectly coalesced.

IIIT,

Hyd

erab

ad

New Cluster Centers

• The new labels are given in the input vector order.• Next step: Find the mean of all vectors with same

label. Find their sum first.• Rearrange input vectors so that vectors of each

category are placed together.• Column major storage makes the memory

accesses non-coalesced and inefficient.• Rearrange and convert to row major. Summing is

easy thereafter!

IIIT,

Hyd

erab

ad

Finding New Centers

1. gIndex = splitGatherIndex(new Labels)2. dCopy = gather(inputVectors, gIndex)3. temp = transpose(dCopy)4. Perform segmented add reduce of temp with segments at

label boundaries. Store results in an dx K array newCenters5. inputVectors = transpose(dCopy)6. centers = transpose(newCenters)

Now, input vectors are rearranged with new cluster centers. (Need to also keep track of a composition of gIndex values to maintain connection to input vectors)

IIIT,

Hyd

erab

ad

• Input : Input vectors, n, Cluster centers, dim, k• Output :New Membership array(n*1), New

cluster centers(k*dim), Global Index(n*1).

IIIT,

Hyd

erab

ad

Storage per Block

1

2

4

1 2 3 dim

Four Input Vectors

2.4 11 15 28 193.1

1 2 3 dim

Center on shared memory

3

Four input vectors loaded per block and their corresponding differences are stored in shared memory which consumes 2*2048 bytes of memory, also the center is on shared memory. on the difference we perform tree based addition for each vector.

IIIT,

Hyd

erab

ad

Algorithm Flow

• Perform distance evaluations between input and current centers to generate new membership array

• Apply split sort on membership array sorting as per cluster center ids.

• Create flag and perform segmented scan to get histogram for each cluster

• Rearrange data as per cluster ids • Perform transpose on rearranged data for coalesced access• Use CUDPP segmented scan on rearranged data followed by CUDPP

compact to extract the summation • Divide the summation by histogram generated for each cluster to

get new cluster centers• Update the global Index

IIIT,

Hyd

erab

ad

• The Global Index is initialized by Global Index[i]=i• After sorting the membership array, we have

sorted_membership_index[] i.e. the order in which vectors are supposed to be arranged• The sorted membership index after split sort is used to

get global index• Global Index[sorted_membership_index[i]] =i• In the final Global Index, i is the actual vector id of Input

vectors and Global Index[i] is the position of i’th vector id in the final rearranged input data.

IIIT,

Hyd

erab

ad

Distance Evaluation

• Sequential approach takes O(dim) steps • Simple tree based parallel approach• Takes O(log(dim) )steps to evaluate the net

distance • In a block only 256/2i threads are active during

i’th iteration of an distance evaluation• Effectively performed on the shared memory• Reduces the complexity by a factor of log

IIIT,

Hyd

erab

ad

Distance Evaluation

8

16

8

4 4 4 4

2 2 2 22 222

1

Tree based addition in log 8 steps

2

3

itr

IIIT,

Hyd

erab

ad

Algorithm for Distance evaluation

• Algorithm (Input: d_input, d_centers, dim, no_centers)for i=0 to no_centers do

shared[threadIdx.x]= (d_input[id]-d_centers[i])2

for j= dim/2 to 0 do If(threadIdx.x<j) then shared[threadIdx.x]+=shared[2*threadIdx.x+j] end ifj=j/2__syncthreads()end of inner for loop if min > shared[0]Min=shared[0] end if end of outer for loop

IIIT,

Hyd

erab

ad

Kernel Level Execution

• 2 • 2 • 2• 2 • 2

• 4 • 4 • 4

• 256

• 1

• 2

• Final iteration

• 1 • 2 • Dim =128

• Every iteration number of active threads reduce by a factor of 2

Threads Id

128 128

IIIT,

Hyd

erab

ad

Kernel Functions

• Distance – Evaluates the distance between vectors (block 128,4, grid n/4p,p)

• Get_long_membership – creates a variable of type long consisting membership id and corresponding vector id.

• SplitSort – Sorts membership array as per cluster ids• CUDPPSegmented Scan – Scan operation on sorted membership

array• Get_flag – Generate flag for CUDPP operations(block 256,1)• Gather_histogram – Gathers the final values after scan• Rearrange_data – Arrange input as per clusters ids (block 128,4,

grid n/4p,p)• Transpose – Performs transpose on rearranged data• CUDPPCompact – Extracts summed up center values

IIIT,

Hyd

erab

ad

Rearranging data

1

2

N

1 2 3 d

Input Vectors in Row Major

1 1 2 k kk

45 59 89 23

4559

23

1 2 3 d

Rearranged Vectors in Row Major based on

Sorted Membership array

Vec id

IIIT,

Hyd

erab

ad

Center Evaluation

45

59

23

1 2 3 dim

Rearranged Input Vectors

1

2

dim

45 59 23

Transposed Vectors

Vec Id

Vec ID

We may apply segmented scan on transposed vectors which is a coalesced operation, flag values can be got with the help of histograms generated for each cluster.

IIIT,

Hyd

erab

ad

Global Index

• 1 • 2 • 52

• 3

• 52• 5

7• 1• 1

9• 4

9• 5

7• 8

9

• 1

• 2

• Final iteration

• 1 • 2 • n

• Updating the global Index array after every iteration• Global Index[membership_sorted_index[i]] =i

Vector Id

IIIT,

Hyd

erab

ad

Why use Split Sort, Transpose?

• New centers evaluation requires concurrent writes which is not easily parallelizable• Sorts membership array grouping vector ids

belonging to same cluster together• Helpful for rearranging entire input vectors as

per their clusters • Transpose provides coalesced access for

center evaluation using segmented scan

IIIT,

Hyd

erab

ad

Issues

• Major time is consumed by distance evaluations as input size increases.• Input size and number of clusters majorly

control the performance

IIIT,

Hyd

erab

ad

Result

• Kmeans++ to generate initial centers• Time taken to generate initial cluster centers

Input size Cluster centers CPU (P4, 2.4Ghz) GPU( GTX 280)

1,000 80 4480 ms 12.177 ms

10,000 800 39341.2 ms 670.06 ms

1,00,000 8000 897326.5 ms 62547.035 ms

1 Million 80000 9943472.8 ms 126392.1 ms

IIIT,

Hyd

erab

ad

Results

• Variation with number of input vectors (128 dimension)• Time taken per iteration to generate new membership array

and new cluster centers (excluding time for kmeans++)

Input size Cluster centers CPU (P4 , 2.4Ghz) GPU( GTX 280)

1,000 80 370 ms 9.91 ms

10,000 800 82900 ms 487.3 ms

1,00,000 8000 679923.1 ms 36623.58 ms

1 Million 8000 5189450.4 ms 45789.29 ms

IIIT,

Hyd

erab

ad

Result

• Variation of cluster centers• N = 10000, Dimension =128

Input size Cluster centers GPU( GTX 280)

1,00,000 500 2188.91 ms

1,00,000 1000 4486.37 ms

1,00,000 2000 9241.58 ms

1,00,000 4000 18419.71 ms

1,00,000 8000 36623.58 ms

IIIT,

Hyd

erab

ad

Result

• Variation with dimension of SIFT vector• N = 10000, Cluster centers =8000• Input size Dimension GPU( GTX 280)

1,00,000 16 118.91 ms

1,00,000 32 997.3 ms

1,00,000 64 8623.58 ms

1,00,000 128 36623.58 ms

IIIT,

Hyd

erab

ad

Result

• Coalesced vs Non-Coalesced• Coalesced involves transpose followed by segmented scan and

non- coalesced involves gather followed by segmented scan

Input size – Cluster centers

Non Coalesced Coalesced

1,000 – 80 0.043 ms 0.077 ms

10,000 – 800 1.28 ms 0.217 ms

1,00,000 -8000 15.45 ms 1.955 ms

1,00,000 - 80000 83.24 ms 19.343 ms

IIIT,

Hyd

erab

ad

Result

• Membership vs New Centers• The membership generation consumes major chunk of time

Input size – Cluster centers

Membership New centers

1,000 – 80 4.23 ms 5.68 ms

10,000 – 800 369.07 ms 118.23 ms

1,00,000 -8000 36465.2 ms 158.38 ms

10,00,000 - 8000 45559.48 ms 229.83 ms

Documents

IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad