Scalable Clustering for Vision using GPUs

IIIT

Hyd

erab

ad

Scalable Clustering for Vision using GPUs

K Wasif MohiuddinP J Narayanan

Center for Visual Information TechnologyInternational Institute of Information Technology (IIIT)

Hyderabad

IIIT

Hyd

erab

ad

Publications

1) K Wasif Mohiuddin and P J Narayanan Scalable Clustering using Multiple GPUs. HIPC `11 (Conference on High Performance Computing), Bangalore,

India)

2) K Wasif Mohiuddin and P J Narayanan GPU Assisted Video Organizing Application. ICCV`11, Workshop on GPU in Computer Vision Applications,

Barcelona, Spain).

2

IIIT

Hyd

erab

ad

Presentation Flow

• Scalable Clustering on Multiple GPUs

• GPU assisted Personal Video Organizer

3

IIIT

Hyd

erab

ad

Introduction

• Classification of data desired for meaningful representation.

• Unsupervised learning for finding hidden structure.

• Application in computer vision, data mining with– Image Classification– Document Retrieval

• K-Means algorithm

4

IIIT

Hyd

erab

ad

Clustering

5

Select CentersLabelingMean EvaluationRelabeling

IIIT

Hyd

erab

ad

Need for High Performance Clustering

• Clustering 125k vectors of 128 dimension with 2k clusters took nearly 8 minutes on CPU per iteration.

• A fast, efficient clustering implementation is needed to deal with large data, high dimensionality and large centers.

• In computer vision, SIFT(128 dim) and GIST are common. Features can run into several millions

• Bag of Words for Vocabulary generation using SIFT vectors

6

IIIT

Hyd

erab

ad

Challenges and Contributions

• Computational: O(ndk+1 log n)

• Growing n, k, d for large scale applications.

• Contributions: A complete GPU based implementation with– Exploitation of intra-vector

parallelism– Efficient Mean evaluation– Data Organization for coalesced

access – Multi GPU framework

7

IIIT

Hyd

erab

ad

Related Work

• General Improvements– KD-trees [Moor et al, SIGKKD-1999]– Triangle Inequality [Elkan, ICML-2003] – Distributed Systems [Dhillon et al, LSPDM-

2000]

• Pre CUDA GPU Efforts Improvements– Fragment Shader [Hall et al, SIGGRAPH-

2004]

8

IIIT

Hyd

erab

ad

Related Work (cont)• Recent GPU efforts

– Mean on CPU [Che et al, JPDC-2008] – Mean on CPU + GPU [Hong et al, WCCSIE-

2009] – GPU Miner [Wenbin et al, HKUSTCS-2008] – HPK-Means [Wu et al, UCHPC-2009] – Divide & Rule [Li et al, ICCIT-2010]

• One thread assigned per vector. Parallelism not exploited within data object.

• Lacking efficiency in Mean evaluation• Proposed techniques are parameter dependant.

9

IIIT

Hyd

erab

ad

K-Means

• Objective Function ∑i∑j‖xi(j) -cj‖2

1≤ i ≤ n, 1≤ j ≤ k

• K random centers are initially chosen from input data objects.

• Steps:– Membership Evaluation– New Mean Evaluation– Convergence

10

IIIT

Hyd

erab

ad

GPU Architecture

11

Fermi architecture has16 Streaming Multiprocessors (SM)

Each SM having 32 cores, so overall has 512 CUDA cores.

Kernel’s unleash multiple threads to perform a task in a Single Instruction Multiple Data (SIMD) fashion.

Each SM has registers divided equally amongst its threads. Each thread has a private local memory.

Single unified memory request path for loads and stores using the L1 cache per SM and L2 cache that services all operations

Double precision, faster context switching, faster atomic operations and multiple kernel execution

IIIT

Hyd

erab

ad

K-Means on GPU

Membership Evaluation

• Involves Distance and Minima evaluation.

• Single thread per component of vector

• Parallel computation done on ‘d’ components of input and center vectors stored in row major format.

• Log summation for distance evaluation.

• For each input vector we traverse across all centers.

12

IIIT

Hyd

erab

ad

Membership on GPU

Center Vectors

13

LabelInput Vector

p

1

2

k

k-1

ip

IIIT

Hyd

erab

ad

Membership on GPU(Cont)

14

• Data objects stored in row major format

• Provides coalesced access

• Distance evaluation using shared memory.

• Square root finding avoided

IIIT

Hyd

erab

ad

K-Means on GPU (Cont)

• Mean Evaluation Issues – Random reads and writes– Concurrent writes– Non uniform distribution of data objects per

label.

15

Threads

Data

WriteRead/Write

IIIT

Hyd

erab

ad

Mean Evaluation on GPU

• Store labels and index in 64 bit records

• Group data objects with same membership using Splitsort operation.

• We split using labels as key

• Gather primitive used to rearrange input in order of labels.

• Sorted global index of input vectors is generated.

16

Splitsort : Suryakant & Narayanan IIITH, TR 2009

IIIT

Hyd

erab

ad

Splitsort & Transpose Operation

17

IIIT

Hyd

erab

ad

Mean Evaluation on GPU (cont)

• Row major storage of vectors enabled coalesced access.

• Segmented scan followed by compact operation for histogram count.

• Transpose operation before rearranging input vectors.

• Using segmented scan again we evaluated mean of rearranged vectors as per labels.

18

IIIT

Hyd

erab

ad

Implementation Details

• Tesla – 2 vectors per block , 2 centers at a time– Centers accessed via texture memory

• Fermi – 2 vectors per block, 4 centers at a time– Centers accessed via global memory using L2

cache– More shared memory for distance evaluation

• Occupancy of 83% achieved in case of Fermi and Tesla.

19

IIIT

Hyd

erab

ad

Limitations of a GPU device

• Highly computational & memory consuming algorithm.

• Overloading on GPU device

• Limited Global and Shared memory on a GPU device.

• Handling of large data vectors

• Scalability of the algorithm

20

IIIT

Hyd

erab

ad

Multi GPU Approach

• Partition input data into chunks proportional to number of cores.

• Broadcast ‘k’ centers to all the nodes.

• Perform Membership & partial mean on each of the GPUs sent to their respective nodes.

21

IIIT

Hyd

erab

ad

Multi GPU Approach (cont)

• Nodes direct partial sums to Master node.

• New means evaluated by Master node for next iteration.

22

Node A Node B Node Z

Master Node

Sa Sb Sz

New Centers

S = Sa+Sb+…..+Sz

IIIT

Hyd

erab

ad

Results

• Generated Gaussian SIFT vectors• Variation in parameters n, d, k• Performance on CPU(1 Gb RAM, 2.7 Ghz),

Tesla T10, GTX 480, 8600 tested up to nmax :4 Million, kmax : 8000 , dmax : 256

• MultiGPU (4xT10 + GTX 480) using MPI nmax : 32 Million, kmax : 8000, dmax : 256

• Comparison with previous GPU implementations.

23

IIIT

Hyd

erab

ad

Overall Results

N, K CPUGPU

Tesla T10 GTX 480 4xT10

10K, 80 1.3 0.119 0.18 0.097

50K, 800 71.3 2.73 1.73 0.891

125K, 2K 463.6 14.18 7.71 2.47

250K, 4K 1320 38.5 27.7 7.45

1M, 8K 28936 268.6 170.6 68.5

Times of K-Means on CPU, GPUs in seconds for d=128.

24

IIIT

Hyd

erab

ad

Performance on GPUs

25

Performance of 8600 (32 cores), Tesla(240 cores), GTX 480(480 cores) for d=128 and k=1,000.

IIIT

Hyd

erab

ad

Performance vs ‘n’

26

Linear in n, with d=128 and k=4,000.

IIIT

Hyd

erab

ad

Overall Performance

• Multi GPU provided linear speedup

• Speedup of up to 170 on GTX 480

• 6 Million vectors of 128 dimension clustered in just 136 sec per iteration.

• Low end GPUs provide nearly 10-20 times of speedup.

27

IIIT

Hyd

erab

ad

Comparison

N K D Li et al Wu et al Our

K-Means

2 Million 400 8 1.23 4.53 1.27

4 Million 100 8 0.689 4.95 0.734

4 Million 400 8 2.26 9.03 2.4

51,200 32 64 0.403 - 0.191

51,200 32 128 0.475 - 0.262

Up to twice increase in speedup against the best GPU implementation on GTX 280

28

IIIT

Hyd

erab

ad

Multi GPU Results

N Dim 1 Tesla 4xTesla4xTesla+GTX480

1 M 128 120.4 33.6 22.8

1.5 M 128 181.7 47.2 34.8

3 M 128 364.2 95.67 67.4

6 M 128 - 183.8 136.7

16 M 16 220.4 57.8 40.9

32 M 16 - 116 84.3

Scalable to number of cores in a Multi GPU, Results on Tesla, GTX 480 in seconds for d=128, k=4000

29

IIIT

Hyd

erab

ad

Time Division

0.0730.248

24.5 69.2

0.20.29

2.9 4.1

0%

20%

40%

60%

80%

100%

50K, 32, 34

0.5M, 32, 34

0.5 M, 128, 2k

1M, 128, 4k

Membership Mean

30

Time on GTX 480 device. Mean evaluation reduced to 6% of the total time for large input of high dimensional data.

IIIT

Hyd

erab

ad

Presentation Flow

• Scalable Clustering on Multiple GPUs

• GPU assisted Personal Video Organizer

31

IIIT

Hyd

erab

ad

1 2 3 4

32

IIIT

Hyd

erab

ad

Motivation

• Many and varied videos in everyone’s collection and growing every day– Sports, TV Shows, Movies, home events, etc.

• Categorizing them based on content useful– No effective tools for video (or images)– Existing efforts are very category specific

• Can’t need heavy training or large clusters of computers

• Goal: Personal categorization performed on personal machines– Training and testing on a personal scale

33

IIIT

Hyd

erab

ad

Challenges and Contributions

• Algorithmic: Extend image classification to videos.

• Data: Use small amount of personal videos span across wide class of categories.

• Computational: Need do it on laptops or personal workstations.

• Contributions: A video organization scheme with– Learning categories from user-labelled data– Fast category assignment for the collection.– Exploiting the GPU for computation– Good performance even on personal machines

34

IIIT

Hyd

erab

ad

Related Work

• Image Categorization– ACDSee, Dbgallery, Flickr, Picasa, etc

• Image Representation– SIFT [Lowe IJCV04], GIST [Torralba IJCV01],

HOG [Dalal & Triggs CVPR05] etc.

• Key Frame extraction– Difference of Histograms [Gianluigi SPIE05]

35

IIIT

Hyd

erab

ad

Related Work…contd

• Genre Classification– SVM [Ekenel et al AIEMPro2010] – HMM [Haoran et al ICICS2003]– GMM [Truong et al, ICPR2000]– Motion and color [Chen et al, JVCIR2011]– Spatio-temporal behavior [Rea et al, ICIP2000]

• Involved extensive learning of categories for a specific type of videos

• Not suitable for personal collections that vary greatly.

36

IIIT

Hyd

erab

ad

Video Classification: Steps

• Category Determination– User tags videos separately for each class– Learning done using these videos– Cluster centers derived for each class

• Category Assignment– Use the trained categories on remaining videos– Final assigning done based on scoring– Ambiguities resolved by user

37

IIIT

Hyd

erab

ad

Category Determination

• Segmentation & Thresholding

• Keyframe extrction & PHOG Features

• K-Means

Segment &Threshold

Keyframes & PHOG

K-Means Clustering

CategoryRepresentation

Tagged Videos

38

IIIT

Hyd

erab

ad

Work Division

• Less intensive steps processed on CPU.

• Computationally expensive steps moved onto GPU.

• Steps like key frame extraction, feature extraction and clustering are time consuming.

39

IIIT

Hyd

erab

ad

Key frame Extraction

Segmentation• Compute color histogram for all the frames.• Divide video into shots using the score of

difference of histograms across consecutive frames.

Thresholding• Shots having more than 60 frames selected.• Four equidistant frames chosen as key frames

from every shot.

40

IIIT

Hyd

erab

ad

• Edge Contours extracted using canny edge detector.

• Orientation gradients computed with a 3 x 3 Sobel mask without Gaussian smoothing.

• HOG descriptor discretized into K orientation bins.

• HOG vector is computed for each grid cell at each pyramid resolution level[Bosch et al. CIVR2007]

PHOG

41

IIIT

Hyd

erab

ad

Final Representation

• Cluster the accumulated key frames separately for every category.

• Grouping of similar frames into single cluster.

• Meaningful representation of key frames for each category is achieved.

• Reduced search space for the test videos.

42

IIIT

Hyd

erab

ad

K-Means

• Partitions ‘n’ data objects into ‘k’ partitions

• Clustering of extracted training key frames.

• Separately for each of the categories.

• Represent each category with meaningful cluster centers.

• For instance grouping frames consisting of pitch, goal post, etc.

• 30 clusters per category generated.

43

IIIT

Hyd

erab

ad

PHOG on GPU

• HoG computed using previous code [Prisacariu et al. 2009]– Gradients evaluated using convolution kernels

from NVIDIA CUDA SDK.

– One thread per pixel and the thread block size is 16×16.

– Each thread computes its own histogram

• PHOG descriptors computed by applying HOG for different scales and merging them.– Downsample the image and send to HoG.

44

IIIT

Hyd

erab

ad

Category Assignment

• Segmentation, Thresholding, keyframes– Extract keyframes from untagged videos.

• Compute PHOG for each keyframe• Classify each keyframe independently

– K-Nearest Neighbor classifier– Allot each keyframe to the nearest k clusters

• Final scoring for category assignment

45

IIIT

Hyd

erab

ad

K-Nearest Neighbor

• Classification done based on closest training samples.

• K nearest centers evaluated for each frame.

• Euclidean distance used as distance metric.

46

IIIT

Hyd

erab

ad

KNN on GPU

• Each block handles ‘L’ new key frames at a time loops over all key frames.

• Find distances for each key frame against all centers sequentially– Deal each dimension in parallel using a thread– Find the vector distance using a log summation– Write back to global memory

• Sort the distance as key for each key frame.

• Keep the top k values47

IIIT

Hyd

erab

ad

Scoring

• Use the distance ratio r = d1 / d2 of distances d1 and d2 to the two neighbors.

• If r < threshold, allot a single membership to the keyframe. Threshold used: 0.6

• Assign multiple memberships otherwise. We assign to top c/2 categories.

• Final category:– Count the votes for each category for the video– If the top category is a clear winner, assign to it.

(20% more score than the next)– Seek manual assignment otherwise.

48

IIIT

Hyd

erab

ad

Results

• Selected four popular Sport categories– Cricket, Football, Tennis, Table Tennis

• Collected a dataset of about 100 videos of10 to 15 minutes each.

• The user tags 3 videos per category.

• Rest of the videos used for testing.

• 4 frames considered to represent a shot.

• Roughly 200 key frames per category.

49

IIIT

Hyd

erab

ad

Keyframes (Football)

50

IIIT

Hyd

erab

ad

Keyframes (Cricket)

51

IIIT

Hyd

erab

ad

Category Labeling

Clubbing of key frames from various tagged videos for each category.

Final key frames for tagged multiple Cricket videos

Final key frames for tagged multiple Football videos

52

IIIT

Hyd

erab

ad

Category LabelingFinal key frames for tagged multiple Tennis videos

Final key frames for tagged multiple Table Tennis videos

53

IIIT

Hyd

erab

ad

Frame classification per category

• Variation of K nearest neighbors

• Evaluated using 12 tagged videos, 3 per category.

• Reduction in error percentage for certain categories using 3 NN vs just NN.– 64% to 73% for cricket– 58% to 66% for football

• Achieved overall accuracy of nearly 96%

54

IIIT

Hyd

erab

ad

Category Determination

GPU Device

No of Videos

# Keyf-rames

Segmentation

(sec)

PHOG Features

(sec)

K-Means (sec)

8600 4 756 182.7 139.6 3.94

8600 12 2432 584.3 468.4 14.3

280 4 756 24.8 19.2 0.59

280 12 2432 76.9 61.8 1.97

580 4 756 11.8 9.1 0.26

580 12 2432 37.91 30.2 0.89

Time taken to process the Category Labeling phase on NVIDIA 8600, GTX 280 and GTX 580 cards

55

80 secsper video

5 secsper video

IIIT

Hyd

erab

ad

Category Assignment

• Videos of total duration 1375 minutes are processed in less than 10 minutes.

• Time share for K-NN in seconds

GPU Device No of Videos Keyframes K-NN

8600 88 16946 40.33 sec

280 88 16946 5.39 sec

580 88 16946 2.46 sec

56

Processing time per 10-15 minute video:5 sec on GTX580, 80 sec on an 8600

IIIT

Hyd

erab

ad

Conclusions

• Complete GPU based implementation.• Achieved a speedup of up to 170 on single NVIDIA

Fermi GPU.• High Performance for large ‘d’ due to processing of

vector in parallel.• Scalable in problem size n, d, k and number of

cores.• Use of operations like Splitsort, Transpose for

coalesced memory access.• Large datasets clustered using Multi GPU frame

work.

57

IIIT

Hyd

erab

ad

Conclusions (contd)

• Achieved accuracy up to 96%.• Involving user for ambiguous videos reduced

misclassification rate.• Exploited the computational power of GPU for

vision algorithms.• Effective training with variations in a single

category. • Could be extended to other class of sport categories

as well as other genres of video.• More sophisticated classification algorithms can

help accuracy.

58

IIIT

Hyd

erab

ad

Future Work

• With evolving GPU architecture the approach may be altered to enhance the performance.

• Improve Multi GPU framework by message passing.

• Target applications in computer vision which use extensive amount of clustering.

• Explore for more categories of video and effective training.

59

IIIT

Hyd

erab

ad

Thank You

Questions?

Documents

Scalable Clustering for Vision using GPUs