Upload
sakura
View
41
Download
0
Embed Size (px)
DESCRIPTION
Accelerating MATLAB Image Processing Toolbox Functions on GPUs. Jingfei Kong , Martin Dimitrov , Yi Yang, Janaka Liyanage , Lin Cao, Jacob Staples, Mike Mantor , Huiyang Zhou. Motivation. - PowerPoint PPT Presentation
Citation preview
Accelerating MATLAB Image Processing Toolbox Functions on GPUs
Jingfei Kong, Martin Dimitrov, Yi Yang, Janaka Liyanage, Lin Cao, Jacob Staples, Mike Mantor,
Huiyang Zhou
University of Central Florida 2
Motivation
• With high memory bandwidth and teraflops computing capability, Graphics Processor Units (GPUs) become quite attractive for accelerating general purpose applications
• Developing high-performance GPU programs, however, requires deep understanding of both application algorithms and GPU hardware architecture
• A systematic way of dealing with a generic class of applications is missing
University of Central Florida 3
Our Contributions
• Compare performance-critical hardware features in different GPUs
• Develop high-quality open-source library code for some representative functions in MATLAB™ Image Processing Toolbox (IPT)– https://sites.google.com/site/iptatiproject/ [15]
• Reveal insights on efficiently accelerating a wide range of image processing algorithms
University of Central Florida 4
Presentation Outline
• Motivation• Our Contributions• Implication of GPU hardware on GPGPU programming• A GPGPU library for IPT functions
– categorization and optimization strategies• Case Studies
– 2D convolution– dither
• Conclusions
University of Central Florida 5
Implication of GPU hardware on GPGPU programmingPerformance-Critical Hardware Features
Implication on GPGPU Programs AMD/ATI HD5870 NVIDIA GTX280
Memory Access Bandwidth
Vector-type (float2 or float4) data access
Scalar-type (float) or vector-type (float2) data access
Register File A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core
A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.
Shared Memory/Local Data Share
Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)
Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)
Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)
(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item
(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item
Stream processor pipeline
5-way VLIW ILP needed to make ALUs busy
Scalar pipeline less ILP needed
University of Central Florida 6
Implication of GPU hardware on GPGPU programmingPerformance-Critical Hardware Features
Implication on GPGPU Programs
AMD/ATI HD5870 (RV870)
NVIDIA GTX280
Memory Access Bandwidth
Vector-type (float2 or float4) data access
Scalar-type (float) or vector-type (float2) data access
Register File A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core
A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.
Shared Memory/Local Data Share
Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)
Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)
Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)
(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item
(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item
Stream processor pipeline
5-way VLIW ILP needed to make ALUs busy
Scalar pipeline less ILP needed
Memory Access Bandwidth
Vector-type (float2 or float4) data access
Scalar-type (float) or vector-type (float2) data access
Our experiments show:
Using float instead would reduce bandwidth by at least 10%
Using float4 instead would reduce bandwidth by at least 16%
University of Central Florida 7
Implication of GPU hardware on GPGPU programmingPerformance-Critical Hardware Features
Implication on GPGPU Programs
AMD/ATI HD5870 (RV870)
NVIDIA GTX280
Memory Access Bandwidth
Vector-type (float2 or float4) data access
Scalar-type (float) or vector-type (float2) data access
Register File A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core
A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.
Shared Memory/Local Data Share
Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)
Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)
Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)
(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item
(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item
Stream processor pipeline
5-way VLIW ILP needed to make ALUs busy
Scalar pipeline less ILP needed
Register File
256KB per SIMD engine * 20 SIMD engines = 5MB in total
64KB per SM * 30 SMs = 1.875MB in total
University of Central Florida 8
Implication of GPU hardware on GPGPU programmingPerformance-Critical Hardware Features
Implication on GPGPU Programs
AMD/ATI HD5870 (RV870)
NVIDIA GTX280
Memory Access Bandwidth
Vector-type (float2 or float4) data access
Scalar-type (float) or vector-type (float2) data access
Register File A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core
A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.
Shared Memory/Local Data Share
Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)
Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)
Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)
(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item
(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item
Stream processor pipeline
5-way VLIW ILP needed to make ALUs busy
Scalar pipeline less ILP needed
Shared memory/ local data share
32KB per SIMD engine * 20 SIMD engines = 640KB in total
16KB per SM * 30 SMs = 480KB in total
University of Central Florida 9
Implication of GPU hardware on GPGPU programmingPerformance-Critical Hardware Features
Implication on GPGPU Programs
AMD/ATI HD5870 (RV870)
NVIDIA GTX280
Memory Access Bandwidth
Vector-type (float2 or float4) data access
Scalar-type (float) or vector-type (float2) data access
Register File A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core
A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.
Shared Memory/Local Data Share
Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)
Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)
Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)
(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item
(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item
Stream processor pipeline
5-way VLIW ILP needed to make ALUs busy
Scalar pipeline less ILP needed
Ratio of Peak Computation Throughput / Peak Memory Bandwidth
(2720 GFLOPS)/(154 GB/s) = 17.7 flop/B= 70.6 flop/word
(624 GFLOPS)/(141 GB/s) = 4.4 flop/B= 17.7 flop/word
University of Central Florida 10
Implication of GPU hardware on GPGPU programmingPerformance-Critical Hardware Features
Implication on GPGPU Programs
AMD/ATI HD5870 (RV870)
NVIDIA GTX280
Memory Access Bandwidth
Vector-type (float2 or float4) data access
Scalar-type (float) or vector-type (float2) data access
Register File A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core
A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.
Shared Memory/Local Data Share
Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)
Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)
Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)
(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item
(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item
Stream processor pipeline
5-way VLIW ILP needed to make ALUs busy
Scalar pipeline less ILP needed
Stream processor pipeline
5-way VLIW Scalar pipeline
University of Central Florida 11
Summary of the LibraryMATLAB Image Processing Toolbox (IPT) Function Classification
Function Category Function Name Function Description
(A) Data independent
intlut Convert integer values using lookup tableimadjust Adjust image intensity valuesimlincomb Linear combination of images
(B) Data sharing
edge Find edges in grayscale imageimregionalmax Regional maxima of an imageordfilt2 2-D order-statistic filtering conv2 2D convolution of an imagemean2 Average of matrix elementsimdilate/imerode Dilate/erode a grayscale image
(C) Algorithm dependent
bwdist Euclidean distance transform of a binary imageradon Radon transform
(D) Data dependent
dither Represent grayscale images in binary format
University of Central Florida 12
MATLAB IPT Function Classification and Optimization Strategies
Function Category Function Name Function Description
(A) Data independent
intlut Convert integer values using lookup tableimadjust Adjust image intensity valuesimlincomb Linear combination of images
(B) Data sharing
edge Find edges in grayscale imageimregionalmax Regional maxima of an imageordfilt2 2-D order-statistic filtering conv2 2D convolution of an imagemean2 Average of matrix elementsimdilate/imerode Dilate/erode a grayscale image
(C) Algorithm dependent
bwdist Euclidean distance transform of a binary imageradon Radon transform
(D) Data dependent
dither Represent grayscale images in binary format
Data Independent
Strategies:effectively utilize bandwidth by packing multiple pixels,
perform multiple such light-weight tasks if possible to amortize the CPU-GPU data transfer overhead
Characteristics: straightforward one on one mapping, abundant
parallelism
University of Central Florida 13
MATLAB IPT Function Classification and Optimization Strategies
Function Category Function Name Function Description
(A) Data independent
intlut Convert integer values using lookup tableimadjust Adjust image intensity valuesimlincomb Linear combination of images
(B) Data sharing
edge Find edges in grayscale imageimregionalmax Regional maxima of an imageordfilt2 2-D order-statistic filtering conv2 2D convolution of an imagemean2 Average of matrix elementsimdilate/imerode Dilate/erode a grayscale image
(C) Algorithm dependent
bwdist Euclidean distance transform of a binary imageradon Radon transform
(D) Data dependent
dither Represent grayscale images in binary format
Data sharing
Strategies:data reuse, computation reuse
Characteristics: still one on one mapping, but there is an overlapping
over input pixels for computing adjacent output pixel
University of Central Florida 14
MATLAB IPT Function Classification and Optimization Strategies
Function Category Function Name Function Description
(A) Data independent
intlut Convert integer values using lookup tableimadjust Adjust image intensity valuesimlincomb Linear combination of images
(B) Data sharing
edge Find edges in grayscale imageimregionalmax Regional maxima of an imageordfilt2 2-D order-statistic filtering conv2 2D convolution of an imagemean2 Average of matrix elementsimdilate/imerode Dilate/erode a grayscale image
(C) Algorithm dependent
bwdist Euclidean distance transform of a binary imageradon Radon transform
(D) Data dependent
dither Represent grayscale images in binary format
Algorithm dependent
Strategies:re-think algorithms, explore inherent
parallelism
Characteristics: lack of explicit parallelism
University of Central Florida 15
MATLAB IPT Function Classification and Optimization Strategies
Function Category Function Name Function Description
(A) Data independent
intlut Convert integer values using lookup tableimadjust Adjust image intensity valuesimlincomb Linear combination of images
(B) Data sharing
edge Find edges in grayscale imageimregionalmax Regional maxima of an imageordfilt2 2-D order-statistic filtering conv2 2D convolution of an imagemean2 Average of matrix elementsimdilate/imerode Dilate/erode a grayscale image
(C) Algorithm dependent
bwdist Euclidean distance transform of a binary imageradon Radon transform
(D) Data dependent
dither Represent grayscale images in binary format
DataDependent
Strategies:give it a shot and you might have some surprise
Characteristics: lack of explicit parallelism, sequential nature with
data dependency and fine-grain communication requirements
University of Central Florida 16
Summary of the LibraryPerformance Comparison against MATLAB CPU (single-threaded)
Function Category
Function Name Kernel Speedup on GTX 280 Kernel Speedup on HD5870
CUDA OpenCL OpenCL
(A) Data independent
intlut 17.7 17.5 12.7imadjust 21.4 15.7 11.9imlincomb 944.6 593.7 1385.4
(B) Data sharing
edge 3385.9 1175.2 4955.1imregionalmax 2117.8 798.4 3694.0ordfilt2 1199.6 171.6 1727.1conv2 345.5 156.9 649.8mean2 50.5 25.2 34.7imdilate/imerode 951.5 523.3 1579.8
(C) Algorithm dependentbwdist 134.8 126.2 104.3radon 84.3 67.4 61.2
(D) Data dependent dither 10.2 6.5 7.6
University of Central Florida 17
Summary of the LibraryPerformance Comparison against MATLAB CPU (single-threaded)
Function Category
Function Name Kernel Speedup on GTX 280 Kernel Speedup on HD5870
CUDA OpenCL OpenCL
(A) Data independent
intlut 17.7 17.5 12.7imadjust 21.4 15.7 11.9imlincomb 944.6 593.7 1385.4
(B) Data sharing
edge 3385.9 1175.2 4955.1imregionalmax 2117.8 798.4 3694.0ordfilt2 1199.6 171.6 1727.1conv2 345.5 156.9 649.8mean2 50.5 25.2 34.7imdilate/imerode 951.5 523.3 1579.8
(C) Algorithm dependentbwdist 134.8 126.2 104.3radon 84.3 67.4 61.2
(D) Data dependent dither 10.2 6.5 7.6
Geometric mean
206x 110x 218x
Kernel speedup on GTX 280
Kernel speedup onHD 5870
CUDA OpenCL OpenCL
University of Central Florida 18
Summary of the LibraryPerformance Comparison against MATLAB CPU (single-threaded)
Function Category
Function Name Kernel Speedup on GTX 280 Kernel Speedup on HD5870
CUDA OpenCL OpenCL
(A) Data independent
intlut 17.7 17.5 12.7imadjust 21.4 15.7 11.9imlincomb 944.6 593.7 1385.4
(B) Data sharing
edge 3385.9 1175.2 4955.1imregionalmax 2117.8 798.4 3694.0ordfilt2 1199.6 171.6 1727.1conv2 345.5 156.9 649.8mean2 50.5 25.2 34.7imdilate/imerode 951.5 523.3 1579.8
(C) Algorithm dependentbwdist 134.8 126.2 104.3radon 84.3 67.4 61.2
(D) Data dependent dither 10.2 6.5 7.6
Function name
CUDA on GTX 280
OpenCL on HD 5870
imlincomb 944.6 1385.4edge 3385.9 4955.1imregionalmax
2117.8 3694.0
ordfilt2 1199.6 1727.1conv2 345.5 649.8imdilate 951.5 1579.8
University of Central Florida 19
Summary of the LibraryPerformance Comparison against MATLAB CPU (single-threaded)
Function Category
Function Name Kernel Speedup on GTX 280 Kernel Speedup on HD5870
CUDA OpenCL OpenCL
(A) Data independent
intlut 17.7 17.5 12.7imadjust 21.4 15.7 11.9imlincomb 944.6 593.7 1385.4
(B) Data sharing
edge 3385.9 1175.2 4955.1imregionalmax 2117.8 798.4 3694.0ordfilt2 1199.6 171.6 1727.1conv2 345.5 156.9 649.8mean2 50.5 25.2 34.7imdilate/imerode 951.5 523.3 1579.8
(C) Algorithm dependentbwdist 134.8 126.2 104.3radon 84.3 67.4 61.2
(D) Data dependent dither 10.2 6.5 7.6
Geometric mean
206x 110x
Kernel speedup on GTX 280
CUDA OpenCL
University of Central Florida 20
2D Convolution Overview
1 2
4 5
9
6
3
7 8
1 1
1 1
1
1
2
2 1
3 x 3 filter
55
input pixels output pixels
University of Central Florida 21
2D Convolution Overview
• Drag the filter over the each pixel of the source image and multiply and accumulate the overlapped input elements to generate an output pixel.
Input Image
filter
pixel
University of Central Florida 22
2D Convolution: Intra-Thread Data Reuse
• Each thread computes multiple pixels along the column
• Intra-Thread reuse: – For a 7x7 filter we reuse each input pixel up to 7 times
22
Input Image
Thread iThread iThread i
University of Central Florida 23
2D Convolution: Inter-Thread Data Reuse
• Threads in the same warp/wavefront access the same row. • Inter-thread reuse
– The row is fetched into texture cache/shared memory and reused by different threads on subsequent accesses.
Input Image
0 1
Reused row in texture cache/shared memory
2 3threads
University of Central Florida 24
2D Convolution Performance
A 4096 x 4096 image with a 7 x 7 filter
• Jacket ‘s: – around 20 GFLOPS on GTX 280– Jacket 1.2.2 trial version (released on 1/4/2010) from
Accelereyes®
• Ours:– around 350 GFLOPS on GTX 280– around 733 GFLOPS on HD 5870
University of Central Florida 25
Data Dependent Case Study: Dither
University of Central Florida 26
Dither
0/1?
input pixels
230 < 128?
1230error
Error = 230 – 128 = 102output pixels
University of Central Florida 27
Dither – Data Dependency
j
i i+j
pixel at (i, j)
University of Central Florida 28
Dither – Parallel Processing Schedule
...1 2 4 53 6 7
3 4 6 75 9 10
5 6 97 10 11 12
7 10 119 12 13 14
9 10 12 1311 14 15 16
11 12 14 1513 16 17 18
13 14 16 1715 18 19 20
15 16 18 1917 20 21 22
8
8
8
8
From P. Metaxas [8]
...
University of Central Florida 29
Dither – Our GPU Implementation
12 3 4 5
45 6 7
8
78 9 10
11
1011 12 13
14
12 3
4
4
5
5
A relatively small amount of thread blocks/threads are active at any given time
• low resource utilization• synchronization overhead (among thread blocks/threads)
We still get up to 10.3x kernel speedup and 3.5x overall speedup!
University of Central Florida 30
Conclusions
• We identify performance-critical hardware features for GPGPU programs
• We present our experience and optimization strategies in developing high performance GPU code for functions from MATLAB Image Processing Toolbox
University of Central Florida 31
Our Open-source Library Project Website
https://sites.google.com/site/iptatiproject/ [15]You are more than welcome to contribute!
Thank you and Questions?