Upload
mcd-media-consulting
View
235
Download
3
Embed Size (px)
DESCRIPTION
Der neue CUDA Libraries Performance Report von NVIDIA ist ab sofort verfügbar. Der Bericht enthält eine Übersicht zu den Performance-Verbesserungen, die das aktuelle CUDA-Toolkit bietet, zum Beispiel in den Bereichen FFT, BLAS, Sparse-Matrixmultiplikation oder Random Number Generation.
Citation preview
CUDA Toolkit 3.2
Math Library Performance
January, 2011
2
CUDA Toolkit 3.2 Libraries
NVIDIA GPU-accelerated math libraries:
cuFFT – Fast Fourier Transforms Library
cuBLAS – Complete BLAS library
cuSPARSE – Sparse Matrix Library
cuRAND – Random Number Generation (RNG) Library
For more information on CUDA libraries: http://www.nvidia.com/object/gtc2010-presentation-archive.html#session2216
3
cuFFT
Multi-dimensional Fast Fourier Transforms
New in CUDA 3.2 :
Higher performance of 1D, 2D, 3D transforms with
dimensions of powers of 2, 3, 5 or 7
Higher performance and accuracy for 1D transform sizes
that contain large prime factors
4
cuFFT 3.2 up to 8.8x Faster than MKL
0
50
100
150
200
250
300
350
400
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
N (transform size = 2^N)
Single-Precision 1D Radix-2
0
20
40
60
80
100
120
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
N (transform size = 2^N)
Double-Precision 1D Radix-2
* cuFFT 3.2, Tesla C2050 (Fermi) with ECC on
* MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz
* FFTW single-thread on same CPU
1D used in audio processing and as a foundation for 2D and 3D FFTs
Performance may vary based on OS version and motherboard configuration
CUFFT
MKL
FFTW
CUFFT
MKL
FFTW
GF
LO
PS
GF
LO
PS
5
cuFFT 1D Radix-3 up to 18x Faster than MKL
0
20
40
60
80
100
120
140
160
180
200
220
240
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
N (transform size = 3^N)
Single-Precision
0
10
20
30
40
50
60
70
80
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
N (transform size = 3^N)
Double-Precision
GFLOPS GFLOPS
CUFFT (ECC off)
CUFFT (ECC on)
MKL
18x for single-precision, 15x for double-precision
Similar acceleration for radix-5 and -7
* CUFFT 3.2 on Tesla C2050 (Fermi)
* MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz Performance may vary based on OS version and motherboard configuration
CUFFT (ECC off)
CUFFT (ECC on)
MKL
GF
LO
PS
GF
LO
PS
6
cuFFT 3.2 up to 100x Faster than cuFFT 3.1
0x
1x
10x
100x
0 2048 4096 6144 8192 10240 12288 14336 16384
Sp
ee
du
p
1D Transform Size
Speedup for C2C single-precision Log Scale
cuFFT 3.2 vs. cuFFT 3.1 on Tesla C2050 (Fermi)
Performance may vary based on OS version and motherboard configuration
7
cuBLAS: Dense Linear Algebra on GPUs
Complete BLAS implementation
Supports all 152 routines for single, double, complex and
double complex
New in CUDA 3.2
7x Faster GEMM (matrix multiply) on Fermi GPUs
Higher performance on SGEMM & DGEMM for all matrix sizes
and all transpose cases (NN, TT, TN, NT)
8
Up to 8x Speedup for all GEMM Types
636
775
301 295
78 80 39 40
0
100
200
300
400
500
600
700
800
900
SGEMM CGEMM DGEMM ZGEMM
GF
LO
PS
GEMM Performance on 4K by 4K matrices
CUBLAS3.2
* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on
* MKL 10.2.3, 4-core Corei7 @ 2.66Ghz Performance may vary based on OS version and motherboard configuration
9
cuBLAS 3.2 : Consistently 7x Faster DGEMM Performance
0
50
100
150
200
250
300
350
GF
LO
PS
Dimension (m = n = k)
DGEMM 3.2 DGEMM 3.1 DGEMM MKL 4 THREADS
cuBLAS 3.2
Large perf
variance in
cuBLAS 3.1 7x Faster than MKL
* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on
* MKL 10.2.3, 4-core Corei7 @ 2.66Ghz
8%
Performance may vary based on OS version and motherboard configuration
cuBLAS 3.1
MKL 10.2.3
10
cuSPARSE
New library for sparse linear algebra
Conversion routines for dense, COO, CSR and CSC formats
Optimized sparse matrix-vector multiplication
1.0
2.0
3.0
4.0
y1
y2
y3
y4
\alpha + \beta
1.0
6.0
4.0
7.0
3.0 2.0
5.0
y1
y2
y3
y4
11
32x Faster
1
2
4
8
16
32
64
Sp
ee
du
p o
ve
r M
KL
1 Sparse Matrix x 6 Dense Vector
Speedup of cuSPARSE vs MKL
Performance may vary based on OS version and motherboard configuration
* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on
* MKL 10.2.3, 4-core Corei7 @ 3.07GHz
single
double
complex
double-complex
Log Scale
12
Performance of 1 Sparse Matrix x 6 Dense Vectors
0
20
40
60
80
100
120
140
160
GF
LO
PS
Test cases roughly in order of increasing sparseness
* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on Performance may vary based on OS version and motherboard configuration
single
double
complex
double-complex
13
cuRAND
Library for generating random numbers
Features supported in CUDA 3.2
XORWOW pseudo-random generator
Sobol’ quasi-random number generators
Host API for generating random numbers in bulk
Inline implementation allows use inside GPU functions/kernels
Single- and double-precision, uniform and normal distributions
14
cuRAND performance
0
2
4
6
8
10
12
14
16
18
uniformint
uniformfloat
uniformdouble
normalfloat
normaldouble
Sobol’ Quasi-RNG (1 dimension)
0
2
4
6
8
10
12
14
16
18
uniformint
uniformfloat
uniformdouble
normalfloat
normaldouble
Gig
aS
am
ple
s / s
ec
on
d
XORWOW Psuedo-RNG
* CURAND 3.2, NVIDIA C2050 (Fermi), ECC on Performance may vary based on OS version and motherboard configuration