NVIDIAs CUDA Libraries Performance Report

CUDA Toolkit 3.2

Math Library Performance

January, 2011

2

CUDA Toolkit 3.2 Libraries

NVIDIA GPU-accelerated math libraries:

cuFFT – Fast Fourier Transforms Library

cuBLAS – Complete BLAS library

cuSPARSE – Sparse Matrix Library

cuRAND – Random Number Generation (RNG) Library

For more information on CUDA libraries: http://www.nvidia.com/object/gtc2010-presentation-archive.html#session2216

http://www.nvidia.com/object/gtc2010-presentation-archive.html






3

cuFFT

Multi-dimensional Fast Fourier Transforms

New in CUDA 3.2 :

Higher performance of 1D, 2D, 3D transforms with

dimensions of powers of 2, 3, 5 or 7

Higher performance and accuracy for 1D transform sizes

that contain large prime factors

4

cuFFT 3.2 up to 8.8x Faster than MKL

0

50

100

150

200

250

300

350

400

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

N (transform size = 2^N)

Single-Precision 1D Radix-2

0

20

40

60

80

100

120

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23


Double-Precision 1D Radix-2

* cuFFT 3.2, Tesla C2050 (Fermi) with ECC on

* MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz

* FFTW single-thread on same CPU

1D used in audio processing and as a foundation for 2D and 3D FFTs

Performance may vary based on OS version and motherboard configuration

CUFFT

MKL

FFTW

CUFFT

MKL

FFTW

GF

LO

PS

GF

LO

PS

5

cuFFT 1D Radix-3 up to 18x Faster than MKL

0

20

40

60

80

100

120

140

160

180

200

220

240

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Single-Precision

0

10

20

30

40

50

60

70

80

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Double-Precision

GFLOPS GFLOPS

CUFFT (ECC off)

CUFFT (ECC on)

MKL

18x for single-precision, 15x for double-precision

Similar acceleration for radix-5 and -7

* CUFFT 3.2 on Tesla C2050 (Fermi)

* MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz Performance may vary based on OS version and motherboard configuration

CUFFT (ECC off)

CUFFT (ECC on)

MKL

GF

LO

PS

GF

LO

PS

6

cuFFT 3.2 up to 100x Faster than cuFFT 3.1

0x

1x

10x

100x

0 2048 4096 6144 8192 10240 12288 14336 16384

Sp

ee

du

p

1D Transform Size

Speedup for C2C single-precision Log Scale

cuFFT 3.2 vs. cuFFT 3.1 on Tesla C2050 (Fermi)


7

cuBLAS: Dense Linear Algebra on GPUs

Complete BLAS implementation

Supports all 152 routines for single, double, complex and

double complex

New in CUDA 3.2

7x Faster GEMM (matrix multiply) on Fermi GPUs

Higher performance on SGEMM & DGEMM for all matrix sizes

and all transpose cases (NN, TT, TN, NT)

8

Up to 8x Speedup for all GEMM Types

636

775

301 295

78 80 39 40

0

100

200

300

400

500

600

700

800

900

SGEMM CGEMM DGEMM ZGEMM

GF

LO

PS

GEMM Performance on 4K by 4K matrices

CUBLAS3.2

* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on

* MKL 10.2.3, 4-core Corei7 @ 2.66Ghz Performance may vary based on OS version and motherboard configuration

9

cuBLAS 3.2 : Consistently 7x Faster DGEMM Performance

0

50

100

150

200

250

300

350

GF

LO

PS

Dimension (m = n = k)

DGEMM 3.2 DGEMM 3.1 DGEMM MKL 4 THREADS

cuBLAS 3.2

Large perf

variance in

cuBLAS 3.1 7x Faster than MKL

* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on

* MKL 10.2.3, 4-core Corei7 @ 2.66Ghz

8%


cuBLAS 3.1

MKL 10.2.3

10

cuSPARSE

New library for sparse linear algebra

Conversion routines for dense, COO, CSR and CSC formats

Optimized sparse matrix-vector multiplication

1.0

2.0

3.0

4.0

y1

y2

y3

y4

\alpha + \beta

1.0

6.0

4.0

7.0

3.0 2.0

5.0

y1

y2

y3

y4

11

32x Faster

1

2

4

8

16

32

64

Sp

ee

du

p o

ve

r M

KL

1 Sparse Matrix x 6 Dense Vector

Speedup of cuSPARSE vs MKL


* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on

* MKL 10.2.3, 4-core Corei7 @ 3.07GHz

single

double

complex

double-complex

Log Scale

12

Performance of 1 Sparse Matrix x 6 Dense Vectors

0

20

40

60

80

100

120

140

160

GF

LO

PS

Test cases roughly in order of increasing sparseness

* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on Performance may vary based on OS version and motherboard configuration

single

double

complex

double-complex

13

cuRAND

Library for generating random numbers

Features supported in CUDA 3.2

XORWOW pseudo-random generator

Sobol’ quasi-random number generators

Host API for generating random numbers in bulk

Inline implementation allows use inside GPU functions/kernels

Single- and double-precision, uniform and normal distributions

14

cuRAND performance

0

2

4

6

8

10

12

14

16

18

uniformint

uniformfloat

uniformdouble

normalfloat

normaldouble

Sobol’ Quasi-RNG (1 dimension)

0

2

4

6

8

10

12

14

16

18

uniformint

uniformfloat

uniformdouble

normalfloat

normaldouble

Gig

aS

am

ple

s / s

ec

on

d

XORWOW Psuedo-RNG

* CURAND 3.2, NVIDIA C2050 (Fermi), ECC on Performance may vary based on OS version and motherboard configuration

Documents

NVIDIAs CUDA Libraries Performance Report