Nvidia® cuda™ 5 sample evaluationresult_2

NVIDIA® CUDA™ 5.0 Sample evaluation result

PART Ⅱ

GPU: GTX 560 Ti

CPU: i5-3450S (TDP65W)

RAM: 16GB

OS: Windows 7 x64 Ultimate

Yukio Saitoh | FXFROG.com

24/Apr/2013

INDEX

Sample binary :19. concurrentKernels

20. conjugateGradient

21. concurrentKernels

22. conjugateGradient23. conjugateGradientPrecond24. convolutionFFT2D25. convolutionSeparable26. convolutionTexture27. cppIntegration28. cudaDecodeD3D9 (runaway)29. cudaDecodeGL30. cudaEncode (runaway)31. dct8x832. deviceQuery33. deviceQueryDrv34. dwtHaar1D35. dxtc

Sample target path and files

• C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release

concurrentKernels.exe

[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥concurrentKernels.exe] - Starting...

GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1

> Detected Compute SM 2.1 hardware with 8 multi-processors

Expected time for serial execution of 8 kernels = 0.080s

Expected time for concurrent execution of 8 kernels = 0.010s

Measured time for sample = 0.010s

Test passed

conjugateGradient.exe


> GPU device has 8 Multi-Processors, SM 2.1 compute capabilities

iteration = 1, residual = 4.451374e+001

iteration = 2, residual = 3.248658e+000

iteration = 3, residual = 2.695777e-001






Test Summary: Error amount = 0.000000

conjugateGradientPrecond.exe

conjugateGradientPrecond starting...


GPU selected Device ID = 0

> GPU device has 8 Multi-Processors, SM 2.1 compute capabilities

laplace dimension = 128

Convergence of conjugate gradient without preconditioning:


Convergence Test: OK

Convergence of conjugate gradient using incomplete LU preconditioning:


Convergence Test: OK

Test Summary:

Counted total of 0 errors

qaerr1 = 0.000004 qaerr2 = 0.000003

convolutionFFT2D.exe 1/2

[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥convolutionFFT2D.exe] - Starting...


Testing built-in R2C / C2R FFT-based convolution

...allocating memory

...generating random input data

...creating R2C & C2R FFT plans for 2048 x 2048

...uploading to GPU and padding convolution kernel and input data

...transforming convolution kernel

...running GPU FFT convolution: 1267.922657 MPix/s (3.154767 ms)

...reading back GPU convolution results

...running reference CPU convolution

...comparing the results: rel L2 = 7.179421E-008 (max delta = 4.808732E-007)

L2norm Error OK

...shutting down

Testing custom R2C / C2R FFT-based convolution



...creating C2C FFT plan for 2048 x 1024




...reading back GPU FFT results



L2norm Error OK

...shutting down

convolutionFFT2D.exe 2/2

Testing updated custom R2C / C2R FFT-based convolution



...creating C2C FFT plan for 2048 x 1024




...reading back GPU FFT results



L2norm Error OK

...shutting down

Test Summary: 0 errors

Test passed

convolutionSeparable.exe

[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥convolutionSeparable.exe] -Starting...


Image Width x Height = 3072 x 3072

Allocating and initializing host arrays...

Allocating and initializing CUDA arrays...

Running GPU convolution (16 identical iterations)...

convolutionSeparable, Throughput = 3179.0263 MPixels/sec, Time = 0.00297 s, Size = 9437184 Pixels, NumDevsUsed = 1, Work

group = 0

Reading back GPU results...

Checking the results...

...running convolutionRowCPU()

...running convolutionColumnCPU()

...comparing the results

...Relative L2 norm: 0.000000E+000

Shutting down...

Test passed

convolutionTexture.exe

[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥convolutionTexture.exe] - Starting...


Initializing data...

Running GPU rows convolution (10 identical iterations)...

Average convolutionRowsGPU() time: 1.427774 msecs; //3304.859282 Mpix/s

Copying convolutionRowGPU() output back to the texture...

cudaMemcpyToArray() time: 0.481161 msecs; //9806.674660 Mpix/s

Running GPU columns convolution (10 iterations)

Average convolutionColumnsGPU() time: 1.429637 msecs; //3300.552071 Mpix/s

Reading back GPU results...

Checking the results...

...running convolutionRowsCPU()

...running convolutionColumnsCPU()

Relative L2 norm: 0.000000E+000

Shutting down...

Test passed

cppIntegration.exe


Hello World.

Hello World.

cudaDecodeD3D9.exe (runaway)

Command Line Arguments:

argv[0] = C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥cudaDecodeD3D9.exe

cudaDecodeGL.exe 1/2

[CUDA/OpenGL Video Decode]


argv[0] = C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥cudaDecodeGL.exe

[cudaDecodeGL]: input file: <../../../3_Imaging/cudaDecodeGL/data/plush1_720p_10s.m2v>

VideoCodec : MPEG-2

Frame rate : 30000/1001fps ~ 29.97fps

Sequence format : Progressive

Coded frame size: [1280, 720]

Display area : [0, 0, 1280, 720]

Chroma format : 4:2:0

Bitrate : 14116kBit/s

Aspect ratio : 16:9

argv[0] = C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥cudaDecodeGL.exe

> Device 0: <GeForce GTX 560 Ti >, Compute SM 2.1 detected

-> GPU 0: < GeForce GTX 560 Ti > driver mode is: WDDM

>> initGL() creating window [1280 x 720]

> Using CUDA/GL Device [0]: GeForce GTX 560 Ti

> Using GPU Device: GeForce GTX 560 Ti has SM 2.1 compute capability

Total amount of global memory: 1024.0000 MB

>> modInitCTX<NV12ToARGB_drvapi_x64.ptx > initialized OK

>> modGetCudaFunction< CUDA file: NV12ToARGB_drvapi_x64.ptx >

CUDA Kernel Function (0x0a4c6660) = < NV12ToARGB_drvapi >

>> modGetCudaFunction< CUDA file: NV12ToARGB_drvapi_x64.ptx >

CUDA Kernel Function (0x0a4c6210) = < Passthru_drvapi >

> VideoDecoder::cudaVideoCreateFlags = <1>Use CUDA decoder

cudaDecodeGL.exe 2/2setTextureFilterMode(GL_NEAREST,GL_NEAREST)

ImageGL::CUcontext = 02047fd0

ImageGL::CUdevice = 00000000

reshape() glViewport(0, 0, 1280, 720)

[cudaDecodeGL] - [Frame: 0016, 00.0 fps, frame time: 98854.47 (ms) ]




















[cudaDecodeGL] statistics

Video Length (hh:mm:ss.msec) = 00:00:00.440

Frames Presented (inc repeats) = 326

Average Present Rate (fps) = 739.44

Frames Decoded (hardware) = 327

Average Rate of Decoding (fps) = 741.71

cudaDecodeD3D9.exe 1/2


argv[0] = C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥cudaDecodeD3D9.exe

[cudaDecodeD3D9]: input file: <../../../3_Imaging/cudaDecodeD3D9/data/plush1_720p_10s.m2v>

VideoCodec : MPEG-2

Frame rate : 30000/1001fps ~ 29.97fps

Sequence format : Progressive

Coded frame size: [1280, 720]

Display area : [0, 0, 1280, 720]

Chroma format : 4:2:0

Bitrate : 14116kBit/s

Aspect ratio : 16:9

> Using GPU Device 0: GeForce GTX 560 Ti has SM 2.1 compute capability

Total amount of global memory: 1024.0000 MB

>> modInitCTX<NV12ToARGB_drvapi_x64.ptx> initialized SUCCESS!

>> modGetCudaFunction<NV12ToARGB_drvapi_x64.ptx>

CUDA Kernel Function = <NV12ToARGB_drvapi, 0x04439d20>

>> modGetCudaFunction<NV12ToARGB_drvapi_x64.ptx>

CUDA Kernel Function = <Passthru_drvapi, 0x044398d0>

> VideoDecoder::cudaVideoCreateFlags = <1>Use CUDA decoder

cudaDecodeD3D9.exe 2/2

[cudaDecodeD3D9] - [Frame: 0016, 833.6 fps, time: 1.20 (ms) ]




















[cudaDecodeD3D9] statistics

Video Length (hh:mm:ss.msec) = 00:00:00.375

Frames Presented (inc repeats) = 326

Average Present FPS = 868.73

Frames Decoded (hardware) = 327

Average Decoder FPS = 871.40

cudaEncode.exe (runaway)

Starting cudaEncode...

[ CUDA H.264 Encoder ]

argv[0] = C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥cudaEncode.exe

dct8x8.exedct8x8.exe Starting...


CUDA sample DCT/IDCT implementation

===================================

Loading test image: barbara.bmp... [512 x 512]... Success

Running Gold 1 (CPU) version... Success

Running Gold 2 (CPU) version... Success

Running CUDA 1 (GPU) version... Success

Running CUDA 2 (GPU) version... 10459.499992 MPix/s //0.025063 ms

Success

Running CUDA short (GPU) version... Success

Dumping result to barbara_gold1.bmp... Success

Dumping result to barbara_gold2.bmp... Success

Dumping result to barbara_cuda1.bmp... Success

Dumping result to barbara_cuda2.bmp... Success

Dumping result to barbara_cuda_short.bmp... Success

Processing time (CUDA 1) : 0.209782 ms

Processing time (CUDA 2) : 0.025063 ms

Processing time (CUDA short): 0.170617 ms

PSNR Original <---> CPU(Gold 1) : 32.777073

PSNR Original <---> CPU(Gold 2) : 32.777046

PSNR Original <---> GPU(CUDA 1) : 32.777092

PSNR Original <---> GPU(CUDA 2) : 32.777077

PSNR Original <---> GPU(CUDA short): 32.749447

PSNR CPU(Gold 1) <---> GPU(CUDA 1) : 64.019310

PSNR CPU(Gold 2) <---> GPU(CUDA 2) : 71.777740

PSNR CPU(Gold 2) <---> GPU(CUDA short): 42.258053

Test Summary...

Test passed

dct8x8.exe / result

barbara_cuda_short.bmp

dct8x8.exe / result

barbara_cuda1.bmp

dct8x8.exe / result

barbara_cuda2.bmp

dct8x8.exe / result

barbara_gold1.bmp

dct8x8.exe / result

barbara_gold2.bmp

deviceQuery.exe 1/2

C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥deviceQuery.exe Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 560 Ti"

CUDA Driver Version / Runtime Version 5.0 / 5.0

CUDA Capability Major/Minor version number: 2.1

Total amount of global memory: 1024 MBytes (1073741824 bytes)

( 8) Multiprocessors x ( 48) CUDA Cores/MP: 384 CUDA Cores

GPU Clock rate: 1800 MHz (1.80 GHz)

Memory Clock rate: 2050 Mhz

Memory Bus Width: 256-bit

L2 Cache Size: 524288 bytes

Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)

Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

deviceQuery.exe 2/2

Maximum number of threads per multiprocessor: 1536

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Concurrent copy and kernel execution: Yes with 1 copy engine(s)

Run time limit on kernels: Yes

Integrated GPU sharing Host Memory: No

Support host page-locked memory mapping: Yes

Alignment requirement for Surfaces: Yes

Device has ECC support: Disabled

CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)

Device supports Unified Addressing (UVA): Yes

Device PCI Bus ID / PCI location ID: 1 / 0

Compute Mode:

< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce

GTX 560 Ti

deviceQueryDrv.exe 1/2

C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥deviceQueryDrv.exe Starting...

CUDA Device Query (Driver API) statically linked version

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 560 Ti"

CUDA Driver Version: 5.0

CUDA Capability Major/Minor version number: 2.1

Total amount of global memory: 1024 MBytes (1073741824 bytes)

( 8) Multiprocessors x ( 48) CUDA Cores/MP: 384 CUDA Cores

GPU Clock rate: 1800 MHz (1.80 GHz)

Memory Clock rate: 2050 Mhz

Memory Bus Width: 256-bit

L2 Cache Size: 524288 bytes

Max Texture Dimension Sizes 1D=(65536) 2D=(65536,65535) 3D=(2048,2048,2048)

Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

deviceQueryDrv.exe 2/2

Maximum number of threads per multiprocessor: 1536

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Texture alignment: 512 bytes

Maximum memory pitch: 2147483647 bytes

Concurrent copy and kernel execution: Yes with 1 copy engine(s)

Run time limit on kernels: Yes

Integrated GPU sharing Host Memory: No

Support host page-locked memory mapping: Yes

Concurrent kernel execution: Yes

Alignment requirement for Surfaces: Yes

Device has ECC support: Disabled

CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)

Device supports Unified Addressing (UVA): Yes

Device PCI Bus ID / PCI location ID: 1 / 0

Compute Mode:

< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

dwtHaar1D.exe

C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥dwtHaar1D.exe Starting...


source file = "../../../3_Imaging/dwtHaar1D/data/signal.dat"

reference file = "result.dat"

gold file = "../../../3_Imaging/dwtHaar1D/data/regression.gold.dat"

Reading signal from "../../../3_Imaging/dwtHaar1D/data/signal.dat"

Writing result to "result.dat"

Reading reference result from "../../../3_Imaging/dwtHaar1D/data/regression.gold.dat"

Test success!

Signal.dat

9.5012929e-001

2.3113851e-001

6.0684258e-001

4.8598247e-001

8.9129897e-001

・・・

Regression.gold.dat

Result.dat

dxtc.exe

C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥dxtc.exe Starting...


Image Loaded '../../../3_Imaging/dxtc/data/lena_std.ppm', 512 x 512 pixels

Running DXT Compression on 512 x 512 image...

16384 Blocks, 64 Threads per Block, 1048576 Threads in Grid...

dxtc, Throughput = 17.7004 MPixels/s, Time = 0.01481 s, Size = 262144 Pixels, NumDevsUsed = 1, Workgroup = 64

dxtc.exe 1/4

Checking accuracy...

Deviation at ( 9, 1): 0.791667 rms







Deviation at ( 100, 8): 2.416667 rms




Deviation at ( 29, 10): 0.020833 rms

Deviation at ( 79, 10): 1.833333 rms

Deviation at ( 13, 11): 1.041667 rms


Deviation at ( 28, 13): 0.562500 rms

Deviation at ( 90, 13): 0.708333 rms

Deviation at ( 25, 14): 0.520833 rms

Deviation at ( 69, 14): 0.770833 rms

Deviation at ( 87, 16): 0.708333 rms

Deviation at ( 90, 17): 1.041667 rms

Deviation at ( 24, 19): 0.916667 rms

Deviation at ( 25, 19): 0.625000 rms

Deviation at ( 26, 19): 1.041667 rms

Deviation at ( 55, 20): 4.791667 rms

Deviation at ( 20, 23): 1.541667 rms

Deviation at ( 99, 23): 3.312500 rms

Deviation at ( 45, 24): 18.104166 rms


dxtc.exe 2/4

Deviation at ( 21, 30): 1.562500 rms

Deviation at ( 115, 32): 24.104166 rms


Deviation at ( 102, 33): 2.250000 rms

Deviation at ( 50, 35): 26.958334 rms

Deviation at ( 68, 35): 11.937500 rms

Deviation at ( 115, 36): 0.458333 rms

Deviation at ( 12, 38): 2.166667 rms

Deviation at ( 40, 40): 0.270833 rms

Deviation at ( 86, 43): 0.604167 rms

Deviation at ( 116, 43): 0.125000 rms

Deviation at ( 43, 44): 2.250000 rms

Deviation at ( 54, 44): 4.791667 rms

Deviation at ( 46, 46): 2.875000 rms

Deviation at ( 116, 46): 0.604167 rms


Deviation at ( 117, 48): 0.937500 rms

Deviation at ( 23, 51): 3.520833 rms

Deviation at ( 11, 52): 0.041667 rms

Deviation at ( 67, 54): 5.687500 rms

Deviation at ( 26, 55): 0.854167 rms

Deviation at ( 21, 56): 5.000000 rms

Deviation at ( 24, 56): 0.562500 rms

Deviation at ( 30, 57): 0.937500 rms

Deviation at ( 21, 59): 2.541667 rms

Deviation at ( 120, 59): 0.104167 rms

Deviation at ( 112, 60): 1.125000 rms

Deviation at ( 77, 61): 1.083333 rms

dxtc.exe 3/4

Deviation at ( 114, 62): 4.958333 rms

Deviation at ( 78, 66): 0.541667 rms

Deviation at ( 106, 68): 0.375000 rms

Deviation at ( 16, 70): 3.104167 rms

Deviation at ( 10, 71): 0.937500 rms

Deviation at ( 108, 71): 0.354167 rms


Deviation at ( 118, 72): 5.562500 rms

Deviation at ( 11, 73): 0.541667 rms

Deviation at ( 68, 74): 1.937500 rms

Deviation at ( 70, 76): 1.791667 rms

Deviation at ( 124, 76): 3.354167 rms

Deviation at ( 103, 78): 0.375000 rms

Deviation at ( 127, 78): 0.541667 rms

Deviation at ( 108, 79): 0.083333 rms

Deviation at ( 120, 81): 0.541667 rms

Deviation at ( 43, 82): 24.979166 rms

Deviation at ( 67, 82): 3.125000 rms

Deviation at ( 78, 82): 2.437500 rms

Deviation at ( 123, 84): 0.541667 rms

Deviation at ( 127, 85): 0.187500 rms

Deviation at ( 122, 87): 0.083333 rms

Deviation at ( 124, 87): 0.541667 rms

Deviation at ( 127, 88): 0.229167 rms

Deviation at ( 93, 91): 0.666667 rms

Deviation at ( 115, 93): 0.083333 rms

Deviation at ( 69, 95): 1.875000 rms

Deviation at ( 106, 95): 1.125000 rms

dxtc.exe 4/4

Deviation at ( 107, 95): 3.708333 rms

Deviation at ( 13, 96): 1.354167 rms

Deviation at ( 115, 98): 0.187500 rms

Deviation at ( 118, 98): 0.187500 rms

Deviation at ( 116, 101): 0.187500 rms

Deviation at ( 78, 105): 0.541667 rms

Deviation at ( 67, 107): 0.708333 rms

Deviation at ( 74, 107): 0.375000 rms

Deviation at ( 65, 109): 0.770833 rms

Deviation at ( 89, 109): 0.708333 rms

Deviation at ( 118, 109): 3.854167 rms

Deviation at ( 67, 110): 1.083333 rms

Deviation at ( 88, 111): 0.208333 rms

Deviation at ( 64, 113): 0.708333 rms

Deviation at ( 84, 113): 0.333333 rms

Deviation at ( 88, 113): 0.187500 rms

Deviation at ( 84, 114): 1.666667 rms

Deviation at ( 66, 115): 0.770833 rms

Deviation at ( 19, 118): 5.270833 rms

Deviation at ( 76, 121): 0.104167 rms

Deviation at ( 70, 122): 0.708333 rms

Deviation at ( 91, 122): 0.208333 rms

Deviation at ( 71, 123): 0.854167 rms

Deviation at ( 75, 123): 0.854167 rms

Deviation at ( 61, 124): 0.937500 rms

Deviation at ( 91, 124): 0.270833 rms

RMS(reference, result) = 0.015488

Test passed

Summary

GTX560, Some samples does not work fine.

→ MUST support CUDA compute capability 3.0.

→ Requires GPU devices with compute SM 3.5 or higher.

This evaluation to be continued, For future reference.

Technology

Nvidia® cuda™ 5 sample evaluationresult_2