Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... ›...

Optimizing Texture Transfers

Shalini Venkataraman

Senior Applied Engineer, NVIDIA

shaliniv@nvidia.com

Outline

Definitions

— Upload : Host (CPU) -> Device (GPU)

— Readback: Device (GPU) -> Host (CPU)

Focus on OpenGL graphics

— Implementing various transfer methods

— Multi-threading and Synchronization

— Debugging transfers

— Best Practices & Results

Applications

Streaming videos/time varying geometry or volumes

— Broadcast, real-time fluid simulations etc

Level of detailing

— Out of core image viewers, terrain engines

— Bricks paged in as needed

Parallel rendering

— Fast communication between multiple GPUs for scaling data/render

Remoting Graphics

— Readback GPU results fast and stream over network

100GB/s

5-10GB/s

Graphics Memory

OpenGL Graphics – Streaming Data

Previous approaches

— Synchronous – CPU and GPU idle during transfer

— CPU Asynchronous

GPU and CPU Asynchronous with Copy Engines

— Application layout

— Use cases

— Results

Synchronous Transfers

Straightforward

— Upload texture every frame

— Driver does all copy

Copy, download and draw are

sequential

[nBricks]

Memory

Graphics

Memory

glTexSubImage

Upload Upload Upload

GPU Draw Draw Draw

Frame Draw

Copy Copy Copy

glTexSubImage

Frame Draw

Other work

CPU Asynchronous Transfers

Non CPU-blocking transfer using Pixel Buffer Objects (PBO)

— Ping-pong PBO’s for optimal throughput

— Data must be in GPU native format

OpenGL Controlled

Memory

Datacur: glTexSubImage

[nBricks]

Main Memory

Graphics Memory

Datanext memcpy

Textures

Example – 3D texture +Ping-Pong PBOs

Gluint pbo[2] ; //ping-pong pbo generate and initialize them ahead

unsigned int curPBO = 0;

//bind current pbo for app->pbo transfer

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[curPBO]); //bind pbo

GLubyte* ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_UNPACK_BUFFER_ARB, 0, size,

GL_MAP_WRITE_BIT|GL_MAP_INVALIDATE_BUFFER_BIT);

memcpy(ptr,pData[curBrick],xdim*ydim*zdim);

glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER_ARB);

//Copy pixels from pbo to texture object

glBindTexture(GL_TEXTURE_3D,texId);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[1-curPBO]); //bind pbo

glTexSubImage3D(GL_TEXTURE_3D,0,0,0,0,xdim,ydim,zdim,GL_LUMINANCE,GL_UNSIGNED_BYTE,0);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB,0);

glBindTexture(GL_TEXTURE_3D,0);

curPBO = 1-curPBO;

//Call drawing code here

CPU Async - Execution Timeline

Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1

GPU Drawt0 Drawt2 Drawt1

Frame Draw

Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0

CPU Async

Analysis with GPUView

(http://graphics.stanford.edu/~mdfish

er/GPUView.html)

GLDriver

Results – Synchronous vs CPU Async

Synchronous

16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3 (2MB) 256^3 (16MB)

PBO vs Synchronous uploads - Quadro 6000

PBO (MB/s) TexSubImage (MB/s)

- Transfers only

- Adding rendering will reduce bandwidth, GPU can’t do both

- Ideally – want to sustain bandwidth with render, need GPU overlap

Texture Size

Achieving Overlap - Copy Engines

Fermi+ have copy engines

— GeForce, low-end Quadro- 1 CE

— Quadro 4000+ - 2 CEs

Allows copy-to-host + compute +

copy-to-device to overlap

simultaneously

Graphics/OpenGL

— Using PBO’s in multiple threads

— Handle synchronization

GPU Asynchronous Transfers

Downloads/uploads in separate thread

— Using OpenGL PBOs

ARB_SYNC used for context

synchronization

Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1

GPU Drawt0 Drawt2 Drawt1

Frame Draw

Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0

Using PBO

Using CE

Upload Draw

Main App Thread

Shared textures

Readback

Upload–Render : Application Layout

OpenGL Controlled

Memory

[nBricks]

Main Memory

Graphics Memory srcTex

[numTextures]

Render

Thread

glBindTexture

Upload Thread

Datacur: glTexSubImage

Datanext : memcpy

uploadGLRC

mainGLRC

Multi-threaded Context Creation

Sharing textures between multiple contexts

— Don’t use wglShareLists

— Use WGL/GLX_ARB_CREATE_CONTEXT instead

— Set OpenGL debug on

static const int contextAttribs[] =

WGL_CONTEXT_FLAGS_ARB, WGL_CONTEXT_DEBUG_BIT_ARB,

mainGLRC = wglCreateContextAttribsARB(winDC, 0, contextAttribs);

wglMakeCurrent(winDC, mainGLRC);

glGenTextures(numTextures, srcTex);

//uploadGLRC now shares all its textures with mainGLRC

uploadGLRC = wglCreateContextAttribsARB(winDC, mainGLRC, contextAttribs);

//Create Upload thread

//Do above for readback if using

Synchronization using ARB_SYNC

OpenGL commands are asynchronous

— When glDrawXXX returns, does not mean command is completed

Sync object glSync (ARB_SYNC) is used for multi-threaded

apps that need sync

— Eg rendering a texture waits for upload completion

Fence is inserted in a unsignaled state but when completed

changed to signaled.

//Upload //Render

glTexSubImage(texID,..) glWaitSync(fence);

GLSync fence = glFenceSync(..) glBindTexture(.., texID);

unsignaled

signaled

Upload-Render Sychronizaton

Need additional CPU event to coordinate waiting for GPU

WaitForSingleObject(startUploadValid)

glWaitSync(startUpload[2])

glBindTexture(srcTex[2])

glTexSubImage(..)

endUpload[2] = glFenceSync(…)

SetEvent(endUploadValid)

srcTex

Upload

WaitForSingleObject(endUploadValid)

glWaitSync(endUpload[0])

glBindTexture(srcTex[0])

//Draw

startUpload[0] = glFenceSync(…)

SetEvent(startUploadValid);

Render

GLsync startUpload[MAX_BUFFERS], endUpload[MAX_BUFFERS]; //GPU fence sync objects

HANDLE startUploadValid, endUploadValid; //cpu event to coordinate wait for GPU sync

Analysis with GPUView

Upload and Render in

separate threads

— Map to distinct

hardware queues on

— Executed concurrently

— Will serialize on pre-

Fermi hardware

Adding Readback

OpenGL Controlled

Memory

Images

[nFrames]

Framecur: glGetTexImage

Frameprev : memcpy

glFramebufferTexture

(GL_DRAW_FRAMEBUFFER

_TEXTURE,…)

Use glGetTexImage, not glReadPixels between threads

mainGLRC

readbackGLRC

Render Thread Readback Thread

Main Memory

Graphics Memory

resultTex

[numTextures]

Render-Readback Synchronizaton

WaitForSingleObject(endReadbackValid)

glWaitSync(endReadback[2])

glFramebufferTexture(resultTex[2])

//Draw

startReadback[3] = glFenceSync(…)

SetEvent(startReadbackValid)

resultTex

Render

WaitForSingleObject(startReadbackValid)

glWaitSync(startReadback[0])

glGetTexImage(resultTex[0])

//Read pixels to png-pong pbo

endReadback[0] = glFenceSync(…)

SetEvent(endReadbackValid);

Readback

GLsync startReadback[MAX_BUFFERS],endReadback[MAX_BUFFERS]; //GPU fence sync objects

HANDLE startReadbackValid, endReadbackValid; //cpu event to coordinate wait for GPU

GeForce vs Quadro Readbacks

Readbacks on GeForce are 3x slower than Quadro

256K 1MB 8MB 32MB

Texture Size

Render-Download Bandwidth for Quadro vs GeForce

GeForce GTX 570 Quadro 6000

Upload-Render-Readback pipeline

// Wait for signal to start upload

CPUWait(startUploadValid);

glWaitSync(startUpload[2]);

// Bind texture object

BindTexture(capTex[2]);

// Upload

glTexSubImage(texID…);

// Signal upload complete

GLSync endUpload[2]= glFenceSync(…);

CPUSignal(endUploadValid);

// Wait for download to complete

CPUWait(endDownloadValid);

glWaitSync(endDownload[3]);

// Wait for upload to complete

CPUWait(endUploadValid);

glWaitSync(endUpload)[0]);

// Bind render target

glFramebufferTexture(playTex[3]);

// Bind video capture source texture

BindTexture(capTex[0]);

// Draw

// Signal next upload

startUpload[0] = glFenceSync(…);

CPUSignal(startUploadValid);

// Signal next download

startDownload[3] = glFenceSync(…);

CPUSignal(startDownloadValid);

// Playout thread

CPUWait(startDownloadValid);

glWaitSync(startDownload[2]);

// Readback

glGetTexImage(playTex[2]);

// Read pixels to PBO

// Signal download complete

endDownload[2] = glFenceSync(…);

CPUSignal(endDownloadValid);

Capture Thread Render Thread Playout Thread

True, S038 – Best Practices in GPU-based Video Processing, GTC 2012 Proceedings

GPUView trace showing 3-way overlap

Copy Engines

are idle

Frame time

Readback

Render

Upload

Readback

Render

Upload

Balanced render, upload

and readback times

Render time larger than

upload and readback

Debugging Transfers

Some OGL calls may not overlap between transfer/render

thread

— Eg non-transfer related OGL calls in transfer thread

— Driver generates debug message

“Pixel transfer is synchronized with 3D rendering”

— Application uses ARB_DEBUG_OUTPUT to check the OGL debug log

— OpenGL 4.0 and above

GL_ARB_debug_output -

http://www.opengl.org/registry/specs/ARB/debug_output.txt

Copy Engine Results – Best Case

256KB 1MB 8MB 32MB

Texture Size

Performance Scaling from CPU Asynchronous Transfers

Upload-Render Scaling Render-Download Scalng

4.2 GB/s 3.2GB/s

1.4 GB/s

900 MB/s

Perfect Scaling

No Scaling

Quadro 6000

Conclusion

Presented different transfer methods

Keep the transfer method simple

— Look at your application transfer needs and render times

— Tradeoff in scaling vs application complexity

Future

— Debugging multi-threaded transfers made much easier with

Nsight Visual studio http://developer.nvidia.com/nvidia-nsight-

visual-studio-edition)

References

Venkataraman, Fermi Asynchronous Texture

Transfers, OpenGL Insights, 2012

— Source code (around SIGGRAPH 2012) –

https://github.com/organizations/OpenGLInsights

Related GTC Talks

— S0328, Thomas True, Best Practices in GPU-based

video processng

— S0049, Alina Alt &Tom True, Using the GPU Direct for

Video API

— S0353, S Venkataraman, Programming multi-gpus for

scalable rendering

Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... ›...

Documents

ATI Radeon HyperZ Technology...• 3 texture multi-texture • 3D volume texture, 2D cube texture • Dot Product per pixel • Dependent texture lookup for environment bump mapping

Texture - cs.umd.edu€¦ · • Shape from texture: estimate surface orientation or shape from texture • Synthesis ( Graphics) : generate new texture patch given some examples

WHAT IS TEXTURE?. TEXTURE in Nature TEXTURE IN FOOD

OpenGL Texture-Mapping Made Simplerweb.engr.oregonstate.edu/~mjb/cs553/Handouts/Texture/texture.pdf · mjb – May 6, 2015 1 OpenGL Texture-Mapping Made Simpler Introduction Texture

Texture Fields: Learning Texture Representations in

International Fund Transfers · transfers include many types of international transfers, including cash-to-cash money transfers, international wire transfers, international ACH transactions,

Texture Mapping - Princeton University Computer … · Texture Mapping • Scan conversion ... Point sampling Area filtering Texture Filtering ... Solid textures Texture values indexed

Texture is - Julianna Kunstler. Your Art Resource. · Texture is texture use of texture types: 1. create an illusion 2. of texture with: Title: texture_element Created Date: 10/6/2017

Texture Analysis - Purdue University · 2004-07-05 · Texture Analysis • There are two primary issues in texture analysis: ntexture classification otexture segmentation • Texture

Transfers: Internal Transfers for Recruiters and Managerstraining.wasteconnections.com/workday/Attachments/Manager... · 2017-04-25 · Transfers: Internal Transfers for Recruiters

Texture Mapping Texture Mapping –– Part Part 11--44yu/Teaching/CISC440_S15/handouts/Lecture19.pdf · Texture Mapping Texture Mapping –– Part Part 11--44 Why Texture Map? How

Ebooksclub.org Introduction to Texture Analysis Macro Texture Micro Texture and Orientation Mapping Second Edition

2D Texture Synthesis Instructor: Yizhou Yu. Texture synthesis Goal: increase texture resolution yet keep local texture variation

International Fund Transfers - Amazon S3€¦ · transfers include many types of international transfers, including cash-to-cash money transfers, international wire transfers, international

S0102 GTC2012 Flame Simulation Games

8. Texture Mapping 8.1 2D Texture Mapping · © 2008 Burkhard Wuensche burkhard Slide 1 8. Texture Mapping 8.1 2D texture mapping 8.2 OpenGL texture mapping example

Texture - University Of Marylanddjacobs/CMSC427/Texture.pdf · • Create and specify a texture object – Create a texture object – Specify the texture image – Specify how texture

Texture Mapping + = Object Texture Texture Mapped Object

A2A Transfers - CUAnswers · 2 a2a transfers scheduled transfers using cubase automated funds transfers (afts) 21 what the member sees 24 “right away” transfers via the transfer

Gradient texture unit coding for texture analysis