GDC : Mar16.pdf

Preview:

Citation preview

© Copyright Khronos Group 2016 - Page 142

Swapchains Unchained!(What you need to know about Vulkan WSI)

Alon Or-bach, Chair, Vulkan Window System Integration Sub-Group – March 2016

© Copyright Khronos Group 2016 - Page 143

Intro to Vulkan Window System Integration• Explicit control for acquisition and

presentation of images - Designed to fit the Vulkan API and today’s

compositing window systems

• Not all extensions are supported by every platform- You MUST check and enable the extensions

your app/engine uses!!!

• Today’s presentation should help you get presentation working- Learn how to present through a swapchain

- Overview of Vulkan objects used by the WSI

extensions

WSI Jargon Buster• Platform

Our terminology for an OS

/ window system e.g.

Android, Windows,

Wayland, X11 via XCB

• Presentation EngineThe platform’s compositor

or display engine

• ApplicationYour app or game engine

© Copyright Khronos Group 2016 - Page 144

How many WSI extensions are there?• Two cross-platform instance extensions- VK_KHR_surface

- VK_KHR_display

• Six (platform) instance extensions- VK_KHR_android_surface

- VK_KHR_mir_surface

- VK_KHR_wayland_surface

- VK_KHR_win32_surface

- VK_KHR_xcb_surface

- VK_KHR_xlib_surface

• Two cross-platform device extensions- VK_KHR_swapchain

- VK_KHR_display_swapchain

© Copyright Khronos Group 2016 - Page 145

Vulkan Surfaces • VkSurfaceKHR- Vulkan’s way to encapsulate a native

window / surface

• Platform-independent surface queries- Find out crucial information about your

surface’s properties- e.g., if presentation is supported by a

particular queue on a particular device

- Some platforms provide additional queries

• An implementation may support multiple platforms- e.g., both xlib and xcb

Physical Device A

Platform X

Queue Family 2

Queue Family 1 Queue

Family 0

Platform Y

Physical Device B

Queue Family 1Queue

Family 0

Surface from

Platform X

Physical Device C

Queue Family 1Queue

Family 0

© Copyright Khronos Group 2016 - Page 146

Vulkan Swapchains: VK_KHR_swapchain• Array of presentable images associated with

a surface- Application requests a minimum number

of presentable images

- Implementation creates at least that

number

- Implementation may have a limit

• Upfront allocation of presentable images- No allocation hitching at crucial moment

- Pre-record fixed content command buffers

• Present mode determines behavior- FIFO support mandatory

- Platforms can offer mailbox,

immediate, FIFO relaxed

const VkSwapchainCreateInfoKHR createInfo ={VK_STRUCTURE_TYPE_SWAPCHAIN_CREATE_INFO_KHR, // sTypeNULL, // pNext0, // flagsmySurface, // surfacedesiredNumberOfPresentableImages, // minImageCountsurfaceFormat, // imageFormatsurfaceColorSpace, // imageColorSpacemyExtent, // imageExtent1, // imageArrayLayersVK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT, // imageUsageVK_SHARING_MODE_EXCLUSIVE, // imageSharingMode0, // queueFamilyIndexCountNULL, // pQueueFamilyIndicessurfaceProperties.currentTransform, // preTransformVK_COMPOSITE_ALPHA_INHERIT_BIT_KHR, // compositeAlphaswapchainPresentMode, // presentModeVK_TRUE, // clippedVK_NULL_HANDLE // oldSwapchain};

© Copyright Khronos Group 2016 - Page 147

Vulkan Swapchains: They’re good!• Application knows which image within a

swapchain it is presenting- Content of image preserved between

presents

• Application is responsible for explicitly recreating swapchains - no surprises- Platform informs app if current swapchain

- Suboptimal: e.g. after window resize,

swapchain still usable for present via image

scaling

- Surface Lost: swapchain no longer usable for

present

- Application is responsible to create a new

swapchain

© Copyright Khronos Group 2016 - Page 148

Vulkan Swapchains: They’re jolly good!• Presenting and acquiring are separate

operations- No need to submit a new image to acquire

another one, unless presentation engine

cannot release it

• Application must only modify presentable images it has acquired

• Presentation engine must only display presentable images that have been presented!

Stalls in frame loop are very bad!

© Copyright Khronos Group 2016 - Page 149

VK_KHR_<platform>_surface

VK_KHR_surface

VK_KHR_swapchain

Platform-specific APIs

Steps to setup your presentable images1 – Create a native window/surface

2 – Create a Vulkan surface

3 – Query information about your surface

4 – Create a Vulkan swapchain

5 – Get your presentable images

© Copyright Khronos Group 2016 - Page 150

VK_KHR_swapchain

Vulkan Frame Loop – as easy as 1-2-3!

2 – Submit command buffer(s) for that image

1 – Acquire the next presentable image 3 – Present the image

0 – Create your swapchain

LegendSetup

Steady-state

Response to suboptimal

/ surface_lost

© Copyright Khronos Group 2016 - Page 151

Vulkan Displays: VK_KHR_display• Vulkan’s way to discover display devices

(screens, panels) outside a window system- Reminder: Not supported on all platforms

• Defines VkDisplayKHR and VkDisplayModeKHR objects- Represent the display devices and the

modes they support connected to a

VkPhysicalDevice

- Determine if a display supports multiple

planes that are blended together

• Enables creation of a VkSurfaceKHR to represent a display plane

Physical Device

Surface

Display 0

Plane 2Plane 1

Plane 0

Display Mode 1Display

Mode 0

Display 1

Display Mode 1Display

Mode 0

© Copyright Khronos Group 2016 - Page 152

VK_KHR_display_swapchain• Extends the information provided at vkQueuePresentKHR- What region to present from the swapchain image

- What region to present to on the display

- Whether the display should persist the image

• Adds ability to create a shared swapchain- Swapchain that takes multiple VkSwapchainCreateInfoKHR structs

- Allows multiple displays to be presented to simultaneously

- No guarantee that presents are atomic ...presently!

© Copyright Khronos Group 2016 - Page 153

Any question?

alon.orbach@samsung.com@alonorbach (disclaimers apply!)

© Copyright Khronos Group 2016 - Page 1

LunarG® SDK for Vulkan®

Karen Ghavam, CEOKarl Schultz, Principal EngineerJon Ashburn, Principal Engineer

© Copyright Khronos Group 2016 - Page 2

Enter the Raffle for your prize!Congratulations!

You are the recipient of the Vulkan Programming Guide, courtesy of LunarG!

Is your OpenGL Programming Guide getting lonely? Well, it will soon have a companion. In August 2016, when the Vulkan Programming Guide becomes available, LunarG will ship it directly to you!

In the meantime, visit LunarXchange (Vulkan.lunarg.com) for the LunarG SDK for Vulkan, and accept this book bag, anxiously awaiting its Vulkan Programming Guide.

© Copyright Khronos Group 2016 - Page 3

LunarG SDK• Loader Binary• Validation Layer Libraries• Vulkan trace and replay tools- vktrace- vkreplay

• SPIR-V Tools- GLSL Validator - SPIR-V Disassembler and Assembler - SPIR-V Remapper

• RenderDoc*• Sample Programs

*For a detailed demonstration of RenderDoc don’t miss:Practical Development for Vulkan (presented by Valve Software). Thursday. 12:45 – 1:45. Room 3009, West Hall

© Copyright Khronos Group 2016 - Page 4

Download the LunarG SDK for Vulkan at LunarXchange: vulkan.lunarg.com

Version 1.0.5.0 now available!

© Copyright Khronos Group 2016 - Page 5

The Power of a Layered Ecosystem

Development pathValidation

layer

Debug layer

Other layers

Production path

Vulkan application

Installable Client Driver

Vulkan application

Installable Client Driver

Loader

Loader

© Copyright Khronos Group 2016 - Page 6

Layers: Fully IntegratedProgrammatic Approach

Vulkan application

Debug Report

Callback

Installable Client Driver

Layer

Application supplies list

of layers

Application handles messages in

callback

Layers report “results” as

messages

Loader

© Copyright Khronos Group 2016 - Page 7

Layers: Externally Activated“Ad-hoc” Approach

Vulkan application

Debug Report

Callback

Installable Client Driver

Layer

User sets environment variables:

VK_INSTANCE_LAYER=“layer name”

Default Debug Report writes to output stream

Layers report “results” as

messages

Loader

Layer Settings File

© Copyright Khronos Group 2016 - Page 8

Demo We’ll Be Using

“Hologram”By

Chia-I Wu (olv)

• Well-written Vulkan demo• Simulation of 5000 moving objects• Demonstrates multi-threaded command

buffer recording• Can be found in:• https://github.com/LunarG/VulkanSamples

© Copyright Khronos Group 2016 - Page 9

Demo!

Watch the demo for a minute or so

© Copyright Khronos Group 2016 - Page 10

A Few Hologram Internals – Object Data

5000 ShaderParamBlocks

struct ShaderParamBlock {float light_pos[4];float light_color[4];float model[4 * 4];float view_projection[4 * 4];

};

One ShaderParamBlock per Object

For Each Frame and For Each Object:• Modify ShaderParamBlock• BindDescriptorSet

Two Frames of Object Data

© Copyright Khronos Group 2016 - Page 11

Modify DemoLet’s add code to modulate the transparency of each object, independently, as a function of time.To do this, we need to:

1. Add a parameter to the ShaderParamBlock: “per-object” alpha2. Modify the shader program to apply the per-object alpha3. Modify the Simulation to change the transparency of each object over time

Start with Step 1!struct ShaderParamBlock {

float light_pos[4];

float light_color[4];

float model[4 * 4];

float view_projection[4 * 4];

float alpha;

};

© Copyright Khronos Group 2016 - Page 12

Let’s See What Happens

Change the code and re-run demo

© Copyright Khronos Group 2016 - Page 13

More Information• Layer Documentation- LunarXchange website (https://vulkan.lunarg.com/app/docs/latest/layers)- More details on validation and other layers

• Screenshot Layer- Good for showing someone else what is wrong- Also can be used for before/after image-compare testing

• Vktrace/Vkreplay- Useful for sending someone a trace file in lieu of setting up a reproduction

scenario

A next gen Engine design on a next gen API

Dan Baker

Graphics Architect, Oxide Games

Nitrous design philosophies

• Job based threading

• Message based systems

• Redundant, shallow state design

• Always evaluate – opposite of Lazy Evaluation

• Efficient memory streaming

• Asynchronous systems

Data driven design

Unit AI System

MessageQueue

Physics Queue

FOW queue

Minimap queue

Message Dispatcher

Relating to Graphics Stack

• Collection of messages and systems extends into graphics

• Dozens of independent systems can operate in parallel

• Big systems internally parrelize (e.g. particles, unit rendering)

A modern API

• Concept of message based, asynchronous design well matched

Exposure of asynchronous nature of a GPU is the key design difference of Vulkan over OpenGL/D3D11

A contract between App and API

• Application will not make conflicting calls on the same objects (e.g. writing one object while another is reading it)

• Driver will generally not lock or serialize any API call– Context information is embedded on the

object being operated on

– With exception to occasional CPU side memory allocation (but should be rare occurrence on create calls)

Application runs parallel to GPU

Even Command Buffers

Odd Command Buffers

Delete Queue

Delete Queue

Application GPU

Flush Queue

Application runs parallel to GPU

Even Command Buffers

Odd Command Buffers

Delete Queue

Delete Queue

Application GPU

Flush Queue

Review

• When we say Vulkan is free threaded, we mean– most API function calls are operators. They operate only on data which

is passed into them as output, and read-only the data passed on that as input

– API function calls are transparent for thread safety: valid to call so long as the there is no read/write or write/write hazards. Apps responsibility to manage them

– GPU/CPU hazard is explicitly exposed. GPUs are read operators on data, therefore read/write hazards between CPU/GPU must also be managed by application

– In General, API function calls will not have locks in them• With exception to calls which must allocate some types of memory

Old way

Sim Job Sim Job Sim JobCore 1

Current Frame

Sim JobCore 2

Sim JobCore 3

Sim JobCore 4

AI Job

Sim Job

Graphics

Core 5

Game Job

Core 6

???GPU Fence, or CPU wait???

Sim Job

Sim Job

Sim Job

Graphics (Opaque, in driver)

AI Job

Game Job

Game Job

Dead time

Game Job

Game Job

AI Job AI Job

Physics Job

Physics Job

Physics Job

Old Way

Driver related cores. Missing time due to thread accounting and system level synchronization primitives

Lots of unused CPU space! Engine is just waiting for driver to be done

Powerful New model

Sim Job Sim Job Sim JobVulkan

CMD JobVulkan

CMD JobCore 1

Current Frame

Sim Job

Sim Job

VulkanCMD Job

VulkanCMD JobCore 2

Sim Job Sim JobVulkan

CMD JobVulkan

CMD JobCore 3

Sim Job Sim JobVulkan

CMD JobVulkan

CMD JobCore 4

AI Job

Sim Job Sim JobVulkan

CMD JobVulkan

CMD JobCore 5

Game Job

Sim Job Sim JobVk present

JobCore 6

GPU Fence End of Frame

Sim Job

Sim Job

Sim Job

Sim JobVulkan

CMD JobVulkan

CMD Job

AI Job

Game Job

Game Job

Next Frame

New way

Vulkan simulation using a modified Mantle build to simulate infinitely fast GPU

Difficult part of Vulkan

• Need to have a strategy for rendering up front, not lazy eval

• Before can setup shader, need to understand bindings, before bindings, need to understand descriptors– Probably need to know these even before a descriptor is

created

• The more you can know about a render job at compile time, the easier Vulkan will be

Setting up the Engine

• Pipelines created up front, combination(s) specified in shaderlanguage

• No concept of individual shader stages – Vertex/Fragment considered one block

• 64 mb temp buffer created for each frame– Shader constants– No buffers are updated directly– Any updates are dumped into staging buffer and copied – When 64 mbs is exceeded, slow allocation path is used, typically only

initialization

• Internal command format that can be built in parallel

Shader Combos

• Large, monolithic blocks with many state folded in

– Shaders

– Alpha state

– MSAA state

– Depth State

• Managing combinatorics is major challenge

Shader Combos

• Very unlikely that hardware actually needs to create unique pipeline object– The problem is that each hardware has a different state that might

require a new shader

• Vulkan has bulk shader create – Give a bunch of shader combinations at once to driver– Most likely driver only has to create a few actual shaders

• Nitrous does group creates – 20-40 combinations of a pipeline that might get used. A little bit of pruning for shader author

Pipeline serialization

• Major problem with D3D12

• Serialization context is passed into shader create

– Needed because most pipelines are not unique

• Driver will use this is a database to store compiled pipeline object

• Can serialize the whole database

Texture Sets

• Nitrous eliminates individual shader bindings

• Textures must be part of groups

• Maps to a descriptor set

Bind Vector

Batch Shader SetPrimitive (vertices)

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Constant Set

Constant Set

Constant Set

Constant Set

Constant Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Constant Set

Constant Set

Constant Set

Constant Set

Constant Set

Bind VectorTexture Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Constant Set

Constant Set

Constant Set

Constant Set

Constant Set

• Becomes a Layout in Vulkan• Layouts are specified during the shader

creation stage• Nitrous uses only 1 master layout

• Most engines will use multiple• Switching layouts has cost

• Can easily sort off redundant changes, only call bind descriptor when something needs changing

Manging hazards

• The trickiest part of Vulkan• Must manage any time a resource will be used differently

– Cache Flush– Operator barrier– Decompression

• USE THE VALIDATOR– Could get correct results on current hardware only to see problems on future

hardware– No different then multi-threaded coding

• Consider having engine layer automatically partially calculate barriers– Good design should do a good job– Nitrous is 100% explicit right now, but will likely to switch to partial automatic system

General performance

• Shader auto recompiling won’t happen automatically– Constant folding

– But no frame stutters due to recompiles

• Memory barriers can introduce stalls

– Need to plan out

• Changing pipelines, layouts frequently

Threading/Command buffers

• Best idea is to have many command buffers, but 1 allocator per thread per frame queued

• Command buffer allocation can cause memory bloat

• Nitrous sorts command buffers from estimated size, largest first, down to smallest

Questions

twitter: dankbaker, oxidegames

Performance Lessons from Porting Source 2 to Vulkan

Dan Ginsburg

Overview

Dota 2 Vulkan Performance Results

Performance Lessons Learned

Overview

Dota 2 Vulkan Performance Results

Performance Lessons Learned

Source 2 Overview

OpenGL, Direct3D 9, Direct3D 11, Vulkan

Windows, Linux, Mac

Dota 2 Reborn

Dota 2 Performance Results - Disclaimer

Not an ideal showcase for Vulkan

Source 2 renderer is multithreaded, but…

Dota 2 is only ~1500 draw calls per frame

Allows DX/GL a frame of latency to avoid being

renderthread bound

Does not (yet!) take advantage of:

Baking descriptors

Command buffer resubmission

Dota 2 Performance Results - Disclaimer

Not an ideal showcase for Vulkan

Source 2 renderer is multithreaded, but…

Dota 2 is only ~1500 draw calls per frame

Allows DX/GL a frame of latency to avoid being

renderthread bound

Does not (yet!) take advantage of:

Baking descriptors

Command buffer resubmission

Still very pleased with results!

Dota 2 Vulkan Performance – DX9 Latency

Frame Start Frame End

Dota 2 Vulkan Performance – DX9 Latency

Frame Start Frame End Present Issued

Dota 2 Vulkan Performance – DX9 Latency

Frame Start Frame End Present Issued

DX9 Latency: 3.8ms

Dota 2 Vulkan Performance – Vulkan Latency

Frame Start Frame End

Dota 2 Vulkan Performance – Vulkan Latency

Frame Start Frame End Present Issued

Dota 2 Vulkan Performance – Vulkan Latency

Frame Start Frame End Present Issued

Vulkan Latency: 0.4ms (!)

Dota 2 Vulkan – Latency Reduction

Renderthread no longer a bottleneck

Reduces “wallclock” time of frame

Time from end of frame to present reduced by 3.4ms

Really important for:

Latency sensitive games (eSports)

VR

Dota 2 Vulkan - Framerate

Two timedemos:

Typical Dota 2 Match

High Drawcall Battle Scene

Test system:

NVIDIA TITAN X 356.45

i7-3770k @ 3.50GHz

Test settings:

Resolution: 640x480 (CPU Perf)

Highest Rendering Quality

Vulkan/GL/DX9/DX11

Dota 2 Timedemo – Typical Dota 2 Match

Dota 2 Timedemo – Typical Dota 2 Match

182.95

170.55

188.5

128.1

FPS

NVIDIA TITAN X i7 3770k 640x480 356.45 - HQ

Vulkan OpenGL DX9 DX11

Dota 2 Timedemo – Battle Scene

Dota 2 – High Drawcall Timedemo

85.3

75.15 75.65

67.5

FPS

NVIDIA TITAN X i7 3770k 640x480 356.45 - HQ

Vulkan OpenGL DX9 DX11

Dota 2 Vulkan Performance - Overall

Significant latency reduction

Improved framerate in heavy scenes

Only going to get better…

Overview

Dota 2 Vulkan Performance Results

Performance Lessons Learned

Overview

Dota 2 Vulkan Performance Results

Performance Lessons Learned

Command Buffer Recycling

Command Buffer Batching

Redundant Call Filtering

Updating Descriptors

Pipeline Cache Usage

Command Buffer Recycling Overview

At least one VkCommandPool per thread

Recycling options:

vkResetCommandPool – resets all command buffers in

pool

vkResetCommandBuffer – reset single command buffer

Reset can either recycle or release resources

Command Buffer Recycling

Souce 2 recycles individual command buffers after

completion

vkBeginCommandBuffer costly

Using VK_COMMAND_BUFFER_RESET_RELEASE_RESOURCES_BIT

Driver reallocates resources

Done to reduce memory footprint, but came at perf cost

Fast Command Buffer Recycling

vkCreateCommandPool

Use VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT

vkResetCommandBuffer( pCmdBuffer, 0 )

flags == 0, keeps resources for reuse

Downside: memory growth

Source 2 strategy for handling memory growth:

Destroy command buffers no longer needed

Heuristic to destroy command buffers

Command Buffer Batching

vkQueueSubmit implies a flush

Also has CPU costs – memory residency

Important to batch submits

Command Buffer Batching

Command Buffer Batching

Batched submit: ~0.7ms / frame

Command Buffer Batching

Batched submit: ~0.7ms / frame Unbatched submits: ~4.5ms / frame

Source 2 Command Buffer Batching

Gather command buffers on renderthread

Up to a threshold, needed during load time

Wait for present request

Issue single submit with all batched command buffers

Redundant Call Filtering

Your job now!

Vulkan drivers may not (should not!) filter calls

If we don’t do it, we will force IHVs to

Hurts the good apps at the expense of the bad

Examples from Source 2:

vkCmdBindIndexBuffer

vkCmdBindVertexBuffers

vkCmdBindPipeline

Dynamic render state

vkCmdSet*

Updating Descriptors

vkUpdateDescriptorSets #1 hotspot

vkCmdBindDescriptorSets #2 hotspot

Source 2 approach:

Single pipeline layout shared across all pipelines

Descriptor sets will have unused entries

Update/bind descriptor set per draw

Not efficient!

Updating Descriptors – The Right Way

In shaders, organize descriptor sets by update

frequency

Bake descriptor sets up front

Use compatible pipeline layouts to simplify descriptor

allocation

Updating Descriptors – The Right Way

In shaders, organize descriptor sets by update

frequency

Bake descriptor sets up front

Use compatible pipeline layouts to simplify descriptor

allocation

…we plan to do this in the future. Will help perf a lot.

Pipeline Creation

vkCreateShaderModule is relatively fast

Loads in the SPIR-V, no heavy compilation

~0.01ms in Dota 2

vkCreateGraphicsPipelines is expensive

Driver performs shader compile here

0.2 – 152ms in Dota 2 before cache is warmed

Vulkan Pipeline Cache

Serialize compiled pipelines to disk

Preload to remove first-time stutters

Header contains VendorID/DeviceID/UUID

Otherwise opaque format

Avoid unnecessary shader compiles

Driver de-duplicates

Only driver knows when recompile is needed based on

state

Pipeline cache should contain only unique pipelines

Allows compilation on multiple threads

Merge later using vkMergePipelineCaches

Summary

Dota 2 Vulkan Performance Results

Reduced latency

Improved framerate in expensive scenes

Performance Lessons Learned

Command Buffer Recycling

Command Buffer Batching

Redundant Call Filtering

Updating Descriptors

Pipeline Cache Usage

Questions?

Vulkan Does RetroA Vulkan Use-Case Study with RetroArch and libretro

Hans-Kristian Arntzen – GDC 2016

Background• Me

• Multimedia programming since 2009

• Co-founder of RetroArch project in 2010-2011

• Working at ARM hacking on the Mali GPUs since 2014

• Contributed Vulkan backend on launch day

• RetroArch / libretro

• Multi-platform system optimized for enjoying retro content

• Plugin abstraction to support many different systems

• Strong focus on portability and performance

Problem• Retro content usually needs to render on CPU

• Emulators of classic consoles in particular is a prime example

• Get software rendered images to screen fast and reliably

• Blazing fast texture uploads part of the equation

CPU

GPU magic

Streaming with Vulkan• Vulkan exposes VK_IMAGE_TILING_LINEAR

• Finally! For some reason, never added to OpenGL

• GPUs can sample from these textures• At least on the Vulkan drivers I have tested ...

• No reason to copy from linear to optimal layout (used once!)

• Vulkan supports persistently mapped memory• Finally, us GLES folks can do it right -

• Combine this to a dream scenario• Persistently map a ring buffer of linear textures

• Let libretro core render directly into HOST_VISIBLE memory or use pure memcpy()

Caveats• Vulkan doesn’t require support for sampling linear textures

• Might need fallback

• Linear textures might not be DEVICE_LOCAL• Mostly a desktop thing

• Might need same fallback as before ...

• Memory might not be cached• Fallback to copy if we want to blend on the surface

• Simple, vendor-neutral fallbacks• If we hit either case, copy linear texture to DEVICE_LOCAL

• Might as well copy to OPTIMAL tiling layout

• vkCmdCopyImage (or vkCmdCopyBufferToImage)

The various ways to copy ...• Ring buffered textures with glTexSubImage appears to be best

• We already did the hard part for the driver• Texture is not in use by GPU, should allow optimal path• Only way in pure GLES2

• Classic async PBO uploads have extra overhead on all drivers• After all, have to copy to PBO, then copy to texture• Doesn’t accomplish anything over plain SubImage in our case

• AZDO-style PBO seems interesting ... but• Observed bizzarre 10x performance dips in TexSubImage• So much for that ...

• On Raspberry Pi 1, things got weirder ...• Optimal path was uploading to OpenVG texture• Share image with GLES via EGL ...

Benchmark• NES video from Nestopia libretro core

• 256x240 resolution @ 32 bpp

• Ran through RetroArch’s Vulkan and GL backends• Measurements

• Time to copy texture from CPU to texture

• Time spent overall to submit frame

• Measured on Linux

OpenGL results• Sure, we’re measuring in microseconds• We can do so much better!

• * GL calls were blocking mid-frame• Probably rate-limiting waiting for older frames

CPU GPU Copy OpenGL (µs) Frame OpenGL (µs)

i5-5257U @ 2.70 Intel HD 6100 (Mesa) 130 N/A (*)

i7 920 @ 2.66 nVidia GTX 760 272 302

Cortex-A17 @ 1.8 Mali T-764 585 806

Vulkan delivers!• Copy time essentially a memcpy() benchmark

• Overall frame times way better than the GL texture upload!

• Great uplifts across the board

• Still room for improvement

CPU GPU Copy Vulkan (µs) Frame Vulkan (µs) Copy uplift

i5-5257U @ 2.70 Intel HD 6100 (Mesa) 27 122 352 %

i7 920 @ 2.66 nVidia GTX 760 46 69 491 %

Cortex-A17 @ 1.8 Mali T-764 80 215 631 %

Conclusion• Even humble 2D applications can gain from Vulkan

• Not reserved for the highest-end engine developers

• Vulkan provides a far more direct and simple path to perf

• Fast paths are more obvious than before

• Going from good to great is much simpler in Vulkan

THANKS!

@themaister

github.com/Themaister

github.com/libretro/RetroArch

© Copyright Khronos Group 2016 - Page 191

Porting Cinder to VulkanHai Nguyen, Google

GFXBench 5 - Aztec RuinsBenchmarking Vulkan

Gergely Juhasz, Lead Gfx Engineer @Kishonti

GFXBench 5 in a nutshell

• Concept• Working title: Aztec Ruins

• Entirely new rendering engine• In-house render API for Vulkan, Metal, DX12• Also on OpenGL 4.3+, ES 3.2, DX11 for comparison• Algorithmic and workload parity across different backends

• High-end graphics features• Real time dynamic GI• Complex shading and advanced post-effects

• State• Near to Beta• Gold version expected by Q3

Actual engine footage

Render pipeline – Direct lights

Render pipeline – Dynamic shadows

Render pipeline – Global illumination

Render pipeline – Post-process

Global illumination

• Probes capture the lighting conditions

• SH is generated for every probe

• Final scene is shaded by deferred irradiance lights

• Well fits in Vulkan’s subpass concept

Subpass 1 – Geometry

Subpass 2 – Lighting

Final step – Post effects

Multi-threaded command recording 1

Render job Render targets

Render states

Drawcalls

A B

D EC

F

Dependency graphPipeline consists of several render jobs

Multi-threaded command recording 2

Command buffer

Command buffer

Command buffer

Command buffer

Main thread Command queue

Main rendering thread submits the command buffers according to the dependency graph

Future development plans

• Planned rendering features• Indirect specular highlights and shadows by GI

• Deferred decals

• Animated vegetation

• Compute based motion blur

• Atmospheric effects, particles

• VR

© Copyright Khronos Group 2016 - Page 208

Comparing Vulkan to OpenGL (ES)

Barthold LichtenbeltMarch 16, 2016

© Copyright Khronos Group 2016 - Page 209

Beneficial Vulkan Scenarios

Is your graphicswork CPU bound?

Can your graphicscreation be parallelized?

start

yes

Vulkanfriendly

Your graphicsplatform is fixed

You’lldo what it

takes to squeeze outMax perf.

You put a premium on

avoidinghitches

You canmanage your

graphics resourceallocations

yes

yes

yes

yes

yes

© Copyright Khronos Group 2016 - Page 210

Unlikely to Benefit

Scenarios to reconsider coding to Vulkan

1. Need for compatibility to pre-Vulkan platforms2. Heavily GPU-bound application3. Heavily CPU-bound application due to non-graphics work4. Single-threaded application, unlikely to change5. App can target middle-ware engine, avoiding 3D graphics API dependencies

• Consider using an engine targeting Vulkan, instead of coding Vulkan yourself

© Copyright Khronos Group 2016 - Page 211

Comparing OpenGL, AZDO, and VulkanIssue Naïve GL AZDO VulkanDeterministic state validation/pre-compilation

no no Yes

Improved single thread performance no Yes Yes

Multi-threaded work creation no partial yes

Multi-threaded work submission (to driver)

no no yes

GPU based work creation no partial partial (through MDI)

Ability to re-use created work no partial yes

Multi-threaded resource updates no Yes Yes

Learning curve low high Significant

Effort low high Significant

© Copyright Khronos Group 2016 - Page 212

Fish demo•Vulkan and OpenGL ES 3.1•Can change- # of schools of fish

- # of fish per school

- # of fish per drawcall

•Worker threads create commandbuffers in Vulkan mode

•Reports- Drawcalls/sec

- FPS

- CPU time per thread

- GPU time

•Android and Windows• Source code will be available soon

© Copyright Khronos Group 2016 - Page 213

200K Fishies, 100 fish per draw call

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1

OpenGL ES

Vulkan

drawcalls / sec

7x

1.5x

1.2x

© Copyright Khronos Group 2016 - Page 214

200K Fishies, 1 fish per draw call

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

18,000,000

Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1

OpenGL ES

Vulkan

drawcalls / sec

6x5x

19x

© Copyright Khronos Group 2016 - Page 215

FISH DEMO

Porting Cinder to VulkanLearning to Follow RulesHai NguyenCreative Technology LeadArt Copy & Code Project

Vulkan: Lots of rules and no mercy.

~Joseph Campbell (paraphrased)

Introducing Cinder

● What’s creative coding?○ Programming with aesthetic intent

● What platforms does Cinder run on?○ Android, Linux, Windows, iOS and OS X

● Open source under Simplified BSD

C++ Creative Coding Framework | https://libcinder.org

Porting Cinder to Vulkan

Cinder: Who/What/Where?

● Who is Cinder’s target audience?○ Creative coders

● What is Cinder used for?○ Apps: mobile to desktop to Times Square

● Where has Cinder been used?

Audience and Projects

Porting Cinder to Vulkan

Grove | Simon Geilfus Planetary | BLOOM.io SCAD Museum | Pentagram

IBM THINK | Mirada Samsung CenterStage | TBG Dia Lights | Kollision

Audi Urban Future | Kollision Androidify | Red Paper Heart Taxi, Taxi! | Robert Hodgin

Porting Cinder to Vulkan: Projects That Use Cinder

Porting Cinder to Vulkan

● Vulkanizing Cinder

● Crossing Vendor Implementations

● Speed Bumps

The Road To Glory

Porting Cinder to Vulkan

Vulkanizing Cinder

● Added RendererVk to Cinder○ Cinder rendering architecture is modular

● Wrapped Vulkan in C++○ Created idiomatic layer for expression

● Created high level graphics classes○ Textures, vertex buffers, render targets, etc

Getting to the First Triangle

Porting Cinder to Vulkan

Vulkanizing Cinder

● Initial port on Windows: ~3wks○ Included updating GLSL to Vulkan convention

● Android and Linux port: ~3hrs (each)○ Added platform WSI calls

○ Added platform swapchain creation

● Everything else stayed the same○ Including GLSL shader code used in demos and tests

Going Cross Platform

Porting Cinder to Vulkan

Crossing Vendor Implementations

● Vendor implementations follow the spec○ Conformance tested

● Slightly different behaviors○ Image layout transitions in render passes

● Varying GPU limits/features○ Found in VkPhysicalDeviceLimits

Implementation Details Will Vary

Porting Cinder to Vulkan

Speed Bump: Image Layout Transitions

● Initial platform allowed image layouts to be LAYOUT_GENERAL○ Made it easy to get up and going

● Seemed to work on other GPUs - until one didn’t○ Why? Vendor had stricter adherence to spec

● Checked spec and added logic for transitions○ Had to rework a good bit of code

Dad Said Yes But Mom Said No

Porting Cinder to Vulkan

Whooops...

Porting Cinder to Vulkan

YAY!

Porting Cinder to Vulkan

Speed Bump: Not Paying Attention to Limits

● Not adhering to limits often results in crashes

● Mishandled vkCmdBindDescriptorSets○ Exceeded maxBoundDescriptorSets

● Tried to multithread on device with 1 queue○ Failed to check queue family’s queue count

VkPhysicalDeviceLimits / VkQueueFamilyProperties

Porting Cinder to Vulkan

No More Black Box / Fewer Black Screens

● Vulkan Specification○ Clear about requirements and expectations (mostly)

● Check Device Limits / Features at Run Time○ Easy to query in Vulkan

● Validation Layers Are Your Friends○ Turn on at day 1 - leave on until shipped

Help Vulkan Help You

Porting Cinder to Vulkan

Antoine LabourE. Greg DanielJesse HallShannon WoodsDaniel KochJeff BolzMathias HeyerPiers DaniellTristan LorachJohn McDonaldDominik Witczak

Special Thanks

Thank You!Hai Nguyen

https://libcinder.org

GFXBench 5 - Aztec RuinsBenchmarking Vulkan

Gergely Juhasz, Lead Gfx Engineer @Kishonti

GFXBench 5 in a nutshell

• Concept• Working title: Aztec Ruins

• Entirely new rendering engine• In-house render API for Vulkan, Metal, DX12• Also on OpenGL 4.3+, ES 3.2, DX11 for comparison• Algorithmic and workload parity across different backends

• High-end graphics features• Real time dynamic GI• Complex shading and advanced post-effects

• State• Near to Beta• Gold version expected by Q3

Actual engine footage

Render pipeline – Direct lights

Render pipeline – Dynamic shadows

Render pipeline – Global illumination

Render pipeline – Post-process

Global illumination

• Probes capture the lighting conditions

• SH is generated for every probe

• Final scene is shaded by deferred irradiance lights

• Well fits in Vulkan’s subpass concept

Subpass 1 – Geometry

Subpass 2 – Lighting

Final step – Post effects

Multi-threaded command recording 1

Render job Render targets

Render states

Drawcalls

A B

D EC

F

Dependency graphPipeline consists of several render jobs

Multi-threaded command recording 2

Command buffer

Command buffer

Command buffer

Command buffer

Main thread Command queue

Main rendering thread submits the command buffers according to the dependency graph

Future development plans

• Planned rendering features• Indirect specular highlights and shadows by GI

• Deferred decals

• Animated vegetation

• Compute based motion blur

• Atmospheric effects, particles

• VR

© Copyright Khronos Group 2016 - Page 208

Comparing Vulkan to OpenGL (ES)

Barthold LichtenbeltMarch 16, 2016

© Copyright Khronos Group 2016 - Page 209

Beneficial Vulkan Scenarios

Is your graphicswork CPU bound?

Can your graphicscreation be parallelized?

start

yes

Vulkanfriendly

Your graphicsplatform is fixed

You’lldo what it

takes to squeeze outMax perf.

You put a premium on

avoidinghitches

You canmanage your

graphics resourceallocations

yes

yes

yes

yes

yes

© Copyright Khronos Group 2016 - Page 210

Unlikely to Benefit

Scenarios to reconsider coding to Vulkan

1. Need for compatibility to pre-Vulkan platforms2. Heavily GPU-bound application3. Heavily CPU-bound application due to non-graphics work4. Single-threaded application, unlikely to change5. App can target middle-ware engine, avoiding 3D graphics API dependencies

• Consider using an engine targeting Vulkan, instead of coding Vulkan yourself

© Copyright Khronos Group 2016 - Page 211

Comparing OpenGL, AZDO, and VulkanIssue Naïve GL AZDO VulkanDeterministic state validation/pre-compilation

no no Yes

Improved single thread performance no Yes Yes

Multi-threaded work creation no partial yes

Multi-threaded work submission (to driver)

no no yes

GPU based work creation no partial partial (through MDI)

Ability to re-use created work no partial yes

Multi-threaded resource updates no Yes Yes

Learning curve low high Significant

Effort low high Significant

© Copyright Khronos Group 2016 - Page 212

Fish demo•Vulkan and OpenGL ES 3.1•Can change- # of schools of fish

- # of fish per school

- # of fish per drawcall

•Worker threads create commandbuffers in Vulkan mode

•Reports- Drawcalls/sec

- FPS

- CPU time per thread

- GPU time

•Android and Windows• Source code will be available soon

© Copyright Khronos Group 2016 - Page 213

200K Fishies, 100 fish per draw call

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1

OpenGL ES

Vulkan

drawcalls / sec

7x

1.5x

1.2x

© Copyright Khronos Group 2016 - Page 214

200K Fishies, 1 fish per draw call

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

18,000,000

Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1

OpenGL ES

Vulkan

drawcalls / sec

6x5x

19x

© Copyright Khronos Group 2016 - Page 215

FISH DEMO