OPTIMIZING NVIDIA VIRTUAL GPU FOR THE BEST VDI USER …on-demand.gputechconf.com/gtc-il/2018/pdf/sil8103... · * Tested with NVIDIA’s Cirrus VDI Benchmarking tool using SPECviewperf

Erik Bohnhorst, NVIDIA

OPTIMIZING NVIDIA VIRTUAL GPU FOR THE BEST VDI USER EXPERIENCE

2

Agenda

• vGPU Introduction

• Virtual GPU October 2018 (vGPU 7.0)

• Architecting for Best User Experience

• NVIDIA Recommended CPUs

• What is the right GPU for your use case

• NVIDIA vGPU Benchmarking

3

VISUAL

WORKSPACE

THE EVOLUTION OF MODERN WORKFLOWS

VISUAL COMPUTING SPECTRUM

COLLABORATIONLARGE DATA

INTERACTIVE

HPC

VRPHOTOREALISMAIMOBILITY

4

HOW IT WORKSNVIDIA virtual GPU technology delivers a GPU experience to every desktop

Server

Hypervisor

Apps and VMs

CPU Only VDILimiting User Experience

NVIDIA Graphics Drivers

NVIDIA Virtual GPU

NVIDIA Tesla GPU

NVIDIA Virtualization Software

With NVIDIA Virtual GPUDriving the Best User Experience across

simple to the most powerful Apps

Apps and VMs

Hypervisor

Server

5

VIRTUAL GPU OCTOBER 2018 (vGPU 7.0)Unprecedented Performance & Manageability

Multi-vGPU SupportWorld’s Most Powerful

Quadro vDWS

vMotion Support for vGPULive Migration of vGPUenabled VMs

Quadro vDWS & GRID

Tesla T4 GPU Support*Latest Generation Turing

Quadro vDWS

NGC with vGPUAvailable with vGPU

Quadro vDWS

FPO FPO

* Tesla T4 support coming with vGPU software 7.1 release

6

GIANT LEAP

TURING

18.6 Billion xtors | 754 mm2 | GDDR6 14GHz

PASCAL

11.8 Billion xtors | 471 mm2 | 24 GB 10GHz

Greatest Leap Since 2006 CUDA GPU

7

GIANT LEAP

TENSOR CORE

FP16

INT8

INT4

RT CORE

Giga Rays/Sec

SHADER | COMPUTE

FP + INT

SHADER | COMPUTE

FP

or

INT

TURINGPASCAL

Greatest Leap Since 2006 CUDA GPU

9

VIRTUAL GPU

10

TESLA T4 IS EXTREMELY VERSATILE

• Great solution for

• Quadro vDWS

• GRID vPC

• Deep Learning Inference

• 6 boards in high volume 2U rack servers

Enablement in Virtual GPU 7.1

TESLA T4

GPU 1x TU104

Cores2,560 CUDA Cores

320 Turing Tensor Cores

RT Cores

Memory 16 GB GDDR6

Form FactorPCIe 3.0 Single Slot

(half height & length)

Thermal Passive

Power 70W – no external power

Max Users 16 (1GB FB)

Compute65 FP16 TFLOPS

130 INT8 TOPS

240 INT4 TOPS

Memory Bandwidth 320 GB/s

11

94% FASTER RENDERING USING MULTI-GPU

“The flexibility of the new multi-GPU feature available with NVIDIA Quadro vDWS opens up powerful new rendering workflows to SOLIDWORKS Visualize users. The near linear performance scaling means they can iterate on their designs at lightning speed on professional virtual workstations, allowing our customers to arrive at their best design in the shortest amount of time.” – Brian Hillner, SOLIDWORKS Product Portfolio Manager

SOLIDWORKS Visualize (Iray) Render Time

94% Faster

1x Tesla V100 2x Tesla V100

Up to 94% Faster Render Time Using Multi-GPU SOLIDWORKS Visualize (IRAY)

Tests were run on a server with 2x Intel Xeon Gold (6154 3.0 GHz) CPUs, 512GB RAM, RHEL 7.5, NVIDIA Quadro

vDWS software, Tesla V100-32Q, Driver - 410.39, 256 GB RAM, Windows 10 x64 RS3

12

UP TO 4.95X FASTER THAN CPU-ONLYAbaqus/Standard 2018 Elastomeric Bearing Model

0 1 2 3 4 5

16 vCPU

32 vCPU

16 vCPU + Quadro vDWSwith 1x V100

16 vCPU + Quadro vDWSwith 2x V100

Abaqus with NVIDIA Quadro vDWS & Tesla V100-32Q

16 License Tokens

16 License Tokens

21 License Tokens

16 License Tokens

Tests run on a sever with 2x Intel Xeon Skylake CPUs (Xeon 6148 2.4 GHz 32-core), NVIDIA Quadro vDWS software, Tesla V100 GPUs with 32Q profile, Driver - 410.53, 256 GB vRAM, Cent OS 7.4 64-bit. Benchmark

Model: ~450-550 TFLOPs, 5.9M DOF, Highly Nonlinear Static, Axisymmetric model with non-axisymmetric loading and twist, Direct Sparse Solver (Model courtesy: SIMULIA)

01X 2X 3X 4X 5X

13

NVIDIA VIRTUAL GPU SOFTWARE LINEUP

For virtual desktops delivering standard

PC applications, browser, and

multimedia.

For professional graphics applications; includes an NVIDIA Quadro driver.

Quadro Virtual Data Center Workstation

(Quadro vDWS)

GRID Virtual PC(GRID vPC)

Use with VMware Horizon Apps.

GRID Virtual Applications(GRID vApps)

Recommended GPU:Tesla P4*

* P40 & V100 for High End & Ultra High-End Use Cases* P6 for blade form factor deployments

Recommended GPU:Tesla M10

Recommended GPU:Tesla M10

14

NVIDIA RECOMMENDED CPU OPTIONS

GRID vPC

AMD EPYC CPU’s higher number of physical cores with lower frequency provide similar user experience to Intel Xeon Gold 6148 at higher scale.

Quadro vDWS

~3.0 GHz is required for many professional applications for optimal performance. Lower frequency can result in degraded performance.

Different workflows require different CPUs

GRID vPC

Quadro vDWS

Intel Xeon Gold 6148- 24 Cores @ 2.4 GHz

AMD EPYC 7501- 32 Cores @ 2.0 GHz

Both CPUs provide similar user experience* while the AMD CPU can host ~25-33% more users

Intel Xeon Gold 6154- 18 Cores @ 3.0 GHz

Provides the required frequency per physical core and allows good scale (18 cores/CPU)

* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload and comparing End-User Latency and Remoted Frames

15

RECOMMENDED NVIDIA TESLA GPU OPTIONS

Quadro vDWS:

Tesla P4 GPUs with Quadro vDWS for entry to mid end users provides the most flexible and cost effective solution

Tesla P40/V100 GPUs with Quadro vDWS provides graphics acceleration for few ultra high end users

GRID vPC:

Tesla M10 GPUs with GRID vPC enhances user experience while being the most cost effective solution

Different workflows require different GPUs

GRID vPC

Intel Xeon Gold 6148 or AMD EPYC 7501+

NVIDIA TESLA M10**+

GRID vPC

* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using SPECviewperf 12.1 on a Dell PowerEdge R740 with 2x Intel Xeon Gold 6154 CPUs** Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload and comparing End-User Latency and Remoted Frames

Quadro vDWS

Intel Xeon Gold 6154+

NVIDIA TESLA P4*+

Quadro vDWS

GRID vApps

Intel Xeon Gold 6148+

NVIDIA TESLA M10**+

GRID vApps

16

QUADRO vDWS GUIDANCE

AutoCAD, Revit, Inventor

Solidworks, Siemens NX, Creo, Catia

Office, Sketchup

Adobe CC Photoshop, Illustrator Adobe CC Premiere Pro, After Effects, Autodesk Maya, 3ds Max, Mari, Nuke

Schlumberger, Halliburton, DeltaGen, Catia Live RenderingPACS/Diagnostics

Ansys, Abaqus, Simulia

Small/simple CAD models, video, Entry PLM

Medium size/complexity CAD models, Basic DCC, Medical Imaging, PLM

Large/complex CAD models,Advanced DCC, Medical Imaging

Large/complex CAD models,Seismic exploration, complexDCC effects, 3D Medical Imaging Recon

Largest CAD models, CAE,Photorealistic rendering,Seismic exploration, GPGPU compute

Deep learning, rendering, immersive visualization, and GPGPU compute applications

+ 4GB Memory

Tesla P4 / T4

Entry – Mid Range Quadro vDWS

Tesla P40

High-End Quadro vDWS

Tesla V100

17

UP TO 6X TESLA P4

NVIDIA recommends Intel Xeon Gold 6154 18-core 3.0 GHz which provides enough CPU resources to host 6x Tesla P4 GPUs with Quadro vDWS.

Tesla P4 benefits over Tesla M60:

- Performance*

- Price/Performance

- Smaller Form Factor

- Lower Power Consumption

- NVIDIA Pascal GPU Architecture Benefits

Best Density with 6x Tesla P4 on a Dual-Socket Server

* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using SPECviewperf 12.1 on a Dell PowerEdge R740 with 2x Intel Xeon Gold 6154 CPUs

0.8

1

1.2

3dsMax Catia Maya Siemens NX SolidworksRela

tive P

erf

orm

ance

Tesla P4 and Tesla M60 Performance*

Tesla M60 GPU Tesla P4 GPU

0

0.5

1

1.5

3dsMax Catia Maya Siemens NX Solidworks

Rela

tive P

erf

orm

ance

Sufficient CPU resources to host 6x Tesla P4 with Quadro vDWS

1x Tesla P4 6x Tesla P4

12 V

Ms

12 V

Ms

24 V

Ms

24 V

Ms

24 V

Ms

2 V

Ms

2 V

Ms

4 V

Ms

4 V

Ms

4 V

Ms

18

TESLA P4 TESLA P40

Many Low-Mid End Users Few Mid-High End Users

Price/Performance Performance

Form FactorHigh Framebuffer Profiles

(12GB and 24GB)

Power Consumption

Different Profiles (Many P4s)

TESLA P40/V100 FOR ULTRA HIGH END USERS

Tesla P40 with Quadro vDWS for few high to ultra high end users.

Tesla V100 with Quadro vDWS for few high to ultra high end users and/or Deep Learning workflows.

When to choose Tesla P40 over P4:

- Maximum Performance*

- High Framebuffer profiles (12GB/24GB)

Multiple Tesla P4 GPUs are the most cost effective and flexible solution for many entry to mid range end users

Tesla P40 and Tesla V100 power the most demanding workflows

* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using SPECviewperf 12.1 on a Dell PowerEdge R740 with 2x Intel Xeon Gold 6154 CPUs

0.5

1

1.5

2

2.5

Rela

tive P

erf

orm

ance

Tesla P4 and Tesla P40 Performance*

Tesla P4 Tesla P40

REMEMBER:

Large framebuffer GPUs don’t guarantee

high number of Quadro vDWS users

19

SIMPLE vGPU LICENSE DECISION

Do you use CUDA or

professional workstation apps?

Do you use VDI?

(Single user per OS)

You need Quadro vDWS Includes vPC and vApps entitlement

Do you have multiple users

sharing a single OS through

sessions? (RDSH, Horizon Apps,

XenApp, etc.)

You need GRID vPC Includes vApps entitlement

Yes!

Yes!

No

No

You need vAppsYes!

20

HOW NVIDIA MEASURES USER EXPERIENCEApplying Methodology of Physical PCs to Virtual PCs

End User Latency

Framerate

Image Quality Functionality

UX

Metrics Description

End User Latency Measures Interactivity, how remote your session feels

Framerate Measures the fluidity of your session

Image Quality Measures the impact of the remote protocol

Functionality Application and API compatibility

Consistency Measures how consistent the UX is over time

UNIQUELY Quantifies Remote User Experience

+ Monitors Resource Utilization

Metrics Description

Host Resources CPU, GPU, Memory, etc.

Virtual Machine Resources vCPUs, vMemory, vGPU, IOPS, etc.

Network Consumption Bandwidth, etc.

= Realistic Sizing Recommendations

21

NVIDIA IMAGE QUALITY RECOMMENDATION

YUV 4:2:0 YUV 4:4:4 YUV 4:2:0 YUV 4:4:4

Reference Image

Reference Image

YUV 4:4:4 for PC Users

GRID vPCYUV 4:2:0 for Workstation Users

Quadro vDWS

22

YUV 4:4:4 IMPLICATIONS

Similar Bandwidth Utilization*YUV 4:4:4 – 2% less bandwidth

Lower Remoted Frames*YUV 4:4:4 - 9% fewer Remoted Frames

Improved Image QualitySSIM increase to 0.989**

* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload running 64 VMs with Tesla M10-1B** Tested with NVIDIA Cirrus VDI Benchmarking tool and predefined reference images to represent multiple workflows

0

20

40

60

80

100

120

140

160

180

200

Mb

its/

sec

64 VM Bandwidth (Mbit/sec)

YUV 4:2:0 YUV 4:4:40

0.25

0.5

0.75

1

YUV 4:2:0 YUV 4:4:4

Tota

l Rem

ote

d F

ram

es

Remoted Frames/User

0.9

0.92

0.94

0.96

0.98

1

YUV 4:2:0 YUV 4:4:4

SSIM

Image Quality

91%

23

End User Latency (ms) Server CPU Utilization (%) Remoted Frames / User

GRID vPC for Multiple Screens

GRID vPC for High Screen Resolutions

350

350

1

1

0

20

40

60

80

1001x 1080p Screen

(CPU-Only)

1x 1080p Screen

(CPU-Only)



0

20

40

60

80

100

24




350

714

350

800

1

1.2

1

1.26

0

20

40

60

80

1001x 1080p Screen

(CPU-Only)

2x 1080p Screens

(CPU-Only)

1x 1080p Screen

(CPU-Only)

1x 4K Screens

(CPU-Only)



0

20

40

60

80

100

25




350

714

233

350

800

451 1

1.2

1.85

1

1.26

1.86

0

20

40

60

80

1001x 1080p Screen

(CPU-Only)

2x 1080p Screens

(CPU-Only)

2x 1080p Screens

(GRID vPC)

1x 1080p Screen

(CPU-Only)

1x 4K Screens

(CPU-Only)

1x 4K Screens

(GRID vPC)



0

20

40

60

80

100

26

THANK YOU

Documents

OPTIMIZING NVIDIA VIRTUAL GPU FOR THE BEST VDI USER …on-demand.gputechconf.com/gtc-il/2018/pdf/sil8103... · * Tested with NVIDIA’s Cirrus VDI Benchmarking tool using SPECviewperf