Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Erik Bohnhorst, NVIDIA
OPTIMIZING NVIDIA VIRTUAL GPU FOR THE BEST VDI USER EXPERIENCE
2
Agenda
• vGPU Introduction
• Virtual GPU October 2018 (vGPU 7.0)
• Architecting for Best User Experience
• NVIDIA Recommended CPUs
• What is the right GPU for your use case
• NVIDIA vGPU Benchmarking
3
VISUAL
WORKSPACE
THE EVOLUTION OF MODERN WORKFLOWS
VISUAL COMPUTING SPECTRUM
COLLABORATIONLARGE DATA
INTERACTIVE
HPC
VRPHOTOREALISMAIMOBILITY
4
HOW IT WORKSNVIDIA virtual GPU technology delivers a GPU experience to every desktop
Server
Hypervisor
Apps and VMs
CPU Only VDILimiting User Experience
NVIDIA Graphics Drivers
NVIDIA Virtual GPU
NVIDIA Tesla GPU
NVIDIA Virtualization Software
With NVIDIA Virtual GPUDriving the Best User Experience across
simple to the most powerful Apps
Apps and VMs
Hypervisor
Server
5
VIRTUAL GPU OCTOBER 2018 (vGPU 7.0)Unprecedented Performance & Manageability
Multi-vGPU SupportWorld’s Most Powerful
Quadro vDWS
vMotion Support for vGPULive Migration of vGPUenabled VMs
Quadro vDWS & GRID
Tesla T4 GPU Support*Latest Generation Turing
Quadro vDWS
NGC with vGPUAvailable with vGPU
Quadro vDWS
FPO FPO
* Tesla T4 support coming with vGPU software 7.1 release
6
GIANT LEAP
TURING
18.6 Billion xtors | 754 mm2 | GDDR6 14GHz
PASCAL
11.8 Billion xtors | 471 mm2 | 24 GB 10GHz
Greatest Leap Since 2006 CUDA GPU
7
GIANT LEAP
TENSOR CORE
FP16
INT8
INT4
RT CORE
Giga Rays/Sec
SHADER | COMPUTE
FP + INT
SHADER | COMPUTE
FP
or
INT
TURINGPASCAL
Greatest Leap Since 2006 CUDA GPU
9
VIRTUAL GPU
10
TESLA T4 IS EXTREMELY VERSATILE
• Great solution for
• Quadro vDWS
• GRID vPC
• Deep Learning Inference
• 6 boards in high volume 2U rack servers
Enablement in Virtual GPU 7.1
TESLA T4
GPU 1x TU104
Cores2,560 CUDA Cores
320 Turing Tensor Cores
RT Cores
Memory 16 GB GDDR6
Form FactorPCIe 3.0 Single Slot
(half height & length)
Thermal Passive
Power 70W – no external power
Max Users 16 (1GB FB)
Compute65 FP16 TFLOPS
130 INT8 TOPS
240 INT4 TOPS
Memory Bandwidth 320 GB/s
11
94% FASTER RENDERING USING MULTI-GPU
“The flexibility of the new multi-GPU feature available with NVIDIA Quadro vDWS opens up powerful new rendering workflows to SOLIDWORKS Visualize users. The near linear performance scaling means they can iterate on their designs at lightning speed on professional virtual workstations, allowing our customers to arrive at their best design in the shortest amount of time.” – Brian Hillner, SOLIDWORKS Product Portfolio Manager
SOLIDWORKS Visualize (Iray) Render Time
94% Faster
1x Tesla V100 2x Tesla V100
Up to 94% Faster Render Time Using Multi-GPU SOLIDWORKS Visualize (IRAY)
Tests were run on a server with 2x Intel Xeon Gold (6154 3.0 GHz) CPUs, 512GB RAM, RHEL 7.5, NVIDIA Quadro
vDWS software, Tesla V100-32Q, Driver - 410.39, 256 GB RAM, Windows 10 x64 RS3
12
UP TO 4.95X FASTER THAN CPU-ONLYAbaqus/Standard 2018 Elastomeric Bearing Model
0 1 2 3 4 5
16 vCPU
32 vCPU
16 vCPU + Quadro vDWSwith 1x V100
16 vCPU + Quadro vDWSwith 2x V100
Abaqus with NVIDIA Quadro vDWS & Tesla V100-32Q
16 License Tokens
16 License Tokens
21 License Tokens
16 License Tokens
Tests run on a sever with 2x Intel Xeon Skylake CPUs (Xeon 6148 2.4 GHz 32-core), NVIDIA Quadro vDWS software, Tesla V100 GPUs with 32Q profile, Driver - 410.53, 256 GB vRAM, Cent OS 7.4 64-bit. Benchmark
Model: ~450-550 TFLOPs, 5.9M DOF, Highly Nonlinear Static, Axisymmetric model with non-axisymmetric loading and twist, Direct Sparse Solver (Model courtesy: SIMULIA)
01X 2X 3X 4X 5X
13
NVIDIA VIRTUAL GPU SOFTWARE LINEUP
For virtual desktops delivering standard
PC applications, browser, and
multimedia.
For professional graphics applications; includes an NVIDIA Quadro driver.
Quadro Virtual Data Center Workstation
(Quadro vDWS)
GRID Virtual PC(GRID vPC)
Use with VMware Horizon Apps.
GRID Virtual Applications(GRID vApps)
Recommended GPU:Tesla P4*
* P40 & V100 for High End & Ultra High-End Use Cases* P6 for blade form factor deployments
Recommended GPU:Tesla M10
Recommended GPU:Tesla M10
14
NVIDIA RECOMMENDED CPU OPTIONS
GRID vPC
AMD EPYC CPU’s higher number of physical cores with lower frequency provide similar user experience to Intel Xeon Gold 6148 at higher scale.
Quadro vDWS
~3.0 GHz is required for many professional applications for optimal performance. Lower frequency can result in degraded performance.
Different workflows require different CPUs
GRID vPC
Quadro vDWS
Intel Xeon Gold 6148- 24 Cores @ 2.4 GHz
AMD EPYC 7501- 32 Cores @ 2.0 GHz
Both CPUs provide similar user experience* while the AMD CPU can host ~25-33% more users
Intel Xeon Gold 6154- 18 Cores @ 3.0 GHz
Provides the required frequency per physical core and allows good scale (18 cores/CPU)
* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload and comparing End-User Latency and Remoted Frames
15
RECOMMENDED NVIDIA TESLA GPU OPTIONS
Quadro vDWS:
Tesla P4 GPUs with Quadro vDWS for entry to mid end users provides the most flexible and cost effective solution
Tesla P40/V100 GPUs with Quadro vDWS provides graphics acceleration for few ultra high end users
GRID vPC:
Tesla M10 GPUs with GRID vPC enhances user experience while being the most cost effective solution
Different workflows require different GPUs
GRID vPC
Intel Xeon Gold 6148 or AMD EPYC 7501+
NVIDIA TESLA M10**+
GRID vPC
* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using SPECviewperf 12.1 on a Dell PowerEdge R740 with 2x Intel Xeon Gold 6154 CPUs** Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload and comparing End-User Latency and Remoted Frames
Quadro vDWS
Intel Xeon Gold 6154+
NVIDIA TESLA P4*+
Quadro vDWS
GRID vApps
Intel Xeon Gold 6148+
NVIDIA TESLA M10**+
GRID vApps
16
QUADRO vDWS GUIDANCE
AutoCAD, Revit, Inventor
Solidworks, Siemens NX, Creo, Catia
Office, Sketchup
Adobe CC Photoshop, Illustrator Adobe CC Premiere Pro, After Effects, Autodesk Maya, 3ds Max, Mari, Nuke
Schlumberger, Halliburton, DeltaGen, Catia Live RenderingPACS/Diagnostics
Ansys, Abaqus, Simulia
Small/simple CAD models, video, Entry PLM
Medium size/complexity CAD models, Basic DCC, Medical Imaging, PLM
Large/complex CAD models,Advanced DCC, Medical Imaging
Large/complex CAD models,Seismic exploration, complexDCC effects, 3D Medical Imaging Recon
Largest CAD models, CAE,Photorealistic rendering,Seismic exploration, GPGPU compute
Deep learning, rendering, immersive visualization, and GPGPU compute applications
+ 4GB Memory
Tesla P4 / T4
Entry – Mid Range Quadro vDWS
Tesla P40
High-End Quadro vDWS
Tesla V100
17
UP TO 6X TESLA P4
NVIDIA recommends Intel Xeon Gold 6154 18-core 3.0 GHz which provides enough CPU resources to host 6x Tesla P4 GPUs with Quadro vDWS.
Tesla P4 benefits over Tesla M60:
- Performance*
- Price/Performance
- Smaller Form Factor
- Lower Power Consumption
- NVIDIA Pascal GPU Architecture Benefits
Best Density with 6x Tesla P4 on a Dual-Socket Server
* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using SPECviewperf 12.1 on a Dell PowerEdge R740 with 2x Intel Xeon Gold 6154 CPUs
0.8
1
1.2
3dsMax Catia Maya Siemens NX SolidworksRela
tive P
erf
orm
ance
Tesla P4 and Tesla M60 Performance*
Tesla M60 GPU Tesla P4 GPU
0
0.5
1
1.5
3dsMax Catia Maya Siemens NX Solidworks
Rela
tive P
erf
orm
ance
Sufficient CPU resources to host 6x Tesla P4 with Quadro vDWS
1x Tesla P4 6x Tesla P4
12 V
Ms
12 V
Ms
24 V
Ms
24 V
Ms
24 V
Ms
2 V
Ms
2 V
Ms
4 V
Ms
4 V
Ms
4 V
Ms
18
TESLA P4 TESLA P40
Many Low-Mid End Users Few Mid-High End Users
Price/Performance Performance
Form FactorHigh Framebuffer Profiles
(12GB and 24GB)
Power Consumption
Different Profiles (Many P4s)
TESLA P40/V100 FOR ULTRA HIGH END USERS
Tesla P40 with Quadro vDWS for few high to ultra high end users.
Tesla V100 with Quadro vDWS for few high to ultra high end users and/or Deep Learning workflows.
When to choose Tesla P40 over P4:
- Maximum Performance*
- High Framebuffer profiles (12GB/24GB)
Multiple Tesla P4 GPUs are the most cost effective and flexible solution for many entry to mid range end users
Tesla P40 and Tesla V100 power the most demanding workflows
* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using SPECviewperf 12.1 on a Dell PowerEdge R740 with 2x Intel Xeon Gold 6154 CPUs
0.5
1
1.5
2
2.5
Rela
tive P
erf
orm
ance
Tesla P4 and Tesla P40 Performance*
Tesla P4 Tesla P40
REMEMBER:
Large framebuffer GPUs don’t guarantee
high number of Quadro vDWS users
19
SIMPLE vGPU LICENSE DECISION
Do you use CUDA or
professional workstation apps?
Do you use VDI?
(Single user per OS)
You need Quadro vDWS Includes vPC and vApps entitlement
Do you have multiple users
sharing a single OS through
sessions? (RDSH, Horizon Apps,
XenApp, etc.)
You need GRID vPC Includes vApps entitlement
Yes!
Yes!
No
No
You need vAppsYes!
20
HOW NVIDIA MEASURES USER EXPERIENCEApplying Methodology of Physical PCs to Virtual PCs
End User Latency
Framerate
Image Quality Functionality
UX
Metrics Description
End User Latency Measures Interactivity, how remote your session feels
Framerate Measures the fluidity of your session
Image Quality Measures the impact of the remote protocol
Functionality Application and API compatibility
Consistency Measures how consistent the UX is over time
UNIQUELY Quantifies Remote User Experience
+ Monitors Resource Utilization
Metrics Description
Host Resources CPU, GPU, Memory, etc.
Virtual Machine Resources vCPUs, vMemory, vGPU, IOPS, etc.
Network Consumption Bandwidth, etc.
= Realistic Sizing Recommendations
21
NVIDIA IMAGE QUALITY RECOMMENDATION
YUV 4:2:0 YUV 4:4:4 YUV 4:2:0 YUV 4:4:4
Reference Image
Reference Image
YUV 4:4:4 for PC Users
GRID vPCYUV 4:2:0 for Workstation Users
Quadro vDWS
22
YUV 4:4:4 IMPLICATIONS
Similar Bandwidth Utilization*YUV 4:4:4 – 2% less bandwidth
Lower Remoted Frames*YUV 4:4:4 - 9% fewer Remoted Frames
Improved Image QualitySSIM increase to 0.989**
* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload running 64 VMs with Tesla M10-1B** Tested with NVIDIA Cirrus VDI Benchmarking tool and predefined reference images to represent multiple workflows
0
20
40
60
80
100
120
140
160
180
200
Mb
its/
sec
64 VM Bandwidth (Mbit/sec)
YUV 4:2:0 YUV 4:4:40
0.25
0.5
0.75
1
YUV 4:2:0 YUV 4:4:4
Tota
l Rem
ote
d F
ram
es
Remoted Frames/User
0.9
0.92
0.94
0.96
0.98
1
YUV 4:2:0 YUV 4:4:4
SSIM
Image Quality
91%
23
End User Latency (ms) Server CPU Utilization (%) Remoted Frames / User
GRID vPC for Multiple Screens
GRID vPC for High Screen Resolutions
350
350
1
1
0
20
40
60
80
1001x 1080p Screen
(CPU-Only)
1x 1080p Screen
(CPU-Only)
End User Latency (ms) Server CPU Utilization (%) Remoted Frames / User
* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload and comparing End-User Latency and Remoted Frames
0
20
40
60
80
100
24
End User Latency (ms) Server CPU Utilization (%) Remoted Frames / User
GRID vPC for Multiple Screens
GRID vPC for High Screen Resolutions
350
714
350
800
1
1.2
1
1.26
0
20
40
60
80
1001x 1080p Screen
(CPU-Only)
2x 1080p Screens
(CPU-Only)
1x 1080p Screen
(CPU-Only)
1x 4K Screens
(CPU-Only)
End User Latency (ms) Server CPU Utilization (%) Remoted Frames / User
* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload and comparing End-User Latency and Remoted Frames
0
20
40
60
80
100
25
End User Latency (ms) Server CPU Utilization (%) Remoted Frames / User
GRID vPC for Multiple Screens
GRID vPC for High Screen Resolutions
350
714
233
350
800
451 1
1.2
1.85
1
1.26
1.86
0
20
40
60
80
1001x 1080p Screen
(CPU-Only)
2x 1080p Screens
(CPU-Only)
2x 1080p Screens
(GRID vPC)
1x 1080p Screen
(CPU-Only)
1x 4K Screens
(CPU-Only)
1x 4K Screens
(GRID vPC)
End User Latency (ms) Server CPU Utilization (%) Remoted Frames / User
* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload and comparing End-User Latency and Remoted Frames
0
20
40
60
80
100
26
THANK YOU