Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
ORNL is managed by UT-Battelle LLC for the US Department of Energy
GPU Enhanced Remote Collaborative Scientific VisualizationBenjamin Hernandez (OLCF), Tim Biedert (NVIDIA)
March 20th, 2019
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
2
Contents
• Part I– GPU Enhanced Remote Collaborative Scientific Visualization
• Part II– Hardware-accelerated Multi-tile streaming for Realtime Remote
Visualization
3
Oak Ridge Leadership Computing Facility (OLCF)
• Provide the computational and data resources required to solve the most challenging problems.
• Highly competitive user allocation programs (INCITE, ALCC). – OLCF provides 10x to 100x more resources than other centers
• We collaborate with users sharing diversity in expertise andgeographic location
4
How are these collaborations ?
• Collaborations– Extends through the life cycle of the
data from computation to analysis and visualization.
– Are structured around data.
• Data analysis and Visualization – An iterative and sometimes remote
process involving students, visualization experts, PIs and stakeholders
Simulation
Pre-visualization
Evaluation
Visualization
Discovery
Team needs
PI feedback
Viz. expert feedback
5
How are these collaborations ?
• The collaborative future must be characterized by[1]:
1. U.S. Department of Energy.(2011). Scientific collaborations for extreme-scale science (Workshop Report). Retrieved from https://indico.bnl.gov/event/403/attachments/11180/13626/ScientificCollaborationsforExtreme-ScaleScienceReportDec2011_Final.pdf
DiscoveryResources easy
to find
ConnectivityNo resource is an
island
PortabilityResources widely and transparently usable
CentralityResources efficiently,
and centrally supported
“Web-based immersive visualization Tools with the ease of a virtual reality game.”
“Visualization ideally would be combined with user-guided as well as template-guided automated feature extraction, real-time annotation, and quantitative geometrical analysis.”
“Rapid data visualization and analysis to enable understanding in near real time by a geographically dispersed team.”
…
6
How are these collaborations ?
• INCITE “Petascale simulations of short pulse laser interaction with metals” PI Leonid Zhigilei, University of Virginia– Laser ablation in vacuum and liquid environments– Hundreds of million to billion scale atomistic simulations, dozens of time steps
• INCITE “Molecular dynamics of motor-protein networks in cellular energy metabolism” PI(s) Abhishek Singharoy, Arizona State University– Hundreds of million scale atomistic simulation, hundreds of time steps
7
How are these collaborations ?
• The data is centralized in OLCF
• SIGHT is a custom platform for interactive data analysis and visualization.– Support for collaborative features are needed for
• Low latency remote visualization streaming• Simultaneous and independent user views• Collaborative multi-display setting environment
88
SIGHT:Exploratory Visualization of Scientific Data
• Designed around user needs
• Lightweight tool– Load your data– Perform exploratory analysis– Visualize/Save results
• Heterogeneous scientific visualization– Advanced shading to
enable new insights into data exploration.
– Multicore and manycore support.
• Remote visualization– Server/Client architecture to provide high
end visualization in laptops, desktops, and powerwalls.
• Multi-threaded I/O• Supports interactive/batch
visualization– In-situ (some effort)
• Designed having OLCF infrastructure in mind.
9
PublicationsM. V. Shugaev, C. Wu, O. Armbruster, A. Naghilou, N. Brouwer, D. S. Ivanov, T. J.-Y. Derrien, N. M. Bulgakova, W. Kautek, B. Rethfeld, and L. V. Zhigilei, Fundamentals of ultrafast laser-material interaction, MRS Bull. 41 (12), 960-968, 2016.
C.-Y. Shih, M. V. Shugaev, C. Wu, and L. V. Zhigilei, Generation of subsurface voids, incubation effect, and formation of nanoparticles in short pulse laser interactions with bulk metal targets in liquid: Molecular dynamics study, J. Phys. Chem. C 121, 16549-16567, 2017.
C.-Y. Shih, R. Streubel, J. Heberle, A. Letzel, M. V. Shugaev, C. Wu, M. Schmidt, B. Gökce, S. Barcikowski, and L. V. Zhigilei, Two mechanisms of nanoparticle generation in picosecond laser ablation in liquids: the origin of the bimodal size distribution, Nanoscale 10, 6900-6910, 2018.
1010
SIGHT’s System Architecture
SIGHT ClientWeb browser - JavaScript
OLCF
SIGHT
Mul
ti-th
read
ed
Dat
aset
Par
ser Ray Tracing
Backends1
NVIDIAOptix
IntelOSPray
Dat
a Pa
ralle
l A
naly
sis2
Frame ServerWebsockets
SIMD ENCTurboJPEG
GPU ENCNVENC NVPipe
1S7175 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1Nvidia GPU Technology Conference 20172P8220 Heterogeneous Selection Algorithms for Interactive Analysis of BillionScale Atomistic Datasets
Nvidia GPU Technology Conference 2018
11
Node(s)
Design of Collaborative Infrastructure - Alternative 1
SIGHT
User 1
Node(s)
Data
SIGHT
Node(s)
SIGHT
User 3User 2
12
Job ?
Job 2Job 1
Node(s)
Design of Collaborative Infrastructure - Alternative 1
SIGHT
User 1
Node(s)
Data
SIGHT
User 2
Job 3
Node(s)
SIGHT
User 3
13
Job 1
Job 3Job 2Job 1
Node(s)
Design of Collaborative Infrastructure - Alternative 1
SIGHT
User 1
Node(s)
Data
SIGHT
Node(s)
SIGHT
User 3User 2
! !
Co-scheduling !Job coordination !Communication !
14
Job 1
Design of Collaborative Infrastructure - Alternative 2
Node(s)
SIGHT
User 1
Data
User 3User 2
15
Design of Collaborative Infrastructure
• ORNL Summit System Overview– 4,608 nodes– Dual-port Mellanox EDR InfiniBand network– 250 PB IBM file system transferring data at
2.5 TB/s
• Each node has– 2 IBM POWER9 processors – 6 NVIDIA Tesla V100 GPUs – 608 GB of fast memory (96 GB HBM2 + 512
GB DDR4) – 1.6 TB of NV memory
Node(s)
SIGHT
User 1
Data
User 3User 2
16
Design of Collaborative Infrastructure
• Each NVIDIA Tesla V100 GPU– 3 NVENC Chips– Unrestricted number of concurrent
sessions
• NVPipe– Lightweight C API library for low-latency
video compression
– Easy access to NVIDIA's hardware-accelerated H.264 and HEVC video codecs
1717
GUI Events
Enhancing SIGHT Frame Server Low latency encoding
Frame ServerWebsockets
NVENC
Ray Tracing Backend
SIGHT ClientWeb browser - JavaScript
(x, y, button)KeyboardGUI Events
Sharing Optix buffer (OptiXpp)
Buffer frameBuffer = context->createBuffer( RT_BUFFER_OUTPUT, RT_FORMAT_UNSIGNED_BYTE4, m_width, m_height );
frameBufferPtr = buffer->getDevicePointer(optxDevice);...
compress (frameBufferPtr);Framebuffer
Visualization Stream
(x, y, button)Keyboard
1818
Enhancing SIGHT Frame Server Low latency encoding
Opening encoding session:
myEncoder = NvPipe_CreateEncoder (NVPIPE_RGBA32,NVPIPE_H264, NVPIPE_LOSSY, bitrateMbps * 1000 * 1000, targetFps);
if (!m_encoder)return error;
myCompressedImg = new unsigned char[w*h*4];
Framebuffer compression:
bool compress (myFramebufferPtr){...myCompressedSize = NvPipe_Encode(myEncoder,
myFramebufferPtr, w*4, myCompressedImg, w*h*4, w, h, true);
if (myCompressedSize == 0 )return error;
…}
Closing encoding session:
NvPipe_Destroy(myEncoder);
GUI EventsFrame Server
Websockets
NVENC
Ray Tracing Backend
SIGHT ClientWeb browser - JavaScript
Framebuffer
Visualization Stream
(x, y, button)Keyboard
CameraParameters
19
Enhancing SIGHT Frame Server Low latency encoding
• System Configuration:– DGX-1 Volta– Connection Bandwidth
• 800Mbps (ideally 1 Gbps)
– NVENC Encoder• H264 PROFILE BASELINE• 32 MBPS, 30 FPS
– Turbo JPEG• SIMD Instructions on• JPEG Quality 50
– Decoder• Broadway.js (FireFox 65)• Media Source Extensions (Chrome 72)• Built-in JPEG Decoder (Chrome 72)
NVENC NVENC+MP4 TJPEGEncoding HD (ms) 4.65 6.05 16.71Encoding 4K (ms) 12.13 17.89 51.89Frame Size HD (KB) 116.00 139.61 409.76Frame Size 4K (KB) 106.32 150.65 569.04
Average
Broadway.cs MSE Built-in JPEG Decoding HD (ms) 43.28 39.97 78.15Decoding 4K (ms) 87.40 53.10 197.63
20
Enhancing SIGHT Frame Server Low latency encoding
21
Enhancing SIGHT Frame Server Low latency encoding
22
Enhancing SIGHT Frame Server Simultaneous and independent user views
• A Summit node can produce 4K visualizations with NVIDIA Optixat interactive rates.
User 1 User 2
User 3 User 4
4K
HD HD
HD HD
EVEREST
User A User B16:9 16:9
32:9
23
Enhancing SIGHT Frame Server Simultaneous and independent user views
2424
Thread 0 Frame Server
Enhancing SIGHT Frame Server Simultaneous and independent user views
Ray Tracing Backend
User 1
(x, y, button)KeyboardGUI Events
User 2
User 3
GUI Event queue
Thread 1 Frame Server
Thread 2 Frame Server
CameraParameters
(Usr Id)
Frame Buffer queue Viz. frame
(Usr Id)
Viz. frame
25
Enhancing SIGHT Frame Server Simultaneous and independent user views
• Video
26
Discussion
• Further work– Multi-perspective/orthographic
projections– Traditional case: stereoscopic projection– Annotations, sharing content between
users, saving sessions
• How AI could help to improve remote visualization performance:
– NVIDIA NGX Tech
• AI Up-Res – Improve compression rates – Different resolutions
• Render and stream @ 720p, decoding at HD, 2K, 4K according to each user display
• AI Slow-Mo– Render at low framerates
• Optix Denoiser– Ray tracing converge faster
27
Thanks!
Benjamin HernandezAdvanced Data and Workflows Group
Oak Ridge National [email protected]
• AcknowledgmentsThis research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Datasets provided by Cheng-Yu Shi and Leonid Zhigilei, Computational Materials Group at University of Virginia, INCITE award MAT130.
Tim Biedert, 03/20/2019
HARDWARE-ACCELERATED MULTI-TILE STREAMING FOR REALTIME REMOTE VISUALIZATION
2
INSTANT VISUALIZATION FOR FASTER SCIENCE
TraditionalSlower Time to
Discovery
CPU Supercomputer Viz Cluster
Simulation- 1 Week Viz- 1 Day
Multiple Iterations
Time to Discovery =
Months
Tesla
PlatformFaster Time to
Discovery
GPU-Accelerated Supercomputer
Visualize while you
simulate/without
data transfers
Restart Simulation Instantly
Multiple Iterations
Time to Discovery = Weeks
Days
Data Transfer
• Interactivity
• Scalability
• Flexibility
Value Proposition
3
CO-PROCESSINGPARTITIONED
SYSTEMLEGACY
WORKFLOW
SUPPORTING MULTIPLE VISUALIZATION WORKFLOWS
Separate compute & vissystem
Communication via file system
Compute and visualization on same GPU
Communication via host-device transfers or memcpy
Different nodes for different roles
Communication via high-speed network
4
VISUALIZATION-ENABLED SUPERCOMPUTERS
http://blogs.nvidia.com/blog/2014/11/19/gpu-in-
situ-milky-way/
CSCS Piz Daint NCSA Blue Waters
Galaxy formation
http://devblogs.nvidia.com/parallelforall/hpc
-visualization-nvidia-tesla-gpus/
ORNL Titan
Molecular dynamics Cosmologyhttp://www.sdav-scidac.org/29-
highlights/visualization/66-accelerated-cosmology-
data-anal.html
5
GEFORCE NOW
You listen to music on Spotify. You watch movies on Netflix. GeForce Now lets you play games the same way.
Instantly stream the latest titles from our powerful cloud-gaming supercomputers. Think of it as your game console in the sky.
Gaming is now easy and instant.
ROAD TO EXASCALEVolta to Fuel Most Powerful US Supercomputers
1.641.50
1.39 1.41 1.37
1.7
1.41.5
V100 P
erf
orm
ance
Norm
alize
d t
o P
100
1.5X HPC Performance in 1 Year
Summit
Supercomputer
200+ PetaFlops
~3,400 Nodes
10 MegawattsSystem Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 2X Tesla P100 or V100.
6
VISUALIZATION TRENDSNew Approaches Required to Solve the
Remoting Challenge
Increasing data set sizes
In-situ scenarios
Interactive workflows
New display technologies
Globally distributed user bases
7
STREAMINGBenefits of Rendering on Supercomputer
Scale with SimulationNo Need to Scale Separate Vis Cluster
Cheaper Infrastructure All Heavy Lifting Performed on the Server
Interactive High-Fidelity Rendering Improves Perception and Scientific Insight
8
FLEXIBLE GPU ACCELERATION ARCHITECTURE
* Diagram represents support for the NVIDIA Turing GPU family
** 4:2:2 is not natively supported on HW
*** Support is codec dependent
Independent CUDA Cores & Video Engines
9
CASE STUDY
Streaming of Large Tile Counts
Frame rates / Latency / Bandwidth
Synchronization
Comparison against CPU-based Compressors
Strong Scaling (Direct-Send Sort-First Compositing)
Hardware-Accelerated Multi-Tile Streaming for Realtime Remote Visualization
Tim Biedert, Peter Messmer, Tom Fogal, Christoph Garth
Eurographics Symposium on Parallel Graphics and Visualization (EGPGV) 2018 (Best Paper)
DOI: 10.2312/pgv.20181093
10
CONCEPTUAL OVERVIEWAsynchronous Pipelines
11
BENCHMARK SCENESNASA Synthesis 4K
SpaceLow Complexity
OrbitMedium Complexity
IceHigh Complexity
StreamlinesExtreme Complexity
12
CODEC PERFORMANCE
13
BITRATE VS QUALITY
0,6
0,65
0,7
0,75
0,8
0,85
0,9
0,95
1
1 2 4 8 16 32 64 128 256
SSIM
Bitrate (90 Hz) [Mbps]
Space (H.264) Orbit (H.264) Ice (H.264) Streamlines (H.264)
Space (HEVC) Orbit (HEVC) Ice (HEVC) Streamlines (HEVC)
Structural Similarity Index (SSIM)
14
HARDWARE LATENCY
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 32 64 128 256
Late
ncy
[m
s]
Bitrate (90 Hz) [Mbps]
Encode (HEVC, Space)
Encode (HEVC, Orbit)
Encode (HEVC, Ice)
Encode (HEVC, Streamlines)
Encode (H.264, Space)
Encode (H.264, Orbit)
Encode (H.264, Ice)
Encode (H.264, Streamlines)
Decode (HEVC, Space)
Decode (HEVC, Orbit)
Decode (HEVC, Ice)
Decode (HEVC, Streamlines)
Decode (H.264, Space)
Decode (H.264, Orbit)
Encode HEVC - Encode H.264 - Decode HEVC - Decode H.264
15
HARDWARE LATENCY
0
2
4
6
8
10
12
14
16
3840x2160 2720x1530 1920x1080 1360x766 960x540 672x378 480x270
Late
ncy
[m
s]
Resolution
Encode (HEVC) Encode (H.264) Decode (HEVC) Decode (H.264)
Decreases with Resolution
16
CPU-BASED COMPRESSION
0
300
600
900
1200
1500
1800
2100
2400
0
20
40
60
80
100
120
HEVC(GPU)
H.264(GPU)
HEVC(CPU)
H.264(CPU)
TurboJPEG BloscLZ LZ4 Snappy
Ba
ndw
idth
[M
B/s]
Late
ncy
[ms]
Encode Decode Bandwidth
Encode/Decode Latency and Bandwidth
17
FULL TILESSTREAMING
18
N:N WITH SIMULATED NETWORK DELAYMean Frame Rates + Min/Max Ranges
0
20
40
60
80
100
120
0 ms 50 ms 150 ms 500 ms
Fra
me
Ra
te [
Hz]
Network Delay
1 2 4 8 16 32 64 128 256
Server: Piz DaintClient: Piz Daint
Ice 4K
H.264: 32 MbpsHEVC: 16 Mbps
Delay + 10% jitter
19
N:N STREAMINGPipeline Latencies
Server: Piz DaintClient: Piz Daint
Ice 4K
H.264: 32 Mbps
MPI-basedsynchronization
0
5
10
15
20
25
30
35
40
45
1 2 4 8 16 32 64 128 256
Late
ncy
[m
s]
Tiles
Synchronize (Servers) Encode Network Decode Synchronize (Clients)
20
N:1 STREAMINGClient-Side Frame Rate
Server: Piz DaintClients: Site A (5 ms)
Site B (25 ms)
Ice 4K
H.264: 32 Mbps
0
10
20
30
40
50
60
70
80
90
1 2 4 8 16 32 64 128 256
Fra
me
Ra
te [
Hz]
Tiles
Site A (1x GP100) Site B (1x GP100) Site B (2x GP100)
21
STRONG SCALING
22
N:1 STRONG SCALINGClient-Side Frame Rate
Server: Piz DaintClients: Site A (5 ms)
Site B (25 ms)
Ice 4K
H.264: 32 Mbps
0
50
100
150
200
250
300
350
400
1 2 4 8 16 32 64 128 256
Fra
me
Ra
te [
Hz]
Tiles
Site A (1x GP100) Site B (1x GP100) Site B (2x GP100)
23
N:1 STRONG SCALINGClient-Side Frame Rate For Different Bitrates
Server: Piz DaintClient: Site A (5 ms)
Ice 4K
H.264
0
20
40
60
80
100
120
140
160
180
200
1 2 4 8 16 32 64 128 256
Fra
me
Ra
te [
Hz]
Tiles
4 Mbps @ 90 Hz 72 Mbps @ 90 Hz 256 Mbps @ 90 Hz
24
EXAMPLE: OPTIX PATHTRACERRendering Highly Sensitive to Tile SizeServer: Piz Daint
Client: Site A (5 ms)H.264
0
10
20
30
1 2 4 8 16 32 64 128 256
Fra
me
Ra
te [
Hz]
Tiles
Regular Tiling Auto-Tuned Tiling
25
INTEROPERABILITY
26
STANDARD-COMPLIANT BITSTREAMWeb Browser Streaming Example
EGL-based shim GLUT
Stream unmodified simpleGL example fromheadless node to web browser (with interaction!)
JavaScript client
WebSocket-based bidirectional communication
On-the-fly MP4 wrapping of H.264
27
RESOURCES
28
VIDEO CODEC SDKAPIs For Hardware Accelerated Video Encode/Decode
What’s New with Turing GPUs and Video Codec SDK 9.0
• Up to 3x decode throughput with multiple decoders on professional cards (Quadro & Tesla)
• Higher quality encoding - H.264 & H.265
• Higher encoding efficiency (15% lower bitrate than Pascal)
• HEVC B-frames support
• HEVC 4:4:4 decoding support
NVIDIA GeForce Now is made possible by leveraging NVENC in the datacenter and streaming the result to end clients
https://developer.nvidia.com/nvidia-video-codec-sdk
29
NVPIPE
Simple C API
H.264, HEVC
RGBA32, uint4, uint8, uint16
Lossy, Lossless
Host/Device memory, OpenGL textures/PBOs
https://github.com/NVIDIA/NvPipe
Issues? Suggestions? Feedback welcome!
A Lightweight Video Codec SDK Wrapper
30
ConclusionGPU-accelerated video compression opens up novel and fast solutions to the large-scale remoting challenge
Video Codec SDKhttps://developer.nvidia.com/nvidia-video-codec-sdk
NvPipehttps://github.com/NVIDIA/NvPipe
We want to help you solve your large-scale vis problems on NVIDIA!