ComP-Net
Command Processor Networking for Efficient Intra-kernel Communications on GPUs
1
Michael LeBeane1,2, Khaled Hamidouche1, Brad Benton1, Mauricio Breternitz3, Steven K. Reinhardt4,
Lizy K. John2
1 Advanced Micro Devices Inc., 2 The University of Texas at Austin, 3 INSEC-ID Lisboa, 4 Microsoft Corporation,
Presented by Sooraj Puthoor1
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 20182
GPUS AND NETWORKS IN THE WILD
• GPUs are everywhere in HPC, machine learning, and beyond• Excellent performance/watt for many classes of data-parallel computation
• Many GPUs are required to solve the biggest computational problems• Can only fit so many GPUs in a single node!
• GPUs talk to each other through Network Interface Controllers (NICs)
• Path between GPU and NIC must be efficient
• Vendors are selling machines filled with many GPUs and NICs• Inventec Project 47 Node
• 4 Radeon Instinct GPUs• 2 Mellanox 100G NICs• 1 EPYC 7601 32-Core CPU• 2:1 GPU/NIC Ratio
Introduction
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 20183
OVERVIEW OF GPU NETWORKING
• Much industry and academic work in the area
• Can largely be broken down into two domains:
• Data Path• i.e., where the data that goes across the
network flows
• Control Path• i.e., who tells the NIC to move the data across
the network
Introduction
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 20184
DATA PATH OPTIMIZATIONS
• Direct path from discrete GPU memory to NIC• No bounce buffers or host memory copies
• Implemented in Mellanox’s PeerDirect interface for their NICs
• Used by AMD’s ROCn RDMA[1] and Nvidia’s GPUDirect RDMA[2]
Introduction
Initiator Target
CPU
CacheMemory
NIC Memory
Network
IOCCPU
Cache Memory
NICMemory
GPUIOC GPU
Memory Memory
[1] AMD. 2017. ROCn RDMA. https://github.com/RadeonOpenCompute/ROCnRDMA[2] Mellanox. 2017. Mellanox OFED GPUDirect RDMA. http://www.mellanox.com/page/products_dyn?product_family=116
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 20185
CONTROL PATH OPTIMIZATIONSIntroduction
a_kernel<<<…, stream>>>(buf);
cudaStreamSynchronize(stream);
netSend(buf);
netWait();
b_kernel<<<…, stream>>>(buf);
cudaStreamSynchronize(stream);
Host Driven Networking
1. CPU schedules kernel and waits for completion2. CPU posts network operation and waits for completion3. CPU schedules and waits on final kernel
GPU CPU NIC
1
2
3
a_kernel<<<…, stream>>>(buf);
netPost_async(stream, qp, buf);
netWait_async(stream, txcq);
b_kernel<<…, stream>>(buf);
cudaStreamSynchronize(stream);
GPU CPU NIC
GPU Direct Async (GDS)[3]
1. CPU schedules kernel, network operation, and, final kernel2. GPU triggers initiation of a network operation after kernel3. GPU launches final kernel
1 2
3
• GDS removes the CPU from the critical path and avoids control flow switches
• Communication events triggered at kernel boundaries
[3] Elena Agostini, Davide Rossetti, and Sreeram Potluri. 2017. Offloading communication control logic in GPU accelerated applications. In Intl. Symp. on Cluster, Cloud and Grid Computing (CCGrid). DOI:https://doi.org/10.1109/CCGRID.2017.
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 20186
OVERHEAD OF KERNEL BOUNDARY COMMUNICATION
• Kernel launch latencies much higher than HPC network overheads!• Can be up to 20µs for a kernel
launch!
• Compare that to < 1µs it takes to get to another node over the network
• Obvious Solution: Can you do networking from within a kernel?• Absolutely!
• Two main schools of thought here…
Introduction
0
4
8
12
16
20
1 8 64 512
Lau
nch
Lat
ency
(µ
s)
Kernel Commands Queued
GPU 1
GPU 2
GPU 3
Sm
alle
r is
Be
tte
r
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 20187
GPU NATIVE NETWORKING[4, 5, 6, 7]
• Run a networking stack on the GPU
• Allow the GPU work-items to directly interact with the network adaptor
• Pros
• Completely decoupled from the CPU
• Can be performant/low-latency
• Cons
• Hard to talk to network interface designed for CPUs
• Can suffer from significant control flow divergence and register pressure
Introduction
[4] Lena Oden, Holger Froning, and Franz-Joseph Pfreundt. 2014. Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU. In Intl. Conf. on Parallel Distributed Processing Symposium Workshops (IPDPSW). 976–983. [5] Benjamin Klenk, Lena Oden, and Holger Froning. 2014. Analyzing Put/Get APIs for Thread-Collaborative Processors. In Intl. Conf. on Parallel Processing (ICPP) Workshops. [6] Benjamin Klenk, Lena Oden, and Holger Froning. 2015. Analyzing communication models for distributed thread-collaborative processors in terms of energy and time. In Intl. Symp. on Performance Analysis of Systems and Software (ISPASS).[7] Feras Daoud, Amir Watad, and Mark Silberstein. 2016. GPUrdma: GPU-side Library for High Performance Networking from GPU Kernels. In Intl. Workshop on Runtime and Operating Systems for Supercomputers (ROSS). 6:1–6:8.
Launch
Kernel
GPU Native Networking Put
CPU
GPU
NIC
Done
Send
Kernel Boundary Networking
GPU
WaitLaunchSendWaitLaunch
Host-Driven Networking Put
CPU
NIC
Done
Kernel Kernel
Intra-Kernel Networking
Send Launch
GPUDirect Async (GDS) Put
CPU
GPU
NIC
Done
Kernel Kernel
Time
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 20188
GPU HOST NETWORKING[8, 9, 10]
• Run your networking stack on the CPU
• Have the GPU place network requests in a producer/consumer queue for the CPU
• Use threads on the CPU to process messages and synchronize with system atomics
• Pros
• Lots of flexibility on the CPU to improve performance through coalescing, etc.
• Cons
• Additional latency imposed by the indirection
• Scales poorly with more or bigger GPUs
Introduction
[8] Jeff A. Stuart and John D. Owens. 2009. Message passing on data-parallel architectures. In Intl. Symp. on Parallel Distributed Processing (IPDPS). 1–12. [9] Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Wated, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In USENIX Conf. on Operating Systems Design and Implementation (OSDI). 201–216. [10] Tobias Gysi, Jeremia Bär, and Torsten Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC) (SC ’16). Article 52, 12 pages.
Launch
Kernel
GPU Native Networking Put
CPU
GPU
NIC
Done
Send
Kernel Boundary Networking
GPU
WaitLaunchSendWaitLaunch
Host-Driven Networking Put
CPU
NIC
Done
Kernel Kernel
Intra-Kernel Networking
Send Launch
GPUDirect Async (GDS) Put
CPU
GPU
NIC
Done
Kernel Kernel
Time
00Kernel
WaitSendWaitLaunch
GPU Host Networking Put
CPU
GPU
NIC
Done
Send
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 20189
PERFORMANCE PROBLEMS WITH GPU HOST NETWORKING
• Need multiple trips over IO bus
• Where to place queues?• GPU memory vs. host memory
• High latency in both cases
• Not scalable• 40µs latency with 8 threads
• Bigger/more GPUs reduce scalability
Introduction
0
20
40
60
80
100
1 8 64 512 4096
Ser
vic
e T
ime
(us)
Active Workgroups
Host Queues
GPU Queues
Network Latency
0
20
40
60
80
100
16 128 1024
Ser
vic
e T
ime
(us)
Active Workgroups
1 Thread
2 Threads
4 Threads
8 Threads
Network Latency
Smal
ler
is B
ette
r
4096
Time
00Kernel
WaitSendWaitLaunch
GPU Host Networking Put
CPU
GPU
NIC
Done
Send
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201810
COMMAND PROCESSOR NETWORKING (COMP-NET) OVERVIEW
• Uses built-in CP to support network operations
• CP/GPU communicate over shared L2 cache instead of PCIe
• Potentially much faster (lower latency) than other GPU Host Networking designs
• CP resources can scale with other GPU resources
ComP-Net
NIC
…Host Queues
Memory
CUs
CPUs
GPU
Host
PCIe
Memory
…Network Queues
PCIe
GPU Host
Networking
HostPCIe
L2 CacheCP Queues
MemoryNetwork Queues
CUs CPs
PCIe
NICGPU
ComP-Net
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201811
COMMAND PROCESSOR OVERVIEW
• GPUs have built-in CPUs called Command Processors (CPs)• Scalar cores == good at running
network runtime code
• Can connect to GPU CUs through a shared LLC
• Traditionally used to launch kernels• But intra-kernel networking encourages
fewer kernels…..
Introduction
Local Data Share
L2 Cache
L1 Cache
CPU Core
GPU Memory
Compute Unit Command Processor
L1 Cache
SIMD SIMD SIMD SIMD
GPU
Can we leverage CPs for intra-kernel
networking?
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201813
COMP-NET PRODUCER/CONSUMER QUEUE
• Main component of ComP-Net Runtime is CP/GPU producer/consumer queue
• Most steps are straightforward
ComP-Net
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
0 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201814
COMP-NET PRODUCER/CONSUMER QUEUE
• 1a) Check if queue is full (using local Read Idx)
• 1b) If full, update Read Idx and loop till not full
ComP-Net
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base Ptr
Local Read Idx…..
0 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
1b
1a<=
Read Idx Ptr
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201815
COMP-NET PRODUCER/CONSUMER QUEUE
• 2) Fill Queue Entry with networking metadata
• Or Inline small payloads in the Queue Entry itself
ComP-Net
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
0 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
2
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201816
COMP-NET PRODUCER/CONSUMER QUEUE
• 3) Set Status flag with release marker to notify CP
ComP-Net
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
3
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201817
COMP-NET PRODUCER/CONSUMER QUEUE
• 4) Increment local Write Idx
ComP-Net
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
4++
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201818
COMP-NET PRODUCER/CONSUMER QUEUE
• 1) Poll on next Queue Entry based on local Read Idx with acquire marker
ComP-Net
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
1
== 0
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201819
COMP-NET PRODUCER/CONSUMER QUEUE
• 2) Read data from Queue Entry
ComP-Net
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
2
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201820
COMP-NET PRODUCER/CONSUMER QUEUE
• 3) Perform network operation and set Status flag to 0 when complete with release marker
ComP-Net
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 0 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
3
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201821
COMP-NET PRODUCER/CONSUMER QUEUE
• 4a) Update global Read Idx with release marker
• 4b) Update local Read Idx
ComP-Net
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 0 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
4b
++
4a++
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201822
COMP-NET PRODUCER/CONSUMER QUEUE
• 5) Check Status bit to determine when CP completes operation
ComP-Net
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
5== 1
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201823
TACKLING GPU CACHE THRASHING
• Residency of data in GPU L2 cache is very short
• Work-group data produced for CP is evicted when other work-groups perform streaming memory accesses
• Can be solved through cache line locking• Preliminary results are promising
• Still much to explore here
ComP-Net
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
40/0 35/5 30/10 25/15 20/20 15/25 10/30 5/35
L2 H
it R
ate
for
CP
Networking Wavefronts / Streaming Wavefronts
Baseline LLC Locking
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201824
SIMULATION ENVIRONMENT
• gem5 [11] + AMD GCN3 GPU model [12] + Internal Portals4 NIC model
• CPU power model with McPAT [13]
• Baseline model is coherent APU• dGPU modeled with extra delay for I/O bus and by disabling coherence probes
Results
DirectoryMemory
ControllersMemory
GPU CPU
Core
L2
L1IL1D …
Core
L2
L1IL1D
Core
L2
L1IL1D
L3
GPUCore
L1D
GPUCore
L1D
GPUCore
L1D
GPUCore
L1D
Sequencer Cache (SQC)
L2
GPUCore
L1D
GPUCore
L1D
GPUCore
L1D
GPUCore
L1D
Sequencer Cache (SQC)
…
NIC
NIC Processors
DMA Engines
L1IL1D
CPCore
IFNetwork
[11] N. Binkert, et. al. 2011. The gem5 simulator. SIGARCH Comput. Archit. News 2001.
[12] A. Gutierrez et al., "Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level," HPCA 2018.
[13] S. Li. et. al. , "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," MICRO 2009
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201825
SIMULATOR CONFIGURATION
• CPU: Standard CPU-only systems
• Baseline non-accelerated system
• HDN: Host Driven Networking
• Kernel boundary networking (host MPI + HIP)
Intra-kernel Networking Schemes:
• APU: CPU/GPU on the same die
• Intra-kernel networking through host threads on an APU
• dGPU: GPU Host Networking
• Intra-kernel networking on a dGPU via host threads
• ComP-Net: Command Processor Networking
• Intra-kernel networking through command processor
ResultsCPU and Memory Configuration
Type 8-wide OOO, x86, 8 cores @ 4GHz
I,D-Cache 64KB, 2-way, 2 cycles
L2-Cache 2MB, 8-way, 8 cycles
L3-Cache 16MB, 16-way, 20 cycles
DRAM DDR4, 8 Channels, 2133MHz
GPU Configuration
Type AMD GCN3 @ 1.5GHz
CU Config 12 CUs with 4 SIMD-16 engines
Wavefronts 40 Waves per SIMD (64 lanes)
V-Cache 32KB, 16-way, 12 cycles, per CU
K-Cache 32KB, 8-way, 12 cycles, per 4 CU
I-Cache 64KB, 8-way, 12 cycles, per 4 CU
L2-Cache 1MB, 16-way, 8 banks, 100 cycles
CP Configuration
Type 2-wide OOO, x86, 2 cores @ 2GHz
D-Cache 32KB, 8-way, 4 cycles
I-Cache 16KB, 8-way, 4 cycles
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201826
MICROBENCHMARKS
• Sweep of message size (single networking WG)
• Round trip network latency
Results
1
2
4
8
16
32
64
128
Rem
ote
Get
Tim
e
Obse
rved
fro
m G
PU
(µ
s)
Network Payload Size (Bytes)
ComP-Net dGPU APU
1B 32B 1KB 32KB 1MB0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10
Rem
ote
Get
Tim
e
Ob
serv
ed f
rom
GP
U (
µs)
Number of Network Service Threads
ComP-Net dGPU APU
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10
Ener
gy C
on
sum
ed b
y N
etw
ork
Thre
ads
Number of Network Service Threads
ComP-Net dGPU APU
• Sweep of number of network service threads (many networking WGs)
• Round trip network latency
• Sweep of number of network service threads (many networking WGs)
• Energy consumption of CP/CPU threads
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201827
2D JACOBI STENCIL
• 1D data decomposition
• Iterative computation and halo exchange
• Allreduce for residual calculation
Results
Node 1 (Bottom)
Node 0 (Top)
Halo
Exchange
0.8
0.9
1
1.1
1.2
1.3
16 64 256 1024
Rel
ativ
e S
pee
du
p v
dG
PU
Bas
elin
e
Per-node Problem Size (N x N Grid)
ComP-Net dGPU APU HDN CPU
Big
ger
is B
ette
r
1 2 3
• Three regions of interest1. CPU is best
2. Intra-kernel networking is best
3. Any GPU solution is acceptable
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201828
REDUCTION
• 64MB strong scaling study• Fix problem size, sweep node count
• APU performs better than ComP-Net
• ComP-Net is much more energy efficient
Results
0 1 25 2 1 1 3 5
0 9 8
Vector Sum0.6
0.8
1
1.2
1.4
0 4 8 12 16 20 24 28 32 36
Rel
ativ
e S
pee
du
p
Number of Nodes in Reduction
ComP-Net dGPU APU HDN CPU
0
0.2
0.4
0.6
0.8
1
1.2
0 4 8 12 16 20 24 28 32 36En
erg
y C
on
sum
pti
on
Number of Nodes in Reduction
ComP-Net dGPU APU
Big
ger
is B
ette
rSm
alle
r is
Bet
ter
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201829
DEEP LEARNING TRAINING
• Use Microsoft’s Cognitive Toolkit and sample workloads
• Projected using simulation results + profiling data from TACC’s Stampede supercomputer
• Speedups bound by % time application blocked on network data
Results
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
AlexNet AN4
LSTM
CIFAR MNIST
Conv
MNIST
Hidden
Average
Pro
ject
ed S
pee
du
p
CPU HDN dGPU APU ComP-Net
Big
ger
is B
ette
r
Workload Name Domain %Blocked Reductions
Alex Net Classification 14% 4672
AN4 LSTM Speech 50% 131192
CIFAR Classification 4% 939820
Large Synth Synthetic 28% 52800
MNIST Conv Text Recognition 12% 900000
MNIST Hidden Text Recognition 29% 900000
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201830
SUMMARYConclusion
• Uses built-in CP to support network operations
• CP/GPU communicate over shared L2 cache instead of PCIe
• CP resources can scale with other GPU resources
• Up to 15% performance improvement and 2x energy reduction vs. GPU Host Networking
NIC
…Host Queues
Memory
CUs
CPUs
GPU
Host
PCIe
Memory
…Network Queues
PCIe
GPU Host
Networking
HostPCIe
L2 CacheHost Queues
MemoryNetwork Queues
CUs CPs
PCIe
NICGPU
ComP-Net
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201831
Thank You!
Questions?
| PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT) | November 01-04, 201832
Disclaimer
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2017 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.
The work described in this presentation was made with Government support awarded by the DOE. The Government may have certain rights in this work.