Upload
neal-barton
View
212
Download
0
Embed Size (px)
Citation preview
Harmony: A Run-Time for Managing Harmony: A Run-Time for Managing AcceleratorsAccelerators
Sponsor: LogicBlox Inc.Sponsor: LogicBlox Inc.
Gregory Diamos and Sudhakar Yalamanchili
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2CERCS
Software Challenges of HeterogeneitySoftware Challenges of Heterogeneity
Programming ModelProgramming Model
Execution ModelExecution Model
PortabilityPortability
PerformancePerformance
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 3CERCS
Pooled Accelerator Execution ModelPooled Accelerator Execution ModelInstance
Heterogeneous multiprocessor systems are viewed as a pool of processors, each potentially with a unique ISA and system interface
Applications that make full use of these systems must include binaries compatible with each accelerator ISA
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY CERCS
Execution Model Execution Model
Configuration of the Machine Model
Architecture description specifies configuration of accelerators and processors & communicates QoS requirements
Kernel
KernelStreamElements
ControlThread
Stream
ACC…Local
Memory DMACacheFIFO
Multicore processor 1 Accelerator 1
Memory
Programming Model
Accelerator-based Code Segment – compiled for
specific device/driver combination
System Architecture Description
Source Program
Compilation Environment
HVM
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 5CERCS
Goals of HarmonyGoals of Harmony
Low OverheadLow Overhead Comparable to or better than hand tuned applicationsComparable to or better than hand tuned applications
System Configuration AgnosticSystem Configuration Agnostic Correct execution on a system with any number and type of Correct execution on a system with any number and type of
heterogeneous architecturesheterogeneous architectures No code modification requiredNo code modification required
ScalableScalable EP application performance should scale with the number of EP application performance should scale with the number of
devicesdevices
FamiliarFamiliar Do not require any more than current programming model of Do not require any more than current programming model of
threaded applications for homogeneous architecturesthreaded applications for homogeneous architectures
Harmony
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY CERCS
Key IdeaKey Idea
Accelerator kernel deployment based on static and Accelerator kernel deployment based on static and dynamic inter-kernel dependenciesdynamic inter-kernel dependencies
Inspired by ILP scheduling techniquesInspired by ILP scheduling techniques
Kernels are “issued” to accelerators and their Kernels are “issued” to accelerators and their execution is “committed” to release dependent execution is “committed” to release dependent kernelskernels
op
op
De
pe
nd
en
ce
reso
lutio
n
op
op
op
op
ReadyBuffer
IssueFrom Application
Harmony
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 7CERCS
Harmony Architecture & OperationHarmony Architecture & Operation
Harmony
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 8CERCS
Harmony Runtime OperationHarmony Runtime Operation
Accelerator kernels are Accelerator kernels are mapped to specific mapped to specific architectures based onarchitectures based on
Architectures in the systemArchitectures in the system Available implementationsAvailable implementations PerformancePerformance
Results are forwarded to Results are forwarded to waiting functionswaiting functions
Can support speculationCan support speculation Results are committed in orderResults are committed in order
Harmony
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 9CERCS
Application DevelopmentApplication Development
Programmer supplied (Harmony) checks on Programmer supplied (Harmony) checks on entry/exit to accelerator kernelsentry/exit to accelerator kernels
Marshalling of operands when a accelerator kernel is Marshalling of operands when a accelerator kernel is invokedinvoked
May employ multiple (static) implementations May employ multiple (static) implementations corresponding to multiple accelerators corresponding to multiple accelerators
Harmony
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 10CERCS
Preliminary Performance EvaluationPreliminary Performance Evaluation
1 Th
read
1 GPU
1 Th
read
(Har
mon
y)
2 Th
read
s (Har
mon
y)
1 GPU
(Har
mon
y)
1 Th
read
1 G
PU (H
arm
ony)
2 Th
read
s 1 G
PU (H
arm
ony)
1000000000
10000000000
100000000000 69169861424
2834834771
71425720785
35567685367
2944899003 2708642262 2657213247
Execution Time
Clo
ck C
ycl
es 3.1%
Overhead
3.8% Overhead
Matrix Multiplication
Harmony
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 11CERCS
Scheduling OverheadScheduling Overhead
2 4 8 16 32 64 128 256 5120%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0.012 0.014 0.017 0.018 0.023 0.048 0.053
0.239
0.59
0.988 0.986 0.983 0.982 0.977 0.952 0.947
0.761
0.41
Overhead for 64x64 Matrix Multiplies
ApplicationHarmony
Scheduling Window Size (# of functions)
Cycl
es
Harmony
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 12CERCS
Extensions to FPGAsExtensions to FPGAs
Maintain the base Harmony deployment modelMaintain the base Harmony deployment model Accelerator poolsAccelerator pools
Associate a Harmony thread with each FPGA-based acceleratorAssociate a Harmony thread with each FPGA-based accelerator
Virtualize the FPGA fabricVirtualize the FPGA fabric Demand-driven vs. static configuration of the fabricDemand-driven vs. static configuration of the fabric Adapt existing register allocation based scheduling techniquesAdapt existing register allocation based scheduling techniques
Example: Virtualized Packet Schedulers (Example: Virtualized Packet Schedulers (Sponsor: RNET Sponsor: RNET TechnologiesTechnologies))
Poster Session Poster Session
Extensions to FPGAs
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY CERCS
FPGA-Based Accelerator ArchitectureFPGA-Based Accelerator Architecture
Volatile(DRAM)
Nonvolatile(FLASH)
PCIe/Hypertransport/CSIInterface
PowerPC
Encrypt Decrypt
FFT
MemoryController
Switch Switch
Switch Switch
NI NI
NI
NI
NI
NI
Extensions to FPGAs
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY CERCS
Accelerator ConfigurationAccelerator Configuration
Volatile(DRAM)
Nonvolatile(FLASH)
PCIe/Hypertransport/CSIInterface
PowerPC
MemoryController
Switch Switch NI
NI
NI
Host DriverHost
(DRAM)
Encrypt Decrypt
Switch Switch
NI NI
HarmonyThread
Address translation in the NI allows isolated paths between accelerators and memory
FFTNI
HarmonyThread
Future
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY CERCS
Virtual Machine Monitor
User Software
Guest OS
Heterogeneous Virtual Machines Heterogeneous Virtual Machines
LocalMemoryCache
ACC
DMA
FIFOLocalMemoryCache
LocalMemoryCache
ACC
DMA
FIFOACC
DMA
FIFO
Network
SW Resources
HW Resources
CPU CPU CPU
isolation
security
legacy system
s
User Software
Guest OS
PIs: A. Gavrilovska, K. Schwan, S. Yalamanchili
Virtualization of accelerator resourcesConsolidation and sharing of accelerators
Looking Ahead
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 16CERCS
Questions?Questions?