Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili

Harmony: A Run-Time for Managing Harmony: A Run-Time for Managing AcceleratorsAccelerators

Sponsor: LogicBlox Inc.Sponsor: LogicBlox Inc.

Gregory Diamos and Sudhakar Yalamanchili

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2CERCS

Software Challenges of HeterogeneitySoftware Challenges of Heterogeneity

Programming ModelProgramming Model

Execution ModelExecution Model

PortabilityPortability

PerformancePerformance


Pooled Accelerator Execution ModelPooled Accelerator Execution ModelInstance

Heterogeneous multiprocessor systems are viewed as a pool of processors, each potentially with a unique ISA and system interface

Applications that make full use of these systems must include binaries compatible with each accelerator ISA

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY CERCS

Execution Model Execution Model

Configuration of the Machine Model

Architecture description specifies configuration of accelerators and processors & communicates QoS requirements

Kernel

KernelStreamElements

ControlThread

Stream

ACC…Local

Memory DMACacheFIFO

Multicore processor 1 Accelerator 1

Memory

Programming Model

Accelerator-based Code Segment – compiled for

specific device/driver combination

System Architecture Description

Source Program

Compilation Environment

HVM


Goals of HarmonyGoals of Harmony

Low OverheadLow Overhead Comparable to or better than hand tuned applicationsComparable to or better than hand tuned applications

System Configuration AgnosticSystem Configuration Agnostic Correct execution on a system with any number and type of Correct execution on a system with any number and type of

heterogeneous architecturesheterogeneous architectures No code modification requiredNo code modification required

ScalableScalable EP application performance should scale with the number of EP application performance should scale with the number of

devicesdevices

FamiliarFamiliar Do not require any more than current programming model of Do not require any more than current programming model of

threaded applications for homogeneous architecturesthreaded applications for homogeneous architectures

Harmony


Key IdeaKey Idea

Accelerator kernel deployment based on static and Accelerator kernel deployment based on static and dynamic inter-kernel dependenciesdynamic inter-kernel dependencies

Inspired by ILP scheduling techniquesInspired by ILP scheduling techniques

Kernels are “issued” to accelerators and their Kernels are “issued” to accelerators and their execution is “committed” to release dependent execution is “committed” to release dependent kernelskernels

op

op

De

pe

nd

en

ce

reso

lutio

n

op

op

op

op

ReadyBuffer

IssueFrom Application

Harmony


Harmony Architecture & OperationHarmony Architecture & Operation

Harmony


Harmony Runtime OperationHarmony Runtime Operation

Accelerator kernels are Accelerator kernels are mapped to specific mapped to specific architectures based onarchitectures based on

Architectures in the systemArchitectures in the system Available implementationsAvailable implementations PerformancePerformance

Results are forwarded to Results are forwarded to waiting functionswaiting functions

Can support speculationCan support speculation Results are committed in orderResults are committed in order

Harmony


Application DevelopmentApplication Development

Programmer supplied (Harmony) checks on Programmer supplied (Harmony) checks on entry/exit to accelerator kernelsentry/exit to accelerator kernels

Marshalling of operands when a accelerator kernel is Marshalling of operands when a accelerator kernel is invokedinvoked

May employ multiple (static) implementations May employ multiple (static) implementations corresponding to multiple accelerators corresponding to multiple accelerators

Harmony


Preliminary Performance EvaluationPreliminary Performance Evaluation

1 Th

read

1 GPU

1 Th

read

(Har

mon

y)

2 Th

read

s (Har

mon

y)

1 GPU

(Har

mon

y)

1 Th

read

1 G

PU (H

arm

ony)

2 Th

read

s 1 G

PU (H

arm

ony)

1000000000

10000000000

100000000000 69169861424

2834834771

71425720785

35567685367

2944899003 2708642262 2657213247

Execution Time

Clo

ck C

ycl

es 3.1%

Overhead

3.8% Overhead

Matrix Multiplication

Harmony


Scheduling OverheadScheduling Overhead

2 4 8 16 32 64 128 256 5120%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0.012 0.014 0.017 0.018 0.023 0.048 0.053

0.239

0.59

0.988 0.986 0.983 0.982 0.977 0.952 0.947

0.761

0.41

Overhead for 64x64 Matrix Multiplies

ApplicationHarmony

Scheduling Window Size (# of functions)

Cycl

es

Harmony


Extensions to FPGAsExtensions to FPGAs

Maintain the base Harmony deployment modelMaintain the base Harmony deployment model Accelerator poolsAccelerator pools

Associate a Harmony thread with each FPGA-based acceleratorAssociate a Harmony thread with each FPGA-based accelerator

Virtualize the FPGA fabricVirtualize the FPGA fabric Demand-driven vs. static configuration of the fabricDemand-driven vs. static configuration of the fabric Adapt existing register allocation based scheduling techniquesAdapt existing register allocation based scheduling techniques

Example: Virtualized Packet Schedulers (Example: Virtualized Packet Schedulers (Sponsor: RNET Sponsor: RNET TechnologiesTechnologies))

Poster Session Poster Session

Extensions to FPGAs


FPGA-Based Accelerator ArchitectureFPGA-Based Accelerator Architecture

Volatile(DRAM)

Nonvolatile(FLASH)

PCIe/Hypertransport/CSIInterface

PowerPC

Encrypt Decrypt

FFT

MemoryController

Switch Switch

Switch Switch

NI NI

NI

NI

NI

NI

Extensions to FPGAs


Accelerator ConfigurationAccelerator Configuration

Volatile(DRAM)

Nonvolatile(FLASH)

PCIe/Hypertransport/CSIInterface

PowerPC

MemoryController

Switch Switch NI

NI

NI

Host DriverHost

(DRAM)

Encrypt Decrypt

Switch Switch

NI NI

HarmonyThread

Address translation in the NI allows isolated paths between accelerators and memory

FFTNI

HarmonyThread

Future


Virtual Machine Monitor

User Software

Guest OS

Heterogeneous Virtual Machines Heterogeneous Virtual Machines

LocalMemoryCache

ACC

DMA

FIFOLocalMemoryCache

LocalMemoryCache

ACC

DMA

FIFOACC

DMA

FIFO

Network

SW Resources

HW Resources

CPU CPU CPU

isolation

security

legacy system

s

User Software

Guest OS

PIs: A. Gavrilovska, K. Schwan, S. Yalamanchili

Virtualization of accelerator resourcesConsolidation and sharing of accelerators

Looking Ahead


Questions?Questions?

Documents

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili