Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

1 University of MichiganElectrical Engineering and Computer Science

Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke

University of Michigan

Automated C to Gates Solution

• SoC design– 10-100 Gops, 200 mW power

budget– Low level tools ineffective

• Automated accelerator synthesis for whole application– Correct by construction– Increase designer productivity– Faster time to market

Streaming Applications

Quantizer

MotionEstimator

Transform Coder

InverseQuantizer

InverseTransform

MotionPredictor

Image Coded Image

H.264 Encoder

• Data “streaming” through kernels

• Kernels are tight loops– FIR, Viterbi, DCT

• Coarse grain dataflow between kernels– Sub-blocks of images,

network packets

Data in Data outCRC Conv./

TurboBlock

Interleaver

OVSFGenerator

Spreader/Scrambler

BasebandTrasmitter

W-CDMA Transmitter

RRCFilter

System Schema Overview

Kernel 1

Kernel 2

Kernel 4

Kernel 3

Kernel 5

Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

Task throughput

Input Specification

for(i=0; i<8; i++) { for(j=0; j<8; j++) { . . . = inp[i][j]; out[i][j] = . . . ; }}

row_trans(char inp[8][8], char out[8][8] ) {

col_trans(char inp[8][8], char out[8][8]);zigzag_trans(char inp[8][8], char out[8][8]);

dct(char inp[8][8], char out[8][8]) {

row_trans

col_trans

zigzag_trans

• Sequential C program• Kernel specification

– Perfectly nested FOR loop– Wrapped inside C function– All data access made

explicit

char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out);}

• System specification

– Function with main input/output

– Local arrays to pass data– Sequence of calls to kernels

System Level Decisions

• Throughput of each LA – Initiation Interval• Grouping of loops into a multifunction LA

– More loops in a single LA → LA occupied for longer time in current task

TC=100

K3TC=100

K4LA 1 occupied for 200 cycles

Throughput = 1 task / 200 cycles

System Decisions (Contd..)

• Cost of SRAM buffers for intermediate arrays• More buffers → more task overlap → high performance

TC=100

tmp1 buffer in use by LA2

Adjacent tasks use different

buffers

Case Study : “Simple” benchmarkLoop graph

TC=256

512 cycles LA 1

1792 cycles

1536 cycles

LA 12048 cycles

Prescribed Throughput Accelerators

• Traditional behavioral synthesis– Directly translate C operators

into gates

• Our approach: Application-centric Architectures– Achieve fixed throughput– Maximize hardware sharing

Application Architecture

Operation graph Datapath

Loop Accelerator Template

• Parameterized execution resources, storage, connectivity

• Hardware realization of modulo scheduled loop

Loop Accelerator Design Flow

FU Alloc.c

C Code,Performance(Throughput)

AbstractArch

ModuloSchedule

Op1 Op2Op3 …tim

ScheduledOps

BuildDatapath

ConcreteArch

FU FUInstantiateArch

Synthesize

Verilog,Control Signals

LoopAccelerator

AcceleratorPipeline

LoopAccelerator

Multifunction Accelerator

• Map multiple loops to single accelerator

• Improve hardware efficiency via reuse

• Opportunities for sharing– Disjoint stages

(loops 2, 3)– Pipeline slack

(loops 4, 5)

FrameType?

Loop 2 Loop 3

Loop 1

Loop 4

Application

Block 5

AcceleratorPipeline

LoopAccelerator

MultifunctionLoopAccelerator

Loop 1

Loop 2

Cost SensitiveModulo Scheduler

FU FUDatapathUnion

• 43% average savings over sum of accelerators• Smart union within 3% of joint scheduling solution

• Algorithm-level pipeline retiming– Splitting loops based on tiling– Co-scheduling adjacent loops

Challenges: Throughput Enabling Transformations

Loop 2

Loop 3

Loop 4

Loop 1 Loop 1

Loop 2a

Loop 2b

Loop 3,4

Critical loop

Challenges: Programmable Loop Accelerator

• Support bug fixes, evolving standards• Accelerate loops not known at design time• Minimize additional control overhead

Interconnect

… …

LocalMem

Control

Controlsignals

Challenges: Timing Aware Synthesis

• Technology scaling, increasing # FUs → rising interconnect cost, wire capacitance

• Strategies to eliminate long wires– Preemptive: predict & prevent long wires– Reactive: use feedback from floorplanner

FU1 FU2 FU3- Insert flip flop on long path- Reschedule with added latency

Challenges: Adaptable Voltage/Frequency Levels

• Allow voltage scaling beyond margins

• Using shadow latches in loop accelerator– Localized error detection– Control is predefined:

simple error recovery

flip-flop

shadowlatch

Shadowlatch Extra queue

entries

For More Information

• Visit http://cccp.eecs.umich.edu

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Documents

Energy Fraud and Orchestrated Blackouts

Who Really Orchestrated the OKC Bombing

SoftFIRE: Constructing a Federated and Orchestrated Multi

Orchestrated Dynamic Autonomous Digital Battlefield

Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines

vFPGAmanager: a Virtualization Framework for Orchestrated …€¦ · vFPGAmanager: a Virtualization Framework for Orchestrated FPGA Accelerator Sharing in 5G Cloud Environments IEEE

Serena Launch Briefing: Orchestrated IT

ORCHESTRATED BITS AND CHOREOGRAPHIC PIECES

Orchestrated Performance by CAMMI Logic

Orchestrated Customer Engagement - 2015 IMS Health Incorporated and its affiliates. ... Orchestrated Customer Engagement Page 2 ... traditional targeting and segmentation exercises

APOE2 orchestrated differences in transcriptomic and

SLUG Bug: Quality Improvement With Orchestrated Testing ...pediatrics.aappublications.org/content/pediatrics/137/1/e20143642... · SLUG Bug: Quality Improvement With Orchestrated

1996 Hameroff_Penrose Conscious Events as Orchestrated Space

Orchestrated Objective Reduction

Serena Orchestrated ALM Reference Architecture

Route 1 Orchestrated

Orchestrated transcription of biological processes in the

The control of tomato fruit elongation orchestrated by sun ...vanderknaaplab.uga.edu/files/ShanWuPlantScience2015.pdfcontrol of tomato fruit elongation orchestrated by sun, ovate and

Orchestrated Objective Reduction: Quantum Physics and its

Bradley Knight Arranged and Orchestrated by · Arranged and Orchestrated by ... Orchestration/Conductor’s Score CD-Rom ... Bass Guitar Rehearsal Track (Bass Guitar-Left,