Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Preview:

DESCRIPTION

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines. Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke University of Michigan. app.c. LA. LA. LA. LA. Automated C to Gates Solution. SoC design 10-100 Gops, 200 mW power budget Low level tools ineffective - PowerPoint PPT Presentation

Citation preview

1 University of MichiganElectrical Engineering and Computer Science

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke

University of Michigan

2 University of MichiganElectrical Engineering and Computer Science

Automated C to Gates Solution

• SoC design– 10-100 Gops, 200 mW power

budget– Low level tools ineffective

• Automated accelerator synthesis for whole application– Correct by construction– Increase designer productivity– Faster time to market

app.c

LA

LA LA

LA

3 University of MichiganElectrical Engineering and Computer Science

Streaming Applications

Quantizer

MotionEstimator

Transform Coder

InverseQuantizer

InverseTransform

MotionPredictor

Image Coded Image

H.264 Encoder

• Data “streaming” through kernels

• Kernels are tight loops– FIR, Viterbi, DCT

• Coarse grain dataflow between kernels– Sub-blocks of images,

network packets

Data in Data outCRC Conv./

TurboBlock

Interleaver

OVSFGenerator

Spreader/Scrambler

BasebandTrasmitter

W-CDMA Transmitter

RRCFilter

4 University of MichiganElectrical Engineering and Computer Science

System Schema Overview

Kernel 1

Kernel 2

Kernel 4

LA 1

LA 2

LA 3

Kernel 3

Kernel 5

Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

K2 K3

time

Task throughput

5 University of MichiganElectrical Engineering and Computer Science

Input Specification

for(i=0; i<8; i++) { for(j=0; j<8; j++) { . . . = inp[i][j]; out[i][j] = . . . ; }}

row_trans(char inp[8][8], char out[8][8] ) {

}

col_trans(char inp[8][8], char out[8][8]);zigzag_trans(char inp[8][8], char out[8][8]);

dct(char inp[8][8], char out[8][8]) {

row_trans

col_trans

zigzag_trans

inp

tmp1

tmp2

out

• Sequential C program• Kernel specification

– Perfectly nested FOR loop– Wrapped inside C function– All data access made

explicit

char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out);}

• System specification

– Function with main input/output

– Local arrays to pass data– Sequence of calls to kernels

6 University of MichiganElectrical Engineering and Computer Science

System Level Decisions

• Throughput of each LA – Initiation Interval• Grouping of loops into a multifunction LA

– More loops in a single LA → LA occupied for longer time in current task

K1

K2

K3

TC=100

TC=100

TC=100

K3TC=100

LA 2

LA 3

LA 1

K1

K2

K3

K4LA 1 occupied for 200 cycles

K1

K2

K3

100

200

300

K4400

Throughput = 1 task / 200 cycles

7 University of MichiganElectrical Engineering and Computer Science

System Decisions (Contd..)

• Cost of SRAM buffers for intermediate arrays• More buffers → more task overlap → high performance

II=1

II=1

II=1

K1

K2

K3

TC=100

TC=100

TC=100

tmp1

tmp2

LA 1

LA 2

LA 3

K1

K2

K3

K1

K2

K3

100

200

300

LA 1

LA 2

LA 3

tmp1 buffer in use by LA2

K1

K2

K3

K1

K2

K3

100

200

300

Adjacent tasks use different

buffers

8 University of MichiganElectrical Engineering and Computer Science

Case Study : “Simple” benchmarkLoop graph

TC=256

1

1

1

1

1

1

1

1

512 cycles LA 1

LA 2

LA 3

LA 4

1

1

2

1

1

1

3

3

1792 cycles

1536 cycles

LA 1

LA 2

1

1

1

1

1

1

1

1

LA 12048 cycles

9 University of MichiganElectrical Engineering and Computer Science

Prescribed Throughput Accelerators

• Traditional behavioral synthesis– Directly translate C operators

into gates

• Our approach: Application-centric Architectures– Achieve fixed throughput– Maximize hardware sharing

Application Architecture

Operation graph Datapath

10 University of MichiganElectrical Engineering and Computer Science

Loop Accelerator Template

• Parameterized execution resources, storage, connectivity

• Hardware realization of modulo scheduled loop

11 University of MichiganElectrical Engineering and Computer Science

Loop Accelerator Design Flow

FU Alloc.c

C Code,Performance(Throughput)

AbstractArch

ModuloSchedule

Op1 Op2Op3 …tim

e

FUs

ScheduledOps

RF

FU FU

BuildDatapath

ConcreteArch

FU FUInstantiateArch

Synthesize

Verilog,Control Signals

.v

LoopAccelerator

12 University of MichiganElectrical Engineering and Computer Science

LA1

LA2

LA4

AcceleratorPipeline

LoopAccelerator

LA3

LA5

Multifunction Accelerator

• Map multiple loops to single accelerator

• Improve hardware efficiency via reuse

• Opportunities for sharing– Disjoint stages

(loops 2, 3)– Pipeline slack

(loops 4, 5)

FrameType?

Loop 2 Loop 3

Loop 1

Loop 4

Application

Block 5

LA1

LA2

LA3

AcceleratorPipeline

LoopAccelerator

MultifunctionLoopAccelerator

MultifunctionLoopAccelerator

13 University of MichiganElectrical Engineering and Computer Science

Union

Loop 1

Loop 2

Cost SensitiveModulo Scheduler

Cost SensitiveModulo Scheduler

FU FU

FU FU

FU FUDatapathUnion

• 43% average savings over sum of accelerators• Smart union within 3% of joint scheduling solution

14 University of MichiganElectrical Engineering and Computer Science

• Algorithm-level pipeline retiming– Splitting loops based on tiling– Co-scheduling adjacent loops

Challenges: Throughput Enabling Transformations

Loop 2

Loop 3

Loop 4

Loop 1 Loop 1

Loop 2a

Loop 2b

Loop 3,4

Critical loop

Critical loop

15 University of MichiganElectrical Engineering and Computer Science

Challenges: Programmable Loop Accelerator

• Support bug fixes, evolving standards• Accelerate loops not known at design time• Minimize additional control overhead

Interconnect

FU

… …

FU

… …

MEM

… …

LocalMem

Control

II

Controlsignals

16 University of MichiganElectrical Engineering and Computer Science

Challenges: Timing Aware Synthesis

• Technology scaling, increasing # FUs → rising interconnect cost, wire capacitance

• Strategies to eliminate long wires– Preemptive: predict & prevent long wires– Reactive: use feedback from floorplanner

FU1 FU2 FU3- Insert flip flop on long path- Reschedule with added latency

17 University of MichiganElectrical Engineering and Computer Science

Challenges: Adaptable Voltage/Frequency Levels

• Allow voltage scaling beyond margins

• Using shadow latches in loop accelerator– Localized error detection– Control is predefined:

simple error recovery

D

CLK

Q

error

flip-flop

shadowlatch

delay

FU FU

Shadowlatch Extra queue

entries

18 University of MichiganElectrical Engineering and Computer Science

For More Information

• Visit http://cccp.eecs.umich.edu

Recommended