Spatial Computation Computing without General-Purpose Processors Mihai Budiu mihaib@cs.cmu.edu...

Spatial ComputationComputing without General-Purpose Processors

Mihai Budiumihaib@cs.cmu.edu

Carnegie Mellon University

July 8, 2004

Mihai Budiumihaib@cs.cmu.edu

Carnegie Mellon University

Spatial Computation

A computation model based on:

• application-specific hardware

• no interpretation

• minimal resource sharing

Spatial Computation

The Engine Behind This Talk

main( )

signal(SIGINT, welcome);

while (slides( ) && time( )) {

talk( );

Research Scope

Object: future architectures

Tool:compilers

Evaluation:simulators

Research Methodology

Constraint Space

state-of-the-art

X (e.g., power)

Y (e.g., cost)

“reasonable limits”

incrementalevolution

new solutions

Outline• Introduction: problems of current architectures

• Compiling Application-Specific Hardware

• Pipelining

• ASH Evaluation

• Conclusions

Resources

• We do not worry about not having hardware resources• We worry about being able to use hardware resources

[Intel]

Design Complexity1981

Designer productivity

Chip size

Communication vs. Computation

5ps 20ps

gate wire

Power consumption on wires is also dominant

Power Consumption

Toasted CPU: about 2 sec after removing cooler.

(Tom’s Hardware Guide)

Energy Efficiency

Pentium 4

Clock Speed

Cannot rely on global signals(clock is a global signal)

Instruction-Set Architecture

Software

Hardware

VERY rigid to changes(e.g. x86 vs Itanium)

Our Proposal• ASH addresses these problems• ASH is not a panacea• ASH “complementary” to CPU

High-ILPcomputation

Low ILP computation+ OS + VM CPU ASH

Memory

Outline

• Problems of current architectures

• CASH: Compiling ASH– program representation– compiling C programs

• Pipelining

• ASH Evaluation

• Conclusions

Application-Specific HardwareC program

Compiler

Dataflow IR

Reconfigurable/custom hw

HW backend

Application-Specific HardwareC program

Compiler

Dataflow IR

CPU [predication]

SW backend

def-use

may-dep.

Key: Intermediate Representation

Traditionally

• SSA + predication + speculation

• Uniform for scalars and memory

• Explicitly encodes may-depend

• Executable

• Precise semantics

• Dataflow IR

• Close to asynchronous target

Our IR

Computation = Dataflow

• Operations ) functional units• Variables ) wires• No interpretation

x = a & 7;...

y = x >> 2;

Programs

Circuits

Basic Computation

Asynchronous Computation

Distributed Control Logic

ackrdy

global

asynchronous control

short, local wires

Outline

• Problems of current architectures

• CASH: Compiling ASH– program representation– compiling C programs

• Pipelining

• ASH Evaluation

• Conclusions

MUX: Forward Branches

if (x > 0) y = -x;

elsey = b*x;

Conditionals ) Speculation critical path

SSA= no arbitration

Control Flow ) Data Flow

datapredicate

Merge (label)

Gateway

Split (branch)p

+1< 100

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;return sum; !

no speculation

sequencingof side-effects

Predication and Side-Effects

tomemory

Memory Access

MonolithicMemory

local communication global structures

pipelinedarbitratednetwork

Future work: fragment this!related workcomplexity

CASH Optimizations

• SSA-based optimizations– unreachable/dead code, gcse, strength reduction,

loop-invariant code motion, software pipelining, reassociation, algebraic simplifications, induction variable optimizations, loop unrolling, inlining

• Memory optimizations– dependence & alias analysis, register promotion,

redundant load/store elimination, memory access pipelining, loop decoupling

• Boolean optimizations– Espresso CAD tool, bitwidth analysis

Outline• Problems of current architectures

• Compiling ASH

• Pipelining

• Evaluation: CASH vs. clocked designs

• Conclusions

Pipeliningi

pipelinedmultiplier(8 stages)

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;

step 1

Pipeliningi

step 2

Pipeliningi

step 3

Pipeliningi

step 4

Pipeliningi

step 5

Pipeliningi

step 6

Pipeliningi

i’s loop

sum’s loop

Longlatency pipe

predicate

step 7

Predicate ackedge is on thecritical path.

Pipeliningi

critical pathi’s loop

sum’s loop

Pipeline balancing i

i’s loop

sum’s loop

decouplingFIFO

step 7

Pipeline balancing i

i’s loop

sum’s loop

critical path

decouplingFIFO

Outline• Problems of current architectures

• Compiling ASH

• Pipelining

• Evaluation: CASH vs. clocked designs

• Conclusions

Evaluating ASHC

CASHcore

Verilog back-end

Synopsys,Cadence P/R

180nm std. cell library, 2V

~1999technology

Mediabench kernels(1 hot function/benchmark)

ModelSim(Verilog simulation)

performancenumbers

ASH AreaP4: 217

normalized area

minimal RISC core

Mem accessDatapath

ASH vs 600MHz CPU [.18 m]

0.45 0.45

1.731.62

Bottleneck: Memory Protocol

ST Memory

•Token release to dependents: requires round-trip to memory.•Limit study: round trip zero time ) up to 6x speed-up.

•Exploring protocol for in-order data delivery & fast token release.

PowerDSP110

mP4000

Xeon [+cache]67000

Energy Efficiency

0.01 0.1 1 10 100 1000

Energy Efficiency [Operations/nJ]

General-purpose DSP

Dedicated hardware

ASH media kernels

Asynchronous P

Microprocessors

Outline

Problems of current architectures

+ Compiling ASH

+ Pipelining

+ ASH Evaluation

= Future/related work & conclusions

Related Work

NanotechnologyDataflowmachines

High-levelsynthesis

Reconfigurablecomputing

Computerarchitecture

Embeddedsystems

Asynchronouscircuits

Compilation

Future Work• Optimizations for

area/speed/power

• Memory partitioning

• Concurrency

• Compiler-guided layout

• Explore extensible ISAs

• Hybridization with superscalar mechanisms

• Reconfigurable hardware support for ASH

• Formal verification

How far can you go?

Grand Vision:Certified Circuit Generation

• Translation validation: input ´ output

• Preserve input properties– e.g., C programs cannot deadlock– e.g., type-safe programs cannot crash

• Debug, test, verify only at source-level

HLL IR IRopt Verilog gates layout

formally validated

Conclusions

Feature Advantages

No interpretation Energy efficiency, speed

Spatial layout Short wires, no contention

Asynchronous Low power, scalable

Distributed No global signals

Automatic compilation Design productivity, no ISA

Spatial computation strengths

Backup Slides• Reconfigurable hardware

• Critical paths• Control logic• ASH vs ...• ASH weaknesses• Exceptions• Normalized area• Why C?• Splitting memory• More performance• Recursive calls

Reconfigurable Hardware

Universal gates

and/or

storage elements

Interconnectionnetwork

Programmable switches

Switch controlled by a 1-bit RAM cell

Universal gate = RAM

a0a1a0

dataa1 & a2

0data in

control

Main RH Ingredient: RAM Cell

Critical Paths

if (x > 0) y = -x;

elsey = b*x;

Lenient Operations

if (x > 0) y = -x;

elsey = b*x;

Solves the problem of unbalanced paths

back to talkback

ackout

rdyoutackin

datain dataout

Asynchronous Control

back back to talk

HLL to HW

High-level Synthesis

BehavioralHDL

SynchronousHardware

ReconfigurableComputing

C [subsets]

Hardwareconfiguration

(spatial computation)

Asynchronouscircuits

ConcurrentLanguage

AsynchronousHardware

Prior work

This research

CASH vs High-Level Synthesis

• CASH: the only existing tool to translate complete ANSI C to hardware

• CASH generates asynchronous circuits

• CASH does not treat C as an HDL– no annotations required– no reactivity model– does not handle non-C, e.g., concurrency

ASH Weaknesses

• Low efficiency for low-ILP code

• Does not adapt at runtime

• Monolithic memory

• Resource waste

• Not flexible

• No support for exceptions

ASH Weaknesses (2)

• Both branch and join not free• Static dataflow (no re-issue of same instr)• Memory is “far”• Fully static

– No branch prediction– No dynamic unrolling– No register renaming

• Calls/returns not lenient

Predicted not takenEffectively a noop for CPU!

Predicted taken.

Branch Prediction

for (i=0; i < N; i++) {

if (exception) break;

exception

result available before inputs

ASH crit path

CPU crit path

Exceptions• Strictly speaking, C has no exceptions

• In practice hard to accommodate exceptions in hardware implementations

• An advantage of software flexibility: PC is single point of execution control

High-ILPcomputation

Low ILP computation+ OS + VM + exceptions CPU ASH

Memory

• Huge installed base

• Embedded specifications written in C

• Small and simple language– Can leverage existing tools– Simpler compiler

• Techniques generally applicable

• Not a toy language

Performance

g721_d

g721_e

mpeg2_d

mpeg2_e

pegwit_

MOPSallMOPSspecMOPS

Parallelism Profile

Normalized Area

back back to talk

2.5Lines/sq mmsq mm/kbyte

Memory Partitioning• MIT RAW project: Babb FCCM ‘99,

Barua HiPC ‘00,Lee ASPLOS ‘00

• Stanford SpC: Semeria DAC ‘01, TVLSI ‘02

• Berkeley CCured: Necula POPL ‘02

• Illinois FlexRAM: Fraguella PPoPP ‘03

• Hand-annotations #pragma

back back to talk

Memory Complexity

RAMaddr

back to talk

Recursion

recursive call

save live values

restore live valuesstack

Spatial Computation Computing without General-Purpose Processors Mihai Budiu mihaib@cs.cmu.edu...

Documents

How Akamai Works - cs.cmu.edu

BitValue: Detecting and Exploiting Narrow Bitwidth Computations Mihai Budiu Carnegie Mellon University mihaib@cs.cmu.edu joint work with Majd Sakr, Kip

Coordinate Descent - cs.cmu.edu

Computation and Deduction - cs.cmu.edu

Wireless - cs.cmu.edu

1 Administrivia - cs.cmu.edu

Intelligent Agents Katia Sycara katia@cs.cmu.edu The Robotics Institute Joseph Giampapa garof@cs.cmu.edu softagents

Computer Vision - cs.cmu.edu

Complete - cs.cmu.edu

Planning Marked - cs.cmu.edu

TEAM PROJECT - cs.cmu.edu

By Jiazhi Ou jzou@cs.cmu.edu jzou@cs.cmu.edu Tal Blum blum@cs.cmu.edu blum@cs.cmu.edu Wild Dolphin Project 11-751 Speech Final Project

Polynomial Space - cs.cmu.edu

Texture Models - cs.cmu.edu

Spatial Computation - cs.cmu.edu

(class29.ppt) - cs.cmu.edu

Announcements - cs.cmu.edu

Plan 9 - cs.cmu.edu

Optimization Techniques - cs.cmu.edu

01234' - cs.cmu.edu