22
The Raw The Raw Architecture Architecture Signal Processing on a Signal Processing on a Scalable Composable Scalable Composable Computation Fabric Computation Fabric David Wentzlaff David Wentzlaff , Michael Taylor, , Michael Taylor, Jason Kim, Jason Miller, Fae Jason Kim, Jason Miller, Fae Ghodrat, Ben Greenwald, Paul Ghodrat, Ben Greenwald, Paul Johnson,Walter Lee, Albert Ma, Johnson,Walter Lee, Albert Ma, Nathan Shnidman, Henry Hoffmann, Nathan Shnidman, Henry Hoffmann, Arvind Saraf, Volker Strumpen, Matt Arvind Saraf, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Frank, Saman Amarasinghe, and Anant Agarwal Agarwal http://www.cag.lcs.mit.edu/raw MIT Laboratory For Computer Science

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

  • Upload
    vita

  • View
    27

  • Download
    2

Embed Size (px)

DESCRIPTION

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric. - PowerPoint PPT Presentation

Citation preview

Page 1: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

The Raw The Raw ArchitectureArchitecture

Signal Processing on a Scalable Signal Processing on a Scalable Composable Computation FabricComposable Computation Fabric

David WentzlaffDavid Wentzlaff, Michael Taylor, Jason Kim, Jason , Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat, Ben Greenwald, Paul Miller, Fae Ghodrat, Ben Greenwald, Paul

Johnson,Walter Lee, Albert Ma, Nathan Shnidman, Johnson,Walter Lee, Albert Ma, Nathan Shnidman, Henry Hoffmann, Arvind Saraf, Volker Strumpen, Henry Hoffmann, Arvind Saraf, Volker Strumpen,

Matt Frank, Saman Amarasinghe, and Anant AgarwalMatt Frank, Saman Amarasinghe, and Anant Agarwal

http://www.cag.lcs.mit.edu/raw

MIT Laboratory For Computer Science

Page 2: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

OutlineOutline

MotivationMotivation

ArchitectureArchitecture

Raw PrototypeRaw Prototype

NetworksNetworks

Signal Processing ApplicationsSignal Processing Applications

StatusStatus

Page 3: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Wire DelayWire Delay and Tiled and Tiled ArchitecturesArchitectures

Problem: The amount of gates we can reach Problem: The amount of gates we can reach in one cycle is staying constant, but our in one cycle is staying constant, but our chips are getting bigger.chips are getting bigger.

Solutions:Solutions:1.1. Hide wire delay latency in micro-architecture Hide wire delay latency in micro-architecture

(Clustering/Hidden communication stalls)(Clustering/Hidden communication stalls)

2.2. Expose the communication to the instruction Expose the communication to the instruction set level and allow the software exploit localityset level and allow the software exploit locality

Fact 1: Number of transistors growingFact 1: Number of transistors growing

Fact 2: Proportionally wires not getting fasterFact 2: Proportionally wires not getting faster

Page 4: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Wire Delay and Wire Delay and Tiled Tiled ArchitecturesArchitectures

2.2. Expose the communication to the instruction set Expose the communication to the instruction set level and allow the software exploit localitylevel and allow the software exploit locality

Page 5: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Wire Delay and Wire Delay and Tiled Tiled ArchitecturesArchitectures

2.2. Expose the communication to the instruction set Expose the communication to the instruction set level and allow the software exploit localitylevel and allow the software exploit locality

Make a tile as big Make a tile as big as you can go in as you can go in one clock cycle, and one clock cycle, and expose longer expose longer communication to communication to the programmerthe programmer

Page 6: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Wire Delay and Wire Delay and Tiled Tiled ArchitecturesArchitectures

2.2. Expose the communication to the instruction set Expose the communication to the instruction set level and allow the software exploit localitylevel and allow the software exploit locality

Make a tile as big Make a tile as big as you can go in as you can go in one clock cycle, and one clock cycle, and expose longer expose longer communication to communication to the programmerthe programmer

Page 7: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

What Are We Building?What Are We Building?The Raw PrototypeThe Raw Prototype

16 Replicated Tiles (Processors)16 Replicated Tiles (Processors)

What is in a tile?What is in a tile?8 stage Pipelined MIPS-like 32-bit 8 stage Pipelined MIPS-like 32-bit

processorprocessor

Pipelined Floating Point UnitPipelined Floating Point Unit

32KB Data Cache32KB Data Cache

32KB Instruction Memory32KB Instruction Memory

Interconnect RoutersInterconnect Routers

Page 8: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Raw’s Networking Raw’s Networking ResourcesResources

2 Dynamic Networks2 Dynamic NetworksFire and ForgetFire and ForgetHeader encodes destinationHeader encodes destination2 Stage router pipeline2 Stage router pipeline

2 Static Networks2 Static NetworksSoftware configurable crossbarSoftware configurable crossbarInterlocked and Flow ControlledInterlocked and Flow Controlled5 Stage static router pipeline5 Stage static router pipeline3 cycle nearest-neighbor ALU to ALU 3 cycle nearest-neighbor ALU to ALU

communication latencycommunication latencyNo header overhead, but requires knowledge No header overhead, but requires knowledge

of communication patterns at compile timeof communication patterns at compile time

Page 9: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Memory Mapped Memory Mapped Communication is Not a First Communication is Not a First Class CitizenClass Citizen

IF RFDA TL

M1 M2

F P

E

U

TV

F4 WB

To other tiles, through To other tiles, through memory system that memory system that happens to go over a happens to go over a network.network.

Page 10: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Raw’s First Class Register-Raw’s First Class Register-Mapped CommunicationMapped Communication

IF RFDA TL

M1 M2

F P

E

U

TV

F4 WB

r26

r27

r25

r24

NetworkNetworkInputInputFIFOsFIFOs

r26

r27

r25

r24

NetworkNetworkOutputOutputFIFOsFIFOs

Ex: add r26, r25, r24Ex: add r26, r25, r24

Page 11: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Signal Processing Signal Processing ApplicationsApplications

Problem: Increase performance of Problem: Increase performance of Signal Processing in a scalable Signal Processing in a scalable fashionfashion

Solution: Exploit parallelism in Signal Solution: Exploit parallelism in Signal Processing Applications at all Processing Applications at all levelslevels

Page 12: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Types of Parallelism in Types of Parallelism in Signal ProcessingSignal Processing

DSP Filter StyleDSP Filter Style

Fine Grain DataflowFine Grain Dataflow

Instruction Level ParallelismInstruction Level Parallelism

Data ParallelData Parallel

Thread Level Parallelism (MPI)Thread Level Parallelism (MPI)

Current ArchitecturesCurrent Architectures

RawRaw

Page 13: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Instruction Level Instruction Level ParallelismParallelism

RawCCRawCCMaps dataflow graphs across tilesMaps dataflow graphs across tiles

ILP across MultiprocessorILP across Multiprocessor

Heavily Latency sensitiveHeavily Latency sensitive

Single cycle reconfigurable Single cycle reconfigurable communicationcommunication

Page 14: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Fine Grain DataflowFine Grain Dataflow

Ex: Pipelined FIR FilterEx: Pipelined FIR Filterxn xn-1 xn-1 xn-3

W1 W2W0 W3

Computation: mul, addComputation: mul, add

Input Operands: xInput Operands: xii, , ll

Output Operands: Output Operands: kk

Cycle countClass First SecondCompute 2 2Communicate 0 3Overall 2 5

Page 15: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Fine Grain DataflowFine Grain Dataflow

Cycle countClass First SecondCompute 2 2Communicate 0 3Overall 2 5

First Class InterfaceFirst Class Interface Second Class Second Class InterfaceInterface

mul $r3, Wmul $r3, Wxx, NET_IN_1, NET_IN_1

add NET_OUT1, NET_IN_2, $r3add NET_OUT1, NET_IN_2, $r3

ld $r4, NET_IN_1_ADDRld $r4, NET_IN_1_ADDR

ld $r5, NET_IN_2_ADDRld $r5, NET_IN_2_ADDR

mul $r3, Wmul $r3, Wxx, $r4, $r4

add $r6, $r5, $r3add $r6, $r5, $r3

st NET_OUT_1_ADDR, $r6st NET_OUT_1_ADDR, $r6

Page 16: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

DSP Filter StyleDSP Filter Style

Off-Off-chipchip

Off-Off-chipchip

Down-Sample

FFT

FrequencyDomain

FilterFFT

FFT

FFT-1

FFT-1

FFT-1

FFT FFT-1

Page 17: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Raw is ComposableRaw is Composable

Mix and match types of parallelismMix and match types of parallelism

4-way Threaded JavaApplication

2-way RawCCApplication

httpd

Whitebalance

Whitebalance

Aliasingfilter

mem mem

Zzz.

Page 18: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Raw StatusRaw Status

StatsStatsIBM SA-27E .15u 6 Layer CopperIBM SA-27E .15u 6 Layer Copper

18.2 mm X 18.2 mm die18.2 mm X 18.2 mm die

.122 Billion Transistors.122 Billion Transistors

2048KB SRAM On-chip2048KB SRAM On-chip

1657 Pin CCGA Package1657 Pin CCGA Package1080 HSTL Signal IO Operating at 1080 HSTL Signal IO Operating at

Core SpeedCore Speed

225MHz225MHz

~25 Watts~25 Watts

Page 19: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

The Raw PerformanceThe Raw Performance

16 OPS/FLOPS per cycle (@225MHz = 3.6 16 OPS/FLOPS per cycle (@225MHz = 3.6 GFLOPS)GFLOPS)

230 Gb/s of on-chip “bisection bandwidth” 230 Gb/s of on-chip “bisection bandwidth”

201 Gb/s of off-chip I/O bandwidth201 Gb/s of off-chip I/O bandwidth

115 Gb/s of on-chip memory bandwidth115 Gb/s of on-chip memory bandwidth

Page 20: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Raw StatusRaw Status

Working:Working:Cycle Accurate Software SimulatorCycle Accurate Software Simulator

RTL SimulationRTL Simulation

Emulation SystemEmulation System

RawCC ILP CompilerRawCC ILP Compiler

Current:Current:VerificationVerification

Backend CompletionBackend Completion

Tapeout December 2001Tapeout December 2001

Chips Back Summer 2002Chips Back Summer 2002

Page 21: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

SummarySummary

Raw’s First Class communication Raw’s First Class communication facilitates exploitation of new facilitates exploitation of new forms of parallelism in Signal forms of parallelism in Signal Processing applicationsProcessing applications

Page 22: The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

Extra SlidesExtra Slides