Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

Plasticine: A Reconfigurable Architecture For Parallel Patterns

Raghu Prabhakar

Granular Computing

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 2

Important Trends

● Moore’s law, Dennard scaling, Power Wall, Memory Wall

=> Use transistors efficiently to achieve better Performance / Watt

=> Exploit data locality, parallelism

● High NRE costs in fabricating ASICs

=> Build programmable hardware to amortize costs

● Availability of large amounts of data + algorithmic innovations

=> Build hardware with high compute density

=> Programmable Accelerator Architectures


Reconfigurable Accelerators

● Statically reprogrammable data path using configuration bits

● Power Efficiency: Avoids overheads of general purpose CPUs, GPGPUs

Instruction fetch, decode, register file access

40% of datapath energy on CPU[1]

30% of dynamic power on GPU [2]

● Flexibility: Amortizes NRE fabrication costs of ASIC

● FPGAs gaining traction as reconfigurable accelerators

[1] Hameed et al, Understanding Sources of Inefficiency in General-purpose Chips, ISCA 2010

[2] Leng et al, GPUWattch: Enabling Energy Optimizations in GPGPUs, ISCA 2013


FPGA: The good and bad

● Bit-level reconfigurable logic elements + static interconnect

● Good

Flexibility, Performance / Watt

Commercially successful, mature toolchain support

● Bad

Architectural overheads: 60% area, power spent in the interconnect

Reduced compute density, slower clock rates

Long compile times, Low-level programming models

Design reconfigurable hardware with the right abstractions


Our Approach

● Parallel Patterns

High-level programming abstractions capturing parallelism and locality

Can express wide variety of applications

Previous work shows programming FPGAs from parallel patterns

Design reconfigurable primitives to accelerate parallel patterns

map zip reduce groupBy

key1 key3key2


Key Observations

● Nested Parallelism

Data and pipeline parallelism at innermost loop level

Coarse-grained pipelining and parallelism at outer levels

● Locality, on-chip bandwidth, and buffering

Large on-chip memories with parallel read/write ports to sustain compute throughput

On-chip memory access patterns can be different

Address partitioning to implement buffering for coarse-grained pipelining

● Dense and sparse memory accesses

Burst DRAM accesses for dense data structures e.g., matrices

Sparse / random DRAM access for sparse data structures e.g., graphs

● Communication

Patterns produce and consume scalar data and arrays


Plasticine

● New reconfigurable accelerator architecture

● Datapath

Hierarchical organization to exploit nested parallelism

● On-chip Memories

Large, banked scratchpad memories with configurable address decoding

Hardware support for generalized double buffering (N-buffering)

● Address generators and address coalescing

Efficient burst access generation for dense data

Scatter-gather support, large number of outstanding requests for sparse data

● Interconnect

Multi-level interconnect to enable scalar, vector, and control communication

Pipelined switches to avoid overheads, long wires


Plasticine: Top-Level


Pattern Compute Unit (PCU)


PCU: Pipeline Network


PCU: Reduction Network


PCU: Shift Network


Pattern Memory Unit (PMU)


Address Generators, Coalescing Unit

● Reconfigurable integer data paths for DRAM address calculation logic

Optimizes for common case for dense ‘burst’ DRAM access

Frees up PCUs for other computation, increases utilization

● Arbitration between multiple address streams

Coalescing unit arbitrates between address generators sharing same DRAM channel

● Scatter-gather support

Coalescing unit maintains sparse request metadata in a coalescing cache

Hardware combines requests belonging to same DRAM burst

Coalescing cache allows large number of outstanding requests


Interconnect

● Three interconnects with different levels of granularity

Vector: Vector (multi-word) granularity

Scalar: Single word granularity

Control: Bit-level granularity

● Pipelined switches to avoid long wires

1 hop = 1 cycle

Enables faster clock rate

● Counters and Control within switches

Outer loop logic mostly involves loop indices and control only

Implementing outer loop logic in PCUs => under utilization


Execution Model and Control

● Scratchpad access decoupled from compute

PMU: Scratchpad read/write address calculation

PCU: Core computation

FIFOs at inputs ease routing constraints

● Decentralized control mechanism to orchestrate execution

Tokens: Feed-forward pulse signals indicating forward flow

Credits: Feedback pulse signals indicating backpressure

Generalizes over any arbitrary level of nested pipelining

● Tokens, Credits, and local FIFO state drives execution

Control blocks contain counters to manage tokens and credits

See paper for details


Application Mapping

Unrolling

Splitting

DHDL

Virtual PCUs

Mapping

Resource Allocation

Routing

Bitstream generation

Plasticine Bitstream

Koeplinger et al, “Automatic Generation of Efficient

Accelerators for Reconfigurable Hardware”, ISCA 2016


Example: Dot Productval out = Reg[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)

Reduce(N by B)(out) { i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]

tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)

Reduce(B by 1)(acc){ j => tileA(j) * tileB(j)

}{a, b => a + b}}{a, b => a + b}

×

DRAMA B

out

acc

TileA TileB

+

+


Example: DotProduct

A

B

tile

A

tile

B

Evaluation

Sizing, Area, Power, Performance, Perf / W


Architecture Sizing


Plasticine Clock, Area, and Power

Technology Node 28nm

Clock Frequency 1 GHz

Total Area 112.77 mm2

Total Power 49 W


Area Breakdown

PCU48%

PMU30%

Interconnect17%

MC5%

Plasticine

FU72%

Regs17%

FIFO10%

Control 1%

PCU

Scratchpad90%

FIFO 5%

Regs 4%

PMU

Scratchpad FIFO Regs FU Control


Experimental Setup

● Plasticine:

Implemented using Chisel, RTL synthesized with 28nm library

4 DDR3-1600 DRAM channels, peak memory bandwidth = 51.2 GB/s

1 GHz clock

● FPGA:

Altera Stratix V, 28 nm technology

6 DDR3-800 DRAM channels, peak memory bandwidth = 37.5 GB/s

150 MHz clock


Experimental Setup

● Plasticine:

Performance: Cycle-accurate simulation using VCS + DRAMSim2

Area: Synopsys DC after synthesis

Chip Power: RTL trace-driven simulation using PrimeTime

● FPGA:

Performance: Measured execution time on FPGA

Utilization: Reports from Altera logic synthesis tools

Chip Power: Altera PowerPlay tool


Plasticine v/s FPGA


Resource Utilization

0

10

20

30

40

50

60

70

80

90

100

PCU PMU AG

FU Reg


Conclusion

● Co-designing reconfigurable architecture and programming models

based on parallel patterns leads to efficient, programmable systems

● Plasticine accelerates dense and sparse applications composed of

parallel patterns

● Design space exploration explores tradeoffs between architecture

parameters and application characteristics

● Up to 95x improvement in Performance, 77x improvement in Perf/W over

FPGA in similar process technology, with an area of 113mm2


The Team

Christos Kozyrakis Kunle Olukotun

Yaqi Zhang David Koeplinger Matt Feldman

Tian Zhao Stefan Hadjis Ardavan Pedram

Documents

Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel