29
Plasticine: A Reconfigurable Architecture For Parallel Patterns Raghu Prabhakar Granular Computing

Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

Plasticine: A Reconfigurable Architecture For Parallel Patterns

Raghu Prabhakar

Granular Computing

Page 2: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 2

Important Trends

● Moore’s law, Dennard scaling, Power Wall, Memory Wall

=> Use transistors efficiently to achieve better Performance / Watt

=> Exploit data locality, parallelism

● High NRE costs in fabricating ASICs

=> Build programmable hardware to amortize costs

● Availability of large amounts of data + algorithmic innovations

=> Build hardware with high compute density

=> Programmable Accelerator Architectures

Page 3: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 3

Reconfigurable Accelerators

● Statically reprogrammable data path using configuration bits

● Power Efficiency: Avoids overheads of general purpose CPUs, GPGPUs

Instruction fetch, decode, register file access

40% of datapath energy on CPU[1]

30% of dynamic power on GPU [2]

● Flexibility: Amortizes NRE fabrication costs of ASIC

● FPGAs gaining traction as reconfigurable accelerators

[1] Hameed et al, Understanding Sources of Inefficiency in General-purpose Chips, ISCA 2010

[2] Leng et al, GPUWattch: Enabling Energy Optimizations in GPGPUs, ISCA 2013

Page 4: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 4

FPGA: The good and bad

● Bit-level reconfigurable logic elements + static interconnect

● Good

Flexibility, Performance / Watt

Commercially successful, mature toolchain support

● Bad

Architectural overheads: 60% area, power spent in the interconnect

Reduced compute density, slower clock rates

Long compile times, Low-level programming models

Design reconfigurable hardware with the right abstractions

Page 5: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 5

Our Approach

● Parallel Patterns

High-level programming abstractions capturing parallelism and locality

Can express wide variety of applications

Previous work shows programming FPGAs from parallel patterns

Design reconfigurable primitives to accelerate parallel patterns

map zip reduce groupBy

key1 key3key2

Page 6: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 6

Key Observations

● Nested Parallelism

Data and pipeline parallelism at innermost loop level

Coarse-grained pipelining and parallelism at outer levels

● Locality, on-chip bandwidth, and buffering

Large on-chip memories with parallel read/write ports to sustain compute throughput

On-chip memory access patterns can be different

Address partitioning to implement buffering for coarse-grained pipelining

● Dense and sparse memory accesses

Burst DRAM accesses for dense data structures e.g., matrices

Sparse / random DRAM access for sparse data structures e.g., graphs

● Communication

Patterns produce and consume scalar data and arrays

Page 7: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 7

Plasticine

● New reconfigurable accelerator architecture

● Datapath

Hierarchical organization to exploit nested parallelism

● On-chip Memories

Large, banked scratchpad memories with configurable address decoding

Hardware support for generalized double buffering (N-buffering)

● Address generators and address coalescing

Efficient burst access generation for dense data

Scatter-gather support, large number of outstanding requests for sparse data

● Interconnect

Multi-level interconnect to enable scalar, vector, and control communication

Pipelined switches to avoid overheads, long wires

Page 8: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 8

Plasticine: Top-Level

Page 9: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 9

Pattern Compute Unit (PCU)

Page 10: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 10

PCU: Pipeline Network

Page 11: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 11

PCU: Reduction Network

Page 12: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 12

PCU: Shift Network

Page 13: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 13

Pattern Memory Unit (PMU)

Page 14: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 14

Address Generators, Coalescing Unit

● Reconfigurable integer data paths for DRAM address calculation logic

Optimizes for common case for dense ‘burst’ DRAM access

Frees up PCUs for other computation, increases utilization

● Arbitration between multiple address streams

Coalescing unit arbitrates between address generators sharing same DRAM channel

● Scatter-gather support

Coalescing unit maintains sparse request metadata in a coalescing cache

Hardware combines requests belonging to same DRAM burst

Coalescing cache allows large number of outstanding requests

Page 15: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 15

Interconnect

● Three interconnects with different levels of granularity

Vector: Vector (multi-word) granularity

Scalar: Single word granularity

Control: Bit-level granularity

● Pipelined switches to avoid long wires

1 hop = 1 cycle

Enables faster clock rate

● Counters and Control within switches

Outer loop logic mostly involves loop indices and control only

Implementing outer loop logic in PCUs => under utilization

Page 16: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 16

Execution Model and Control

● Scratchpad access decoupled from compute

PMU: Scratchpad read/write address calculation

PCU: Core computation

FIFOs at inputs ease routing constraints

● Decentralized control mechanism to orchestrate execution

Tokens: Feed-forward pulse signals indicating forward flow

Credits: Feedback pulse signals indicating backpressure

Generalizes over any arbitrary level of nested pipelining

● Tokens, Credits, and local FIFO state drives execution

Control blocks contain counters to manage tokens and credits

See paper for details

Page 17: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 17

Application Mapping

Unrolling

Splitting

DHDL

Virtual PCUs

Mapping

Resource Allocation

Routing

Bitstream generation

Plasticine Bitstream

Koeplinger et al, “Automatic Generation of Efficient

Accelerators for Reconfigurable Hardware”, ISCA 2016

Page 18: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 18

Example: Dot Productval out = Reg[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)

Reduce(N by B)(out) { i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]

tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)

Reduce(B by 1)(acc){ j => tileA(j) * tileB(j)

}{a, b => a + b}}{a, b => a + b}

×

DRAMA B

out

acc

TileA TileB

+

+

Page 19: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 19

Example: DotProduct

A

B

tile

A

tile

B

Page 20: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

Evaluation

Sizing, Area, Power, Performance, Perf / W

Page 21: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 21

Architecture Sizing

Page 22: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 22

Plasticine Clock, Area, and Power

Technology Node 28nm

Clock Frequency 1 GHz

Total Area 112.77 mm2

Total Power 49 W

Page 23: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 23

Area Breakdown

PCU48%

PMU30%

Interconnect17%

MC5%

Plasticine

FU72%

Regs17%

FIFO10%

Control 1%

PCU

Scratchpad90%

FIFO 5%

Regs 4%

PMU

Scratchpad FIFO Regs FU Control

Page 24: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 24

Experimental Setup

● Plasticine:

Implemented using Chisel, RTL synthesized with 28nm library

4 DDR3-1600 DRAM channels, peak memory bandwidth = 51.2 GB/s

1 GHz clock

● FPGA:

Altera Stratix V, 28 nm technology

6 DDR3-800 DRAM channels, peak memory bandwidth = 37.5 GB/s

150 MHz clock

Page 25: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 25

Experimental Setup

● Plasticine:

Performance: Cycle-accurate simulation using VCS + DRAMSim2

Area: Synopsys DC after synthesis

Chip Power: RTL trace-driven simulation using PrimeTime

● FPGA:

Performance: Measured execution time on FPGA

Utilization: Reports from Altera logic synthesis tools

Chip Power: Altera PowerPlay tool

Page 26: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 26

Plasticine v/s FPGA

Page 27: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 27

Resource Utilization

0

10

20

30

40

50

60

70

80

90

100

PCU PMU AG

FU Reg

Page 28: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 28

Conclusion

● Co-designing reconfigurable architecture and programming models

based on parallel patterns leads to efficient, programmable systems

● Plasticine accelerates dense and sparse applications composed of

parallel patterns

● Design space exploration explores tradeoffs between architecture

parameters and application characteristics

● Up to 95x improvement in Performance, 77x improvement in Perf/W over

FPGA in similar process technology, with an area of 113mm2

Page 29: Plasticine: A Reconfigurable Architecture For … › Seminar Talks › retreat...Raghu Prabhakar Granular Computing June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 29

The Team

Christos Kozyrakis Kunle Olukotun

Yaqi Zhang David Koeplinger Matt Feldman

Tian Zhao Stefan Hadjis Ardavan Pedram