Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Plasticine: A Reconfigurable Architecture For Parallel Patterns
Raghu Prabhakar
Granular Computing
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 2
Important Trends
● Moore’s law, Dennard scaling, Power Wall, Memory Wall
=> Use transistors efficiently to achieve better Performance / Watt
=> Exploit data locality, parallelism
● High NRE costs in fabricating ASICs
=> Build programmable hardware to amortize costs
● Availability of large amounts of data + algorithmic innovations
=> Build hardware with high compute density
=> Programmable Accelerator Architectures
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 3
Reconfigurable Accelerators
● Statically reprogrammable data path using configuration bits
● Power Efficiency: Avoids overheads of general purpose CPUs, GPGPUs
Instruction fetch, decode, register file access
40% of datapath energy on CPU[1]
30% of dynamic power on GPU [2]
● Flexibility: Amortizes NRE fabrication costs of ASIC
● FPGAs gaining traction as reconfigurable accelerators
[1] Hameed et al, Understanding Sources of Inefficiency in General-purpose Chips, ISCA 2010
[2] Leng et al, GPUWattch: Enabling Energy Optimizations in GPGPUs, ISCA 2013
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 4
FPGA: The good and bad
● Bit-level reconfigurable logic elements + static interconnect
● Good
Flexibility, Performance / Watt
Commercially successful, mature toolchain support
● Bad
Architectural overheads: 60% area, power spent in the interconnect
Reduced compute density, slower clock rates
Long compile times, Low-level programming models
Design reconfigurable hardware with the right abstractions
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 5
Our Approach
● Parallel Patterns
High-level programming abstractions capturing parallelism and locality
Can express wide variety of applications
Previous work shows programming FPGAs from parallel patterns
Design reconfigurable primitives to accelerate parallel patterns
map zip reduce groupBy
key1 key3key2
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 6
Key Observations
● Nested Parallelism
Data and pipeline parallelism at innermost loop level
Coarse-grained pipelining and parallelism at outer levels
● Locality, on-chip bandwidth, and buffering
Large on-chip memories with parallel read/write ports to sustain compute throughput
On-chip memory access patterns can be different
Address partitioning to implement buffering for coarse-grained pipelining
● Dense and sparse memory accesses
Burst DRAM accesses for dense data structures e.g., matrices
Sparse / random DRAM access for sparse data structures e.g., graphs
● Communication
Patterns produce and consume scalar data and arrays
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 7
Plasticine
● New reconfigurable accelerator architecture
● Datapath
Hierarchical organization to exploit nested parallelism
● On-chip Memories
Large, banked scratchpad memories with configurable address decoding
Hardware support for generalized double buffering (N-buffering)
● Address generators and address coalescing
Efficient burst access generation for dense data
Scatter-gather support, large number of outstanding requests for sparse data
● Interconnect
Multi-level interconnect to enable scalar, vector, and control communication
Pipelined switches to avoid overheads, long wires
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 8
Plasticine: Top-Level
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 9
Pattern Compute Unit (PCU)
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 10
PCU: Pipeline Network
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 11
PCU: Reduction Network
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 12
PCU: Shift Network
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 13
Pattern Memory Unit (PMU)
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 14
Address Generators, Coalescing Unit
● Reconfigurable integer data paths for DRAM address calculation logic
Optimizes for common case for dense ‘burst’ DRAM access
Frees up PCUs for other computation, increases utilization
● Arbitration between multiple address streams
Coalescing unit arbitrates between address generators sharing same DRAM channel
● Scatter-gather support
Coalescing unit maintains sparse request metadata in a coalescing cache
Hardware combines requests belonging to same DRAM burst
Coalescing cache allows large number of outstanding requests
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 15
Interconnect
● Three interconnects with different levels of granularity
Vector: Vector (multi-word) granularity
Scalar: Single word granularity
Control: Bit-level granularity
● Pipelined switches to avoid long wires
1 hop = 1 cycle
Enables faster clock rate
● Counters and Control within switches
Outer loop logic mostly involves loop indices and control only
Implementing outer loop logic in PCUs => under utilization
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 16
Execution Model and Control
● Scratchpad access decoupled from compute
PMU: Scratchpad read/write address calculation
PCU: Core computation
FIFOs at inputs ease routing constraints
● Decentralized control mechanism to orchestrate execution
Tokens: Feed-forward pulse signals indicating forward flow
Credits: Feedback pulse signals indicating backpressure
Generalizes over any arbitrary level of nested pipelining
● Tokens, Credits, and local FIFO state drives execution
Control blocks contain counters to manage tokens and credits
See paper for details
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 17
Application Mapping
Unrolling
Splitting
DHDL
Virtual PCUs
Mapping
Resource Allocation
Routing
Bitstream generation
Plasticine Bitstream
Koeplinger et al, “Automatic Generation of Efficient
Accelerators for Reconfigurable Hardware”, ISCA 2016
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 18
Example: Dot Productval out = Reg[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Reduce(N by B)(out) { i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(B by 1)(acc){ j => tileA(j) * tileB(j)
}{a, b => a + b}}{a, b => a + b}
×
DRAMA B
out
acc
TileA TileB
+
+
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 19
Example: DotProduct
A
B
tile
A
tile
B
Evaluation
Sizing, Area, Power, Performance, Perf / W
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 21
Architecture Sizing
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 22
Plasticine Clock, Area, and Power
Technology Node 28nm
Clock Frequency 1 GHz
Total Area 112.77 mm2
Total Power 49 W
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 23
Area Breakdown
PCU48%
PMU30%
Interconnect17%
MC5%
Plasticine
FU72%
Regs17%
FIFO10%
Control 1%
PCU
Scratchpad90%
FIFO 5%
Regs 4%
PMU
Scratchpad FIFO Regs FU Control
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 24
Experimental Setup
● Plasticine:
Implemented using Chisel, RTL synthesized with 28nm library
4 DDR3-1600 DRAM channels, peak memory bandwidth = 51.2 GB/s
1 GHz clock
● FPGA:
Altera Stratix V, 28 nm technology
6 DDR3-800 DRAM channels, peak memory bandwidth = 37.5 GB/s
150 MHz clock
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 25
Experimental Setup
● Plasticine:
Performance: Cycle-accurate simulation using VCS + DRAMSim2
Area: Synopsys DC after synthesis
Chip Power: RTL trace-driven simulation using PrimeTime
● FPGA:
Performance: Measured execution time on FPGA
Utilization: Reports from Altera logic synthesis tools
Chip Power: Altera PowerPlay tool
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 26
Plasticine v/s FPGA
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 27
Resource Utilization
0
10
20
30
40
50
60
70
80
90
100
PCU PMU AG
FU Reg
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 28
Conclusion
● Co-designing reconfigurable architecture and programming models
based on parallel patterns leads to efficient, programmable systems
● Plasticine accelerates dense and sparse applications composed of
parallel patterns
● Design space exploration explores tradeoffs between architecture
parameters and application characteristics
● Up to 95x improvement in Performance, 77x improvement in Perf/W over
FPGA in similar process technology, with an area of 113mm2
June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 29
The Team
Christos Kozyrakis Kunle Olukotun
Yaqi Zhang David Koeplinger Matt Feldman
Tian Zhao Stefan Hadjis Ardavan Pedram