Upload
may
View
32
Download
0
Tags:
Embed Size (px)
DESCRIPTION
TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS. Aaron Severance University of British Columbia Advised by Guy Lemieux. Our Problem. We use overlays for data processing Partially/fully fixed processing elements Virtual CGRAs, soft vector processors Memory: - PowerPoint PPT Presentation
Citation preview
1
TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY
CACHE FOR HIGH-THROUGHPUT
FPGA APPLICATIONS
Aaron SeveranceUniversity of British Columbia
Advised by Guy Lemieux
2
Our Problem
We use overlays for data processing Partially/fully fixed processing elements Virtual CGRAs, soft vector processors
Memory: Large register files/scratchpad in overlay
Low latency, local data Trivial (large DMA): burst to/from DDR Non-trivial?
Scatter/Gather
Data dependent store/load vscatter adr_ptr, idx_vect,
data_vect for i in 1..N
adr_ptr[idx_vect[i]] <= data_vect[i]
Random narrow (32-bit) accesses Waste bandwidth on DDR interfaces
3
4
If Data Fits on the FPGA… BRAMs with interconnect network
General network… Not customized per application Shared: all masters <-> all slaves
Memory mapped BRAM Double-pump (2x clk) if possible
Banking/LVT/etc. for further ports
5
Example BRAM system
6
But if data doesn’t fit… (oversimplified)
7
So Let’s Use a Cache
But a throughput focused cache Low latency data held in local
memories Amortize latency over multiple
accesses Focus on bandwidth
Replace on-chip memory or augment memory controller?
Data fits on-chip Want BRAM like speed, bandwidth Low overhead compared to shared
BRAM
Data doesn’t fit on-chip Use ‘leftover’ BRAMs for performance
8
9
TputCache Design Goals
Fmax near BRAM Fmax Fully pipelined Support multiple outstanding
misses Write coalescing Associativity
10
TputCache Architecture Replay based architecture
Reinsert misses back into pipeline Separate line fill/evict logic in background Token FIFO for completing requests in order
No MSHRs for tracking misses Fewer muxes (only single replay request mux) 6 stage pipeline -> 6 outstanding misses
Good performance with high hit rate Common case fast
11
TputCache Architecture
12
Cache Hit
13
Cache Miss
14
Evict/Fill Logic
15
Area & Fmax Results
•Reaches 253MHz compared to 270MHz BRAM fmax on Cyclone IV
•423MHz compared to 490MHz BRAM fmax on Stratix IV
•Minor degredation with increasing size, associativity
•13% to 35% extra BRAM usage for tags, queues
16
Benchmark Setup TputCache
128kB, 4-way, 32-byte lines MXP soft vector processor
16 lanes, 128kB scratchpad memory Scatter/Gather memory unit
Indexed loads/stores per lane Doublepumping port adapters
TputCache runs at 2x frequency of MXP
MXP Soft Vector Processor
17
18
Histogram•Instantiate a number of Virtual Processors (VPs) mapped across lanes•Each VP histograms part of the image•Final pass to sum VP partial histograms
19
Hough Transform•Convert an image to 2D Hough Space (angle, radius)•Each vector element calculates the radius for a given angle
•Adds pixel value to counter
20
Motion Compensation•Load block from reference image, interpolate
•Offset by small amount from location in current image
21
Future Work More ports needed for scalability
Share evict/fill BRAM port with 2nd request Banking (sharing same evict/fill logic) Multiported BRAM designs
Write cache Allocate on write currently Track dirty state of bytes in BRAMs 9th bit
Non-blocking behavior Multiple token FIFOs (one per requestor)?
22
FAQ
Coherency Envisioned as only/LLC Future work
Replay loops/problems Random replacement + associativity Power expected to be not great…
23
Conclusions TputCache: alternative to shared
BRAM Low overhead (13%-35% extra BRAM) Nearly as high fmax (253MHz vs
270MHz)
More flexible than shared BRAM Performance degrades gradually Cache behavior instead of manual filling
24
Questions?
Thank you