Upload
cecil-todd
View
232
Download
4
Embed Size (px)
Citation preview
1
Tutorial OutlineTime Topic
9:00 am – 9:30 am Introduction
9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin
10:10 am – 10:30 am Standalone Accelerator Generation: High-Level Synthesis
10:30 am – 11:00 amHLS-Based Accelerator-Rich Architecture Simulation:
PARADE
11:00 am – 11:30 am Break
11:30 am – 12:00 pm Pre-RTL SoC Simulation: gem5-Aladdin
12:00 pm – 12:30 pm FPGA Prototyping: ARACompiler
12:30 pm – 2:00 pm Lunch
2:00 pm – 3:00 pm Panel on Accelerator Research
3:00 pm – 3:30 pm Accelerator Benchmarks and Workload Characterization
3:30 pm – 4:00 pm Break
4:00 pm – 5:00 pm Hands-on Exercise
2
Integration for Heterogeneous SoC Modeling
Yakun Sophia Shao, Sam Xi, Gu-Yeon Wei, David Brooks
Harvard University
3
Accelerator-CPU Integration:Today’s Conventional SoCs
• Easy to integrate lots of IP, simple accelerator design
• Hard to program and share data
CoreL2 $
…
L3 $
CoreL2 $
DMA
On-Chip System Bus
Acc #1
Scratchpad
Acc #n
Scratchpad
4
Accelerator Integration Trend• Users design application-specific hardware accelerators.• System vendors provide Host Service Layer with virtual
memory and cache coherence support– Intel QuickAssist QPI-Based FPGA Accelerator Platform (QAP)– IBM POWER8’s Coherent Accelerator Processor Interface (CAPI)
CoreL2 $
…
L3 $
CoreL2 $ Acc
Agent Host Service Layer
Accelerator
Main CPU/SoC FPGA or user-defined ASIC
5
• Example of state-of-the-art:– IBM POWER8’s Coherent Accelerator
Processor Interface (CAPI)• Virtual Addressing & Data Caching• Easier, Natural Programming Model
IBM CAPI: Two part solution
6
• Coherent Accelerator Processor Proxy (CAPP)– Snoops PowerBus on behalf of accelerator
• Power Service Layer (PSL)– Performs address translations, page table walker support– Provides cache and interface logic
IBM CAPI: Two part solution
Core CoreL2 $ L2 $
On-Chip Coherent PowerBus
Memory
CAPP
Accelerator… PCIe
PSL
Cache TLB …
L3 $
7
But… accelerators arenot one size fits all
• Problem: PSL layer consumes ~20-30% of FPGA resources… for one accelerator
• Applications have drastically different requirements.
• Memory design customization is often more important than datapath customization
8
gem5-Aladdin Integration
CPU
DMA Engin
e
Scratchpad
TLB
DRAM
LLC
CacheCache
Acc Datapath
9
Code example: Siftvoid imsmooth(F2D* array, float sigma, F2D* product);
void sift() { … imsmooth(I, temp, gss[0]); mapArrayToAccelerator(imsmooth, “array”, (void *)I, sizeof(I)); mapArrayToAccelerator(imsmooth, “product”, (void *)product, sizeof(product));
invokeAcceleratorAndBlock(imsmooth); …}
10
Code example: Siftvoid imsmooth(F2D* array, float sigma, F2D* product);
void sift() { … // imsmooth(I, temp, gss[0]); mapArrayToAccelerator(imsmooth, “array”, (void *)I, sizeof(I)); mapArrayToAccelerator(imsmooth, “product”, (void *)product, sizeof(product));
invokeAccelerator(imsmooth); …}
Start Aladdin Simulation
Simulating Accelerator with Memory System using Aladdin
11
Acc
Cache
Memory
12
Acc
Cache
Memory
CPU
Cache
Memory
13
Modeling Accelerators in an SoC-like Environment
Acc Core
Cache
Memory
Core
14
Acc Core
Cache
Memory
Modeling Accelerators in an SoC-like Environment
Aladdin gem5-Aladdin
FPGAPrototyping
Modeling
High-Level Synthesis
PARADE
Accelerator Research Infrastructure
15
StandaloneSystem
Integration
RTL
Tutorial References• Y.S. Shao and D. Brooks, “ISA-Independent Workload Characterization
and its Implications for Specialized Architectures,” ISPASS’13.
• B. Reagen, Y.S. Shao, G.-Y. Wei, D. Brooks, “Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware,” ISLPED’13.
• Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, “Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures,” ISCA’14.
• B. Reagen, B. Adolf, Y.S. Shao, G.-Y. Wei, D. Brooks, “MachSuite: Benchmarks for Accelerator Design and Customized Architectures,” IISWC’14.
16