Upload
denise
View
29
Download
2
Embed Size (px)
DESCRIPTION
Closely-Coupled Timing-Directed Partitioning in HAsim. Michael Pellauer † [email protected]. Murali Vijayaraghavan † , Michael Adler ‡ , Arvind † , Joel Emer †‡. † MIT CS and AI Lab Computation Structures Group. ‡ Intel Corporation VSSAD Group. To Appear In: ISPASS 2008. Motivation. - PowerPoint PPT Presentation
Citation preview
Closely-CoupledTiming-Directed Partitioning
in HAsim
Michael Pellauer†
[email protected] Vijayaraghavan†, Michael Adler‡, Arvind†, Joel Emer†‡
†MIT CS and AI LabComputation Structures Group
‡Intel CorporationVSSAD Group
To Appear In: ISPASS 2008
MotivationWe want to simulate target platforms quicklyWe also want to construct simulators quicklyPartitioned simulators are a known technique from traditional performance models:
• ISA• Off-chipcommunication
• Micro-architecture• Resource contention• Dependencies
Interaction
• Simplifies timing model• Amortize functional model design effort over many models• Functional Partition can be extremely FPGA-optimized
TimingPartition
FunctionalPartition
Different Partitioning SchemesAs categorized by Mauer, Hill and Wood:
Source: [MAUER 2002], ACM SIGMETRICSWe believe that a timing-directed solution will ultimately lead to the best performance
Both partitions upon the FPGA
Functional Partition in Software AsimGet Instruction (at a given Address)Get DependenciesGet Instruction ResultsRead Memory*
Speculatively Write Memory* (locally visible)Commit or Abort instructionWrite Memory* (globally visible)
* Optional depending on instruction type
Execution in Phases
F D X R C
F D X W C W
F D X C
The Emer Assertion:
All data dependencies can be represented via these phases
F D X R A
F D X X C W
Detailed Example: 3 Different Timing Models
Executing the same instruction sequence:
Functional Partition in Hardware?Requirements
Support these operations in hardwareAllow for out-of-order execution, speculation, rollback
ChallengesMinimize operation execution timesPipeline wherever possibleTradeoff between BRAM/multiport RAMsRace conditions due to extreme parallelism
Functional Partition As Pipeline
Conveys concept well, but poor performance
Token Gen Dec Exe Mem LCom GComFet
Timing Model
MemoryState
Register State
RegFile
FunctionalPartition
Implementation:Large Scoreboards in BRAM
Series of tables in BRAM
Store information about each in-flight instructionTables are indexed by “token”
Also used by the timing partition to refer to each instructionNew operation “getToken” to allocate a space in the tables
Implementing the Operations
See paper for details (also extra slides)
Assessment:Three Timing Models
Unpipelined Target
MIPS R10K-like out-of-order superscalar
5-Stage Pipeline
Assessment:Target Performance
Targets have idealized memory hierarchy
Target Processor CPI
0
0.5
1
1.5
2
2.5
3
3.5
median multiply qsort towers vvadd average
Mod
el C
ycle
s pe
r Ins
truct
ion
(CPI
)
Unpipelined5-stageOut-of-Order
Assessment:Simulator Performance
Some correspondence between target and functional partition is very helpful
Simulation Rate
0
5
10
15
20
25
30
35
40
45
median multiply qsort towers vvadd average
FPG
A-C
ycle
s pe
r Mod
el C
ycle
(FM
R)
Unpipelined5-StageOut-of-Order
Assessment:Reuse and Physical Stats
Where is functionality implemented:
FPGA usage:
Design IMem ProgramCounter
Branch Predictor
Scoreboard/ROB
RegFile
Maptable/Freelist
ALU DMem Store Buffer
Snapshots/Rollback
Functional Partition
Unpipelined N/A N/A N/A N/A N/A
5-Stage N/A
Out-of-Order
Unpipelined 5-stage Out of Order
FPGA Slices 6599 (20%) 9220 (28%) 22,873 (69%)
Block RAMs 18 (5%) 25 (7%) 25 (7%)
Clock Speed 98.8 MHz 96.9 MHz 95.0 MHz
Average FMR 41.1 7.49 15.6
Simulation Rate 2.4 MHz 14 MHz 6 MHz
Average Simulator IPS
2.4 MIPS 5.1 MIPS 4.7 MIPS
Virtex IIPro 70
Using ISE 8.1i
Future Work:Simulating Multicores
Scheme 1: Duplicate both partitions
Scheme 2: Cluster Timing Parititions
TimingModel
A
FuncReg +
Datapath
TimingModel
B
FuncReg +
Datapath
FuncReg +
Datapath
TimingModel
C
FuncReg +
Datapath
TimingModel
D
FunctionalMemory
State
TimingModel
A
TimingModel
B
TimingModel
C
TimingModel
D
FunctionalReg State +
Datapath
FunctionalMemory
State
Interactionoccurshere
Interactionstill occurs
here
Use a context IDto reference all state
lookups
Future Work: Simulating MulticoresScheme 3: Perform multiplexing of timing models themselves
Leverage HASim A-Ports in Timing ModelOut of scope of today’s talk
TimingModel
D
FunctionalReg State +
Datapath
FunctionalMemory
StateInteractionstill occurs
here
Use a context IDto reference all state
lookups
TimingModel
C
TimingModel
B
TimingModel
A
UT-FAST is Functional-First
This can be unified into Timing-DirectedJust do “execute-at-fetch”
Future Work:Unifying with the UT-FAST model
FuncPartition
TimingPartition
EmulatorØØØ
Ø
functionalemulatorrunning insoftware
FPGA
execution stream
resteer
execution stream
resteer
functionalemulatorrunning insoftware
SummaryDescribed a scheme for closely-coupled timing-directed partitioning
Both partitions are suitable for on-FPGA implementation
Demonstrated such a scheme’s benefits:Very Good Reuse, Very Good Area/Clock SpeedGood FPGA-to-Model Cycle Ratio:
Caveat: Assuming some correspondence between timing model and functional partitions (recall the unpipelined target)
We plan to extend this using contexts for hardware multiplexing [Chung 07]Future: rare complex operations (such as syscalls) could be done in software using virtual channels
Questions?
Extra Slides
Functional Partition Fetch
Functional Partition Decode
Functional Partition Execute
Functional Partition Back End
Timing Model: Unpipelined
5-Stage Pipeline Timing Model
Out-Of-Order Superscalar Timing Model