Upload
ariel-sullivan
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
Dynamically Trading Frequency for Complexity in a GALS Microprocessor
Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott
University of Rochester
The gist of the paper…
Radical idea: Trade off frequency and hardware complexity dynamically at runtime
rather than statically at design time
The new twist: A Globally-Asynchronous, Locally-Synchronous (GALS) microarchitecture
is key to making this worthwhile
Application phase behavior
Varying behavior over time
[Sherwood, Sair, Calder, ISCA 2003]
Can exploit to save power
gcc
L2 misses
IPC
L1I misses
L1D misses
branch mispred
E per interval
[Buyuktosunoglu, et al., GLSVLSI 2001]
adaptive issue queue
What about performance?
Lower power and faster access time!
entries relative delay322416 8
1.00.770.520.31
RAM delay
entries relative delay322426 8
1.00.770.550.34
CAM delay
[Buyuktosunoglu, GLSVLSI 2001]
What about performance?
How do we exploit the faster speed?
Variable latency
Increase frequency when downsizing
Decrease frequency when upsizing
What about performance?
[Albonesi, ISCA 1998]
Issue Queue
ALUs & RF
L1 I-Cache
Dispatch, Rename, ROB
Fetch Unit
Issue Queue
MainMemory
L2 Cache
Ld/St Unit
L1 D-Cache
clock
Br Pred
ALUs & RF
FP integer
What about performance?
[Albonesi, ISCA 1998]
0.0
0.2
0.4
0.6
0.8
1.0
1.2
m88
ksim gcc
com
pre
ss
li
ijpeg
per
l
vort
ex
airs
hed
ster
eo
rad
ar
app
cg
tom
catv
swim
su2c
or
hyd
ro2d
mg
rid
app
lu
turb
3d
apsi
fpp
pp
wav
e5
aver
age
Avg
TP
I (n
s)
Best ConventionalProcess-level Adaptive
Enter GALS…
Issue Queue
ALUs & RF
L1 I-Cache
Dispatch, Rename, ROB
Fetch Unit
Issue Queue
ALUs & RF
MainMemory
L2 Cache
Ld/St Unit
Integer Domain FP Domain
Memory Domain
Front-end Domain External Domain
Br Pred
L1 D-Cache
[Semeraro et al., HPCA 2002][Iyer and Marculescu, ISCA 2002]
Outline
Motivation and background Adaptive GALS microarchitecture Control mechanisms Evaluation methodology Results Conclusions and future work
Adaptive GALS microarchitecture
Br PredBr PredBr Pred
L1 I-CacheL1 I-CacheL1 I-Cache
L2 CacheL2 CacheL2 Cache
L1 D-CacheL1 D-CacheL1 D-Cache
Issue QueueIssue Queue
ALUs & RF
L1 I-Cache
Dispatch, Rename, ROB
Fetch Unit
ALUs & RF
MainMemory
L2 Cache
Ld/St Unit
L1 D-Cache
Integer Domain FP Domain
Memory Domain
Front-end DomainExternal Domain
Issue Queue Issue QueueIssue Queue
Br Pred
Adaptive GALS operation
Br PredBr PredBr Pred
L1 I-CacheL1 I-CacheL1 I-Cache
L2 CacheL2 CacheL2 Cache
L1 D-CacheL1 D-CacheL1 D-Cache
Issue QueueIssue Queue
ALUs & RF
Dispatch, Rename, ROB
L1 I-Cache
Fetch Unit
ALUs & RF
MainMemory
L2 Cache
Ld/St Unit
L1 D-Cache
Integer Domain FP Domain
Memory Domain
Front-end DomainExternal Domain
Issue Queue Issue QueueIssue Queue
Br PredBr Pred
L1 I-CacheL1 I-Cache
Resizable cache organization
Access A part first, then B part on a miss Swap A and B blocks on a A miss, B hit Select A/B split according to application phase behavior
Resizable cache control
A
MRU State(LRU)(MRU)
MRU[1]++
MRU[2]++
MRU[0]++
MRU[3]++
Exa
mpl
e A
cces
ses
Config A1 B3• hitsA = MRU[0]• hitsB = MRU[1] + [2] + [3]
Config A2 B2• hitsA = MRU[0] + [1]• hitsB = MRU[2] + [3]
Config A3 B1• hitsA = MRU[0] + [1] + [2]• hitsB = MRU[3]
Config A4 B0• hitsA = MRU[0] + [1] + [2] + [3]• hitsB = 0
1 2 30
B C D
AB C D
BC A D
BC A D
• Calculate the cost for each possible configuration: A access costs = (hitsA + hitsB + misses) * CostA
B access costs = (hitsB + misses) * CostB
Miss access costs = misses * CostMiss
Total access cost = A + B + Miss (normalized to frequency)
Resizable issue queue control
Measures the exploitable ILP for each queue size Timestamp counter is reset at the start of an interval and
incremented each cycle During rename, a destination register is given a timestamp
based on the timestamp + execution latency of its slowest source operand
The maximum timestamp, MAXN is maintained for each of the four possible queue sizes over N fetched instructions (N=16, 32, 48, 64)
ILP is estimated as N/MAXN Queue size with highest ILP (normalized to frequency) is
selectedRead th
e paper
Resizable hardware – some details Front end domain
• Icache “A”: 16KB 1-way, 32KB 2-way, 48KB 3-way, 64KB 4-way• Branch predictor sized with Icache
– gshare PHT: 16KB-64KB– Local BHT: 2KB-8KB– Local PHT: 1024 entries– Meta: 16KB-64KB
Load/store domain• Dcache “A”: 32KB 1-way, 64KB 2-way, 128KB 4-way, 256KB, 8-way• L2 cache “A” sized with Dcache
– 256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way
Integer and floating point domains• Issue queue: 16, 32, 48, or 64 entries
Evaluation methodology
SimpleScalar and Cacti 40 benchmarks from SPEC, Mediabench, and Olden Baseline: best overall performing fully synchronous 21264-like design
found out of 1,024 simulated options Adaptive MCD costs imposed:
• Additional branch penalty of 2 integer domain cycles and 1 front end domain cycle (overpipelined)
• Frequency penalty as much as 31% Mean PLL locking time of 15 µsec Program-Adaptive: profile application and pick the best adaptive
configuration for the whole program Phase-Adaptive: use online cache and issue queue control
mechanisms
Performance improvementMediabench Olden SPEC
Phase behavior – art
16
32
48
64
issu
e qu
eue
entr
ies
100 million instruction window
Phase behavior – apsiD
cach
e “A
” si
ze
32KB
128KB
64KB
256KB
100 million instruction window
Performance summary Program Adaptive: 17% performance improvement Phase Adaptive: 20% performance improvement
• Automatic • Never degrades performance for 40 applications• Few phases in chosen application windows – could perhaps do better
Distribution of chosen configurations for Program Adaptive:
Integer IQ FP IQ D/L2 Cache Icache
16 85%32 5%48 5%64 5%
32KB/256KB 50%64KB/512KB 18%128KB/1MB 23%256KB/2MB 10%
16KB 55%32KB 18%48KB 8%64KB 20%
16 73%32 15%48 8%64 5%
Domain frequency versus IQ size
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
16 32 48 64
Issue Queue Size
Rel
ativ
e fr
equ
ency
Conclusions
Application phase behavior can be exploited to improve performance in addition to power savings
GALS approach is key to localizing the impact of slowing the clock
Cache and queue control mechanisms can evaluate all possible configurations within a single interval
Phase adaptive approach improves performance by as much as 48% and by an average of 20%
Future work
Explore multiple adaptive structures in each domain Better take into account the branch predictor Resize the instruction cache by sets rather than ways Explore better issue queue design alternatives Build circuits Dynamically customized heterogeneous multi-core
architectures using phase-adaptive GALS cores
Dynamically Trading Frequency for Complexity in a GALS Microprocessor
Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott
University of Rochester