Upload
isaura
View
30
Download
12
Embed Size (px)
DESCRIPTION
Speculative Parallelization in Decoupled Look-ahead. Alok Garg , Raj Parihar , and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY. Motivation. Single-thread performance still important design goal - PowerPoint PPT Presentation
Citation preview
Speculative Parallelization in Decoupled Look-ahead
Speculative Parallelization in Decoupled Look-ahead
Alok Garg, Raj Parihar, and Michael C. Huang
Dept. of Electrical & Computer Engineering
University of Rochester, Rochester, NY
Alok Garg, Raj Parihar, and Michael C. Huang
Dept. of Electrical & Computer Engineering
University of Rochester, Rochester, NY
2
MotivationMotivation
• Single-thread performance still important design goal
• Branch mispredictions and cache misses are still important performance hurdles
• One effective approach: Decoupled Look-ahead
• Look-ahead thread can often become the new bottleneck
• Speculative parallelization aptly suited for its acceleration• More parallelism opportunities: Look-ahead thread does not
contain all dependences• Simple architectural support: Look-ahead not correctness
critical
• Single-thread performance still important design goal
• Branch mispredictions and cache misses are still important performance hurdles
• One effective approach: Decoupled Look-ahead
• Look-ahead thread can often become the new bottleneck
• Speculative parallelization aptly suited for its acceleration• More parallelism opportunities: Look-ahead thread does not
contain all dependences• Simple architectural support: Look-ahead not correctness
critical
3
OutlineOutline
• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary
• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary
4
Baseline Decoupled Look-aheadBaseline Decoupled Look-ahead• Binary parser is used to generate skeleton from original program• The skeleton runs on a separate core and
• Maintains its own memory image in local L1, no write-back to shared L2• Sends all branch outcomes through FIFO queue and also helps
prefetching
• Binary parser is used to generate skeleton from original program• The skeleton runs on a separate core and
• Maintains its own memory image in local L1, no write-back to shared L2• Sends all branch outcomes through FIFO queue and also helps
prefetching
*A. Garg and M. Huang. “A Performance-Correctness Explicitly Decoupled Architecture”. In Proc. Int’l Symp. On Microarch., November 2008
5
Practical Advantages of Decoupled Look-ahead
Practical Advantages of Decoupled Look-ahead
• Look-ahead thread is a single, self-reliant agent
• No need for quick spawning and register communication support
• Little management overhead on main thread
• Easier for run-time control to disable
• Natural throttling mechanism to prevent run-away prefetching
• Look-ahead thread size comparable to aggregation of short helper threads
• Look-ahead thread is a single, self-reliant agent
• No need for quick spawning and register communication support
• Little management overhead on main thread
• Easier for run-time control to disable
• Natural throttling mechanism to prevent run-away prefetching
• Look-ahead thread size comparable to aggregation of short helper threads
6
Look-ahead: A New BottleneckLook-ahead: A New Bottleneck• Comparing four
systems• Baseline• Decoupled look-ahead• Ideal• Look-ahead alone
• Application categories• Bottlenecks removed• Speed of look-ahead
not the problem• Look-ahead is the
new bottleneck
• Comparing four systems• Baseline• Decoupled look-ahead• Ideal• Look-ahead alone
• Application categories• Bottlenecks removed• Speed of look-ahead
not the problem• Look-ahead is the
new bottleneck
7
OutlineOutline
• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary
• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary
8
Unique OpportunitiesUnique Opportunities
• Skeleton code offers more parallelism• Certain dependencies removed
during slicing for skeleton
• Look-ahead is inherently error-tolerant• Can ignore dependence violations• Little to no support needed, unlike in
conventional TLS
• Skeleton code offers more parallelism• Certain dependencies removed
during slicing for skeleton
• Look-ahead is inherently error-tolerant• Can ignore dependence violations• Little to no support needed, unlike in
conventional TLS
Assembly code from equake
9
Software SupportSoftware Support
• Dependence analysis• Profile guided, coarse-grain at
basic block level
• Spawn and Target points• Basic blocks with consistent
dependence distance of more than threshold of DMIN
• Spawned thread executes from target point
• Loop level parallelism is also exploited
• Dependence analysis• Profile guided, coarse-grain at
basic block level
• Spawn and Target points• Basic blocks with consistent
dependence distance of more than threshold of DMIN
• Spawned thread executes from target point
• Loop level parallelism is also exploited
Parallelism at basic block level
Available parallelism for 2 core/contexts system
Spawn
Target
Tim
e
10
Parallelism in Look-ahead BinaryParallelism in Look-ahead BinaryAvailable theoretical parallelism for 2 core/contexts system; DMIN = 15 BB
Parallelism potential with stables and consistent target and spawn points
11
Hardware and Runtime SupportHardware and Runtime Support
• Thread spawning and merging• Not too different from regular thread
spawning except• Spawned thread shares the same
register and memory state• Spawning thread will terminate at
the target PC
• Value communication• Register-based naturally through shared
registers in SMT• Memory-based communication can be
supported at different levels• Partial versioning in cache at line level
• Thread spawning and merging• Not too different from regular thread
spawning except• Spawned thread shares the same
register and memory state• Spawning thread will terminate at
the target PC
• Value communication• Register-based naturally through shared
registers in SMT• Memory-based communication can be
supported at different levels• Partial versioning in cache at line level
Spawning of a new look-ahead thread
12
OutlineOutline
• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary
• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary
13
Experimental SetupExperimental Setup
• Program/binary analysis tool: based on ALTO• Simulator: based on heavily modified SimpleScalar • SMT, look-ahead and speculative parallelization support• True execution-driven simulation (faithfully value
modeling)
• Program/binary analysis tool: based on ALTO• Simulator: based on heavily modified SimpleScalar • SMT, look-ahead and speculative parallelization support• True execution-driven simulation (faithfully value
modeling)
Microarchitectural and cache configurations
14
Speedup of speculative parallel decoupled look-ahead
Speedup of speculative parallel decoupled look-ahead
• 14 applications in which look-ahead is bottleneck• Speedup of look-ahead systems over single thread
• Decoupled look-ahead over single thread baseline: 1.61x• Speculative look-ahead over single thread baseline: 1.81x
• Speculative look-ahead over decoupled look-ahead:1.13x
• 14 applications in which look-ahead is bottleneck• Speedup of look-ahead systems over single thread
• Decoupled look-ahead over single thread baseline: 1.61x• Speculative look-ahead over single thread baseline: 1.81x
• Speculative look-ahead over decoupled look-ahead:1.13x
15
Speculative Parallelization in Look-ahead vs Main thread
Speculative Parallelization in Look-ahead vs Main thread
• Skeleton provides more opportunities for parallelization • Speedup of speculative parallelization • Speculative look-ahead over decoupled LA baseline: 1.13x• Speculative main thread over single thread baseline: 1.07x
• Skeleton provides more opportunities for parallelization • Speedup of speculative parallelization • Speculative look-ahead over decoupled LA baseline: 1.13x• Speculative main thread over single thread baseline: 1.07x
IPC
Core1 Core2
Baseline (ST)
Baseline TLS
Decoupled LA
Speculative LALA
1.07x
1.13x
LA0 LA1
1.81x
16
Flexibility in Hardware DesignFlexibility in Hardware Design
• Comparison of regular (partial versioning) cache support with two other alternatives• No cache versioning support • Dependence violation detection and squash
• Comparison of regular (partial versioning) cache support with two other alternatives• No cache versioning support • Dependence violation detection and squash
17
Other Details in the PaperOther Details in the Paper• The effect of spawning look-ahead thread in preserving the
lead of overall look-ahead system
• Technique to avoid spurious spawns in dispatch stage which could be subjected to branch misprediction
• Technique to mitigate the damage of runaway spawns
• Impact of speculative parallelization on overall recoveries
• Modifications to branch queue in multiple look-ahead system
• The effect of spawning look-ahead thread in preserving the lead of overall look-ahead system
• Technique to avoid spurious spawns in dispatch stage which could be subjected to branch misprediction
• Technique to mitigate the damage of runaway spawns
• Impact of speculative parallelization on overall recoveries
• Modifications to branch queue in multiple look-ahead system
18
OutlineOutline
• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary
• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary
19
SummarySummary• Decoupled look-ahead can significantly improve single-thread
performance but look-ahead thread often becomes new bottleneck
• Look-ahead thread lends itself to TLS acceleration• Skeleton construction removes dependences and increases parallelism• Hardware design is flexible and can be a simple extension of SMT
• A straightforward implementation of TLS look-ahead achieves an average 1.13x (up to 1.39x) speedup
• Decoupled look-ahead can significantly improve single-thread performance but look-ahead thread often becomes new bottleneck
• Look-ahead thread lends itself to TLS acceleration• Skeleton construction removes dependences and increases parallelism• Hardware design is flexible and can be a simple extension of SMT
• A straightforward implementation of TLS look-ahead achieves an average 1.13x (up to 1.39x) speedup
Speculative Parallelization in Decoupled Look-ahead
Speculative Parallelization in Decoupled Look-ahead
Alok Garg, Raj Parihar, and Michael C. Huang
Dept. of Electrical & Computer Engineering
University of Rochester, Rochester, NY
Alok Garg, Raj Parihar, and Michael C. Huang
Dept. of Electrical & Computer Engineering
University of Rochester, Rochester, NY
Backup Slides
http://www.ece.rochester.edu/projects/acal/
21
References (Partial) References (Partial) • J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a
Cache Miss. In Proc. Int’l Conf. on Supercomputing, pages 68–75, July 1997.
• O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In Proc. Int’l Symp. on High-Perf. Comp. Arch., pages 129–140, February 2003.
• J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proc. Int’l Symp. On Comp. Arch., pages 14–25, June 2001.
• C. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In Proc. Int’l Symp. on Comp. Arch., pages 2–13, June 2001.
• A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proc. Int’l Symp. on High-Perf. Comp. Arch., pages 37–48, January 2001.
• Z. Purser, K. Sundaramoorthy, and E. Rotenberg. A Study of Slipstream Processors. In Proc. Int’l Symp. on Microarch., pages 269–280, December 2000.
• G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In Proc. Int’l Symp. on Comp. Arch., pages 414–425, June 1995.
• J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l Conf. on Supercomputing, pages 68–75, July 1997.
• O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In Proc. Int’l Symp. on High-Perf. Comp. Arch., pages 129–140, February 2003.
• J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proc. Int’l Symp. On Comp. Arch., pages 14–25, June 2001.
• C. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In Proc. Int’l Symp. on Comp. Arch., pages 2–13, June 2001.
• A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proc. Int’l Symp. on High-Perf. Comp. Arch., pages 37–48, January 2001.
• Z. Purser, K. Sundaramoorthy, and E. Rotenberg. A Study of Slipstream Processors. In Proc. Int’l Symp. on Microarch., pages 269–280, December 2000.
• G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In Proc. Int’l Symp. on Comp. Arch., pages 414–425, June 1995.
22
References (cont…) References (cont…) • H. Zhou. Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window. In
Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques, pages 231–242, September 2005.
• A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture. In Proc. Int’l Symp. On Microarch., November 2008.
• C. Zilles and G. Sohi. Master/Slave Speculative Parallelization. In Proc. Int’l Symp. on Microarch., pages 85–96, Nov 2002.
• D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical report 1342, Computer Sciences Department, University of Wisconsin-Madison, June 1997.
• J. Renau, K. Strauss, L. Ceze,W. Liu, S. Sarangi, J. Tuck, and J. Torrellas. Energy-Efficient Thread-Level Speculation on a CMP. IEEE Micro, 26(1):80–91, January/February 2006.
• P. Xekalakis, N. Ioannou, and M. Cintra. Combining Thread Level Speculation, Helper Threads, and Runahead Execution. In Proc. Intl. Conf. on Supercomputing, pages 410–420, June 2009.
• M. Cintra, J. Martinez, and J. Torrellas. Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors. In Proc. Int’l Symp. on Comp. Arch., pages 13–24, June 2000.
• H. Zhou. Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques, pages 231–242, September 2005.
• A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture. In Proc. Int’l Symp. On Microarch., November 2008.
• C. Zilles and G. Sohi. Master/Slave Speculative Parallelization. In Proc. Int’l Symp. on Microarch., pages 85–96, Nov 2002.
• D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical report 1342, Computer Sciences Department, University of Wisconsin-Madison, June 1997.
• J. Renau, K. Strauss, L. Ceze,W. Liu, S. Sarangi, J. Tuck, and J. Torrellas. Energy-Efficient Thread-Level Speculation on a CMP. IEEE Micro, 26(1):80–91, January/February 2006.
• P. Xekalakis, N. Ioannou, and M. Cintra. Combining Thread Level Speculation, Helper Threads, and Runahead Execution. In Proc. Intl. Conf. on Supercomputing, pages 410–420, June 2009.
• M. Cintra, J. Martinez, and J. Torrellas. Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors. In Proc. Int’l Symp. on Comp. Arch., pages 13–24, June 2000.
Backup SlidesBackup Slides
For reference onlyFor reference only
24
Register Renaming SupportRegister Renaming Support
25
Modified Banked Branch QueueModified Banked Branch Queue
26
Speedup of all applicationsSpeedup of all applications
27
Impact on Overall RecoveriesImpact on Overall Recoveries
28
“Partial” Recovery in Look-ahead“Partial” Recovery in Look-ahead• When recovery happens in primary look-ahead
• Do we kill the secondary look-ahead or not?• If we don’t kill we gain around 1% performance improvement• After recovery in main, secondary thread survives often (1000 insts)
• Spawning of a secondary look-ahead thread helps preserving the slip of overall look-ahead system
• When recovery happens in primary look-ahead• Do we kill the secondary look-ahead or not?• If we don’t kill we gain around 1% performance improvement• After recovery in main, secondary thread survives often (1000 insts)
• Spawning of a secondary look-ahead thread helps preserving the slip of overall look-ahead system
29
Quality of Thread SpawningQuality of Thread Spawning
• Successful and run-away spawns: killed after certain time• Successful and run-away spawns: killed after certain time
• Impact of lazy spawning policy• Wait for few cycles when spawning opportunity arrives• Avoids spurious spawns; Saves some wrong path execution
• Impact of lazy spawning policy• Wait for few cycles when spawning opportunity arrives• Avoids spurious spawns; Saves some wrong path execution