29
Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY

Speculative Parallelization in Decoupled Look-ahead

  • Upload
    isaura

  • View
    30

  • Download
    12

Embed Size (px)

DESCRIPTION

Speculative Parallelization in Decoupled Look-ahead. Alok Garg , Raj Parihar , and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY. Motivation. Single-thread performance still important design goal - PowerPoint PPT Presentation

Citation preview

Page 1: Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead

Alok Garg, Raj Parihar, and Michael C. Huang

Dept. of Electrical & Computer Engineering

University of Rochester, Rochester, NY

Alok Garg, Raj Parihar, and Michael C. Huang

Dept. of Electrical & Computer Engineering

University of Rochester, Rochester, NY

Page 2: Speculative Parallelization in Decoupled Look-ahead

2

MotivationMotivation

• Single-thread performance still important design goal

• Branch mispredictions and cache misses are still important performance hurdles

• One effective approach: Decoupled Look-ahead

• Look-ahead thread can often become the new bottleneck

• Speculative parallelization aptly suited for its acceleration• More parallelism opportunities: Look-ahead thread does not

contain all dependences• Simple architectural support: Look-ahead not correctness

critical

• Single-thread performance still important design goal

• Branch mispredictions and cache misses are still important performance hurdles

• One effective approach: Decoupled Look-ahead

• Look-ahead thread can often become the new bottleneck

• Speculative parallelization aptly suited for its acceleration• More parallelism opportunities: Look-ahead thread does not

contain all dependences• Simple architectural support: Look-ahead not correctness

critical

Page 3: Speculative Parallelization in Decoupled Look-ahead

3

OutlineOutline

• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary

• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary

Page 4: Speculative Parallelization in Decoupled Look-ahead

4

Baseline Decoupled Look-aheadBaseline Decoupled Look-ahead• Binary parser is used to generate skeleton from original program• The skeleton runs on a separate core and

• Maintains its own memory image in local L1, no write-back to shared L2• Sends all branch outcomes through FIFO queue and also helps

prefetching

• Binary parser is used to generate skeleton from original program• The skeleton runs on a separate core and

• Maintains its own memory image in local L1, no write-back to shared L2• Sends all branch outcomes through FIFO queue and also helps

prefetching

*A. Garg and M. Huang. “A Performance-Correctness Explicitly Decoupled Architecture”. In Proc. Int’l Symp. On Microarch., November 2008

Page 5: Speculative Parallelization in Decoupled Look-ahead

5

Practical Advantages of Decoupled Look-ahead

Practical Advantages of Decoupled Look-ahead

• Look-ahead thread is a single, self-reliant agent

• No need for quick spawning and register communication support

• Little management overhead on main thread

• Easier for run-time control to disable

• Natural throttling mechanism to prevent run-away prefetching

• Look-ahead thread size comparable to aggregation of short helper threads

• Look-ahead thread is a single, self-reliant agent

• No need for quick spawning and register communication support

• Little management overhead on main thread

• Easier for run-time control to disable

• Natural throttling mechanism to prevent run-away prefetching

• Look-ahead thread size comparable to aggregation of short helper threads

Page 6: Speculative Parallelization in Decoupled Look-ahead

6

Look-ahead: A New BottleneckLook-ahead: A New Bottleneck• Comparing four

systems• Baseline• Decoupled look-ahead• Ideal• Look-ahead alone

• Application categories• Bottlenecks removed• Speed of look-ahead

not the problem• Look-ahead is the

new bottleneck

• Comparing four systems• Baseline• Decoupled look-ahead• Ideal• Look-ahead alone

• Application categories• Bottlenecks removed• Speed of look-ahead

not the problem• Look-ahead is the

new bottleneck

Page 7: Speculative Parallelization in Decoupled Look-ahead

7

OutlineOutline

• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary

• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary

Page 8: Speculative Parallelization in Decoupled Look-ahead

8

Unique OpportunitiesUnique Opportunities

• Skeleton code offers more parallelism• Certain dependencies removed

during slicing for skeleton

• Look-ahead is inherently error-tolerant• Can ignore dependence violations• Little to no support needed, unlike in

conventional TLS

• Skeleton code offers more parallelism• Certain dependencies removed

during slicing for skeleton

• Look-ahead is inherently error-tolerant• Can ignore dependence violations• Little to no support needed, unlike in

conventional TLS

Assembly code from equake

Page 9: Speculative Parallelization in Decoupled Look-ahead

9

Software SupportSoftware Support

• Dependence analysis• Profile guided, coarse-grain at

basic block level

• Spawn and Target points• Basic blocks with consistent

dependence distance of more than threshold of DMIN

• Spawned thread executes from target point

• Loop level parallelism is also exploited

• Dependence analysis• Profile guided, coarse-grain at

basic block level

• Spawn and Target points• Basic blocks with consistent

dependence distance of more than threshold of DMIN

• Spawned thread executes from target point

• Loop level parallelism is also exploited

Parallelism at basic block level

Available parallelism for 2 core/contexts system

Spawn

Target

Tim

e

Page 10: Speculative Parallelization in Decoupled Look-ahead

10

Parallelism in Look-ahead BinaryParallelism in Look-ahead BinaryAvailable theoretical parallelism for 2 core/contexts system; DMIN = 15 BB

Parallelism potential with stables and consistent target and spawn points

Page 11: Speculative Parallelization in Decoupled Look-ahead

11

Hardware and Runtime SupportHardware and Runtime Support

• Thread spawning and merging• Not too different from regular thread

spawning except• Spawned thread shares the same

register and memory state• Spawning thread will terminate at

the target PC

• Value communication• Register-based naturally through shared

registers in SMT• Memory-based communication can be

supported at different levels• Partial versioning in cache at line level

• Thread spawning and merging• Not too different from regular thread

spawning except• Spawned thread shares the same

register and memory state• Spawning thread will terminate at

the target PC

• Value communication• Register-based naturally through shared

registers in SMT• Memory-based communication can be

supported at different levels• Partial versioning in cache at line level

Spawning of a new look-ahead thread

Page 12: Speculative Parallelization in Decoupled Look-ahead

12

OutlineOutline

• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary

• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary

Page 13: Speculative Parallelization in Decoupled Look-ahead

13

Experimental SetupExperimental Setup

• Program/binary analysis tool: based on ALTO• Simulator: based on heavily modified SimpleScalar • SMT, look-ahead and speculative parallelization support• True execution-driven simulation (faithfully value

modeling)

• Program/binary analysis tool: based on ALTO• Simulator: based on heavily modified SimpleScalar • SMT, look-ahead and speculative parallelization support• True execution-driven simulation (faithfully value

modeling)

Microarchitectural and cache configurations

Page 14: Speculative Parallelization in Decoupled Look-ahead

14

Speedup of speculative parallel decoupled look-ahead

Speedup of speculative parallel decoupled look-ahead

• 14 applications in which look-ahead is bottleneck• Speedup of look-ahead systems over single thread

• Decoupled look-ahead over single thread baseline: 1.61x• Speculative look-ahead over single thread baseline: 1.81x

• Speculative look-ahead over decoupled look-ahead:1.13x

• 14 applications in which look-ahead is bottleneck• Speedup of look-ahead systems over single thread

• Decoupled look-ahead over single thread baseline: 1.61x• Speculative look-ahead over single thread baseline: 1.81x

• Speculative look-ahead over decoupled look-ahead:1.13x

Page 15: Speculative Parallelization in Decoupled Look-ahead

15

Speculative Parallelization in Look-ahead vs Main thread

Speculative Parallelization in Look-ahead vs Main thread

• Skeleton provides more opportunities for parallelization • Speedup of speculative parallelization • Speculative look-ahead over decoupled LA baseline: 1.13x• Speculative main thread over single thread baseline: 1.07x

• Skeleton provides more opportunities for parallelization • Speedup of speculative parallelization • Speculative look-ahead over decoupled LA baseline: 1.13x• Speculative main thread over single thread baseline: 1.07x

IPC

Core1 Core2

Baseline (ST)

Baseline TLS

Decoupled LA

Speculative LALA

1.07x

1.13x

LA0 LA1

1.81x

Page 16: Speculative Parallelization in Decoupled Look-ahead

16

Flexibility in Hardware DesignFlexibility in Hardware Design

• Comparison of regular (partial versioning) cache support with two other alternatives• No cache versioning support • Dependence violation detection and squash

• Comparison of regular (partial versioning) cache support with two other alternatives• No cache versioning support • Dependence violation detection and squash

Page 17: Speculative Parallelization in Decoupled Look-ahead

17

Other Details in the PaperOther Details in the Paper• The effect of spawning look-ahead thread in preserving the

lead of overall look-ahead system

• Technique to avoid spurious spawns in dispatch stage which could be subjected to branch misprediction

• Technique to mitigate the damage of runaway spawns

• Impact of speculative parallelization on overall recoveries

• Modifications to branch queue in multiple look-ahead system

• The effect of spawning look-ahead thread in preserving the lead of overall look-ahead system

• Technique to avoid spurious spawns in dispatch stage which could be subjected to branch misprediction

• Technique to mitigate the damage of runaway spawns

• Impact of speculative parallelization on overall recoveries

• Modifications to branch queue in multiple look-ahead system

Page 18: Speculative Parallelization in Decoupled Look-ahead

18

OutlineOutline

• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary

• Motivation• Baseline decoupled look-ahead• Speculative parallelization in look-ahead• Experimental analysis• Summary

Page 19: Speculative Parallelization in Decoupled Look-ahead

19

SummarySummary• Decoupled look-ahead can significantly improve single-thread

performance but look-ahead thread often becomes new bottleneck

• Look-ahead thread lends itself to TLS acceleration• Skeleton construction removes dependences and increases parallelism• Hardware design is flexible and can be a simple extension of SMT

• A straightforward implementation of TLS look-ahead achieves an average 1.13x (up to 1.39x) speedup

• Decoupled look-ahead can significantly improve single-thread performance but look-ahead thread often becomes new bottleneck

• Look-ahead thread lends itself to TLS acceleration• Skeleton construction removes dependences and increases parallelism• Hardware design is flexible and can be a simple extension of SMT

• A straightforward implementation of TLS look-ahead achieves an average 1.13x (up to 1.39x) speedup

Page 20: Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead

Alok Garg, Raj Parihar, and Michael C. Huang

Dept. of Electrical & Computer Engineering

University of Rochester, Rochester, NY

Alok Garg, Raj Parihar, and Michael C. Huang

Dept. of Electrical & Computer Engineering

University of Rochester, Rochester, NY

Backup Slides

http://www.ece.rochester.edu/projects/acal/

Page 21: Speculative Parallelization in Decoupled Look-ahead

21

References (Partial) References (Partial) • J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a

Cache Miss. In Proc. Int’l Conf. on Supercomputing, pages 68–75, July 1997.

• O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In Proc. Int’l Symp. on High-Perf. Comp. Arch., pages 129–140, February 2003.

• J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proc. Int’l Symp. On Comp. Arch., pages 14–25, June 2001.

• C. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In Proc. Int’l Symp. on Comp. Arch., pages 2–13, June 2001.

• A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proc. Int’l Symp. on High-Perf. Comp. Arch., pages 37–48, January 2001.

• Z. Purser, K. Sundaramoorthy, and E. Rotenberg. A Study of Slipstream Processors. In Proc. Int’l Symp. on Microarch., pages 269–280, December 2000.

• G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In Proc. Int’l Symp. on Comp. Arch., pages 414–425, June 1995.

• J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l Conf. on Supercomputing, pages 68–75, July 1997.

• O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In Proc. Int’l Symp. on High-Perf. Comp. Arch., pages 129–140, February 2003.

• J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proc. Int’l Symp. On Comp. Arch., pages 14–25, June 2001.

• C. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In Proc. Int’l Symp. on Comp. Arch., pages 2–13, June 2001.

• A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proc. Int’l Symp. on High-Perf. Comp. Arch., pages 37–48, January 2001.

• Z. Purser, K. Sundaramoorthy, and E. Rotenberg. A Study of Slipstream Processors. In Proc. Int’l Symp. on Microarch., pages 269–280, December 2000.

• G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In Proc. Int’l Symp. on Comp. Arch., pages 414–425, June 1995.

Page 22: Speculative Parallelization in Decoupled Look-ahead

22

References (cont…) References (cont…) • H. Zhou. Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window. In

Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques, pages 231–242, September 2005.

• A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture. In Proc. Int’l Symp. On Microarch., November 2008.

• C. Zilles and G. Sohi. Master/Slave Speculative Parallelization. In Proc. Int’l Symp. on Microarch., pages 85–96, Nov 2002.

• D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical report 1342, Computer Sciences Department, University of Wisconsin-Madison, June 1997.

• J. Renau, K. Strauss, L. Ceze,W. Liu, S. Sarangi, J. Tuck, and J. Torrellas. Energy-Efficient Thread-Level Speculation on a CMP. IEEE Micro, 26(1):80–91, January/February 2006.

• P. Xekalakis, N. Ioannou, and M. Cintra. Combining Thread Level Speculation, Helper Threads, and Runahead Execution. In Proc. Intl. Conf. on Supercomputing, pages 410–420, June 2009.

• M. Cintra, J. Martinez, and J. Torrellas. Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors. In Proc. Int’l Symp. on Comp. Arch., pages 13–24, June 2000.

• H. Zhou. Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques, pages 231–242, September 2005.

• A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture. In Proc. Int’l Symp. On Microarch., November 2008.

• C. Zilles and G. Sohi. Master/Slave Speculative Parallelization. In Proc. Int’l Symp. on Microarch., pages 85–96, Nov 2002.

• D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical report 1342, Computer Sciences Department, University of Wisconsin-Madison, June 1997.

• J. Renau, K. Strauss, L. Ceze,W. Liu, S. Sarangi, J. Tuck, and J. Torrellas. Energy-Efficient Thread-Level Speculation on a CMP. IEEE Micro, 26(1):80–91, January/February 2006.

• P. Xekalakis, N. Ioannou, and M. Cintra. Combining Thread Level Speculation, Helper Threads, and Runahead Execution. In Proc. Intl. Conf. on Supercomputing, pages 410–420, June 2009.

• M. Cintra, J. Martinez, and J. Torrellas. Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors. In Proc. Int’l Symp. on Comp. Arch., pages 13–24, June 2000.

Page 23: Speculative Parallelization in Decoupled Look-ahead

Backup SlidesBackup Slides

For reference onlyFor reference only

Page 24: Speculative Parallelization in Decoupled Look-ahead

24

Register Renaming SupportRegister Renaming Support

Page 25: Speculative Parallelization in Decoupled Look-ahead

25

Modified Banked Branch QueueModified Banked Branch Queue

Page 26: Speculative Parallelization in Decoupled Look-ahead

26

Speedup of all applicationsSpeedup of all applications

Page 27: Speculative Parallelization in Decoupled Look-ahead

27

Impact on Overall RecoveriesImpact on Overall Recoveries

Page 28: Speculative Parallelization in Decoupled Look-ahead

28

“Partial” Recovery in Look-ahead“Partial” Recovery in Look-ahead• When recovery happens in primary look-ahead

• Do we kill the secondary look-ahead or not?• If we don’t kill we gain around 1% performance improvement• After recovery in main, secondary thread survives often (1000 insts)

• Spawning of a secondary look-ahead thread helps preserving the slip of overall look-ahead system

• When recovery happens in primary look-ahead• Do we kill the secondary look-ahead or not?• If we don’t kill we gain around 1% performance improvement• After recovery in main, secondary thread survives often (1000 insts)

• Spawning of a secondary look-ahead thread helps preserving the slip of overall look-ahead system

Page 29: Speculative Parallelization in Decoupled Look-ahead

29

Quality of Thread SpawningQuality of Thread Spawning

• Successful and run-away spawns: killed after certain time• Successful and run-away spawns: killed after certain time

• Impact of lazy spawning policy• Wait for few cycles when spawning opportunity arrives• Avoids spurious spawns; Saves some wrong path execution

• Impact of lazy spawning policy• Wait for few cycles when spawning opportunity arrives• Avoids spurious spawns; Saves some wrong path execution