21
1 Parallelizing FPGA Placement with TM Steffan Parallelizing FPGA Placement with Parallelizing FPGA Placement with Transactional Memory Transactional Memory Steven Birk*, Steven Birk*, Greg Steffan**, Greg Steffan**, and Jason and Jason Anderson** Anderson** *CS Department / **ECE Department *CS Department / **ECE Department University of Toronto University of Toronto

1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

Embed Size (px)

Citation preview

Page 1: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

1Parallelizing FPGA Placement with TM Steffan

Parallelizing FPGA Placement with Parallelizing FPGA Placement with Transactional MemoryTransactional Memory

Steven Birk*, Steven Birk*, Greg Steffan**, Greg Steffan**, and Jason Anderson**and Jason Anderson**

*CS Department / **ECE Department*CS Department / **ECE Department

University of TorontoUniversity of Toronto

Page 2: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

2

Implications of Moore’s LawImplications of Moore’s Law

need for parallel CAD is intensifying

1995 2000 2005 2010Year

FPGAs

CAD Complexity

CPUs

7.5m

Pentium II

42m

PIV

1.1b70m 350m 2.5b

291m

Core 2 Duo

731m

Core i7 Quad

Page 3: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

3Parallelizing FPGA Placement with TM Steffan

Parallelizing CAD SoftwareParallelizing CAD Software

• The focus of this talk:The focus of this talk:

– simulated-annealing-based placementsimulated-annealing-based placement

key algorithm in FPGA CAD

Page 4: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

4Parallelizing FPGA Placement with TM Steffan

Simulated Annealing Placement: Basic IdeaSimulated Annealing Placement: Basic Idea

Algorithm:Algorithm:

1) Start with random placement of blocks1) Start with random placement of blocks

2) Randomly pick a pair of blocks to swap2) Randomly pick a pair of blocks to swap

3) Keep new placement if an improvement3) Keep new placement if an improvement

A

B

C

D

? B

A

C

D

?

blocks

nets

Page 5: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

5Parallelizing FPGA Placement with TM Steffan

Potential Parallelism: the IntuitionPotential Parallelism: the Intuition

Thread 1

Single-Threaded

parallelism when blocks/nets are disjoint

A

B

C

D

?

Thread 1

Thread 2

Parallel Moves (success)

A

B

C

D

?

?

Thread 1

Thread 2

Parallel Moves (failure)

A

B

C

D

?

?

nice match to Transactional Memory

Page 6: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

6Parallelizing FPGA Placement with TM Steffan

abort!

Transactional Memory (TM): the Basic IdeaTransactional Memory (TM): the Basic IdeaSource Code:

...atomic { ... access_shared_data(); ...}...

TM System

Specifies transactions in source code

...atomic { ... access_shared_data(); ...}...

...atomic { ... access_shared_data(); ...}...

Transactions:

Executes transactions optimistically in parallel

Programmer:

TM System:

1) Checkpoints execution

2) Detects conflicts

? ?

3) Commits or aborts and re-executes

Exploits available parallelism

while maintaining correctness!

Page 7: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

7Parallelizing FPGA Placement with TM Steffan

• Software TM (STM)Software TM (STM)– compiler or library basedcompiler or library based

– works on current multicores, but high overheadsworks on current multicores, but high overheads

– JavaJava: DSTM, ASTM: DSTM, ASTM

– C or C++C or C++: McRT icc, TL2, RSTM, : McRT icc, TL2, RSTM, JudoSTM, JudoSTM, tinySTMtinySTM

• Hardware TM (HTM)Hardware TM (HTM)– more automatic, low overhead, limited transaction sizemore automatic, low overhead, limited transaction size

– commercial systems don’t exist yetcommercial systems don’t exist yet

– Stanford’s TCC, Wisconsin’s LogTM, SUN’s ROCKStanford’s TCC, Wisconsin’s LogTM, SUN’s ROCK

TM ImplementationsTM Implementations

This work

STM has high overhead, no HTM’s (yet)

Page 8: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

8Parallelizing FPGA Placement with TM Steffan

Goals of this WorkGoals of this Work

• Parallelize simulated-annealing placement

– using software transactional memory (tinySTM)

– demonstrate the potential for good scaling

– not expecting great speedup due to the overheads of STM

• For the FPGA community

– evaluate potential for easier parallelization via TM

– suggest CAD algorithm changes to capitalize on TM

• For the systems/TM community

– lessons from a real application

– TM feature wish-list

Page 9: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

9Parallelizing FPGA Placement with TM Steffan

MethodologyMethodology

• CAD SW: Versatile Place and Route (VPR) 5.0CAD SW: Versatile Place and Route (VPR) 5.0

– available at www.eecg.toronto.edu/vpravailable at www.eecg.toronto.edu/vpr

• Benchmark circuits: provided by VPRBenchmark circuits: provided by VPR

– sizes ranging from: 67-6000 blocks, 100-60000 netssizes ranging from: 67-6000 blocks, 100-60000 nets

– target architecture: 4 LUTs, cluster size 10target architecture: 4 LUTs, cluster size 10

• STM: tinySTMSTM: tinySTM

– available at www.tinystm.orgavailable at www.tinystm.org

• Platform: 8 CPUsPlatform: 8 CPUs

– 2 X Quad-Core Intel Xeon E5345 @ 2.33 Ghz2 X Quad-Core Intel Xeon E5345 @ 2.33 Ghz

Page 10: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

10Parallelizing FPGA Placement with TM Steffan

Challenges: Non-Determinism & MeasurementChallenges: Non-Determinism & Measurement

• Our initial implementation is non-deterministicOur initial implementation is non-deterministic

– however a deterministic version is possible, see paperhowever a deterministic version is possible, see paper

• Non-determinism makes measurement difficultNon-determinism makes measurement difficult

– different numbers of threads -> different work/resultsdifferent numbers of threads -> different work/results

• Solution: consider both runtime & quality-of-result (QoR)Solution: consider both runtime & quality-of-result (QoR)

– QoR: worst-case critical path delayQoR: worst-case critical path delay

can trade-off runtime and QoR

Page 11: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

11Parallelizing FPGA Placement with TM Steffan

The Parallelization StoryThe Parallelization Story

Page 12: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

12Parallelizing FPGA Placement with TM Steffan

First Parallelization AttemptFirst Parallelization Attempt

• Fast: one student-monthFast: one student-month

– includes time to get familiar with tinySTM, VPR codeincludes time to get familiar with tinySTM, VPR code

– very few code changesvery few code changes

– produced correct results very quicklyproduced correct results very quickly

– no deadlocks or data raceno deadlocks or data race

• Standard parallelism optimizations:Standard parallelism optimizations:

– reductions: i.e. reductions: i.e. success_sum += 1

– scheduling: move unnecessary code out of transactionsscheduling: move unnecessary code out of transactions

additional effort devoted to improving perf.

Page 13: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

13Parallelizing FPGA Placement with TM Steffan

Performance (avg all benchmark circuits)Performance (avg all benchmark circuits)

high QoR degradation (30%), high abort rate (60%)

deg.

Page 14: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

14Parallelizing FPGA Placement with TM Steffan

More Optimization: Reduce AbortsMore Optimization: Reduce Aborts• Use feedback to identify causes of abortsUse feedback to identify causes of aborts

– 80% of aborts caused by accesses to x_lookup[] 80% of aborts caused by accesses to x_lookup[]

• array used to locate 2array used to locate 2ndnd block in a swap block in a swap

– interesting: not used by “I/O” type blocksinteresting: not used by “I/O” type blocks

• Interesting resulting behavior: “favoritism”Interesting resulting behavior: “favoritism”– system favors swapping I/O blockssystem favors swapping I/O blocks

• I/O block swaps have much shorter txns, no conflictsI/O block swaps have much shorter txns, no conflicts

– only one non-I/O block swapping at a timeonly one non-I/O block swapping at a time

• others conflict immediately on x_lookup[]others conflict immediately on x_lookup[]

– intuition: causing QoR degradation, ‘false’ speedupintuition: causing QoR degradation, ‘false’ speedup

solution: privatize x_lookup[]

Page 15: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

15Parallelizing FPGA Placement with TM Steffan

Transactions and Swaps: TerminologyTransactions and Swaps: Terminology

• SwapsSwaps

– ACCEPTEDACCEPTED or or REJECTEDREJECTED

• TransactionsTransactions

– COMMITCOMMIT or or ABORTABORT

A

B

A

B

Page 16: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

16Parallelizing FPGA Placement with TM Steffan

More Optimization: Leveraging TMMore Optimization: Leveraging TM

• VPR code implements commit/abortVPR code implements commit/abort

– directly modifies placement data structuresdirectly modifies placement data structures

– undoes modifications if swap is rejectedundoes modifications if swap is rejected

• TM implements commit/abort, hence optimize:TM implements commit/abort, hence optimize:

– delete VPR code for undoing rejected swapsdelete VPR code for undoing rejected swaps

– force transaction to abort if swap is rejectedforce transaction to abort if swap is rejected

requires API for forcing a transaction to abort

Page 17: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

17Parallelizing FPGA Placement with TM Steffan

Impact on Abort RateImpact on Abort Rate

Standard Optimizations Privatization and Leveraging

significant decrease in abort rate

Page 18: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

18Parallelizing FPGA Placement with TM Steffan

Performance of Privatization and Leveraging TMPerformance of Privatization and Leveraging TM

deg.

deg.

improved QoR deg: max 35% to 8%, avg 7% to 2%

Page 19: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

19Parallelizing FPGA Placement with TM Steffan

Even More Optimization: Ignoring Large NetsEven More Optimization: Ignoring Large Nets

improves abort rate, little impact on QoR

Privatization and Leveraging Ignore Large Nets

Page 20: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

20Parallelizing FPGA Placement with TM Steffan

Evaluating ScalingEvaluating ScalingRelative to Single Thread STM

(estimated)

Single Thread STM vs. Sequential

Page 21: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

21Parallelizing FPGA Placement with TM Steffan

ConclusionsConclusions

• Parallel placement via STMParallel placement via STM

– good algorithmic fit (accept/reject -> commit/abort)good algorithmic fit (accept/reject -> commit/abort)

– speedup poor due to overheads, scaling good, need HTM!speedup poor due to overheads, scaling good, need HTM!

• FPGA community:FPGA community:

– should pay attention to TM, especially HTMshould pay attention to TM, especially HTM

– TM offers fast & correct parallelization, focus on performanceTM offers fast & correct parallelization, focus on performance

– algorithms can be modified to better exploit TM (ignoring nets)algorithms can be modified to better exploit TM (ignoring nets)

• Systems/TM community:Systems/TM community:

– need API for forced abort, ordered transactionsneed API for forced abort, ordered transactions

21