1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan, and Jason Anderson

1Parallelizing FPGA Placement with TM Steffan

Parallelizing FPGA Placement with Parallelizing FPGA Placement with Transactional MemoryTransactional Memory

Steven Birk*, Steven Birk*, Greg Steffan**, Greg Steffan**, and Jason Anderson**and Jason Anderson**

*CS Department / **ECE Department*CS Department / **ECE Department

University of TorontoUniversity of Toronto

2

Implications of Moore’s LawImplications of Moore’s Law

need for parallel CAD is intensifying

1995 2000 2005 2010Year

FPGAs

CAD Complexity

CPUs

…

7.5m

Pentium II

42m

PIV

1.1b70m 350m 2.5b

291m

Core 2 Duo

731m

Core i7 Quad


Parallelizing CAD SoftwareParallelizing CAD Software

• The focus of this talk:The focus of this talk:

– simulated-annealing-based placementsimulated-annealing-based placement

key algorithm in FPGA CAD


Simulated Annealing Placement: Basic IdeaSimulated Annealing Placement: Basic Idea

Algorithm:Algorithm:

1) Start with random placement of blocks1) Start with random placement of blocks

2) Randomly pick a pair of blocks to swap2) Randomly pick a pair of blocks to swap

3) Keep new placement if an improvement3) Keep new placement if an improvement

…

A

B

C

D

? B

A

C

D

?

blocks

nets


Potential Parallelism: the IntuitionPotential Parallelism: the Intuition

Thread 1

Single-Threaded

parallelism when blocks/nets are disjoint

A

B

C

D

?

Thread 1

Thread 2

Parallel Moves (success)

A

B

C

D

?

?

Thread 1

Thread 2

Parallel Moves (failure)

A

B

C

D

?

?

nice match to Transactional Memory


abort!

Transactional Memory (TM): the Basic IdeaTransactional Memory (TM): the Basic IdeaSource Code:

...atomic { ... access_shared_data(); ...}...

TM System

Specifies transactions in source code



Transactions:

Executes transactions optimistically in parallel

Programmer:

TM System:

1) Checkpoints execution

2) Detects conflicts

? ?

3) Commits or aborts and re-executes

Exploits available parallelism

while maintaining correctness!


• Software TM (STM)Software TM (STM)– compiler or library basedcompiler or library based

– works on current multicores, but high overheadsworks on current multicores, but high overheads

– JavaJava: DSTM, ASTM: DSTM, ASTM

– C or C++C or C++: McRT icc, TL2, RSTM, : McRT icc, TL2, RSTM, JudoSTM, JudoSTM, tinySTMtinySTM

• Hardware TM (HTM)Hardware TM (HTM)– more automatic, low overhead, limited transaction sizemore automatic, low overhead, limited transaction size

– commercial systems don’t exist yetcommercial systems don’t exist yet

– Stanford’s TCC, Wisconsin’s LogTM, SUN’s ROCKStanford’s TCC, Wisconsin’s LogTM, SUN’s ROCK

TM ImplementationsTM Implementations

This work

STM has high overhead, no HTM’s (yet)


Goals of this WorkGoals of this Work

• Parallelize simulated-annealing placement

– using software transactional memory (tinySTM)

– demonstrate the potential for good scaling

– not expecting great speedup due to the overheads of STM

• For the FPGA community

– evaluate potential for easier parallelization via TM

– suggest CAD algorithm changes to capitalize on TM

• For the systems/TM community

– lessons from a real application

– TM feature wish-list


MethodologyMethodology

• CAD SW: Versatile Place and Route (VPR) 5.0CAD SW: Versatile Place and Route (VPR) 5.0

– available at www.eecg.toronto.edu/vpravailable at www.eecg.toronto.edu/vpr

• Benchmark circuits: provided by VPRBenchmark circuits: provided by VPR

– sizes ranging from: 67-6000 blocks, 100-60000 netssizes ranging from: 67-6000 blocks, 100-60000 nets

– target architecture: 4 LUTs, cluster size 10target architecture: 4 LUTs, cluster size 10

• STM: tinySTMSTM: tinySTM

– available at www.tinystm.orgavailable at www.tinystm.org

• Platform: 8 CPUsPlatform: 8 CPUs

– 2 X Quad-Core Intel Xeon E5345 @ 2.33 Ghz2 X Quad-Core Intel Xeon E5345 @ 2.33 Ghz


Challenges: Non-Determinism & MeasurementChallenges: Non-Determinism & Measurement

• Our initial implementation is non-deterministicOur initial implementation is non-deterministic

– however a deterministic version is possible, see paperhowever a deterministic version is possible, see paper

• Non-determinism makes measurement difficultNon-determinism makes measurement difficult

– different numbers of threads -> different work/resultsdifferent numbers of threads -> different work/results

• Solution: consider both runtime & quality-of-result (QoR)Solution: consider both runtime & quality-of-result (QoR)

– QoR: worst-case critical path delayQoR: worst-case critical path delay

can trade-off runtime and QoR


The Parallelization StoryThe Parallelization Story


First Parallelization AttemptFirst Parallelization Attempt

• Fast: one student-monthFast: one student-month

– includes time to get familiar with tinySTM, VPR codeincludes time to get familiar with tinySTM, VPR code

– very few code changesvery few code changes

– produced correct results very quicklyproduced correct results very quickly

– no deadlocks or data raceno deadlocks or data race

• Standard parallelism optimizations:Standard parallelism optimizations:

– reductions: i.e. reductions: i.e. success_sum += 1

– scheduling: move unnecessary code out of transactionsscheduling: move unnecessary code out of transactions

additional effort devoted to improving perf.


Performance (avg all benchmark circuits)Performance (avg all benchmark circuits)

high QoR degradation (30%), high abort rate (60%)

deg.


More Optimization: Reduce AbortsMore Optimization: Reduce Aborts• Use feedback to identify causes of abortsUse feedback to identify causes of aborts

– 80% of aborts caused by accesses to x_lookup[] 80% of aborts caused by accesses to x_lookup[]

• array used to locate 2array used to locate 2ndnd block in a swap block in a swap

– interesting: not used by “I/O” type blocksinteresting: not used by “I/O” type blocks

• Interesting resulting behavior: “favoritism”Interesting resulting behavior: “favoritism”– system favors swapping I/O blockssystem favors swapping I/O blocks

• I/O block swaps have much shorter txns, no conflictsI/O block swaps have much shorter txns, no conflicts

– only one non-I/O block swapping at a timeonly one non-I/O block swapping at a time

• others conflict immediately on x_lookup[]others conflict immediately on x_lookup[]

– intuition: causing QoR degradation, ‘false’ speedupintuition: causing QoR degradation, ‘false’ speedup

solution: privatize x_lookup[]


Transactions and Swaps: TerminologyTransactions and Swaps: Terminology

• SwapsSwaps

– ACCEPTEDACCEPTED or or REJECTEDREJECTED

• TransactionsTransactions

– COMMITCOMMIT or or ABORTABORT

A

B

A

B


More Optimization: Leveraging TMMore Optimization: Leveraging TM

• VPR code implements commit/abortVPR code implements commit/abort

– directly modifies placement data structuresdirectly modifies placement data structures

– undoes modifications if swap is rejectedundoes modifications if swap is rejected

• TM implements commit/abort, hence optimize:TM implements commit/abort, hence optimize:

– delete VPR code for undoing rejected swapsdelete VPR code for undoing rejected swaps

– force transaction to abort if swap is rejectedforce transaction to abort if swap is rejected

requires API for forcing a transaction to abort


Impact on Abort RateImpact on Abort Rate

Standard Optimizations Privatization and Leveraging

significant decrease in abort rate


Performance of Privatization and Leveraging TMPerformance of Privatization and Leveraging TM

deg.

deg.

improved QoR deg: max 35% to 8%, avg 7% to 2%


Even More Optimization: Ignoring Large NetsEven More Optimization: Ignoring Large Nets

improves abort rate, little impact on QoR

Privatization and Leveraging Ignore Large Nets


Evaluating ScalingEvaluating ScalingRelative to Single Thread STM

(estimated)

Single Thread STM vs. Sequential


ConclusionsConclusions

• Parallel placement via STMParallel placement via STM

– good algorithmic fit (accept/reject -> commit/abort)good algorithmic fit (accept/reject -> commit/abort)

– speedup poor due to overheads, scaling good, need HTM!speedup poor due to overheads, scaling good, need HTM!

• FPGA community:FPGA community:

– should pay attention to TM, especially HTMshould pay attention to TM, especially HTM

– TM offers fast & correct parallelization, focus on performanceTM offers fast & correct parallelization, focus on performance

– algorithms can be modified to better exploit TM (ignoring nets)algorithms can be modified to better exploit TM (ignoring nets)

• Systems/TM community:Systems/TM community:

– need API for forced abort, ordered transactionsneed API for forced abort, ordered transactions

21

Documents

1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan, and Jason Anderson