Just-In-Time Software Pipelining

Hongbo Rong Hyunchul Park

Youfeng Wu Cheng Wang

Programming Systems Lab

Intel Labs, Santa Clara

What is software pipelining?

for (i = 0; i<N; i++){a: x = y + 1 b: y = A[i] + x c: B[i+2] = B[i]*x d: A[i+2] = B[i+2]

A loop optimization exposing instruction-level parallelism (ILP)

What is software pipelining?

for (i = 0; i<N; i++){a: x = y + 1 b: y = A[i] + x c: B[i+2] = B[i]*x d: A[i+2] = B[i+2]

dLocal dependence

Loop-carried dependence

A loop optimization exposing instruction-level parallelism (ILP)

Iteration0 1 2 3

Iteration

0 1 2 3

Iteration

da b c

0 1 2 3

Iteration

da b c

0 1 2 3

Initiation interval(II)=2

Iteration

da b c

0 1 2 3

Iteration

da b c

0 1 2 3

Iteration

da b c

0 1 2 3

Iteration

da b c

0 1 2 3

• Different iterations work on different stages in parallel

Kernel2 stages

Iteration

da b c

0 1 2 3

Kernel2 stages

Iteration

da b c

0 1 2 3

Kernel2 stages

Iteration

da b c

0 1 2 3

• II is the performance indicator

Kernel2 stages

Software pipelining has been static

• Extensively studied in 3 decades, and efficient for wide-issue architectures– VLIW [Lam 1988]

– superscalar [Ruttenberg et al. 1996]

• It is seen only in static compilers

• Most works aim to minimize II but not compile overhead

It is time now to extend software

pipelining to dynamic compilers!

• Dynamic languages are increasingly popular– JavaScript and PhP 88.9% and 81.5% in client and server websites (W3Techs)

• Huge amount of legacy code – Small optimization scope: a loop iteration

– Software pipelining enlarges the scope to many iterations

• Minimizing compile overhead must be the 1st

objective– Only simple/fast algorithms can be used

• Minimizing compile overhead must be the 1st

objective– Only simple/fast algorithms can be used

– linear-time algorithms are preferred

Challenges

• Memory aliases kill parallelism– Hardware: Atomic region + rotating alias registers [MICRO-46]

Challenges

for (i = 0; i<N; i++){abcd

Challenges

Original optimization scope

Challenges

for (j = 0; j<N; j+=M){for (i = j; i<j+M; i++){

Challenges

Atomic region

Challenges

• Costly rollback– Software: Light-weight checkpointing

Atomic region

Challenges

• Costly rollback– Software: Light-weight checkpointing

• Scheduling is expensive

Atomic region

Framework(on Transmeta CMS)

x86 binaryFramework

(on Transmeta CMS)

x86 binary

Interpreter & profiler

Hot region optimizations

x86 binary

Acyclic scheduler

x86 binary

Acyclic scheduler

Assembler

x86 binary

Acyclic scheduler

Assembler

x86 binary

Code cache

Acyclic scheduler

Assembler

x86 binary

Code cache

Acyclic scheduler

Assembler

Acyclic scheduler

Assembler

Loop selection

Acyclic scheduler

Assembler

Initialization

Loop selection

Acyclic scheduler

Assembler

Scheduling

Initialization

Loop selection

Acyclic scheduler

Assembler

Scheduling

Rotating alias reg alloc

Initialization

Loop selection

Acyclic scheduler

Assembler

Scheduling

Code generation

Initialization

Loop selection

Acyclic scheduler

Assembler

Scheduling

Code generation

Initialization

Loop selection

Acyclic scheduler

Assembler

Scheduling

Code generation

Initialization

Loop selection

Acyclic scheduler

Rotating alias register fileAtomic region

Scheduling is expensive

• NP-complete problem to find an optimal schedule [Colland et al. 1996]

• O(V3) at least, exponential at worst [Rau et al. 1992]– V: number of operations

• Can we linearize software pipelining?

Keep scheduling time under control

• Schedule in linear time– Use either simple or fast sub-algorithms

• Avoid cubic or exponential complexity

– For iterative sub-algorithms, have a threshold: the maximum #iterations allowed

• The smaller the threshold, the less the compile overhead• Once exceeded, abort software pipelining

• The smaller the threshold, the less the compile overhead• Once exceeded, abort software pipelining• Key question: How small can the threshold be?

• Find a good enough schedule– No backtracking– Priority function: approximate and never update– Separate dependence and resource constraints– Separate local and loop-carried dependences

• Iteratively improve a schedule

Just-In-Time Software Pipelining

Just-In-Time Software Pipelining• Quickly creates an initial schedule to start with

Prepartition

Local scheduling• Handles local dependences and resources

Prepartition

Local scheduling

Kernel expansion

• Handles local dependences and resources

• Adjusts the schedule for loop-carried dependences

Prepartition

Local scheduling

Kernel expansion

Prepartition

Local scheduling

Kernel expansion

Rotation

• Iteratively improves the schedule

Prepartition

Local scheduling

Kernel expansion

Rotation

• Iteratively improves the schedule

• Time complexity: O(V+E)– V: #operations E: #dependences

Illustration12

Iteration0 1 2 3

Local dependences

Resources

Loop-carried dependences

PrepartitionPrepartitionPrepartitionPrepartition

Local scheduling

Kernel expansion

RotationExit

Illustration12

RecMII

Iteration0 1 2 3

Local dependences

Resources

Local scheduling

Kernel expansion

RotationExit

Illustration12

Kernel

RecMII

Iteration0 1 2 3

Local dependences

Resources

Loop-carried dependencesabc

Local scheduling

Kernel expansion

RotationExit

Calculating RecMII

• Recurrence Minimum II– II determined by the biggest dependence cycles

– Needed by almost every software pipelining method

Calculating RecMII

Conventional

Calculating RecMII

• Turn the problem into a Markov decision process– 1st time Howard algorithm is applied to pipelining

Conventional

Calculating RecMII

Conventional

Howard policyiteration algo.

O(exponential*E)

Calculating RecMII

• Linearize Howard with a small constant H 13

Conventional

Linearized Howard

O(exponential*E) H*

Calculating RecMII

• Linearize Howard with a small constant H 13

Conventional

Linearized Howard

O(exponential*E)

Calculating RecMII

• Linearize Howard with a small constant H• A by-product: critical operations13

Conventional

Linearized Howard

O(exponential*E)

Divide operations into stages

Bellman-Ford

O(V*E)

• Also an iterative sub-algorithm

Bellman-Ford

O(V*E)

• Also an iterative sub-algorithm• Linearize it with a constant B

Bellman-Ford

O(V*E)

Linearized Bellman-Ford

O( E)B*

• Linearize it with a constant B

• Specific order in visiting edges– Scan nodes in sequential order, and visit their incoming edges

• Values are propagated along local edges in the 1st itr.

• Values are propagated along loop-carried edges in the 2nd itr.14

Bellman-Ford

O(V*E)

O( E)B*

• Linearize it with a constant B

• Specific order in visiting edges– Scan nodes in sequential order, and visit their incoming edges

• Values are propagated along local edges in the 1st itr.

• Values are propagated along loop-carried edges in the 2nd itr.14

Bellman-Ford

O(V*E)

Illustration15

Kernel

RecMII

Iteration0 1 2 3

Local dependences

Resources

Local scheduling

Kernel expansion

RotationExit

Illustration15

Kernel

RecMII

Iteration0 1 2 3

Local dependences

Resources

Local scheduling

Kernel expansion

RotationExit

Kernel

Iteration0 1 2 3

Illustration (Cont.)

Local dependences

Resources

Prepartition

Local schedulingLocal schedulingLocal schedulingLocal scheduling

Kernel expansion

RotationExit

Kernel

Iteration0 1 2 3

Local dependences

Resources

Prepartition

Kernel expansion

RotationExit

Local scheduling

• Any local scheduling algorithm can be used– E.g. list scheduling

Local scheduling

• Weakness: – Loop-carried dependences may be violated

Local scheduling

• Weakness: – Loop-carried dependences may be violated

• To reduce the chance of violation: – Before scheduling, priority function considers loop-carried dependences in advance

– Prioritize critical operations

Kernel

d ab c

Iteration0 1 2 3

dIllustration (Cont.)

Local dependences

Resources

Prepartition

Kernel expansion

RotationExit

Kernel

d ab c

Iteration0 1 2 3

Local dependences

Resources

Prepartition

Kernel expansion

RotationExit

Iteration0 1 2 3

Local dependences

Resources

Prepartition

Local scheduling

Kernel expansionKernel expansionKernel expansionKernel expansion

RotationExit

Kernel

Iteration0 1 2 3

Local dependences

Resources

Loop-carried dependencesx

Prepartition

Local scheduling

Kernel expansionKernel expansionKernel expansionKernel expansion

RotationExit

Iteration0 1 2 3

d Illustration (Cont.)

Local dependences

Resources

Loop-carried dependencesx

Prepartition

Local scheduling

Kernel expansion

RotationRotationRotationRotationExit

Kernel

Iteration0 1 2 3 ab c

d Illustration (Cont.)

Local dependences

Resources

Prepartition

Local scheduling

Kernel expansion

RotationRotationRotationRotationExit

Experiments

• Transmeta CMS on SPEC2k traces

Experiments

• Functional simulator to comprehensively

– Explore thresholds H and B– Evaluate compile overhead and schedules’ quality

Experiments

• Functional simulator to comprehensively

– Explore thresholds H and B– Evaluate compile overhead and schedules’ quality

• Cycle-accurate simulator– Simulates cache misses, latencies, …

– Initial performance study

Linearized Howard vs.

exponential backoff + binary

search [Rau et al. 1992]

0 100 200 300 400 500 600 700

#dependences

0 100 200 300 400 500 600 700

#dependences

0 100 200 300 400 500 600 700

#dependences

Average 9X faster Average 9X faster Average 9X faster Average 9X faster

Peak 812X fasterPeak 812X fasterPeak 812X fasterPeak 812X faster

H≤3 for 96% of 11,992 loops 3 for 96% of 11,992 loops 3 for 96% of 11,992 loops 3 for 96% of 11,992 loops

0 100 200 300 400 500 600 700

#dependences

Average 9X faster Average 9X faster Average 9X faster Average 9X faster

Peak 812X fasterPeak 812X fasterPeak 812X fasterPeak 812X faster

H≤3 for 96% of 11,992 loops 3 for 96% of 11,992 loops 3 for 96% of 11,992 loops 3 for 96% of 11,992 loops

H≤14 for all loops 14 for all loops 14 for all loops 14 for all loops

2 3 4 5

% Total loops

• B ≤ 3 for 98.8% of 11,992 loops

2 3 4 5

% Total loops

• B ≤ 3 for 98.8% of 11,992 loops

2 3 4 5

% Total loops

• B ≤ 3 for 98.8% of 11,992 loops• B ≤ 5 for all the loops

2 3 4 5

% Total loops

• B ≤ 3 for 98.8% of 11,992 loops• B ≤ 5 for all the loops• From now on, we set H=10, B=5 � 11,910 loops scheduled

2 3 4 5

% Total loops

Scheduling overhead &

schedules’ quality

RS2 DESP JITSP RS2 DESP JITSP

Overhead % loops with optimal schedules

• JITSP achieves optimal schedules for 95% loops

95%12X

95%13%

23%13%

Acyclic

scheduler,

Software

pipelining,

Hot region

optimizations,

Assembler,

Note: acyclic scheduler handles acyclic code, or loops NOT selected for software pipelining

Compile overhead distribution

Preliminary performance

• 40 hot loops

filtered

too many

registers

too many

unrolls

Successfully

generated

• 40 hot loops

filtered

too many

registers

too many

unrolls

Successfully

generated

• 40 hot loops

• The architecture has bottleneck in registers

– 2/7/24/32 predicate/static alias/integer/floating point

available for pipelining 26

filtered

too many

registers

too many

unrolls

Successfully

generated

• 40 hot loops

filtered

too many

registers

too many

unrolls

Successfully

generated

• 40 hot loops

filtered

too many

registers

too many

unrolls

Successfully

generated

Preliminary performance (Cont.)

1 2 3 4 5 6 7 8 9 10

II/MII (The lower, the better)

RS2 DESP JITSP

Optimal

• JITSP achieves optimal schedules for all but 1 loops

1 2 3 4 5 6 7 8 9 10

II/MII (The lower, the better)

RS2 DESP JITSP

Optimal

• JITSP 10~36% speedup. Better than the others

1 2 3 4 5 6 7 8 9 10

Speedup

• JITSP 10~36% speedup. Better than the others• Exception: loop 2. Optimal schedule but slowdown due to memory aliases �retranslation needed

1 2 3 4 5 6 7 8 9 10

Speedup

• JITSP 10~36% speedup. Better than the others• Exception: loop 2. Optimal schedule but slowdown due to memory aliases �retranslation needed

• Speedup swim(5.3%), ammp(4.4%), mcf(3.6%)28

1 2 3 4 5 6 7 8 9 10

Speedup

Conclusion

• 1st linear software pipelining algorithm implemented for dynamic compilers

Conclusion

• Turns a traditionally-expensive optimization into linear time O(V+E)

Conclusion

• Turns a traditionally-expensive optimization into linear time O(V+E)– Taking advantages of results from various domains:

• hardware circuit design (Retiming)

• stochastic control (Howard algorithm)

• Graph (Bellman-Ford)

• software pipelining (Rotation scheduling and DESP)

Conclusion

• Turns a traditionally-expensive optimization into linear time O(V+E)– Taking advantages of results from various domains:

• hardware circuit design (Retiming)

• stochastic control (Howard algorithm)

• Graph (Bellman-Ford)

• software pipelining (Rotation scheduling and DESP)

• Generates optimal or near-optimal schedules with reasonable compile overhead

Future work

• Register availability– Add more architecture registers

– Algorithm: Register pressure-aware

• Implementation– Loop selection

– Re-translation

• Evaluation– Benchmarks

Backup slides

Howard algorithm

• Each operation is rewarded to reach a cycle via 1 policy edge– The bigger the cycle, the more the reward

Howard algorithm

Distribution statistics of JITSP

Just-In-Time Software Pipelining -...

Documents

unit3 pipelining

Pipelining Verilog

PIPELINING basics - · PIPELINING basics • A pipelined architecture for MIPS • Hurdles in pipelining • Simple solutions to pipelining hurdles • Advanced pipelining

Pipelining Principles

Recap (Pipelining)

Pipelining ChemAxon

Pipelining Lecture

EE457Unit6a Pipelining Notes - USC Viterbiee.usc.edu/~redekopp/ee457/slides/EE457Unit6a_Pipelining_Notes.pdf · • w/o pipelining: ___ • w/ pipelining: _ – _ cycles for

Chapter6 pipelining

3 Pipelining

Pipelining - University of Toronto · 2005-09-17 · Pipelining • Principles of pipelining † Simple pipelining † Structural Hazards † Data Hazards † Control Hazards †

Structural & Data Hazards Topic 3: Pipelining ECE 4750 ...cullenprogramming.homelinux.com/pipelining-hazards-struct-data.pdf · • Pipelining Basics •Structural HazardsData Hazards

Linear Pipelining

Pipelining & Parallel Processing - ics.kaist.ac.krics.kaist.ac.kr/ee878_2018f/[EE878]3 Pipelining and Parallel Processing.pdf · Pipelining processing By using pipelining latches

Processor Pipelining

Pipelining Lessons

Pipelining III

Pipelining - II

Pipelining IV

ARM Pipelining