Compiler Optimisation - 6 Instruction Scheduling

Compiler Optimisation6 – Instruction Scheduling

Hugh LeatherIF 1.18a

hleather@inf.ed.ac.uk

Institute for Computing Systems ArchitectureSchool of Informatics

University of Edinburgh

Introduction

This lecture:Scheduling to hide latency and exploit ILPDependence graphLocal list Scheduling + prioritiesForward versus backward schedulingSoftware pipelining of loops

Latency, functional units, and ILP

Instructions take clock cycles to execute (latency)Modern machines issue several operations per cycleCannot use results until ready, can do something elseExecution time is order-dependentLatencies not always constant (cache, early exit, etc)

Operation Cyclesload, store 3load /∈ cache 100sloadI, add, shift 1mult 2div 40branch 0 – 8

Machine types

In orderDeep pipelining allows multiple instructions

SuperscalarMultiple functional units, can issue > 1 instruction

Out of orderLarge window of instructions can be reordered dynamically

VLIWCompiler statically allocates to FUs

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Simple schedule1 a := 2*a*b*c

Cycle Operations Operands waitingloadAI rarp ,@a ⇒ r1

add r1, r1 ⇒ r1loadAI rarp ,@b ⇒ r2

mult r1, r2 ⇒ r1loadAI rarp ,@c ⇒ r2

mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a

1loads/stores 3 cycles, mults 2, adds 1

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r1

add r1, r1 ⇒ r1loadAI rarp ,@b ⇒ r2

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r1

loadAI rarp ,@b ⇒ r2mult r1, r2 ⇒ r1

loadAI rarp ,@c ⇒ r2mult r1, r2 ⇒ r1

storeAI r1 ⇒ rarp ,@aDone

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r2

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 Next op does not use r1 r1

loadAI rarp ,@c ⇒ r2mult r1, r2 ⇒ r1

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 loadAI rarp ,@c ⇒ r2 r1, r2

10 r211 r2

10 r211 r212 mult r1, r2 ⇒ r1 r113 r1

10 r211 r212 mult r1, r2 ⇒ r1 r113 r114 storeAI r1 ⇒ rarp ,@a store to complete15 store to complete16 store to complete

Schedule loads early2 a := 2*a*b*c

Cycle Operations Operands waitingloadAI rarp ,@a ⇒ r1loadAI rarp ,@b ⇒ r2loadAI rarp ,@c ⇒ r3

add r1, r1 ⇒ r1mult r1, r2 ⇒ r1mult r1, r2 ⇒ r1

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r1

loadAI rarp ,@b ⇒ r2loadAI rarp ,@c ⇒ r3

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r2

loadAI rarp ,@c ⇒ r3add r1, r1 ⇒ r1

mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r3

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r3

mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r1

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r17 mult r1, r3 ⇒ r1 r18 r1

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r17 mult r1, r3 ⇒ r1 r18 r19 storeAI r1 ⇒ rarp ,@a store to complete

10 store to complete11 store to complete

DoneUses one more register

11 versus 16 cycles – 31% faster!

Scheduling problem

Schedule maps operations to cycle; ∀a ∈ Ops, S(a) ∈ NRespect latency;∀a, b ∈ Ops, a dependson b =⇒ S(a) ≥ S(b) + λ(b)Respect function units; no more ops per type per cycle thanFUs can handle

Length of schedule, L(S) = maxa∈Ops(S(a) + λ(a))Schedule S is time-optimal if ∀S1, L(S) ≤ L(S1)

Problem: Find a time-optimal schedule3

Even local scheduling with many restrictions is NP-complete

3A schedule might also be optimal in terms of registers, power, or space

List scheduling

Local greedy heuristic to produce schedules for single basic blocks1 Rename to avoid anti-dependences2 Build dependency graph3 Prioritise operations4 For each cycle

1 Choose the highest priority ready operation & schedule it2 Update ready queue

List schedulingDependence/Precedence graph

Schedule operation only when operands readyBuild dependency graph of read-after-write (RAW) deps

Label with latency and FU requirements

Anti-dependences (WAR) restrict movement

Example: a = 2*a*b*c

Label with latency and FU requirementsAnti-dependences (WAR) restrict movement

Label with latency and FU requirementsAnti-dependences (WAR) restrict movement – renamingremoves

List scheduling

List scheduling algorithmCycle ← 1Ready ← leaves of (D)Active ← ∅while(Ready ∪ Active 6= ∅)

∀a ∈ Active where S(a) + λ(a) ≤ CycleActive ← Active - a∀ b ∈ succs(a) where isready(b)

Ready ← Ready ∪ bif ∃ a ∈ Ready and ∀ b, apriority ≥ bpriority

Ready ← Ready - aS(op) ← CycleActive ← Active ∪ a

Cycle ← Cycle + 1

List schedulingPriorities

Many different priorities usedQuality of schedules depends on good choice

The longest latency path or critical path is a good priorityTie breakers

Last use of a value - decreases demand for register as moves itnearer defNumber of descendants - encourages scheduler to pursuemultiple pathsLonger latency first - others can fit in shadowRandom

List schedulingExample: Schedule with priority by critical path length

List schedulingForward vs backward

Can schedule from root to leaves (backward)May change schedule timeList scheduling cheap, so try both, choose best

Opcode loadI lshift add addI cmp storeLatency 1 1 2 1 1 4

ForwardsInt Int Stores

1 loadI1 lshift2 loadI2 loadI33 loadI4 add14 add2 add35 add4 addI store16 cmp store27 store38 store49 store510111213 cbr

BackwardsInt Int Stores

1 loadI12 addI lshift3 add4 loadI34 add3 loadI2 store55 add2 loadI1 store46 add1 store37 store28 store191011 cmp12 cbr

Scheduling Larger Regions

Schedule extended basic blocks (EBBs)Super block cloning

Schedule tracesSoftware pipelining

Scheduling Larger RegionsExtended basic blocks

Extended basic blockEBB is maximal set of blocks such thatSet has a single entry, BiEach block Bj other than Bi has

exactly one predecessor

Extended basic blockEBB is maximal set of blocks such thatSet has a single entry, BiEach block Bj other than Bi has

exactly one predecessor

Schedule entire paths throughEBBsExample has four EBB paths

Having B1 in both causes conflicts

Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems

Schedule entire paths throughEBBsExample has four EBB pathsHaving B1 in both causes conflicts

Moving an op out of B1 causesproblems

Must insert compensation codeMoving an op into B1 causesproblems

Moving an op out of B1 causesproblemsMust insert compensation code

Moving an op into B1 causesproblems

Moving an op out of B1 causesproblemsMust insert compensation code

Moving an op into B1 causesproblems

Scheduling Larger RegionsSuperblock cloning

Join points create context problems

Clone blocks to create morecontextMerge any simple control flowSchedule EBBs

Join points create context problemsClone blocks to create morecontext

Merge any simple control flowSchedule EBBs

Join points create context problemsClone blocks to create morecontextMerge any simple control flow

Schedule EBBs

Join points create context problemsClone blocks to create morecontextMerge any simple control flowSchedule EBBs

Scheduling Larger RegionsTrace scheduling

Edge frequency from profile (notblock frequency)

Pick “hot” pathSchedule with compensation codeRemove from CFGRepeat

Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation code

Remove from CFGRepeat

Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation codeRemove from CFG

Repeat

Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation codeRemove from CFGRepeat

Loop scheduling

Loop structures can dominate execution timeSpecialist technique software pipeliningAllows application of list scheduling to loops

Why not loop unrolling?

Allows loop effect to become arbitrarily small, butCode growth, cache pressure, register pressure

Loop scheduling

Loop structures can dominate execution timeSpecialist technique software pipeliningAllows application of list scheduling to loops

Why not loop unrolling?Allows loop effect to become arbitrarily small, butCode growth, cache pressure, register pressure

Software pipelining

Consider simple loop to sum array

Software pipeliningSchedule on 1 FU - 5 cycles

load 3 cycles, add 1 cycle, branch 1 cycle

Software pipeliningSchedule on VLIW 3 FUs - 4 cycles

Software pipeliningA better steady state schedule exists

Software pipeliningRequires prologue and epilogue (may schedule others in epilogue)

Software pipeliningRespect dependences and latency – including loop carries

Software pipeliningComplete code

Software pipeliningSome definitions

Initiation interval (ii)Number of cycles between initiating loop iterations

Original loop had ii of 5 cyclesFinal loop had ii of 2 cycles

RecurrenceLoop-based computation whose value is used in later loop iteration

Might be several iterations laterHas dependency chain(s) on itselfRecurrence latency is latency of dependency chain

Software pipeliningAlgorithm

Choose an initiation interval, iiCompute lower bounds on iiShorter ii means faster overall execution

Generate a loop body that takes ii cyclesTry to schedule into ii cycles, using modulo schedulerIf it fails, increase ii by one and try again

Generate the needed prologue and epilogue codeFor prologue, work backward from upward exposed uses in thescheduled loop bodyFor epilogue, work forward from downward exposed definitionsin the scheduled loop body

Software pipeliningInitial initiation interval (ii)

Starting value for ii based on minimum resource and recurrenceconstraints

Resource constraintii must be large enough to issue every operationLet Nu = number of FUs of type uLet Iu = number of operations of type udIu/Nue is lower bound on ii for type umaxu(dIu/Nue) is lower bound on ii

Recurrence constraintii cannot be smaller than longest recurrence latencyRecurrence r is over kr iterations with latency λr

dλr/kue is lower bound on ii for type rmaxr(dλr/kue) is lower bound on ii

Start value = max(maxu(dIu/Nue),maxr (dλr/kue)

For simple loop

a = A[ i ]b = b + ai = i + 1if i < n gotoend

Resource constraintMemory Integer Branch

Iu 1 2 1Nu 1 1 1

dIu/Nue 1 2 1Recurrence constraint

b ikr 1 1λr 2 1

dIu/Nue 2 1

Software pipeliningModulo scheduling

Modulo schedulingSchedule with cycle modulo initiation interval

Software pipeliningCurrent research

Much research in different software pipelining techniquesDifficult when there is general control flow in the loopPredication in IA64 for example really helps hereSome recent work in exhaustive scheduling -i.e. solve theNP-complete problem for basic blocks

Summary

Scheduling to hide latency and exploit ILPDependence graph - dependences between instructions +latencyLocal list Scheduling + prioritiesForward versus backward schedulingScheduling EBBs, superblock cloning, trace schedulingSoftware pipelining of loops

PPar CDT Advert

The biggest revolution in the technological landscape for fifty years

Now accepting applications! Find out more and apply at:

pervasiveparallelism.inf.ed.ac.uk

• • 4-year programme: 4-year programme: MSc by Research + PhDMSc by Research + PhD

• Collaboration between: ▶ University of Edinburgh’s School of Informatics ✴ Ranked top in the UK by 2014 REF

▶ Edinburgh Parallel Computing Centre ✴ UK’s largest supercomputing centre

• Full funding available

• Industrial engagement programme includes internships at leading companies

• Research-focused: Work on your thesis topic from the start

• Research topics in software, hardware, theory and

application of: ▶ Parallelism ▶ Concurrency ▶ Distribution

Compiler Optimisation - 6 Instruction Scheduling

Documents

Code Generation and Optimisation · What is a compiler? Source Program Target Easy for humans to understand Easy for machines to execute Compiler "Programs must be written for people

Machine Learning in Compiler Optimisation

Spring 2014Jim Hogg - UW - CSE - P501N-1 CSE P501 – Compiler Construction Compiler Backend Organization Instruction Selection Instruction Scheduling Registers

Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine

Route optimisation and fleet scheduling · • Integrates with telematics, ePODs, and back office software • Unique electric vehicle optimisation capabilities • Dynamic scheduling

Machine intelligence - Deloitte United States · • Optimisation, planning, and scheduling: Among the more mature cognitive algorithms, optimisation automates complex decisions and

COMPILER NON-NUMERIC CODE WITH EXPLICITLY FORWARDED DATA ... · A Compiler Algonthm for Scheduling Non-Numeric Code with Explicitly Forwarded data on a ... This technique is known

Australian National Universityusers.cecs.anu.edu.au/~ejmontgomery/publications/Montgomery, EJ … · Summary Combinatorial optimisation problems (COPs) pervade human society: scheduling,

A robust optimisation approach to project scheduling and ...elodieg/Adida-Joshi-IJSOI-2009.pdf · The robust optimisation methodology has been applied to a number of areas. Bertsimas

A robust optimisation approach to project scheduling and

Compiler Optimization Techniques for Scheduling and

Optimisation of maintenance routing and scheduling

Static ILP Static (Compiler Based) Scheduling

Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

A Comparison of Neighbourhood Topologies for Staff Scheduling With Particle Swarm Optimisation

MULTI-AGENT BASED SCHEDULING D. Ouelhadj ASAP (Automated Scheduling Optimisation and Planning) Research Group School of Computer Science and IT University

The Multiflow Trace Scheduling Compiler · The trace scheduling algorithm permits instruction scheduling beyond basic blocks (Figure 1-1). It provides a framework for a unified approach

Scheduling High-Throughput Computing Applications on the ... · cost-time optimisation, which extends the DBC cost-optimisation algorithm to optimise for time, keeping the cost of

Maintenance Scheduling and Vehicle Routing Optimisation

Copy Propagation Optimizations for VLIW DSP Processors with Dresearch.ihost.com/lcpc06/presentations/56_Presentation.pdf · Impact on Compiler Techniques Instruction Scheduling &