Instant-access cycle stealing for parallel applications requiring interactive response

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

on

Instant-access cycle stealing for parallel applications requiring interactive response

Paul Kelly (Imperial College)Susanna Pelagatti (University of Pisa)Mark Rossiter (ex-Imperial, now with Telcordia)

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onApplication scenario…

Workplace with fast LAN and many PCsSome users occasionally need high computing power to accelerate interactive tasksExample:CAD

Interactive design of components/structuresAnalyse structural properties Simulate fluid flowCompute high-resolution rendering

Most PCs are under-utilised most of the timeCan we use spare CPU cycles to improve responsiveness?

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onThe challenge…

Cycle stealing the easy way…Maintain a batch queueMaximise throughput for multiple, long-running jobsWait til desktop users leave their desks

This paper is about doing it the hard way:Using spare cycles to accelerate short, parallel tasks (5-60 seconds)In order to reduce interactive response timeWhile desktop users are at their desks

This means:No batch queue – execute immediately using resources instantaneously availableNo time to migrate or checkpoint tasks No time to ship data across wide-area network

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onA challenging environment…

For our experiments, we used a group of 32 Linux PCs in a very busy CS student labGraph shows hourly-average percentage utilisation (on a log scale) over a typical dayAlthough not 100% busy, machines in continuous use

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onScenario

Host PCs service interactive desktop usersRequests to execute parallel guest jobs arrive intermittentlySystem allocates group of idle PCs to execute guest job

Objectives:Minimise average response time for guest jobsKeep interference suffered by hosts within reasonable limits

We show that this can really work, even in our extremely challenging environment

Next: characterise patterns of idlenessThen: design software to assign guest tasksThen: evaluate alternative strategies by simulation

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onEarlier work

Litzkow, Livny, Mutka, “Condor - a hunter of idle workstations”. ICDCS’88.Atallah, Black, et al, “Models and algorithms for co-scheduling compute-intensive tasks on networks of workstations”. JPDC 1992.Arpaci, Dusseau et al “The interaction of parallel and sequential workloads on a network of workstations”. SIGMETRICS’95Acharya, Edjlali, Saltz, “The utility of exploiting idle workstations for parallel computing”. SIGMETRICS’97Petrini, Feng, “Buffered coscheduling: a new methodology for multitasking parallel jobs on distributed systems”. IPDPS 2000.United Devices, Seti@home, EntropiaSubholk, Lieu, Lowekamp, “Automatic node selection for high performance applications on networks”. PPoPP 1999.

Batch queue, multiple long-running jobs

Parallel jobs

“60-workstation cluster can handle job arrival trace taken from a dedicated 32-node CM-5”

Wide-area networks

Our goal: Improve response time for individual tasks

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onCharacterize patterns of idleness

Idle periods occur frequently

90% of idle periods occur within 5s

Idle = over a one second period, less than 10% of CPU time is spent executing user processes, and at least 90% of CPU time could be devoted to a new process

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onCharacterize patterns of idleness

Idle periods don’t last long

Only 50% last more than 3.3s

Idle periods occur frequently

90% of idle periods occur within 5s

Idle = over a one second period, less than 10% of CPU time is spent executing user processes, and at least 90% of CPU time could be devoted to a new process

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onDistribution of idleness – 32 PCs in busy student lab

It’s very likely that we’ll have up to 15 idle machines at any time

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

on

It’s very likely that we’ll have up to 15 idle machines at any time

It’s unlikely that the same 15 machines will stay idle for long

Distribution of idleness – 32 PCs in busy student lab

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onSo how much can we hope to get?

With our 32-PC cluster, an idle group of 5 processors has about a 50% chance of remaining idle for more than 5 secondsThis is our parallel computing resource!

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onThe mpidled software

mpidled is a Linux daemon process which runs on every participating PC:

Monitors system utilisation, determines whether system is idleUses this and past measurements to predict short-term future utilization

mpidle is a client application which lists the participating PCs which are currently predicted to be idle

Produces list of machine names, for use as MPI machinefile

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onZero administration by leadership election

Participating PCs are regularly unplugged and rebootedVital to minimize systems administration overheads… Mpidled daemons autonomously elect “leader” to handle client requests (current implementation relies on LAN broadcast, confined to one subnet)Mpidle usually responds in less than 0.15s

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onLoad prediction

We use recent measurements of idleness to predict how idle each PC will be in the futureGood prediction leads to

shorter execution time for guest jobsLess interference with host processes, ie the desktop user

We’re interested in short-running guest jobs – so we don’t consider migrating tasks if the prediction turns out wrong

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onHow good is load prediction?

Previous studies (Dinda and O’Halloran, Wolski et al) have shown that taking the weighted mean of the last few samples works as well as anything

For 10-second prediction

Forecast length (seconds)

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onHow well does it work?

Simulation, driven by traces from 32 machines gathered over one week, during busy working hours

Uses application’s speedup curve to predict execution time given number of processors availableAlso uses trace load data to compute CPU share available on each processor

For this study, we simulated execution of a ray-tracing task

Sequential execution takes 42 secondsSpeedup is more-or-less linear with 50-60% efficiencyRequests to execute a guest task arrive with an exponential distribution, with mean inter-arrival time of 20 seconds

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onHow well does it work - baseline

Disruption to desktop users is dramatically reduced compared to assigning work at random (but not zero)Although many processors used, speedup is lowQuite often, a guest task is rejected because no processor is idle

Usually because earlier guest task is still running

Allocate to all idle processors

Allocate randomly to 17 processors

Jobs refused 16.3% nil

Idle seconds used

21.6% 25%

Mean group size 17.2 17

Mean speedup 3.58 3.68

Seconds disrupted

5.28% 44.4%

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onAllocation policy matters…

The simplest policy is to allocate all available (idle) processors to each guest jobThis leads to a bimodal distribution: a substantial proportion of guest jobs get little or no benefit

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onA better strategy – holdback

The problem:If a second guest task arrives before the first has finished, very few processors are available to run it

Idea: “holdback”Hold back a percentage r of processors in reserveEach guest task is allocated (1-r) of the available (idle) processors

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onHoldback improves fairness

By holding back some resources at each allocation, guest tasks get a more predictable and consistent shareHow much to hold back depends on rate of arrival of guest tasks

Frequency

(%

)

Group size Group size Group size

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onHow much to hold back

Random

No reserve

10% reserve

20% reserve

30% reserve

40% reserve

50% reserve

60% reserve

Jobs Refused

nil 16.3% 3.3% 1.5% 2.0% 0.5% 0.2% 0.1%

Idle Seconds Used

25.0% 21.6% 23.3% 23.2% 22.6% 22.1% 21.5% 21.0%

Mean Group Size

17.0 17.2 15.24 13.6 12.0 10.3 8.7 7.1

Mean Speedup

3.68 3.58 4.58 4.88 4.96 4.82 4.46 3.92

Seconds Disrupted

44.4% 5.28% 6.3% 6.5% 5.9% 6.0% 5.9% 6.2%

Mean speedup maximised with right holdbackParallel efficiency lower than would be on dedicated parallel system, due to interferenceLarger group size doesn’t imply higher speedupDetails depend on speedup characteristics of guest application workload

Sof

twar

e P

erfo

rman

ce O

pti

mis

atio

n G

rou

pIm

per

ial C

olle

ge, L

ond

onConclusions & Further work

Simple, effective tool, to be made freely availableEven extremely busy environments can host a substantial parallel workloadShort interactive jobs can be accelerated, if

Relatively small startup cost, data sizeParallel execution time lies within scope of load prediction – 10 seconds or soDesktop users prepared to tolerate some interference

Plenty of scope for further study…Memory contentionAdaptive holdbackIntegrate with queuing to handle longer-running jobsHow to reduce startup delay?

Documents

Instant-access cycle stealing for parallel applications requiring interactive response