Upload
palti
View
16
Download
0
Embed Size (px)
DESCRIPTION
Instant-access cycle stealing for parallel applications requiring interactive response. Paul Kelly (Imperial College) Susanna Pelagatti (University of Pisa) Mark Rossiter (ex-Imperial, now with Telcordia). Application scenario…. Workplace with fast LAN and many PCs - PowerPoint PPT Presentation
Citation preview
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
on
Instant-access cycle stealing for parallel applications requiring interactive response
Paul Kelly (Imperial College)Susanna Pelagatti (University of Pisa)Mark Rossiter (ex-Imperial, now with Telcordia)
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onApplication scenario…
Workplace with fast LAN and many PCsSome users occasionally need high computing power to accelerate interactive tasksExample:CAD
Interactive design of components/structuresAnalyse structural properties Simulate fluid flowCompute high-resolution rendering
Most PCs are under-utilised most of the timeCan we use spare CPU cycles to improve responsiveness?
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onThe challenge…
Cycle stealing the easy way…Maintain a batch queueMaximise throughput for multiple, long-running jobsWait til desktop users leave their desks
This paper is about doing it the hard way:Using spare cycles to accelerate short, parallel tasks (5-60 seconds)In order to reduce interactive response timeWhile desktop users are at their desks
This means:No batch queue – execute immediately using resources instantaneously availableNo time to migrate or checkpoint tasks No time to ship data across wide-area network
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onA challenging environment…
For our experiments, we used a group of 32 Linux PCs in a very busy CS student labGraph shows hourly-average percentage utilisation (on a log scale) over a typical dayAlthough not 100% busy, machines in continuous use
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onScenario
Host PCs service interactive desktop usersRequests to execute parallel guest jobs arrive intermittentlySystem allocates group of idle PCs to execute guest job
Objectives:Minimise average response time for guest jobsKeep interference suffered by hosts within reasonable limits
We show that this can really work, even in our extremely challenging environment
Next: characterise patterns of idlenessThen: design software to assign guest tasksThen: evaluate alternative strategies by simulation
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onEarlier work
Litzkow, Livny, Mutka, “Condor - a hunter of idle workstations”. ICDCS’88.Atallah, Black, et al, “Models and algorithms for co-scheduling compute-intensive tasks on networks of workstations”. JPDC 1992.Arpaci, Dusseau et al “The interaction of parallel and sequential workloads on a network of workstations”. SIGMETRICS’95Acharya, Edjlali, Saltz, “The utility of exploiting idle workstations for parallel computing”. SIGMETRICS’97Petrini, Feng, “Buffered coscheduling: a new methodology for multitasking parallel jobs on distributed systems”. IPDPS 2000.United Devices, Seti@home, EntropiaSubholk, Lieu, Lowekamp, “Automatic node selection for high performance applications on networks”. PPoPP 1999.
Batch queue, multiple long-running jobs
Parallel jobs
“60-workstation cluster can handle job arrival trace taken from a dedicated 32-node CM-5”
Wide-area networks
Our goal: Improve response time for individual tasks
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onCharacterize patterns of idleness
Idle periods occur frequently
90% of idle periods occur within 5s
Idle = over a one second period, less than 10% of CPU time is spent executing user processes, and at least 90% of CPU time could be devoted to a new process
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onCharacterize patterns of idleness
Idle periods don’t last long
Only 50% last more than 3.3s
Idle periods occur frequently
90% of idle periods occur within 5s
Idle = over a one second period, less than 10% of CPU time is spent executing user processes, and at least 90% of CPU time could be devoted to a new process
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onDistribution of idleness – 32 PCs in busy student lab
It’s very likely that we’ll have up to 15 idle machines at any time
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
on
It’s very likely that we’ll have up to 15 idle machines at any time
It’s unlikely that the same 15 machines will stay idle for long
Distribution of idleness – 32 PCs in busy student lab
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onSo how much can we hope to get?
With our 32-PC cluster, an idle group of 5 processors has about a 50% chance of remaining idle for more than 5 secondsThis is our parallel computing resource!
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onThe mpidled software
mpidled is a Linux daemon process which runs on every participating PC:
Monitors system utilisation, determines whether system is idleUses this and past measurements to predict short-term future utilization
mpidle is a client application which lists the participating PCs which are currently predicted to be idle
Produces list of machine names, for use as MPI machinefile
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onZero administration by leadership election
Participating PCs are regularly unplugged and rebootedVital to minimize systems administration overheads… Mpidled daemons autonomously elect “leader” to handle client requests (current implementation relies on LAN broadcast, confined to one subnet)Mpidle usually responds in less than 0.15s
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onLoad prediction
We use recent measurements of idleness to predict how idle each PC will be in the futureGood prediction leads to
shorter execution time for guest jobsLess interference with host processes, ie the desktop user
We’re interested in short-running guest jobs – so we don’t consider migrating tasks if the prediction turns out wrong
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onHow good is load prediction?
Previous studies (Dinda and O’Halloran, Wolski et al) have shown that taking the weighted mean of the last few samples works as well as anything
For 10-second prediction
Forecast length (seconds)
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onHow well does it work?
Simulation, driven by traces from 32 machines gathered over one week, during busy working hours
Uses application’s speedup curve to predict execution time given number of processors availableAlso uses trace load data to compute CPU share available on each processor
For this study, we simulated execution of a ray-tracing task
Sequential execution takes 42 secondsSpeedup is more-or-less linear with 50-60% efficiencyRequests to execute a guest task arrive with an exponential distribution, with mean inter-arrival time of 20 seconds
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onHow well does it work - baseline
Disruption to desktop users is dramatically reduced compared to assigning work at random (but not zero)Although many processors used, speedup is lowQuite often, a guest task is rejected because no processor is idle
Usually because earlier guest task is still running
Allocate to all idle processors
Allocate randomly to 17 processors
Jobs refused 16.3% nil
Idle seconds used
21.6% 25%
Mean group size 17.2 17
Mean speedup 3.58 3.68
Seconds disrupted
5.28% 44.4%
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onAllocation policy matters…
The simplest policy is to allocate all available (idle) processors to each guest jobThis leads to a bimodal distribution: a substantial proportion of guest jobs get little or no benefit
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onA better strategy – holdback
The problem:If a second guest task arrives before the first has finished, very few processors are available to run it
Idea: “holdback”Hold back a percentage r of processors in reserveEach guest task is allocated (1-r) of the available (idle) processors
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onHoldback improves fairness
By holding back some resources at each allocation, guest tasks get a more predictable and consistent shareHow much to hold back depends on rate of arrival of guest tasks
Frequency
(%
)
Group size Group size Group size
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onHow much to hold back
Random
No reserve
10% reserve
20% reserve
30% reserve
40% reserve
50% reserve
60% reserve
Jobs Refused
nil 16.3% 3.3% 1.5% 2.0% 0.5% 0.2% 0.1%
Idle Seconds Used
25.0% 21.6% 23.3% 23.2% 22.6% 22.1% 21.5% 21.0%
Mean Group Size
17.0 17.2 15.24 13.6 12.0 10.3 8.7 7.1
Mean Speedup
3.68 3.58 4.58 4.88 4.96 4.82 4.46 3.92
Seconds Disrupted
44.4% 5.28% 6.3% 6.5% 5.9% 6.0% 5.9% 6.2%
Mean speedup maximised with right holdbackParallel efficiency lower than would be on dedicated parallel system, due to interferenceLarger group size doesn’t imply higher speedupDetails depend on speedup characteristics of guest application workload
Sof
twar
e P
erfo
rman
ce O
pti
mis
atio
n G
rou
pIm
per
ial C
olle
ge, L
ond
onConclusions & Further work
Simple, effective tool, to be made freely availableEven extremely busy environments can host a substantial parallel workloadShort interactive jobs can be accelerated, if
Relatively small startup cost, data sizeParallel execution time lies within scope of load prediction – 10 seconds or soDesktop users prepared to tolerate some interference
Plenty of scope for further study…Memory contentionAdaptive holdbackIntegrate with queuing to handle longer-running jobsHow to reduce startup delay?