View
217
Download
0
Tags:
Embed Size (px)
Citation preview
A Prediction-basedReal-time Scheduling Advisor
Peter A. DindaPrescience Lab
Department of Computer Science
Northwestern University
http://www.cs.northwestern.edu/~pdinda
2
RTSA
Real-timeScheduling Advisor
“I have a 5 second job.I want it to finish in under 10 seconds
with at least 95% probability.Here is a list of hosts where I can run it.Which one should I use?”
“Use host 3. It’ll finish there in 7 to 9 secs”
“There is no host where thatis possible. The fastest ishost 5, where it’ll finish in 12 to 15 seconds.”
3
Core Results
• RTSA based on predictive signal processing• Layered system architecture for scalable performance prediction• Targets commodity shared, unreserved distributed environments• All at user level
• Randomized trace-based evaluation giving evidence of its effectiveness
• Limitations• Compute-bound tasks• Evaluation on Digital Unix platform
Publicly available as part of RPS system
4
Outline
• Motivation: interactive applications
• Interface
• Implementation
• Performance evaluation
• Conclusions and future work
5
Interactive Applications on Shared, Unreserved Distributed Computing Environments
• Examples: visualization, games, vr• Responsiveness requirements => soft deadlines• No resource reservation or admission control
• Constant competition from other users
• Changing resource availability => adaptation• Adaptation is via server selection• Other mechanisms possible
6
Interactive Applications and the RTSA
• RTSA controls adaptation mechanisms• Operates on behalf of single application• Multiple RTSAs may be running independently
• Current limitation: Compute-bound tasks
7
Interface - Request
struct RTSARequest {
double tnom;
double sf;
double conf;
Host hosts[];
};
Size of task in CPU-seconds
Maximum slack allowed
Minimum probability allowed
List of hosts to choose from
“I have a 5 second job. I want it to finish in under 10 seconds with at least 95% probability. Here is a list of hosts where I can run it. Which one should I use?”
int RTSAAdviseTask(RTSARequest &req, RTSAResponse &resp);
deadline = now + tnom(1+sf)
8
Interface - Responsestruct RTSAResponse {
double tnom;
double sf;
double conf;
Host host;
RunningTimePredictionResponse runningtime;
};
Size of task in CPU-seconds
Maximum slack allowed
Minimum probability allowed
Host to use
int RTSAAdviseTask(RTSARequest &req, RTSAResponse &resp);
Predicted running timeof task on that host
“Use host 3. It’ll finish there in 7 to 9 secs”
9
RunningTimePredictionResponsestruct RunningTimePredictionResponse {
Host host;
double tnom;
double conf;
double texp;
double tlb;
double tub;
};
Size of task in CPU-seconds
Confidence level
Point estimate of running time
Host to use
Confidence intervalof running time
“The most likely running time is 7.5 seconds. There is a 95% chance that the actual running time will be in the range 7 to 9 seconds.”
10
Implementation
Host Load Measurement System
Host Load Prediction System
Running Time Advisor
Real-time Scheduling Advisor
Application
Measurement Stream
Load PredictionRequest
Load PredictionResponse
Nominal timeconfidence, host
Running time estimate(confidence interval)
Nominal time, slack,confidence, host list
Host, running timeestimate
Daemon(one per host)
Library
11
Underlying Components
• Host load measurement• Digital Unix 5 second load average, 1 Hz• [LCR98, SciProg99]
• Host load prediction• Periodic linear time series analysis (continuously
monitored AR(16) predictors)• <1% of CPU• [HPDC99, Cluster00]
• Running time advisor (RTA)• Task size + host load predictions => confidence interval
for running time of task• [SIGMETRICS01,HPDC01,Cluster02]
12
RTSA Implementation SimplifiedP
redi
cted
Run
ning
Tim
e
Task
?
RTA predicts running time on each host
tnom
texp
13
RTSA Implementation SimplifiedP
redi
cted
Run
ning
Tim
e
Task
deadline=(1+sf)tnomtnom
?
deadline
RTSA picks randomly from among the hosts where the deadline can be met
If there is no such host, RTSA returns the host with the lowest running time
RTSA also returns the estimate of the running time
14
Prediction Error
• Predictions are not perfect• Some machines harder to predict than others• Need more than a point estimate (texp)
• Predictors can estimate their quality• Covariance matrix for prediction errors• Estimate of predictor error also continually
monitored for accuracy
• Confidence interval captures this• Deadline probability serves as confidence level
15
Task
deadline=(1+sf)tnomtnom
?
RTSA ImplementationP
redi
cted
Run
ning
Tim
e
deadline
conf=95%
RTSA picks randomly from among the hosts where the deadline can be met even given the maximum running time captured in the confidence interval
If there is no such host, RTSA returns the host with the lowest running time
RTSA also returns the estimate of the running time
tub
tlb
16
Experimental Setup• Environment
– Alphastation 255s, Digital Unix 4.0– Private network– Separate machine for submitting tasks– Prediction system on each host
• BG workload: host load trace playback [lcr00]– Traces from PSC Alpha cluster, wide range of CMU
machines– Reconstruct any combination of these machines
(scenario)
• Testcase: submit synthetic task to system, run on host that RTSA selects, measure result
17
ScenariosName Numhosts Average
LoadAverageEpoch
4LS 4 High Small
4SL 4 Low Large
4MM 4 Mixed Mixed
5SS 5 Low Small
4MS 4 Mixed Small
4SM 4 Low Mixed
2CS 2 large memory compute servers
2MP 2 very predictable hosts
18
The Metrics
• Fraction of deadlines met• Probability of meeting deadline
• Fraction of deadlines met when predicted
• Probability of meeting deadline if RTSA claims it is possible
• Number of possible hosts• Degree of randomness in RTSA’s decision• High randomness means different RTSAs are
unlikely to conflict
19
Testcases
• Synthetic compute-bound tasks
• Size: 0.1 to 10 seconds, uniform
• Interarrival: 5 to 15 seconds, uniform
• sf: 0 to 2, uniform
• conf: 0.95 in all cases
8,000 to 16,000 testcases for each scenario
How do metrics vary with scenario, size, sf?
20
The RTSA Implementations
• AR(16)– RTSA as described here– Instantiated with the AR(16) load predictor
• MEASURE– Send task to host with lowest load– Does not return predicted running time– High probability of conflicts
• RANDOM– Send task to a random host– Does not return predicted running time– Low probability of conflicts
21
Fraction of Deadlines Met – 4LS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Slack Factor
random
measure
ar16
Target 95% Level
Performance gainfrom prediction
22
Fraction of Deadlines Met – 4LS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9 10Nominal Time (seconds)
random
measure
ar16
Target 95% Level
Performance gainfrom prediction
23
Fraction of Deadlines Met – 4LS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9 10Nominal Time (seconds)
random
measure
ar16
Target 95% LevelHighest performance gainfrom prediction near“critical slack”
24
Fraction of Deadlines Met When Predicted – 4LS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Slack Factor
ar16
Target 95% Level
Only predictivestrategy can indicatewhether meeting deadline is possible
25
Fraction of Deadlines Met When Predicted – 4LS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9 10Nominal Time (seconds)
ar16
Target 95% Level
Only predictivestrategy can indicatewhether meeting deadline is possible
26
Fraction of Deadlines Met When Predicted – 4LS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9 10Nominal Time (seconds)
ar16
Target 95% Level
Operating near critical slack is most challenging
27
Number of Possible Hosts – 4LS
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Slack Factor
random
measure
ar16
Predictive strategy introduces “appropriate randomness”
28
Number of Possible Hosts – 4LS
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 1 2 3 4 5 6 7 8 9 10Nominal Time (seconds)
random
measure
ar16
Predictive strategy introduces “appropriate randomness”
29
Number of Possible Hosts – 4LS
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 1 2 3 4 5 6 7 8 9 10Nominal Time (seconds)
random
measure
ar16
Operation near“critical slack” ismost challenging
30
Conclusions and Future Work
• Introduced RTSA concept
• Described prediction-based implementation
• Demonstrated feasibility
• Evaluated performance
• Current and future work– Incorporate communication, memory, disk– Improved predictive models
31
For MoreInformation
• Peter Dinda– http://www.cs.northwestern.edu/~pdinda
• RPS– http://www.cs.northwestern.edu/~RPS
• Prescience Lab– http://www.cs.northwestern.edu/~plab