Upload
meryl-nash
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
Performance-responsive Middleware for Grid Computing
Dr Stephen JarvisHigh Performance Systems Group
University of Warwick, UK
High Performance Systems Group
Context
• Funded by / collaborating with – UK e-Science Core Programme– IBM (Watson, Hursley)– NASA (Ames)– NEC Europe– Los Alamos National Laboratory
• Integrate established performance tools into emerging grid middleware
High Performance Systems Group
High Performance Systems Group
Grid Resource Management
How do we enable and regulate the resource
sharing between users?
While…
providing vision of access to full resources
hiding detail & unnecessary complexity
providing acceptable levels of service
High Performance Systems Group
Workload Generation, Visualisation…
Discovery, Mapping, Scheduling, Security, Accounting…
Computing, Storage, Instrumentation…
Managing through Middleware
Key interface between applications & resources
High Performance Systems Group
Key Middleware Activities
• Determine what resources are required (advertise)
•Determine what resources are available (discovery)
•Map requirements to available resources (scheduling)
•Maintain contract of performance (service level agreement)
Performance Services
• Intra-domain– Lab- / department-based
– Shared resources under local administration
• Multi-domain– Campus- / country-based
– Wide-area resource and task management
– Cross domain
High Performance Systems Group
Performance Services
High Performance Systems Group
• Intra-domain– Lab- / department-based
– Shared resources under local administration
• Multi-domain– Campus- / country-based
– Wide-area resource and task management
– Cross domain
Performance Services
High Performance Systems Group
• Intra-domain– Lab- / department-based
– Shared resources under local administration
• Multi-domain– Campus- / country-based
– Wide-area resource and task management
– Cross domain
Performance Prediction
• Performance prediction tools• Aim to predict
– Execution time– Communication usage– Data and resource requirements
• Provides best guess as to how an application will execute on a given resource
High Performance Systems Group
High Performance Systems Group
PACE User
Application
Resource
High Performance Systems Group
PACE User
Application
Resource
ApplicationModel
Resource Model
Application
ApplicationModel
Resource
Resource Model
PACE User
Evaluation Engine
Model parameters
Resource config.
High Performance Systems Group
Application
ApplicationModel
Resource
Resource Model
PACE User
Evaluation Engine
Model parameters
Resource config.
High Performance Systems Group
Why is prediction useful?
• Scaling properties
• Compare runtime options with– deadline
– available resources
– priority / other jobs
– etc.
High Performance Systems Group
0
5
10
15
20
25
30
35
40
45
50
1 4 7 10 13 16
The Number of Processors
Run
ning
Tim
e on
SG
IOri
gin2
000
(sec
)
sweep3d
fft
improc
closure
jacobi
memsort
cpi
Allows runtime scenarios to be explored before deployment
1. Intra-Domain Co-Scheduling
High Performance Systems Group
• Augment emerging middleware with
additional performance information
• Handle predictive and non-predictive tasks
• Use predictive data for system improvement
– Time to complete tasks / utilisation of resources
– QoS – ability to meet deadlines
• Scheduler driver, or co-scheduler (called
Titan)
Intra-Domain Co-Scheduling
High Performance Systems Group
• Non-predictive tasks
PORTALPRE-
EXECUTIONENGINE MATCHMAKER
SCHEDULEQUEUE
PACE
GA CLUSTERCONNECTOR
CONDORCONDOR
REQUESTS FROM USERS OR OTHERDOMAIN SCHEDULERS
RESOURCES
CLASSADS
Titan
Intra-Domain Co-Scheduling
High Performance Systems Group
• Non-predictive tasks
PORTALPRE-
EXECUTIONENGINE MATCHMAKER
SCHEDULEQUEUE
PACE
GA CLUSTERCONNECTOR
CONDORCONDOR
REQUESTS FROM USERS OR OTHERDOMAIN SCHEDULERS
RESOURCES
CLASSADS
Titan
Intra-Domain Co-Scheduling
High Performance Systems Group
• Non-predictive tasks• Tasks with prediction
data
PORTALPRE-
EXECUTIONENGINE MATCHMAKER
SCHEDULEQUEUE
PACE
GA CLUSTERCONNECTOR
CONDORCONDOR
REQUESTS FROM USERS OR OTHERDOMAIN SCHEDULERS
RESOURCES
CLASSADS
Titan
Intra-Domain Co-Scheduling
High Performance Systems Group
• Non-predictive tasks• Tasks with prediction
data
PORTALPRE-
EXECUTIONENGINE MATCHMAKER
SCHEDULEQUEUE
PACE
GA CLUSTERCONNECTOR
CONDORCONDOR
REQUESTS FROM USERS OR OTHERDOMAIN SCHEDULERS
RESOURCES
CLASSADS
Titan
Intra-Domain Co-Scheduling
High Performance Systems Group
• Non-predictive tasks• Tasks with prediction
data
PORTALPRE-
EXECUTIONENGINE MATCHMAKER
SCHEDULEQUEUE
PACE
GA CLUSTERCONNECTOR
CONDORCONDOR
REQUESTS FROM USERS OR OTHERDOMAIN SCHEDULERS
RESOURCES
CLASSADS
Titan
Intra-Domain Co-Scheduling
High Performance Systems Group
• Non-predictive tasks• Tasks with prediction
data
PORTALPRE-
EXECUTIONENGINE MATCHMAKER
SCHEDULEQUEUE
PACE
GA CLUSTERCONNECTOR
CONDORCONDOR
REQUESTS FROM USERS OR OTHERDOMAIN SCHEDULERS
RESOURCES
CLASSADS
Titan
Intra-Domain Co-Scheduling
High Performance Systems Group
• Non-predictive tasks• Tasks with prediction
data
PORTALPRE-
EXECUTIONENGINE MATCHMAKER
SCHEDULEQUEUE
PACE
GA CLUSTERCONNECTOR
CONDORCONDOR
REQUESTS FROM USERS OR OTHERDOMAIN SCHEDULERS
RESOURCES
CLASSADS
Titan
Intra-Domain Deployment
Without co-scheduler With co-scheduler
Time to complete = 70.08m Time to complete = 35.19m
High Performance Systems Group
• Publish intra-domain perf. data through
global information services (MDS)
• Augment service with agent system
– One agent per domain / VO
• When a task is submitted
– Agents query IS, and negotiate to discover best
domain to run task
• Scheme is tested on a 256-node exp. Grid
– 16 resource domains; 6 arch. types
High Performance Systems Group
2. Multi-Domain Management
High Performance Systems Group
Multi-Domain Management
time
High Performance Systems Group
Multi-Domain Management
time
High Performance Systems Group
Multi-Domain Management
time
High Performance Systems Group
Multi-Domain Management
Time to complete = 2752s
Multi-Domain Management
High Performance Systems Group
Time to complete = 467s; an improvement of 83%
Multi-Domain Management
High Performance Systems Group
Time to complete = 467s; an improvement of 83%
QoS: Ability to Meet Deadline
High Performance Systems Group
active inactive
Resource usage
High Performance Systems Group
active inactive
Many Issues Remain• Identification of meaningful QoS metrics
– User-orientated– Contract-based
• Honouring of SLA – End-to-end service management– Resolving conflicts
• Managing Workflow (CCGrid 2003)– See poster & demo
• But…version 1.0, Condor/GT2-based, available for download– See www.dcs.warwick.ac.uk/~hpsg
High Performance Systems Group