Performance-responsive Middleware for Grid Computing Dr Stephen Jarvis High Performance Systems Group University of Warwick, UK High Performance Systems

Performance-responsive Middleware for Grid Computing

Dr Stephen JarvisHigh Performance Systems Group

University of Warwick, UK

High Performance Systems Group

Context

• Funded by / collaborating with – UK e-Science Core Programme– IBM (Watson, Hursley)– NASA (Ames)– NEC Europe– Los Alamos National Laboratory

• Integrate established performance tools into emerging grid middleware



Grid Resource Management

How do we enable and regulate the resource

sharing between users?

While…

providing vision of access to full resources

hiding detail & unnecessary complexity

providing acceptable levels of service


Workload Generation, Visualisation…

Discovery, Mapping, Scheduling, Security, Accounting…

Computing, Storage, Instrumentation…

Managing through Middleware

Key interface between applications & resources


Key Middleware Activities

• Determine what resources are required (advertise)

•Determine what resources are available (discovery)

•Map requirements to available resources (scheduling)

•Maintain contract of performance (service level agreement)

Performance Services

• Intra-domain– Lab- / department-based

– Shared resources under local administration

• Multi-domain– Campus- / country-based

– Wide-area resource and task management

– Cross domain








– Cross domain







– Cross domain

Performance Prediction

• Performance prediction tools• Aim to predict

– Execution time– Communication usage– Data and resource requirements

• Provides best guess as to how an application will execute on a given resource



PACE User

Application

Resource


PACE User

Application

Resource

ApplicationModel

Resource Model

Application

ApplicationModel

Resource

Resource Model

PACE User

Evaluation Engine

Model parameters

Resource config.


Application

ApplicationModel

Resource

Resource Model

PACE User

Evaluation Engine

Model parameters

Resource config.


Why is prediction useful?

• Scaling properties

• Compare runtime options with– deadline

– available resources

– priority / other jobs

– etc.


0

5

10

15

20

25

30

35

40

45

50

1 4 7 10 13 16

The Number of Processors

Run

ning

Tim

e on

SG

IOri

gin2

000

(sec

)

sweep3d

fft

improc

closure

jacobi

memsort

cpi

Allows runtime scenarios to be explored before deployment

1. Intra-Domain Co-Scheduling


• Augment emerging middleware with

additional performance information

• Handle predictive and non-predictive tasks

• Use predictive data for system improvement

– Time to complete tasks / utilisation of resources

– QoS – ability to meet deadlines

• Scheduler driver, or co-scheduler (called

Titan)

Intra-Domain Co-Scheduling


• Non-predictive tasks

PORTALPRE-

EXECUTIONENGINE MATCHMAKER

SCHEDULEQUEUE

PACE

GA CLUSTERCONNECTOR

CONDORCONDOR

REQUESTS FROM USERS OR OTHERDOMAIN SCHEDULERS

RESOURCES

CLASSADS

Titan



• Non-predictive tasks

PORTALPRE-


SCHEDULEQUEUE

PACE

GA CLUSTERCONNECTOR

CONDORCONDOR


RESOURCES

CLASSADS

Titan



• Non-predictive tasks• Tasks with prediction

data

PORTALPRE-


SCHEDULEQUEUE

PACE

GA CLUSTERCONNECTOR

CONDORCONDOR


RESOURCES

CLASSADS

Titan




data

PORTALPRE-


SCHEDULEQUEUE

PACE

GA CLUSTERCONNECTOR

CONDORCONDOR


RESOURCES

CLASSADS

Titan




data

PORTALPRE-


SCHEDULEQUEUE

PACE

GA CLUSTERCONNECTOR

CONDORCONDOR


RESOURCES

CLASSADS

Titan




data

PORTALPRE-


SCHEDULEQUEUE

PACE

GA CLUSTERCONNECTOR

CONDORCONDOR


RESOURCES

CLASSADS

Titan




data

PORTALPRE-


SCHEDULEQUEUE

PACE

GA CLUSTERCONNECTOR

CONDORCONDOR


RESOURCES

CLASSADS

Titan

Intra-Domain Deployment

Without co-scheduler With co-scheduler

Time to complete = 70.08m Time to complete = 35.19m


• Publish intra-domain perf. data through

global information services (MDS)

• Augment service with agent system

– One agent per domain / VO

• When a task is submitted

– Agents query IS, and negotiate to discover best

domain to run task

• Scheme is tested on a 256-node exp. Grid

– 16 resource domains; 6 arch. types


2. Multi-Domain Management


Multi-Domain Management

time



time



time



Time to complete = 2752s



Time to complete = 467s; an improvement of 83%



Time to complete = 467s; an improvement of 83%

QoS: Ability to Meet Deadline


active inactive

Resource usage


active inactive

Many Issues Remain• Identification of meaningful QoS metrics

– User-orientated– Contract-based

• Honouring of SLA – End-to-end service management– Resolving conflicts

• Managing Workflow (CCGrid 2003)– See poster & demo

• But…version 1.0, Condor/GT2-based, available for download– See www.dcs.warwick.ac.uk/~hpsg


Documents

Performance-responsive Middleware for Grid Computing Dr Stephen Jarvis High Performance Systems Group University of Warwick, UK High Performance Systems