78
1 High-Performance Grid Computing High-Performance Grid Computing and Research Networking and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu Grid Computing Grid Computing

1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

1

High-Performance Grid Computing and High-Performance Grid Computing and Research NetworkingResearch Networking

Presented by Selim Kalayci

Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/

sadjadi At cs Dot fiu Dot edu

Grid ComputingGrid Computing

Page 2: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

2

Acknowledgements The content of many of the slides in this

lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks!

Henri Casanova Principles of High Performance Computing http://navet.ics.hawaii.edu/~casanova [email protected]

Ian Foster Presentations&Tutorials from

www.globus.org

Page 3: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

3

Agenda Grid Computing Grid Middleware - Globus Security in Globus Data Management Execution Management Monitoring Metaschedulers - Gridway

Page 4: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

4

Multiple Computers Adding CPUs to a single computer

becomes very expensive How about multiple computers

together? Linux Clusters (60% of Top-500 list)

Blue/Gene: 30K computers

Page 5: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

5

Beyond the machine room? Need more capacity than available at (most) single sites

Everyone would like a 10K-node 100GHz cluster Very expensive (cooling, power) More economical to have multiple sites

Need to locate available resources now Data/Instruments are inherently distributed

Campus

Machine Room

Nation

Page 6: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

6

Grid Computing A dynamic multi-institutional network of computers that come

together to share resources for the purpose of coordinated problem solving.

resource

application

institutional boundaryAchieved through:

1. Open general-purpose protocols

2. Standard interfaces

Page 7: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

7

Layers in Grid

Page 8: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

8

A Grid Checklist coordinates resources that are not subject to centralized

control …

… using standard, open, general-purpose protocols and interfaces …

… to deliver nontrivial qualities of service.

Virtual Organizations Group of individuals or institutions defined by sharing

rules to share the resources of “Grid” for a common goal. Example: Application service providers, storage service

providers, databases, crisis management team, consultants.

Page 9: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

9

How is a grid different? Grids focus on site autonomy

Grids involve heterogeneity

Grids involve more resources than just computers and networks

Grids focus on the user

Page 10: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

10

Agenda Grid Computing Grid Middleware - Globus Security in Globus Data Management Execution Management Monitoring Metaschedulers - Gridway

Page 11: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

11

Grid Infrastructure Distributed management

Of physical resources Of software services Of communities and their policies

Unified treatment Build on Web services framework Use WS-RF, WS-Notification (or

WS-Transfer/Man) to represent/access state

Common management abstractions & interfaces

Page 12: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

12

Globus is Open Source Grid Infrastructure

Implement key Web services standards State, notification, security, …

Software for Grid infrastructure Service-enable new & existing resources E.g., GRAM on computer, GridFTP on storage

system, custom application services Uniform abstractions & mechanisms

Tools to build applications that exploit Grid infrastructure Registries, security, data management, …

Enabler of a rich tool & service ecosystem

Page 13: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

13

GLOBUS TOOLKIT 4 – GT4

Open source toolkit developed by The Globus Alliance that allows us to build Grid applications.

Organized as a collection of loosely coupled components.

Consists of services, programming libraries, and development tools.

Page 14: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

14

GT Domain Areas Core runtime

Infrastructure for building new services Security

Apply uniform policy across distinct systems Execution management

Provision, deploy, & manage services Data management

Discover, transfer, & access large data Monitoring

Discover & monitor dynamic services

Page 15: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

15

GT4 Components

Page 16: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

16

WSRF & WS-Notification Naming and bindings (basis for virtualization)

Every resource can be uniquely referenced, and has one or more associated services for interacting with it

Lifecycle (basis for fault resilient state mgmt) Resources created by services following factory pattern Resources destroyed immediately or scheduled

Information model (basis for monitoring, discovery) Resource properties associated with resources Operations for querying and setting this info Asynchronous notification of changes to properties

Service groups (basis for registries, collective svcs) Group membership rules & membership management

Base Fault type

Page 17: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

17

Agenda Grid Computing Grid Middleware - Globus Security in Globus Data Management Execution Management Monitoring Metaschedulers - Gridway

Page 18: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

18

Security Services Forms the underlying communication

medium for all the services Secure Authentication and

Authorization Single Sign-on

User need not explicitly authenticate himself every time a service is requested

Uniform Credentials Ex: GSI (Globus Security Infrastructure)

Page 19: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

19

Grid Security Infrastructure - GSI

Grid Security Infrastructure (GSI)

Use GSI as a standard mechanism for bridging disparate security mechanisms

Doesn’t solve trust problem, but now things talk same protocol and understand each other’s identity credentials

Basic support for delegation, policy distribution Translate from other mechanisms to/from GSI

as needed Convert from GSI identity to local identity for

authorization

Page 20: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

20

Grid Security Infrastructure - GSI

Grid Security Infrastructure (GSI)

Based on standard PKI technologies CAs allow one-way, light-weight trust relationships (not

just site-to-site) SSL protocol or WS-Security for authentication,

message protection X.509 Certificates for asserting identity

for users, services, hosts, etc. Proxy Certificates

GSI extension to X.509 certificates for delegation, single sign-on

Page 21: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

21

Gridmap file A gridmap file at each site maps the grid id of a user to

a local id The grid id of the user is his/her subject in the grid

user certificate The local id is site-specific; multiple grid ids can be mapped to a single local id

Usually a local id exists for each VO participating in that grid effort

The local ids are then used to implement site specific policies

Priorities etc.

Page 22: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

22

Gridmap file entry The gridmap-file is maintained by the site

administrator Each entry maps a Grid DN (distinguished

name of the user; subject name) to local user names

##Distinguished Name Local username#

“/DC=org/DC=doegrids/OU=People/CN=Laukik Chitnis 712960” ivdgl“/DC=org/DC=doegrids/OU=People/CN=Richard Cavanaugh 710220” grid3“/DC=org/DC=doegrids/OU=People/CN=JangUk In 712961” ivdgl“/DC=org/DC=doegrids/OU=People/CN=Jorge Rodriguez 690211” osg

Page 23: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

23

How to create and use an Identity (1)

Run the below command to generate a personal grid identity certificate.

grid-cert-request

This will create the following files in $HOME/.globus

usercert_request.pem (request to sign certificate)userkey.pem (private key - encrypted)usercert.pem (public key - signed)

Page 24: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

24

How to create and use an Identity (2)

After you have created the request then you need to mail it to the local certificate authority:

cat $HOME/.globus/usercert_request.pem | mail [email protected] (or [email protected])

Then the CA will mail you back a signed certificate which you will want to put into $HOME/.globus/usercert.pem

(it can take up to a day for the CA to process the request)

Page 25: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

25

Commands to log in / logout grid-proxy-init

This "logs you into" the globus system.

grid-proxy-info Use this to see your status.

grid-proxy-destroy Use this to log out.

A proxy is like a temporary ticket to use the Grid, default in the above case being 12 hours.

Once this is done, you should be able to run “grid jobs” globus-job-run site-name command

Page 26: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

26

Agenda Grid Computing Grid Middleware - Globus Security in Globus Data Management Execution Management Monitoring Metaschedulers - Gridway

Page 27: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

27

GT4 Data Management Stage/move large data to/from nodes

GridFTP, Reliable File Transfer (RFT) Alone, and integrated with GRAM

Locate data of interest Replica Location Service (RLS)

Replicate data for performance/reliability Distributed Replication Service (DRS)

Provide access to diverse data sources File systems, parallel file systems, hierarchical

storage: GridFTP Databases: OGSA DAI

Page 28: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

28

GridFTP

What is GridFTP?

A secure, robust, fast, efficient, standards based, widely accepted data transfer protocol

A Protocol Multiple independent implementations can interoperate

This works. Both the Condor Project at Uwisand Fermi Lab have home grown servers that work with ours.

Lots of people have developed clients independent of the Globus Project.

We also supply a reference implementation: Server Client tools (globus-url-copy) Development Libraries

Page 29: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

29

Globus-url-copy GridFTP-compliant client from the Globus team Copy files from one URL to another URL

One URL is usually a gsiftp:// URL Another URL is usually a file:/ URL To move a file from remote GridFTP-enabled server to local machine

% globus-url-copy gsiftp://gcb.fiu.edu/tmp/jt file:/home/skala001/jt

To put file onto server reverse URLs % globus-url-copy file:/home/skala001/jt

gsiftp://gcb.fiu.edu/tmp/jt Monitor performance using –vb flag % globus-url-copy -vb gsiftp://gcb.fiu.edu/tmp/jt

file:/home/skala001/jt

Page 30: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

30

Reliable File Transfer - RFT WSRF compliant Fault-tolerant, High- performance

data transfer service Soft state. Notifications/Query

Reliability on top of high performance provided by GridFTP. Fire and Forget. Integrated Automatic Failure Recovery.

Network level failures. System level failures etc.

Essentially a Data transfer scheduler with FIFO as a Queue Policy.

Page 31: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

31

RFT

RFT Service

RFT Client

SOAP Messages

Notifications(Optional)

DataChannel

Protocol Interpreter

MasterDSI

DataChannel

SlaveDSI

IPCReceiver

IPC Link

MasterDSI

Protocol Interpreter

Data Channel

IPCReceiver

SlaveDSI

Data Channel

IPC Link

GridFTP Server GridFTP Server

Page 32: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

32

Agenda Grid Computing Grid Middleware - Globus Security in Globus Data Management Execution Management Monitoring Metaschedulers - Gridway

Page 33: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

33

Execution Management Common WS interface to schedulers

Unix, Condor, LSF, PBS, SGE, … More generally: interface for process

execution management Lay down execution environment Stage data Monitor & manage lifecycle Kill it, clean up

A basis for application-driven provisioning

Page 34: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

34

Grid Job Management GoalsProvide a service to securely: Create an environment for a job Stage files to/from environment Cause execution of job process(es)

Via various local resource managers Monitor execution Signal important state changes to client Enable client access to output files

Streaming access during execution

Page 35: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

35

GRAM GRAM:Globus Resource Allocation and

Management GRAM is a Globus Toolkit component

For Grid job management GRAM is a unifying remote interface to

Resource Managers Yet preserves local site security/control

GRAM is for stateful job control Reliable operation Asynchronous monitoring and control Remote credential management File staging via RFT and GridFTP

Page 36: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

36

GRAMservices

GT4 Java Container

GRAMservices

Delegation

RFT FileTransfer

Transferrequest

GridFTPRemote storage element(s)

Localscheduler

Userjob

Compute element

GridFTP

sudo

GRAMadapter

FTPcontrol

Local job control

Delegate

FTP data

Cli

ent Job

functions

Delegate

Service host(s) and compute element(s)

GT4 WS GRAM Architecture

SEGJob events

Page 37: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

37

GRAMservices

GT4 Java Container

GRAMservices

Delegation

RFT FileTransfer

Transferrequest

GridFTPRemote storage element(s)

Localscheduler

Userjob

Compute element

GridFTP

sudo

GRAMadapter

FTPcontrol

Local job control

Delegate

FTP data

Cli

ent Job

functions

Delegate

Service host(s) and compute element(s)

GT4 WS GRAM Architecture

SEGJob events

Delegated credential can be:Made available to the application

Page 38: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

38

GRAMservices

GT4 Java Container

GRAMservices

Delegation

RFT FileTransfer

Transferrequest

GridFTPRemote storage element(s)

Localscheduler

Userjob

Compute element

GridFTP

sudo

GRAMadapter

FTPcontrol

Local job control

Delegate

FTP data

Cli

ent Job

functions

Delegate

Service host(s) and compute element(s)

GT4 WS GRAM Architecture

SEGJob events

Delegated credential can be:Used to authenticate with RFT

Page 39: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

39

GRAMservices

GT4 Java Container

GRAMservices

Delegation

RFT FileTransfer

Transferrequest

GridFTPRemote storage element(s)

Localscheduler

Userjob

Compute element

GridFTP

sudo

GRAMadapter

FTPcontrol

Local job control

Delegate

FTP data

Cli

ent Job

functions

Delegate

Service host(s) and compute element(s)

GT4 WS GRAM Architecture

SEGJob events

Delegated credential can be:Used to authenticate with GridFTP

Page 40: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

40

A Simple Example Command example:% globusrun-ws -submit -c /bin/date

Submitting job...Done.Job ID: uuid:002a6ab8-6036-11d9-bae6-0002a5ad41e5Termination time: 01/07/2005 22:55 GMTCurrent job state: ActiveCurrent job state: CleanUpCurrent job state: DoneDestroying job...Done.

A successful submission will create a new ManagedJob resource with its own unique EPR for messaging

Use –o option to create the EPR file% globusrun-ws -submit –o job.epr -c /bin/date

Page 41: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

41

A Simple Example(2) To see the output, use –s (stream) option% globusrun-ws -submit –s -c /bin/date

Termination time: 06/14/2007 18:07 GMTCurrent job state: ActiveCurrent job state: CleanUp-HoldWed Jun 13 14:07:54 EDT 2007Current job state: CleanUpCurrent job state: DoneDestroying job...Done.Cleaning up any delegated credentials...Done.

If you want to send the output to a file, use –so option% globusrun-ws -submit –s –so job.out -c /bin/date

…% cat job.out

Wed Jun 13 14:07:54 EDT 2007

Page 42: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

42

A Simple Example(3)Submitting your job to different schedulers Fork% globusrun-ws -submit -Ft Fork -s -c

/bin/date(Actually, the default is Fork. So, you can skip it in this case.)

SGE% globusrun-ws -submit -Ft SGE -s -c

/bin/hostname

Page 43: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

43

Batch Job Submissions% globusrun-ws -submit -batch -o job_epr -c

/bin/sleep 50Submitting job...Done.Job ID: uuid:f9544174-60c5-11d9-97e3-0002a5ad41e5Termination time: 01/08/2005 16:05 GMT

% globusrun-ws -status -j job_eprCurrent job state: Active

% globusrun-ws -status -j job_eprCurrent job state: Done

% globusrun-ws -kill -j job_eprRequesting original job description...Done.Destroying job...Done.

Page 44: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

44

Complete Factory Contact Override default EPR

Select a different host/service Use “contact” shorthand for convenience

Relies on proprietary knowledge of EPR format!

Command example:

% globusrun-ws -submit –F gcb.fiu.edu\-c /bin/date

Page 45: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

45

Read RSL from File Command:% globusrun-ws -submit -f touch.xml

Contents of touch.xml file:<job> <executable>/bin/touch</executable> <argument>touched_it</argument></job>

Page 46: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

46

Resource Specification Language (RSL) RSL is the language used by the clients to submit

a job. All job submission requests are described in RSL,

including the executable file and arguments. You can specify the type and capabilities of

resources to execute your job. You can also coordinate Stage-in and Stage-out

operations through RSL.

Page 47: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

47

Common/useful options globusrun-ws -J

Perform delegation as necessary for job globusrun-ws -S

Perform delegation as necessary for job’s file staging

globusrun-ws -s Stream stdout/err during job execution to the

terminal globusrun-ws -self

Useful for testing, when you have started the service using your credentials instead of host credentials

Page 48: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

48

Staging job<job><executable>/bin/echo</executable><directory>/tmp</directory><argument>Hello</argument><stdout>job.out</stdout><stderr>job.err</stderr><fileStageOut> <transfer> <sourceUrl>file:///tmp/job.out</sourceUrl> <destinationUrl> gsiftp://host.domain:2811/tmp/stage.out </destinationUrl> </transfer></fileStageOut>

</job>

Page 49: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

49

RSL Variable Enables late binding of values

Values resolved by GRAM service

System-specific variables ${GLOBUS_USER_HOME} ${GLOBUS_LOCATION} ${GLOBUS_SCRATCH_DIR}

Alternative directory that is shared with compute node

Typically providing more space than user’s HOME dir

Page 50: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

50

RSL Variable Example<job><executable>/bin/echo</executable><argument>HOME is ${GLOBUS_USER_HOME}</argument><argument>SCRATCH = ${GLOBUS_SCRATCH_DIR}</argument><argument>GL is ${GLOBUS_LOCATION}</argument><stdout>${GLOBUS_USER_HOME}/echo.stdout</stdout><stderr>${GLOBUS_USER_HOME}/echo.stderr</stderr>

</job>

!!!/tmp/rslExample

Page 51: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

51

GRAM Commands Run a job using:

% globus-job-run localhost /bin/date Submit to Fork:

% globus-job-run localhost/jobmanager-fork /bin/date

Submit a batch job using:% globus-job-submit localhost /bin/sleep 50

globus-job-status globus-job-get-output globus-job-cancel

Page 52: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

52

Running a Script in GRAM Add this script to file “job”

#! /bin/csh -fecho "Hello World from ";$GLOBUS_LOCATION/bin/globus-hostnameecho arg 1 = $1echo arg 2 = $2echo -n "sum is " echo "$1+$2" | /usr/bin/bc –l

Change the permissions for “job”% chmod +x job

Run the job% globus-job-run localhost ./job 5 6

You should getHello World fromgcb.fiu.eduarg 1 = 5arg 2 = 6sum is 11

!!!/tmp/job

Page 53: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

53

Agenda Grid Computing Grid Middleware - Globus Security in Globus Data Management Execution Management Monitoring Metaschedulers - Gridway

Page 54: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

54

What is MDS4? Grid-level monitoring system used most often

for resource selection and error notification Aid user/agent to identify host(s) on which to run an

application Make sure that they are up and running correctly

Uses standard interfaces to provide publishing of data, discovery, and data access, including subscription/notification

WS-ResourceProperties, WS-BaseNotification, WS-ServiceGroup

Functions as an hourglass to provide a common interface to lower-level monitoring tools

Page 55: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

55

MDS4 Components Information providers

Monitoring is a part of every WSRF service Non-WS services are also be used

Higher level services Index Service – a way to aggregate data Trigger Service – a way to be notified of changes Both built on common aggregator framework

Clients WebMDS

All of the tool are schema-agnostic, but interoperability needs a well-understood common language

Page 56: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

56

Information Providers GT4 information providers collect

information from some system and make it accessible as WSRF resource properties

Growing number of information providers Ganglia, CluMon, Nagios SGE, LSF, OpenPBS, PBSPro, Torque

Many opportunities to build additional ones E.g., network monitoring, storage systems,

various sensors

Page 57: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

57

Information Providers Data sources for the higher-level services Some are built into services

Any WSRF-compliant service publishes some data automatically

WS-RF gives us standard Query/Subscribe/Notify interfaces

GT4 services: ServiceMetaDataInfo element includes start time, version, and service type name

Most of them also publish additional useful information as resource properties

Page 58: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

58

Information Providers:GT4 Services Reliable File Transfer Service (RFT)

Service status data, number of active transfers, transfer status, information about the resource running the service

Community Authorization Service (CAS) Identifies the VO served by the service instance

Replica Location Service (RLS) Note: not a WS Location of replicas on physical storage

systems (based on user registrations) for later queries

Page 59: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

59

Information Providers (2) Other sources of data

Any executables Other (non-WS) services Interface to another archive or data

store File scraping

Just need to produce a valid XML document

Page 60: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

60

Information Providers:Cluster and Queue Data

Interfaces to Hawkeye, Ganglia, CluMon, Nagios Basic host data (name, ID), processor information,

memory size, OS name and version, file system data, processor load data

Some condor/cluster specific data This can also be done for sub-clusters, not just at

the host level Interfaces to PBS, Torque, LSF

Queue information, number of CPUs available and free, job count information, some memory statistics and host info for head node of cluster

Page 61: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

61

Higher-Level Services Index Service

Caching registry Trigger Service

Warn on error conditions

All of these have common needs, and are built on a common framework

Page 62: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

62

MDS4 Index Service Index Service is both registry and cache

Datatype and data provider info, like a registry (UDDI) Last value of data, like a cache

Subscribes to information providers In memory default approach

DB backing store currently being discussed to allow for very large indexes

Can be set up for a site or set of sites, a specific set of project data, or for user-specific data only

Can be a multi-rooted hierarchy No *global* index

Page 63: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

63

Container-wide Index Each GT4 container has a local index Collects information about services in that

container Each service registers to container index when

correctly configured

Page 64: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

64

VO-wide indexes Local indexes can be registered to VO wide indexes Configfile at resource container or at VO index –

contains URL for resource or VO index

Page 65: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

65

MDS4 Trigger Service Subscribe to a set of resource

properties Evaluate that data against a set of

pre-configured conditions (triggers) When a condition matches, action

occurs Email is sent to pre-defined address Website updated

Page 66: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

66

Information models Each information sources publishes

information in XML according to some schema.

Some times the author of the information source or the grid resource defines that schema.

Some collaborative efforts to define common schemas–for example GLUE for compute information

Schema typically written in XSD, but not required

Page 67: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

67

GLUE schema Grid Laboratory Uniform Environment Schema developed by DataTAG for

EU/USA interoperability. Modelled in UML Implementations

XML version for MDS Information collected from various cluster

monitoring systems Also: LDAP and SQL versions (used by older

versions of MDS and other monitoring systems).

Page 68: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

68

MDS user interfaces

General purpose UIs Web browser based interface -

WebMDS Command line tools

Specialized clients Brokers

Page 69: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

69

WebMDS

Web-based interface to display monitoring information

Easily extensible for new data using XSLT

Page 70: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

70

MDS4 - Command Line Xpath Queries to query the Index Service To see all collected in the Index Service

wsrf-query -s \ https://gcb.fiu.edu:8443/wsrf/services/DefaultIndexService

To see the number of free nodes: wsrf-query -s

https://gcb.fiu.edu:8443/wsrf/services/DefaultIndexService "number(//*/glue:GLUECE//glue:ComputingElement/glue:State/@glue:FreeCPUs)"

To see how many jobs are currently running: wsrf-query -s

https://gcb.fiu.edu:8443/wsrf/services/DefaultIndexService "number(//*[local-name()='GLUECE']//glue:ComputingElement//glue:State/@glue:TotalJobs)"

Page 71: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

71

Configuring GRAM to use a cluster monitoring system

GRAM extracts and publishes cluster information from either Ganglia or Hawkeye

$GLOBUS_LOCATION/etc/globus_wsrf_mds_usefulrp/gluerp.xml

<defaultProvider> tag specifies whether to use Ganglia or Hawkeye or none.

Uncomment appropriate example supplied in the configfile

Page 72: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

72

Agenda Grid Computing Grid Middleware - Globus Security in Globus Data Management Execution Management Monitoring Metaschedulers - Gridway

Page 73: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

73

Grid Meta-Scheduler Local Schedulers is not fit for Grid environment

Meta-scheduler(s) should interact with lower-level schedulers for scheduling decisions

Resources (Computational, Data, Network, etc.) and Jobs are other entities, Meta-Scheduler should be aware of and interact with

Meta-Scheduler uses existing Grid services

Page 74: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

74

GridWay Lightweight metascheduler on top of GT 2.4 –

4.x Properties:

Support of GGF DRMA standard API for job submission and management

Support for JSDL Simple scheduling mechanisms but extensible Interoperability between different grid infrastructures

and middlewares (Globus, EGEE, UNICORE…) Allows job dependencies (workflow) Supports job migration/adaptive execution (Grid- and

application-initiated)

Page 75: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

75

GridWay Architecture

RequestManager

DispatchManager

InformationManager

ExecutionManager

TransferManager

SchedulerScheduler

GridWay Core

Jobpool

Hostpool

GRAMRFT MDS

Resource

GRAMRFT MDS DRMAA Library CLIJob control operations

Matchmaking, execution and

migration

Execution of jobs on LRM

Performance Monitor

Page 76: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

76

GridWay Modules Request Manager Interfaces with client

commands Dispatch Manager Performs job scheduling Information Manager Resource Monitoring and

data gathering Execution Manager Executes job stages Performance Monitor Evaluates the job

performance

Page 77: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

77

Scheduling Strategy Dispatch manager wakes up at every scheduling

interval Uses Resource Selector to select the host(s) to

submit the job Resource Selector interfaces with Grid Information

Services, such as MDS Resource Selector returns a candidate list of hosts

to submit the job by using a policy script You can implement your own policy script, so it is

extensible Dispatch Manager then submits the job to the

Execution Manager

Page 78: 1 High-Performance Grid Computing and Research Networking Presented by Selim Kalayci Instructor: S. Masoud Sadjadi sadjadi/Teaching

78

GridWay Commands gwd - start the daemon gwhost - information about

resources gwps - information about jobs gwuser - information about users gwsubmit - submits job gwkill - cancels a job