67
Grid Computing from a solid past to a bright future? David Groep NIKHEF DataGrid and VL group 2003-03-14

Grid Computing from a solid past to a bright future? David Groep NIKHEF DataGrid and VL group 2003-03-14

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Grid Computing

from a solid past to a bright future?

David GroepNIKHEF DataGrid and VL group

2003-03-14

Grid – more than a hype?

Imagine that you could plug your computer into the wall and have direct access to huge computing resources immediately, just as you plug in a lamp to get instant light. …

Far from being science-fiction, this is the idea the XXXXXX project is about to make into reality.…

from a project brochure in 2001

• Grids and their (science) applications• Origins of the grid• What makes a Grid?

• Grid implementations today• New standards

• Dutch dimensions

Grid – a visionThe GRID: networked data processing centres and ”middleware” software as the “glue” of resources.

Researchers perform their activities regardless geographical location, interact with colleagues, share and access data

Scientific instruments and experiments provide huge amount of data

[email protected]

Communities and Apps

ENVISAT• 10 instruments on board10 instruments on board• 200 Mbps data rate to ground200 Mbps data rate to ground• 400 Tbytes data archived/year400 Tbytes data archived/year• ~100 `standard’ products~100 `standard’ products• 10+ dedicated facilities in Europe10+ dedicated facilities in Europe

• ~700 approved science user projects~700 approved science user projects

• 10 instruments on board10 instruments on board• 200 Mbps data rate to ground200 Mbps data rate to ground• 400 Tbytes data archived/year400 Tbytes data archived/year• ~100 `standard’ products~100 `standard’ products• 10+ dedicated facilities in Europe10+ dedicated facilities in Europe

• ~700 approved science user projects~700 approved science user projects

http://www.esa.int/

Added value for EO

• enhance the ability to access high level products

• allow reprocessing of large historical archives

• data fusion and cross-validation, …

Physics @ CERN• LHC particle accellerator

• operational in 2007

• 5-10 Petabyte per year

• 150 countries

• > 10000 Users

• lifetime ~ 20 years

level 1 - special hardware

40 MHz (40 TB/sec)

level 2 - embeddedlevel 3 - PCs

75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &

offline analysis

The Need for Grids: LHC

http://www.cern.ch/

And More …

•For access to data

–Large network bandwidth to access computing centers

–Support of Data banks replicas (easier and faster

mirroring)

–Distributed data banks

•For interpretation of data

–GRID enabled algorithmsBLAST on distributed data banks, distributed data mining

Bio-informatics

Genome pattern matching

And even more …

• financial services, life sciences, strategy evaluation, …

• instant immersive teleconferencing

• remote experimentation

• pre-surgical planning and simulation

Why is the Grid successful?

• Applications need large amounts of data or computation

• Ever larger, distributed user community• Network grows faster than compute power/storage

Inter-networking systems

• Continuous growth (now ~ 180 million hosts)• Many protocols and APIs (~3500 RFCs)• Focus on heterogeneity (and security)

http://www.caida.org/

http://www.isc.org/

Remote Service

• RPC proved hugely successful within domains– Network Information System (YP)– Network File System– Typical client-server stuff…

• CORBA – also intra-domain– Extension of RPC to OO design model– Diversification

• Web Services – venturing in the inter org. domain– Standard service descriptions and discovery– Common syntax (XML/SOAP)

Grid beginnings - Systems

• distributed computing research• Gigabit network test beds• Meta-supercomputing (I-WAY)• Condor ‘flocking’

GUSTO meta-computing test bed in 1999

Grid beginnings - Apps

• Solve problems using systems in one ‘domain’– parameter sweeps on batch clusters– PIAF for (HE) physics analysis– …

• Solvers using systems in multiple domains– SETI@home– …

• Ready for the next step …

What is the Grid about?

Resource sharing and coordinated problem solving

in dynamic multi-institutional virtual organisations

Virtual Organisation (VO):

A set of individuals or organisations, not under single hierarchical control, temporarily joining forces to solve a particular problem at hand, bringing to the collaboration a subset of their resources, sharing those at their discretion and each under their own conditions.

What makes a Grid?

Coordinates resources not subject to central control …– More than cluster & centralised distributed computing– Security, AAA, billing&payment, integrity, procedures

… using standard, open protocols …– More than single-purpose solutions– Requires interoperability, standards body, multiple

implementations

… to deliver non-trivial QoS.– Sum more than individual components (e.g. single sign-

on, transparency)

Ian Foster in Grid Today, 2002

Grid Architecture (v1)

Application

Fabric“Controlling things locally”: Access to, & control of, resources

Connectivity“Talking to things”: communication (Internet protocols) & security

Resource“Sharing single resources”: negotiating access, controlling use

Collective“Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services

InternetTransport

Application

Link

Inte

rnet P

roto

col

Arch

itectu

re

Protocol Layers & Bodies

PhysicalPhysical

Data LinkData Link

NetworkNetwork

TransportTransport

SessionSession

PresentationPresentation

ApplicationApplication

Standards body: IEEE

Standards body: IETF

Standards bodies: GGFW3C

OASIS

Application

Fabric

Connectivity

Resource

Collective

Internet

Transport

Application

Link

Inte

rnet P

roto

col

Arch

itectu

re

• Globus Project started 1997• Focus on research only• Used and extended by

many other projects• Toolkit `bag-of-services' approach –

not a complete architecture

• Several middleware projects:– EU DataGrid – production focus– CrossGrid, GridLAB, DataTAG, PPDG, GriPhyN– Condor– In NL: ICES/KIS Virtual Lab, VL-E

Grid Middleware

http://www.globus.org/

http://www.edg.org/

http://www.vl-e.nl/

Grids Today

Grid Protocols Today

• Use common Grid Security Infrastructure:– Extensions to TLS for delegation (single sign-on)– Organisation of users in VOs

• Currently deployed main services– GRAM (resource allocation):

attrib/value pairs over HTTP

– GridFTP (bulk file transfer):FTP with GSI and high-throughput extras (striping)

– MDS (monitoring and discovery service):LDAP + common resource description schema

• Next generation: Grid Services (OGSA)

Grid Security Infrastructure

• Requirements:– “Secure” – User identification– Accountability– Site autonomy– Usage control

– Single sign-on– Dynamic VOs any time and any place– Mobility (“easyEverything”, airport kiosk, handheld)– Multiple roles for each user– Easy!

Authentication – PKI

• Asserting, binding identities

• Trust issues on a global scale

– EDG: CA Coord. Group• 16 national certification authorities

+ CrossGrid CAs• policies & procedures mutual trust• users identified by CA’s certificates

– Part of world-wide GridPMA• Establishing minimum requirements• Includes several US and AP CAs

• Scaling still a challenge

EDG CA’s

CERN

CESNET

CNRS (3)

GermanGrid

Grid-Ireland

INFN

NIKHEF

NorduGrid

LIP

Russian DataGrid

DATAGRID-ES

GridPP

US–DOE Root CA

US-DOE Sub CA

CrossGrid (*)

http://marianne.in2p3.fr/datagrid/ca and http://www.gridpma.org/

Getting People TogetherVirtual Organisations

• The user community `out there’ is large & highly dynamic• Applying at each individual resource does not scale

• Users get together to form Virtual Organisations:– Temporary alliance of stakeholders

(users and/or resources)– Various groups and roles– Managed by (legal) contracts– Setup and dissolved at will

*currently not yet that fast

• Authentication, Authorization, Accounting (AAA)

Authorization (today)

• Virtual Organisation “directories”– Members are listed in a directory– Managed by VO responsible

– Sites extract access lists from directories– Only for VOs they have “contract” with– Still need OS-local accounts

– May also use automated tools (sysadm level)• poolAccounts• slashGrid

http://cern.ch/hep-project-grid-scg/

Grid Security in Action

• Key elements in Grid Security Infrastructure (GSI)– Proxy– Trusted certificate store– Delegation: full or restricted rights

• Access services directly

• Establish trust between processes

Site A(Kerberos)

Site B (Unix)

Site C(Kerberos)

Computer

User

Single sign-on via “grid-id”& generation of proxy cred.

Or: retrieval of proxy cred.from online repository

User ProxyProxy

credential

Computer

Storagesystem

Communication*

GSI-enabledFTP server

AuthorizeMap to local idAccess file

Remote fileaccess request*

GSI-enabledGRAM server

GSI-enabledGRAM server

Remote processcreation requests*

* With mutual authentication

Process

Kerberosticket

Restrictedproxy

Process

Restrictedproxy

Local id Local id

AuthorizeMap to local idCreate processGenerate credentials

Ditto

GSI in Action“Create Processes at A and B

that Communicate & Access Files at C”

Large-scale production Grids

• Until recently usually “smallish”– O(10) sites, O(20) users– Only one community (VO)

Running Production Grids• EU DataGrid (EDG)

– Stress testing: up to 2000 jobs at any time– Focus on stability (>99% of jobs complete correctly)

• VL-E • NASA IPG• LCG, PPDG/iVDGL

EU DataGrid

• Middleware research project (2001-2003)• Driving applications:

• HE Physics• Earth Observation• Biomedicine

• Operational testbed• 25 sites, 50 CEs• 8 VOs• ~ 350 users, growing with ~50/month!

http://www.eu-datagrid.org/

EU DataGrid Test Bed 1

• DataGrid TB1:– 14 countries– 21 major sites– CrossGrid: 40 more sites

• Submitting Jobs:– Login only once,

run everywhere– Cross administrative

boundaries in asecure and trusted way

– Mutual authorization

http://marianne.in2p3.fr/

EDG: 3 Tier ArchitectureEDG: 3 Tier Architecture

Client‘User Interface’

Execution Resources‘ComputeElement’

Data Server‘StorageElement’

Request

ResultData

Request

Database server

Example: GOME

Step 8: Visualize Results

ESA – KNMIProcessing of raw GOME

data to ozone profilesWith Opera and Noprego

IPSL

Validate GOME ozone profilesWith Ground Based measurements

‘Raw’ satellite data from the GOME instrument

Visualization

LIDAR data

DataGrid

Level 1

Level 2

GOME processing cycle

Situation on a GridSituation on a Grid

?

Information Services (IS)

Cluster information Storage capacity Network connections

HARDWARE – fabric and storageToday: info-providers publish to

IS hierarchical directory

Next week: R-GMA producer-consumer framework based on

RDBMS

File replica locations

DATA – files and collectionsToday: Replica Catalogue (RC)

In few month: Replica Location Service

RunTime Environment tags Service entries (SE, CE, RC)

SOFTWARE – programs & services

Today: in IS

Grid job submission

• Basic protocol: GRAM– Job submission at individual CE– Status inqueries– Credential delegation– File staging– Job manager (baby-sitter)

• Collective services (Workload Mngt System)– Resource broker– Job submission service– Logging and Bookkeeping

• The EDG WMS tries to optimize the usage of resources• Will re-submit on resource failure

•Information to be specified–Job characteristics–Requirements and Preferences of the computing system

–Software dependencies–Job Data requirements –Specified using a Job Description Language (JDL)

Job Preparation

Example JDL File

Executable = “gridTest”;

StdError = “stderr.log”;

StdOutput = “stdout.log”;

InputSandbox = {“home/joda/test/gridTest”};

OutputSandbox = {“stderr.log”, “stdout.log”};

InputData = “LF:testbed0-00019”;

ReplicaCatalog = “ldap://sunlab2g.cnaf.infn.it:2010/ \ lc=test, rc=WP2 INFN Test, dc=infn,

dc=it”;

DataAccessProtocol = “gridftp”;

Requirements = other.Architecture==“INTEL” && \ other.OpSys==“LINUX” && \

other.FreeCpus >=4;

Rank = “other.MaxCpuTime”;

This JDL is input to dg-job-submit

Job Submission Scenario

UIJDL

Logging &Bookkeeping(LB)

ResourceBroker (RB)

Job SubmissionService (JSS)

StorageElement(SE)

ComputeComputeElement CE)Element CE)

Information Service (IS)

ReplicaCatalogue(RC)

Example

UIJDL

Logging &Bookkeeping(LB)

ResourceBroker (RB)

Job SubmissionService (JSS)

StorageElement(SE)

ComputeComputeElement (CE)Element (CE)

Information Service (IS)

ReplicaCatalogue(RC)

Job SubmitEvent

Input Sandbox

Job Status

submitted

Example

UIJDL

Logging &Bookkeeping(LB)

ResourceBroker (RB)

Job SubmissionService (JSS)

StorageElement(SE)

ComputeComputeElement (CE)Element (CE)

Information Service (IS)

ReplicaCatalogue(RC)

Job Status

submitted

waiting

Example

UIJDL

Logging &Bookkeeping(LB)

ResourceBroker (RB)

Job SubmissionService (JSS)

StorageElement(SE)

ComputeComputeElement (CE)Element (CE)

Information Service (IS)

ReplicaCatalogue(RC)

Job Status

submitted

waiting

ready

Example

UIJDL

Logging &Bookkeeping(LB)

ResourceBroker (RB)

Job SubmissionService(JSS)

StorageElement (SE)

ComputeComputeElement (CE)Element (CE)

Information Service (IS)

ReplicaCatalogue(RC)

Job Status

submitted

waiting

ready

BrokerInfo

scheduled

Example

UIJDL

Logging &Bookkeeping(LB)

ResourceBroker (RB)

Job SubmissionService (JSS)

StorageElement(SE)

ComputeComputeElement (CE)Element (CE)

Information Service (IS)

ReplicaCatalogue(RC)

Job Status

submitted

waiting

ready

scheduledInput Sandbox

running

Example

UIJDL

Logging &Bookkeeping(LB)

ResourceBroker (RB)

Job SubmissionService (JSS)

StorageElement(SE)

ComputeComputeElement (CE)Element (CE)

Information Service (IS)

ReplicaCatalogue(RC)

Job Status

submitted

waiting

ready

scheduled

Job Status

running

Example

UIJDL

Logging &Bookkeeping

ResourceBroker

Job SubmissionService

StorageElement

ComputeComputeElementElement

Information Service

ReplicaCatalogue

submitted

waiting

ready

scheduled

running

Job Status

done

Job Status

Example

UIJDL

Logging &Bookkeeping

ResourceBroker

Job SubmissionService

StorageElement

ComputeComputeElementElement

Information Service

ReplicaCatalogue

submitted

waiting

ready

scheduled

running

done

Job Status

Job Status

outputready

Output Sandbox

Example

UIJDL

Logging &Bookkeeping(LB)

ResourceBroker (RB)

Job SubmissionService (JS)

StorageElement(SE)

ComputeComputeElement (CE)Element (CE)

Information Service (IS)

ReplicaCatalogue(RC)

Output Sandbox

cleared

submitted

waiting

ready

scheduled

running

done

Job Status

outputready

Data Access & Transport

• Requirements– Support single sign-on– Transfer large files quickly– Confidentiality/integrity– Integrated with information systems (RC)

• Extensions to FTP protocol: GridFTP– GSI, DCAU– Server striping, parallel streams

• TCP protocol optimisation

EDG Storage Element

• Transfer methods:– gridFTP– RFIO– G-HTTPS

• Replica Catalogue– Yesterday: LDAP directory using GDMP– Today: Replica Location Service and Giggle

• Backend systems– Disk storage– HPSS via HRM– HPSS with explicit staging

Grid Data Bases ?!

• Database Access and Integration (DAI)-WG– OGSA-DAI integration project– Data Virtualisation Services– Standard Data Source Services

Early Emerging Standards:– Grid Data Service specification (GDS)– Grid Data Service Factory (GDSF)

Largely spin-off from the UK e-Science effort & DataGrid

Grid Access to Databases

• SpitFire (standard data source services)uniform access to persistent storage on the Grid

• Multiple roles support• Compatible with GSI (single sign-on) though CoG• Uses standard stuff: JDBC, SOAP, XML• Supports various back-end data bases

http://hep-proj-spitfire.web.cern.ch/hep-proj-spitfire/

Spitfire security model

Standard access to DBs

•GSI SOAP protocol•Strong authentication

•Supports single-signon•Local role repository

•Connection pool to•Multiple backend DBs

Version 1.0 out,WebServices version in alpha

Bringing Grids to the User

• Core services too complex to present to scientistsdesign (graphical/web) portals

• VLAM-G• GENIUS/

EnginFrame• EDG GUI

• Application-specific interfaces

A Bright Future?

Grids Around the World

• Many different grid projects• Different goals (and thus architectures)• Breath of applications

– Meta-supercomputing (origin of the Grid)– High-throughput computing (DataGrids)– Collaboratories, data fusion grids– Harnassing idle workstations– Transaction-oriented grids (industry)

• Interoperability requires standardisation!

Standards Requirements

• GGF established in 2001merger of GridForum and Egrid Forum

• Approx. 50 working & research groups

0

200

400

600

800

1000

1200

1999 1999 2000 2000 2000 2001 2001 2001 2002 2002

(G)GF attendance

http://www.ggf.org/

OGSA: current directions

Open Grid Services Architecture … … cleaning up the protocol mess

• Use standard containers (based on web services)

• Based on common standards:– SOAP, WSDL, UDDI– Running over “upgraded” Grid Security Infra (GSI)

• New in OGSA: adding transient “manageable” services:– State of distributed activities– Workflow, multi-media, distributed data analysis

OGSA Roadmap

• Introduced at GGF4 (Toronto, March 2002)• OGSI definition draft went for final call last week

• First implementations – Globus Toolkit v3– Currently in alpha testing– Beta release in July

• Significant effort towards homogeneous interfaces

• Large commitment (world-wide and local)

Dutch Dimensions

SURFnet5 connectivity

http://www.surfnet.nl/

Networking: Europe

http://www.dante.net/

DutchGrid Platform

Amsterdam

Utrecht

KNMI

Delft

Nijmegen

TELIN

• DutchGrid:– Test bed coordination– PKI security– Support

• Participation byNIKHEF, KNMI, SARA

DAS-2 (ASCI):TUDelft, Leiden, VU, UvA, Utrecht

Telematics Institute

FOM, NWO/NCF

Min. EZ (ICES/KIS)

IBM, KPN, …

Leiden

ASTRONJIVE

www.dutchgrid.nl

Resources

• ASCI DAS-2 (VU, UvA, Leiden, TUDelft, Utrecht)– 200 dual P-III 1GHz CPUs– homogeneous clusters, 5 locations

• NIKHEF DataGrid clusters– 75 dual P-III ~ 1GHz – 1Gb/s IPv4 + 1Gb/s IPv6

• NCF Gridnational computer facilities foundation from NWO

– 66 node dual AMD-K7 Fabric Research Cluster (NIKHEF)– 32 node duals “production quality” cluster (SARA)*– 10Gb/s optical “lambda” test bed– …

• BioASP – various smaller O(1-10 node) clusters

Resources (cont.)

SARA – National HPC Centre• Processing

– SGI 1024 processor MPP

• Mass storage– StorageTek NearLine tape robot– currently: 500 TByte– Integrated as an EDG “Storage Element”

• User expertise centre

SURFnet – networking• 2.5-10 Gb/s international• 10 Gb/s to dedicated centres (DAS-2, ASTRON)

A Bright Future!

You could plug your computer into the wall and have direct access to huge (computing) resources almost immediately

(with a little help from toolkits and portals)…It may still be science – although not fiction –but we are working hard to get there!