Upload
montana
View
49
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Global Data Grids for 21 st Century Science. Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ [email protected]. Physics Colloquium University of Texas at Arlington Jan. 24, 2002. What is a Grid?. - PowerPoint PPT Presentation
Citation preview
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 1
Paul AveryUniversity of Floridahttp://www.phys.ufl.edu/~avery/[email protected]
Physics ColloquiumUniversity of Texas at Arlington
Jan. 24, 2002
Global Data Grids for21st Century Science
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 2
What is a Grid?Grid: Geographically distributed computing
resources configured for coordinated use
Physical resources & networks provide raw capability
“Middleware” software ties it together
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 3
Applications for GridsClimate modeling
Climate scientists visualize, annotate, & analyze Terabytes of simulation data
BiologyA biochemist exploits 10,000 computers to screen
100,000 compounds in an hour
High energy physics3,000 physicists worldwide pool Petaflops of CPU
resources to analyze Petabytes of data
EngineeringCivil engineers collaborate to design, execute, & analyze
shake table experimentsA multidisciplinary analysis in aerospace couples code and
data in four companies
From Ian Foster
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 4
Applications for Grids (cont.)Application Service Providers
A home user invokes architectural design functions at an application service provider
An application service provider purchases cycles from compute cycle providers
CommercialScientists at a multinational soap company design a new
product
CommunitiesAn emergency response team couples real time data,
weather model, population dataA community group pools members’ PCs to analyze
alternative designs for a local road
HealthHospitals and international agencies collaborate on
stemming a major disease outbreak From Ian Foster
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 5
Proto-Grid: SETI@homeCommunity: SETI researchers + enthusiastsArecibo radio data sent to users (250KB data
chunks)Over 2M PCs used
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 6
Community1000s of home computer
usersPhilanthropic computing
vendor (Entropia)Research group (Scripps)
Common goalAdvance AIDS research
More Advanced Proto-Grid:Evaluation of AIDS Drugs
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 7
Early Information Infrastructure
Network-centricSimple, fixed end systemsFew embedded capabilitiesFew servicesNo user-level quality of service
O(108) nodes
Network
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 8
Emerging Information Infrastructure
Application-centricHeterogeneous, mobile end-systemsMany embedded capabilitiesRich servicesUser-level quality of service
QoS
ResourceDiscovery
O(1010) nodes
Qualitatively different,not just “faster andmore reliable”
Processing
Grid
Caching
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 9
Why Grids?Resources for complex problems are distributed
Advanced scientific instruments (accelerators, telescopes, …)
Storage and computingGroups of people
Communities require access to common servicesScientific collaborations (physics, astronomy, biology, eng.
…)Government agenciesHealth care organizations, large corporations, …
Goal is to build “Virtual Organizations”Make all community resources available to any VO
memberLeverage strengths at different institutions Add people & resources dynamically
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 10
Grid ChallengesOverall goal
Coordinated sharing of resources
Technical problems to overcomeAuthentication, authorization, policy, auditingResource discovery, access, allocation, controlFailure detection & recoveryResource brokering
Additional issue: lack of central control & knowledge
Preservation of local site autonomyPolicy discovery and negotiation important
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 11
Layered Grid Architecture(Analogy to Internet Architecture)
Application
FabricControlling things locally:Accessing, controlling resources
ConnectivityTalking to things:communications, security
ResourceSharing single resources:negotiating access, controlling use
CollectiveManaging multiple resources:ubiquitous infrastructure services
UserSpecialized services:App. specific distributed services
InternetTransport
Application
Link
Inte
rnet P
roto
col
Arc
hite
ctu
re
From Ian Foster
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 12
Globus Project and ToolkitGlobus Project™ (Argonne + USC/ISI)
O(40) researchers & developers Identify and define core protocols and services
Globus Toolkit™A major product of the Globus ProjectReference implementation of core protocols & servicesGrowing open source developer community
Globus Toolkit used by all Data Grid projects todayUS: GriPhyN, PPDG, TeraGrid, iVDGLEU: EU-DataGrid and national projects
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 13
Globus General ApproachDefine Grid protocols & APIs
Protocol-mediated access to remote resources
Integrate and extend existing standards
Develop reference implementationOpen source Globus ToolkitClient & server SDKs, services, tools,
etc.
Grid-enable wide variety of toolsGlobus ToolkitFTP, SSH, Condor, SRB, MPI, …
Learn about real world problemsDeploymentTestingApplications
Diverse global services
Coreservice
s
Diverse OS services
Applications
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 14
Globus Toolkit ProtocolsSecurity (connectivity layer)
Grid Security Infrastructure (GSI)
Resource management (resource layer)Grid Resource Allocation Management (GRAM)
Information services (resource layer)Grid Resource Information Protocol (GRIP)
Data transfer (resource layer)Grid File Transfer Protocol (GridFTP)
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 15
Data Grids
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 16
Data Intensive Science: 2000-2015Scientific discovery increasingly driven by IT
Computationally intensive analysesMassive data collectionsData distributed across networks of varying capabilityGeographically distributed collaboration
Dominant factor: data growth (1 Petabyte = 1000 TB)
2000 ~0.5 Petabyte2005 ~10 Petabytes2010 ~100 Petabytes2015 ~1000 Petabytes?
How to collect, manage,access and interpret thisquantity of data?
Drives demand for “Data Grids” to handleadditional dimension of data access & movement
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 17
Global Data Grid Challenge
“Global scientific communities will perform computationally demanding analyses of distributed datasets that will grow by at least 3 orders of magnitude over the next decade, from the 100 Terabyte to the 100 Petabyte scale.”
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 18
Data Intensive Physical SciencesHigh energy & nuclear physicsGravity wave searches
LIGO, GEO, VIRGO
Astronomy: Digital sky surveysNow: Sloan Sky Survey, 2MASSFuture: VISTA, other Gigapixel arrays“Virtual” Observatories (Global Virtual Observatory)
Time-dependent 3-D systems (simulation & data)Earth ObservationClimate modelingGeophysics, earthquake modelingFluids, aerodynamic designPollutant dispersal scenarios
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 19
Data Intensive Biology and Medicine
Medical dataX-Ray, mammography data, etc. (many petabytes)Digitizing patient records (ditto)
X-ray crystallographyBright X-Ray sources, e.g. Argonne Advanced Photon
Source
Molecular genomics and related disciplinesHuman Genome, other genome databasesProteomics (protein structure, activities, …)Protein interactions, drug delivery
Brain scans (3-D, time dependent)Virtual Population Laboratory (proposed)
Database of populations, geography, transportation corridors
Simulate likely spread of disease outbreaks
Craig Venter keynote@SC2001
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 20
Data and CorporationsCorporations and Grids
National, international, globalBusiness units, research teamsSales dataTransparent access to distributed databases
Corporate issuesShort term and long term partnershipsOverlapping networksManage, control access to data and resourcesSecurity
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 21
Example: High Energy Physics“Compact” Muon Solenoid
at the LHC (CERN)
Smithsonianstandard man
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 22
LHC Computing Challenges“Events” resulting from beam-beam collisions:
Signal event is obscured by 20 overlapping uninteresting collisions in same crossing
CPU time does not scale from previous generations
2000 2007
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 23
All charged tracks with pt > 2 GeV
Reconstructed tracks with pt > 25 GeV
(+30 minimum bias events)
109 events/sec, selectivity: 1 in 1013
LHC: Higgs Decay into 4 muons
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 24
1800 Physicists150 Institutes32 Countries
LHC Computing ChallengesComplexity of LHC interaction environment & resulting dataScale: Petabytes of data per year (100 PB by ~2010-12)GLobal distribution of people and resources
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 25
Tier0 CERNTier1 National LabTier2 Regional Center (University, etc.)Tier3 University workgroupTier4 Workstation
Global LHC Data Grid
Tier 1
T2
T2
T2
T2
T2
3
3
3
3
3
3
3
3
3
3
3
Tier 0 (CERN)
4 4 4 4
3 3
Key ideas:Hierarchical structureTier2 centers
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 26
Global LHC Data Grid
Tier2 Center
Online System
CERN Computer Center > 20
TIPS
USA CenterFrance Center
Italy Center UK Center
InstituteInstituteInstituteInstitute ~0.25TIPS
Workstations,other portals
~100 MBytes/sec
2.5 Gbits/sec
100 - 1000
Mbits/sec
Bunch crossing per 25 nsecs.100 triggers per secondEvent is ~1 MByte in size
Physicists work on analysis “channels”.
Each institute has ~10 physicists working on one or more channels
Physics data cache
~PBytes/sec
2.5 Gbits/sec
Tier2 CenterTier2 CenterTier2 Center
~622 Mbits/sec
Tier 0 +1
Tier 1
Tier 3
Tier 4
Tier2 Center Tier 2
Experiment CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 27
Example: Global Virtual Observatory
Source Catalogs Image Data
Specialized Data:Spectroscopy, Time Series,
PolarizationInformation Archives:
Derived & legacy data: NED,Simbad,ADS, etcDiscovery Tools:
Visualization, Statistics
Standards
Multi-wavelength astronomy,Multiple surveys
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 28
GVO Data ChallengeDigital representation of the sky
All-sky + deep fields Integrated catalog and image databasesSpectra of selected samples
Size of the archived data40,000 square degreesResolution < 0.1 arcsec > 50 trillion pixelsOne band (2 bytes/pixel) 100 TerabytesMulti-wavelength: 500-1000 TerabytesTime dimension: Many Petabytes
Large, globally distributed database engines Integrated catalog and image databasesMulti-Petabyte data sizeGbyte/s aggregate I/O speed per site
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 29
Sloan Digital Sky Survey Data Grid
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 30
LIGO (Gravity Wave) Data Grid
HanfordObservatory
LivingstonObservatory
Caltech
MIT
INet2Abilene
Tier1 LSCLSC
LSCLSC
LSCTier2
OC3
OC48
OC3
OC12
OC48
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 31
Data Grid Projects
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 32
Large Data Grid ProjectsFunded projects
GriPhyN USA NSF $11.9M + $1.6M 2000-2005EU DataGrid EU EC €10M 2001-2004PPDG USA DOE $9.5M 2001-2004TeraGrid USA NSF $53M 2001-? iVDGL USA NSF $13.7M + $2M 2001-2006DataTAG EU EC €4M 2002-2004
Proposed projectsGridPP UK PPARC >$15M? 2001-2004
Many national projects Initiatives in US, UK, Italy, France, NL, Germany, Japan, …EU networking initiatives (Géant, SURFNet)
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 33
Future• OO-collection export• Cache, state tracking• Prediction
PPDG Middleware Components
Object- and File-basedApplication Services (Request Interpreter)
Cache Manager
File Access Service (Request Planner)
Matchmaking Service
Cost EstimationFile
FetchingService
File Replication Index
End-to-End Network Services
Mass Storage Manager
Resource Management
File Mover
File Mover
Site Boundary
Security Domain
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 34
Work Package
Work Package title Lead contractor
WP1 Grid Workload Management INFN
WP2 Grid Data Management CERN
WP3 Grid Monitoring Services PPARC
WP4 Fabric Management CERN
WP5 Mass Storage Management PPARC
WP6 Integration Testbed CNRS
WP7 Network Services CNRS
WP8 High Energy Physics Applications CERN
WP9 Earth Observation Science Applications ESA
WP10 Biology Science Applications INFN
WP11 Dissemination and Exploitation INFN
WP12 Project Management CERN
EU DataGrid Project
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 35
GriPhyN: PetaScale Virtual-Data Grids
Virtual Data Tools
Request Planning &
Scheduling ToolsRequest Execution & Management Tools
Transforms
Distributed resources(code, storage, CPUs,networks)
Resource Management
Services
Resource Management
Services
Security and Policy
Services
Security and Policy
Services
Other Grid ServicesOther Grid
Services
Interactive User Tools
Production TeamIndividual Investigator Workgroups
Raw data source
~1 Petaflop~100 Petabytes
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 36
GriPhyN Research AgendaVirtual Data technologies (fig.)
Derived data, calculable via algorithm Instantiated 0, 1, or many times (e.g., caches)“Fetch value” vs “execute algorithm”Very complex (versions, consistency, cost calculation, etc)
LIGO example“Get gravitational strain for 2 minutes around each of 200
gamma-ray bursts over the last year”
For each requested data value, need toLocate item location and algorithm Determine costs of fetching vs calculatingPlan data movements & computations required to obtain
results Execute the plan
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 37
Virtual Data in Action
Data request may Compute locally Compute remotely Access local data Access remote data
Scheduling based on Local policies Global policies Cost
Major facilities, archives
Regional facilities, caches
Local facilities, cachesFetch item
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 38
GriPhyN/PPDG Data Grid Architecture
Application
Planner
Executor
Catalog Services
Info Services
Policy/Security
Monitoring
Repl. Mgmt.
Reliable TransferService
Compute Resource Storage Resource
DAG
DAG
DAGMAN, Kangaroo
GRAM GridFTP; GRAM; SRM
GSI, CAS
MDS
MCAT; GriPhyN catalogs
GDMP
MDS
Globus
= initial solution is operational
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 39
Transparency wrt materialization
Id Trans F ParamName …i1 F X F.X …i2 F Y F.Y …i10 G Y P G(P).Y …
Trans Prog Cost …F URL:f 10 …G URL:g 20 …
Program storage
Trans. name
URLs for program location
Derived Data Catalog
Transformation Catalog
Update uponmaterialization
App specific attr. id …… i2,i10……
Derived Metadata Catalog
id
Id Trans Param Name …i1 F X F.X …i2 F Y F.Y …i10 G Y P G(P).Y …
Trans Prog Cost …F URL:f 10 …G URL:g 20 …
Program storage
Trans. name
URLs for program location
App-specific-attr id … … i2,i10……
id
Physical file storage
URLs for physical file location
Name LObjN …
F.X logO3 … …
LCN PFNs …logC1 URL1logC2 URL2 URL3logC3 URL4logC4 URL5 URL6
Metadata Catalog
Replica Catalog
Logical Container Name
GCMS
Object Name
Transparency wrt location
Name LObjN … … X logO1 … …Y logO2 … …F.X logO3 … …G(1).Y logO4 … …
LCN PFNs …logC1 URL1logC2 URL2 URL3logC3 URL4logC4 URL5 URL6
Replica Catalog
GCMSGCMS
Object Name
Catalog Architecture
Metadata Catalog
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 40
April 2001CaltechNCSAWisconsin NCSA Linux
cluster
5) Secondary reports complete to master
Master Condor job running at
Caltech
7) GridFTP fetches data from UniTree
NCSA UniTree - GridFTP-enabled FTP server
4) 100 data files transferred via GridFTP, ~ 1 GB each
Secondary Condor job on UW pool
3) 100 Monte Carlo jobs on Wisconsin Condor pool
2) Launch secondary job on Wisconsin pool; input files via Globus GASS
Caltech workstation
6) Master starts reconstruction jobs via Globus jobmanager on cluster
8) Processed objectivity database stored to UniTree
9) Reconstruction job reports complete to master
Early GriPhyN Challenge Problem:CMS Data Reconstruction
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 41
0
20
40
60
80
100
120
Pre / Simulation Jobs / Post (UW Condor)
ooHits at NCSA
ooDigis at NCSA
Delay due to script error
Trace of a Condor-G Physics Run
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 42
iVDGL: A World Grid Laboratory
International Virtual-Data Grid LaboratoryA global Grid laboratory (US, EU, Asia, …)A place to conduct Data Grid tests “at scale”A mechanism to create common Grid infrastructureA facility to perform production exercises for LHC
experimentsA laboratory for other disciplines to perform Data Grid tests
US part funded by NSF: Sep. 25, 2001$13.65M + $2M
“We propose to create, operate and evaluate, over asustained period of time, an international researchlaboratory for data-intensive science.”
From NSF proposal, 2001
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 43
iVDGL Summary InformationPrincipal components
Tier1 sites (laboratories)Tier2 sites (universities)Selected Tier3 sites (universities)Fast networks: US, Europe, transatlantic, transpacificGrid Operations Center (GOC)Computer Science support teams (6 UK Fellows)Coordination, management
Proposed international participants Initially US, EU, Japan, AustraliaOther world regions laterDiscussions w/ Russia, China, Pakistan, India, Brazil
Complementary EU project: DataTAGTransatlantic network from CERN to STAR-TAP (+ people) Initially 2.5 Gb/s
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 44
U Florida CMSCaltech CMS, LIGOUC San Diego CMS, CS Indiana U ATLAS, iGOCBoston U ATLASU Wisconsin, Milwaukee LIGOPenn State LIGO Johns Hopkins SDSS, NVOU Chicago CSU Southern California CSU Wisconsin, Madison CSSalish Kootenai Outreach, LIGOHampton U Outreach, ATLASU Texas, Brownsville Outreach, LIGOFermilab CMS, SDSS, NVOBrookhaven ATLASArgonne LabATLAS, CS
US iVDGL Proposal Participants
T2 / Software
CS support
T3 / Outreach
T1 / Labs(not funded)
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 45
Initial US-iVDGL Data Grid
Tier1 (FNAL)Proto-Tier2Tier3 university
Caltech/UCSDFlorida
Wisconsin
Fermilab BNLIndiana
BUMichigan
Other sites to be added in
2002
SKC
Brownsville
Hampton
PSU
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 46
iVDGL Map (2002-2003)
Tier0/1 facility
Tier2 facility
10 Gbps link
2.5 Gbps link
622 Mbps link
Other link
Tier3 facility
DataTAG
Surfnet
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 47
“Infrastructure” Data Grid ProjectsGriPhyN (US, NSF)
Petascale Virtual-Data Gridshttp://www.griphyn.org/
Particle Physics Data Grid (US, DOE)Data Grid applications for HENPhttp://www.ppdg.net/
European Data Grid (EC, EU)Data Grid technologies, EU deploymenthttp://www.eu-datagrid.org/
TeraGrid Project (US, NSF)Dist. supercomp. resources (13 TFlops)http://www.teragrid.org/
iVDGL + DataTAG (NSF, EC, others)Global Grid lab & transatlantic network
Collaborations of application scientists & computer scientists
Focus on infrastructure development & deployment
Broad application
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 48
Data Grid Project TimelineGriPhyN approved, $11.9M+$1.6M
EU DataGrid approved, $9.3M
1st Grid coordination meeting
PPDG approved, $9.5M
2nd Grid coordination meeting
iVDGL approved, $13.65M+$2M
TeraGrid approved ($53M)
Q1 02
Q4 00
Q1 01
Q2 01
Q3 01
Q4 013rd Grid coordination meeting
4th Grid coordination meeting
DataTAG approved (€4M) LHC Grid Computing Project
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 49
Need for Common Grid Infrastructure
Grid computing sometimes compared to electric gridYou plug in to get a resource (CPU, storage, …)You don’t care where the resource is located
Want to avoid this situation in Grid computing
This analogy is more appropriate than originally intendedIt expresses a USA viewpoint uniform power gridWhat happens when you travel around the world?
Different frequencies 60 Hz, 50 HzDifferent voltages 120 V, 220 VDifferent sockets! USA, 2 pin, France, UK, etc.
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 50
Role of Grid Infrastructure Provide essential common Grid services
Cannot afford to develop separate infrastructures(Manpower, timing, immediate needs, etc.)
Meet needs of high-end scientific & engin’g collaborations
HENP, astrophysics, GVO, earthquake, climate, space, biology, …
Already international and even global in scopeDrive future requirements
Be broadly applicable outside scienceGovernment agencies: National, regional (EU), UNNon-governmental organizations (NGOs)Corporations, business networks (e.g., suppliers, R&D)Other “virtual organizations” (see Anatomy of the Grid)
Be scalable to the Global level
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 51
Grid Coordination EffortsGlobal Grid Forum (GGF)
www.gridforum.org International forum for general Grid effortsMany working groups, standards definitionsNext one in Toronto, Feb. 17-20
HICB (High energy physics)Represents HEP collaborations, primarily LHC experiments Joint development & deployment of Data Grid middlewareGriPhyN, PPDG, TeraGrid, iVDGL, EU-DataGrid, LCG,
DataTAG, CrossgridCommon testbed, open source software modelSeveral meeting so far
New infrastructure Data Grid projects?Fold into existing Grid landscape (primarily US + EU)
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 52
SummaryData Grids will qualitatively and quantitatively
change the nature of collaborations and approaches to computing
The iVDGL will provide vast experience for new collaborations
Many challenges during the coming transitionNew grid projects will provide rich experience and lessonsDifficult to predict situation even 3-5 years ahead
UT Arlington Colloquium (Jan. 24, 2002)
Paul Avery 53
Grid References Grid Book
www.mkp.com/grids Globus
www.globus.org Global Grid Forum
www.gridforum.org TeraGrid
www.teragrid.org EU DataGrid
www.eu-datagrid.org PPDG
www.ppdg.net GriPhyN
www.griphyn.org iVDGL
www.ivdgl.org