Upload
angelica-sparks
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
Open Science Grid
Frank WürthweinOSG Application Coordinator
Experimental Elementary Particle PhysicsUCSD
8/7/06 NBCR 2006 2
Particle Physics & Computing Science Driver
Event rate = Luminosity x Crossection
LHC Revolution starting in 2008— Luminosity x 10— Crossection x 150 (e.g. top-quark)
Computing Challenge 20PB in first year of running ~ 100MSpecInt2000 ~ close to 100,000 cores
8/7/06 NBCR 2006 3
Overview
OSG in a nutshell Organization “Architecture” Using the OSG Present Utilization & Expected Growth Summary of OSG Status
8/7/06 NBCR 2006 4
OSG in a nutshell High Throughput Computing
— Opportunistic scavenging on cheap hardware.— Owner controlled policies.
“open consortium”— Add OSG project to an open consortium to provide cohesion and
sustainability. Heterogeneous Middleware stack
Minimal site requirements & optional services Production grid allows coexistence of multiple OSG releases.
“Linux rules”: mostly RHEL3 on Intel/AMD Grid of clusters
— Compute & storage (mostly) on private Gb/s LANs.— Some sites with (multiple) 10Gb/s WAN “uplink”.
OrganizationStarted in 2005 as Consortium with contributed
effort only.Now adding OSG project to sustain production grid.
People coming together to build …People paid to operate …
8/7/06 NBCR 2006 6
Consortium & Project Consortium Council
IT Departments & their hardware resources
Science Application Communities
Middleware Providers
Funded Project (Starting 9/06) Operate services for a distributed
facility. Improve, Extend, Expand &
Interoperate Engagement, Education &
Outreach
Argonne Nat. Lab.Brookhaven Nat. Lab.CCR SUNY BuffaloFermi Nat. LabThomas Jefferson Nat. Lab.Lawrence Berkeley Nat. Lab.Stanford Lin. Acc. CenterTexas Adv. Comp. CenterRENCIPurdue
US Atlas CollaborationBaBar CollaborationCDF CollaborationUS CMS CollaborationD0 CollaborationGRASELIGOSDSSSTAR
Council Members:US Atlas S&C ProjectUS CMS S&C ProjectCondorGlobusSRMOSG Project
8/7/06 NBCR 2006 7
Consortium & Project Consortium Council
IT Departments & their hardware resources
Science Application Communities
Middleware Providers
Funded Project (Starting 9/06) Operate services for a distributed
facility. Improve, Extend, Expand &
Interoperate Engagement, Education &
Outreach
Argonne Nat. Lab.Brookhaven Nat. Lab.CCR SUNY BuffaloFermi Nat. LabThomas Jefferson Nat. Lab.Lawrence Berkeley Nat. Lab.Stanford Lin. Acc. CenterTexas Adv. Comp. CenterRENCIPurdue
US Atlas CollaborationBaBar CollaborationCDF CollaborationUS CMS CollaborationD0 CollaborationGRASELIGOSDSSSTAR
Council Members:US Atlas S&C ProjectUS CMS S&C ProjectCondorGlobusSRMOSG Project
Middleware
Middleware
Hardware
Hardware
User Support
User Support
8/7/06 NBCR 2006 8
OSG ManagementExecutive Director: Ruth Pordes
Facility Coordinator: Miron Livny
Application Coordinators: Torre Wenaus & fkw
Resource Managers: P. Avery & A. Lazzarini
Education Coordinator: Mike Wilde
Engagement Coord.: Alan Blatecky
Council Chair: Bill Kramer
Diverse Set of people from Universities & National Labs, including CS, Science Apps, & IT infrastructure people.
8/7/06 NBCR 2006 11
Grid of sites IT Departments at Universities & National Labs make their
hardware resources available via OSG interfaces.— CE: (modified) pre-ws GRAM— SE: SRM for large volume, gftp & (N)FS for small volume
Today’s scale:— 20-50 “active” sites (depending on definition of “active”)— ~ 5000 batch slots— ~ 500TB storage— ~ 10 “active” sites with shared 10Gbps or better connectivity
Expected Scale for End of 2008 ~50 “active” sites ~30-50,000 batch slots Few PB of storage ~ 25-50% of sites with shared 10Gbps or better connectivity
8/7/06 NBCR 2006 12
Making the Grid attractive Minimize entry threshold for resource owners
— Minimize software stack.— Minimize support load.
Minimize entry threshold for users— Feature rich software stack.— Excellent user support.
Resolve contradiction via “thick” Virtual Organization layer of services between users and the grid.
8/7/06 NBCR 2006 13
Me -- My friends -- The grid
O(104) Users
O(102-3) Sites
O(101-2) VOs
Thin client
Thin “Grid API”
Thick VOMiddleware& Support
Me
My friends
The anonymousGrid
Domain science specific Common to all sciences
8/7/06 NBCR 2006 14
Grid of Grids - from local to global
ScienceCommunityInfrastructure
CS/IT Campus Grids
National & InternationalCyberInfrastructure
for Science (e.g. Teragrid, EGEE, …)
(e.g Atlas, CMS, LIGO…)(e.g GLOW, FermiGrid, …)
OSG enables its usersto operate transparently across Grid boundaries globally.
8/7/06 NBCR 2006 16
Authentication & Authorization OSG Responsibilities
— X509 based middleware— Accounts may be dynamic/static, shared/FQAN-specific
VO Responsibilities— Instantiate VOMS— Register users & define/manage their roles
Site Responsibilities— Choose security model (what accounts are supported)— Choose VOs to allow— Default accept of all users in VO but individuals or groups
within VO can be denied.
8/7/06 NBCR 2006 17
User Management User obtains DN from CA that is vetted by TAGPMA User registers with VO and is added to VOMS of VO.
— VO responsible for registration of VOMS with OSG GOC.— VO responsible for users to sign AUP.— VO responsible for VOMS operations.
— VOMS shared for ops on multiple grids globally by some VOs.— Default OSG VO exists for new communities & single PIs.
Sites decide which VOs to support (striving for default admit)— Site populates GUMS daily from VOMSes of all VOs— Site chooses uid policy for each VO & role
— Dynamic vs static vs group accounts User uses whatever services the VO provides in support of users
— VOs generally hide grid behind portal Any and all support is responsibility of VO
— Helping its users— Responding to complains from grid sites about its users.
8/7/06 NBCR 2006 18
Moving & storing data OSG Responsibilities
— Define storage types & their APIs from WAN & LAN— Define information schema for “finding” storage— All storage is local to site - no global filesystem!
VO Responsibilities— Manage data transfer & catalogues
Site Responsibilities— Choose storage type to support & how much— Implement storage type according to OSG rules— Truth in advertisement
8/7/06 NBCR 2006 19
Disk areas in some detail: Shared filesystem as applications area at site.
— Read only from compute cluster.— Role based installation via GRAM.
Batch slot specific local work space.— No persistency beyond batch slot lease.— Not shared across batch slots.— Read & write access (of course).
SRM/gftp controlled data area.— “persistent” data store beyond job boundaries.— Job related stage in/out.— SRM v1.1 today.— SRM v2 expected in late 2006 (space reservation).
8/7/06 NBCR 2006 20
Securing your data Archival Storage in your trusted Archive
— You control where your data is archived. Data moved by party you trust
— You control who moves your data— You control encryption of your data
You compute at sites you trust— E.g. sites that guarantee specific unix uid for you.— E.g. sites whose security model satisfies your needs.
You decide how secure your data needs to be!
8/7/06 NBCR 2006 21
Submitting jobs/workloads OSG Responsibilities
— Define Interface to batch system (today: pre-ws GRAM)— Define information schema— Provide middleware that implements the above.
VO Responsibilities— Manage submissions & workflows— VO controlled workload management system or wms from other
grids, e.g. EGEE/LCG. Site Responsibilities
— Choose batch system— Configure interface according to OSG rules— Truth in advertisement
8/7/06 NBCR 2006 22
Simple Workflow Install Application Software at site(s)
— VO admin install via GRAM.— VO users have read only access from batch slots.
“Download” data to site(s)— VO admin move data via SRM/gftp.— VO users have read only access from batch slots.
Submit job(s) to site(s)— VO users submit job(s)/DAG via condor-g.— Jobs run in batch slots, writing output to local disk.— Jobs copy output from local disk to SRM/gftp data area.
Collect output from site(s)— VO users collect output from site(s) via SRM/gftp as part of DAG.
8/7/06 NBCR 2006 23
Some technical details Job submission
— Condor:— Condor-g— “schedd on the side” (simple multi-site brokering using condor schedd)— Condor glide-in
— EGEE workload management system— OSG CE compatble with glite Classic CE— Submissions via either LCG 2.7 RB or glite RB, including bulk submission
— Virtual Data System (VDS) in use on OSG Data placement using SRM
— SRM/dCache in use to virtualize many disks into one storage system — Schedule WAN Xfer across many gftp servers— Typical WAN IO capability today ~ 10TB/day ~ 2Gbps— Schedule random access from batch slots to many disks via LAN— Typical LAN IO capability today ~ 0.5-5 Gbyte/sec— Space reservation
8/7/06 NBCR 2006 24
Middleware lifecycleDomain science requirements.
Joint projects between OSG applicationsgroup & Middleware developers todevelop & test on community grids.
Integrate into VDT and deploy on OSG-itb.
Inclusion into OSG release & deployment on (part of) production grid.
EGEE et al.
Status of UtilizationOSG job = job submitted via OSG CE
“Accounting” of OSG jobs not (yet) required!
8/7/06 NBCR 2006 26
OSG use by Numbers32 Virtual Organizations
3 with >1000 jobs max.(all particle physics)
3 with 500-1000 max.(all outside physics)
5 with 100-500 max(particle, nuclear, and astro
physics)
8/7/06 NBCR 2006 27
Experimental Particle Physics
Bio/Eng/Med/MathCampus Grids
Non-HEP physics100 jobs
850 jobs
2250 jobs
5/05-5/06 GADU using VDS
PI from Campus Grid
8/7/06 NBCR 2006 30
CMS Xfer on OSG in June 2006
All CMS sites have exceeded 5TB per day in June 2006.Caltech, Purdue, UCSD, UFL, UW exceeded 10TB/day.Hoping to reach 30-40TB/day capability by end of 2006.
450MByte/sec
Grid of GridsOSG enable single PIs and user communities
to operate transparently across Grid boundaries globally.
E.g.: CMS a particle physics experiment
8/7/06 NBCR 2006 32
CMS Experiment - a global community grid
Germany
CMS Experiment
Taiwan UKItaly
Data & jobs moving locally, regionally & globally within CMS grid.
Transparently across grid boundaries from campus to globus.
Florida
USA@FNAL
CERN
Caltech
Wisconsin
UCSD
France
Purdue
MIT
UNL
OSG
EGEE
8/7/06 NBCR 2006 33
Grid of Grids - Production Interop
Job submission:16,000 jobs per day submitted across EGEE & OSG via “LCG RB”.Jobs brokered transparently onto both grids.
Data Transfer:Peak IO of 5Gbps from FNAL to 32 EGEE and 7 OSG sites.All 8 CMS sites on OSG have exceeded 5TB/day goal.Caltech, FNAL, Purdue, UCSD/SDSC, UFL, UW exceed 10TB/day.
8/7/06 NBCR 2006 34
The US CMS center atFNAL transfers data to39 sites worldwide inCMS global Xfer challenge.
Peak Xfer rates of ~5Gbpsare reached.
CMS Xfer FNAL to World
8/7/06 NBCR 2006 35
Summary of OSG Status OSG facility opened July 22nd 2005. OSG facility is under steady use
— ~2-3000 jobs at all times— Mostly HEP but large Bio/Eng/Med occasionally— Moderate other physics (Astro/Nuclear)
OSG project— 5 year Proposal to DOE & NSF funded starting FY07.
— Facility & Improve/Expand/Extend/Interoperate & E&O Off to a running start … but lot’s more to do.
— Routinely exceeding 1Gbps at 6 sites— Scale by x4 by 2008 and many more sites
— Routinely exceeding 1000 running jobs per client— Scale by at least x10 by 2008
— Have reached 99% success rate for 10,000 jobs per day submission— Need to reach this routinely, even under heavy load