Upload
bernard-hutchinson
View
215
Download
1
Embed Size (px)
Citation preview
f
Run II Experiments and the Grid
Amber BoehnleinFermilab
September 16, 2005
September 13-14 2005
2
f
Run II Computing Review
DO Status
• DO is running SAMGrid for MC production and Reprocessing– SAMGrid is a 1st generation production
system• Typical configuration, installation and
robustness issues are being addressed– LCG SamGrid interoperability proto-type is
going well– OSG resource selector will be developed in
order to facilitate similar functionality as with LCG
September 13-14 2005
3
f
Run II Computing Review
CDF Status
• CDF has prototype grid job submission based on the CDF Analysis Facility that uses Condor Glide-in – Running well and usefully in “owner/operator” mode
on a few sites– Does not have integrated data handling– May not be handling tarballs– Requires installation on a head node, and outbound
node connectivity– Has some legacy security policies to address
• CAF is kerberos basedCDF has prototype grid job submission based on the CDF Analysis Facility that uses Condor Glide-in
September 13-14 2005
4
f
Run II Computing Review
Why?
• Glide-in technology is attractive in many ways.– There is always a certain appeal in the next great thing.
• Illustrative of a general tension for the Run II experiments– Competing agendas—difficult for CDF to turn down effort.
Italians support Glide-CAF• CDF wants to do analysis on the Grid, and they do not
want user interface to change.– Probably could have achieved that requirement other
ways, however CDF is also vested in the CAF as an model.• Ultimately probably beneficial to both CDF and DO
– If Glide-in works in production on a reasonable time scale, might be able to use
– VO specific services support is a motivation for the Edge Services Pre-proposal for OSG—Edge Services will almost certainly benefit DO
September 13-14 2005
5
f
Run II Computing Review
Run II computing in the LHC Era
• Grid is the strategic direction for FNAL CD to meet commitments to Run II, CMS and other stakeholders.– 05 Run II computing review complimented DO and CDF on moving
to towards grid models– Run II effort task force acknowledges strategy
• Concerns about– Availability of resources, especially disk– Urged to make more formal agreements– “Expenses” involved in operating a production Grid– About heavyweight and nonstandard interfaces on the production
system– About real world issues for the prototype
• Mitigations– DO and FNAL CD proposing an installation team, supported by the
review– Move towards standard interfaces, more robust– Guest Scientist positions could be used to leverage knowledge and
expertise—particularly in cases where physics potential would also leveraged.
September 13-14 2005
6
f
Run II Computing Review
OSG Pre Proposals
• The OSG Pre Proposal call was targeted at core functionality– SAMGrid was built with the support of PPDG
funds. – Noted that a service without customers is of
limited use.– Some calls to work closely with TERAGrid.– Still working through details for a full
proposal– Encouraged to make a proposal for an OSG
that will thrive!
September 13-14 2005
7
f
Run II Computing Review
Summary
September 13-14 2005
8
f
Run II Computing Review
RUN II Department Roles
• Operations—Running the systems, standing pager rotations/shifts, researching latest technologies– purchasing and deploying equipment – tracking down and fixing problems– code management
• Development—exploring use cases, writing code, introducing new features, testing, documenting, exploring technologies
• Integration—testing, more testing, training users, transition from development to operations
• Planning—how best to use resources to meet stakeholder needs, facility issues
• Interfacing – Serve in experiment management roles, bridging the CD and the experiments, CD department to CD department, hosting guest scientists
• Participate in physics analysis as collaboration members -- 30% of department FTEs hold scientific positions
September 13-14 2005
9
f
Run II Computing Review
Risks, expanded
• Increased calls on FNAL CD as migration of effort and equipment to LHC
• Declining equipment and operations budgets are already limiting the data collection rate.– Over time, limits in the equipment and operating budget will
create delays• Operational performance of user code
– DO reconstruction code performance and release turn-around
– CDF user code has caused inefficiencies on the CAF• COTS Computing
– Experiments need best price/performance, which introduces risk.
– Moore’s law– Have a good process in place for evaluation, purchase and
acceptance.– Each purchase of worker nodes presents challenges
• FNAL CD plays engineering/integrator role by default– Commodity fileservers are maintenance intensive
September 13-14 2005
10
f
Run II Computing Review
Risks, expanded
• Data Handling– SAM system, dCache, hardware working well– User patterns are still evolving, sometimes conflicts
between wanting to get results out and using standard production.
– Scaling with data sample size might have unanticipated consequences.
– Count on next generation tape drives to mitigate tape costs
• Longevity of hardware components and software applications– Starting to use a 4 year replacement cycle for worker
nodes so the equipment is off warranty the final year.– 5 year life cycle on major components, replacement
needed again around 2010 when budget for Run II will be extremely limited.
– Migrating either experiment from existing mode of operation or user interfaces would be time intensive and costly.