F Run II Experiments and the Grid Amber Boehnlein Fermilab September 16, 2005

f

Run II Experiments and the Grid

Amber BoehnleinFermilab

September 16, 2005

September 13-14 2005

2

f

Run II Computing Review

DO Status

• DO is running SAMGrid for MC production and Reprocessing– SAMGrid is a 1st generation production

system• Typical configuration, installation and

robustness issues are being addressed– LCG SamGrid interoperability proto-type is

going well– OSG resource selector will be developed in

order to facilitate similar functionality as with LCG


3

f


CDF Status

• CDF has prototype grid job submission based on the CDF Analysis Facility that uses Condor Glide-in – Running well and usefully in “owner/operator” mode

on a few sites– Does not have integrated data handling– May not be handling tarballs– Requires installation on a head node, and outbound

node connectivity– Has some legacy security policies to address

• CAF is kerberos basedCDF has prototype grid job submission based on the CDF Analysis Facility that uses Condor Glide-in


4

f


Why?

• Glide-in technology is attractive in many ways.– There is always a certain appeal in the next great thing.

• Illustrative of a general tension for the Run II experiments– Competing agendas—difficult for CDF to turn down effort.

Italians support Glide-CAF• CDF wants to do analysis on the Grid, and they do not

want user interface to change.– Probably could have achieved that requirement other

ways, however CDF is also vested in the CAF as an model.• Ultimately probably beneficial to both CDF and DO

– If Glide-in works in production on a reasonable time scale, might be able to use

– VO specific services support is a motivation for the Edge Services Pre-proposal for OSG—Edge Services will almost certainly benefit DO


5

f


Run II computing in the LHC Era

• Grid is the strategic direction for FNAL CD to meet commitments to Run II, CMS and other stakeholders.– 05 Run II computing review complimented DO and CDF on moving

to towards grid models– Run II effort task force acknowledges strategy

• Concerns about– Availability of resources, especially disk– Urged to make more formal agreements– “Expenses” involved in operating a production Grid– About heavyweight and nonstandard interfaces on the production

system– About real world issues for the prototype

• Mitigations– DO and FNAL CD proposing an installation team, supported by the

review– Move towards standard interfaces, more robust– Guest Scientist positions could be used to leverage knowledge and

expertise—particularly in cases where physics potential would also leveraged.


6

f


OSG Pre Proposals

• The OSG Pre Proposal call was targeted at core functionality– SAMGrid was built with the support of PPDG

funds. – Noted that a service without customers is of

limited use.– Some calls to work closely with TERAGrid.– Still working through details for a full

proposal– Encouraged to make a proposal for an OSG

that will thrive!


7

f


Summary


8

f


RUN II Department Roles

• Operations—Running the systems, standing pager rotations/shifts, researching latest technologies– purchasing and deploying equipment – tracking down and fixing problems– code management

• Development—exploring use cases, writing code, introducing new features, testing, documenting, exploring technologies

• Integration—testing, more testing, training users, transition from development to operations

• Planning—how best to use resources to meet stakeholder needs, facility issues

• Interfacing – Serve in experiment management roles, bridging the CD and the experiments, CD department to CD department, hosting guest scientists

• Participate in physics analysis as collaboration members -- 30% of department FTEs hold scientific positions


9

f


Risks, expanded

• Increased calls on FNAL CD as migration of effort and equipment to LHC

• Declining equipment and operations budgets are already limiting the data collection rate.– Over time, limits in the equipment and operating budget will

create delays• Operational performance of user code

– DO reconstruction code performance and release turn-around

– CDF user code has caused inefficiencies on the CAF• COTS Computing

– Experiments need best price/performance, which introduces risk.

– Moore’s law– Have a good process in place for evaluation, purchase and

acceptance.– Each purchase of worker nodes presents challenges

• FNAL CD plays engineering/integrator role by default– Commodity fileservers are maintenance intensive


10

f


Risks, expanded

• Data Handling– SAM system, dCache, hardware working well– User patterns are still evolving, sometimes conflicts

between wanting to get results out and using standard production.

– Scaling with data sample size might have unanticipated consequences.

– Count on next generation tape drives to mitigate tape costs

• Longevity of hardware components and software applications– Starting to use a 4 year replacement cycle for worker

nodes so the equipment is off warranty the final year.– 5 year life cycle on major components, replacement

needed again around 2010 when budget for Run II will be extremely limited.

– Migrating either experiment from existing mode of operation or user interfaces would be time intensive and costly.

Documents

F Run II Experiments and the Grid Amber Boehnlein Fermilab September 16, 2005