6-Mar-2003Grids NOW 1 Grids Now: The USCMS Integration Grid Testbed and the European Data Grid Michael Ernst

6-Mar-2003

Grids NOW

1

Grids Now: The USCMS Integration Grid Testbed

and the European Data Grid

Michael Ernst

6-Mar-2003

Grids NOW

2

6-Mar-2003

Grids NOW

3

Why Grids?

Efficient sharing of distributed heterogeneous compute and storage resources.

Virtual Organizations and Institutional resource sharing Dynamic reallocation of resources to target specific problems Collaboration-wide data access and analysis environments

Grid solutions need to be scalable and robust Must handle petabytes per year Tens of thousands of CPUs Tens of thousands of jobs

Grid solutions presented here are supported in part by the EDG, DataTag, PPDG, GriPhyN, and iVDGL.

We are learning a lot from these current efforts for LCG-1!

6-Mar-2003

Grids NOW

4

Introduction US CMS is positioning itself to be able to learn, prototype and develop while providing a production

environment to cater to CMS, US CMS and LCG demands

R&D Integration Production

3 phased approach is not a new idea, but is mission critical! Development Grid Testbed (DGT), Integration Grid Testbed (IGT), and Production Grid

6-Mar-2003

Grids NOW

5

Commissioning with CMSMonte Carlo Production

USCMS IGT is producing: 1M Egamma “BigJets” events from generation stage all the way

through analysis Ntuples Time dominated by GEANT simulation, but no intermediate data is kept Must run on RedHat Linux 6.1 resources for Objectivity license

500K Egamma “BigJets” events at CMSIM stage only EDG is producing:

1M Egamma “BigJets” events at CMSIM stage only Comparison of requests:

CMSIM-only involves large intermediate data (~ 1.5 TB) transfers Full simulation through to Ntuples involves more complex workflow

6-Mar-2003

Grids NOW

6

Special IGT Software: MOP

MOP is a system for packaging production processing jobs into DAGMan job descriptions

DAGMan jobs are Directed Acyclic Graphs (DAGs) MOP uses the following DAG Nodes for each job:

Stage-in: Stages in needed application files, scripts, data from the submit host Run: The application(s) run on the remote host Stage-out: The produced data is staged out from the execution site back to the submit host Clean-up: Temporary areas on the remote site are cleansed Publish: Data is published to a GDMP replica catalogue after it is returned

Master Site

Remote Site 1

IMPALA mop_submitterDAGManCondor-G

GridFTP

BatchQueue

GridFTP

Remote Site N

BatchQueue

GridFTP

6-Mar-2003

Grids NOW

7

Grid Resources

The USCMS Integration Grid Testbed (IGT) comprises: About 230 CPU (750 MHz equivalent, RedHat Linux 6.1)

Additional 80 CPU at 2.4 GHz running RedHat Linux 7.X About 5 TB local disk space plus Enstore Mass storage at FNAL

using dCache This is the combined USCMS Tier-1/Tier-2 resources:

Caltech, Fermilab, U Florida, UC SanDiego, (UW Madison)

The European Data Grid (EDG) comprises: About 350 CPU (750-1000 MHz equivalent, mostly RHL 6.2)

Up to 400 additional CPU at Lyon that are shared CERN, CNAF, RAL, (NIKHEF), Legnaro, Lyon

About 3.7 TB local Disk space plus CASTOR Mass Storage at CERN and HPSS at Lyon

6-Mar-2003

Grids NOW

8

Current Grid Software

Both grids use Globus and Condor core middleware USCMS Integration Grid Testbed (IGT)

Using Virtual Data Toolkit (VDT) 1.1.3 Bottom-up” approach

Advantage: Bugs have been shaken out of the core middleware products

European Data Grid Using EDG release 1.3 Enhanced functionality

Eg- Using multiple Resource Brokers relying on real time monitoring information to schedule jobs

“Top-down” approach Advantage: More functionality has been added

6-Mar-2003

Grids NOW

9

The CMS IGT “Stack”

The CMS IGT “stack” comprises nine layers. The Application layer contains only CMS executables. The Job Creation layer comprises CMS provided tools MCRunJob and Impala. Neither MCRunJob nor Impala are specifically “grid aware.” Then there is a DAG Creation layer and a Job Submission layer. Both functionalities are provided by MOP. Jobs are submitted to DAGMAN which, through Condor-G, manages jobs run on remote Globus Job Managers. Finally, there is a local Farm or Batch System used by Globus GRAM to manage jobs. In the case of the IGT, the local Batch manager was always FBSNG or Condor. Scheduling and Integrated monitoring are not present.

Application

Job Creation

DAG Creation

DAGMAN/Condor_G

Globus

Network

Job Manager

Farm/Batch System

CMSApplications

MCRunJob

MOP

Globus/GRAM

FBS/PBS/Condor

Job Submission

VDT

Mass Storage System(dCache+Enstore)

6-Mar-2003

Grids NOW

10

Monitoring

MonaLisa is used as the primary IGT Grid-wide monitor. Physical parameters: CPU load, network usage, disk space, etc.

Dynamic discovery of monitoring targets and schema Implemented with Java/Jini with SNMP local monitors Interfaces to/from other monitoring packages

EDG uses two sources of Monitoring: EDG Monitoring System (based on Globus Metadata Directory

Service) Physical Parameters: CPU Load, network usage, disk space, etc.

BOSS (Batch Object Submission Service) Application level: Running Time, Size of Output, Job Progress, etc Stores information in MySQL DB in real time

6-Mar-2003

Grids NOW

11

ML Design ConsiderationsML Design Considerations

Act as a true dynamic service and provide the necessary functionally to be used by any other services that require such information (Jini, UDDI - WSDL / SOAP)- - mechanism to dynamically discover all the "Farm Units" used by a community - remote event notification for changes in the any system- lease mechanism for each registered unit

Allow dynamic configuration and the list of monitor parameters.Integrate existing monitoring tools ( SNMP, Ganglia, Hawkeye …)To provide:

- single-farm values and details for each node - network aspect - real time information - historical data and extracted trend information - listener subscription / notification - (mobile) agent filters (algorithms for prediction and decision-support)

6-Mar-2003

Grids NOW

12

IGT Results (so far)

Time to process 1 event: 500 sec @ 750 MHz

Speedup: Avg factor of 100 speedup during current run

Resources: Approximately 230 CPU @750 MHz equiv.

Sustained efficiency: about 43.5%

Oct 25

6-Mar-2003

Grids NOW

13

EDG CMSIM events vs. time

Avg. CPU Utilizationwas “about the same”

6-Mar-2003

Grids NOW

14

Analysis of Problems on IGT The usual servers die and need to be restarted

But nothing that seems to be related to the load... Failure semantics currently lean heavily towards automatic

resubmission of failed jobs Sometimes failures are not recognized right away Need better system for spotting chronic problems

BOSS does this already, we aren’t using it because it was still under development when we were planning the IGT

Problems must be better routed to the “right” people At one point, an application problem was misdiagnosed early as a

middleware issue Partly because IGT is currently run by middleware developers!

Once the application expert looked into it, the problem was solved in 90 minutes

6-Mar-2003

Grids NOW

15

Analysis of Problems on EDG

The biggest problems related to the Information System: Symptom: no resources are found

Cause: instability of the MDS when it is overloaded Solution: submitting jobs at a lower rate improves the chances of success Symptom: the RB gets stuck (no job ever starts)

Cause: investigating... Symptom: grid elements disappear from the II

Cause: services on some machines stopped workingSolution: restart the services

Symptom: timeouts when copying the input sandbox Symptom: log file lost (“Stdout does not contain useful data”)

Cause: several (no free files/inodes, broken connect. between CE & RB, …) Problems related to the replica manager:

Symptom: file registration in the RC fails from time to time None of these problems is a show-stopper and they happen just in a fraction

of the jobs! Fixes are already there for some of them (but not yet deployed)

6-Mar-2003

Grids NOW

16

The Integration Grid Testbed(IGT)

2002(IGT) 2002(PG) 2003(New) 2003(IGT) 2003(PG)

FNAL 60 0 260 10 310

Florida 80 0 175 5 250

Caltech 120 0 88 5 203

UCSD 128 0 88 5 211

Total 388 0 611 25 974

Resource Allocations (1 GHz equiv. CPU) in 2002/2003 for IGT and Production Grid. (R&D Grid not included.)

New resources for Tier-2 are from iVDGL.

6-Mar-2003

Grids NOW

17

Comparison to CMS Spring 02 This is really apples and oranges, but... Average CPU utilization was about 20-25% during Spring02

over all participating CMS Regional Centers Though it is impossible to extract a concrete number...

It really isn’t known how many Spring02 CPUs should be counted in the denominator, estimated 700-1000 CPUs

File transfers were much more complicated during Spring02 Objectivity data was kept and archived Different events were processed at different steps

But the current efforts show that the Grids are in the same ballpark!

We still need a factor of ~2.5 for DC04 production want slightly better than 1 evt/sec

6-Mar-2003

Grids NOW

18

Manpower Results (so far) Estimates of Manpower for the IGT

2.65 FTE equivalent during initial phase and debugging Reported voluntarily in response to a general query

1.1 FTE equivalent during smooth running periods The manager plus periodic small file transfers

Expected to be less than 1 FTE when regular shift procedures are adopted Caveats:

This is the “second” attempt for the IGT. The first attempt last Spring needed more manpower.

We STILL have a rapidly changing middleware environment This is not counting general sysadmin support This is really saying that production ops is becoming less specialized

EDG began in early December, first attempts, No manpower estimates yet. A task force has been set up including reps from EDG, CMS, and LCG

6-Mar-2003

Grids NOW

19

Next steps For the EDG:

Deploy the fixes for the problems encountered so far Put the online monitoring in place

For the IGT: Deploy fixes for problems uncovered so far Deploy more functionality

IGT and EDG are preparation for LCG-1 Recently, LCG started participating in the IGT

50 nodes running CMSIM-only production Getting “top” and “bottom” to meet in the middle

6-Mar-2003

Grids NOW

20

Conclusion

Our approach to developing the software systems for the distributed data processing environment adopts “rolling prototyping”– Analyze current practices – Prototyping of the distributed processing environment – Software Support and Transitioning – Servicing external milestones

Next prototype system to be delivered is the US CMS contribution to the LCG Production Grid (June 2003)– CMS will run a large Data Challenge on that system to prove the

computing systems (including new object storage solution ?)

This scheme allows us to flexibly react to technology developments AND to changing and developing external requirements

Documents

6-Mar-2003Grids NOW 1 Grids Now: The USCMS Integration Grid Testbed and the European Data Grid Michael Ernst