Download ppt - ATLAS Goals and Status Jim Shank US LHC OSG Technology Roadmap May 4-5th, 2005

ATLAS Goals and Status

Jim Shank

US LHC OSG Technology Roadmap

May 4-5th, 2005

2

Outline

High level USATLAS goals

Milestone Summary and Goals

Identified Grid3 Shortcomings

Workload Management Related

Distributed Data Management Related

VO Service Related

3

US ATLAS High Level Requirements

Ability to transfer large datasets between CERN, BNL, and Tier2 Centers

Ability to store and manage access to large datasets at BNL

Ability to produce Monte Carlo at Tier2 and other sites and deliver to

BNL

Ability to store, serve, catalog, manage and discover ATLAS wide

datasets by users and ATLAS file transfer services

Ability to create and submit user analysis jobs to data locations

Ability to access opportunistically non-dedicated ATLAS resources

To deliver all these capabilities at a scale commensurate with available

resources (CPU, storage, network, and human) and number of users

4

ATLAS Computing Timeline

2003 • POOL/SEAL release (done)

• ATLAS release 7 (with POOL persistency) (done)

• LCG-1 deployment (done)

• ATLAS complete Geant4 validation (done)

• ATLAS release 8 (done)

• DC2 Phase 1: simulation production (done)

• DC2 Phase 2: intensive reconstruction (the real challenge!) LATE!

• Combined test beams (barrel wedge) (done)

• Computing Model paper (done)

• Computing Memorandum of Understanding (in progress)

• ATLAS Computing TDR and LCG TDR (starting)

• DC3: produce data for PRR and test LCG-n

• Physics Readiness Report

• Start cosmic ray run• GO!

2004

2005

2006

2007

NOW

Rome Physics Workshop

5

Grid Pressure

• Grid3/DC2 very valuable exercise

• Lots of pressure on me now to develop backup plans

if the “grid middleware” does not come through

with required functionality on our timescale.

• Since manpower is short, this could mean pulling manpower out of

our OSG effort.

6

Schedule & Organization

Grid Tools & Services

2.3.4.1 Grid Service Infrastructure

2.3.4.2 Grid Workload Management

2.3.4.3 Grid Data Management

2.3.4.4 Grid Integration & Validation

2.3.4.5 Grid User Support

2005

2006

2007

NO

W

CSC

PRR

CRR

Start

TDR

DC2

ATLAS milestones

7

Milestone Summary (I)

2.3.4.1 Grid Service Infrastructure

May 2005 ATLAS software management service upgrade

June 2005 Deployment of OSG 0.2 (expected increments after this)

June 2005 ATLAS-wide monitoring and accounting service

June 2005 ATLAS site certification service

June 2005 LCG interoperability (OSG LCG)

July 2005 SRM/dCache deployed on Tier2 Centers

Sep 2005 LCG interoperability services for SC05

8

Milestone Summary (II)

2.3.4.2 Grid Workload Management

June 2005: Defined metrics for submit host scalability met

June 2005: Capone recovery possible without job loss

July 2005: Capone2 WMS for ADA

Aug 2005: Pre-production Capone2 delivered to Integration team

Integration with SRM

Job scheduling

Provision of Grid WMS for ADA

Integration with DDMS

Sep 2005: Validated Capone2 + DDMS

Oct 2005: Capone2 WMS with LCG Interoperability components

9

Milestone Summary (III)

2.3.4.3 Grid Data Management

April 2005: Interfaces and functionality for OSG-based DDM

June 2005: Integrate with OSG-based storage services with benchmarks

Aug 2005: Interfaces to the ATLAS OSG workload management system (Capone)

Nov 2005: Implementing storage authorization and policies

10

Milestone Summary (IV)

2.3.4.4 Grid Integration & Validation

March 2005: Deployment of ATLAS on OSG ITB (Integration Testbed)

June 2005: Pre-production Capone2 deployed and validated OSG ITB

July 2005: Distributed analysis service implemented with WMS

August 2005: Integrated DDM+WMS service challenges on OSG

Sept 2005: CSC (formerly DC3) full functionality for production service

Oct 2005: Large scale distributed analysis challenge on OSG

Nov 2005: OSG-LCG Interoperability exercises with ATLAS

Dec 2005: CSC full functionality pre-production validation

11

ATLAS DC2 Overview of Grids as of 2005-02-24 18:11:30

Grid submitted pending running finished failed efficiency

Grid3 36 3 814 153028 46943 77 %

NorduGrid 29 130 1105 114264 70349 62 %

LCG 60 528 610 145692 242247 38 %

TOTAL 125 661 2529 412984 359539 53 %

Capone + Grid3 Performance

Capone submitted & managed ATLAS jobs on Grid3 > 150K

In 2004, 1.2M CPU-hours

Grid3 sites with more than 1000 successful DC2 jobs: 20

Capone instances > 1000 jobs: 13

12

ATLAS DC2 and Rome Production on Grid3

<Capone Jobs/day> = 350

Max # jobs/day = 1020

13

Most Painful Shortcomings

“Unprotected” Grid services

GT2 Gram and GridFTP vulnerable to multiple users and VOs

Frequent manual intervention of site administrators

No reliable file transfer service or other data management

services such as space management

No policy-based authorization infrastructure

No distinction between production and individual grid users

Lack of a reliable information service

Overall robustness and reliability (on client and server) poor

15

ATLAS Production System

LCGNorduGrid Grid3 LSF

LCGexe

NGexe

G3exe

Legacyexe

super super super super

prodDB(CERN)

datamanagement

RLS RLS RLS

jabber soap soap jabber

Don Quijote “DQ”

Windmill

Lexor

AMI(Metadata)

CaponeDulcinea

LCGNorduGrid Grid3 LSF

LCGexe

NGexe

G3exe

Legacyexe

super super super super

prodDB(CERN)

datamanagement

RLS RLS RLS

jabber soap soap jabber

Don Quijote “DQ”

Windmill

Lexor

AMI(Metadata)

CaponeDulcinea

Requirements for

VO and

Core Services

DDMS

WMS

16

Grid Data Management

ATLAS Distributed Data Management System (DDMS)

US ATLAS GTS role in this project: Provide input to design: expected interfaces, functionality,

and scalability performance and metrics, based on experience with Grid3 and DC2 and compatibility with OSG services

Integrate with OSG-based storage services Benchmark OSG implementation choices with ATLAS standards Specification and development of as-needed OSG specific

components required for integration with the overall ATLAS system Introduce new middleware services as they mature Interfaces to OSG workload management system (Capone) Implementing storage authorization and policies for role-based

usage (reservation, expiration, cleanup, connection to VOMS, etc) consistent with ATLAS data management tools and services.

17

DDMS Issues: File Transfers

Dataset discovery and distribution for both

production and analysis services

AOD distribution from CERN to Tier1

Monte Carlo produced at Tier2 delivered to Tier1

Support role-based priorities for transfers and space

21

Reliable FT in DDMS

MySQL Backend Services for agents and clients Schedules transfers

later, monitor and account resources Security and policy issues affecting core

infrastructure Authentication and Authorization infrastructure needs to

be in place Priorities for production managers & end-users need to

be settable Roles to set group and user quotas, permissions

DQ evolving

M. Branco

LC

G D

eplo

ymen

t Gro

up

22

LCG-required SRM functions

SRM v1.1 insufficient – mainly lack of pinning SRM v3 not required – and timescale too late Require Volatile, Permanent space; Durable not practical Global space reservation: reserve, release, update

(mandatory LHCb, useful ATLAS,ALICE). Compactspace NN Permissions on directories mandatory

Prefer based on roles and not DN (SRM integrated with VOMS desirable but timescale?)

Directory functions (except mv) should be implemented asap

Pin/unpin high priority srmGetProtocols useful but not mandatory Abort, suspend, resume request : all low priority Relative paths in SURL important for ATLAS, LHCb, not for

ALICE

CMS input/comments not included yet

DDMS INTERFACING TO STORAGE

23

Managing Persistent Services

General summary: larger sites and ATLAS-controlled sites

should let us run services like LRC, space management,

replication.

Other sites can be handled by being part of some multi-site

domain managed by one of our sites -- ie their persistent

service needs are covered by services running at our sites,

and site-local actions like space management happen via

submitted jobs (working with the remote service and therefore

requiring some remote connectivity or gateway proxying)

rather than local persistent services.

24

Scalable Remote Data Access

ATLAS reconstruction and analysis jobs require

acces to remote database servers at CERN, BNL,

and elsewhere

Presents additional traffic on network especially for

large sites

Suggest use of local mechanisms, such was web

proxy caches, to minimize this impact

25

Workload Management

LCG Core services contributions focused on LCG-RB, etc.

Here address scalability and stability problems experienced on

Grid3

Defined metrics; achieve by job batching or by DAG-appends

5000 active jobs from a single submit host

Submission rate: 30 jobs/minute

>90% job efficiency jobs accepted

All of US production managed by 1 person

Job state persistency mechanism

Capone recovery possible without job loss

26

Other Considerations

Require reliable information system

Integration with Managed Storage

Several resources in OSG will be SRM/dCache, which does not support 3rd party

transfers (used heavily in Capone)

NeST as an option for space management

Job scheduling

Static at present; change to matchmaking based on data location and job queue

depth & policy

Integration with ATLAS data management system (evolving)

Current system works directly with file-based RLS

New system will interface with new ATLAS dataset model; POOL file catalog

interface

27

Grid Interoperability

Interoperability with LCG A number of proof of principle exercises completed Progress of GLUE schema with LCG Publication of USATLAS resources (compute, storage) to

LCG index service (BDII) Use of the LCG developed General Information Provider Submit ATLAS job from LCG to OSG (via an LCG Resource Broker) Submit from OSG site to LCG services directly via Condor-G Demonstration challenge under discussion with LCG in OSG Interoperability

Interoperability with TeraGrid Initial discussions begun in OSG Interoperability Activity F. Luehring (US ATLAS) co-Chair Initial issues identified: authorization, allocations, platform dependencies

28

Summary

We’ve been busy almost continuously with production since July 2004.

Uncovered many shortcomings in Grid3 core infrastructure and our own

services (eg. WMS scalability, no DDMS).

Expect to see more ATLAS services and agents on gatekeeper nodes,

especially for DDMS

Need VO-scoped management of resources by the WMS and DDMS until

middleware service supply these capabilities

To be compatible with ATLAS requirements and OSG principles, we see

resources partitioned (hosted dedicated VO & guest VO resources and general

OSG opportunistic) to achieve reliability and robustness.

Time is of the essence. A grid component, fully meeting our specs, delivered

late is the same as a failed component. We will have already implemented a

work around.

appendix

30

ProdDB

Elements of the execution environment used to run ATLAS DC2 on Grid3. The dashed (red) box indicates processes executing on the Capone submit host. Yellow boxes indicate VDT components.

Condor-G

Condor-G

schedd

GridMgr

CEgsiftp WN

SE

Chimera

RLS

Windmill

Pegasus

VDC

DonQuijote

Mon-Servers

MonALISA

gram

Grid3 Sites

Capone

jobsch

GridCat

MDS

Grid3 System Architecture

Caponeinside

31

Capone Workload Manager

Message interface Web Service & Jabber

Translation layer Windmill (Prodsys) schema

Distributed Analysis (ADA) -TBD

Process execution Capone Finite State Machine

Process eXecution Engine SBIR (FiveSight)

Processes Grid interface

Stub: local shell script testing

Data Management -TBD

The Capone Architecture, showing three common component layers above selectable modules which interact with local and remote services.

Message protocols

JabberWeb Service

Translation

ADAWindmill

Process Execution

PXECPE-FSM

Stu

b

Grid

(GC

E

Clien

t)

Data

Man

ag

emen

t

32