18
José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

Embed Size (px)

Citation preview

Page 1: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José M. Hernández

CIEMAT

Grid Computing in the

Experiment at LHC

Jornada de usuarios de Infraestructuras Grid

19-20 January 2012, CIEMAT, Madrid

Page 2: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

The CMS Experiment at the LHC

20 January 2012Grid Computing in CMS 2

The Large Hadron Collider

p-p collisions, 7 TeV, 40 MHz

The Compact Muon Solenoid

Precision measurements

Search for new phenomena

Page 3: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

LHC: a challenge for computing

The Large Hadron Collider at CERN is the largest scientific instrument on the planet

Unprecedented data handling scale 40 MHz event rate (~1 GHz collision rate) → 100TB/s → online filtering

to ~300 Hz (~300 MB/s) → ~3 PB/year (107 secs data taking/year)

Need large computing power to process data Complex events

Many interesting signals << Hz

Thousands of scientists around the world access and analyze the data

Need computing infrastructure able to store, move around the globe, process, simulate and analyze data at the Petabyte scale [O(10) PB/year]

3

Page 4: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

The LHC Computing Grid

4

LCG: 300+ centers, 50+ countries, ~100k CPUs, ~ 100PB disk/tape, 10k users

The LHC Computing Grid provides the distributed computing infrastructure

Computing resources (CPU, storage, networking)

Computing services (data and job management, monitoring, etc)

Integrated to provide a single LHC computing service

Using Grid technologies

Transparent and reliable access to heterogeneous computing resources geographically distributed via internet

High capacity wide area networking

Page 5: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

The CMS Computing Model

Distributed computing model for data storage, processing and analysis

Grid technologies (Worldwide LHC Computing Grid, WLCG)

Tiered architecture of computing resources

~20 Petabytes of data (real and simulated) every year

About 200k jobs (data processing, simulation production and analysis) per day

Page 6: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

WLCG network infrastructure

20 January 2012Grid Computing in CMS 6

T0-T1 and T1-T1 interconnected via LHCOPN (10 Gpbs links)

T1-T2 and T2-T2 using general research networks Dedicated network infrastructure

(LHCONE) being deployed

Page 7: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

Grid services in WLCG

Middleware providers: gLite/EMI, OSG, ARC

Global services: data transfers and job management, authentication / authorization, information system

Compute (gateway, local batch system, WNs) and storage (gridftp servers, disk servers, mass storage system) elements at the sites

Experiment specific services

20 January 2012Grid Computing in CMS 7

Page 8: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

CMS Data and Workload Management

Experiment-specific DMWM services on top of basic Grid services Pilot-based WMS

Data bookkeeping, location and transfer systems

Data pre-located

Jobs go to data

Experiment software pre-installed at sites20 January 2012Grid Computing in CMS 8

Production System

(WMAgent)

Analysis System (CRAB)

Data Bookkeeping

& location system (DBS)

Data Transfer System

(PhEDEx)

gLite WMS

File Transfer System

CE

CE

SE

SE

Local batch

system

Mass storage system

CMS Services Grid Services Sites

Operators

Users

Pilot-based WMS

Page 9: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

CMS Grid Operations - Jobs Large scale data processing & analysis

~50k used slots, 300k jobs/day

Plots correspond Aug 2011 – Jan 2012

20 January 2012Grid Computing in CMS 9

Page 10: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

Spanish contribution to CMS Computing Resources

10

Spain contributes with ~ 5% of the CMS computing resources

PIC Tier-1

~1/2 average Tier-1

3000 cores, 4 PB disk, 6 PB tape

IFCA Tier-2

~ 2/3 average Tier-2 (~3% T2 resources)

1000 CPUs, 600 TB disk

CIEMAT Tier-2

~ 2/3 average Tier-2 (~3% T2 resources)

1000 cores, 600 TB disk

Page 11: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

Contribution from Spanish sites

20 January 2012Grid Computing in CMS 11

~5 % of total CPU delivered for CMS

CPU delivered Feb 2011 – Jan 2012

Page 12: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

CMS Grid Operations - Data

Large scale data replication 1-2 GB/s throughput CMS-wide

~1 PB/week data transfers

Full mesh 50+ sites T0 T1 T1 T2 T2

20 January 2012Grid Computing in CMS 12

1 GB/s

Production transfers

debug transfers

1 GB/s

Page 13: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

Site monitoring/readiness

20 January 2012Grid Computing in CMS 13

Page 14: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

Lessons learnt

20 January 2012Grid Computing in CMS 14

Porting the production and analysis applications to the Grid was easy Package job wrapper and user libraries into input sandbox

Experiment software pre-installed at the sites

Job wrapper sets up environment, runs the job, stages out output

When running at large scale in WLCG, additional services are needed Job and data management services on top of Grid services

Data bookkeeping and location

Monitoring

Page 15: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

Lessons learnt

20 January 2012Grid Computing in CMS 15

Monitoring is essential Multi-layer complex system (experiment, Grid, site layers)

Monitor workflows, services, sites

Experiment services should be robust Deal with (inherent) Grid unreliability

Be prepared for retries, cool-off

Pilot-based WMS gLite BDII and WMS not reliable enough

Smaller overhead, verify node environment, global priorities, etc Isolating users from the Grid; Grid operations team

Lots of manpower needed to operate the system Central operations team (~20 FTE)

Contacts at sites (50+)

Page 16: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

Future developments

20 January 2012Grid Computing in CMS 16

Dynamic data placement/deletions Most of the pre-located data not really accessed much

Investigating automatic replication of hot data, deletion of cold data

Replicate data when accessed by jobs and cache locally

Remote data access Jobs go to free slots and access data remotely

CMS has improved a lot read performance over WAN

At the moment only used as fail-over and overflow

Service to asynchronously copy user data Remote stage out from WN is a bad idea

Multi-core processing More efficient use of multi-core nodes, savings in RAM, many less

jobs to handle

Page 17: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

Future developments

20 January 2012Grid Computing in CMS 17

Virtualization of WNs/Cloud computing Decouple node OS and application environment using VMs or chroot

Allow use of opportunistic resources

CERN VMFS for experiment software

Page 18: José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

José Hernández

Summary

20 January 2012Grid Computing in CMS 18

CMS has been very successful in using the LHC Computing Grid at large scale

Lot of work to make the system efficient, reliable and scalable

Some developments in the pipeline to make CMS distributed computing more dynamic and transparent