Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)

Take on messages from Lecture 1 LHC Computing has been well sized to handle the

production and analysis needs of LHC (very high data rates and throughputs) Based on the hierarchical Monarc model It has been very successful

WLCG operates smoothly and reliably Data is well transferred and made available in a very

short time to everybody Higgs boson discovery was announced within a week

from latest data update! Network has worked well and allows now for

computing model changes

[email protected] /

August 2012

2

Grid co

mputing enables the rapid delivery

of physic

s results

Outlook to the Future

3

44

Computing Model Evolution

Evolution of computing models

Hierarchy Mesh

5

Evolution

During the development the evolution of the WLCG Production grid has oscillated between structure and flexibility Driven by capabilities of the infrastructure and the

needs of the experiments

ALICE RemoteAccess

PD2P/Popularity

CMS Full Mesh

5

6

Structur

Data management in the WLCG has been moving to a less deterministic system as the software improved

Started with deterministic pre-placement of data on disk storage for all samples (ATLAS)

Then subscriptions driven by physics groups (CMS)

Then dynamic placement of data based on access to only replicate samples that were going to be looked at (ATLAS)

Once IO is optimized and network links improve we can send data over the wide area so jobs can run anywhere and access the data (ALICE, ATLAS, CMS)• Good for opportunistic resources, balancing, clouds, or any other

time when the sample will be accessed only once

6

Data Management Evolution

Less

Det

erm

inis

tic

7

Structur

Scheduling evolution has similar drivers We started with a very deterministic system where jobs

were sent directly to a specific site

This leads to early binding of jobs to resourcesrequests idle in long queues, no ability to reschedule

All 4 experiments evolved to use a set of pilots to make better scheduling decisions based on current information

The pilot system now evolves further to allow submission to additional resources like clouds

What began as a deterministic system has evolved to flexibility in scheduling and resources

7

Scheduling Evolution

Less

Det

erm

inis

tic

More dynamic data placement is needed

less restrictions in where the data comes from

but data is still pushed to sites8

Data Access Frequency

Ian FiskFNAL/CD

ATLAS

Tier-1

Tier-2

Tier-2

Tier-1

Tier-2

Services like the Data Popularity Service track all the file accesses and can show what data is accessed and for how long

Over a year, popular data stays that way for reasonable long periods of time

9

Popularity

Ian FiskFNAL/CD

CMS Data Popularity Service

ATLAS uses the central queue and popularity to understand how heavily used a dataset is

Additional copies of the data made Jobs re-brokered to use them

Unused copies are cleaned

10

Dynamic Data Placement

Ian FiskFNAL/CD

PANDA

Requests

Tier-1

Tier-2

With optimized IO other methods of managing the data and the storage are available

Sending data directly to applications over the WANAllows users to open any file regardless of their locations or the file’s source

Sites deploy at least one xrootd server that acts as a proxy/door

12

Wide Area Access

Ian FiskFNAL/CD

Once we have a combination of dynamic placement, wide area access to data, and reasonable networking then facilities we can be treated as part of a coherent system

Also opens doors to use new kinds of resources (opportunistic resorces, commercial clouds, data centers..)

14

Transparent Access to Data

CERN is deploying a remote computing facility in Budapest

200Gb/s of networking between the centers at 35ms ping time

As experiments we cannot really tell the difference where resources are installed

15

Example: Expanding the CERN Tier0

CERN Budapest100Gb/s

100Gb/s

Tier 0: Wigner Data Centre, Budapest

• New facility due to be ready at the end of 2012

• 1100m² (725m²) in an existing building but new infrastructure

• 2 independent HV lines• Full UPS and diesel

coverage for all IT load (and cooling)

• Maximum 2.7MW

These 100Gb/s links are the first in production for WLCG Other sites will soon follow

We have reduced the differences in site functionality

Then reduced the difference in even the perception that two sites are separate

We can begin to think of the facility as a big center and not a cluster of center This concept can be expanded to many facilities

17

Networks

The WLCG service architecture has been reasonably stable for over a decade

This is beginning to change with new Middleware for resource provisioning

A variety of places are opening their resources to “Cloud” type of provisioning

From a site perspective this is often chosen for cluster management and flexibility reasons

Everything is virtualized and services are put on top

18

Changing the Services

Grids offer primarily standard services with agreed protocols

Designed to be generic, but execute a particular task

Clouds offer the ability to build custom services and functions

More flexible, but also more work for users

19

Clouds vs Grids

CMS and ATLAS are trying to provision resources like this with the High Level Trigger farms

Open Stack interfaced to the Pilot systems In CMS we got to 6000 running cores and

the facility looks like another destination, though no grid CE exists

It will be used for large scale production running in a few weeks

Already several sites have requested similar connections to local resources

20

Trying this out

We have a grid because: We need to collaborate and share resources Thus we will always have a “grid” Our network of trust is of enormous value for us and

for (e-)science in general

We also need distributed data management That supports very high data rates and throughputs We will continually work on these tools

We are now working on how to integrate Cloud Infrastructures in WLCG

21

WLCG will remain a Grid

Evolution of the Services and Tools

Computing infrastructure is a needed piece to the ultimate core mission of HEP experiments development effort is steadily decreasing

Common solutions try to take advantage of the similarities in the experiment activities optimize development effort and offer lower long-term

maintenance and support costs

Together with the willingness of the experiments to work together Successful examples in Distributed Data Management, Data

Analysis, Monitoring( HammerCloud, Dashboards, Data Popularity, the Common Analysis Framework , …)

Taking advantage of the Long Shut-down 1

Need for Common Solutions

Architecture of the Common Analysis Framework

Evolution of Capacity: CERN & WLCG

25

Modest growth until 2014

Anticipate x2 in 2015

Anticipate x5 after 2018

What we thought was needed at LHC start

What we actually used at LHC start!

Resource Utilization was highest in 2012 for both Tier-1 and Tier-2 sites

Jan-

12

Mar

-12

May

-12

Jul-1

2

Sep-1

2

Nov-1

2

Jan-

13

Mar

-13

May

-13

Jul-1

3

Sep-1

30

20

40

60

80

100

120

140

CMS Tier-1 Pledge Usage

CMS

Jan-

12

Mar

-12

May

-12

Jul-1

2

Sep-1

2

Nov-1

2

Jan-

13

Mar

-13

May

-13

Jul-1

3

Sep-1

30

20406080

100120140160180

CMS Tier-2 Pledge Usage

CMS

CMS Resource Utilization

Growth curves for resources

CMS Resource Utilization

2012 2013 2014 2015 2016 20170

100

200

300

400

500

600

CMS Tier-1 CPU Run2

Resource Request

Flat Growth

kHS

06

2012 2013 2014 2015 2016 20170

10

20

30

40

50

60

CMS Tier-1 Disk Run2

Resource Request

Flat Growth

PB

2012 2013 2014 2015 2016 20170

20

40

60

80

100

120

140

160

Tier-1 Tape

Resource Request

Flat GrowthPB

Conclusions

28

C

First years of LHC data – WLCG has helped deliver physics rapidly

Data available everywhere within 48h

Just the start of decades of exploration of new physics Sustainable solutions!

Entering a phase of consolidation and at the same time evolution

LS1: opportunity for disruptive changes and scale testing of new technologies

Wide area access, dynamic data placement, new analysis tools, clouds

Challenges for computing – scale & complexity – will continue to increase

28

Conclusions

In the new resource provisioning model the pilot infrastructure communicates with the resource provisioning tools directly

Requesting groups of machines for periods of time

29

Evolving the Infrastructure

29

Resource Provisioning

Resource Provisioning

Pilots

ResourceRequests

Cloud Interface

CE

VM with PilotsVM with PilotsVM with PilotsVM with PilotsVM with PilotsVM with PilotsVM with Pilots

Batch Queue

WN with PilotsWN with PilotsWN with PilotsWN with PilotsWN with PilotsWN with PilotsWN with Pilots

Documents

Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)