26
Will / Can Clouds Will / Can Clouds Replace Grids? Replace Grids? A Three-Point Checklist [email protected] [email protected] Grid Support Group, IT Department, CERN Grid Support Group, IT Department, CERN

Will / Can Clouds Replace Grids? A Three-Point [email protected] Grid Support Group, IT Department, CERN

Embed Size (px)

Citation preview

Page 1: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Will / Can Clouds Will / Can Clouds Replace Grids?Replace Grids?

A Three-Point Checklist

[email protected] [email protected]

Grid Support Group, IT Department, CERNGrid Support Group, IT Department, CERN

Page 2: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Introduction This talk tries to establish a checklist to determine

whether Cloud computing could be an alternative – or possibly a complementary solution – to Grids for LHC-scale computing

In other words, it tries to build a list of criteria that a cloud-based solution must satisfy if it is to be considered as an acceptable solution

This checklist leads naturally to a set of actions or possible project(s) in this area

Outstanding issues and / or current experience from today’s solution (aka Grid) is interleaved to emphasize the relevance of these issues

Page 3: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Abstract The WLCG service has been declared officially open for production

and analysis during the LCG Grid Fest held at CERN - with live contributions from around the world - on Friday 3rd October 2008. But the service is not without its problems - services or even sites suffer degradation or complete outage with painful repercussions on experiment activities, the operations and service model is arguably not sustainable at this level but yet an important element of the funding comes to and end approximately one year after this conference! Cloud computing - which has been referred to as Grid computing with a viable business model - makes ambitious claims. Could it solve all - or even a significant fraction, say Monte Carlo production - of our computing problems? What would be the associated costs, technical and sociological implications? This presentation analyzes the Strengths, Weaknesses, Opportunities and Threats of these potential rival models from the viewpoint of the current WLCG service. It makes proposals for studies that should be performed - beyond existing largely paper analyses - and highlights some key differentiators between the two approaches.

Page 4: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

What is Cloud Computing?a. The latest in a series of hype;

b. Yet another form of utility computing;

c. Grid Computing but with a business model;

d. Where the action (money) is currently at;

e. All of the above?

Page 5: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Does it matter? Some ten years ago Larry Ellison – head honcho at

Oracle – declared:

“There have been 3 generations of computing: mainframe, client-server and Internet computing

There’ll be nothing new for one thousand (1000) years”

Curiously enough, just a couple of years later, Oracle declared Grid to be “the next big thing”

Page 6: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

What is Grid Computing?

Today there are many definitions of Grid computing:

The definitive definition of a Grid is provided by [1] Ian Foster in his article "What is the Grid? A Three Point Checklist" [2].

The three points of this checklist are:

1. Computing resources are not administered centrally;

2. Open standards are used;

3. Non-trivial quality of service is achieved.

Page 7: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

What is Grid Computing?

Today there are many definitions of Grid computing:

The definitive definition of a Grid is provided by [1] Ian Foster in his article "What is the Grid? A Three Point Checklist" [2].

The three points of this checklist are:

1. Computing resources are not administered centrally;

2. Open standards are used;

3. Non-trivial quality of service is achieved.

Page 8: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

WLCG Key Performance Indicators

• Since the beginning of last year we have held week-daily conference calls open to all experiments and sites to follow-up on short-term operations issues

• These have been well attended by the experiments, with somewhat more patchy attendance from sites but minutes are widely and rapidly read by members of the WLCG Management Board and beyond

• A weekly summary is given to the Management Board where we have tried to evolve towards a small set of Key Performance Indicators

• These currently include a summary of the GGUS tickets opened in the previous week by the LHC VOs, as well as more important service incidents requiring follow-up: Service Incident Reports (aka post-mortems) 8

Page 9: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

GGUS Summary

9

VO concerned USER TEAM ALARM TOTAL

ALICE 3 0 0 3

ATLAS 16 16 0 32

CMS 13 0 0 13

LHCb 9 2 0 11

Totals 41 18 0 59

No alarm tickets – this may also reflect activity

• Increasing use of TEAM TICKETS

Regular test of ALARM TICKETS coming soon!• See

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek090223#Tuesday under AOB

Page 10: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Intervention Summary (fake)

10

Site # scheduled

#overran #unscheduled Hours sched.

Hours unsched.

Bilbo 5 0 1 10 4

Frodo 1 1 0 2 22

Drogo 27 0 0 165 0

• As with GGUS summary we will drill-down in case of exceptions (examples high-lighted above)

• Q: what are reasonable thresholds?

• Proposal: look briefly at ALL unscheduled interventions, ALL overruns and “high” (TBD) # of scheduled

Page 11: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

(Some) Unscheduled Interventions

11

Site Reason

NL-T1 (SARA-MATRIX)

A DDN storage device partially crashed and needs a cold reboot and some additional actions. We are uncertain how long it will take. The SARA CE's may be affected.

Period announced 23-02-2009 09:30 – 11:15Intervention terminated 23-02-2009 12:20

NDGF Some dCache pools offline from time to time due to bad hardware causing spontaneous reboots.

Period announced 20-02-2009 15:22 – 23-02-2009 15:22Terminated 23-02-2009 16:25 We need to automatically harvest this information and

improve follow-up reporting A convenient place to provide such a report is at

the daily WLCG operations call!

Page 12: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

12

ALICE ATLAS

CMSLHCbLHCb

Page 13: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Constant improvement of the quality of the infrastructureComparison of the CMS site availability based on the results of SAM tests specific for CMS VO First and last quarter of 2008.

Page 14: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

The Goal – WLCG Services

• The goal is that – by end 2009 – the weekly WLCG operations / service report is quasi-automatically generated 3 weeks out of 4 with no major service incidents – just a (tabular?) summary of the KPIs

We are currently very far from this target with (typically) multiple service incidents that are either:• New in a given week;• Still being investigating or resolved several to many weeks later Quite a few are avoidable too if we followed some basic

rules!

• By definition, such incidents are characterized by severe (or total) loss of service or even a complete site (or even Cloud in the case of ATLAS)

15From February 2009 LHCC mini-review of (W)LCG

Page 15: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

How Can We Improve? Change Management

• Plan and communicate changes carefully;• Do not make untested changes on production systems – these

can be extremely costly to recover from.

Incident Management• The point is to learn from the experience and hopefully avoid

similar problems in the future;• Documenting clearly what happened together with possible

action items is essential.

All teams must buy into this: it does not work simply by high-level management decision (which might not even filter down to the technical teams involved).

• CERN IT plans to address this systematically (ITIL) as part of its 2009+ Programme of Work

16Pronounced “Common Sense”

Page 16: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

What is LHC Scale Computing? (W)LCG was initially declared to require “100,000 of today’s

fastest PCs”

Technology has changed quite significantly since this was first written, but with a very minor change this still holds true 100,000 cores

This is also used to loosely characterize “petascale” computing (aka supercomputing)

Any demonstration that we do must be on a scale commensurate with this – a 1% (or less) test is completely irrelevant!

(and we know that the data is petabyte-scale…)

Page 17: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Can Clouds Replace Grids?

We now have the first two of three points of this checklist:

1. Non-trivial quality of service must be achieved;

We have well understood metrics from day-to-day operation of WLCG services

2. The scale of the test(s) must be meaningful for petascale computing;

Obviously one cannot expect to dedicate 100,000 cores to the first prototype, but anything done on a scale < 1,000 cores will not be relevant to the conclusion!

3. Data

Page 18: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Current Data Management vs Database Strategies

Data Management Specify only interface

(e.g. SRM) and allow sites to chose implementation (both of SRM and backend s/w & h/w mass storage system)

Databases Agree on a single

technology (for specific purposes) and agree on detailed implementation and deployment details

WLCG experience from both areas shows that you need to have very detailed control down to the lowest levels to get the required performance and scalability.How can this be achieved through today’s (or tomorrow’s) Cloud interfaces?Are we just dumb???

Page 19: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Major Service Incidents

• Quite a few such incidents are “DB-related” in the sense that they concern services with a DB backend• The execution of a “not quite tested” procedure on ATLAS online

led – partly due to the Xmas shutdown – to a break in replication of ATLAS conditions from online out to Tier1s of over 1 month (online-offline was restored much earlier)

• Various Oracle problems over many weeks affected numerous services (CASTOR, SRM, FTS, LFC, ATLAS conditions) at ASGC need for ~1FTE of suitably qualified personnel at WLCG Tier1 sites, particularly those running CASTOR; recommendations to follow CERN/3D DB configuration & perform a clean Oracle+CASTOR install; communication issues

• Various problems affecting CASTOR+SRM services at RAL over prolonged period, including “Oracle bugs” strongly reminiscent of those seen at CERN with earlier Oracle version: very similar (but not identical) problems seen recently at CERN & ASGC (not CNAF…)

• Plus not infrequent power + cooling problems [ + weather! ]• Can take out an entire site – main concern is controlled recovery

(and communication)20From February 2009 LHCC mini-review of (W)LCG

Page 20: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

21

At the November 2008 WLCG workshops a recommendation was made that each WLCG Tier1 site should have at least 1 FTE of DBA effort. This effort (preferably spread over multiple people) should proactively monitor the databases behind the WLCG services at that site: CASTOR/dCache, LFC/FTS, conditions and other relevant applications.The skills required include the ability to backup and recover, tune and debug the database and associated applications.At least one WLCG Tier1 does not have this effort available today.

Page 21: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Services – Concrete Actions

1. Review on a regular (3-6 monthly?) basis open Oracle “Service Requests” that are significant risk factors for the WLCG service (Tier0+Tier1s+Oracle)• The first such meeting is being setup, will hopefully take

place prior to CHEP 2009

2. Perform “technology-oriented” reviews of the main storage solutions (CASTOR, dCache) focussing on service and operational issues• Follow-on to Jan/Feb workshops in these areas; again

report at pre-CHEP WLCG Collaboration Workshop

3. Perform Site Reviews – initially Tier0 and Tier1 sites – focussing again and service and operational issues.• Will take some time to cover all sites; proposal is for review

panel to include members of the site to be reviewed who will participate also in the review before and after their site22

Page 22: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Remaining Questions

Are Grids too complex?

Are Clouds too Simple?

IMHO we can learn much from the strengths and weaknesses of these approaches, particularly in the key (for us) areas of data(base) management & service provision. This must be a priority for the immediate future….

Do Grids have to be too complex?

Do Clouds have to be too simple?

Page 23: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Can Clouds Replace Grids? – The Checklist

We have established a short checklist that will allow us to determine whether clouds can replace – or be used in conjunction with – Grids for LHC-scale data intensive applications:

1. Non-trivial quality of service must be achieved;

2. The scale of the test(s) must be meaningful for petascale computing;

3. Data Volumes, Rates and Access patterns representative of LHC data acquisition, (re-)processing and analysis;

4. Cost (of entry; of ownership).

Page 24: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Conclusions We cannot afford to ignore major trends in the

computing industry Some may turn out to be dead-ends Some may die only to be reborn in a different guise

We have established – through a long series of challenges – a well-proven mechanism for determining whether a (set of) computing service(s) satisfies an agreed set of requirements

Not evaluating cloud computing for at least some HEP Use Cases would appear to be the one option we cannot afford to take…

Page 25: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Summary

The following targets must be met by Cloud-based (or any other solution) to satisfy LHC-scale needs:

1. Non-trivial quality of service must be achieved;

2. The scale of the test(s) must be meaningful for petascale computing;

3. Data Volumes, Rates and Access patterns representative of LHC data acquisition, (re-)processing and analysis;

4. Cost (of entry; of ownership).

Page 26: Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

Acknowledgements

Elsie Gee (E. Gee!) – for many interesting but often heated discussions