21
RAL Tier1 Operations Andrew Sansum 18 th April 2012

RAL Tier1 Operations Andrew Sansum 18 th April 2012

Embed Size (px)

Citation preview

Page 1: RAL Tier1 Operations Andrew Sansum 18 th April 2012

RAL Tier1 Operations

Andrew Sansum18th April 2012

Page 2: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Staffing

Staff changes since GridPP27:Leavers• Kier Hawker (Database Team Leader)New Starters• Orlin Alexandrov (Grid Team)• Dimitrios (Fabric Team)• Vasilij Savin (Fabric Team)New Roles• Ian Collier - “Grid Team” Leader• Richard Sinclair Database Team Leader• James Adams – storage system development

10 April 2023 Tier-1 Status

Page 3: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Some Changes

• CVMFS in use for Atlas & LHCb:– The Atlas (NFS) software server used to give

significant problems.– Some CVMFS teething issues but overall much

better!

• Virtualisation:– Starting to bear fruit. Uses Hyper-V.

• Numerous test systems• Production systems that do not require particular

resilience.

• Quattor:– Large gains already made.

10 April 2023 Tier-1 Status

Page 4: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Database Infrastructure

We making Significant Changes to the Oracle Database Infrastructure.

Why?• Old servers are out of maintenance• Move from 32bit to 64bit databases• Performance improvements• Standby systems• Simplified architecture

Page 5: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Database Disk Arrays - Future

10 April 2023 Tier-1 Status

Fibrechannel

SAN

Oracle RAC Nodes

Disk Arrays

Power Supplies (on UPS)

Data Guard

Page 6: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Castor

Changes since last GridPP Meeting:

• Castor upgrade to 2.1.10 (March)• Castor version 2.1.10-1 (July) needed for the higher

capacity "T10KC" tapes.• Updated Garbage Collection Algorithm (to “LRU” rather

than the default which is based on size). (July)• (Moved ‘logrotate’ to 1pm rather than 4am.)

10 April 2023 Tier-1 Status

Page 7: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Recent Developments (I)

• Hardware– Procured and commissioned 2.6PB disk – Procured and commissioned 15KHS06 disk – T10KC tape drives deployed and (1.5PB) ATLAS data migrated– New head nodes and core infrastructure storage capacity– Procured A new Tier-1 core network and new Site network

• ORACLE Database Hardware upgrade and re-organisation– Rebuilding database SAN infrastructure– Increased CASTOR database resilience. Now have two copies

of CASTOR database. Maintained in step by Oracle Data-guard.

– Upgraded 3D service to ORACLE 11

• Virtualisation infrastructure (Hyper-V) now approved for critical production systems (deployment starting).10 April 2023 Tier-1 Status

Page 8: RAL Tier1 Operations Andrew Sansum 18 th April 2012

• CASTOR (significant improvements in latency)– Upgraded to CASTOR 2.1.11-8 (major upgrade)– Head node replacement

• EMI/UMD upgrades of Grid Middleware

10 April 2023 Tier-1 Status

Page 9: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Castor Issues.

• Load related issues on small/full service classes (e.g. AtlasScratchDisk; LHCbRawRDst)– Load can become concentrated on one or two disk

servers.– Exacerbated if uneven distribution if disk server sizes.

• Solutions:– Add more capacity; clean-up.– Changes to tape migration policies.– Re-organization of service classes.

10 April 2023 Tier-1 Status

Page 10: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Disk Server Outages by Cause (2011)

10 April 2023 Tier-1 Status

Page 11: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Disk Drive Failure – Year 2011

Page 12: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Double Disk Failures (2011)

In process of updating the firmware on the particular batch of disk controllers.

10 April 2023 Tier-1 Status

Page 13: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Data Loss Incidents

Summary of losses since GridPP26Total of 12 incidents logged:• 1 – Due to a disk server failure (loss of 8 files for CMS)• 1 – Due to a bad tape (loss of 3 files for LHCb)• 1 - Files not in Castor Nameserver but no location. ( 9

LHCb files)• 9 – Cases of corrupt files. In most cases the files were

old (and pre-date Castor checksumming).

Checksumming in place of tape and disk files. Daily and random checks made on disk files.

10 April 2023 Tier-1 Status

Page 14: RAL Tier1 Operations Andrew Sansum 18 th April 2012

T10KC Tapes In Production

Type Capacity In Use Total Capacity

A 0.5TB 55702.2PB

B 1TB 2170 1.9PB (CMS)C 5TB

10 April 2023 Tier-1 Status

Page 15: RAL Tier1 Operations Andrew Sansum 18 th April 2012

T10000C Issues

• Failure of 6 out of 10 tapes.– Current A/B failure rate roughly 1 in 1000.– After writing part of a tape an error was reported.

• Concerns are three fold:– A high rate of write errors cause disruption– If tapes could not be filled our capacity would be

reduced– We were not 100% confident that data would be

secure• Updated Firmware in drives.

– 100 tapes now successfully written without problem.

• In contact with Oracle.10 April 2023 Tier-1 Status

Page 16: RAL Tier1 Operations Andrew Sansum 18 th April 2012

A couple of final comments

Disk server issues are the main area of effort for hardware reliability / stability.

...but do not forget the network.

Hardware that has performed reliably in the past may throw up a systematic problem.

10 April 2023 Tier-1 Status

Page 17: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Formal Operations Processes

10 April 2023 Tier-1 Status

Change Review

Exception Review

SIR Review

Team Fault Review

WLCG DAILY ops

Liaison Meeting

Production Scheduling

Management Meeting

Requirements

Exception Handling

Page 18: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Service Exceptions 2011

•Definitions– Service exception – High priority fault alert raising a pager call – Callout – Service exception raised outside formal working

hours

•Operations Team– Daytime – “Admin on Duty” (AoD). Holds pager, handles

service exceptions – passes on to daytime teams. – “Nighttime” – Primary Oncall (Like AoD) – holds pager fixes

easy problems, operationally “in Charge”. Second line On-call (one per team) guarantees response. Some (not guaranteed) third line support or escalation in serious incidents.

•Exceptions Count in 2011– 461 Service exceptions – 265 callouts10 April 2023 Tier-1 Status

Page 19: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Exceptions by Type by Week

Page 20: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Exceptions by Service

Page 21: RAL Tier1 Operations Andrew Sansum 18 th April 2012

Plans for Future

• ORACLE 11 upgrade for CASTOR/LFC/FTS needed by July

• CASTOR – Switch on transfer manager (reduce transfer startup latency)– Upgrade to 2.1.11-9 (needed before Oracle 11 upgrade)– Upgrade to 2.1.12

• Network (move Tier-1 backbone to 40Gb/s)– Site “front of house” network upgrade “early summer”– Tier-1 new routing and spine layer .. DRI ….

10 April 2023 Tier-1 Status