60
Tony Osborne 1 Castor Monitoring Castor Monitoring

Castor Monitoring

  • Upload
    ama

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Castor Monitoring. Castor is:. Nameserver(s) (3 in load balanced service + 1 DB) Diskservers (~10**2) Stagers (~50 – up from ~5 when monitoring started) Tape Robots (4) Tapeservers (~10**2) Tapedrives (~10**2) Tapes (~3*10**4) - PowerPoint PPT Presentation

Citation preview

Page 1: Castor Monitoring

Tony Osborne 1

Castor MonitoringCastor Monitoring

Page 2: Castor Monitoring

Tony Osborne 2

Castor is:Castor is: Nameserver(s) (3 in load balanced service + 1 DB) Diskservers (~10**2) Stagers (~50 – up from ~5 when monitoring started) Tape Robots (4) Tapeservers (~10**2) Tapedrives (~10**2) Tapes (~3*10**4)

All of which can go wrong and some of which can be easily abused/overwhelmed by users.

Page 3: Castor Monitoring

Tony Osborne 3

User abuseUser abuse Almost always occurs because internally Castor

is not protected against ‘unreasonable’ behavior. Because Castor appears to have infinite capacity

– no feedback from the system to users if they do silly things (and a lot of requests come from batch anyway)

Lack of user information/education Because they can/because it’s there

(Peer pressure is one effective way to control user excesses ‘(Please) don’t do that again’

cc: [email protected])

Page 4: Castor Monitoring

Tony Osborne 4

User abuseUser abuse Nameserver: (nsls of large directories or nslisttape of many

tapes) (up to ~2*10**5 files in a directory, up to 1.4*10**5 files on one tape..)

Too many stager requests: (no scheduling, all requests started until the stager grinds to a halt)

Too many stager requests/second: requests submitted >~5/second will reliably crash a stager

Too many rfio requests to same filesystems/diskserver (no scheduling..) can kill a diskserver (very high load)

Too many tape requests (read) can cause vdqm meltdown (solved by use of multifile stageing)

Too many files in a stager catalogue can cause stagers to grind to a halt

Too many files can cause migrations to last days..(blocking TapeDrives) average Castor file size ~ 105 MB; many < 10KB.

Page 5: Castor Monitoring

Tony Osborne 5

Reasons for monitoring (1)Reasons for monitoring (1) First and foremost: to understand the beast (which came

with little or no documentation - though many man pages). In the early days ‘nasties’ lurked in most nooks and crannies.. A fair number bugs were detected..

To ensure similar objects (TapeServers, DiskServers, Stagers, FileClasses, TapePools etc) are correctly configured

To add additional functionality (e.g. control of stage catalogues, prevention of needless tape mounting of problematic tapes)

To impose useful conventions (i.e. run a ‘tighter ship’ than Castor itself imposed)

To record (administrative actions) who did what,why and when. e.g. change of a Tape status.

Page 6: Castor Monitoring

Tony Osborne 6

Reasons for monitoring (2)Reasons for monitoring (2) To provide a hierarchical overview of the highly distributed

system involving hundreds of boxes, hundreds of Castor objects (FileClasses, TapePools) and tens of thousands of tape volumes.

To ensure the correct running of ~50 stagers:– compaction of Cdb when file > ~.5GB (gain~factor 5 in

performance for catalogue changes such as file delete)– Adequate staged files lifetimes in disk pools– To ‘clean up’ after Castor (infamous PUT_FAILED files)

To detect user abuse (and take action). To understand the long term evolution of resource usage

by the experiments (TapePools, diskserver filespace, stagers)

Page 7: Castor Monitoring

Tony Osborne 7

Reasons for monitoring (3)Reasons for monitoring (3) To enable power users (admins) in the

experiments to understand their data management resource usage.

To enable end-users to understand why (perhaps) they cannot access their data (Tape problems, DiskServer problems).

To inform end-users of ‘their (user)’ errors (e.g. cannot use a FileClass/TapePool, or write to a specific Castor directory)

Page 8: Castor Monitoring

Tony Osborne 8

BasicsBasics Put it all on the web (comparing 200 objects is very

simple – anomalies highlighted by colour coding) Warn of serious issues by email (Castor operations,

Tape operations, Computer operations) Web pages use a drill down approach. One page

shows a summary of the entire system with colour codes showing possible problems at lower levels.

In conjunction with Lemon monitoring. Warn FIO admins of serious individual machine state transitions - daemons dead, no_contact - for sub-groups of a large number of machines ~850. Even though sysadmins may fix many problems we need to know that they occur.

Page 9: Castor Monitoring

Tony Osborne 9

Monitoring methods and tools (1)Monitoring methods and tools (1)

Analysis model is (mostly) centralized analysis accessing distributed data via rfcp.

Perl scripts gather data via cron/acrontabs with varying sampling periods – from 5 minutes to 2 hours – depending on resource consumption; results glued together on the web.

Batch jobs once a week to check ALL files in ALL stager diskpools (-> orphaned files)

Make use of existing Castor accounting data (stager stats, RTCOPY errors, TapeMounting stats)

Page 10: Castor Monitoring

Tony Osborne 10

Monitoring aids and tools (2)Monitoring aids and tools (2) Enable any new problem to be quickly added to list of

monitored issues (shutting the barn door after… is useful) Parse log files (incrementally) – errors tolerated up to

predefined limits. Also used for extraction of performance statistics.

Use existing Castor tools:– stageqry --format x,y,z…. considered v. powerful!– showqueues (Tape queue analysis, TapeServer and

TapeDrives status)– vmgrlistpool (TapePool evolution, media management)

Deduce syntactic rules for different types and sets of configuration files and regularly scan for inconsistencies.

Page 11: Castor Monitoring

Tony Osborne 11

Top level page (mother..)Top level page (mother..)

Some trivialRTCOPY errors

Non-uniform configuration of castorgrid machines

All stagers ok

Too many DISABLED tapes

Page 12: Castor Monitoring

Tony Osborne 12

Tape Queues (vdqm) Tape Queues (vdqm) Source: showqueuesSource: showqueues

Average time to mount a tape not already mounted

Time to ‘remount’ analready mounted tape is often

>> time to mount a ‘new’ tape

userid@stager requestingmost tape mounts

Page 13: Castor Monitoring

Tony Osborne 13

Tape Queues (vdqm)Tape Queues (vdqm)

Total tape queuemust be kept < 5000 or systemfreezes and all drives must be

manually reset (tapes dismounted and daemons restarted)E-mail alert system used

(with precise instructions)

Page 14: Castor Monitoring

Tony Osborne 14

Tape activity per DeviceGroupName (dgn)Tape activity per DeviceGroupName (dgn)

Reading through a 200 GB tapeat ~25 MB/sec takes a long time(>2 hours). This issue does not

go away with newer tapetechnologies

Page 15: Castor Monitoring

Tony Osborne 15

Tape activity per DeviceGroupName (dgn)Tape activity per DeviceGroupName (dgn)

Typically many mountseach to read a few (~2) files – INEFFICIENT

(not streaming)

Typically few writes tooutput many files

(stager migrations)EFFICIENT (streaming)

Page 16: Castor Monitoring

Tony Osborne 16

Tape activity per DeviceGroupName (dgn)Tape activity per DeviceGroupName (dgn)

Each tape requested~6 times on average

Page 17: Castor Monitoring

Tony Osborne 17

Tape activity per DeviceGroupName (dgn)Tape activity per DeviceGroupName (dgn)

Number of different tapes (VIDs) per

TapePool)

Most solicitedTapePools

Most solicitedtapes (VIDs)

Page 18: Castor Monitoring

Tony Osborne 18

TapeDrive status: overviewTapeDrive status: overview

Dedication used as an administrative tool - write precedence (over read) used

for stager migrations and CDR

Overview by drive type,drive status and dgn

Page 19: Castor Monitoring

Tony Osborne 19

Links to recentRTCOPY errors

TapeDrive status: dgn 994BR0TapeDrive status: dgn 994BR0

Would show requeststhat are ‘stuck’

Would show tapes‘trapped’ on drives that are DOWN for a long period

Page 20: Castor Monitoring

Tony Osborne 20

TapeDrives with too many errorsTapeDrives with too many errors

Bad drives identifiedby recent RTCOPY errors (STK: change this drive!)

RTCOPY errorsextracted from

accounting records

DriveSerial numbersextracted from each

TapeDrive

Page 21: Castor Monitoring

Tony Osborne 21

TapeDrives with too many errorsTapeDrives with too many errors

7 tapes DISABLEDby same TapeDrive

Source: vmgr

Page 22: Castor Monitoring

Tony Osborne 22

Bad tapes also identifiedBad tapes also identified

Same tape has errorson multiple TapeDrives with

within a short time frame.(tape needs repair/copying)

Page 23: Castor Monitoring

Tony Osborne 23

TapeRepairs: drive/volume/errDBTapeRepairs: drive/volume/errDB

Select a DISABLED tape

Simple work log:who did what, when and why

Page 24: Castor Monitoring

Tony Osborne 24

TapeRepairs: drive/volume/errDBTapeRepairs: drive/volume/errDB

Re-enable a tape after repair(log of who did what, when

and why)

Page 25: Castor Monitoring

Tony Osborne 25

Drive/volume/errDBDrive/volume/errDB

Overview of error rateby category

User errors are usually not important

(and not usually caused by the user..)

‘Tape’ and ‘Unknown’(unclassified) are the most

significant errors – and even these need to be filtered

to spot bad VID’s and/or bad TapeDrives

Page 26: Castor Monitoring

Tony Osborne 26

Drive/volume/errDBDrive/volume/errDB

Serious errors by VIDLikely solution is to copy/retire this VID

Page 27: Castor Monitoring

Tony Osborne 27

Drive/volume/errDBDrive/volume/errDB

New TapeDrive/medianot without problems!

Page 28: Castor Monitoring

Tony Osborne 28

Drive/volume/errDBDrive/volume/errDB Oracle forms queryfront-end allows to

query for specific error strings, TapeDrives, VIDs

Page 29: Castor Monitoring

Tony Osborne 29

End-user pages to explain data inaccessibilityEnd-user pages to explain data inaccessibilitySimilar pages

for disabled tape segments (files)

Tapes that have beenrepaired as much as possible

but still contain (inaccessible) filesare archived. Users may (and do) try

to access these files years later.

Castor does not revealwhy it does what it does..

Files still on thesetapes may be lost

forever

Page 30: Castor Monitoring

Tony Osborne 30

Stager surveyStager survey

A set of problems/parametersfor all stagers – colour coded to

indictate if action is required

Cdb filesize (influences speedof change of catalogue)

- see next slide

rfdir of every DiskPoolFileSystem using weekly

batch jobs

Indicates basic stagerresponse to simple stageqry

Page 31: Castor Monitoring

Tony Osborne 31

Stager surveyStager survey

Aleph stager hardlyused these days

Keep catalogue to < 0.5GB(Size significantly affects speed of

catalogue operations – notablystageclrs)

Page 32: Castor Monitoring

Tony Osborne 32

Stager surveyStager survey

Current migration under way

Basic information about configurationand performance of stagers and diskpools.

Many fields change colour in case of problems. Source : stageqry. Admins notified

of serious conditions by e-mail

Colour warning if e.g. DiskServer(s)in trouble or catalogue too big

Page 33: Castor Monitoring

Tony Osborne 33

Stager surveyStager surveyLast few year’s weekly

statistics of basic stager operations

The story so far today(basic stager operations)

Evolution of overall stager catalogue,files sizes, lifetimes etc over last months

Overview of DiskServersstatus, tests and links to Lemon

Page 34: Castor Monitoring

Tony Osborne 34

Stager surveyStager survey

DiskPool configurations andFileSystem information.

Breakdown by file statusof total stager catalogue

Page 35: Castor Monitoring

Tony Osborne 35

Stager activity so far todayStager activity so far today

Time distribution of basicStager command (stagein)

Source: stager logs

Top users opening files to read today

Page 36: Castor Monitoring

Tony Osborne 36

Stager activity so far todayStager activity so far today

castorc3 is the userid used forcatalogue control. (root performs

garbage collections)

Time distribution of stageclrs(garbage collection for space or

catalogue reduction -too many files)

Page 37: Castor Monitoring

Tony Osborne 37

Evolution of stager (past few years)Evolution of stager (past few years)

Source: Castor weeklyaccounting records

March 2003-> May 2006

Page 38: Castor Monitoring

Tony Osborne 38

Evolution of stager (past few months)Evolution of stager (past few months)

Catalogue size grows beyondCastor’s ability to manage

(on this hardware and with this castor version) The catalogue starts to be limited

by external scripts tailored to each stager instance

Page 39: Castor Monitoring

Tony Osborne 39

Stager DiskPool statisticsStager DiskPool statistics

Basic file statistics

Files in this DiskPoolby status. Source: stageqry

23 MB!

Page 40: Castor Monitoring

Tony Osborne 40

Stager DiskPool: time evolutionStager DiskPool: time evolution

Page 41: Castor Monitoring

Tony Osborne 41

Stager DiskPool: time evolutionStager DiskPool: time evolution

Page 42: Castor Monitoring

Tony Osborne 42

Stager DiskPool: current distributionStager DiskPool: current distribution

Page 43: Castor Monitoring

Tony Osborne 43

Stager DiskPool: current distributionStager DiskPool: current distribution

Page 44: Castor Monitoring

Tony Osborne 44

Migrator informationMigrator information

Detailed information onstreams and tapes

Migrator speeds/data volumes

If files ~> 1GB are usedthen we get close to the TapeDrive

streaming speed (~ 30MB/sec); otherwise….

Page 45: Castor Monitoring

Tony Osborne 45

Migration candidates: FreeSpace EvolutionMigration candidates: FreeSpace Evolution

FreeSpace evolution(no GC ran in this time period)

Number of files to be migrated.

(been as high as 150k!)

Page 46: Castor Monitoring

Tony Osborne 46

Migration candidates: FreeSpace EvolutionMigration candidates: FreeSpace Evolution

Garbage collector starts

Garbage collection limits

Page 47: Castor Monitoring

Tony Osborne 47

Machine management – by groupsMachine management – by groupsHigh level view of

monitoring and configurationinformation for ~850 machines

Data sources: Cdb, Castor confign. (stageqry), Lemon, Landb,SMS and local db for state

transition detection. The overallconsistency between the sources

is continuously checked.

Transitions between significant states (no contact,high-load, daemons wrong , machine

added/removed to/from each Cdb cluster,)are also signalled by e-mail to person(s) responsible

for that cluster

All castor1 machines OK(just 2 minor problems)

These are in use nowin castor1 stagers. Group

castor (other): in standby/repair

Page 48: Castor Monitoring

Tony Osborne 48

Machine management – by groupsMachine management – by groups

Colour coding indicatesproblems/issues

Links to Lemon/ITCM Remedy feed

Link to a comparison of all machines of this model

Data organized by Cdb cluster

and sub_cluster

View onlyproblem cases

Link to see all machineson a given switch

Not inproduction

Page 49: Castor Monitoring

Tony Osborne 49

vmgr and Robotic library comparisonsvmgr and Robotic library comparisons

Layout of drives andtape media within Robot silos

Agreement betweenvmgr and Robot libraries

inventories

Tape ‘type’ : VIDdefinition (convention)

Page 50: Castor Monitoring

Tony Osborne 50

Data volumes per ExperimentData volumes per Experiment

~50,000,000 files in ~5 PetaBytes(~100 MB average filesize)

These files are from stageslap - whichonce had 150k migration candidates

Page 51: Castor Monitoring

Tony Osborne 51

VIDs in TapePoolsVIDs in TapePools

LHC experiment castor2 poolsrecently moved to new media

Links to vmgr informationand RTCOPY errors

Link to evolutionof TapePool size with time

Page 52: Castor Monitoring

Tony Osborne 52

Supply TapePoolsSupply TapePools

Castor TapePools fed fromsupply_pools via specified

filling policies

Prediction of lifetimes of current supply poolsbased on last month’s

consumption

Page 53: Castor Monitoring

Tony Osborne 53

SupplyPool monitoringSupplyPool monitoring

Tape purchases, repacking exercises, CDR and normal tape consumption

(typical background rate of ~1-2 TeraBytes/day

Page 54: Castor Monitoring

Tony Osborne 54

Largest tape consumersLargest tape consumers

In few weeks lhcb haswritten 150 TB of dataonto T10K media (*)

•In fact this was not the case. Wehad problems with new media

processing by castor.

Page 55: Castor Monitoring

Tony Osborne 55

Evolution of tape consumption by CompassEvolution of tape consumption by Compass

Different TapePools usedby Compass over last ~3 years

CDR 2004 run

Source: vmgrlisttape

Page 56: Castor Monitoring

Tony Osborne 56

Evolution of tape consumption by EngineersEvolution of tape consumption by Engineers

Backup/archives of mechanical and electrical

engineering data

Page 57: Castor Monitoring

Tony Osborne 57

TapeMount StatisticsTapeMount StatisticsSource: weekly accounting

Page 58: Castor Monitoring

Tony Osborne 58

TapeMount StatisticsTapeMount Statistics

Few files per read mount

Many files per write(stager migration)Don’t be misled

by itdc – this is ‘us’

We can do much morethan this per week with the currently installed

equipment

Page 59: Castor Monitoring

Tony Osborne 59

SummarySummary

Highly visual overview, plus by email notifications, allowed management of ~50 castor1 stagers, 200 diskservers, ~200 TapePools, ~200 Fileclasses, ~100 TapeServers, ~10**3 VIDs

A lot (most?) work went into the nitty gritty of Tape media and TapeDrive management..

~250 scripts of 50k lines of perl/sql. ~1.5 man-years dev+maintenance over 3

years.

Page 60: Castor Monitoring

Tony Osborne 60

AcknowledgementsAcknowledgements

Charles Curran: for provision of tape related information (serial numbers, Silo inventories, current state of media on robotic drives) etc.

Jean-Damien Durand: for patient answers to endless questions concerning stager workings.