View
11
Download
0
Category
Preview:
Citation preview
TSM Linux User ExperienceTSM Linux User ExperienceTSM Linux User Experience TSM Linux User Experience at CERNat CERN
David Asbury, CERN, Geneva, SwitzerlandO f d TSM S i 26 S t b 2007Oxford TSM Symposium, 26 September 2007
TopicsTopics
What is CERN?What do we do with all that data?How TSM is used in CERNManaging the growth of dataConfigurationConfigurationExperience with Linux
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 2
What is CERN?What is CERN?
European Laboratory for Particle PhysicsFrench-Swiss border near Geneva20 member states, ~3000 staff,~6500 visiting scientists from ~500 institutes ~80 nationalitiesinstitutes, 80 nationalitiesLarge Hadron Collider (LHC) to open 2008
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 3
Large Hadron ColliderLarge Hadron Collider
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 4
Accelerator ComplexAccelerator Complex
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 5
Atlas ExperimentAtlas Experiment
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 6
Data PyramidData Pyramid
Derived data,Physics dbs
Mail, Home directoriesDatabases systems etcPhysics dbs Databases, systems etc.
Raw Data fromi iexperiments is
distributed among 10 other Grid sites10 other Grid sites.
~15PB per year
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 7
CERN Policy on BackupCERN Policy on Backup
Home DirectoriesAFS l b k CAFS
Windows DFS
AFS volume backup -> CastorTSM
MailMicrosoft Exchange
D t b TSMTSM
DatabasesUnix group & project servers TSM
TSM
Experimental Data Castor
Castor: CERN Advanced Storage Manager (local)26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 8
Castor: CERN Advanced Storage Manager (local)
G h!G h!Growth!Growth!Data Received by TSM
60
40
50
ek
10
20
30
TB p
er w
e
0
/200
5
/200
5
/200
5
/200
5
/200
6
/200
6
/200
6
/200
6
/200
7
/200
7
02/0
1
02/0
4
02/0
7
02/1
0
02/0
1
02/0
4
02/0
7
02/1
0
02/0
1
02/0
4
Date
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 9
Managing growthManaging growth
Ask the major clients for forecastsMonitoring everything they do too!
Servergraph, moving to home-grown TSMMS
Want a repeatable “unit” of TSMCan add when needed to avoid performance problemsUse existing TSM FC infrastructureProfit from local Linux expertise and installationMake use of physics robotic tape infrastructureMake use of physics robotic tape infrastructure
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 10
A Unit of TSM CapacityA Unit of TSM Capacity
PC running standard RHEL4 64-bit Linux4 cpus, 8GB memory, 2 Qlogic HBAs for FC
System disks mirrored by 3ware cardDisks for TSM db & log mirrored by TSMRAID6 disks for staging areasg gUse physics robot tape infrastructure
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 11
TSM ConfigurationTSM Configuration2nd Storage Centre Computer Centre
FC stack FC stack
TAPE
ROBOT
SAN
AIX AIXLinux LinuxLinux… Linux…
FC switch FC switchRAID DISK SAN
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 12
Setting up the Linux etc.Setting up the Linux etc.
IBM only supports specific Linux kernelsIBM t d i d ifi IBM d iIBM tape drives need specific IBM driverMore restrictive than AIX or Solaris
No “smitt ” s stem tool like AIXNo “smitty” system tool like AIX
Must reload FC driver to add devicesDisks MUST be labelled in /etc/fstab for safetyDisks MUST be labelled in /etc/fstab for safetyCannot avoid Unix disk cache with ext3 fsTape drive devices may change name if add new onesM t h t t d i if TSM t tMust change access to tape devices if TSM not run as rootUsually have to reboot
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 13
Spec of 1Spec of 1stst TSM on LinuxTSM on Linux
PC Intel Xeon 2x3Ghz cpus, 4GB memorySystem disks mirrored by 3ware cardStandard RHEL4 64-bit Linux (specified)( p )Raptor disks mirrored by TSM for db & logSATA RAID6 Infortrend array for stagingSATA RAID6 Infortrend array for stagingExt3 file system used (specified)8 IBM 3592J tapes (300GB) in 3584 robot
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 14
1st TSM on Linux1st TSM on Linux
Started well, performance okayFunctioned normallyHigh load (>1 cpu) when doing i/og ( p ) gSometimes does not schedule all TSM processes concurrently?processes concurrently?Beware of Linux “tools” for devices
Rewound tape drives!Rewound tape drives!
Added 2nd Linux machine …
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 15
Spec of 2nd TSM on LinuxSpec of 2nd TSM on Linux
AMD Opteron dual core, 4 cpu, 8GB mem.System disks mirrored by 3ware cardStandard RHEL4 64-bit Linux (specified)( p )Raptor disks mirrored by TSM for db & logSATA RAID6 Infortrend array for stagingSATA RAID6 Infortrend array for staging6 LTO3 HP drives in STK 8500 robotLTO drives mounted via ACSLS
Need special script to create device files
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 16
2nd TSM on Linux2nd TSM on Linux
Started well, but high cpu with i/o againCorrupted file systems with high disk i/o
/var/log/messages “trying to seek off end of disk”Reboot stopped - needed manual fsck of file systemsSystem down for some hours to check file systems and ran TSM AUDIT on disks to cleanupran TSM AUDIT on disks to cleanup
Upset Backup clients!System not available when neededSystem not available when neededBackups corrupted? - yes
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 17
Tracing the CorruptionTracing the Corruption
Tried changing RAID arrays, updated k l d Ql i FC d ikernel and Qlogic FC driverTried single-processor kernel.
Better, but still corrupted
Borrowed RedHat certified PCStill corrupted with memory problems, audit errors
Eventually moved big clients back to AIXLinux better lightly loaded, but still see audit errors
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 18
CorruptionCorruption
Needs high disk i/o – only seen with disks connected by FCconnected by FCSingle processor kernel was better, but too slow (limited cpu for i/o)too slow (limited cpu for i/o)Did not seriously suspect RAID arrays as have worked well with AIX for yearshave worked well with AIX for yearsDifficult to separate Linux fs from FCR TSM AUDIT f tl b t tRun TSM AUDIT frequently, but cannot check data (only metadata)
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 19
CERN Corruption SurveyCERN Corruption Survey
Used fsprobe program in C (not TSM)J t d / it U i fil d h kJust reads/writes Unix files and checksRun on ~3000 farm PCs in CERN for some weeksV i t f il t ti f dVariety of silent corruption found:
Memory errors, less than expected. 1-bit errors are correctedSector/page sized regions corruptedSector/page sized regions corruptedLarger blocks of invalid data – ext3 file system?
All makes of PC eventually showed errorsMemory is most dangerous place for your data!
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 20
ConclusionsConclusions
Jury still out. Linux fs or FC-related?Linux offers cheaper repeatable unit?Problem: no single point of contactg p
No clear line between hardware and softwareDifferent PCs show corruption in different waysExtremely time consuming, disruptive to service
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 21
Next StepsNext Steps
Try IBM configuration certified for TSMPC, Qlogic HBAs, IBM RAID with RHEL4
Pay IBM to take all problems (Redhat too)Hope for clear answer to problem – do not want to repeat all this with new hardware!Talk in TSM Symposium 2009 on results?
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 22
AcknowledgementsAcknowledgements
Lio Frost-AinleyGordon LeeTim Bell (boss)( )Charles Silvan (Expert from GATE & IBM)Peter Kelemen (Corruption Survey)Peter Kelemen (Corruption Survey)
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 23
Contact DetailsContact Details
David Asbury, CERN IT DepartmentEmail: david.asbury@cern.chCERN Website: www.cern.ch
26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 24
Q ti ?Q ti ?Questions?Questions?
Recommended