Upload
margaret-snow
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
25-29 May 2009, HEPiX Spring
ASGC Site Report
Jason ShihASGC/OPS
HEPiX Fall 2009Umea, Sweden
25-29 May 2009 HEPiX Spring
Overview
• Fire incident• Hardware • Network• Storage• Future remarks
25-29 May 2009 HEPiX Spring
Fire incident – event summary
• Damage Analysis: fire was limited at the power room • Severe damage of UPS • wiring of power system, AHR• Smoke dust pervaded and smudged almost every where,
including computing & storage systems• History and Planning
• 16:53 Feb. 25 UPS battery burning • 19:50 Feb. 25 Fire extinguishment by Fire department• 10:00 Feb. 26 Fire scene investigation by Fire department• 15:00 Feb 26 ~ Mar 23 DC cleaning, re-partitioning, re-wiring,
deoderization, and re-installation• from ceiling to ground under raised floor, from power room to
machine room, from power system, air conditioning, fire prevention system to computing system
• All facilities moved outside to cleaning• Mar 23 Computing System installation• Mar 23 ~ Apr 9 Recovery of Monitoring, Environment control and
Access control system
25-29 May 2009 HEPiX Spring
Fire incident – recovery plan
• DC Consultant will review the re-design on Mar. 11, schedule will be revised based on the inspection
• Tier1/Tier2 services will be collocated at IDC for 3 months from Mar. 20
25-29 May 2009 HEPiX Spring
Fire incident – review/lessons (I)
• DC Infrastructure Standards to comply with• ANSI TIA/EIA• ASHRAE thermal guideline for data processing
env.• Guidelines for green data centers are available,
e.g., LEED• NFPA: Fire suppression system
• Capacity and type of UPS (min. scale)• Vary by the responding time of generators
• Adjust rating of all breaks (NFB and ACB)• Location of UPS (open space & outside PR) • Regular maintenance of batteries
• Inner resistance measurement
25-29 May 2009 HEPiX Spring
Fire incident – review/lessons (II)
• Smoke damage: Fire stopping• Improvement of monitoring system
• Re-design the monitoring sys.• Earlier pre-action: consider: VESDA
• Emergent response and procedures• Routine Fire drill is indispensable
• Disaster Recovery plan is necessary
• Other improvement:• PP and H/C aisle splitting• Fiber panels: MDF and FOR• OH cable tray (exist: PWR tray in subfloor)+ Fiber
guide• Raised floor grommets
25-29 May 2009 HEPiX Spring
Move out all facilities for cleaning
Container as storage and humidification
Protect Racks from Dust
Ceiling Removal
25-29 May 2009 HEPiX Spring
Fire incident - Tape system
• Snapshots of decommissioned tape drives after the incident
25-29 May 2009 HEPiX Spring
DC recovered – mid of May
• FOR in area #1• MDF move to center of DC area• H/C aisle fully split
• Plan to replace racks to provide 1100mm depth
25-29 May 2009 HEPiX Spring
IDC Collocation (I)
• Site selection and paper processing - one week
• Preparation at IDC – one week• 15R + reservation for tape system (6R)• Power (14kW per racks)• cooling (perforated raise floor)• 10G protection SDH STM-64 networking
between IDC and ASGC
25-29 May 2009 HEPiX Spring
IDC collocation (II)
• Relocation of 50+% computing/storage – one week• 2k job slots (3.2MSI2K), 26 chassis of blade
servers• 2.3PB storage (1PB allocated dynamically)
• Cabling + setup + reconfiguration – one week
25-29 May 2009 HEPiX Spring
IDC collocation (III)
• Facility install complete at Mar 27th
• Tape system delay after Apr 9th
• Realignment• RMA for faulty parts
25-29 May 2009 HEPiX Spring
T1 performance
• 7G peak reach to Amsterdam• 9G peak observed between
IDC/ASGC
25-29 May 2009 HEPiX Spring
Network – before May
KREONET2
CSTNet
HARNet
GE
GEGE
GE
HKIX
M120
Pacnet IP Transit
APAN-JPKEK
GE GEGE
JPIX
SINet
WIDEGE
GE*2
NUS
GE GE
AARNet
2.5G WL non-protect
NCIC -2.5G(STM-16) SDH
622M(STM-4) SDH on APCN2
100M
M120
M20
M320
CERNet
TWGate IP Transit
100M
JP, KDDI Otemachi
Sinica, TaipeiHK, Mega-iAdvantage
SG, KIM CHUNG
25-29 May 2009 HEPiX Spring
Network - 2009
KREONET2
CSTNet
HARNet
GE
GEGE
GE
HKIX
M120
Pacnet IP Transit
APAN-JPKEK
GE GEGE
JPIX
SINet
WIDEGE
GE*2
SingAREN
GE GE
AARNet
NUS
GE
STM-16 SDH
2.5G(STM-16) SDH
622M(STM-4) SDH on EAC
100M
M120
M20
M320
CERNet
TWGate IP Transit
100M
Sinica, TaipeiHK, Mega-iAdvantage
JP, KDDI Otemachi
Singapore, Global Switch
25-29 May 2009 HEPiX Spring
ASGC Resource Level Targets
Date CPU (MSI2k) Disk (PB) Tape (PB)
Current 2.4 1.2 0.8
Year End 5.6 2.4 1.3
MoU 2009
7.55 3.15 2.1
• 2008• 0.5PB expansion of Tape system in Q2• Meet MOU target mid of Nov.• 1.3MSI2k per rack base on recent E5450 processor.
• 2009• 150 QC blade servers• 2TB per drives for raid subsystem• 42TB net capacity per chassis and 0.75PB in total
25-29 May 2009 HEPiX Spring
Hardware Profile and Selection (I)
• CPU:• 2K8 Expansion: 330 blade server provide
3.6KSI2k• 7U height chassis• SMP Xeon E5430 processors, 16GB FB-DIMM• each blade provide 11KSI2k• 2 blade/U density, Web/SOL management
• current capacity: 2.4MSI2k• Year end total computing power: ~5.6MSI2k
• 22KSI2k/U (24 chassis in 168U)
25-29 May 2009 HEPiX Spring
Tape system• Before incident:
• LTO3 * 8 + LTO4 * 4• 720TB with LTO3• 530TB with LTO4
• May 2009:• Two loan LOT3 drives• MES: 6 LTO4 drives end of May• Capacity: 1.3PB (old) + 0.8PB (LTO4)
• New S54 model introduced• 2K slots with tier model• Upgrade ALMS• Enhanced gripper
25-29 May 2009 HEPiX Spring
Roadmap – Host I/F 2009
Q1 Q2 Q3 Q4
4G FC ( ≈ 400 MB/sec)
8G FC ( ≈ 800 MB/sec)
SAS 3G (4-lane ≈ 1200 MB/sec)
iSCSI – 1Gb
U320 - SCSI ( ≈ 320 MB/sec)
iSCSI – 10 Gb
SAS 6G (4-lane ≈ 2400 MB/sec)
3U16bay FC-SAS in May, 2U/12 and 4U/24 bay in June
25-29 May 2009 HEPiX Spring
Roadmap – Drive I/F 2009
Q1 Q2 Q3 Q4
4G FC
SAS 3G
SAS 6G
U320 - SCSI
SATA-II
2.5” SSD (B12F series)
25-29 May 2009 HEPiX Spring
Est. Density
• 2009 H1 1TB, 1 rack (42U)= 240TB• 2009 H2 2TB, 1 rack (42U)= 480TB• 2010 H1 2TB, 1 rack (42U)= 480TB• 2010 H2 3TB, 1 rack (42U)= 720TB• 2012 5TB…..
25-29 May 2009 HEPiX Spring
Future remarks
• DC full restore end of May• Restart run-the-clock operation
• Resources relocated fully involved in STEP09
• Facility relocation end of Jun from IDC• New resource expansion end of Jul• Improve DC monitoring
25-29 May 2009 HEPiX Spring
Water mist
• Fire suppresion system• Review the implementation of Gas
supression system• Consider water mist in power room
• Wall cabinet outside data center area
25-29 May 2009 HEPiX Spring
Water mist – design plan