Upload
ericwilliammarshall
View
116
Download
2
Embed Size (px)
Citation preview
Office of Instructional and Research Technology
Very large computing and the real world
a very few thoughts
Eric MarshallAssociate Director for Research Technology
Rutgers University
Office of Instructional and Research Technology
Shock and awe
Bigger is better!
Office of Instructional and Research Technology
The shiny future
• Newer is Better!
Office of Instructional and Research Technology
The real world
• Bugs, warts, and the eternal problem of hindsight
Office of Instructional and Research Technology
The problem of architecture
• Build as you go vs. predicting the future
Office of Instructional and Research Technology
Where do you put and for how long?
• The problem of 2x foot print in the land of 24x7
Office of Instructional and Research Technology
Who is expert?
• Is the architect, programmer, scientist, owner, vendor or bottle washer expert? Complex problems are hard.
Office of Instructional and Research Technology
“Anyone who understands the system isn’t doing science!”
• The problem of users
Office of Instructional and Research Technology
Supercomputers are disposable
• 3 to 5 year ‘shelf life’
Office of Instructional and Research Technology
“This system sucks, the last one was better!”(no matter how many systems)
• The problem of transition: porting, change and habits
Office of Instructional and Research Technology
Goldlock’s paradox
• The problem of useful use: efficient programming, useful scaling, overhead, keeping track of results, allocation, etc.
Office of Instructional and Research Technology
Goldlock’s paradox (cont’d)
• Someone will always say the solution is around around the corner!
Office of Instructional and Research Technology
Scaling is deadly
• Scaling problems: OS/SAN/code/people/etc.
Large Scale Cluster (LSC)SGI Origin 3800 + 3900, 600MHz
2 Nodes x 512 PE + 512GB + 2.9TB disk5 Nodes x 256 PE + 256GB + .9TB disk1 Node x 128 PE + 128GB + .9TB disk SAN Bandwidth: 2GB/s per LSC Node
CXFS, PCP, Workshop Pro,GridEngine, S-Plus,TotalView, Matlab, NAG SMP, Mathmatica
Analysis Cluster (ANC) SGI Origin 3900, 600 MHz, 2 Nodes x 96 PE + 96GB + 4.2TB disk
SAN Bandwidth: 2GB/s per ANC NodeGridEngine, CXFS, PCP, Workshop Pro
Tape SAN4 x STK 9310 Tape Libraries24 x 9940B Drives (200GB, 30MB/s)22 x 9840A Drives (20GB, 10MB/s)3.5PB Tape Storage On-Line 1.5PB Off-Line
LANCisco Catalyst 65094 x 16 GbE2 x 48 Fast Ethernet
ANC
SAN (FC) SwitchBrocade 2800 & 3800
Redundant AccessDual-Ported
Fiber ChannelMetaData Server (MDS)
HFS & HSMS ServerSGI Origin 3800, 600 MHz, 2 Nodes x 64 PE + 64GB
Disk SAN: 4GB/s per MDS NodeTape SAN: 1GB/s per MDS Node 2.8TB disk, Failsafe, DMF, CXFS
Onyx 3 - Infinite Reality 3
MDS
Computational Capability & Capacity89 Coupled Climate Model Years
Per Computational Day1 deg. Ocean Model2 deg. Atmospheric
Disk SAN 23.6TB SAN Disk
TP9100B5+P+HS RAID5
w/Dual Controllers2Gbit/s Fibre
GFDL HPCSJuly 2005
CCCI Cluster (IC)SGI Altix 3700, 1.5GHz
2 Nodes x 256 PE + 512GB + 2TB disk1 Node x 96 PE + 192GB + 3TB disk
SAN Bandwidth: 2GigE/Node, NFS mounted
PCP, Workshop Pro,GridEngine,TotalView,
NAG
IC
Office of Instructional and Research Technology
Questions?
Eric MarshallOffice of Instructional and Research [email protected] 445-2262