Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
High Performance Computing
Compute Clusters at the Chrysler Group:Battle Scars and Victory Parades
Compute Clusters at the Chrysler Group:Battle Scars and Victory Parades
June 24, 2003
John Picklo
Manager, Mainframes and High PerformanceComputing
High Performance Computing
AgendaAgenda
Introductory comments
History of clusters at the Chrysler Group
Implementation issues
Lessons learned
High Performance Computing
Chrysler Group Product DevelopmentChrysler Group Product Development
Design and engineering for Chrysler Groupvehicles
High Performance Computing
High Performance ComputingHigh Performance Computing
Computer Aided Engineering
Vehicle simulationFluid Dynamics
Impact Events
Noise, Vibration, Harshness
High Performance Computing
CAE Simulation ProcessCAE Simulation Process
Pre-processCreate meshed modelElements and properties> 700,000 elements
SimulationBatch processCompute and memory intensiveDuration: hours to days
Post-processGraph resultsVisualizeAnimate
High Performance Computing
Workload CharacteristicsWorkload Characteristics
Multi-cpu jobs: 1-24Processor, memory, and I/O intensiveNot data centricHeterogeneous systems
Multi-vendorEvolutionary introduction of technologyJobs modified to match environment
LSF provides workload management (Impact, NVH)Job scheduling interfaceLoad leveling across systemsMaximize utilization and minimize queue time
High Performance Computing
HPC GoalsHPC Goals
Reduce simulation cycle timeShorten the design timelineEvaluate more design alternatives
Reduce costsEngineer productivityPhysical testsCost of computing
Increased accuracyImproved vehicles
High Performance Computing
HPC SystemsHPC Systems
HP: 6-Superdomes (352 CPUs)
96 Node Itanium Cluster (192 CPUs)
SGI: 12 Origin 300 (384 CPUs)4 Origin 3000 (256 CPUs)2 Origin 2000 (128 CPUs)
IBM 172 Node Pentium Cluster (344 CPUs)
64 Node Pentium Cluster (128 CPUs)
High Performance Computing
Chrysler group’s first clusterChrysler group’s first cluster
Installed 1998First entry into MPI andclustersRetired 2003
HP N-Class56 Compute nodes
Impact and NVH
4-8 cpus per node
400 mhz PA8500
8-12 cpus per job
Gigabit and hiperfabricbackbone
11tb scratch disk
5 Front end L-class dataservers with fail-over
software
High Performance Computing
Impact Pentium ClusterImpact Pentium Cluster
IBM Intellistations172 Nodes, 344 CPUs
Xeon 2.2ghz, 2.8ghz
Gigabit internal network
12 nodes per switch
RedHat Linux V7.2
Installed July 2002
Management node w XCAT software
Storage node w 1.8tb shared disk
High Performance Computing
Itanium NVH ClusterItanium NVH Cluster
96 Nodes, 192 CPUs
HP ZX6000
HP/UX V11.22
MSC/Nastran
675gb shared storage
288gb scratch space per node
Gigabit internal network
High Performance Computing
CFD Hybrid ClustersCFD Hybrid Clusters
2 SGI SMP front-end systemsGeographically separated for disaster recoverypurposes64 CPUs, 64gb64 bit applications, job scheduling, and file serving
Pentium compute cluster behind each front-end
32 nodes each (64 cpus)IBM x335, 2.8ghz32 bit applications
PBSproLoad level between site and front/back endMatch workload to available licenses
High Performance Computing
Cluster EnablersCluster Enablers
ISV’sImplement Message Passing InterfacePort to LinuxQuality assurance
Intel1.8ghz Xeon performance exceeds RiscItanium 2 performance
IntegratorsIndustrial grade solutions
Ø Bag-of-wiresTurnkey implementation servicesHardware and software supportCost effective designs
Integrator provides research and developmentNode speed and sizeBack-end network
ClusterWare
High Performance Computing
Internal NetworkInternal Network
Speed vs. efficiencyProprietary (Myrinet, Scali, etc.)10/100/1000 EthernetNew technology (Infiniband)
Switch sizeFlat is expensiveHierarchical limits job size
Sub clustersMaintaining locality of traffic for jobs
Multiple clustersNetwork of private networksSeparating public and private traffic
TrafficDataMPIManagement
High Performance Computing
Workload ManagementWorkload Management
Workload managementMore than job scheduling
Clean-up after kill / failure
Accounting
Visibility to back-end processes
Sub-clustersMaintaining locality of traffic
Heterogeneous systemsUnequal nodes
Number of CPUs
CPU speed
Mixed environment (SMP and clusters)
High Performance Computing
DataData
Internal disksDo you need / want internal disks?Is there activity that should be targeted to theinternal disks?
Shared diskSAN
Costly and complexHigh performance
NFSLess expensive and simplerModerate performance
High Performance Computing
Operating SystemOperating System
Corporate CultureCorporate standards for open sourceNiche implementations vs. corporatedirectionCan you fly below the radar?
Which OS is right for your cluster?Linux concernsWhat about Windows?HP/UX for Itanium
High Performance Computing
ResilienceResilience
Likelihood of failureAre commodity components more or lesslikely to fail?
Impact of failureCost of lost node
HW supportBack end nodes vs. shared components
Off-hours level of diagnosis and ability to repair
High Performance Computing
DCX Failure experiences (2002)DCX Failure experiences (2002)
SMP : 0.10% jobs lost to system failure
Cluster: 0.05% jobs lost to system failure
Cost savings for mixed mode hardwaresupport
High Performance Computing
ClusterWareClusterWare
Number of offerings
Need for standardization
Interface to other standard tools
Open source vs. proprietary
High Performance Computing
Choosing Your IntegratorChoosing Your Integrator
Roll your own: “bag of wires”Who you gonna call?How much did you really save?When will you find out if you made a configurationmistake?
Application expertise is dominant factorHW and SW supportLet your integrators do your research anddevelopmentThere are plenty of players. When in doubt,bid it out.
High Performance Computing
FuturesFutures
Internal networksEthernet improvements
Infiniband
Proprietary: Myrinet, Quadrics, etc.Cost
Future
Duplicate network
High Performance Computing
FuturesFutures
Multiple clustersHomogeneous interface to heterogeneousclusters
Multiple operating systems
Hybrid hardware32bit and 64bit applications
ClusterWare and workload management
Cluster experience will benefit Gridimplementation
High Performance Computing
BenefitsBenefits
CostMetrics?
Transactions (jobs) per dollar spent
PerformanceMetrics?
Turnaround vs. throughput
Visibility
High Performance Computing
BenefitsBenefits
Pentium Cluster20% performance improvement overequivalent number of RISC processors
40% cost benefit
Itanium Cluster50% performance improvement over RISC
High Performance Computing
EnablersEnablers
It is all about the application
Know your workloadHow many bells and whistles will providebenefits?
Run benchmarks
Use application savvy integrators
High Performance Computing
Cluster StrategyCluster Strategy
Good for only the next installationTechnology is changing too fast
When does it make sense to expand an existingcluster
Niche solutionsDon’t have to transform the enterprise
Corporate resistanceThe only thing more notorious than a successfulimplementation is an unsuccessful one
High Performance Computing
PerspectivePerspective
Remember, the primary reason toembrace clusters is to reduce costs. Itis easy to lose sight of this goal andsuccumb to the temptation to over-configure.