Upload
jeffery-wilcox
View
221
Download
1
Tags:
Embed Size (px)
Citation preview
Considerations for Scalable CAE on the SGI ccNUMA Architecture
Stan PoseyApplications Market Development
Cheng LiaoPrincipal Scientist, FEA ApplicationsChristian TanasescuCAE Applications Manager
Topics of Discussion
Historical Trends of CAEHistorical Trends of CAE
Current Status of Scalable CAECurrent Status of Scalable CAE
Future Directions in ApplicationsFuture Directions in Applications
Workstationsand Servers
Workstationsand Servers
MainframesMainframes
Economics: Physical prototyping costs continue Increasing Engineer more expensive than simulation tools
Cost
1960 2000Years
Cost of CAESimulation
Cost of PhysicalPrototyping
Cost of CAEEngineer
MSC/NASTRANSimulation Costs
(Source: General Motors)
MSC/NASTRANSimulation Costs
(Source: General Motors)
CAE Engineervs. System Costs(Source: Detroit Big3)
CAE Engineervs. System Costs(Source: Detroit Big3)
1960$30,000
1960$30,000
Engineer$36/hr
Engineer$36/hr
1999$0.021999$0.02
System$1.5/hrSystem$1.5/hr
Motivation for CAE Technology
Computer Hardware Advances:Processors: Ability to “hide” system latency
Architecture: ccNUMA: Crossbar switch replaces shared bus
Recent Technology Achievements
Rapid CAE Advancement from 1996 to 1999
Late 1980’s: Shared Memory ParallelHardware: Bus-based shared memory parallel (SMP)Parallel Model: Compiler enabled loop level (SMP fine grain)Characteristics: Low scalability (2p to 6p) but easy to programLimitations: Expensive memory for vector architectures
Early 1990’s: Distributed Memory ParallelHardware: MPP and cluster distributed memory parallel (DMP)Parallel Model: DMP coarse grain through explicit message passingCharacteristics: High scalability (> 64p) but difficult to programLimitations: Commercial CAE applications generally unavailable
Late 1990’s: Distributed Shared Memory ParallelHardware: Physically DMP but logically SMP ccNUMAParallel Model: SMP fine grain, DMP and SMP coarse grainCharacteristics: High scalability and easy to program
Recent History of Parallel Computing
Origin ccNUMA Architecture Basics
MainMemory
Proc.
Cache
I/O
Proc.
Cache
Local Switch
Proc.
Cache
I/O
Proc.
Cache
Local Switch
Global Switch Interconnect
MainMemory D
ir
Dir
Features of ccNUMA Multi-purpose ArchitectureFeatures of ccNUMA Multi-purpose Architecture
Detail of TwoNode (w/Router)
Architecture (32p Topology)
Node
Router
• Origin2000 ccNUMA available since 1996
• Non-blocking crossbar switch as interconnect fabric
• High levels of scalability over shared bus SMP
• Physical DMP but logical SMP (synchronized cache memories)
• 2 to 512 MIPS R12000/400Mhz processors with 8MB L2 cache
• High memory bandwidth (1.6Gb/s) and I/O that is scalable
• Distributed and shared memory (fine and coarse) parallel models
Parallel Computing with ccNUMA
Origin2000/256
Features of ccNUMA Multi-purpose ArchitectureFeatures of ccNUMA Multi-purpose Architecture
Computer Hardware Advances:Processors: Ability to “hide” system latency
Architecture: ccNUMA: Crossbar switch replaces shared bus
Application Software Advances:Implicit FEA: Sparse solvers increase performance by 10-fold
Explicit FEA: Domain parallel increases performance by 10-fold
CFD: Scalability increases performance by 100-fold
Meshing: Automatic and robust “tetra” meshing
Recent Technology Achievements
Rapid CAE Advancement from 1996 to 1999
Compute Intensity Flops/word of memory traffic
Deg
ree
of
Par
alle
lism
0.1 1 10 100 1000
FLUENT
ABAQUS
PAM-CRASH
LS-DYNA
MSC.Nastran (101)
ADINA ANSYS
Cache-friendlyMemory BW
Low
HighSTAR-CD
RADIOSS
MARC
MSC.Nastran (108)
OVERFLOWCFD
Explicit FEA
Implicit FEA(Statics)
Characterization of CAE Applications
MSC.Nastran (103 and 111)
Implicit FEA(Modal Freq)
Implicit FEA(Direct Freq)
MP SCALAR MP SCALAR
VECTOR VECTOR
Compute Intensity Flops/word of memory traffic
Deg
ree
of
Par
alle
lism
0.1 1 10 100 1000
FLUENT
ABAQUS
PAM-CRASH
LS-DYNA
MSC.Nastran (101)
ADINA ANSYS
Cache-friendlyMemory BW
Low
HighSTAR-CD
RADIOSS
MARC
MSC.Nastran (108)
MSC.Nastran (103 and 111)
OVERFLOWCFD
Explicit FEA
Implicit FEA(Statics)
Characterization of CAE Applications
Implicit FEA(Direct Freq)
Implicit FEA(Modal Freq)
Topics of Discussion
Historical Trends of CAEHistorical Trends of CAE
Current Status of Scalable CAECurrent Status of Scalable CAE
Future Directions in ApplicationsFuture Directions in Applications
CPU1CPU2CPU3CPU4
image
Implicit FEA - ABAQUS, ANSYS, MSC.Marc, MSC.Nastran
Explicit FEA - LS-DYNA, PAM-CRASH, RADIOSS
General CFD - CFX, FLUENT, STAR-CD
Domain Parallel Example:
Compressible 2D flow overwedge, partitioned as 4domains for parallelexecution on 4 processors
1
3
4
2
System
Scalable CAE: Domain Decomposition Parallel
Scalability Emerging for all CAE
Parallel Scalability in CAE
51
2
25
6
12
8
6432
16
8
4
2
1
# CPUs
Nastran CFD CodesCrash Codes
108101103108
SMP DMP
Usable parallel
V70.5
V70.7
Peak parallel
Sources that Inhibit Efficient Parallelism
Source
Computational load imbalance
communication overhead
between neighboring partitions
data and process placement
message passing performance
MPICH latency : ~ 31s
Solution
Nearly equal sized partitions
minimize communication between
adjacent cells on different cpus
enforce memory-process affinity
latency and bandwidth awareness
SGI-MPI3.1 latency : ~ 12s
Sca
lin
g t
o 1
6p o
nly
Sca
lin
g t
o 6
4p !
!
Considerations for Scalable CAE
Processor-Memory Affinity (Data Placement)
R
N
N
RN
N
R
N
N
R N
N
R N
N
R
N
N
RN
N
R
N
N
Processmigrates,data stays
Process + Data
Theory:system will place data and execution threads together properly, system will migrate that data to follow the executing
Real Life:32p Origin 2000
Considerations for Scalable CAE
CPUs
10
30
60
120
240
SSI
381 1.0
99 3.9
67 5.7
29 13.1
18 21.2
4 x 64
424 1.0
139 3.0
72 5.9
39 10.9
49 8.7
Software: FLUENT 5.1.1CFD Model: External aerodynamics, 3D, , segregated
incompressible, iso-thermal, 29M cells
Time per Iteration (seconds)
FLUENT Scalability on ccNUMA
FLUENT Scalability Study of SSI vs. ClusterFLUENT Scalability Study of SSI vs. Cluster
Largest FLUENT automotive case
achieved near ideal scaling on
SGI 2800/256
CPUs
8
16
32
64
128
256
Shared Memory (ns)
528
641
710
796
903
1200
MPI (ns)
19 x 10^3
23 x 10^3
26 x 10^3
29 x 10^3
34 x 10^3
44 x 10^3
Single System Image (SSI) Latency
HIPPI osBYPASS 139 x 10^3
Cluster Configuration Latency
256cpu SSI
4 x 64 Cluster
SSI Advantage for CFD with MPI
75
60
45
30
15
00 128 256 384 512
Number of CPUs
Per
form
ance
(G
FL
OP
/s)
60 GFLOPS, Oct 99
FY98 Milestone
C916/16 OVERFLOW Limit
Problem: 35M Points160 Zones
NASAAmes Research Center
Largest model in NASA history,
achieved 60Gflops on SGI 2800/512
with linear scaling
OVERFLOW Complete Boeing 747 Aerodynamics Simulation
BoeingCommercial Aircraft
Grand Scale HPC: NASA and Boeing
Computational Requirements for MSC.Nastran
Compute Task
Sparse Direct Solver
Lanczos Solver
Iterative Solver
I/O Activity
Memory CPUBandwidth Cycles
7% 93%
60% 40%
83% 17%
100% 0%
MSC/NASTRAN MPI Based Scalability for SOL 108:
• Independent frequency steps, naturally parallel
• File and memory space not shared
• Near linear parallel scalability
• Improved accuracy over SOL 111 with increasing frequency
• Released on SGI with v70.7 (Oct 99)
MSC/NASTRAN MPI Based Scalability for SOL 103, 111:
• Typical scalability - 2x to 3x on 8p, less for SOL 111
MSC.Nastran Scalability on ccNUMA
200Hz100Hz
150 modes
CPU 1
350 modes
CPU 2
300 modes
CPU 3
200 modes
CPU 4
0Hz
200Hz50Hz 100Hz 150Hz
1 - 50
CPU 1
51 - 100
CPU 2
101 - 150
CPU 3
151 - 200
CPU 4
0Hz
400Hz300Hz
MSC/NASTRAN MPI Based Scalability for SOL 111:
MSC/NASTRAN MPI Based Scalability for SOL 108:
Freqs
CPU
Modes
CPU
Parallel Schematics
Parallel Schemesfor an excitation frequency of 200Hzon a 4 CPU system
MSC.Nastran Scalability on ccNUMA
0
20000
40000
60000
80000
100000
120000
140000
1-w
ay
4-w
ay
16-w
ay
wall clockinseconds
CPUs Elapsed ParallelTime (s) Speed-up
1 120720 1.0
2 61680 2.0
4 32160 3.8
8 17387 6.9
16 10387 11.6(*)
* measured on populated nodes
Cray T90 Baseline Results
SOL: 111DOF: 525KEigensolution: 2714 modesFreq Steps: 96Elap Time: 31610 sec
SOL 108 Comparison with Conventional NVH (SOL 111 on T90)
MSC.Nastran Scalability on ccNUMA
CPUs Elapsed Parallel Time (h) Speed-up
1 31.7 1.0
8 4.1 7.8
16 2.2 14.2
32 1.4 22.6
Model Description
Model: BIW SOL: 108DOF: 536KFreq Steps: 96
Run Statistics (per MPI Process)
Memory: 340 MB FFIO Cache: 128 MBDisk Space: 3.6 GBProcess/Node: 2
MSC.Nastran Parallel Scalability for Direct Frequency Response (SOL 108)
MSC.Nastran Scalability on ccNUMA
The Future of Automotive NVH ModelingThe Future of Automotive NVH Modeling
Higher excitation frequencies of interest will increase DOF and modal density beyond SOL 103,111 practical limits
Frequency
ElapTime Direct Frequency
Response: 108
Modal FrequencyResponse: 103,111
199X Models
200X Models
Future Automotive NVH Modeling
Topics of Discussion
Historical Trends of CAEHistorical Trends of CAE
Current Status of Scalable CAECurrent Status of Scalable CAE
Future Directions in ApplicationsFuture Directions in Applications
CapabilityFeatures
General Availability
IRIX/MIPS SSI
Linux/IA-64,Clusters & SSI
FunctionalityMigration
UNICOS/Vector
Economics of HPC Rapidly Changing
SGI Partnership with HPC Community on Technology RoadmapSGI Partnership with HPC Community on Technology Roadmap
• Bandwidth improvement of 2x over Origin2000
• System support for IRIX/MIPS or LINUX/IA-64
• Modular design allows subsystem upgrades without forklift
• Latency decrease by 50% over Origin2000
•Next Generation IRIX Features and Improvements
SN-MIPS: Features of Next Generation ccNUMA
• Shared memory to 512 processors and beyond• RAS enhancements: Resiliency and Hot Swap• Data center management: scheduling, accounting• HPC clustering: GSN, CXFS shared file system
HPC Architecture Roadmap at SGI
Compute Intensity Flops/word of memory traffic
Deg
ree
of
Par
alle
lism
0.1 1 10 100 1000
FLUENT
ABAQUS
PAM-CRASH
LS-DYNA
MSC.Nastran (101)
ADINA ANSYS
Cache-friendlyMemory BW
Low
HighSTAR-CD
RADIOSS
MARC
MSC.Nastran (108)
OVERFLOWCFD
Explicit FEA
Implicit FEA(Statics)
Characterization of CAE Applications
Implicit FEA(Direct Freq)
MSC.Nastran (103 and 111)
Implicit FEA(Modal Freq)
SN-MIPS Benefit
SN-MIPS Benefit
Compute Intensity Flops/word of memory traffic
Deg
ree
of
Par
alle
lism
0.1 1 10 100 1000
FLUENT
ABAQUS
PAM-CRASH
LS-DYNA
MSC.Nastran (101)
ADINA ANSYS
Cache-friendlyMemory BW
Low
HighSTAR-CD
RADIOSS
MARC
MSC.Nastran (108)
OVERFLOWCFD
Explicit FEA
Implicit FEA(Statics)
Characterization of CAE Applications
Implicit FEA(Direct Freq)
MSC.Nastran (103 and 111)
Implicit FEA(Modal Freq)
SN-MIPS Benefit
SN-MIPS Benefit
SN-IA BenefitSN-IA Benefit
Current as of SEP 1999
49.8
50.2
31.1
68.9
78.3
21.7
0
20
40
60
80
100
%
USA Europe Japan
1997
MP Scalar
Vector
18.3
81.7
35.2
64.8
72.5
27.5
0
20
40
60
80
100
%
USA Europe Japan
1999
1999: 2.9 TFlops installed in Automotive OEMs world wide
1997: 1.1 TFlops installed in Automotive OEMs world wide
Architecture Mix for Automotive HPC
0.0
200.0
400.0
600.0
800.0
1000.0
1200.0
1400.0
1995 1996 1997 1998 1999
Installed GFLOPs
EUROPE
J APAN
US
GM and DaimlerChrysler each grew capacity more than 2x over past year
Automotive Industry HPC Investments
Meta-Computing with Explicit FEAMeta-Computing with Explicit FEA
Los Alamos and DOE Applied Engineering Analysis“Stochastic Simulation of 18 CPU Years Completed in 3 Days on ASCI Blue Mtn”
USDOE supported research achieved “first-ever” full-scale ABAQUS/Explicit simulation of nuclear weapons impact response on Origin/6144 ASCI (Feb 00)
Ford Motor SRL and NASA Langley Optimization of a vehicle body for NVH and crash, completed 9 CPU months of RADIOSS and MSC.Nastran overnight with response surface technique (Apr 00)
BMW Body Engineering672 MIPS cpus dedicated to stochastic crash simulation with PAM-CRASH (Jan 00)
Non-deterministic methods for improved FEA simulation
Future Directions in CAE Applications
Meta-Computing with Explicit FEA
• Manage design uncertainty from variability– Scatter in materials, loading, test conditions
• Non-deterministic simulation of vehicle “population”– Meta-computing on SSI or large cluster
• Improved design space exploration– Moving design towards target parameters
Objective:Objective:
Approach:Approach:
Insight:Insight:
Unlikely Performance
Most likely Performance
NASALangley Research Center
Achieved overnight BIP optimization on SGI 2800/256, with
equivalent yield of 9 months CPU time
NVH & Crash Optimization of Vehicle Body Overnight
Ford MotorScientific Research Labs
• Ford body-in-prime (BIP) model of 390K DOF
• MSC.Nastran for NVH, 30 design variables
• RADIOSS for crash, 20 design variables
• 10 design variables in common
• Sensitivity based Taylor approx. for NVH
• Polynomial response surface for crash
Grand Scale HPC: NASA and Ford
Crash ModelSize
Number ofEngineers
Cost per CPU-hour
1
100
Growth Index
1993
1999
450000 elem.
x7 x5
X90+
Turnaroundtime CrashSMP
x6
Capacity GFlops
#1 564 Gflops
x90
x40
Turnaroundtime Crash,CFD-MPP
NVH ModelSize
2 Mil. DOF
CFDModelSize
>10Mil cells
x6x6
Historical Growth of CAE Application
Source: Survey of major automotive developers
CAE to evolve into fully scalable, RISC-based technology High resolution models - CFD today, Crash, FEA emerging
Deterministic CAE giving way to probability techniques Deployment increases computational requirements 10-fold
Visual interaction with models beyond 3M cell/DOF High resolution modeling will strain visualization technology
Multi-Discipline optimization (MDO) implementation in earnest Coupling of structure, fluids, acoustics, electromagnetics
Future Directions of Scalable CAE
Conclusions
For small and medium size problems cluster can be a viable solution in the range of 8 – 16 CPUs
In the space of large and extremely large problems SSI architecture provides better parallel performance due to superior characteristics of in-box interconnect
In order to increase a single CPU performance developer should put in consideration the correlation between exploited data structure & algorithms and specific memory hierarchy
ccNUMA system allows a coupling of various parallel programming paradigms which could benefit a performance of multiphysics applications