Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Improving Virtual Prototyping and Certification with Implicit Finite Element Method at Scale
Seid Koric1,2, Robert F. Lucas3, Erman Guleryuz1
1National Center for Supercomputing Applications 2Mechanical Science and Engineering Department, University of Illinois
3Livermore Software Technology Corporation
Blue Waters Symposium 2019, June 5th
2
Seid Koric, Erman Guleryuz
Todd Simons, James Ong
Jef Dawson, Ting-Ting Zhu
Robert Lucas, Roger Grimes, Francois-Henry Rouet
3
Overview of the project
q Today: Virtual prototypes supplement physical tests in design and certification
q Vision: Further reduce cost & risk (Supplement → Replacement)
q Immediate goal: Increase impact of simulation technology
q Impact of simulation = f (speed, scale, fidelity)
q Performance scaling = f (code, input, machine)
q FEM: Partial differential equations → Sparse linear system
q HPC strategy: Sparse linear algebra → Dense linear algebra
q Overall approach: Scale-analyze-improve with real-life models Rolls-Royce Representative Engine Model
4
Overview of challenges
q More specific: These apply to LS-DYNA, and any other significant MCAE ISVs
§ Large legacy code, cannot start from scratch, must gracefully evolve
§ General-purpose code, cannot optimize for narrow class of problems
§ Key algorithms are NP-complete/hard, need to depend on heuristics
q More universal: These probably apply to any significant scientific or engineering code
§ Limited number of software development tools, especially for performance engineering
§ Increasing complexity of hardware architectures, combined with frequent design updates
§ Performance portability constraints for codes used on many systems
§ Limited HPC access, especially true for ISVs
5
Parallel scaling at the beginning of the Blue Waters project
0
2000
4000
6000
8000
10000
12000
14000
128 256 512 1024 2048
Tim
e (s
econ
ds)
MPI ranks
100M DOF, Three implicit load steps
Before, Hybrid (8 threads/MPI)
6
Improvement framework and progress highlights
q Memory management improvements
§ Dynamic allocation
q Existing algorithm improvements§ Inter-node communication
q Previously unknown bottlenecks
§ Constraint processing
q Entirely new algorithms§ Parallel matrix reordering
§ Parallel symbolic factorization
q Computation workflow modifications§ Offline parsing and decomposition of the model
Measure
AnalyzeImprove
Scale-up
7
NCSA OVIS view of LS-DYNA execution
0
10
20
30
40
50
60
70
0 20 40 60 80 100 120 140
Free
mem
ory
(GB)
Time (minutes)
Input processing
Reordering
Sequential symbolic preprocessing
Assemble, redistribute, factor and solve
2X
105M DOF model, 256 MPI ranks, 8 threads eachFree memory on MPI rank zero’s node
8
Multifrontal sparse linear solver
Multifrontal method: Input processing > Matrix reordering > Symbolic factorization >
Numeric factorization > Triangular solution
Assembly tree of submatrices
{ } { }t+Δt t+Δt t+Δti-1 i-1 i-1K Δu = Ré ùë û
Sparse linear system
0
5
10
15
20
25
30
0
2
4
6
8
10
12
14
16
18
20
2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000
Num
eric
fact
oriza
tion
rate
(Tflo
p/s)
Num
eric
fact
oriza
tion
mem
ory
foot
prin
t pe
r pro
cess
(GB)
Number of threads
Multifrontal factorization parallel scaling
9
Results – Comparison with MUMPS factorization
10
LS-GPart nested dissection for eight processors
11
Results – LS-GPart matrix reordering quality
0.1
1
10
torso3 Transportmemchip
atmosmoddaudikw1
boneS10cage13
CurlCurl4G3circuit
HV15R kktpowerLongCoupdt6
MLGeer
nnz(
L+U
), no
rmal
ized
wrt
Met
is
AMDMMD
Sparspak-NDMetis
ScotchSpectralLS-GPart
LS-GPart added to reordering comparison presented in “Preconditioning using Rank-structured Sparse Matrix Factorization”, Ghysels, et.al., SIAM PP 2018
Results - LS-GPart performance
128 256 512 1024 20480
50
100
150
200
250
300
350
LS-GPartParMETISPT-Scotch
Proces s or count
Tim
e (s
econ
ds)
13
0
5
10
15
20
25
30
128 256 512 1024 2048
Time (seconds)
MPI ranks
mf3Sym with LS-GPart
mf3Indist
mf3Permute
mf3Locate
mf3Affinity
mf3PrmSize
mf3Redist
mf3ObjMember
wait mf3DomSym
mf3DomSymFct
mf3SepTree
mf3SepSymFct
mf3FormTree
mf3Assign
mf3PostOrder
mf3PtrInit
mf3Symbolic
mf3Owner
mf3KObjStats
mf3Finish
mf3SNtile
mf3SMPsize
Results – New symbolic factorization performance scaling
100M DOF
~300 sec. in original sequential code
14
Results – Before and after Blue Waters engagement
0
2000
4000
6000
8000
10000
12000
14000
128 256 512 1024 2048
Tim
e (s
econ
ds)
MPI ranks
100M DOF, Three implicit load steps
Before, Hybrid (8 threads/MPI)
After, MPP
After, Hybrid (8 threads/MPI)
15
Results – Overall practical impact
q Finite element model with 200 million degrees of freedom
q Cumulative effect of better code and more compute resources
q Two orders of magnitude reduction in time-to-solution
q Work in progress for more practical impact
16
Future work and concluding remarks
q Industrial challenges are beyond the capabilities of today’s H/W and S/W!
q New design decisions based on finer grain analyses and more benchmarks!
q More scale will also couple with more physics!
q The right collaboration model accelerates progress!
q HPC access is critical in advancing the state of the art!
q Project benefits much broader community and sectors!
q Special thanks to Blue Waters SEAS team for technical support!
17