Improving Virtual Prototyping and Certification with Implicit Finite … · 2019. 7. 11. · Improving Virtual Prototyping and Certification with Implicit Finite Element Method at

Improving Virtual Prototyping and Certification with Implicit Finite Element Method at Scale

Seid Koric1,2, Robert F. Lucas3, Erman Guleryuz1

1National Center for Supercomputing Applications 2Mechanical Science and Engineering Department, University of Illinois

3Livermore Software Technology Corporation

Blue Waters Symposium 2019, June 5th

2

Seid Koric, Erman Guleryuz

Todd Simons, James Ong

Jef Dawson, Ting-Ting Zhu

Robert Lucas, Roger Grimes, Francois-Henry Rouet

3

Overview of the project

q Today: Virtual prototypes supplement physical tests in design and certification

q Vision: Further reduce cost & risk (Supplement → Replacement)

q Immediate goal: Increase impact of simulation technology

q Impact of simulation = f (speed, scale, fidelity)

q Performance scaling = f (code, input, machine)

q FEM: Partial differential equations → Sparse linear system

q HPC strategy: Sparse linear algebra → Dense linear algebra

q Overall approach: Scale-analyze-improve with real-life models Rolls-Royce Representative Engine Model

4

Overview of challenges

q More specific: These apply to LS-DYNA, and any other significant MCAE ISVs

§ Large legacy code, cannot start from scratch, must gracefully evolve

§ General-purpose code, cannot optimize for narrow class of problems

§ Key algorithms are NP-complete/hard, need to depend on heuristics

q More universal: These probably apply to any significant scientific or engineering code

§ Limited number of software development tools, especially for performance engineering

§ Increasing complexity of hardware architectures, combined with frequent design updates

§ Performance portability constraints for codes used on many systems

§ Limited HPC access, especially true for ISVs

5

Parallel scaling at the beginning of the Blue Waters project

0

2000

4000

6000

8000

10000

12000

14000

128 256 512 1024 2048

Tim

e (s

econ

ds)

MPI ranks

100M DOF, Three implicit load steps

Before, Hybrid (8 threads/MPI)

6

Improvement framework and progress highlights

q Memory management improvements

§ Dynamic allocation

q Existing algorithm improvements§ Inter-node communication

q Previously unknown bottlenecks

§ Constraint processing

q Entirely new algorithms§ Parallel matrix reordering

§ Parallel symbolic factorization

q Computation workflow modifications§ Offline parsing and decomposition of the model

Measure

AnalyzeImprove

Scale-up

7

NCSA OVIS view of LS-DYNA execution

0

10

20

30

40

50

60

70

0 20 40 60 80 100 120 140

Free

mem

ory

(GB)

Time (minutes)

Input processing

Reordering

Sequential symbolic preprocessing

Assemble, redistribute, factor and solve

2X

105M DOF model, 256 MPI ranks, 8 threads eachFree memory on MPI rank zero’s node

8

Multifrontal sparse linear solver

Multifrontal method: Input processing > Matrix reordering > Symbolic factorization >

Numeric factorization > Triangular solution

Assembly tree of submatrices

{ } { }t+Δt t+Δt t+Δti-1 i-1 i-1K Δu = Ré ùë û

Sparse linear system

0

5

10

15

20

25

30

0

2

4

6

8

10

12

14

16

18

20

2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000

Num

eric

fact

oriza

tion

rate

(Tflo

p/s)

Num

eric

fact

oriza

tion

mem

ory

foot

prin

t pe

r pro

cess

(GB)

Number of threads

Multifrontal factorization parallel scaling

9

Results – Comparison with MUMPS factorization

10

LS-GPart nested dissection for eight processors

11

Results – LS-GPart matrix reordering quality

0.1

1

10

torso3 Transportmemchip

atmosmoddaudikw1

boneS10cage13

CurlCurl4G3circuit

HV15R kktpowerLongCoupdt6

MLGeer

nnz(

L+U

), no

rmal

ized

wrt

Met

is

AMDMMD

Sparspak-NDMetis

ScotchSpectralLS-GPart

LS-GPart added to reordering comparison presented in “Preconditioning using Rank-structured Sparse Matrix Factorization”, Ghysels, et.al., SIAM PP 2018

Results - LS-GPart performance

128 256 512 1024 20480

50

100

150

200

250

300

350

LS-GPartParMETISPT-Scotch

Proces s or count

Tim

e (s

econ

ds)

13

0

5

10

15

20

25

30

128 256 512 1024 2048

Time (seconds)

MPI ranks

mf3Sym with LS-GPart

mf3Indist

mf3Permute

mf3Locate

mf3Affinity

mf3PrmSize

mf3Redist

mf3ObjMember

wait mf3DomSym

mf3DomSymFct

mf3SepTree

mf3SepSymFct

mf3FormTree

mf3Assign

mf3PostOrder

mf3PtrInit

mf3Symbolic

mf3Owner

mf3KObjStats

mf3Finish

mf3SNtile

mf3SMPsize

Results – New symbolic factorization performance scaling

100M DOF

~300 sec. in original sequential code

14

Results – Before and after Blue Waters engagement

0

2000

4000

6000

8000

10000

12000

14000

128 256 512 1024 2048

Tim

e (s

econ

ds)

MPI ranks

100M DOF, Three implicit load steps

Before, Hybrid (8 threads/MPI)

After, MPP

After, Hybrid (8 threads/MPI)

15

Results – Overall practical impact

q Finite element model with 200 million degrees of freedom

q Cumulative effect of better code and more compute resources

q Two orders of magnitude reduction in time-to-solution

q Work in progress for more practical impact

16

Future work and concluding remarks

q Industrial challenges are beyond the capabilities of today’s H/W and S/W!

q New design decisions based on finer grain analyses and more benchmarks!

q More scale will also couple with more physics!

q The right collaboration model accelerates progress!

q HPC access is critical in advancing the state of the art!

q Project benefits much broader community and sectors!

q Special thanks to Blue Waters SEAS team for technical support!

17

Documents

Improving Virtual Prototyping and Certification with Implicit Finite … · 2019. 7. 11. · Improving Virtual Prototyping and Certification with Implicit Finite Element Method at