State of the art distributed parallel computational ... · 1 PARENG- 2011 Louis Komzsik State of the art distributed parallel computational techniques in industrial finite element

1 Louis Komzsik PARENG- 2011

State of the art distributed parallel computational techniques in industrial finite element analysis

Second Conference on Parallel, Distributed, Grid and CloudComputing for Engineering

Dr. Louis KomzsikSiemens PLM Software, USA

Ajaccio, FranceApril 12-15, 2011


Introduction to industrial analysis

Geometric domain decomposition

Distributed computational solutions

Parallel computational kernels

Application case studies

Conclusions and future work

Scope or presentation


Industrial complexity – constantly increasing

Engine block1,000,000 elements

Car30,000 parts

Jet Engine10,000 parts

Factory10,000 machines

3


Cray Computer Multi-core CPU

$15 million $150

O(1) gigaflops O(100) gigaflops

1000 sold 100 million sold

Computer hardware – constantly changing


Lifecycle simulations

Designerview

Analystview


Multidisciplinary solutions

Designerview

Analystview


High performance requirements

The constrained stiffness matrix of an analysis problem

� Number of rows: 35,734,709

� Nonzero terms: 1,384,305,995

� Nonzero terms in sparse factor matrix: 43,827,004,000

� Memory used during factorization: 1,080,732,000 (4 byte) words

� Actual elapsed time of sparse factorization on a single high performance processor:

335 minutes







Conclusions



� Subdivide large geometry domains into limited number of partitions

� Computations in the geometry partitions are dependent

� Minimize the boundary size of each partition with respect to its interior

� Minimize the total boundary size as communication is needed

Single level geometric domain decomposition

Proc 1 Proc 2 Proc k


Single level

� Subdivide large geometry domains into limited number of partitions

� Subdivide the partitions into sub-partitions and dynamically reduce them to their collectors

� Assemble the multilevel substructures to obtain the engineering solution

� The total number of substructures may exceed the number of processors

Multi-level geometry domain decomposition


Finite element problem domain decomposition

Based on model or matrices

Graph Matrix FE model

Vertices Diagonal Terms Node points

Edges Off-diagonals Elements

Undirected Symmetric Linear


Graphs and matrices

Graph model and its Laplacian matrix

Finite element model and its stiffness matrix

1 2 4

3 5

−

−−

−

−−−

−

=

kkk

kkk

kkk

kkkkk

kkk

K

2300

36030

0023

3383

0032

MembraneElement 1

Membrane Element 2

1 2 4

3 5

−−

−−

−−

−−−−

−−

=

21010

12010

00211

11141

00112

L


Partitioning technology

Spectral bisection method

Vertex cut result

1 2 4

3 5

:222 uLu λ=

−

−

⋅=

−

−

−−

−−

−−

−−−−

−−

2/1

2/1

2/1

0

2/1

1

2/1

2/1

2/1

0

2/1

21010

12010

00211

11141

00112

1 2

3

2 4

5


Recursive graph partitioning

Coarsening, partitioning and refining phases

8

9 36

57

24

1

2 8

36

57

69

4 2

71

69

44 2

7

2

1

9 6

24

Partition 1

Refining

Partitioning

Coarsening

9 3

51

6

7

Partition 2










Distributed memory parallel architecture

� Cluster of high performance workstations

� Distributed memory work station

� Dedicated I/O devices

� High level parallelism

� Feasible number ofnodes: 16-1024


Geometric problem Partitioning hierarchy

Recursive matrix partitioning

1 2 4

3 6

7

5

9 36

57

24

1


Distributed normal modes analysis

1 1 1,3 1,3 1

2 2 2,3 2,3 2

3 3 3,7 3,7 3

4 4 4,6 4,6 4

5 5 5,6 5,6 5

6 6 6,7 6,7 6

77 7

oo oo ot ot o

oo oo ot ot o

tt tt tt tt t

oo oo ot ot o

oo oo ot ot o

tt tt tt tt t

ttt tt

K M K M

K M K M

K M K M

K M K M

K M K M

K M K M

K M

λ λ φλ λ φ

λ λ φλ λ φ

λ λ φλ λ φ

φλ

− −

− − − −

− − − −

− −

−

0

=

0)( =Φ− MK λPhysical problem

Partitioned form


Phase 1

Processor 1

Processor 3 Processor 4

Processor 2Start

Communicate


Phase 2

Processors 1-2

Processors 3- 4

Start

Communicate


Phase 3

Processors 1-2-3-4Start

0~

)~~

( =Φ− MK λ

Solve reduced order problem

Recover physical solution

=Φ→

=Φ→

=Φ

7

6

5

4

3

2

1

7

6

5

4

3

2

1

7

6

5

4

3

2

1

~

~

~

~

~

~

~

~

t

t

o

o

t

o

o

t

t

o

o

t

o

o

q

q

q

q

q

q

q

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ

ϕ










Shared memory parallel architecture

� Multi-core processors

� Shared cache

� Shared memory

� Low level parallelism

� Feasible number of cores: 2-16


Sparse factorization

Matrix connectivity Reordering

Elimination tree Factorization


Multifrontal factorization

Sparsity pattern

Frontal steps

Front amalgamation


Symbolic reordering

Consecutive columns

Same sparsity pattern

Cache fitting size

Supernodal approach


Matrix update

Panel selection

Downstream columns

Different sparsity pattern

BLAS 2.5 operation










High performance workstation cluster

111 IBM P575 nodes with 1.9 GHz4 dual-core POWER5 CPUs per node

3.5 Terabyte aggregate memory100 Terabyte total disk space

IBM High Performance Switch (HPS)8 GB/sec bidirectional bandwidth

AIX OS Version 5.3Parallel Environment (PE) V4.2


Trimmed car body application

Shell element model

� 1.3 M grid points� 1.2 M shell elements� 7.9 M degrees of freedom

Normal modes analysis

� Frequency 0 – 300 Hz � ~1000 normal modes� 512 partitions


Shortening solution time

0.0

20.0

40.0

60.0

80.0

100.0

120.0

Serial 1 2 4 8 16 32 64 128

1.04.0

7.8

29.3

49.2

77.5

96.5

104.1 105.9

Speed Up

Number of DMP processes


0.00

2.00

4.00

6.00

8.00

10.00

12.00

0 - 100 0 - 200 0 - 300 0 - 400 0 - 500

1.00 1.08 1.21 1.34 1.551.00

2.41

4.67

7.44

10.93

Frequency Range (Hz)

Solution Time

(Normalized)

Number of Modes

(Normalized)

0.00

2.00

4.00

6.00

8.00

10.00

12.00

0 - 100 0 - 200 0 - 300 0 - 400 0 - 500

1.00 1.08 1.21 1.34 1.551.00

2.41

4.67

7.44

10.93


Solution Time

(Normalized)

Number of Modes

(Normalized)

Increased fidelity of analysis


Distributed memory workstation

HP Proliant DL320G5 server

64 dual core (1.85 GHz) Xeon CPUs

50GB local SATA disks per node

4 GB memory per node

GigE interconnect with HP MPI

Suse Linux Version 10.3


Automotive engine application

Solid element model

� 3.6 M grid points� 2.3 M tetrahedral elements� 10.8 M degrees of freedom

Normal modes analysis

� Frequency: 0 – 10,000 Hz � ~ 250 normal modes� 256 partitions


Shortening solution time

1.004.00

7.11

12.47

17.15

25.78

34.58

49.27

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

Speed up

Serial 1 2 4 8 16 32 64

Number of DMP processes


Increased fidelity of analysis

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

0 - 10,000 0 - 20,000 0 - 30,000 0 - 40,000 0 - 50,000

1.001.25 1.28 1.32 1.34

1.00

2.95

5.61

8.79

12.57


Solution Time

(Normalized)

Number of Modes

(Normalized)

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

0 - 10,000 0 - 20,000 0 - 30,000 0 - 40,000 0 - 50,000

1.001.25 1.28 1.32 1.34

1.00

2.95

5.61

8.79

12.57


Solution Time

(Normalized)

Number of Modes

(Normalized)










Geometric domain decomposition technologies provide the basis for distributed solutions on modern hardware

Recursive computational solutions can support a wide range of engineering analyses with practically acceptable accuracy

The handling of the local matrix operations with multi-core processors contributes to the overall performance gain

The performance advantages of distributed computational solutionsare significant and tremendously accelerate the engineering work

Conclusions


Extending the distributed finite element technology to a grid computing environment

Overcoming the lack of node to node communication mechanism with a high speed network

Minimizing the need for a high bandwidth connection between the local nodes and storage devices

Synchronizing completion of similar computational complexity components on non-homogeneous grid environment

Future work


Thank you for your attention!

www.siemens.com

www.siemens.com/plm

www.siemens.com/plm/nxnastran

Siemens and the Siemens logo are registered trademarks of Siemens AG. NX is a registered trademark of Siemens PLM Software Inc. in the United States and in other countries.

NASTRAN is a registered trademark of the National Aeronautics and Space Administration.

SpaceShip One pictures by courtesy and permission of Quartus Engineering Inc.

Documents

State of the art distributed parallel computational ... · 1 PARENG- 2011 Louis Komzsik State of the art distributed parallel computational techniques in industrial finite element