Download pdf - COMPUTER SIMULATION TECHNOLOGY | CST STUDIO SUITE ® as many numerical algorithms are ... GPU Computing for PIC The Particle-In-Cell (PIC) solver supports GPU Computing on a Tesla

CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com



Support & Sales R & D



R & D

Support & Sales










High Performance Computing in

Electromagnetic Simulations –

Achieving Shorter Design Cycles

Dr. Klaus Krohne


Making Simulations Run Faster...

Usually combined

Algorithmic Hardware

Use the most efficient algorithm Use better (faster) hardware at the

point where your bottleneck occurs.

This is easily done with

CST STUDIO SUITE® as many

numerical algorithms are

available in one frontend.

Use "divide and conquer" strategies

Hardware unit

Approach 1: Make single unit faster

Approach 2: Parallelization

Waiting tasks Finished tasks

Field sources

Model assembly in CST DESIGN STUDIO®

See other webinars in this series!


Hardware Based Acceleration Techniques

MPI Computing Distributed Computing

GPU Computing Multithreading


Benchmarks Dual Xeon/Quad Xeon

Very good scalability on the new Sandy Bridge EP (Xeon E5) processors!

Up to 48 CPU cores are supported on a single machine.

0

1

2

3

4

5

6

7

0 4 8 12 16 20 24

Speedup

Number of CPU Cores

Speedup of Solver Loop

Dual Xeon E5-2690

Quad Xeon E7-4807

0

2

4

6

8

10

0 4 8 12 16

Speedup

Number of CPU Cores

Speedup of Matrix Computation

Dual Xeon E5-2690


Multithreading Performance Limit

0

1

2

3

4

5

6

7

1 2 4 8

Speedup

Number of threads

Typical speedup

Transient solver

I-Solver (direct)

Measured on a system with dual Intel Xeon E5520

(2.27 GHZ), 24 GB RAM, Windows Server 2003 R2

The bottleneck which limits the perfor-

mance of the transient solver is the

memory bandwidth of the system

(i.e. how fast can data be copied

from/to RAM).

Memory

Controller

CPU

Core

CPU

Core

CPU

Core

CPU

Core

Many CPU cores are competing for memory access.

Widen this bottleneck! GPU Computing






Supported GPU Devices GPU Computing using the Nvidia Tesla GPU series is supported.

Recent Tesla 20 GPU devices

Tesla S2050 / NextIO vCore express 2075 /

NextIO vCore Express 2090

(external rack mounted device)

Tesla C2050/C2075

(for workstations)

Tesla M2050/M2075/M2090

(passive cooling/typically

integrated in rack mounted servers)


Supported GPU Devices

Series RAM Cores/GPU GPU clock Memory BW

Tesla 20

(2090) 6 GB 512 1.3 GHz 177 GB/s

Tesla 20

(2050/2075) 3 GB / 6 GB 448 1.15 GHz 148 GB/s

Tesla 10 4 GB 240 1.3 GHz 102 GB/s

Tesla 8 1.5 GB 128 1.35 GHz 76.8 GB/s

Some "magic numbers" ...

For comparison: Memory bandwidth for DDR3-1333 RAM (triple channel configuration) is 32 GB/s.

The acceleration of the transient solver is mainly a result of the very high memory band-

width which "widens" the memory bottleneck which limits the performance on the CPU.


GPU Computing — Typical Performance

Not sure whether your models will benefit from GPU Computing? Send us a test model and we

will run a benchmark with your model on different GPU hardware!

The features which

need the largest

amount of memory on

the GPU are:

Dispersive materials

Lossy metal

Open boundaries

Model too large Swapping Sm

all m

odel

CPU performance

Model sizes (Number of mesh cells)

Speedup o

f th

e s

olv

er

loop


Typical GPU System Configurations Entry level

Workstation with 1 GPU card

Well tested configurations

Available "off the shelf“

Good acceleration for

smaller models

Limited model size

(depends on available GPU

memory and features used)

Professional level

Workstation/server with

multiple internal or

external GPU cards

Many configurations available

Good acceleration for medium

size and large models

Limited model size

(depends on available GPU

memory and features used)

Enterprise level

Cluster system with high-

speed interconnect.

High flexibility: Can

handle extremely large

models using MPI

Computing and also a lot

of parallel simulation

tasks using Distributed

Computing (DC)

Administrative overhead

Higher price

CST engineers are available to discuss with you which configuration makes sense for your applications and usage scenario.


GPU Computing for PIC

The Particle-In-Cell (PIC) solver supports GPU Computing

on a Tesla 20 card.

0

2

4

6

8

10

12

14

Total Time CPU Total Time GPU (C2070)

Tota

l Sim

ula

tion T

ime/h

Speedup = 6.6

Benchmark model:

Slow wave structure

400,000 mesh cells

350,000 particles


All features of the transient solver (except

"subgridding" in some configurations and the “ferrite"

material model) are available for GPU Computing

The TLM solver supports GPU Computing on Tesla

20 GPUs

The Pic solver of CST PARTICLE STUDIO® is supported

on Tesla 20 GPUs (single GPU only)

Up to 8 GPUs per host system are supported

GPU Computing — New Features


GPU Computing Conclusion

GPU Computing is a very efficient method to

accelerate the transient solver and the Pic solver by

widening the memory bottleneck.

The maximum model size which can be handled

depends on the features used in your model (mainly

material models).

There are currently many GPU accelerated systems

offered by hardware vendors. Please ask us if you are

not sure which hardware configuration is the best one

for your use case.






MPI Computing — Area of Application MPI Computing is a way to handle very large models efficiently

Some application examples for MPI Computing:

Electrically very large structures

(e.g. RCS calculation, lightning strike) Extremely complex structures

(e.g.SI simulation for a full package)


MPI Computing — Working Principle

Based on a domain decomposition of the simulation domain.

Each cluster computer works on its part of the domain.

Automatic load balancing ensures an equal distribution of the workload.

It works cross-platform on Windows and Linux systems.

Domain decomposition is

shown in mesh view.

Node 2

Node 3

Node 4 Node 1

connects to

MPI Client Nodes

CST STUDIO SUITE®

Frontend

High speed/low latency interconnection network (optional)


MPI Computing — Performance CST STUDIO SUITE® offers native support for high speed/low latency networks

A GPU accelerated

cluster system requires

high-speed network in

order to perform well!

MPI Cluster System CST STUDIO SUITE Frontend

GPU Hardware

Cluster Interconnect

0

0.5

1

1.5

2

2.5

3

3.5

4

1 2 3 4

Speedup

Number of Nodes

Speedup per Timestep

CPU / IB (QDR)

CPU / GbE (1 Gb/s)

0

5

10

15

20

25

30

1 2 3 4

Speedup

Number of Nodes


CPU / IB (QDR)

CPU / GbE (1 Gb/s)

2xTesla 10 / IB (QDR)

2xTesla 10 / GbE (1 Gb/s)

0

5

10

15

20

25

30

35

40

1 2 3 4

Speedup

Number of Nodes


CPU / IB (QDR)

CPU / GbE (1 Gb/s)





Cluster nodes: Dual Intel Xeon E5530, 2.4 GHz


MPI Computing Conclusion

MPI Computing allows you to use the resources of a

computer cluster as a single "supercomputer“

Sometimes the only approach to solve very large

models efficiently

Allows overcoming the memory limitation of the GPU

hardware

If you are using a GPU accelerated cluster a high-

speed/low-latency interconnection network

(Infiniband) is a "must-have"






CST Simulation Acceleration:

Distributed Computing


DC — Working Principle

Users send simulation jobs to

the DC Main Controller.

The DC Main Controller

selects solver servers for the

jobs and sends the

simulation tasks to them.

It manages a simple FIFO

queue.

CST STUDIO SUITE®

Frontend

connects to

connects to

DC Main Controller

DC Solver Servers



As soon as a solver

server has finished

its work the results

are automatically

transferred back to

the frontend.

connects to

connects to

DC Main Controller

DC Solver Servers

Results are saved by

MC if frontend is not

connected.

CST STUDIO SUITE®

Frontend



CST STUDIO SUITE®

Frontend

connects to

connects to

DC Solver Servers

As soon as a solver

server has finished its

work the results are

automatically

transferred back to

the frontend.

DC Main Controller

Results are saved by

MC if frontend is not

connected.



CST STUDIO SUITE®

Frontend

Additional constraints can

be defined which allows a

"fine grained" DC Solver

Server selection.

If you have very complex policies or requirements

such as:

priorities for users/jobs,

hardware is not dedicated to CST,

both MPI and DC jobs must run on your cluster,

you may integrate the DC system now into your

favorite queuing system.

1 GPU 24 GB RAM 24 GB RAM 48 GB RAM 96 GB RAM


DC — Benefits Very good utilization of

computational resources

Very efficient parallelization

strategy for independent tasks

Watch the progress of your

distributed jobs from the CST

Frontend.

Easy way to share

computational resources in a

multi user environment

The automatic update system

keeps track of distributing

updates to the whole cluster

The system is cross platform


DC — Main Controller The DC Main Controller gives you a complete overview about what is happening on your cluster.

Job Status

Machine Status


DC Security

Only trusted machines can

register at the DC Main

Controller

The remote configuration

feature can be protected

by a password


Model has 16 ports

Only 8 ports need to be computed if defining symmetry conditions

Distribute the 8 simulation runs to different solver servers with

GPU acceleration


DC Simulation Time Improvement

0

5

10

15

20

25

30

1 2 4 8

Speedup

Number of DC Solver Servers

Speedup (total time)

CPU

1 GPU (Tesla 20)

Dual Intel Xeon X5675 CPUs (3.06 GHz), fastest memory configuration, 1 Tesla 20 GPU

per node, 1 Gb Ethernet interconnect, 40 million mesh cells


Efficient approach to parallelizing the execution of

independent simulation tasks (e.g. port excitations,

parameter sweeps)

The DC system can organize the jobs of a multi-user en-

vironment. It is in fact a queuing system for CST simulations

You can watch the progress of your jobs and get intermediate

results with the CST STUDIO SUITE® frontend

The DC system can take job constraints (e.g. GPU

requirements) into account when selecting solver servers

for a job

DC Conclusion


A Solution for Challenging Problems

Combined MPI Computing and GPU Computing

System: 8 compute nodes with dual Intel Xeon E5530, 2.4 GHz, Infiniband (QDR, 40 Gb/s),

2 Tesla 10 GPUs per node

Model was provided by "Institut für Theorie Elektromagnetischer Felder" (www.temf.de)

1 Billion mesh cells Magnetic field (absolute values)


IBM Benchmark Application ID 326

Full model

Layer stackup

Large Multilayer Electronic

circuit and package benchmark

offered by IBM to different 3D

EM Simulation tool vendors

Interest in delay time, cross-talk

and S-parameters

Model available within CADENCE

Small portion of model


Allegro File

(CADENCE)

CST MWS

SAT format

Triangulated surface

description (15Million Δ‘s)


via native import

ACIS

STL format

CADENCE Allegro

CST

MICROWAVE

STUDIO®


Good agreement with the measured results provided by IBM:

Measured delay time: 0.12ns

Simulated delay time: 0.118ns



Input Output

Cross talk


Time Signals


S-Parameter for 0 to 20 GHZ in One Go!



Double Negative Materials Application ID 294

E-Field for material permittivity and permeability smaller 1

Vacuum Vacuum Material

Source



E-Field for material permittivity and permeability equal to 0

Vacuum Vacuum

Source

Material



E-Field for material permittivity and permeability equal to -1

Vacuum Vacuum

Source

Material


7 Tesla body coil

MRI Simulation 8 individual channels to optimize

homogeneous field distribution

Full 3D simulation model

including human model and

RF shield of gradient coil fine detail


Ch 1

Ch 2

8 individually tuned solutions

Coil combination optimisation


Tuning the matching circuits

All 8 coils can be tuned to

297.2 MHz using tuning tool

in CST DS.

frequency in MHz

S-Parameters in dB

reflection

coefficients

couplin

g


Optimised B1+ distribution in CST MWS

Coil Combination

Homogeneous fields desired in

this region.


Simulation vs. Measurememt

Male, 1.74 m, 70 kg Female, 1.6 m, 58 kg Male, 1.85 m, 95 kg

Simulated phase shims, |B1+|, in µT

Measured actual flip angle distribution in degrees Female, 1.65 m, 64 kg Male, 1.83 m, 82 kg

180

0 Mean 60°, SD 41%

Male, 1.86 m, 100 kg

Mean 66°, SD 33% Mean 80°, SD 28%


SAR Calculation

Point SAR 10g averaged SAR

10g SAR limit: 10 W/kg


Thermal Bioheat Evaluation

Power scaled according to max. SAR10g = 10 W/kg

Temperature increase: ~1°C

Start temperature Temperature after 10 min