CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Support & Sales R & D
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
R & D
Support & Sales
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
High Performance Computing in
Electromagnetic Simulations –
Achieving Shorter Design Cycles
Dr. Klaus Krohne
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Making Simulations Run Faster...
Usually combined
Algorithmic Hardware
Use the most efficient algorithm Use better (faster) hardware at the
point where your bottleneck occurs.
This is easily done with
CST STUDIO SUITE® as many
numerical algorithms are
available in one frontend.
Use "divide and conquer" strategies
Hardware unit
Approach 1: Make single unit faster
Approach 2: Parallelization
Waiting tasks Finished tasks
Field sources
Model assembly in CST DESIGN STUDIO®
See other webinars in this series!
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Hardware Based Acceleration Techniques
MPI Computing Distributed Computing
GPU Computing Multithreading
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Benchmarks Dual Xeon/Quad Xeon
Very good scalability on the new Sandy Bridge EP (Xeon E5) processors!
Up to 48 CPU cores are supported on a single machine.
0
1
2
3
4
5
6
7
0 4 8 12 16 20 24
Speedup
Number of CPU Cores
Speedup of Solver Loop
Dual Xeon E5-2690
Quad Xeon E7-4807
0
2
4
6
8
10
0 4 8 12 16
Speedup
Number of CPU Cores
Speedup of Matrix Computation
Dual Xeon E5-2690
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Multithreading Performance Limit
0
1
2
3
4
5
6
7
1 2 4 8
Speedup
Number of threads
Typical speedup
Transient solver
I-Solver (direct)
Measured on a system with dual Intel Xeon E5520
(2.27 GHZ), 24 GB RAM, Windows Server 2003 R2
The bottleneck which limits the perfor-
mance of the transient solver is the
memory bandwidth of the system
(i.e. how fast can data be copied
from/to RAM).
Memory
Controller
CPU
Core
CPU
Core
CPU
Core
CPU
Core
Many CPU cores are competing for memory access.
Widen this bottleneck! GPU Computing
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Hardware Based Acceleration Techniques
GPU Computing Multithreading
MPI Computing Distributed Computing
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Supported GPU Devices GPU Computing using the Nvidia Tesla GPU series is supported.
Recent Tesla 20 GPU devices
Tesla S2050 / NextIO vCore express 2075 /
NextIO vCore Express 2090
(external rack mounted device)
Tesla C2050/C2075
(for workstations)
Tesla M2050/M2075/M2090
(passive cooling/typically
integrated in rack mounted servers)
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Supported GPU Devices
Series RAM Cores/GPU GPU clock Memory BW
Tesla 20
(2090) 6 GB 512 1.3 GHz 177 GB/s
Tesla 20
(2050/2075) 3 GB / 6 GB 448 1.15 GHz 148 GB/s
Tesla 10 4 GB 240 1.3 GHz 102 GB/s
Tesla 8 1.5 GB 128 1.35 GHz 76.8 GB/s
Some "magic numbers" ...
For comparison: Memory bandwidth for DDR3-1333 RAM (triple channel configuration) is 32 GB/s.
The acceleration of the transient solver is mainly a result of the very high memory band-
width which "widens" the memory bottleneck which limits the performance on the CPU.
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
GPU Computing — Typical Performance
Not sure whether your models will benefit from GPU Computing? Send us a test model and we
will run a benchmark with your model on different GPU hardware!
The features which
need the largest
amount of memory on
the GPU are:
Dispersive materials
Lossy metal
Open boundaries
Model too large Swapping Sm
all m
odel
CPU performance
Model sizes (Number of mesh cells)
Speedup o
f th
e s
olv
er
loop
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Typical GPU System Configurations Entry level
Workstation with 1 GPU card
Well tested configurations
Available "off the shelf“
Good acceleration for
smaller models
Limited model size
(depends on available GPU
memory and features used)
Professional level
Workstation/server with
multiple internal or
external GPU cards
Many configurations available
Good acceleration for medium
size and large models
Limited model size
(depends on available GPU
memory and features used)
Enterprise level
Cluster system with high-
speed interconnect.
High flexibility: Can
handle extremely large
models using MPI
Computing and also a lot
of parallel simulation
tasks using Distributed
Computing (DC)
Administrative overhead
Higher price
CST engineers are available to discuss with you which configuration makes sense for your applications and usage scenario.
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
GPU Computing for PIC
The Particle-In-Cell (PIC) solver supports GPU Computing
on a Tesla 20 card.
0
2
4
6
8
10
12
14
Total Time CPU Total Time GPU (C2070)
Tota
l Sim
ula
tion T
ime/h
Speedup = 6.6
Benchmark model:
Slow wave structure
400,000 mesh cells
350,000 particles
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
All features of the transient solver (except
"subgridding" in some configurations and the “ferrite"
material model) are available for GPU Computing
The TLM solver supports GPU Computing on Tesla
20 GPUs
The Pic solver of CST PARTICLE STUDIO® is supported
on Tesla 20 GPUs (single GPU only)
Up to 8 GPUs per host system are supported
GPU Computing — New Features
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
GPU Computing Conclusion
GPU Computing is a very efficient method to
accelerate the transient solver and the Pic solver by
widening the memory bottleneck.
The maximum model size which can be handled
depends on the features used in your model (mainly
material models).
There are currently many GPU accelerated systems
offered by hardware vendors. Please ask us if you are
not sure which hardware configuration is the best one
for your use case.
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Hardware Based Acceleration Techniques
MPI Computing Distributed Computing
GPU Computing Multithreading
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
MPI Computing — Area of Application MPI Computing is a way to handle very large models efficiently
Some application examples for MPI Computing:
Electrically very large structures
(e.g. RCS calculation, lightning strike) Extremely complex structures
(e.g.SI simulation for a full package)
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
MPI Computing — Working Principle
Based on a domain decomposition of the simulation domain.
Each cluster computer works on its part of the domain.
Automatic load balancing ensures an equal distribution of the workload.
It works cross-platform on Windows and Linux systems.
Domain decomposition is
shown in mesh view.
Node 2
Node 3
Node 4 Node 1
connects to
MPI Client Nodes
CST STUDIO SUITE®
Frontend
High speed/low latency interconnection network (optional)
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
MPI Computing — Performance CST STUDIO SUITE® offers native support for high speed/low latency networks
A GPU accelerated
cluster system requires
high-speed network in
order to perform well!
MPI Cluster System CST STUDIO SUITE Frontend
GPU Hardware
Cluster Interconnect
0
0.5
1
1.5
2
2.5
3
3.5
4
1 2 3 4
Speedup
Number of Nodes
Speedup per Timestep
CPU / IB (QDR)
CPU / GbE (1 Gb/s)
0
5
10
15
20
25
30
1 2 3 4
Speedup
Number of Nodes
Speedup per Timestep
CPU / IB (QDR)
CPU / GbE (1 Gb/s)
2xTesla 10 / IB (QDR)
2xTesla 10 / GbE (1 Gb/s)
0
5
10
15
20
25
30
35
40
1 2 3 4
Speedup
Number of Nodes
Speedup per Timestep
CPU / IB (QDR)
CPU / GbE (1 Gb/s)
2xTesla 10 / IB (QDR)
2xTesla 10 / GbE (1 Gb/s)
2xTesla 20 / IB (QDR)
2xTesla 20 / GbE (1 Gb/s)
Cluster nodes: Dual Intel Xeon E5530, 2.4 GHz
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
MPI Computing Conclusion
MPI Computing allows you to use the resources of a
computer cluster as a single "supercomputer“
Sometimes the only approach to solve very large
models efficiently
Allows overcoming the memory limitation of the GPU
hardware
If you are using a GPU accelerated cluster a high-
speed/low-latency interconnection network
(Infiniband) is a "must-have"
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Hardware Based Acceleration Techniques
MPI Computing Distributed Computing
GPU Computing Multithreading
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
CST Simulation Acceleration:
Distributed Computing
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
DC — Working Principle
Users send simulation jobs to
the DC Main Controller.
The DC Main Controller
selects solver servers for the
jobs and sends the
simulation tasks to them.
It manages a simple FIFO
queue.
CST STUDIO SUITE®
Frontend
connects to
connects to
DC Main Controller
DC Solver Servers
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
DC — Working Principle
As soon as a solver
server has finished
its work the results
are automatically
transferred back to
the frontend.
connects to
connects to
DC Main Controller
DC Solver Servers
Results are saved by
MC if frontend is not
connected.
CST STUDIO SUITE®
Frontend
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
DC — Working Principle
CST STUDIO SUITE®
Frontend
connects to
connects to
DC Solver Servers
As soon as a solver
server has finished its
work the results are
automatically
transferred back to
the frontend.
DC Main Controller
Results are saved by
MC if frontend is not
connected.
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
DC — Working Principle
CST STUDIO SUITE®
Frontend
Additional constraints can
be defined which allows a
"fine grained" DC Solver
Server selection.
If you have very complex policies or requirements
such as:
priorities for users/jobs,
hardware is not dedicated to CST,
both MPI and DC jobs must run on your cluster,
you may integrate the DC system now into your
favorite queuing system.
1 GPU 24 GB RAM 24 GB RAM 48 GB RAM 96 GB RAM
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
DC — Benefits Very good utilization of
computational resources
Very efficient parallelization
strategy for independent tasks
Watch the progress of your
distributed jobs from the CST
Frontend.
Easy way to share
computational resources in a
multi user environment
The automatic update system
keeps track of distributing
updates to the whole cluster
The system is cross platform
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
DC — Main Controller The DC Main Controller gives you a complete overview about what is happening on your cluster.
Job Status
Machine Status
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
DC Security
Only trusted machines can
register at the DC Main
Controller
The remote configuration
feature can be protected
by a password
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Model has 16 ports
Only 8 ports need to be computed if defining symmetry conditions
Distribute the 8 simulation runs to different solver servers with
GPU acceleration
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
DC Simulation Time Improvement
0
5
10
15
20
25
30
1 2 4 8
Speedup
Number of DC Solver Servers
Speedup (total time)
CPU
1 GPU (Tesla 20)
Dual Intel Xeon X5675 CPUs (3.06 GHz), fastest memory configuration, 1 Tesla 20 GPU
per node, 1 Gb Ethernet interconnect, 40 million mesh cells
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Efficient approach to parallelizing the execution of
independent simulation tasks (e.g. port excitations,
parameter sweeps)
The DC system can organize the jobs of a multi-user en-
vironment. It is in fact a queuing system for CST simulations
You can watch the progress of your jobs and get intermediate
results with the CST STUDIO SUITE® frontend
The DC system can take job constraints (e.g. GPU
requirements) into account when selecting solver servers
for a job
DC Conclusion
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
A Solution for Challenging Problems
Combined MPI Computing and GPU Computing
System: 8 compute nodes with dual Intel Xeon E5530, 2.4 GHz, Infiniband (QDR, 40 Gb/s),
2 Tesla 10 GPUs per node
Model was provided by "Institut für Theorie Elektromagnetischer Felder" (www.temf.de)
1 Billion mesh cells Magnetic field (absolute values)
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
IBM Benchmark Application ID 326
Full model
Layer stackup
Large Multilayer Electronic
circuit and package benchmark
offered by IBM to different 3D
EM Simulation tool vendors
Interest in delay time, cross-talk
and S-parameters
Model available within CADENCE
Small portion of model
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Allegro File
(CADENCE)
CST MWS
SAT format
Triangulated surface
description (15Million Δ‘s)
IBM Benchmark Application ID 326
via native import
ACIS
STL format
CADENCE Allegro
CST
MICROWAVE
STUDIO®
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Good agreement with the measured results provided by IBM:
Measured delay time: 0.12ns
Simulated delay time: 0.118ns
IBM Benchmark Application ID 326
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Input Output
Cross talk
IBM Benchmark Application ID 326
Time Signals
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
S-Parameter for 0 to 20 GHZ in One Go!
IBM Benchmark Application ID 326
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Double Negative Materials Application ID 294
E-Field for material permittivity and permeability smaller 1
Vacuum Vacuum Material
Source
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Double Negative Materials Application ID 294
E-Field for material permittivity and permeability equal to 0
Vacuum Vacuum
Source
Material
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Double Negative Materials Application ID 294
E-Field for material permittivity and permeability equal to -1
Vacuum Vacuum
Source
Material
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
7 Tesla body coil
MRI Simulation 8 individual channels to optimize
homogeneous field distribution
Full 3D simulation model
including human model and
RF shield of gradient coil fine detail
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Ch 1
Ch 2
8 individually tuned solutions
Coil combination optimisation
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Tuning the matching circuits
All 8 coils can be tuned to
297.2 MHz using tuning tool
in CST DS.
frequency in MHz
S-Parameters in dB
reflection
coefficients
couplin
g
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Optimised B1+ distribution in CST MWS
Coil Combination
Homogeneous fields desired in
this region.
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Simulation vs. Measurememt
Male, 1.74 m, 70 kg Female, 1.6 m, 58 kg Male, 1.85 m, 95 kg
Simulated phase shims, |B1+|, in µT
Measured actual flip angle distribution in degrees Female, 1.65 m, 64 kg Male, 1.83 m, 82 kg
180
0 Mean 60°, SD 41%
Male, 1.86 m, 100 kg
Mean 66°, SD 33% Mean 80°, SD 28%
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
SAR Calculation
Point SAR 10g averaged SAR
10g SAR limit: 10 W/kg
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Thermal Bioheat Evaluation
Power scaled according to max. SAR10g = 10 W/kg
Temperature increase: ~1°C
Start temperature Temperature after 10 min
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com