Breaking Through the Barriers to GPU Accelerated Monte ......Operated by Los Alamos National...

Preview:

Citation preview

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

Jeremy SweezyScientist

Monte Carlo Methods, Codes and Applications Group

3/28/2018

LA-UR-18-XXXX

Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport

GTC 2018

What is Monte Carlo Particle Transport?

3/23/18 | 2Los Alamos National Laboratory

– Follows the path of individual particles through a system– Uses pseudo-random numbers to sample processes– Randomly sample physical and non-physical processes– Attributed to Stanislaw Ulam and

Enrico Fermi– Named because Ulam had an

uncle who who would borrow money from relatives because he “just had to go to Monte Carlo”

FERMIAC

Porting to Specialized Hardware is Prohibitively Expensive

3/23/18 | 3Los Alamos National Laboratory

–The world’s production Monte Carlo codes have decades of development–LANL’s MCNP code has been in development since 1977–Equally extensive amount of V&V effort–Codes have to run on desktop machines and super-computers–DOE HPC platforms have been in a state of flux for the last 10-years

• Cell Broadband Engine • Intel Xeon Phi (MIC)• GPUs• ARM???

Barrier #1: Limited Resources (Money, People, Time)

Monte Carlo Random Walk on GPU Hardware has reached a Performance Wall

3/25/18 | 4Los Alamos National Laboratory

• A least 6 different research groups have ported the Monte Carlo random walk to GPU hardware for neutron transport

• All report results against different numbers of CPUs• All get the same results!• Almost all are extremely simplified• Production codes will likely have

worse performance.• What are the limitations?

– Conditional branching– Random data access– No small computational intensive kernel

to accelerate

Barrier #2: Performance of random walk on GPUs

4.5x

3.0x

How do You Define Performance?

3/23/18 | 5Los Alamos National Laboratory

• A computer scientist might measure performance as an increase in speed.

𝑷 =𝑻𝑪𝑷𝑼𝑻𝑮𝑷𝑼

• A Monte Carlo specialist would measure performance as an balance between speed and statistical variance using a Figure-of-Merit

To date, almost all GPU implementations of Monte Carlo particle transport of have focused on increasing speed.

𝑬𝒙𝒂𝒎𝒑𝒍𝒆: 𝑭𝑶𝑴 =𝟎. 𝟏𝟐 7 𝟏min𝟎. 𝟎𝟓𝟐 7 𝟐min = 𝟐

𝑭𝑶𝑴 =𝝈𝑪𝑷𝑼𝟐 𝑻𝑪𝑷𝑼𝝈𝑮𝑷𝑼𝟐 𝑻𝑮𝑷𝑼

Next Event Estimator

3/23/18 | 6Los Alamos National Laboratory

• Next-event estimator calculates the probability of a particle from a source or collision event reaches a point without interaction

• Typically used for image tallies

A

Cell 1

Cell 2

μ

Image Plane

B

𝑺 𝑹, 𝑬 =𝒘

𝟐𝝅𝑹𝟐 ×

C𝝈𝒊 𝑹, 𝑬𝝈𝑻

𝒑𝒊 𝝁, 𝑬 → 𝑬G exp(−M 𝚺𝑻 𝒔, 𝑬G 𝒅𝒔𝑹

𝟎)

𝑵

𝒊S𝟏Ray-cast

One to two orders of magnitude faster on GPU hardware

Traditional Track-Length Estimator

3/25/18 | 7Los Alamos National Laboratory

• The standard Monte Carlo fluence estimator• Uses the sampled distance in each cell as fluence estimator• Only contributes to cells through which the particle passes • Easy to compute• Nothing to accelerate on GPU

Cell 1

B

Cell 2

Cell 3

Computing has changed, we need to change our algorithms too!

Volumetric-Ray-Casting Estimator

3/25/18 | 8Los Alamos National Laboratory

• For use in place of the traditional track-length estimator on GPU• Multiple pseudo-rays are generated at each source and collision event• Computational intensive estimator with lower variance

Cell 1

B

Cell 2

Cell 3

F 𝒊, 𝑬′ = 𝒘 𝟏UVWX U𝚺𝑻,𝒊 𝑬Y 𝒍𝒊𝑵𝚺𝑻,𝒊(𝑬Y)

exp −∫ 𝚺𝑻 𝒓 + 𝛀′𝒔′, 𝑬G 𝒅𝒔′𝒓YU𝒓𝟎

Ray-cast

A neutron dance for a neutron fan. P.M. Dawn

MonteRay - Accelerating Monte Carlo Transport with GPU Ray Tracing

3/23/18 | 9Los Alamos National Laboratory

• MonteRay – A library for accelerating Monte Carlo tallies with GPU • Random walk is maintained on CPU• Ray casting based tallies are calculated on the GPU

–Next-Event estimator –Volumetric-Ray-Casting estimator, a new estimator designed for GPUs–Supports neutron and photon tallies

• Can be incorporated into new and legacy Monte Carlo codes• Uses continuous energy cross-section data• Single precision ray casting• Single precision attenuation cross-sections• Double precision tallies

Reduces cost of accelerating an existing Monte Carlo code with GPUs

MonteRay - Testing

3/23/18 | 10Los Alamos National Laboratory

• Tests use:–GeForce GTX TitanX GPU with NVIDIA Maxwell architecture–2 CPUs (Intel Haswell E5-2660 v3 at 2.60 GHz), with 10 cores each

• MonteRay linked with LANL’s C++ Monte Carlo code MCATK• MCATK uses MPI parallelism building shared ray buffers using MPI-3

shared memory• 3-D Cartesian Structured Mesh Geometry• 2 tests measured performance of the Next-event estimator• 4 tests measured the performance of the Volumetric-ray-casting

estimator• Volumetric-ray-casting estimator performance on GPU compared to the

Track-length estimator performance on the CPU• Base performance measured as compared to 8 CPU cores

Testing the Next-Event Estimator on GPU Hardware:Two Radiography Tests

3/23/18 | 11Los Alamos National Laboratory

MonteRay – Medical X-Ray Imaging Simulation

3/23/18 | 12Los Alamos National Laboratory

• 50-keV X-ray beam• 0.12mm spot size• Radiograph used Next-Event Estimator• Simulation useful for designing collimator to minimize scattered contribution

MonteRay – Medical X-Ray Imaging Simulation

3/23/18 | 13Los Alamos National Laboratory

• Source and Collided contribution calculated separately

• Source contribution relatively easy to calculate

• Collided contribution important for collimator design

• Collided performance 15-18x

14.5x 15.3x

MonteRay – Industrial Radiography

3/23/18 | 14Los Alamos National Laboratory

• Simulated a physical test object used at Los Alamos’ Dual Axis Radiographic Hydrodynamic Test Facility

• Used 4-MeV mono-energetic X-ray beam• 100 x 100 image grid (10,000 estimators) to simulate image detector • Calculation of scatter component needed to design

collimators and experiment, but too computational expensive

I'm a peeping-tom techie with x-ray eyes – Patrick Lee MacDonald

MonteRay – Industrial Radiography

3/23/18 | 15Los Alamos National Laboratory

10

100

0 5 10 15 20

Re

lative

Pe

rfo

rma

nce

Number of CPU Cores / GPU

SourceCollided

Collided calculation performance 15-32x!

GPU Performance vs Number of CPU Cores

28.5x24.2x

Volumetric-Ray-Casting Estimator on GPU Hardware vs

Track-Length Estimator on CPU Hardware

3/23/18 | 16Los Alamos National Laboratory

Cancer Treatment Simulation

3/23/18 | 17Los Alamos National Laboratory

• 2-MeV Photon beam ( peak of 6MV medical accelerator photon spectrum)• 1-cm beam radius

Tumor

2-MeV Photon Beam

What is the dose to healthy tissue?

GPU Performance vs 8 CPU Cores

14x performance improvement in healthy tissue

Cancer Treatment Simulation

3/23/18 | 18Los Alamos National Laboratory

GPU Performance vs Number of CPU Cores in Healthy Tissue

Performance is 14x vs 8 CPU cores or 10x vs 12 CPU cores

14.3x

10.2x

Pressured Water Reactor Assembly Simulation

3/23/18 | 19Los Alamos National Laboratory

• 16x16 Fuel Assembly• Performance 7.5x in the Control Rods, 5x in the fuel, and 4.5x in the coolant

GPU Performance vs 8 CPU Cores

Control Rod

Fuel Pin

Pressured Water Reactor Assembly Simulation

3/23/18 | 20Los Alamos National Laboratory

GPU Performance vs Number of CPU Cores

Compared to 8 CPU cores performance in control rod 7.2x and 6.0x in the fuel

7.2x

5.4x6.0x

4.4x

Criticality Accident Simulation

3/23/18 | 21Los Alamos National Laboratory

• Critical Uranium sphere in the corner of a concrete room• Concrete floor, walls, ceiling, and 4 concrete pillars

GPU Performance vs 8 CPU CoresUranium Sphere

Performance increase of 14-16x in the center of the room

Criticality Accident Simulation – Smoother Fluence Estimate

3/23/18 | 22Los Alamos National Laboratory

Track-Length Estimator Volumetric-Ray-Casting Estimator

Criticality Accident Simulation

3/23/18 | 23Los Alamos National Laboratory

GPU Performance vs Number of CPU Cores

Things are going great, and they’re only getting better – Patrick Lee MacDonald

15x

10.5x

Reflected Godiva Criticality Experiment Simulation

3/23/18 | 24Los Alamos National Laboratory

• U-235 sphere reflected by water• Performance Improvement

–2.5x in the core–1.0x in the water

GPU Performance vs 8 CPU Cores

Reflected Godiva Criticality Experiment Simulation

3/23/18 | 25Los Alamos National Laboratory

• Variance of the Volumetric-Ray-Casting estimator approaches that of the Track-Length estimator is strong scattering material.

1

1.5

2

2.5

3

3.5

4

4.5

1 4 8 12 16 20

Varia

nce

Rat

io ( σ T

L2 / σ2 VR

C )

Number of Samples per Collision (N)

Performance is limited by the estimator variance, not the GPU speed

Variance Ratio vs Num. Collisions

GPU Performance vs. Num. CPU Cores

2.2x

2.2x

Conclusions

3/23/18 | 26Los Alamos National Laboratory

• MonteRay provides a low cost method of providing GPU accelerated Monte Carlo particle transport–Can be incorporated into legacy codes at low cost.–Works with standard variance reduction methods

• Performance improvements of MonteRay are significant:–Up to 32 times for the Next-event estimator as compared to 8 CPU cores–Up to 14 times for the Volumetric-ray-casting estimator as compared to the Track-Length

estimator on 8 CPU cores

MonteRay provides a method of breaking through the barriers of limited resources and limited performance

Questions?Jeremy Sweezy

jsweezy@lanl.gov

3/23/18 | 27Los Alamos National Laboratory

Extra

3/23/18 | 28Los Alamos National Laboratory

Uncertainty - Pressured Water Reactor Assembly Simulation

3/23/18 | 29Los Alamos National Laboratory

Volumetric-Ray-Casting EstimatorTrack-Length Estimator

600 sec., 8 CPU Cores and 1 GPU93 cycles, 40000 Particles/Cycle8 rays/collision

600 sec., 8 CPU Cores124 cycles, 40000 Particles/Cycle

Recommended