Earthquake Simula=ons with AWP-‐ODC on Titan, Blue Waters and

Earthquake Simula/ons with AWP-‐ODC on Titan, Blue Waters and Keeneland

Yifeng Cui, San Diego Supercomputer Center!

!Efecan Poyraz, Jun Zhou, Dongj u Choi, Heming Xu, Kyle Withers, Scott

Callaghan, Po Chen, Zheqiang Shi, Kim Olsen, Steven Day, Philip Maechling, Thomas Jordan and SCEC CME Collaboration!

!NVIDIA Technology Theater @ SC13!

November 21, 2013 !

HPGeoC

Supported by

Dr. Heming Xu Dr. Yifeng Cui Sheau-‐Yen Chen

Jun Zhu Efecan Poyraz Amit Chourasia Dr. Daniel Roten

Ian Zhang

FEMA 336 Report (2000)

About 50% of the na/onal seismic risk is in Southern California

U. S. Seismic Risk Map

San Andreas Fault System

Creeping Section

Pacific plate is moving NW relative to North America at 5 meters per 100 years

1906 San Francisco Earthquake, M 7.8

1812 Earthquake M 7.5

1680 Earthquake M 7.4 1857 Fort Tejon Earthquake, M 7.9

San Andreas Fault System

Creeping Section

Pacific plate is moving NW relative to North America at 5 meters per 100 years

Open interval 104 years Open interval 153 years

Open interval 198 years

Open interval 330 years

The en/re southern San Andreas fault is “locked and loaded.”

New paleoseismic data reduce mean recurrence interval for Carrizo sec/on from 260 yr to < 140 yr.

(Source: T. Jordan, SCEC)

Area-‐Magnitude and Slip Magnitude Scaling San Andreas Fault System

log10 A ~ MW

log10 D ~ ½ MW

Frequency-‐Magnitude Scaling San Andreas Fault System

M8 “outer scale”

UCERF2 (Field et al., 2008)

Natural Frequency of Buildings

Building Height Typical Natural Period

2 story 0.2 seconds

5 story 0.5 seconds

10 story 1.0 second

20 story 2.0 second

30 story 3.0 second

50 story 5.0 second

Tall buildings tend to have a lower natural frequency than shorter buildings

f = 12π

KM

f − natural frequency in HertzK − the stiffness of the building with a specific modeM − the mass of the building associated with the mode

Wenchun Earthquake, 2008

Determinis/c earthquake wave propaga/on simula/ons do not yet approach frequencies of interest for building engineers for most common buildings

Computa/onal Requirements

1E+13

1E+15

1E+17

1E+19

1E+21

TeraShake ShakeOut M8 2-‐Hz M8 10-‐Hz

Compu

ta/o

nal

Requ

iremen

ts

(Mesh po

ints X Tim

e step

s)

SDSC DataStar,2004 OLCF Jaguar, 2010 TACC Ranger, 2007

4D ra/o 7 x 1016

Computa/onal Requirements

1E+13

1E+15

1E+17

1E+19

1E+21

TeraShake ShakeOut M8 2-‐Hz M8 10-‐Hz

Compu

ta/o

nal

Requ

iremen

ts

(Mesh po

ints X Tim

e step

s)

SDSC DataStar,2004 OLCF Jaguar, 2010 TACC Ranger, 2007

M8 Earthquake Simulation

•  443-‐billion elements, using 16,640 Titan GPUs

•  Small-‐scale fault geometry and media complexity

•  Dynamic rupture propaga/on along a rough fault embedded in a 3D velocity structure

0.306

3.06

30.6

306

3060

2

20

200

2000

20000

2 20 200 2000 20000

Tflop

s

Speedu

p

Number of GPUs vs XE6 Cores

Ideal Speedup

NCCS Titan Speedup

NCCS Titan XK7 FLOPS

Blue Waters XK7 FLOPS

2.3 Pflop/s

Ground Mo/on Up to 10-‐Hz on BW/Titan

Improvement of models

4

Invert

Other Data Geology Geodesy

4

Ground-‐mo/on inverse problem (AWP-‐ODC, SPECFEM3D)

Physics-‐based simula8ons

AWP = Anelas/c Wave Propaga/on NSR = Nonlinear Site Response

KFR = Kinema/c Fault Rupture DFR = Dynamic Fault Rupture

2

AWP Ground Mo/ons NSR

2 KFR

Ground mo/on simula/on (AWP-‐ODC, Hercules)

Empirical models

Standard seismic hazard analysis (RWG, AWP-‐ODC) 1

Intensity Measures

Ajenua/on Rela/onship

1 Earthquake Rupture

Forecast

SCEC Computa/onal Pathways

“Extended” Earthquake Rupture Forecast

Structural Representa/on

3

AWP DFR 3

Dynamic rupture modeling (SORD, AWP-‐ODC) Hybrid MPI/CUDA

Probabilis/c Seismic Hazard Model

Receiver

Source 1

Source 3 Source 2

M sources to N receivers requires M simula/ons M sources to N receivers requires 3N simula/ons

•  Physics-‐based seismic hazard model requires more than a few earthquake simula/ons

–  Standard “forward” simula/on compu/ng 3-‐component seismograms from M sources at N sites requires M simula/ons (M > 105, N < 103)

–  Strain Green tensor based “reciprocal” simula/on compu/ng 3-‐component seismograms for M sources at N sites requires only 3N simula/ons

–  Use of reciprocity reduces CPU /me by a factor of ~2,000

Un(r,rs)=Gnj,i(r,rs)Mji Un(r,rs)=H(rs,r)Mji H(rs,r)=[Gjn,i(rs,r) + Gin,j(rs ,r)]/2

P(IMk) P(IMk | Sn) P(Sn)

Intensity Measures

Ajenua/on Rela/onship

Earthquake Rupture Forecast

Probabilisbc Seismic Hazard Analysis

CyberShake hazard map PoE = 2% in 50 yrs

CyberShake seismogram

•  1144 sites in LA region, f < 0.5 Hz, 2013 -  Produced four alternative seismic hazard maps for

southern California -  7.1 million CPU hrs (28-day run using Blue Waters

and Stampede) -  189 million jobs -  165 TB of total output data -  10.6 TB of stored data

CyberShake Hazard Model

•  5,000 sites, f < 1.0, 2014-2015 -  Produce seismic hazard maps for entire

California -  723 million core-hours -  4.2 billion jobs -  56 PB of total output data -  3.0 PB of stored data

LA Region Map, v13.4 State-‐wide Map

CyberShake SGT Simula/ons on XK7 vs XE6 CyberShake 1.0 Hz XE6 XK7 XK7

(CPU-‐GPU co-‐scheduling)

Nodes 400 400 400

SGT hrs per site 10.36 2.80 2.80

Post-‐processing hours per site** 0.94 1.88 2.00

Total Hrs per site 11.30 4.68 2.80

Total SUs(Millions)* 723 M 299 M 179 M

SUs saving (Millions) 424 M 543 M

* Scale to 5000 sites based on two strain Green tensor runs per site; ** based on CyberShake 13..4 map

3.7x speedup

AWP-‐ODC •  Started as personal research code (Olsen 1994) •  3D velocity-stress wave equations solved by explicit staggered-grid 4th-order FD •  Memory variable formulation of inelastic relaxation

using coarse-grained representation (Day 1998) •  Dynamic rupture by the staggered-grid split-node (SGSN)

method (Dalguer and Day 2007) •  Absorbing boundary conditions by perfectly matched layers

(PML) (Marcinkovich and Olsen 2003) and Cerjan et al. (1985)

€

€

∂tν =1ρ∇ ⋅σ

€

∂tσ = λ(∇⋅ν )Ι+ µ(∇ν +∇νΤ )

€

τ idς i(t)dt

+ ς i(t) = λiδMMu

ε(t)

€

σ ( t) = Mu ε(t) − ς i( t)i=1

N

∑'

( )

*

+ ,

€

Q −1(ω ) ≈δMMu

λiωτ iω 2τ i

2 +1i=1

N

∑

1

x

x

x

!

!

!!

!

!

!

!

!

!

!

!

!

!

1

2

3

1

4

4

3

5 5

6

6

7

8

8

2

2

3

1

"

"

"12

23

1313

{!

"11 #11"

2222 #"33 33#

12#

23#

#

2

3

v

v

v

Relaxation Time

Distribution

Unit Cells

Inelas8c relaxa8on variables for memory-‐variable ODEs in AWP-‐ODC

Two-‐layer 3D domain decomposi/on on CPU-‐GPU Ø  X&Y

decomposi/on for CPUs

Ø  Y&Z decomposi/on for GPU SMs

GPU Code: Decomposi/on on CPU and GPU

Single-‐GPU Op/miza/ons

Global memory Op/miza/on •  global memory coalescing •  texture memory for 3D constant variables •  Constant memory for scalar constants

Using L1/L2 cache rather shared memory

Velocity as input

to compute stress

Velocity

communica/on

Communica/on Reduc/on •  Extend ghost cell region with two extra layers and compute rather

communicate for the ghost cell region updates before stress computa/on. •  The 2D XY plane represents the 3D sub-‐domain, as no communica/on in Z

direc/on is required due to 2D decomposi/on for GPUs.

Velocity before computabon Velocity aier computabon Velocity aier communicabon Stress aier computabon

Stress as input to compute next /me step velocity,

€

∂tν =1ρ∇ ⋅σ

GPU-‐GPU Communica/on

Communica/on Velocity Stress

Frequency Message size Frequency Message size

Before Communicabon Reducbon 4 6*(nx+ny)*NZ 4 12*(nx+ny)*NZ

Aier Communicabon Reducbon 4 12*(nx+ny+4)*NZ No communica/on

Compu/ng and Communica/on Overlapping

Compu/ng and Communica/on Overlapping

XK7 Nodes used Elements (1000s) Wall Clock Time Parallel Efficiency

8192 429,496,729 0.1085 100%

16384 858,993,459 0.1159 93.2%

•  Parallel I/O •  Read and redistribute mul/ple terabytes inputs

•  Conbguous block read by reduced number of readers

•  High bandwidth asynchronous point-‐to-‐point communicabon redistribubon

Two-‐phase I/O Model cores

Shared file




Two-‐phase I/O Model

OSTs

•  Aggregate and write

OSTs

Stripe size

Stripe size

Stripe size

…

Temporal aggregator

bme step 1 bme step 2 … … bme step N

MPI-‐IO •  Parallel I/O

•  Read and redistribute mul/ple terabytes inputs •  Conbguous block read by reduced number of

readers •  High bandwidth asynchronous point-‐to-‐point

communicabon redistribubon •  Aggregate and write

•  Temporal aggregabon buffers •  Conbguous writes •  Throughput


•  ADIOS checkpoin/ng •  Effec/ve I/O by separa/ng metadata and an API

library


ADIOS

Temporal aggregator

External metadata (XML file)

bme step 1 bme step 2 … … bme step N

…

OST1 OST2 OST3

File 1 File 2 File 3

Spabal aggregator

…

…

Joint ADIOS work with S. Klasky, N. Podhorszki and Q. Liu of ORNL




•  Aggregate and write •  Temporal aggregabon buffers •  Conbguous writes •  Throughput

AWP-‐ODC Weak Scaling

0.1$

1$

10$

100$

1000$

2$ 20$ 200$ 2000$ 20000$

TFLOP

S'

Number'of'nodes'

ideal$

AWPg/XK7$

AWPg/HPSL250$

AWPc/XE6$

94% efficiency

100% efficiency

CPUs/GPUs Co-‐scheduling

aprun -‐n 50 <GPU executable> <arguments> & get the PID of the GPU job cybershake_coscheduling.py: build all the cybershake input files divide up the nodes and work among a customizable number of jobs for each job: fork extract_sgt.py cores -‐-‐> performs pre-‐processing and launches

"aprun -‐n <cores per job> -‐N 15 -‐r 1 <cpu executable A>&" get PID of the CPU job while executable A jobs are running: check PIDs to see if job has completed if completed: launch “aprun -‐n <cores per job> -‐N 15 -‐r 1 <cpu executable B>&” while executable B jobs are running: check for comple8on check for GPU job comple8on

–  CPUs run reciprocity-‐based seismogram and intensity computa/ons

–  Run mul/ple MPI jobs on compute nodes using Node Managers (MOM)

Post-‐processing on CPUs: API for Pthreads •  AWP-‐API lets individual

pthreads make use of CPUs: post-‐processing –  Vmag, SGT, seismograms –  Stabsbcs (real-‐bme performance

measuring) –  Adapbve/interacbve control

tools –  Visualizabon –  Output wribng is introduced as a

pthread that uses the API

Ini$alize)

Calculate)SGT)

Is)the)signal)

received?)

Write)out):)MPI:IO)

Is)it)$me)to)write)out?)

Ini$alize)simula$on)

Ini$alize)modules)

Start)computa$on)

on)GPU)

Specified)$me)step?)

Copy)velocity)data)and)signal)modules)

Finalize)

More)$me)steps?)

Main)thread)

GPU)

Modules)on)other)CPUs)on)XK7)

yes)

no)

yes)

yes)

yes)

no)

no)

Accelara/ng CyberShake Calcula/ons On GPUs USC

1 0- 2 2 3 4 5 6 1 0

- 1 2 3 4 5 6 7 1 00 2 3

3s SA (g)

1 0- 5

1 0- 4

1 0- 3

1 0- 2

1 0- 1

Probab

ility Ra

te (1/y

r)

CyberShake as a Platorm for Opera/onal Earthquake Forecas/ng

Opera/onal Forecast – Harvard Curves

Opera/onal Forecast -‐ NSHMP

Opera/onal Forecast – Aver 2009 Bombay Beach

Opera/onal Forecast – Aver 2004 Parkfield

10-‐Hz Visualiza/on

Collaborators Carl Ponder, Cyril Zeller, Stanley Posey and Roy Kim (NVIDIA), Jeffrey Veper, Mitch Horton, Graham Lopez, Richard Glassbrook (NICS/ORNL), Maphew Norman and Jack Wells (ORNL), Bruce Loiis (NICS), DK. Panda, Sreeram Potluri and DK’s team (OSU), Gregory Bauer, Jay Alameda, Omar Padron (NCSA); Robert Fiedler (Cray),

Scop Baden and Didem Unat (UCSD), Liwen Shih (UH) Compu/ng Resources

NCSA Blue Waters, OLCF Titan, XSEDE Keeneland, NVIDIA GPUs donabon to HPGeoC/SDSC

NSF Grants NCSA NEIS-‐P2/PRAC OCI-‐0832698, XSEDE ECCS, PRAC, SI2-‐SSI (OCI-‐1148493 ), Geoinformabcs (EAR-‐1226343), NSF/USGS SCEC4 Core (EAR-‐0529922 and

07HQAG0008)

Acknowledgements

Documents

Earthquake Simula=ons with AWP-‐ODC on Titan, Blue Waters and