Upload
nguyenminh
View
218
Download
0
Embed Size (px)
Citation preview
Earthquake Simula/ons with AWP-‐ODC on Titan, Blue Waters and Keeneland
Yifeng Cui, San Diego Supercomputer Center!
!Efecan Poyraz, Jun Zhou, Dongj u Choi, Heming Xu, Kyle Withers, Scott
Callaghan, Po Chen, Zheqiang Shi, Kim Olsen, Steven Day, Philip Maechling, Thomas Jordan and SCEC CME Collaboration!
!NVIDIA Technology Theater @ SC13!
November 21, 2013 !
HPGeoC
Supported by
Dr. Heming Xu Dr. Yifeng Cui Sheau-‐Yen Chen
Jun Zhu Efecan Poyraz Amit Chourasia Dr. Daniel Roten
Ian Zhang
FEMA 336 Report (2000)
About 50% of the na/onal seismic risk is in Southern California
U. S. Seismic Risk Map
San Andreas Fault System
Creeping Section
Pacific plate is moving NW relative to North America at 5 meters per 100 years
1906 San Francisco Earthquake, M 7.8
1812 Earthquake M 7.5
1680 Earthquake M 7.4 1857 Fort Tejon Earthquake, M 7.9
San Andreas Fault System
Creeping Section
Pacific plate is moving NW relative to North America at 5 meters per 100 years
Open interval 104 years Open interval 153 years
Open interval 198 years
Open interval 330 years
The en/re southern San Andreas fault is “locked and loaded.”
New paleoseismic data reduce mean recurrence interval for Carrizo sec/on from 260 yr to < 140 yr.
(Source: T. Jordan, SCEC)
Area-‐Magnitude and Slip Magnitude Scaling San Andreas Fault System
log10 A ~ MW
log10 D ~ ½ MW
Frequency-‐Magnitude Scaling San Andreas Fault System
M8 “outer scale”
UCERF2 (Field et al., 2008)
Natural Frequency of Buildings
Building Height Typical Natural Period
2 story 0.2 seconds
5 story 0.5 seconds
10 story 1.0 second
20 story 2.0 second
30 story 3.0 second
50 story 5.0 second
Tall buildings tend to have a lower natural frequency than shorter buildings
f = 12π
KM
f − natural frequency in HertzK − the stiffness of the building with a specific modeM − the mass of the building associated with the mode
Wenchun Earthquake, 2008
Determinis/c earthquake wave propaga/on simula/ons do not yet approach frequencies of interest for building engineers for most common buildings
Computa/onal Requirements
1E+13
1E+15
1E+17
1E+19
1E+21
TeraShake ShakeOut M8 2-‐Hz M8 10-‐Hz
Compu
ta/o
nal
Requ
iremen
ts
(Mesh po
ints X Tim
e step
s)
SDSC DataStar,2004 OLCF Jaguar, 2010 TACC Ranger, 2007
4D ra/o 7 x 1016
Computa/onal Requirements
1E+13
1E+15
1E+17
1E+19
1E+21
TeraShake ShakeOut M8 2-‐Hz M8 10-‐Hz
Compu
ta/o
nal
Requ
iremen
ts
(Mesh po
ints X Tim
e step
s)
SDSC DataStar,2004 OLCF Jaguar, 2010 TACC Ranger, 2007
M8 Earthquake Simulation
• 443-‐billion elements, using 16,640 Titan GPUs
• Small-‐scale fault geometry and media complexity
• Dynamic rupture propaga/on along a rough fault embedded in a 3D velocity structure
0.306
3.06
30.6
306
3060
2
20
200
2000
20000
2 20 200 2000 20000
Tflop
s
Speedu
p
Number of GPUs vs XE6 Cores
Ideal Speedup
NCCS Titan Speedup
NCCS Titan XK7 FLOPS
Blue Waters XK7 FLOPS
2.3 Pflop/s
Ground Mo/on Up to 10-‐Hz on BW/Titan
Improvement of models
4
Invert
Other Data Geology Geodesy
4
Ground-‐mo/on inverse problem (AWP-‐ODC, SPECFEM3D)
Physics-‐based simula8ons
AWP = Anelas/c Wave Propaga/on NSR = Nonlinear Site Response
KFR = Kinema/c Fault Rupture DFR = Dynamic Fault Rupture
2
AWP Ground Mo/ons NSR
2 KFR
Ground mo/on simula/on (AWP-‐ODC, Hercules)
Empirical models
Standard seismic hazard analysis (RWG, AWP-‐ODC) 1
Intensity Measures
Ajenua/on Rela/onship
1 Earthquake Rupture
Forecast
SCEC Computa/onal Pathways
“Extended” Earthquake Rupture Forecast
Structural Representa/on
3
AWP DFR 3
Dynamic rupture modeling (SORD, AWP-‐ODC) Hybrid MPI/CUDA
Probabilis/c Seismic Hazard Model
Receiver
Source 1
Source 3 Source 2
M sources to N receivers requires M simula/ons M sources to N receivers requires 3N simula/ons
• Physics-‐based seismic hazard model requires more than a few earthquake simula/ons
– Standard “forward” simula/on compu/ng 3-‐component seismograms from M sources at N sites requires M simula/ons (M > 105, N < 103)
– Strain Green tensor based “reciprocal” simula/on compu/ng 3-‐component seismograms for M sources at N sites requires only 3N simula/ons
– Use of reciprocity reduces CPU /me by a factor of ~2,000
Un(r,rs)=Gnj,i(r,rs)Mji Un(r,rs)=H(rs,r)Mji H(rs,r)=[Gjn,i(rs,r) + Gin,j(rs ,r)]/2
P(IMk) P(IMk | Sn) P(Sn)
Intensity Measures
Ajenua/on Rela/onship
Earthquake Rupture Forecast
Probabilisbc Seismic Hazard Analysis
CyberShake hazard map PoE = 2% in 50 yrs
CyberShake seismogram
• 1144 sites in LA region, f < 0.5 Hz, 2013 - Produced four alternative seismic hazard maps for
southern California - 7.1 million CPU hrs (28-day run using Blue Waters
and Stampede) - 189 million jobs - 165 TB of total output data - 10.6 TB of stored data
CyberShake Hazard Model
• 5,000 sites, f < 1.0, 2014-2015 - Produce seismic hazard maps for entire
California - 723 million core-hours - 4.2 billion jobs - 56 PB of total output data - 3.0 PB of stored data
LA Region Map, v13.4 State-‐wide Map
CyberShake SGT Simula/ons on XK7 vs XE6 CyberShake 1.0 Hz XE6 XK7 XK7
(CPU-‐GPU co-‐scheduling)
Nodes 400 400 400
SGT hrs per site 10.36 2.80 2.80
Post-‐processing hours per site** 0.94 1.88 2.00
Total Hrs per site 11.30 4.68 2.80
Total SUs(Millions)* 723 M 299 M 179 M
SUs saving (Millions) 424 M 543 M
* Scale to 5000 sites based on two strain Green tensor runs per site; ** based on CyberShake 13..4 map
3.7x speedup
AWP-‐ODC • Started as personal research code (Olsen 1994) • 3D velocity-stress wave equations solved by explicit staggered-grid 4th-order FD • Memory variable formulation of inelastic relaxation
using coarse-grained representation (Day 1998) • Dynamic rupture by the staggered-grid split-node (SGSN)
method (Dalguer and Day 2007) • Absorbing boundary conditions by perfectly matched layers
(PML) (Marcinkovich and Olsen 2003) and Cerjan et al. (1985)
€
€
∂tν =1ρ∇ ⋅σ
€
∂tσ = λ(∇⋅ν )Ι+ µ(∇ν +∇νΤ )
€
τ idς i(t)dt
+ ς i(t) = λiδMMu
ε(t)
€
σ ( t) = Mu ε(t) − ς i( t)i=1
N
∑'
( )
*
+ ,
€
Q −1(ω ) ≈δMMu
λiωτ iω 2τ i
2 +1i=1
N
∑
1
x
x
x
!
!
!!
!
!
!
!
!
!
!
!
!
!
1
2
3
1
4
4
3
5 5
6
6
7
8
8
2
2
3
1
"
"
"12
23
1313
{!
"11 #11"
2222 #"33 33#
12#
23#
#
2
3
v
v
v
Relaxation Time
Distribution
Unit Cells
Inelas8c relaxa8on variables for memory-‐variable ODEs in AWP-‐ODC
Two-‐layer 3D domain decomposi/on on CPU-‐GPU Ø X&Y
decomposi/on for CPUs
Ø Y&Z decomposi/on for GPU SMs
GPU Code: Decomposi/on on CPU and GPU
Single-‐GPU Op/miza/ons
Global memory Op/miza/on • global memory coalescing • texture memory for 3D constant variables • Constant memory for scalar constants
Using L1/L2 cache rather shared memory
Velocity as input
to compute stress
Velocity
communica/on
Communica/on Reduc/on • Extend ghost cell region with two extra layers and compute rather
communicate for the ghost cell region updates before stress computa/on. • The 2D XY plane represents the 3D sub-‐domain, as no communica/on in Z
direc/on is required due to 2D decomposi/on for GPUs.
Velocity before computabon Velocity aier computabon Velocity aier communicabon Stress aier computabon
Stress as input to compute next /me step velocity,
€
∂tν =1ρ∇ ⋅σ
GPU-‐GPU Communica/on
Communica/on Velocity Stress
Frequency Message size Frequency Message size
Before Communicabon Reducbon 4 6*(nx+ny)*NZ 4 12*(nx+ny)*NZ
Aier Communicabon Reducbon 4 12*(nx+ny+4)*NZ No communica/on
Compu/ng and Communica/on Overlapping
Compu/ng and Communica/on Overlapping
XK7 Nodes used Elements (1000s) Wall Clock Time Parallel Efficiency
8192 429,496,729 0.1085 100%
16384 858,993,459 0.1159 93.2%
• Parallel I/O • Read and redistribute mul/ple terabytes inputs
• Conbguous block read by reduced number of readers
• High bandwidth asynchronous point-‐to-‐point communicabon redistribubon
Two-‐phase I/O Model cores
Shared file
• Parallel I/O • Read and redistribute mul/ple terabytes inputs
• Conbguous block read by reduced number of readers
• High bandwidth asynchronous point-‐to-‐point communicabon redistribubon
Two-‐phase I/O Model
OSTs
• Aggregate and write
OSTs
Stripe size
Stripe size
Stripe size
…
Temporal aggregator
bme step 1 bme step 2 … … bme step N
MPI-‐IO • Parallel I/O
• Read and redistribute mul/ple terabytes inputs • Conbguous block read by reduced number of
readers • High bandwidth asynchronous point-‐to-‐point
communicabon redistribubon • Aggregate and write
• Temporal aggregabon buffers • Conbguous writes • Throughput
Two-‐phase I/O Model
• ADIOS checkpoin/ng • Effec/ve I/O by separa/ng metadata and an API
library
Two-‐phase I/O Model
ADIOS
Temporal aggregator
External metadata (XML file)
bme step 1 bme step 2 … … bme step N
…
OST1 OST2 OST3
File 1 File 2 File 3
Spabal aggregator
…
…
Joint ADIOS work with S. Klasky, N. Podhorszki and Q. Liu of ORNL
• Parallel I/O • Read and redistribute mul/ple terabytes inputs
• Conbguous block read by reduced number of readers
• High bandwidth asynchronous point-‐to-‐point communicabon redistribubon
• Aggregate and write • Temporal aggregabon buffers • Conbguous writes • Throughput
AWP-‐ODC Weak Scaling
0.1$
1$
10$
100$
1000$
2$ 20$ 200$ 2000$ 20000$
TFLOP
S'
Number'of'nodes'
ideal$
AWPg/XK7$
AWPg/HPSL250$
AWPc/XE6$
94% efficiency
100% efficiency
CPUs/GPUs Co-‐scheduling
aprun -‐n 50 <GPU executable> <arguments> & get the PID of the GPU job cybershake_coscheduling.py: build all the cybershake input files divide up the nodes and work among a customizable number of jobs for each job: fork extract_sgt.py cores -‐-‐> performs pre-‐processing and launches
"aprun -‐n <cores per job> -‐N 15 -‐r 1 <cpu executable A>&" get PID of the CPU job while executable A jobs are running: check PIDs to see if job has completed if completed: launch “aprun -‐n <cores per job> -‐N 15 -‐r 1 <cpu executable B>&” while executable B jobs are running: check for comple8on check for GPU job comple8on
– CPUs run reciprocity-‐based seismogram and intensity computa/ons
– Run mul/ple MPI jobs on compute nodes using Node Managers (MOM)
Post-‐processing on CPUs: API for Pthreads • AWP-‐API lets individual
pthreads make use of CPUs: post-‐processing – Vmag, SGT, seismograms – Stabsbcs (real-‐bme performance
measuring) – Adapbve/interacbve control
tools – Visualizabon – Output wribng is introduced as a
pthread that uses the API
Ini$alize)
Calculate)SGT)
Is)the)signal)
received?)
Write)out):)MPI:IO)
Is)it)$me)to)write)out?)
Ini$alize)simula$on)
Ini$alize)modules)
Start)computa$on)
on)GPU)
Specified)$me)step?)
Copy)velocity)data)and)signal)modules)
Finalize)
More)$me)steps?)
Main)thread)
GPU)
Modules)on)other)CPUs)on)XK7)
yes)
no)
yes)
yes)
yes)
no)
no)
Accelara/ng CyberShake Calcula/ons On GPUs USC
1 0- 2 2 3 4 5 6 1 0
- 1 2 3 4 5 6 7 1 00 2 3
3s SA (g)
1 0- 5
1 0- 4
1 0- 3
1 0- 2
1 0- 1
Probab
ility Ra
te (1/y
r)
CyberShake as a Platorm for Opera/onal Earthquake Forecas/ng
Opera/onal Forecast – Harvard Curves
Opera/onal Forecast -‐ NSHMP
Opera/onal Forecast – Aver 2009 Bombay Beach
Opera/onal Forecast – Aver 2004 Parkfield
10-‐Hz Visualiza/on
Collaborators Carl Ponder, Cyril Zeller, Stanley Posey and Roy Kim (NVIDIA), Jeffrey Veper, Mitch Horton, Graham Lopez, Richard Glassbrook (NICS/ORNL), Maphew Norman and Jack Wells (ORNL), Bruce Loiis (NICS), DK. Panda, Sreeram Potluri and DK’s team (OSU), Gregory Bauer, Jay Alameda, Omar Padron (NCSA); Robert Fiedler (Cray),
Scop Baden and Didem Unat (UCSD), Liwen Shih (UH) Compu/ng Resources
NCSA Blue Waters, OLCF Titan, XSEDE Keeneland, NVIDIA GPUs donabon to HPGeoC/SDSC
NSF Grants NCSA NEIS-‐P2/PRAC OCI-‐0832698, XSEDE ECCS, PRAC, SI2-‐SSI (OCI-‐1148493 ), Geoinformabcs (EAR-‐1226343), NSF/USGS SCEC4 Core (EAR-‐0529922 and
07HQAG0008)
Acknowledgements