23
CyberShake Study 15.4 Technical Readiness Review

CyberShake Study 15.4 Technical Readiness Review

Embed Size (px)

Citation preview

Page 1: CyberShake Study 15.4 Technical Readiness Review

CyberShake Study 15.4Technical Readiness Review

Page 2: CyberShake Study 15.4 Technical Readiness Review

Study 15.4 Scientific Goals

• Calculate a 1 Hz map of Southern California

• Produce meaningful 2 second results for the UGMS• RotD50 and RotD100 at 2, 3, 4, 5, 7.5, 10 seconds• Contour maps

• Compare 0.5 Hz and 1 Hz hazard maps

• Use Graves & Pitarka (2014) rupture generator with regular spaced hypocenters

• 336 sites (10 km mesh, points of interest, “gap” sites)• Run 14 UGMS sites first

• Produce 1 Hz seismograms which could be combined with BBP high-frequency seismograms

Page 3: CyberShake Study 15.4 Technical Readiness Review

Study 15.4 Technical Goals

• Run CyberShake across Blue Waters and Titan• SGT and post-processing workflows on Blue Waters• SGTs only on Titan

• All SGTs to be calculated on GPUs

• Will measure CyberShake application makespan• Equivalent to the makespan of all of the workflows

• (All jobs complete) – (first workflow submitted)• Includes hazard curve calculation time• Includes system downtime, workflow stoppages

• Compare performance of Blue Waters and Titan

• Compare 1 Hz performance to previous studies

Page 4: CyberShake Study 15.4 Technical Readiness Review

Performance Enhancements

• Pegasus cleanup used to decrease temp storage• Should avoid Study 14.2 problem of running out of

scratch space

• Parallel version of AWP reformat code• 65% reduction in runtime

• DirectSynth for post-processing• Single job for entire post-processing• Reads in SGTs across nodes, communicates via MPI• Reduces 1 Hz post-processing CPU-hours by 75%

• MD5 sums in post-processing workflow moved out of critical path

Page 5: CyberShake Study 15.4 Technical Readiness Review

Proposed Study sites (336)

Green sites are the 50 new “gap” sites

Page 6: CyberShake Study 15.4 Technical Readiness Review

Study 15.4 Data Products

• CVM-S4.26 Los Angeles-area hazard maps• RotD100 2, 3, 4, 5, 7.5, 10 sec• RotD50 2, 3, 4, 5, 7.5, 10 sec• Geometric mean 2, 3, 5, 10 sec

• Hazard curves for 336 sites, at 2, 3, 5, 10s

• 2-component seismograms for all ruptures (~160M)

• Peak amplitudes in DB• Geometric mean at 2, 3, 5, 10 sec• RotD100, RotD50 at 2, 3, 4, 5, 7.5, 10 sec

Page 7: CyberShake Study 15.4 Technical Readiness Review

Study 15.4 Notables

• First 1 Hz hazard maps

• First study with RotD50 and RotD100 calculated

• First study to use OLCF Titan

• First study with Graves & Pitarka (2014) rupture generator with uniformly spaced hypocenters

• First study with 200 m rupture grid point spacing

• First study with source filtered at a different frequency than the simulation frequency

Page 8: CyberShake Study 15.4 Technical Readiness Review

Study 15.4 Parameters

• 1.0 Hz deterministic• 100 m spacing• dt=0.005 sec• nt=40000 timesteps

• CVM-S 4.26• Vs min = 500 m/s

• UCERF 2

• Graves & Pitarka (2014) rupture variations• 200 m rupture grid point spacing

• Source filtered at 2.0 Hz

Page 9: CyberShake Study 15.4 Technical Readiness Review

Changes to SGT Software Stack

• UCVM 14.3

• SGTs• Only using AWP-ODC-SGT GPU version, on 800 nodes

per component

• PostAWP• Changed to parallel version of AWP reformatting, 65%

speedup• Reduced read sizes to avoid issue with Titan filesystem• Separated MD5 sum calculation into separate job

• Handoff• Modified handoff job from Study 13.4 to provide interface

between Titan SGTs and Blue Waters post-processing

Page 10: CyberShake Study 15.4 Technical Readiness Review

Changes to PP Software Stack

• Rupture generator V3.3.1• Extraction & Synthesis

• Created new job, DirectSynth• Single job for all seismogram synthesis, PSA, and

RotD calculation• Set of SGT handler processes read in SGTs• Set of workers work on synthesis, request SGTs

from SGT handlers• Data products are sent to master, which writes to

filesystem• Will use 1024 SGT handlers, 2560 workers per job

• Additional database jobs added for RotD data• Checks, insertions, curve calculations

Page 11: CyberShake Study 15.4 Technical Readiness Review

Changes to Workflows• Post-processing workflows simpler

• Only 1 job for extraction and synthesis• MD5 sum out of critical path, but will abort workflow if fails

• Auto-submit cron tool on shock enhanced• List of sites to execute passed to cron job

• Will start with 14 UGMS sites• Will repeat first 2 sites (CCP, COO) on Titan and Blue Waters for

verification• Maintains constant # of workflows• When more are needed, selects next site(s),

creates/plans/runs the workflow• Now supports SGT, PP, and full workflows• Dynamically assigns workflows to remote resources

Page 12: CyberShake Study 15.4 Technical Readiness Review

Workflow HierarchyIntegrated Workflow (one per site per model)

PreCVM(creates volume)

Generate AWP Workflow

AWP Workflow

PP Pre Workflow

PP Main Workflow DB workflow

Post-processing workflow (Blue Waters & shock)

SGT workflow (Blue Waters or Titan)

Handoff(Titan/BW interface)

336 workflows for Study 15.4

AutoSubmit cron job on shock.usc.edu

Page 13: CyberShake Study 15.4 Technical Readiness Review

Distributed Processing

• Pegasus 4.5.0 RC, HTCondor 8.2.8, Globus 5.2.5

• Cron job on shock.usc.edu creates/plans/runs SGT, PP, and full workflows

• Jobs submitted to Blue Waters via GRAM

• Results staged back to shock, DB populated, curves generated

• Jobs submitted to Titan using pilot jobs• Cannot submit jobs to Titan directly, due to security

Page 14: CyberShake Study 15.4 Technical Readiness Review

Titan Distributed ProcessingTitanshock.usc.edu

monitor_daemon.pyCondor queue

1. Every 5 minutes, monitor daemon queries Condor queue on shock.2. If there are more Titan SGT workflows in the queue than sets of jobs in the Titan

queue, a new set of pilot jobs are submitted, with qsub dependencies.3. These jobs start up Condor processes which call back to the shock Condor collector

and can be assigned work.

(1)

SGT workflow, Titan

SGT workflow, BW

SGT workflow, Titan

PP workflow, BWBatch queue

(2)

Pre SGT pilot jobSGT pilot jobPost SGT pilot jobPre SGT pilot jobSGT pilot jobPost SGT pilot jobCondor

collector

(3) condor_master

Page 15: CyberShake Study 15.4 Technical Readiness Review

Computational Requirements

• Per site: ~3720 node-hrs • SGTs: depends on execution site (~50%)

• Titan = 2110 node-hrs / 63,300 SUs• Blue Waters = 1760 node-hrs / 30,200 SUs• More expensive for Titan because of padding in pilot jobs and

different node-hrs -> SU conversion• PP: 1880 node-hrs / 60,200 SUs (~50%)

• Computational time:• Titan (SGTs): 355K node-hours / 10.7M SUs• Blue Waters: 928K node-hours

• SGTs: 275K GPU node-hrs, 21K CPU node-hrs• PP: 632K CPU node-hrs

• Titan has 104M SUs remaining

• Blue Waters has 5.3M node-hrs remaining

Page 16: CyberShake Study 15.4 Technical Readiness Review

Storage Requirements

• Titan• Purged: 526 TB (for SGTs and temp data)

• Blue Waters• Delayed purge: 506 TB (for Titan SGTs)• Purged: 526 TB SGTs + 9 TB data products

• SCEC• Archived: 9.1 TB (seismograms, PSA, RotD)• Database: 268 GB (Geom @ 4 periods, RotD @ 6)• Temporary: 608 GB (workflow logs)• Shared SCEC disks have 171 TB free

Page 17: CyberShake Study 15.4 Technical Readiness Review

Metrics Gathering

• Monitord for workflow metrics• Will run during workflows, since DirectSynth

dramatically cuts the number of tasks

• Python scripts to calculate standard metrics

• Cronjob on Blue Waters• Core usage over time• Jobs running and idle

• Pilot monitor process on Titan• Core usage• Jobs running and idle

• Will use start and end of workflow logs to perform makespan measurement

Page 18: CyberShake Study 15.4 Technical Readiness Review

Monitoring Tools

• Will use Study Manager to track progress• Hosted on northridge.usc.edu• Tracks number of runs in each state• Estimates completion time based on velocity

• Run Manager tracks status of individual runs

• If errors, will dig into individual logs

Page 19: CyberShake Study 15.4 Technical Readiness Review

Estimated Duration

• Limiting factors:• XK node queue time

• 800 XK nodes is 19% of Blue Waters• Titan -> Blue Waters

• If throughput is very high, transfer could be bottleneck• USC HPC downtime for ~1 week in April

• Estimated completion is 12 weeks (11 running + 1 downtime)• Based on same node availability as Study 14.2

• Planning to request reservation on Blue Waters

• Planning to request high priority on Titan

Page 20: CyberShake Study 15.4 Technical Readiness Review

Personnel Support• Scientists

• Tom Jordan, Kim Olsen, Rob Graves

• Technical Lead• Scott Callaghan

• Job Submission / Run Monitoring• Scott Callaghan, David Gill, Phil Maechling

• NCSA Support• Omar Padron, Tim Bouvet

• Titan Support• Val Anantharaj

• USC Support• John Yu, John Mehringer

• Workflow Support• Karan Vahi, Gideon Juve

Page 21: CyberShake Study 15.4 Technical Readiness Review

Risks• Queue times on Blue Waters for XK nodes

• Will try to dynamically assign SGT jobs to resources

• Unforeseen complications with Titan pilot jobs

• Globus toolkit upgrades on NCSA• Globus upgraded on shock and tested• Waiting on new Condor release for shock

• Congestion protection events (network overloaded)• If triggered consistently, will need to limit number of

post-processing workflows

• Scott goes on leave before study is complete• Difficult to run study under other accounts at this time

Page 22: CyberShake Study 15.4 Technical Readiness Review

Action Items

• Upgrade UCVM on Blue Waters to 14.3.0

• Notify OLCF, NCSA, and USC of study

• Request reservation on Blue Waters

• Request increased priority on Titan

• Develop an approach for continuity of study during Scott’s leave

• Test single workflow, multiple sites approach

Page 23: CyberShake Study 15.4 Technical Readiness Review

Thanks for your time!