Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Pegasus WMS
1. Pegasus determines “Express flow” requirements for workflow data transfers
2. Pegasus sends “modifyNetwork” QoS requests to messaging space
Mod
ify
requ
ests
Request Manager (RM), AHAB, VirtualSDX API
Mobius ++
3. RM uses AHAB and VirtualSDX API to send SDN QoS requests to SDX Tree Network
4. SDX Tree Network actuates QoS actions using REST API of SDN controller
6. Slice modified with required bandwidth QoS
NIaaS/Infrastructure
VirtualSDX + ExoGENI
Slic
e re
ady
with
Q
oS 7. RM sends ack. to Pegasus when Express flow QoS is set
5. SDN controller contacts SDX switches for each site
Workflow Models (Aspen) +
Simulation (CODES)Resource
Provisioner
Ewa Deelman, Christopher Carothers, Anirban Mandal, Brian Tierney, Jeffrey Vetter, Ilya Baldin, Mark BlancoRafael Ferreira da Silva, Mariam Kiran, Vickie Lynch, Shirley Moore, Paul Ruth
LEARN MOREPanorama Websitehttp://sites.google.com/site/panoramaofworkflows/
Panorama is funded by the US Department of Energy under Grant #DE-SC0012636
PANORAMAOverview of the Project Description
Water is seen as small red and white molecules on large nanodiamonds spheres. The colored tRNA can be seen on the nanodiamond surface. Image :Michael Mattheson, ORNL (https://www.ornl.gov/news/diamonds-deliver).
IMPACT ON DOE SCIENCEDiamonds that deliver!
Panorama enabled cutting-edge domain science research and development that has the potential to solve some of the challenges associated with drug discovery and delivery:
• The motions of a tRNA (or transfer RNA) model system can be enhanced when coupled with nanodiamonds, or diamond nanoparticles approximately 5 to 10 nanometers in size
Flow Prioritization Use Case
NETWORK PROVISIONINGData Flow Prioritization for Scientific Workflows Using a Virtual SDX
SYSTEM DESIGNWorkflow System, Infrastructure Monitoring, Analytical Modeling, Simulation
PANORAMA: Predictive Modeling and Diagnostic Monitoring of Extreme Science Workflows
Infrastructure
Dashboard<web UI>
Pegasus DB<RDBMS>
Monitoring DB<InfluxDB>
WorkflowMonitor
<monitord>
Real TimeJob Monitor<Kickstart>
Analytical Modeling<Aspen>
Workflow State,Performance, Estimates,
Anomalies
MonitoringTimeseries
MonitoringEvents
Timeseries
Anomalies
ActualPerformance
Perf.Estimates
DBUpdates
EstimatedPerformance
Infrastructure Monitoring
I/O Network
Resource
Workflow Planner
<Pegasus>
WorkflowStructure
and Estimates
Parameters
Workflow and Task Models
Workflow Logs
ExecutableWorkflow<DAG>Abstract
Workflow<DAX>
Workflow Engine<DAGMan>
Job Manager<Condor>
ApplicationJob
Simulator<ROSS/CODES>
Message Broker<RabbitMQ>
MonitoringEvents
MonitoringEvents
Anomaly Detection<Complex Event
Processing>
New PANORAMA components
Unmodified existing components
Components modified by PANORAMA
• Pegasus interfaces with Aspen to estimate resource requirements of individual workflow tasks as well as the entire workflow
• Aspen interfaces with ROSS/CODES to simulate network behavior not easily modeled using analytical techniques
• Workflow and infrastructure monitoring data is stored in InfluxDB and Pegasus DB
• Anomaly detection process monitors data stream and generates anomaly notifications, which are displayed in the web dashboard
The Panorama project aims to further the understanding of the behavior of scientific workflows asthey are executing in heterogeneous environments. Panaroma’s approach to modeling anddiagnosing the runtime performance of complex scientific workflows is to integrate extreme-scalesystems testbed experimentation, structured analytical modeling and parallel systems simulationinto a comprehensive workflow framework that can characterize the end-to-end workflowperformance on today’s and future generation architectures, which can be used to improve theoverall workflow performance and reliability. The Panaroma architecture includes the individualframework components: the Aspen analytical application modeling software, the ROSS simulationframework, the Pegasus workflow management system, and how they are used to model thebehavior of DOE-relevant applications. By having a coupled model of the application andexecution environment, decisions can be made about resource provisioning, application taskscheduling, data management within the application, etc. Our approach for correlating the realtime application and infrastructure monitoring data can be used to verify application behavior,perform anomaly detection and diagnosis, and support adaptivity during workflow execution.
• We have developed an SNS Pegasus workflow to confirm that nanodiamonds enhance the dynamics of tRNA when in the presence of water. The workflow calculates the epsilon which best matches experimental data. These calculations used almost 400,000 CPU hours on a Cray XE6at NERSC.
• The workflow runs NAMD parallel simulations, which varies the epsilon between -0.01 and -0.19 for each temperature specified (it requires 800 cores: equilibrium runs take ~1.5hs and production runs 12-16hs). AMBER’s cpptraj removes global translation and rotation, and SASSENA calculates neutron scattering intensities from the trajectories (400 cores, 3-6hs). This workflow was used to computer 4 temperatures between 260K and 300K, which generated ~3TB of data.
MODELING AND SIMULATIONScalable Workload Generation for Application Performance Modeling and Simulation
BURST BUFFERSOn the use of Burst Buffers for Accelerating Scientific Workflow Executions
Major Findings:• I/O write performance was improved by a
factor of 9, and I/O read performance by a factor of 15
• Performance decreased slightly at node counts above 64 (potential I/O ceiling)
• I/O performance must be balanced with parallel efficiency when using burst buffers with highly parallel applications
• I/O contention may limit the broad applicability of burst buffers for all workflow applications (e.g., in situ processing)
Burst Buffers (BB) have emerged as a non-volatile storage solution that is positioned between the processors’ memory and the PFS, buffering the large volume of data produced by the application at an higher rate than the PFS, while seamlessly draining the data to the PFS asynchronously.
We explored the impact of Burst Buffers (BB) in scientific workflow applications. Using a software stack including Pegasus-WMS and HT-Condor, we ran a workflow on the Cori system at NERSC which included provisioning and releasing remote-shared BB nodes. Our application wrote and read about 550 GB of data.
Virtualized SDX
Site 1
OpenFlowController Site 3Site 2
Data Site
ExoGENI Slice(s)
DynamicConnectivity
SDX Tree Network with Express Flows
RYU OpenFlowController
OpenFlow v1.3
Site0
Site1
Express Flows (Bandwidth Provisioned)
Site5
Site6
Site4
Site3
Site2
Tree network with vSDX
SDX: meeting point networks to exchange traffic, securely and with QoS, using SDN protocols
We developed mechanisms to arbitrate and prioritize data flows from competing workflows by leveraging advanced network provisioning technologies like a virtual Software Defined Exchange (SDX).
Virtualized SDX
Site 1
OpenFlowController Site 3Site 2
Data Site
ExoGENI Slice with Virtual SDX + HTCondor Pools + Data Site
DynamicConnectivity
HTCondorPool
HTCondorPool
HTCondorPool
vSDX use case with HTCondor pools and Pegasus
• Software Defined Exchanges (SDX) –meeting point of networks to exchange traffic, securely and with QoS, using SDN protocols
• Virtual SDX – virtual overlay acting as SDX without persistent physical location
• ExoGENI virtual SDX can modify compute, network, storage to support changing demands of SDX
0
2500
5000
7500
1 4 8 16 32 64 128# Nodes
MiB
/s
BB No−BB
0.5
1.0
1.5
2.0
Spee
dup
fx.sgt fy.sgt
1 4 8 16 32 64 128 1 4 8 16 32 64 1280
500
1000
1500
2000
# Nodes
Tim
e (s
econ
ds)
BB No−BB
BB No−BB
1 4 8 16 32 64 128 1 4 8 16 32 64 1280
25
50
75
100
# Nodes
Cum
ulat
ive C
PU ti
me
usag
e (%
)
stime utime
MPI-IO module data: Average time consumed in I/O read operations per process
I/O performance estimate for read operations at the MPI-IO layer (solid lines), and average runtime speedup (dashed lines)
Ratio between the cumulative time spent in the user (utime) and kernel (stime) spaces for different numbers of nodes
• Virtual SDX transparently arbitrates workflow data flows communicated by Pegasus
Prioritized Data Flows
The Mobius++ framework can be used by several high-level applications to provision and adapt infrastructure based on particular requirements
Experimental Results
10-80 20-70 30-60 40-50 45-45 50-40 60-30 70-20 80-10
Relative data flow priorities (Workflow1-Workflow2)
0
200
400
600
800
1000
1200
1400
1600
1800
Avg
. tr
an
sfe
r tim
e f
or
"in
div
idu
al"
job
s (s
ecs
)
Transfer time vs. Flow priorities (Different bandwidths)
Workflow1 on HTCondor pool 1 (vSDX bandwidth = 250Mb/s)Workflow2 on HTCondor pool 2 (vSDX bandwidth = 250Mb/s)Workflow1 on HTCondor pool 1 (vSDX bandwidth = 500Mb/s)Workflow2 on HTCondor pool 2 (vSDX bandwidth = 500Mb/s)
10-80 20-70 30-60 40-50 45-45 50-40 60-30 70-20 80-10
Relative data flow priorities (Workflow1-Workflow2)
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
Work
flow
exe
cutio
n tim
e (
secs
)
Execution time vs. Flow priorities (vSDX bandwidth = 250Mb/s)
Workflow1 on HTCondor pool 1Workflow2 on HTCondor pool 2
Effect of relative flow priorities on observed data transfer times for data-intensive workflow tasks for different vSDX provisioned bandwidths
Effect of relative flow priorities on overall workflow execution times
• We have created a new technique for generating scalable workloads from real applications, and implemented a prototype, called Durango, using a performance modeling toolkit.
• We demonstrate the efficacy of Durango’s direct integration approach, which links Aspen into CODES as part of the running network simulation model. Here, Aspen generates the application-level computation timing events, which in turn drives the start of a network communication phase. Durango in direct integration
mode with 1.3M node dragonfly network and Aspen compute node generator for 1K to 32K MPI ranks
Durango in direct integration mode with 32K node torus network and Aspen compute node generator for 1K to 16K MPI ranks.