10
Toward an End-to-end Framework for Modeling, Monitoring and Anomaly Detection for Scientific Workflows Anirban Mandal, Paul Ruth, Ilya Baldin RENCI - UNC Chapel Hill {anirban,pruth,ibaldin}@renci.org Dariusz Kr´ ol, Gideon Juve, Rajiv Mayani, Rafael Ferreira da Silva, Ewa Deelman USC Information Sciences Institute {darek,gideon,mayani,rafsilva,deelman}@isi.edu Jeremy Meredith, Jeffrey Vetter, Vickie Lynch, Ben Mayer, James Wynne III Oak Ridge National Laboratory {jsmeredith,vetter,lynchve, mayerbw,wynnejr}@ornl.gov Mark Blanco, Chris Carothers, Justin LaPre Rensselaer Polytechnic Institute [email protected], {chrisc,laprej}@cs.rpi.edu Brian Tierney LBL [email protected] Abstract—Modern science is often conducted on large scale, distributed, heterogeneous and high-performance computing infrastructures. Increasingly, the scale and complexity of both the applications and the underly- ing execution platforms have been growing. Scientific workflows have emerged as a flexible representation to declaratively express complex applications with data and control dependences. However, it is extremely challenging for scientists to execute their science workflows in a reliable and scalable way due to a lack of understanding of expected and realistic behavior of complex scientific workflows on large scale and distributed HPC systems. This is exacerbated by failures and anomalies in large scale systems and applications, which makes detecting, analyzing and acting on anomaly events challenging. In this work, we present a prototype of an end- to-end system for modeling and diagnosing the run- time performance of complex scientific workflows. We interfaced the Pegasus workflow management system, Aspen performance modeling, monitoring and anomaly detection into an integrated framework that not only improves the understanding of complex scientific appli- cations on large scale complex infrastructure, but also detects anomalies and supports adaptivity. We present a black box modeling tool, a comprehensive online monitoring system, and anomaly detection algorithms that employ the models and monitoring data to detect anomaly events. We present an evaluation of the system with a Spallation Neutron Source workflow as a driving use case. Keywords-scientific workflows, performance modeling, monitoring, anomaly detection I. I NTRODUCTION Modern computational and data science often involves processing and analyzing vast amounts of data through large scale simulations of un- derlying science phenomena. In addition, access to remote sensors, instruments, data-sets, high- performance computing resources, and demands for high-volume data flows are making modern scientific analysis highly dynamic, heterogeneous and distributed. This is evident in diverse fields of science including astronomy, bioinformatics, physics, and climate modeling among many oth- ers. Not only are applications growing in scale and complexity, the distributed and high-performance computing infrastructure required to support sci- ence applications is increasingly diverse in scale and complexity — DOE Leadership Computing Facilities, OSG [1], XSEDE [2], cloud infrastruc- tures [3], [4] and national and regional network transit providers like ESnet [5] and Internet2 [6]. It is extremely challenging for scientists to navigate the complexity of applications and infrastructures to execute their computational campaigns in a

Toward an End-to-end Framework for Modeling, …Toward an End-to-end Framework for Modeling, Monitoring and Anomaly Detection for Scientific Workflows Anirban Mandal, Paul Ruth,

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Toward an End-to-end Framework for Modeling, …Toward an End-to-end Framework for Modeling, Monitoring and Anomaly Detection for Scientific Workflows Anirban Mandal, Paul Ruth,

Toward an End-to-end Framework for Modeling, Monitoring andAnomaly Detection for Scientific Workflows

Anirban Mandal, Paul Ruth, Ilya BaldinRENCI - UNC Chapel Hill

{anirban,pruth,ibaldin}@renci.org

Dariusz Krol, Gideon Juve, Rajiv Mayani,Rafael Ferreira da Silva, Ewa Deelman

USC Information Sciences Institute{darek,gideon,mayani,rafsilva,deelman}@isi.edu

Jeremy Meredith, Jeffrey Vetter, Vickie Lynch,Ben Mayer, James Wynne III

Oak Ridge National Laboratory{jsmeredith,vetter,lynchve,

mayerbw,wynnejr}@ornl.gov

Mark Blanco, Chris Carothers,Justin LaPre

Rensselaer Polytechnic [email protected],

{chrisc,laprej}@cs.rpi.edu

Brian TierneyLBL

[email protected]

Abstract—Modern science is often conducted on largescale, distributed, heterogeneous and high-performancecomputing infrastructures. Increasingly, the scale andcomplexity of both the applications and the underly-ing execution platforms have been growing. Scientificworkflows have emerged as a flexible representation todeclaratively express complex applications with data andcontrol dependences. However, it is extremely challengingfor scientists to execute their science workflows in areliable and scalable way due to a lack of understandingof expected and realistic behavior of complex scientificworkflows on large scale and distributed HPC systems.This is exacerbated by failures and anomalies in largescale systems and applications, which makes detecting,analyzing and acting on anomaly events challenging.In this work, we present a prototype of an end-to-end system for modeling and diagnosing the run-time performance of complex scientific workflows. Weinterfaced the Pegasus workflow management system,Aspen performance modeling, monitoring and anomalydetection into an integrated framework that not onlyimproves the understanding of complex scientific appli-cations on large scale complex infrastructure, but alsodetects anomalies and supports adaptivity. We presenta black box modeling tool, a comprehensive onlinemonitoring system, and anomaly detection algorithmsthat employ the models and monitoring data to detectanomaly events. We present an evaluation of the systemwith a Spallation Neutron Source workflow as a drivinguse case.

Keywords-scientific workflows, performance modeling,monitoring, anomaly detection

I. INTRODUCTION

Modern computational and data science ofteninvolves processing and analyzing vast amountsof data through large scale simulations of un-derlying science phenomena. In addition, accessto remote sensors, instruments, data-sets, high-performance computing resources, and demandsfor high-volume data flows are making modernscientific analysis highly dynamic, heterogeneousand distributed. This is evident in diverse fieldsof science including astronomy, bioinformatics,physics, and climate modeling among many oth-ers.

Not only are applications growing in scale andcomplexity, the distributed and high-performancecomputing infrastructure required to support sci-ence applications is increasingly diverse in scaleand complexity — DOE Leadership ComputingFacilities, OSG [1], XSEDE [2], cloud infrastruc-tures [3], [4] and national and regional networktransit providers like ESnet [5] and Internet2 [6]. Itis extremely challenging for scientists to navigatethe complexity of applications and infrastructuresto execute their computational campaigns in a

Page 2: Toward an End-to-end Framework for Modeling, …Toward an End-to-end Framework for Modeling, Monitoring and Anomaly Detection for Scientific Workflows Anirban Mandal, Paul Ruth,

reliable and scalable way. This, in part, stems froma lack of understanding of expected and realisticbehavior of complex scientific workflows on largescale and distributed HPC systems. Furthermore,there are bound to be failures and anomalies inlarge scale systems and applications, and detect-ing, analyzing the root causes for, and acting onanomaly events are challenging.

In order to address some of the above chal-lenges, the Panorama project [7] aims to furtherour understanding of the behavior of scientificworkflows as they are executing in large scale,heterogeneous environments by modeling and di-agnosing the run-time performance of complexscientific workflows. We are developing tools thatanalyze the workflow and that develop models ofexpected behavior given a particular computingenvironment, such as an HPC system, clustersdistributed over wide area networks, or clouds.We are also investigating the use of analyticalmodels for resource provisioning, scheduling anddata management decisions. This is coupled withcorrelating real-time monitoring of application andinfrastructure to verify application behavior, todetect and diagnose anomalies, and to supportadaptivity.

In this paper, we present our initial successesin developing analytical models that can predictthe behavior of complex, data-intensive, scientificworkflows executing on large-scale infrastructures.In particular, we describe a black-box modelingapproach for generating informed thresholds forvarious performance metrics using the Aspen an-alytical performance modeling language and suiteof tools [8] (Section II). We also present thedesign and implementation of a comprehensiveperformance monitoring framework (Section III)for dynamically monitoring various workflow, ap-plication and infrastructure metrics, and how themonitoring data can be used to build initial blackbox models for performance prediction and foranomaly detection (Section IV) during workflowexecution. The paper makes the case for an au-tomated, end-to-end framework (Section VI) thatties together workflow planning and managementusing Pegasus [9], Aspen performance modeling,

online monitoring, and anomaly detection and no-tification for improving the overall performance ofcomplex scientific workflows on current and futuregeneration architectures. We present a prototypesystem that used a Spallation Neutron Sourceworkflow (Section V) as the driving use case.

II. ANALYTICAL MODELING WITH ASPENBLACK-BOX MODELING TOOL

To generate more informed thresholds for theonline monitoring and anomaly detection, we turnto the Aspen analytical performance modelinglanguage and suite of tools [8].

There are myriad ways of generating applicationmodels for Aspen, including automated parsingof source code [10] and through manual effortsfrom those knowledgeable about the applicationin question. In the case of modeling workflows,our system may be exposed to applications forwhich no model has yet been created, and itmust be able to generate informed thresholds foranomaly detection even in the absence of moredetailed descriptions of these applications. For thispurpose, we created a technique to use Aspen forblack-box application model generation.

A. Black-box Modeling for Recorded DataSuppose we have obtained information about

runtimes for some application, running on theCPU cores of a known machine, at two differentproblem sizes (n). This information is seen in thefollowing table:

n machine socket runtime100 machine.aspen cpu 11500 machine.aspen cpu 15

A simple solution to generating an analyticalperformance model for this application would toperform, say, a linear regression on the runtime.In particular, we could predict that runtime =10 + n

100 . This is very easy, but provides minimalinformation about the application itself, and pro-vides no ability to extrapolate to other machinesas will likely be needed by a workflow system.

A better solution is to use some known aspect ofthe machine to transition away from a model basedon runtime alone. For example, as the known

Page 3: Toward an End-to-end Framework for Modeling, …Toward an End-to-end Framework for Modeling, Monitoring and Anomaly Detection for Scientific Workflows Anirban Mandal, Paul Ruth,

abstract machine model for our system has aclockspeed for the CPUs, such as 1 GHz, and weknow that it can perform one floating point oper-ation per cycle, then we can refine our predictionto flops = 1010+n⇥ 107. This prediction is nowmachine-independent, and we have a performancemodel based solely on application parameters andapplication resource usage. In other words, wenow have an application model which can begenerate performance predictions for the currentmachine as well as unknown future machines.

Our system takes this approach one step fur-ther and can automatically generate an applicationmodel based on other resource parameters suchas floating point operations, bytes loaded andstored from memory, and messages communicatedbetween tasks – i.e., from any application resourceusage which can be described for an abstractmachine model.

Similar work has been done in this area. Theapproach taken in [11] utilized runtimes and flops,though lost some accuracy by ignoring thresh-old factors such as I/O, contention, and networklatency. A later approach in [12] added a ge-netic programming error correction procedure toincrease the overall accuracy. The approach wedescribe here allows for greater abstraction, flexi-bility, and portability by supporting any resourcerepresentable in template application models andabstract machine models within the Aspen frame-work.

B. Aspen Black-box Modeling Tool

Measured Application Runtimes

Aspen Library

Template Aspen App

Model

Aspen Hardware Models

Symbolic Runtime

Equations

Nonlinear Optimizer

Objective Function

Aspen Black-box Modeling Tool

Aspen Tool Suite

Runtime / Threshold Predictions

Application-Specific

Aspen Model

Figure 1. Aspen black box modeling tool.

The black-box modeling tool in Aspen uses thefollowing approach, as shown in Figure 1:

• The user chooses a template Aspen modelor creates their own. This template modelcontains free parameters used to fit the data.

• The user feeds this template model andrecorded runtime data into the Aspen black-box modeling tool.

• Aspen converts the templated Aspen modelinto symbolic runtime equations for eachmachine listed in the input runtime data.

• The modeling tool creates an objective func-tion which returns the error between the run-time predictions and the recorded data fora given set of parameters. This is currentlydone via least-squares fitting.

• An optimizer (such as http://ab-initio.mit.edu/nlopt/ ) solves for the freeparameters to minimize the error of theobjective function.

• The tool outputs a concrete Aspen applicationmodel which combines the input templatemodel and the best solution to the free pa-rameters.

An example input template Aspen applicationmodel is as follows:model NAMD_Template {

param nAtoms = 1e6 // application parametersparam nTimeSteps = 100 // (defined in the input file)param c = 1 in 1 .. 1e18 // solve for these parametersparam d = 1 in 1 .. 1e18 // (within the given ranges)kernel main // application behavior: execution and control flow{ iterate [nTimeSteps] {

execute {loads [c * nAtomsˆ2]flops [d * nAtoms]

}}}}

Here, we see problem parameters like nAtomsand nTimeSteps. These values can be defined inthe input data file for each recorded runtime. Wealso see free parameters c and d; the modelingtool recognizes the allowable ranges on theseparameters and uses them as constraints duringthe optimization phase. Finally, we see the controlflow and resource usage (flops and loads);this is a simple example but shows how someminimal knowledge about the application (e.g.resource usage scaling versus problem size, anda linear effect of number of time steps) can beused to generate a more informed template file for

Page 4: Toward an End-to-end Framework for Modeling, …Toward an End-to-end Framework for Modeling, Monitoring and Anomaly Detection for Scientific Workflows Anirban Mandal, Paul Ruth,

modeling. The model output by the tool, now inconcrete form with problem-specific parameters,is as follows:model NAMD_Equilibrate {

param nAtoms = 1e6 // NAMD input parametersparam nTimeSteps = 100param c = 402.1 // calculation-specific constantsparam d = 10.95kernel main // NAMD application behavior{ iterate [nTimeSteps] {

execute {loads [c * nAtomsˆ2]flops [d * nAtoms]

}}}}

C. Interaction with CODES

CODES [13] is an HPC storage and networksimulation framework built on the ROSS [14]parallel simulation framework which implementedthe Time Warp parallel discrete-event simulationprotocol using reverse computation [15]. ROSShas been shown to efficiently scale to nearly 2 mil-lion cores on the largest supercomputers availableto date [14]. These scaling results have enabledCODES to model Dragonfly, Slimfly and Torustopologies with over 1 million nodes.

These large-scale network models can be linkedto Aspen compute node models to provide a com-bined, overall picture of an applications perfor-mance. In the current integration, CODES drivesAspen to report computational kernel time infor-mation. That information is used by CODES toboth delay and generate application specific com-munication patterns into the simulated network.When executing together, the Aspen/CODES inte-grated model reports no performance loss over justCODES along when executing a large-scale Drag-onfly network model using 1024 Blue Gene/Qnodes with 8,192 MPI ranks.

III. MONITORING

Workflow modeling and analysis methods re-quire detailed event traces and online monitoringdata in order to build models of system behaviorand perform anomaly detection and diagnosis asthe workflow is running. The Panorama projecthas developed a sophisticated set of tools andinfrastructure for collecting traces and monitoringdata for workflows.

A. Workflow and application monitoring

The monitord component of Pegasus collects,aggregates and publishes all the monitoring dataproduced by the workflow. This includes high-level information on the state of the workflow gen-erated by the workflow execution engine (DAG-Man) and the workflow scheduler (HTCondor),as well as low-level information published by jobmonitoring tools.

Job-level monitoring is performed by Kick-start [16], a monitoring and tracing tool used tolaunch computation and data management jobs tocollect information about the behavior of the jobsand their execution environment. A Kickstart pro-cess forks application processes on the computenode and uses its position as the parent process toinspect the behavior of the application. Kickstartwrites trace data to a file for offline analysis, andreports real-time monitoring data to monitord.

As part of the DOE dV/dt project [17], weadded functionality to Kickstart to automati-cally capture resource usage metrics of workflowjobs [18]. This functionality uses operating sys-tem monitoring facilities such as procfs andgetrusage() as well as library call interpo-sition to collect fine-grained profile data that in-cludes process I/O, file accesses, runtime, memoryusage, and CPU utilization.

Library call interposition is used to implementmonitoring functionality that requires access to theinternal state of an application process. It is imple-mented by a component called libinterpose,which uses LD_PRELOAD to intercept calls toPOSIX functions for file and network I/O andthreads, as well as performing monitoring activ-ities at process start and exit, such as startingmonitoring threads, activating CPU counters, andreporting final performance metrics.

We extended Kickstart and libinterpose to col-lect CPU performance counters using the PAPIlibrary [19]. libinterpose enables the CPU counterswhen the process starts, and reports their valuesperiodically to Kickstart. This includes countersfor the number of floating-point operations, in-structions, loads, stores, and cache misses exe-

Page 5: Toward an End-to-end Framework for Modeling, …Toward an End-to-end Framework for Modeling, Monitoring and Anomaly Detection for Scientific Workflows Anirban Mandal, Paul Ruth,

cuted by the application. We expect this informa-tion to be very useful in modeling because it tellsus more about the computational requirements ofthe application that is independent of the executionhardware. Previously, our modeling was focusedon metrics such as runtime, which are a function ofboth the computation requirements and the hard-ware capability. By measuring the computation interms of loads, stores, and operations, we expect tobe able to construct machine-independent modelsof the application using Aspen (Section II-A) thathave the potential to be evaluated against differentexecution hardware.

We also extended Kickstart to support MPIjobs. Previously Kickstart was not able to monitorMPI processes running on multiple compute nodeswithout running a Kickstart process for each MPIrank. Using libinterpose, however, there can beone Kickstart process that invokes mpiexec (orequivalent) to launch the MPI job, and libinterposecan attach to each MPI process and report resultsback to the Kickstart process. To facilitate this,libinterpose sends MPI rank information alongwith monitoring data to Kickstart, and this data isaggregated across all MPI ranks to produce one,unified time series for the entire MPI job.

B. Infrastructure monitoringWe are also collecting infrastructure-level mon-

itoring data for network, storage system, and hostperformance. Our current tools collect monitoringdata from the infrastructure with particular em-phasis on I/O and network performance, whichare critical for many workflows. The goals areto gather monitoring data to help build initialmodels. Workflow- and job-level monitoring iscorrelated with infrastructure-level monitoring toidentify anomalies and their potential sources.We are using a combination of standard systemmonitoring tools, including sar, and automatedscripts, as well as active monitoring infrastruc-ture such as perfSONAR [20] and fio [21]. Weare using racks on the ExoGENI [3] testbed asa controlled environment to collect monitoringdata, both application-level monitoring data andinfrastructure-level monitoring data, because it of-

fers minimal performance interference, which isextremely important to build and validate initialmodels.

During workflow execution, we launch auto-mated monitoring scripts as a part of workflowlaunch bootstrap to observe various OS level per-formance metrics on the nodes. The monitoringscripts use the sar tool from the “sysstat” [22]linux utilities for online monitoring of I/O andnetwork. For example, if the workflow applica-tions are using a shared filesystem like NFS, wemonitor the I/O performance of the node thatacts as the NFS server. In this case, we moni-tor number of observed read/write requests andobserved read/write bandwidth on a node usingthe ‘-b’ option in sar. In the network case, weobserve receive/transfer bandwidth and the num-ber of packets received/transferred using the ‘-n’option in sar on the data-plane network interface,which is used for application data movement.All the above infrastructure monitoring data iscollected every five seconds and stored as a timeseries.

In addition, we have created a tool for activelymonitoring block storage I/O performance, whichinvolves running periodic tests on the ExoGENIracks. The tool works by probing each ExoGENIsite and submitting a resource request that provi-sions a machine, runs a suite of I/O performancebenchmarks, and stores the performance resultsas a time series. Our tool uses fio to simulateand benchmark I/O load. fio is a flexible toolfor benchmarking I/O performance and generat-ing simulated I/O load. We have configured fioto simulate several representative patterns of I/Oincluding reading and writing of sequential andrandom I/O blocks. For each pattern of I/O inthe benchmark suite, we are collecting severalmetrics that include read/write bandwidth, latency,and latency percentiles. For active network moni-toring, we created a perfSONAR image based onDocker for the ExoGENI infrastructure testbed.We updated the base image with client tools thatcan push perfSONAR network monitoring datato our monitoring database. We have provisionedresources between different pairs of racks, and ran

Page 6: Toward an End-to-end Framework for Modeling, …Toward an End-to-end Framework for Modeling, Monitoring and Anomaly Detection for Scientific Workflows Anirban Mandal, Paul Ruth,

perfSONAR bwctl tests with appropriate parame-ters to bwctl.

IV. ANOMALY DETECTION

Monitoring DB<InfluxDB>

MonitoringTimeseries

ActualPerformance(App + Infra)

Message BrokerAnomaly Exchange

Anomaly

Anomaly

Aspen Modelse.g. sns-model.json

EstimatedPerformance from

Aspen Models

Anomaly Detection Engine

Detection Algorithm 1(Threshold based)

Detection Algorithm 2(Auto Regression based)

Detection Algorithm 3……..

Workflow State,Performance, Estimates,

Pegasus Dashboard

get M

onito

ringD

ata

get P

erfM

odel

s

Publish Anomaly

Figure 2. Anomaly detection components.

Figure 3. AR based detection example (training series).

Coupling online monitoring data from workflowapplication and infrastructure with the Aspen per-formance models gives us a powerful tool to detectanomalies during workflow executions. An accu-rate attribution of the anomalies to the observedworkflow performance is critical for fixing the un-derlying problem, or adapting the system. In thissection, we describe the various components of theanomaly detection engine and its interactions withthe rest of the system (Figure 2).

The “getMonitoringData” component interactswith the online monitoring data to obtain the

different time series for relevant performance met-rics. The “getPerfModel” component is designedto interact with the Aspen performance model-ing system to obtain the estimated performanceof workflow applications from the Aspen mod-els. In the current implementation, it staticallyreads the generated thresholds from the Aspenblack box modeling tool. In future, we plan atighter integration with the modeling system. The“Publish Anomaly” component is responsible forgenerating anomaly events and publishing theanomaly messages to an AMQP message broker.The anomaly messages are then consumed bythe Pegasus dashboard to enable users to see thedetails of the anomalies and the correspondingtime series data, which is useful for attribution ofanomalies.

The heart of the anomaly detection engine isa suite of detection algorithms, which are plug-gable modules. These detection algorithms can becoupled in different ways to detect and correlateanomalies. Three types of detection algorithms arecurrently implemented in the system. The firstis a simple threshold based detection algorithmthat does continuous diffs between Aspen gen-erated thresholds and online measurements forapplication specific metrics. This can be coupledwith a “MovingAverage” (MA) based detectionalgorithm for analyzing time-series data for in-frastructure related metrics. We have also devel-oped “AutoRegression” (AR) based algorithms todetect anomalies by analyzing time-series datafor application related metrics. An AR model isfirst developed from non-anomalous runs, whichincludes the AR coefficients and the degree of ARthat fits best for a metric. When ran against onlinemonitoring data, the AR model is used to predictthe errors that become significant in presence ofanomalies.

We now present an example of the AR baseddetection technique. The first graph (Figure 3)shows the time series plots of two applicationmetrics (write bytes and iowait) related to I/Operformance of the NAMD application from theSNS workflow (Section V) and one infrastruc-ture related metric (write bandwidth) related to

Page 7: Toward an End-to-end Framework for Modeling, …Toward an End-to-end Framework for Modeling, Monitoring and Anomaly Detection for Scientific Workflows Anirban Mandal, Paul Ruth,

Figure 4. AR based detection example (anomalous series).

observed I/O on the node for a non-anomalousrun. The X-axis represents time since the start ofthe application. In this example, the AR model iscalculated for the iowait metric. Figure 4 showsthe time series plots of the three metrics foran anomalous run where a write anomaly wasintroduced one hour into the run for an hour byutilizing the stress benchmark1 on the node thatacts as the NFS server. The stress benchmarkis a workload generator for POSIX systems, andprovides the capability to impose a configurableamount of CPU, memory, I/O, and disk stress onthe system. In this case, we employed stressto inject I/O anomaly by only running disk stressthreads. We observe the anomaly showing up asa step in the infrastructure time series plot. Thesecond subfigure plots the errors from the ARmodel for iowait for each point of time duringexecution of NAMD. We observe that the errorsare maximum in the region where anomaly existedand they cease to exist when anomaly is removed.We can use the AR model errors to trigger ob-served anomalies in an online fashion.

V. SNS WORKFLOW

We have developed several use-case applica-tions to evaluate our workflow modeling, moni-toring and anomaly detection system. The researchdescribed in this paper uses the Spallation NeutronSource (SNS) Workflow as a test case.

1http://people.seas.harvard.edu/⇠apw/stress/

The Spallation Neutron Source (SNS) [23] isa research facility at the Department of Energy’sOak Ridge National Laboratory that uses pulsedneutron beams to investigate the properties of ma-terials for scientific and industrial research. SNSuses a particle accelerator to impact a mercury-filled target with a stream of protons, generat-ing neutrons by the process of spallation. Theseneutrons are directed toward a sample, causingsome neutrons to scatter. The scattering eventsare collected by array of detectors and distilledinto different science products depending on theexperiment. This reduced data is then analyzedand compared to materials simulations to extractscientific information about the material.

In collaboration with the Center for Accelerat-ing Materials Modeling (CAMM), we created aworkflow that executes an ensemble of moleculardynamics and neutron scattering simulations to fitmodel parameter values to experimental results.This parameter refinement workflow has been usedto investigate temperature and hydrogen chargeparameters for models of water molecules.

The workflow executes one job to unpack areference database along with 5 jobs for eachset of parameter values. Each parameter value isfed into a series of parallel molecular dynamicssimulations using NAMD [24]. The output fromthe MD simulations has the global translation androtation removed using AMBER’s [25] cpptrajutility [26] and is passed to Sassena [27] for the

Page 8: Toward an End-to-end Framework for Modeling, …Toward an End-to-end Framework for Modeling, Monitoring and Anomaly Detection for Scientific Workflows Anirban Mandal, Paul Ruth,

calculation of coherent and incoherent neutronscattering intensities. The final outputs of theworkflow are transferred to the user’s desktopand loaded into Mantid [28] for analysis andvisualization.

The production version of this workflow hasbeen run on the Hopper supercomputer at NERSCand is in the process of being ported to the Titansupercomputer at OLCF.

VI. END-TO-END SYSTEM

The end-to-end workflow modeling, monitor-ing, and anomaly detection system is illustratedin Figure 5.

The workflow is planned and executed usingthe Pegasus Workflow Management System. Pe-gasus converts an abstract workflow descriptioninto an executable, directed acyclic graph (DAG)of jobs that is managed by the DAGMan [29]workflow engine. DAGMan submits the jobs in theDAG to HTCondor [30], which executes them onthe distributed infrastructure. As the workflow isrunning, the Pegasus monitord service collectsworkflow monitoring data, and updates the stateof the workflow in the Pegasus database.

Application monitoring is implemented bywrapping each job in the workflow with Kickstart.Kickstart performs two functions: it collects tracedata for offline analysis and modeling, and itmonitors the job and reports real-time performancemetrics via a RabbitMQ2 message broker. Themonitord service aggregates monitoring data fora job from RabbitMQ, updates the current metricsfor the job in the Pegasus database, and publishesjob performance metrics to an InfluxDB3 timeseries database.

Infrastructure monitoring is implemented usingthe infrastructure monitoring tools as describedin Section III-B. These tools continuously collectmonitoring metrics for the hosts, storage systemsand networks used by the workflow and store themas time series in InfluxDB.

2https://www.rabbitmq.com3https://influxdata.com

Anomaly detection tools use models to predictapplication performance, and compare the pre-dicted performance to the observed performancefrom InfluxDB. When an anomaly is detected,a message is published via RabbitMQ that de-scribes the details of the anomaly. These anomalymessages are collected and stored in the Pegasusdatabase by monitord.

Monitoring and anomaly data are displayed ina web-based user interface called the PegasusDashboard. Users can select a job and see currentvalues for all of the performance metrics collectedby the online monitoring infrastructure, as wellas time series plots of the data for the job fromInfluxDB. The Dashboard also displays anomaliesthat were detected by the anomaly detection tools.The Dashboard reads these anomalies from thePegasus database and displays them with alongwith time series plots from InfluxDB that help toillustrate and explain the cause of the anomaly.

Infrastructure

Dashboard<web UI>

Monitoring DB<InfluxDB>

WorkflowMonitor

<monitord>

Real TimeJob Monitor<Kickstart>

MonitoringTimeseries

MonitoringEvents

InfrastructureMonitoringTimeseries

ActualPerformance(App + Infra)

Infrastructure Monitoring

I/O Network

Resource

Workflow Planner

<Pegasus>

Workflow Logs

Workflow Engine<DAGMan>

Job Manager<Condor>

ApplicationJob

Message Broker

<RabbitMQ> App MonitoringTimeseries

Anomaly DetectionEngine

Pegasus DB<RDBMS>

Message BrokerAnomaly Exchange

Anomaly

Anomaly

Aspen Models

Workflow State,Performance, Estimates

EstimatedPerformance from

Aspen Models

Figure 5. End-to-end workflow modeling, monitoring and anomalydetection system. Red indicates existing components, blue indicatesmodified components, and green indicates new components.

We deployed the SNS workflow application onan ExoGENI [3] rack located at the Pittsburgh Su-percomputing Center. We instantiated a virtualized

Page 9: Toward an End-to-end Framework for Modeling, …Toward an End-to-end Framework for Modeling, Monitoring and Anomaly Detection for Scientific Workflows Anirban Mandal, Paul Ruth,

HTCondor system capable of running MPI simula-tions utilizing 100 cores and ran the SNS workflowapplication using Pegasus. As discussed earlier, wecollected online monitoring data from applicationand infrastructure during workflow execution. Weintroduced an I/O anomaly using the stresstool. This anomaly was designed to simulate com-petition with other jobs for I/O resources, whichis a common source of job performance variabilityon production HPC systems. The Aspen modelthresholds were available to the anomaly detectionengine, and we ran the MA- and threshold-baseddetection algorithms. The I/O anomaly appearsin the application monitoring as a spike in theiowait metric, which measures the amount oftime that the process was blocked waiting onI/O. It was detected by the system and shownon the Pegasus dashboard with the correspondingrelevant time series, as shown in Figure 6.

Figure 6. Pegasus dashboard anomaly notification.

VII. CONCLUSION

We presented a prototype of an integrated, end-to-end system that models, executes, monitors,and detects anomalies for complex scientific work-flows. The system has been applied and evaluatedin the context of executing SNS workflows on theExoGENI testbed. It is a foundational frameworkfor our future work on developing more complexpredictive models, anomaly detection algorithms,workflow adaptation mechanisms, and adaptiveresource provisioning for workflows executing ondistributed, heterogeneous and HPC systems.

ACKNOWLEDGMENT

This work was funded by DOE under contract#DE-SC0012636, “Panorama - Predictive Model-ing and Diagnostic Monitoring of Extreme ScienceWorkflows”. The development of the neutron scat-tering simulation workflow was supported by theU.S. Department of Energy (DOE), Office of Sci-ence, Basic Energy Sciences, Materials Sciencesand Engineering Division. The use of Oak RidgeNational Laboratory’s Spallation Neutron Sourcewas sponsored by the Scientific User FacilitiesDivision, Office of Basic Energy Sciences.

REFERENCES

[1] “Open Science Grid,” http://www.opensciencegrid.org.

[2] “Extreme Science and Engineering Discovery Environ-ment,” http://www.xsede.org.

[3] I. Baldine, Y. Xin et al., “Exogeni: A multi-domaininfrastructure-as-a-service testbed,” in 8th Interna-tional ICST Conference on Testbeds and ResearchInfrastructures for the Development of Networks andCommunities, 2012, pp. 97–113.

[4] “CloudLab,” http://cloudlab.us.

[5] “ESnet,” http://www.es.net.

[6] “Internet2,” http://www.internet2.edu.

[7] E. Deelman, C. Carothers, A. Mandal, B. Tierney,J. S. Vetter, I. Baldin, C. Castillo, G. Juve, D. Krol,V. Lynch, B. Mayer, J. Meredith, T. Proffen, P. Ruth,and R. Ferreira da Silva, “PANORAMA: An ap-proach to performance modeling and diagnosis of ex-treme scale workflows,” International Journal of HighPerformance Computing Applications, vol. to appear,2015.

[8] K. Spafford and J. S. Vetter, “Aspen: A domain spe-cific language for performance modeling,” in SC12:ACM/IEEE International Conference for High Perfor-mance Computing, Networking, Storage, and Analysis,2012.

[9] E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan,P. J. Maechling, R. Mayani, W. Chen, R. Ferreira daSilva, M. Livny, and K. Wenger, “Pegasus: a workflowmanagement system for science automation,” FutureGeneration Computer Systems, vol. 46, pp. 17–35,2015.

Page 10: Toward an End-to-end Framework for Modeling, …Toward an End-to-end Framework for Modeling, Monitoring and Anomaly Detection for Scientific Workflows Anirban Mandal, Paul Ruth,

[10] S. Lee, J. S. Meredith, and J. S. Vetter, “COMPASS:A framework for automated performance modelingand prediction,” in ACM International Conference onSupercomputing (ICS). Newport Beach, California:ACM, 2015.

[11] G. Mahinthakumar, M. Sayeed, J. Blondin, P. Worley,A. Mezzacappa, and R. Hix, “Performance evaluationand modeling of a parallel astrophysics application,”in Proceedings of the High Performance ComputingSymposium, J. Meyer, Ed., 2004, pp. 27–33.

[12] K. Raghavachar, G. Mahinthakumar, P. Worley,E. Zechman, and R. Ranjithan, “Parallel performancemodeling using a genetic programming-based errorcorrection procedure,” SIMULATION, vol. 83, no. 7,pp. 515–527, 2007.

[13] M. Mubarak, C. D. Carothers, R. B. Ross, and P. Carns,“A case study in using massively parallel simulation forextreme-scale torus network codesign,” in Proceedingsof the 2nd ACM SIGSIM Conference on Principles ofAdvanced Discrete Simulation, ser. SIGSIM PADS ’14.New York, NY, USA: ACM, 2014, pp. 27–38.

[14] P. D. Barnes, Jr., C. D. Carothers, D. R. Jefferson, andJ. M. LaPre, “Warp Speed: Executing Time Warp on1,966,080 Cores,” in Proc. 2013 ACM SIGSIM Conf.Principles of Advanced Discrete Simulation, ser. PADS’13, Montreal, Canada, 19–22 May 2013, pp. 327–336.

[15] C. D. Carothers, K. S. Perumalla, and R. M. Fujimoto,“Efficient optimistic parallel simulations using reversecomputation,” ACM Trans. Model. Comput. Simul.,vol. 9, no. 3, pp. 224–253, Jul. 1999.

[16] J. S. Vockler, G. Mehta, Y. Zhao, E. Deelman, andM. Wilde, “Kickstarting remote applications,” in Inter-national Workshop on Grid Computing Environments,2007.

[17] “dv/dt: Accelerating the rate of progress towards ex-treme scale collaborative science,” https://sites.google.com/site/acceleratingexascale/.

[18] G. Juve, B. Tovar, R. Ferreira da Silva, D. Krol,D. Thain, E. Deelman, W. Allcock, and M. Livny,“Practical resource monitoring for robust high through-put computing,” in 2nd Workshop on Monitoring andAnalysis for High Performance Computing SystemsPlus Applications, ser. HPCMASPA’15, 2015, pp. 650–657.

[19] K. London, S. Moore, P. Mucci, K. Seymour, andR. Luczak, “The papi cross-platform interface to hard-ware performance counters,” in Department of DefenseUsers? Group Conference Proceedings, Biloxi, Missis-sippi, vol. 3, no. 4, 2001.

[20] B. Tierney, J. Metzger, J. Boote, E. Boyd, A. Brown,R. Carlson, M. Zekauskas, J. Zurawski, M. Swany,and M. Grigoriev, “perfsonar: Instantiating a globalnetwork measurement framework,” SOSP Wksp. RealOverlays and Distrib. Sys, 2009.

[21] “fio - Flexible I/O Tester,” https://github.com/axboe/fio.

[22] “sysstat,” http://sebastien.godard.pagesperso-orange.fr.

[23] T. Mason, D. Abernathy et al., “The spallation neutronsource in oak ridge: A powerful tool for materialsresearch,” Physica B: Condensed Matter, vol. 385, pp.955–960, 2006.

[24] J. C. Phillips, R. Braun et al., “Scalable moleculardynamics with NAMD,” Journal of ComputationalChemistry, vol. 26, no. 16, pp. 1781–1802, 2005.

[25] D. Case, V. Babin et al., “AMBER 14, university ofcalifornia, san francisco,” 2014.

[26] D. R. Roe and I. Thomas E. Cheatham, “Ptraj and cpp-traj: Software for processing and analysis of moleculardynamics trajectory data,” Journal of Chemical Theoryand Computation, vol. 9, no. 7, pp. 3084–3095, 2013.

[27] B. Lindner and J. Smith, “Sassena: X-ray and neutronscattering calculated from molecular dynamics trajec-tories using massively parallel computers,” ComputerPhysics Communications, vol. 183, no. 7, pp. 1491–1501, 2012.

[28] O. Arnold, J. C. Bilheux et al., “Mantid: Data analysisand visualization package for neutron scattering andsr experiments,” Nuclear Instruments and Methods inPhysics Research Section A, vol. 764, pp. 156–166,November 2014.

[29] “DAGMan,” http://research.cs.wisc.edu/htcondor/dagman/dagman.html.

[30] “HTCondor,” http://research.cs.wisc.edu/htcondor.