34
LBL, 11/4/2003 Towards Scientific Workflows Towards Scientific Workflows Based on Dataflow Process Based on Dataflow Process Networks Networks (or (or from Ptolemy to Kepler from Ptolemy to Kepler ) ) Bertram Lud Bertram Lud ä ä scher scher San Diego Supercomputer San Diego Supercomputer Center Center [email protected] [email protected]

Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler )

  • Upload
    mariah

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler ). Bertram Lud ä scher San Diego Supercomputer Center [email protected]. NSF, NIH, DOE GEOsciences Network (NSF) www.geongrid.org Biomedical Informatics Research Network (NIH) www.nbirn.net - PowerPoint PPT Presentation

Citation preview

Page 1: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBL, 11/4/2003

Towards Scientific Workflows Based on Towards Scientific Workflows Based on Dataflow Process Networks Dataflow Process Networks

(or (or from Ptolemy to Keplerfrom Ptolemy to Kepler))

Bertram LudBertram Ludääscherscher

San Diego Supercomputer San Diego Supercomputer CenterCenter

[email protected]@SDSC.edu

Page 2: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

AcknowledgementsAcknowledgements• NSF, NIH, DOENSF, NIH, DOE

• GEOsciences Network (NSF) GEOsciences Network (NSF) – www.geongrid.org

• Biomedical Informatics Research Network (NIH)Biomedical Informatics Research Network (NIH)– www.nbirn.net

• Science Environment for Ecological Knowledge (NSF)Science Environment for Ecological Knowledge (NSF)– seek.ecoinformatics.org

• Scientific Data Management Center (DOE)Scientific Data Management Center (DOE)– sdm.lbl.gov/sdmcenter/

Page 3: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

OutlineOutline

• Scientific WorkflowsScientific Workflows• Business WorkflowsBusiness Workflows• [Problem Solving Environments (SCIRun)][Problem Solving Environments (SCIRun)]• Dataflow Process Networks (Ptolemy-II)Dataflow Process Networks (Ptolemy-II)• Scientific Workflows (Kepler)Scientific Workflows (Kepler)

Page 4: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Promoter Identification Workflow Promoter Identification Workflow (PIW)(PIW)

Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)

Page 5: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Source: NIH BIRN (Jeffrey Grethe, UCSD)Source: NIH BIRN (Jeffrey Grethe, UCSD)

Page 6: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

GARP Invasive Species GARP Invasive Species PipelinePipeline

Training sample

(d)

GARPrule set

(e)

Test sample (d)

Integrated layers

(native range) (c)

Speciespresence &

absence points(native range)

(a)EcoGridQuery

EcoGridQuery

LayerIntegration

LayerIntegration

SampleData

+A3+A2

+A1

DataCalculation

MapGeneration

Validation

User

Validation

MapGeneration

Integrated layers (invasion area) (c)

Species presence &absence points

(invasion area) (a)

Native range

predictionmap (f)

Model qualityparameter (g)

Environmental layers (native

range) (b)

GenerateMetadata

ArchiveTo Ecogrid

RegisteredEcogrid

Database

RegisteredEcogrid

Database

RegisteredEcogrid

Database

RegisteredEcogrid

Database

Environmental layers (invasion

area) (b)

Invasionarea prediction

map (f)

Model qualityparameter (g)

Selectedpredictionmaps (h)

Source: NSF SEEK (Deana Pennington, UNM)Source: NSF SEEK (Deana Pennington, UNM)

Page 7: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Scientific Workflow AspectsScientific Workflow Aspects

• Data orientationData orientation– Data volume

– Data complexity

– Data integration

• Computational complexityComputational complexity

• Grid-aspectsGrid-aspects– Distributed computation

– Distributed data

• Analysis and tool integration Analysis and tool integration

• User-interactions/WF steeringUser-interactions/WF steering

• Data and workflow provenanceData and workflow provenance

Page 8: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Business WorkflowsBusiness Workflows

• Business Workflows Business Workflows – show their office automation ancestry

– documents and “work-tasks” are passed

– no data streaming, no data-intensive pipelines– lots of standards to choose from: WfMC, WSFL, BMPL, BPEL4WS,.. XPDL,…

– but often no clear execution semantics for constructs as simple as this:

Source: Expressiveness and Suitability of Languages for Control Flow Modelling in Workflows, PhD thesis, Bartosz Kiepuszewski, 2002

Page 9: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

A ZOO of Workflow Standards and A ZOO of Workflow Standards and SystemsSystems

Source: W.M.P. van der Aalst et al.http://tmitwww.tm.tue.nl/research/patterns/Source: W.M.P. van der Aalst et al.http://tmitwww.tm.tue.nl/research/patterns/

Page 10: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

More on Scientific WF vs Business WFMore on Scientific WF vs Business WF

• Business WFBusiness WF– Tasks, documents, etc. undergo modifications (e.g., flight reservation from

reserved to ticketed), but modified WF objects still identifiable throughout

– Complex control flow, task-oriented

– Transactions w/o rollback (ticket: reserved purchased)

– …

• Scientific WFScientific WF– data-in and data-out of an analysis step are not the same object!

– dataflow, data-oriented (cf. AVS/Express, Khoros, …)

– re-run automatically (a la distrib. comp., e.g. Condor) or user-driven/interactively (based on failure type)

– data integration & semantic typing as part of SWF framework

– …

Page 11: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Scientific Workflows: Some FindingsScientific Workflows: Some Findings

• More More dataflowdataflow than (business) workflow than (business) workflow– but some branching looping, merging, …– not: documents/objects undergoing modifications – instead often: dataset-out = analysis(dataset-in)

• Need for “Need for “programming extensionprogramming extension” ” – Iterations over lists (foreach); filtering; functional composition; generic &

higher-order operations (zip, map(f), …)

• Need for Need for abstractionabstraction and and nested workflowsnested workflows• Need for Need for data transformationsdata transformations (compute/transform alternations) (compute/transform alternations)• Need for rich Need for rich user interactionuser interaction & & workflow steeringworkflow steering::

– pause / revise / resume– select & branch; e.g., web browser capability at specific steps as part of a

coordinated SWF

• Need for Need for high-throughputhigh-throughput transfers (“grid-enabling”, “streaming”) transfers (“grid-enabling”, “streaming”)• Need for Need for persistencepersistence of intermediate products of intermediate products

data provenance (“virtual data” concept)

Page 12: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Problem Solving EnvironmentsProblem Solving Environments

• SCIRun: a dynamic dataflow system (in the Ptolemy SCIRun: a dynamic dataflow system (in the Ptolemy sense) sense) separate presentation separate presentation

Page 13: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

SWF vs Distributed ComputingSWF vs Distributed Computing

• Distributed Computing (e.g. a la Condor-(G) )Distributed Computing (e.g. a la Condor-(G) )– Batch oriented

– Transparent distributed computing (“remote Unix/Java”; standard/Java universes in Condor)

– HPC resource allocation & scheduling

• SWFSWF– Often highly interactive for decision making/steering of the WF

and visualization (data analysis)

– Transparent data access (Grid) and integration (database mediation & semantic extensions)

– Desktop metaphor ; often (but not always!) light-weight web service invocation

Page 14: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBL, 11/4/2003

Dataflow Process Dataflow Process Networks and Ptolemy-Networks and Ptolemy-

IIII

see!see!see!see!

try!try!try!try!

read!read!read!read!

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

Page 15: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Dataflow Process Networks: Why Ptolemy-Dataflow Process Networks: Why Ptolemy-II?II?

• PtII Objective:PtII Objective:– “The focus is on assembly of concurrent components. The

key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.”

• Data & Process oriented:Data & Process oriented:– Dataflow process networks

• Natural Data Streaming SupportNatural Data Streaming Support

• PragmaticsPragmatics– mature, actively maintained, open source system

– leverage “sister projects” activities (e.g. SEEK)

Page 16: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Ptolemy-II Type SystemPtolemy-II Type System

Page 17: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Scientific Workflows = Dataflow Process Scientific Workflows = Dataflow Process Networks + ?Networks + ?

• X =X = … …– Grid extensions:

• Actors as web/grid services • 3rd party data transfer, high-throughput data streaming• Data and service repositories, discovery Extended type system (structural & semantic extensions)

– Programming extensions (declarative/FP) and – Rich user interactions/workflow steering– Rich data transformations (compute/transform alternations)– Data provenance

• (semi-)automatic meta-data creation

– …

• …– …– (minus)(minus) upcoming Ptolemy-II extensions (PtII, SEEK, …)! upcoming Ptolemy-II extensions (PtII, SEEK, …)!– The slower we are, the less we have to do ourselves ;-)

Kepler = Ptolemy-II + X Kepler = Ptolemy-II + X

Page 18: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

XX includes: The customer is always right includes: The customer is always right ……

• Intuitive …Intuitive …– component composition

– data binding

– execution monitoring

• Reusability of …Reusability of …– Generic components (actors)

– Derived data products

• Application specific packaging and “branding”Application specific packaging and “branding”

• Transparent “gridification” Transparent “gridification”

Page 19: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Some specific tasks for KeplerSome specific tasks for Kepler$DONE(or almost ;-), %ONGOING, *NEW$DONE(or almost ;-), %ONGOING, *NEW

• User interaction, workflow steering User interaction, workflow steering – $ Pause/revise/resume– % BrowserUI actor (browser as a 0-learning display and selection tool)

• Distributed executionDistributed execution– % Dynamically port-specializing WSDL actor – * Dynamically specializing Grid service actor

• Port & actor type extensions (SEEK leverage)Port & actor type extensions (SEEK leverage)– * Structural types (XML Schema)– * Semantic types (OWL) incl. unit types w/ automatic conversion

• Programming extensionsProgramming extensions– % Data transformation actors (XSLT, XQuery, Python, Perl,…)– * map, zip, zipWith, …, loop, switch “patterns”

• Specialized Data SourcesSpecialized Data Sources– $ EML (SEEK), – % MS Access (GEON), *JDBC, – *XML, *NetCDF, …

Page 20: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Some specific tasks for Kepler Some specific tasks for Kepler (all (all NEW)NEW)

• Design & develop transparent, Grid-enabled PNs:Design & develop transparent, Grid-enabled PNs:– Communication protocol details– Grid-actor extensions and/or– Grid-Process Network director (G-PN)– Host/Source-location becomes actor parameter

• add “active-inline” parameter display for grid-actors (@exec-loc), channels (@transport-protocol), source-actors (@{src-loc|catalog-loc})

• Activity MonitoringActivity Monitoring– Add “activity status” display (green, yellow, red) to replace PtII animation

(needed for concurrently executing PN!)

• Register & Deploy mechanism Register & Deploy mechanism – Actor/Data/Workflow repository (=composite actors)– Shows up as (config’able) actor library– OGSA Service Registry approach? (SEEK leverage; UDDI complex & limited says MattJ)

• http://www-unix.globus.org/toolkit/draft-ggf-ogsi-gridservice-33_2003-06-27.pdf

• MOML extensions MOML extensions – Also separate language?

Page 21: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBL, 11/4/2003

Example: Grid-enabling Example: Grid-enabling

(again: SEEK leverage opportunity)(again: SEEK leverage opportunity)

Page 22: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Dataflow Process NetworksDataflow Process Networks

• Synchronous Dataflow Network (SDF)Synchronous Dataflow Network (SDF)– Statically schedulable single-threaded dataflow

• Can execute multi-threaded, but the firing-sequence is known in advance

– Maximally well-behaved, but also limited expressiveness

• Process Network (PN)Process Network (PN)– Multi-threaded dynamically scheduled dataflow– More expressive than SDF (dynamic token rate prevents static scheduling)– Natural streaming model

• Other Execution Models (“Domains”)Other Execution Models (“Domains”)– Implemented through different “Directors”

actor actor

typed i/o ports

FIFO

advanced push/pull

Page 23: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

TransparentlyTransparently Grid-Enabling PtII: Grid-Enabling PtII: HandlesHandles

A B

GA GB

1. AGA: get_handle2. GAA: return &X3. AB: send &X4. BGB: request &X5. GBGA: request &X6. GA GB: send *X7. GBB: send done(&X)

Example: &X = “GA.17”

*X =<some_huge_file>

1 2

3

4

5

6

7

PtII space

Grid space

Logical token transfer (3) requires get_handle(1,2); then exec_handle(4,5,6,7) for completion.

Page 24: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

TransparentlyTransparently Grid-Enabling PtII Grid-Enabling PtII

• Different phasesDifferent phases– Register designed WF (could include external validation service)– Find suitable grid service hosts for actors– Pre-stage execution– Execute– Archive execution log

• Implementation choices: Implementation choices: – Grid-actors (no change of director necessary)– and/or Grid-(PN)-director (also need to change actors!?)

– Add grid service host id as actor parameter: A@GA– Similar for data: myDB@GA

Page 25: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBL, 11/4/2003

Programming ExtensionsProgramming Extensions

(some lessons from SciDAC/SSDBM demo)(some lessons from SciDAC/SSDBM demo)

Page 26: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Promoter Identification

Workflowin Ptolemy-II(SSDBM’03)

hand-crafted control solution; also: forces sequential execution!

designed to fit

designed to fit

hand-craftedWeb-service

actor

Complex backward control-flow

No data transformations

available

Page 27: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Promoter Identification Workflow in Promoter Identification Workflow in FPFP

genBankG :: GeneId -> GeneSeqgenBankP :: PromoterId -> PromoterSeqblast :: GeneSeq -> [PromoterId]promoterRegion :: PromoterSeq -> PromoterRegiontransfac :: PromoterRegion -> [TFBS]gpr2str :: (PromoterId, PromoterRegion) -> String

d0 = Gid "7" -- start with some gene-id d1 = genBankG d0 -- get its gene sequence from GenBankd2 = blast d1 -- BLAST to get a list of potential promotersd3 = map genBankP d2 -- get list of promoter sequences d4 = map promoterRegion d3 -- compute list of promoter regions and ...d5 = map transfac d4 -- ... get transcription factor binding sitesd6 = zip d2 d4 -- create list of pairs promoter-id/regiond7 = map gpr2str d6 -- pretty print into a list of strings d8 = concat d7 -- concat into a single "file" d9 = putStr d8 -- output that file

Page 28: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Simplified Process Network PIWSimplified Process Network PIW

• Back to purely functional Back to purely functional dataflow process networkdataflow process network(= a data streaming model!)

• Re-introducing Re-introducing mapmap((ff) to ) to Ptolemy-II Ptolemy-II (was there in PT (was there in PT Classic) Classic) no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go from

piw(GeneId) to PIW :=map(piw) over [GeneId]

map(f)-style

iterators Powerful type

checking Generic,

declarative “programming”

constructs

Generic data transformation

actors

Forward-only, abstractable sub-workflow piw(GeneId)

Page 29: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Optimization by Declarative Optimization by Declarative Rewriting IRewriting I

• PIW as a declarative, PIW as a declarative, referentially transparent referentially transparent functional processfunctional process optimization via functional

rewriting possiblee.g. map(f o g) = map(f) o map(g)

• Details: Details: – Technical report &PIW specification

in Haskell

map(f o g) instead of map(f) o

map(g)

Combination of map and zip

http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdfhttp://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf

Page 30: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Optimizing II: Streams & PipelinesOptimizing II: Streams & Pipelines

• Clean functional semantics facilitates Clean functional semantics facilitates algebraic workflow (program) algebraic workflow (program) transformationstransformations (Bird-Meertens); e.g. mapS (Bird-Meertens); e.g. mapS ff •• mapS mapS gg mapS ( mapS (f f •• g g) )

Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki

John Reekie, University of Technology, Sydney

Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki

John Reekie, University of Technology, Sydney

Page 31: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Data Transformation Actors: Data Transformation Actors: Our Approach (proposal)Our Approach (proposal)

• ManualManual– XQuery, XSLT, Perl, Python, … transformation actor

(development)

• (Semi-)automatic(Semi-)automatic– Semantic-type guided transformation generation (research)

• Also: Also: Web Service CompositionWeb Service Composition is … is …– … a hot topic

– … a reincarnation of many “old” ideas – (e.g., AI-style planning born-again; functional composition; query

composition; … )

– … a separate topic

Page 32: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Contrast to Existing Dataflow Contrast to Existing Dataflow Systems Here: CommercialSystems Here: Commercial

Page 33: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBNL, 11/4/2003

Workflow and distributed computation grid created with Kensington Discovery Edition from InforSense.

Page 34: Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler )

LBL, 11/4/2003

F I N: Words to/from the WiseF I N: Words to/from the WiseFYI: Flow-based programming has been re-discovered/re-invented several times by FYI: Flow-based programming has been re-discovered/re-invented several times by different communities. Here is an “IBM practitioner’s view”:different communities. Here is an “IBM practitioner’s view”:

– Flow-based Programming, http://www.jpaulmorrison.com/fbp/… In "Flow-Based Programming" (FBP), applications are defined as networks of "black box" processes, which exchange data across predefined connections. These black box processes can be reconnected endlessly to form different applications without having to be changed internally. It is thus naturally component-oriented. To describe this capability, the distinguished IBM engineer, Nate Edwards, coined the term "configurable modularity", which he calls the basis of all true engineered systems. When using FBP, the application developer works with flows of data, being processed asynchronously, rather than the conventional single hierarchy of sequential, procedural code.   It is thus a good fit with multiprocessor computers, and also with modern embedded software. In many ways, an FBP application resembles more closely a real-life factory, where items travel from station to station, undergoing various transformations.  Think of a soft drink bottling factory, where bottles are filled at one station, capped at the next and labelled at yet another one.  FBP is therefore highly visual: it is quite hard to work with an FBP application without having the picture laid out on one's desk, or up on a screen!  For an example, see Sample DrawFlow Diagram. Strangely though, in spite of being at the leading edge of application development, it is also simple enough that trainee programmers can pick it up, and it is a much better match with the primitives of data processing than the conventional primitives of procedural languages. The key, of course (and perhaps the reason why it hasn't caught on more widely), is that it involves a significant paradigm shift that changes the way you look at programming, and once you have made this transition, you find you can never go back! FBP seems to dovetail neatly with a concept that I call "smart data". There is a section on this in stuff about the author. A new web page on this topic has just been uploaded - see "Smart Data" and Business Data Types - and we will be publishing more as it develops. …