Scientific Workflows - rafaelt/projects/OMNIDATA-WEB... · Cierzo Siokia + The workflow concept

  • View
    212

  • Download
    0

Embed Size (px)

Text of Scientific Workflows - rafaelt/projects/OMNIDATA-WEB... · Cierzo Siokia + The workflow concept

  • +

    Scientific Workflows

    Programming Data-Oriented Applications in large-scale, distributed and heterogeneous resources

    2MoRo, September 14th, 2010

  • +Outline

    OMNI-DATA consortium

    The workflow concept

    Business workflows vs. scientific workflows

    Scientific workflows Features

    Achievements

    Our research on scientific workflows Fault tolerance at workflow-level

    Workflow performance prediction

  • +OMNI-DATA consortium

    France

    University of Pau, LIUPPA 6 persons

    Company 2MoRo

    Spain

    University of Zaragoza I3A : 8 persons

    BIFI : 6 persons

    Company Cierzo

    Siokia

  • +The workflow concept

    The workflow concept has existed for decades

    A workflow is a model to represent real work for further assessment

    They have been also utilised in industry to manage business processes Workflows are designed to achieve processing intents of some

    sort, such as physical transformation, service provision, or information processing.

  • +Business workflow example

    Enterprise Data Model Content: Accounting Subject Area by Peter Aiken Virginia Commonwealth University

    Workflow Engine system

  • +Other kind of Processes

    Scientific experiments based on the Scientific method

    consists of collections of data through observation and experimentation, and the formulation and testing of hypotheses.

    Define problems Experiments Data analysis Discovery

    Activities are:

    - Iterative, dynamic, and human steered

    Taken by Dr Zhao, scientific workflows for eScience

  • +Business vs. Scientific workflows

    Scientific and business workflows may not be distinguishable share common characteristics.

    Scientific research requires flexible design and exploration capabilities that appear to depart

    significantly from the more prescriptive use of workflows in business.

    to ensure repeatable experiments.

    to support a variety and heterogeneity of data.

  • +Scientific workflows

    Data intensive analysis in the experiments heterogeneous databases are extensively accessed

    Many large-scale scientific computations of interest are long-term lasting weeks if not months.

    they can also involve much human intervention.

  • +Scientific workflows

    The computing environments are heterogeneous including supercomputers, networks of workstations &

    instruments.

    users typically want some kind of a predictability of the time

    Scientific activities requires the exploration of variants experimentation with alternative settings

    configuration of experiments with different parameters

    The computing environments are distributed & 3rd party fault tolerance is a must

    flexibility in the interactions

  • +Scientific Workflows

    Workflow tier

    Middleware tier

    Resource tier

    MIDDLEWARE

    Workflow GUI composition tool

    Workflow Engine

    Workflow specification

    Build-time

    Run-time

    cluster WS instrument

    WSRF

  • +Scientific workflow achievements

    Some examples Life sciences: bioinformatics

    Astronomy: Montage

    Data mining

    Map / reduce based workflows

  • +Scientific workflow achievements

    Life Sciences have terabytes of heterogeneous data and tools on the Web that need integrate in order to understand DNA, genes, genomes, proteins, biological pathways etc

    858 public databases

    150+ public web servers

    Between 2,000 and 3,000 public services (e.g. sequence analysis programs like BLAST that use Web Service standards like WSDL and SOAP)

    All these databases, servers and services allow us to perform many different sorts of computations on DNA, RNA and Proteins

    Taken from slides by Dr Duncan Hull myGrid project

  • +Scientific workflow achievements

    Taken from slides by Dr Duncan Hull myGrid project

  • +Taverna workflow engine: composition of WS

  • +Astronomy: Montage

    Montage is a centralized tool for assembling images of the universe.

    The scientific workflow Pegasus can build enormous workflows expressing how an enormous number of images can be combined to for a bigger image. performance speed-up

    experiments that could not be done otherwise.

    A mosaic of M104 (also known as the Sombrero Galaxy) taken from http://montage.ipac.caltech.edu/

  • +Data mining

    Data mining is the process of extracting patterns from data used in marketing, surveillance, fraud detection, and scientific

    discovery.

    Weka contains a collection of visualization tools and algorithms for data mining

    Weka4WS supports distributed data mining on a Grid environment extends Weka

    speeds up the execution of Weka workflows

  • +Map / reduce workflows

    Paradigm for processing huge amounts of datasets

    Map function m a dataset is automatically partitioned and function m is applied to

    the slices (in parallel)

    Reduce function r the outputs from the map are combined by function r

    Some scientific workflows such as Kepler integrate such a new paradigm into pipelines of

  • +Our research on scientific workflows Workflow Fault tolerance Performance prediction Autonomic computing for workflows

  • +Background: Reference nets

    Formal tool for representing processes

    Rapid prototyping systems

    Suitable for Scientific workflows characteristics ease the implementation of dynamic, flexible architectural

    requirements

    express dynamism & change in workflow specifications in a formal way

    workflow specifications can be easily used for performance prediction models

  • +Reference nets

    t2[] wt1 w

    w:end(result)w:new WorkflowModel;

    w:begin(args)

    args result

    E

    Input data

    Output data node

  • +Architectural Issues

    Workflow engine

    Tuplespace

    SoapAdapter

    RMIAdapter

    CondorAdapter

    Message Broker

    Internet

    Resources

  • +Expressing Scientific Workflows

  • Failure! Exception raised!

  • +Workflow Performance Prediction

    The Reference net-based workflow representations are annotated / parameterised with time

    We propose 3 different time profiles for feeding workflows: constant values

    stochastic values

    real values obtained from experimentation / instrumentation

    By simulation, we obtain performance predictions