Session 37 - Intro to Workflows, API's and semantics

Introduction to Workflows, APIs and Semantics

Session 37. July 13th, 2009

Oscar Corcho (Universidad Politécnica de Madrid)

Based on slides from all the presenters in the following two days

Work distributed under the license Creative Commons Attribution-Noncommercial-Share Alike 3.0

Themes of the Second Week

Date Theme TechnologyMon 13 July How to solve my problem?

Tue 14 July Higher level APIs: OGSA-DAI, SAGA and metadata management

SAGA,OGSA-DAI,Grid SAM

Wed 15 July Workflows P-GRADE,Semantic Metadata

Thu 16 July Integrating Practical All

Fri 17 July Cloud Computing (lecture)

Principles of job submission and

execution management

Principles of high-throughput computing

Principles of service-oriented

architecture

Principles of distributed data management

Principles of using distributed and

high performance systems

Higher level APIs: OGSA-DAI, SAGA

and metadata management

Workflows

Motivation• Grids are:

– Dynamic:• Version, updates, new resources...

– Heterogenous:• Operating Systems, Libraries, software stack• Middleware service versions and semantics• Administrative policies – access, usage, upgrade

– Complex:• Production level service with high QoS non-trivial• Derived from above as well as inherently

• As described by Steven this morning, operating Grids is still an effort-consuming task, and it is still somehow difficult to develop, program & deploy Grid applications using the existing Grid middleware

• But as you have also seen during last week (and in Morris’ presentation today), there are many commonalities among heterogeneous middleware

In general…• As described by Steven this morning, operating Grids is still an

effort-consuming task, and it is still somehow difficult to develop, program & deploy Grid applications using the existing Grid middleware

• But as you have also seen during last week (and in Morris’ presentation today), there are many commonalities among heterogeneous middleware

• There is a need of:– Programmatic approaches that provide common grid functionality at a

correct level of abstraction for applications– Ability to hide underlying complexity of infrastructure, varying semantics,

heterogeneity and changes from the application-developer– Improved data access and integration mechanisms– Traceable, repeatable analyses of e-Science experiments– Graphical modelling languages for the ease of Grid application

development

e-Science Approach Interoperability• Increasing complexity of e-science applications that embrace

multiple physical models (i.e. multi-physics) & larger scale– Creating a steadily growing demand of compute power– Demand for a ‘United Federation of world-wide Grids’

Balatonfüred, Hungary, 6th-18th July 2008

I. Simple Scripts & Control

II. ScientificApplication plug-ins

III. ComplexWorkflows

IV. InteractiveAccess

V. Interoperability

Grid other Grid type

Grid Middleware

[2] Morris Riedel et al., ‘Classification of Different Approaches for e-Science Applications in Next Generation Infrastructures, Int. Conference on e-Science 2008, Indianapolis, Indiana

SAGA one-slide summary• Simple API for Grid Application – SAGA

– Provide simple and usable programmatic interface that can be widely-adopted for the development of applications for the grid

– Simplicity (80:20 restricted scope)• easy to use, install, administer and maintain

– Uniformity• provides support for different application programming languages as well as

consistent semantics and style for different Grid functionality– Scalability

• Contains mechanisms for the same application (source) code to run on a variety of systems ranging from laptops to HPC resources

– Genericity• adds support for different grid middleware

– Modularity • provides a framework that is easily extendable

• SAGA is not…– Middleware– Service management interface– Does not hide the resources - remote files, job (but the details)

Text

Example: SAGA Job submission

Example: Copy a File (Globus)if (source_url.scheme_type == GLOBUS_URL_SCHEME_GSIFTP || source_url.scheme_type == GLOBUS_URL_SCHEME_FTP ) { globus_ftp_client_operationattr_init (&source_ftp_attr); globus_gass_copy_attr_set_ftp (&source_gass_copy_attr, &source_ftp_attr); } else { globus_gass_transfer_requestattr_init (&source_gass_attr, source_url.scheme); globus_gass_copy_attr_set_gass(&source_gass_copy_attr, &source_gass_attr); } output_file = globus_libc_open ((char*) target, O_WRONLY | O_TRUNC | O_CREAT, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP); if ( output_file == -1 ) { printf ("could not open the file \"%s\"\n", target); return (-1); } /* convert stdout to be a globus_io_handle */if ( globus_io_file_posix_convert (output_file, 0, &dest_io_handle) != GLOBUS_SUCCESS) { printf ("Error converting the file handle\n"); return (-1); } result = globus_gass_copy_register_url_to_handle ( &gass_copy_handle, (char*)source_URL, &source_gass_copy_attr, &dest_io_handle, my_callback, NULL); if ( result != GLOBUS_SUCCESS ) { printf ("error: %s\n", globus_object_printable_to_string (globus_error_get (result))); return (-1); } globus_url_destroy (&source_url); return (0); }

int copy_file (char const* source, char const* target) {globus_url_t source_url;globus_io_handle_t dest_io_handle;globus_ftp_client_operationattr_t source_ftp_attr;globus_result_t result;globus_gass_transfer_requestattr_t source_gass_attr;globus_gass_copy_attr_t source_gass_copy_attr;globus_gass_copy_handle_t gass_copy_handle;globus_gass_copy_handleattr_t gass_copy_handleattr; globus_ftp_client_handleattr_t ftp_handleattr; globus_io_attr_t io_attr; int output_file = -1; if ( globus_url_parse (source_URL, &source_url) != GLOBUS_SUCCESS ) { printf ("can not parse source_URL \"%s\"\n", source_URL); return (-1); } if ( source_url.scheme_type != GLOBUS_URL_SCHEME_GSIFTP && source_url.scheme_type != GLOBUS_URL_SCHEME_FTP && source_url.scheme_type != GLOBUS_URL_SCHEME_HTTP && source_url.scheme_type != GLOBUS_URL_SCHEME_HTTPS ) { printf ("can not copy from %s - wrong prot\n", source_URL); return (-1); } globus_gass_copy_handleattr_init (&gass_copy_handleattr); globus_gass_copy_attr_init (&source_gass_copy_attr); globus_ftp_client_handleattr_init (&ftp_handleattr); globus_io_fileattr_init (&io_attr); globus_gass_copy_attr_set_io (&source_gass_copy_attr, &io_attr); &io_attr); globus_gass_copy_handleattr_set_ftp_attr (&gass_copy_handleattr, &ftp_handleattr); globus_gass_copy_handle_init (&gass_copy_handle, &gass_copy_handleattr);

Text

Example: Copy a File (SAGA)

#include <string>#include <saga/saga.hpp>

void copy_file(std::string source_url, std::string target_url) { try { saga::file f(source_url); f.copy(target_url); } catch (saga::exception const &e) { std::cerr << e.what() << std::endl; }}

The interface is simple and the actual function calls remain the same

Workflow one-slide summary• Build distributed applications through orchestration of multiple

services– Allows to compose larger applications from individual application

components– The components can be independent or connected by some control

flow/ data flow dependencies.– Scaled up execution over several computational resources

• Integration of multiple teams involved (collaborative work)• Unit of reusage: e-science requires traceable, repetable analysis

– Provide automation: Reproducibility of scientific analyses and processes is at the core of the scientific method

– Support easy analysis modifications– Sharing workflows is an essential element of education, and

acceleration of knowledge dissemination.”– Allows capture and generation of provenance information

• Ease the use of grids: graphical representation– Capture individual data transformation and analysis steps

NSF Workshop on the Challenges of Scientific Workflows, 2006, www.isi.edu/nsf-workflows06Y. Gil, E. Deelman et al, Examining the Challenges of Scientific Workflows. IEEE Computer, 12/2007

Workflow• The automation of a business process, in whole or part, during

which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules to achieve, or contribute to, an overall business goal.

• Workflow management system (WFMS) is the software that does it

www.wfmc.org

Workflow Reference Model, 19/11/1998

What does a typical Grid WFMS provide?• A level of abstraction above grid processes

– gridftp, lcg-cr, lfc-mkdir, ...– condor-submit, globus-job-run, glite-wms-job-submit, ...– lcg-infosites, ...

• A level of abstraction above “legacy processes”– SQL read/write– HTTP file transfer, …

• Mapping and execution of tasks grid resources– Submission of jobs– Invocation of (Web) services– Manage data – Catalog intermediate and final data products

• Improve successful application execution• Improve application performance• Provide provenance tracking capabilities

http://www.gridworkflow.org/

What does a typical Grid WFMS provide?

Source: Jia Yu and Rajkumar Buyya: A Taxonomy of Workflow Management Systems for Grid Computing, Journal of Grid Computing, Volume 3, Numbers 3-4 / September, 2005

Abstract Workflow Executable Workflow

Describes your workflow at a logical level

Describes your workflow in terms of physical files and paths

Site Independent Site Specific

Captures just the computation that the user wants to do

Has additional jobs for data movement etc.

What does a typical workflow consist of?

• Dataflow graph• Activities

– Definition of Jobs– Specification of services

• Data channels– Data transfer– Coordination

• Cyclic (DAG) /acyclic• Conditional statements

Workflow Lifecycle

Workflow Template

Workflow Instance

Executable Workflow

Data, Metadata, Provenance Information

Data, Metadata Catalogs

Resource, Application Component DescriptionsCompute,

Storage and

Network Resources

Data Products

ExecuteMap to

available resources

Adapt, Modify

Workflow and

Component Libraries

Populate with data

Creation

Mapping

Scheduling/Execution

Reuse

Distributed

Data lifecycle in workflows

Data Discovery

Der

ived

Dat

a an

d

Prov

enan

ce A

rchi

val

Data Processing

Data A

nalysis SetupData Lifecycle

in a Workflow Environment

Metadata Catalogs

Provenance Catalogs

Component Libraries

Workflow Template Libraries

Data Replica CatalogsData Movement Services

Software Catalogs

Workflow Creation

Workflow Mapping andExecution

Workflow Reuse

Workflow Execution

P-GRADE one-slide summary• P-GRADE portal desiderata

– Hide the complexity of the underlying grid middlewares– Provide a high-level graphical user interface that is easy-to-use for e-

scientists– Support many different grid programming approaches:

• Simple Scripts & Control (sequential and MPI job execution)• Scientific Application Plug-ins (based on GEMLCA)• Complex Workflows• Parameter sweep applications: both on job and workflow level• Interoperability: transparent access to grids based on different

middleware technology– Support three levels of parallelism

• History– Started in the Hungarian SuperComputing Grid project in 2003– http://portal.p-grade.hu/– https://sourceforge.net/projects/pgportal/

Workflow sharing: MyExperiment

http://www.myexperiment.org/

Data access and integrationResearcher wants to obtainspecified data from multipledistributed data sources andto supply the result to aprocess and then view itsoutput.

1 Researcher formulates query2 Researcher submits query3 Query system transforms and distributes query4 Data services send back local results5 Query system combines these to form requested data

6 Query system sends data to process7 Process system sends derived data to researcher

OGSA-DAI one-slide summary• Enable the sharing of data resources to support:

– Data access - access to structured data in distributed heterogeneous data resources.

– Data transformation e.g. expose data in schema X to users as data in schema Y.

– Data integration e.g. expose multiple databases to users as a single virtual database

– Data delivery - delivering data to where it's needed by the most appropriate means e.g. web service, e-mail, HTTP, FTP, GridFTP

• History– Started in February 2002 as part of the UK e-Science Grid Core

Program– Part of OMII-UK, a partnership between:

• OMII, The University of Southampton• myGrid, The University of Manchester• OGSA-DAI, The University of Edinburgh

OGSA-DAI Generic web servicesRelational Database

XML Database

Indexed File

• Manipulate data using OGSA-DAI’s generic web services

• Clients see the data in its ‘raw’ format, e.g.– Tables, columns, rows for

relational data– Collections, elements etc. for

XML data• Clients can obtain the schema

of the data• Clients send queries in

appropriate query language, e.g. SQL, XPath

Relational Database

XML Database

Indexed File

OGSA-DAIrequest

data

OGSA-DAI Workflows

• Pipeline, Sequence, Parallel workflows• Composed of activities• Reduces data transfers and web service calls

25

Metadata Management: A Satellite Scenario

SpaceSegment

Ground Segment DMOP files

Product files

SATELLITE FILES:

A Sample File in the Satellite Domain

DATA

METADATA

Metadata can be present in file names… Namefile (Product):RA2_MW__1PNPDK20060201_120535_000000062044_00424_20518_0349.N1" Corresponds to:

27

…and in file headersFILE ; DMOP (generated by FOS Mission Planning System) RECORD fhr FILENAME="DMOP_SOF__VFOS20060124_103709_00000000_00001215_20060131_014048_20060202_035846.N1" DESTINATION="PDCC" PHASE_START=2 CYCLE_START=44 REL_START_ORBIT=404 ABS_START_ORBIT=20498

ENDRECORD fhr................................ RECORD dmop_er RECORD dmop_er_gen_part RECORD gen_event_params

EVENT_TYPE=RA2_MEA EVENT_ID="RA2_MEA_00000000002063" NB_EVENT_PR1=1 NB_EVENT_PR3=0 ORBIT_NUMBER=20521 ELAPSED_TIME=623635 DURATION=41627862 ENDRECORD gen_event_params ENDRECORD dmop_erENDLIST all_dmop_erENDFILE

RECORD ID

RECORD parameters

RECORD parameters corresponding to other RECORD

structure.

Metadata can be exposed• Metadata deserves a better treatment

– In most cases it appears together with files or other resources– It is difficult to deal with– What about trying to query about all the files that deal with instrument X

and where the information was taken from time T1 to T2?

Our goal:Let’s make metadata a FIRST-CLASS CITIZEN in our systemsAnd let’s make it FLEXIBLE to changes

Workflow Lifecycle

Workflow Template

Workflow Instance

Executable Workflow

Data, Metadata, Provenance Information

Data, Metadata Catalogs

Resource, Application Component DescriptionsCompute,

Storage and

Network Resources

Data Products

ExecuteMap to

available resources

Adapt, Modify

Workflow and

Component Libraries

Populate with data

Reuse

Metadata and workflows

• Metadata for describing workflow entities– What is the value added of a given workflow?– What is the task a given service performs?– What are the services that can be associated with a

processor? • Metadata for describing workflow provenance

– How did the execution of a given workflow go?– What this the semantics of a data product?– How many invocations of a given service failed?

Some metadata about a workflow

RDF annotations

Social Tags annotations

Free-text annotations

Reference Ontology1

Reference Ontology2

ReferenceControlled vocabulary

A scientific workflow

Metadata content

Metadata is everywhere• We can attach metadata almost to anything

– Events, notifications, logs– Services and resources– Schemas and catalogue entries – People, meetings, discussions, conference talks– Scientific publications, recommendations, quality comments– Models, codes, builds, workflows, – Data files and data streams– Sensors and sensor data

• But..., what do we mean by metadata???

What is the metadata of this HTML fragment?Based on Dublin CoreThe contributor and creator is the flight booking service “www.flightbookings.com”.The date would be January 1st, 2003, in case that the HTML page has been generated on that specific date.The description would be something like “flight details for a travel between Madrid and Seattle via Chicago on February 8th, 2004”.The document format is “HTML”.The document language is “en”, which stands for English

Based on thesauriMadrid is a reference to the term with ID 7010413 in the thesaurus, which refers to the city of Madrid in Spain.Spain is a reference to the term with ID 1000095, which refers to the kingdom of Spain in Europe.Chicago is a reference to the term with ID 7013596, which refers to the city of Chicago in Illinois, US.United States of America is a reference to the term “United States” with ID 7012149, which refers to the US nation.Seattle is a reference to the term with ID 7014494, which refers to the city of Seattle in Washington, US.

Based on ontologiesConcept instances relate a part of the document to one or several concepts in an ontology. For example, “Flight details” may represent an instance of the concept Flight, and can be named as AA7615_Feb08_2003, although concept instances do not necessarily have a name.Attribute values relate a concept instance with part of the document, which is the value of one of its attributes. For example, “American Airlines” can be the value of the attribute companyName.Relation instances that relate two concept instances by some domain-specific relation. For example, the flight AA7615_Feb08_2003 and the location Madrid can be connected by the relation departurePlace

Need to Add “Semantics”• External agreement on meaning of annotations

– E.g., Dublin Core for annotation of library/bibliographic information

• Use Ontologies to specify meaning of annotations– Ontologies provide a vocabulary of terms, plus– a set of explicit assumptions regarding the intended meaning of the

vocabulary. • Almost always including concepts and their classification• Almost always including properties between concepts• Similar to an object oriented model

– Meaning (semantics) of terms is formally specified– Can also specify relationships between terms in multiple ontologies

• Thus, an ontology describes a formal specification of a certain domain:– Shared understanding of a domain of interest– Formal and machine manipulable model of a domain of interest

S-OGSA Model

Summary• From the lower level of abstraction…

– Difficulties to develop, program & deploy Grid applications using the existing Grid middleware

• To a higher level of abstraction:– High-level APIs and metadata management

• Programmatic approaches that provide common grid functionality at a correct level of abstraction for applications

• Ability to hide underlying complexity of infrastructure, varying semantics, heterogeneity and changes from the application-developer

– Improved data access and integration mechanisms– Workflow management

• Traceable, repetable analyses of e-Science experiments• Graphical modelling languages for the ease of Grid application

development

Introduction to Workflows, APIs and Semantics

Session 37. July 13th, 2009

Oscar Corcho (Universidad Politécnica de Madrid)

Based on slides from all the presenters in the following two days

Work distributed under the license Creative Commons Attribution-Noncommercial-Share Alike 3.0

Education

Session 37 - Intro to Workflows, API's and semantics