36
Policy-based Data Management Integrated Rule Oriented Data Grid (iRODS) Reagan W. Moore (DICE-UNC) Arcot Rajasekar (DICE-UNC) http://irods.diceresearch.org

iRODS

Embed Size (px)

Citation preview

Page 1: iRODS

Policy-based Data Management

Integrated Rule Oriented Data Grid (iRODS)

Reagan W Moore (DICE-UNC)

Arcot Rajasekar (DICE-UNC)

httpirodsdiceresearchorg

What is the Opportunity Play for iRODS

At a high level hellip

The Management of Big Data is the 1 concern for IT

bull Life Cycle Management

bull Useful (actionable) and searchable metadata

bull Integrity

bull Collaboration (Federation of Immutable data)

iRODS Provides Policy-Based data management

bull Next Generation data management cyber-infrastructure

bull System that enables a flexible adaptive customizable

data management architecture

bull Tool for large collections (Petabytes hundreds of millions of files)

Properties of policy-based data management systems

Management of the data life cycle (project collection digital library persistent

archive processing pipeline)

Applications of iRODS

LifeTimetrade Library (digital library for students)

Genomics data grid

Carolina Digital Repository (institution repository)

French National Library (IT automation)

DataNet Federation Consortium (data and workflow sharing for collaborative

research)

1 What iRODS is and what problems it is solving today and tomorrow

2 Speak to different use cases (there will be many companies attending

representing many departments with different opportunitiesproblems)

a Digitization of University Assets- Library archive

b Genomic pipeline automation

c IT service automation

We will touch on hellip

Topics

bull Principles behind policy-based data management

ndash Enable collaborative research

ndash Enable reproducible science

ndash Enable creation of reference collections

bull Integrated Rule-Oriented Data System (iRODS)

ndash Enforce management policies

ndash Automate administrative functions

ndash Validate assessment criteria

Shared Collections ndash Data Grid

File

System

Client 50 clients web browser

unix shell command hellip

Data grid middleware

provides global name

single sign-on policy

enforcement metadata

replication Tape

Archive

Data Grid

Multiple types of systems

can be used to store data

Policy-based Data Management

Client

iRODS-server

Rule-engine

Rule base

Workflows

iRODS-server

Rule Engine

Rule base

Workflows

Storage Storage

Logical

Collection

(data grid)

Consensus on Policies and Procedures

controls the Data Collection

7

Policy-Based Data Environments

Purpose Reason a collection is assembled

Properties Attributes needed to ensure the purpose

Policies Controls for enforcing desired properties

bull mapped to computer actionable rules

Procedures Functions that implement the policies

bull Mapped to computer actionable workflows

Persistent state information Results of applying the procedures

bull mapped to system metadata

Property verification Validation that state information conforms to the desired purpose

bull mapped to periodically executed policies

Community-based Collection Life Cycle

Project

Collection

Private

Local

Policy

Data

Grid

Shared

Distribution

Policy

Digital

Library

Published

Description

Policy

Data

Processing

Pipeline

Analyzed

Service

Policy

Reference

Collection

Preserved

Representation

Policy

Federation

Sustained

Re-purposing

Policy

Stages correspond to addition of new policies for a broader community

Virtualize the stages of the collection life cycle through policy evolution

The driving purpose changes at each stage of the data life cycle

Applications

Data Grids (data sharing)

Ocean Observatories Initiative

The iPlant Collaborative

National Optical Astronomy Observatory

Babar High Energy Physics

Broad Institute genomics data grid

WellCome Trust Sanger Institute genomics data grid

Digital Libraries (data publication)

Texas Digital Library

French National Library

UNC-CH SILS LifeTime Library

Repositories Archives (data preservation)

NASA Center for Climate Simulation

Carolina Digital Repository

Sequencing Work ndash an Infrastructure View

RENCI

Science

Portal

Open

Science

Grid

TeraGrid

UNC BASS

hellip

National Resources

Pipelines Genome

Databases

RENCI Infrastructure TestDevelopment

Distributed ad-hoc

processing

iRODS data-grid managed

processing

Data Production

UNC HTFS

Third Party

Vendors

Clinical Data Systems

NCGenes

Secure Medical

Workspace

Production

Pipelines

Archive Genome

Databases

RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High

Throughput Sequencing facility

Local

(TUCASI)

Data Sharing

NIH

Other

Institutions

Ref

Se

q

Genome

Annotations

dbSN

P HGMD

1000

Genomes

Managing several hundred TBs of genomic data

VarDB Hadoop

Managing Data on the Research Side

RENCI

STORAGE

(Tape Drives)

UNC

STORAGE

(Tape Drives)

UNC HPC RENCI HPC

External

Compute

Open Science

Grid

Clemson

Clouds

IT Machines

RENCI Hadoop

Genomics

Storage

Lab

Machines

NIH

External

Partners

Genomics HPC

Genomics

Hadoop

Data

Providers

Researchers

Students

External

Collaborators

IT Staff

iRODS gracefully allows for introducing control

bullData movement and replication

bullMetadata standards

bullArchival deletion and retention

bullIntegration with workflows hadoop databases

bullHiding complexities

bullAutomation

bullhellip all policy driven

bullhellip without breaking the in-place systems

Wild West Managed

SILS LifeTime Library

Student digital libraries

Enable students to build collections of

Photographs

MP3 audio files

Class documents

Video

Web site archive

Resources provided by School of Information and

Library Science at UNC-CH

Student collections range from 2 GBytes to 150 Gbytes

Number of files from 2000 to 12000

SILS LifeTime Library Policies

Library management

Replication

Checksums

Versioning

Strict access controls

Quotas

Metadata catalog replication

Installation environment archiving

Ingestion

Automated synchronization of student directory

with LifeTime Library

Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 2: iRODS

What is the Opportunity Play for iRODS

At a high level hellip

The Management of Big Data is the 1 concern for IT

bull Life Cycle Management

bull Useful (actionable) and searchable metadata

bull Integrity

bull Collaboration (Federation of Immutable data)

iRODS Provides Policy-Based data management

bull Next Generation data management cyber-infrastructure

bull System that enables a flexible adaptive customizable

data management architecture

bull Tool for large collections (Petabytes hundreds of millions of files)

Properties of policy-based data management systems

Management of the data life cycle (project collection digital library persistent

archive processing pipeline)

Applications of iRODS

LifeTimetrade Library (digital library for students)

Genomics data grid

Carolina Digital Repository (institution repository)

French National Library (IT automation)

DataNet Federation Consortium (data and workflow sharing for collaborative

research)

1 What iRODS is and what problems it is solving today and tomorrow

2 Speak to different use cases (there will be many companies attending

representing many departments with different opportunitiesproblems)

a Digitization of University Assets- Library archive

b Genomic pipeline automation

c IT service automation

We will touch on hellip

Topics

bull Principles behind policy-based data management

ndash Enable collaborative research

ndash Enable reproducible science

ndash Enable creation of reference collections

bull Integrated Rule-Oriented Data System (iRODS)

ndash Enforce management policies

ndash Automate administrative functions

ndash Validate assessment criteria

Shared Collections ndash Data Grid

File

System

Client 50 clients web browser

unix shell command hellip

Data grid middleware

provides global name

single sign-on policy

enforcement metadata

replication Tape

Archive

Data Grid

Multiple types of systems

can be used to store data

Policy-based Data Management

Client

iRODS-server

Rule-engine

Rule base

Workflows

iRODS-server

Rule Engine

Rule base

Workflows

Storage Storage

Logical

Collection

(data grid)

Consensus on Policies and Procedures

controls the Data Collection

7

Policy-Based Data Environments

Purpose Reason a collection is assembled

Properties Attributes needed to ensure the purpose

Policies Controls for enforcing desired properties

bull mapped to computer actionable rules

Procedures Functions that implement the policies

bull Mapped to computer actionable workflows

Persistent state information Results of applying the procedures

bull mapped to system metadata

Property verification Validation that state information conforms to the desired purpose

bull mapped to periodically executed policies

Community-based Collection Life Cycle

Project

Collection

Private

Local

Policy

Data

Grid

Shared

Distribution

Policy

Digital

Library

Published

Description

Policy

Data

Processing

Pipeline

Analyzed

Service

Policy

Reference

Collection

Preserved

Representation

Policy

Federation

Sustained

Re-purposing

Policy

Stages correspond to addition of new policies for a broader community

Virtualize the stages of the collection life cycle through policy evolution

The driving purpose changes at each stage of the data life cycle

Applications

Data Grids (data sharing)

Ocean Observatories Initiative

The iPlant Collaborative

National Optical Astronomy Observatory

Babar High Energy Physics

Broad Institute genomics data grid

WellCome Trust Sanger Institute genomics data grid

Digital Libraries (data publication)

Texas Digital Library

French National Library

UNC-CH SILS LifeTime Library

Repositories Archives (data preservation)

NASA Center for Climate Simulation

Carolina Digital Repository

Sequencing Work ndash an Infrastructure View

RENCI

Science

Portal

Open

Science

Grid

TeraGrid

UNC BASS

hellip

National Resources

Pipelines Genome

Databases

RENCI Infrastructure TestDevelopment

Distributed ad-hoc

processing

iRODS data-grid managed

processing

Data Production

UNC HTFS

Third Party

Vendors

Clinical Data Systems

NCGenes

Secure Medical

Workspace

Production

Pipelines

Archive Genome

Databases

RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High

Throughput Sequencing facility

Local

(TUCASI)

Data Sharing

NIH

Other

Institutions

Ref

Se

q

Genome

Annotations

dbSN

P HGMD

1000

Genomes

Managing several hundred TBs of genomic data

VarDB Hadoop

Managing Data on the Research Side

RENCI

STORAGE

(Tape Drives)

UNC

STORAGE

(Tape Drives)

UNC HPC RENCI HPC

External

Compute

Open Science

Grid

Clemson

Clouds

IT Machines

RENCI Hadoop

Genomics

Storage

Lab

Machines

NIH

External

Partners

Genomics HPC

Genomics

Hadoop

Data

Providers

Researchers

Students

External

Collaborators

IT Staff

iRODS gracefully allows for introducing control

bullData movement and replication

bullMetadata standards

bullArchival deletion and retention

bullIntegration with workflows hadoop databases

bullHiding complexities

bullAutomation

bullhellip all policy driven

bullhellip without breaking the in-place systems

Wild West Managed

SILS LifeTime Library

Student digital libraries

Enable students to build collections of

Photographs

MP3 audio files

Class documents

Video

Web site archive

Resources provided by School of Information and

Library Science at UNC-CH

Student collections range from 2 GBytes to 150 Gbytes

Number of files from 2000 to 12000

SILS LifeTime Library Policies

Library management

Replication

Checksums

Versioning

Strict access controls

Quotas

Metadata catalog replication

Installation environment archiving

Ingestion

Automated synchronization of student directory

with LifeTime Library

Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 3: iRODS

Properties of policy-based data management systems

Management of the data life cycle (project collection digital library persistent

archive processing pipeline)

Applications of iRODS

LifeTimetrade Library (digital library for students)

Genomics data grid

Carolina Digital Repository (institution repository)

French National Library (IT automation)

DataNet Federation Consortium (data and workflow sharing for collaborative

research)

1 What iRODS is and what problems it is solving today and tomorrow

2 Speak to different use cases (there will be many companies attending

representing many departments with different opportunitiesproblems)

a Digitization of University Assets- Library archive

b Genomic pipeline automation

c IT service automation

We will touch on hellip

Topics

bull Principles behind policy-based data management

ndash Enable collaborative research

ndash Enable reproducible science

ndash Enable creation of reference collections

bull Integrated Rule-Oriented Data System (iRODS)

ndash Enforce management policies

ndash Automate administrative functions

ndash Validate assessment criteria

Shared Collections ndash Data Grid

File

System

Client 50 clients web browser

unix shell command hellip

Data grid middleware

provides global name

single sign-on policy

enforcement metadata

replication Tape

Archive

Data Grid

Multiple types of systems

can be used to store data

Policy-based Data Management

Client

iRODS-server

Rule-engine

Rule base

Workflows

iRODS-server

Rule Engine

Rule base

Workflows

Storage Storage

Logical

Collection

(data grid)

Consensus on Policies and Procedures

controls the Data Collection

7

Policy-Based Data Environments

Purpose Reason a collection is assembled

Properties Attributes needed to ensure the purpose

Policies Controls for enforcing desired properties

bull mapped to computer actionable rules

Procedures Functions that implement the policies

bull Mapped to computer actionable workflows

Persistent state information Results of applying the procedures

bull mapped to system metadata

Property verification Validation that state information conforms to the desired purpose

bull mapped to periodically executed policies

Community-based Collection Life Cycle

Project

Collection

Private

Local

Policy

Data

Grid

Shared

Distribution

Policy

Digital

Library

Published

Description

Policy

Data

Processing

Pipeline

Analyzed

Service

Policy

Reference

Collection

Preserved

Representation

Policy

Federation

Sustained

Re-purposing

Policy

Stages correspond to addition of new policies for a broader community

Virtualize the stages of the collection life cycle through policy evolution

The driving purpose changes at each stage of the data life cycle

Applications

Data Grids (data sharing)

Ocean Observatories Initiative

The iPlant Collaborative

National Optical Astronomy Observatory

Babar High Energy Physics

Broad Institute genomics data grid

WellCome Trust Sanger Institute genomics data grid

Digital Libraries (data publication)

Texas Digital Library

French National Library

UNC-CH SILS LifeTime Library

Repositories Archives (data preservation)

NASA Center for Climate Simulation

Carolina Digital Repository

Sequencing Work ndash an Infrastructure View

RENCI

Science

Portal

Open

Science

Grid

TeraGrid

UNC BASS

hellip

National Resources

Pipelines Genome

Databases

RENCI Infrastructure TestDevelopment

Distributed ad-hoc

processing

iRODS data-grid managed

processing

Data Production

UNC HTFS

Third Party

Vendors

Clinical Data Systems

NCGenes

Secure Medical

Workspace

Production

Pipelines

Archive Genome

Databases

RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High

Throughput Sequencing facility

Local

(TUCASI)

Data Sharing

NIH

Other

Institutions

Ref

Se

q

Genome

Annotations

dbSN

P HGMD

1000

Genomes

Managing several hundred TBs of genomic data

VarDB Hadoop

Managing Data on the Research Side

RENCI

STORAGE

(Tape Drives)

UNC

STORAGE

(Tape Drives)

UNC HPC RENCI HPC

External

Compute

Open Science

Grid

Clemson

Clouds

IT Machines

RENCI Hadoop

Genomics

Storage

Lab

Machines

NIH

External

Partners

Genomics HPC

Genomics

Hadoop

Data

Providers

Researchers

Students

External

Collaborators

IT Staff

iRODS gracefully allows for introducing control

bullData movement and replication

bullMetadata standards

bullArchival deletion and retention

bullIntegration with workflows hadoop databases

bullHiding complexities

bullAutomation

bullhellip all policy driven

bullhellip without breaking the in-place systems

Wild West Managed

SILS LifeTime Library

Student digital libraries

Enable students to build collections of

Photographs

MP3 audio files

Class documents

Video

Web site archive

Resources provided by School of Information and

Library Science at UNC-CH

Student collections range from 2 GBytes to 150 Gbytes

Number of files from 2000 to 12000

SILS LifeTime Library Policies

Library management

Replication

Checksums

Versioning

Strict access controls

Quotas

Metadata catalog replication

Installation environment archiving

Ingestion

Automated synchronization of student directory

with LifeTime Library

Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 4: iRODS

Topics

bull Principles behind policy-based data management

ndash Enable collaborative research

ndash Enable reproducible science

ndash Enable creation of reference collections

bull Integrated Rule-Oriented Data System (iRODS)

ndash Enforce management policies

ndash Automate administrative functions

ndash Validate assessment criteria

Shared Collections ndash Data Grid

File

System

Client 50 clients web browser

unix shell command hellip

Data grid middleware

provides global name

single sign-on policy

enforcement metadata

replication Tape

Archive

Data Grid

Multiple types of systems

can be used to store data

Policy-based Data Management

Client

iRODS-server

Rule-engine

Rule base

Workflows

iRODS-server

Rule Engine

Rule base

Workflows

Storage Storage

Logical

Collection

(data grid)

Consensus on Policies and Procedures

controls the Data Collection

7

Policy-Based Data Environments

Purpose Reason a collection is assembled

Properties Attributes needed to ensure the purpose

Policies Controls for enforcing desired properties

bull mapped to computer actionable rules

Procedures Functions that implement the policies

bull Mapped to computer actionable workflows

Persistent state information Results of applying the procedures

bull mapped to system metadata

Property verification Validation that state information conforms to the desired purpose

bull mapped to periodically executed policies

Community-based Collection Life Cycle

Project

Collection

Private

Local

Policy

Data

Grid

Shared

Distribution

Policy

Digital

Library

Published

Description

Policy

Data

Processing

Pipeline

Analyzed

Service

Policy

Reference

Collection

Preserved

Representation

Policy

Federation

Sustained

Re-purposing

Policy

Stages correspond to addition of new policies for a broader community

Virtualize the stages of the collection life cycle through policy evolution

The driving purpose changes at each stage of the data life cycle

Applications

Data Grids (data sharing)

Ocean Observatories Initiative

The iPlant Collaborative

National Optical Astronomy Observatory

Babar High Energy Physics

Broad Institute genomics data grid

WellCome Trust Sanger Institute genomics data grid

Digital Libraries (data publication)

Texas Digital Library

French National Library

UNC-CH SILS LifeTime Library

Repositories Archives (data preservation)

NASA Center for Climate Simulation

Carolina Digital Repository

Sequencing Work ndash an Infrastructure View

RENCI

Science

Portal

Open

Science

Grid

TeraGrid

UNC BASS

hellip

National Resources

Pipelines Genome

Databases

RENCI Infrastructure TestDevelopment

Distributed ad-hoc

processing

iRODS data-grid managed

processing

Data Production

UNC HTFS

Third Party

Vendors

Clinical Data Systems

NCGenes

Secure Medical

Workspace

Production

Pipelines

Archive Genome

Databases

RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High

Throughput Sequencing facility

Local

(TUCASI)

Data Sharing

NIH

Other

Institutions

Ref

Se

q

Genome

Annotations

dbSN

P HGMD

1000

Genomes

Managing several hundred TBs of genomic data

VarDB Hadoop

Managing Data on the Research Side

RENCI

STORAGE

(Tape Drives)

UNC

STORAGE

(Tape Drives)

UNC HPC RENCI HPC

External

Compute

Open Science

Grid

Clemson

Clouds

IT Machines

RENCI Hadoop

Genomics

Storage

Lab

Machines

NIH

External

Partners

Genomics HPC

Genomics

Hadoop

Data

Providers

Researchers

Students

External

Collaborators

IT Staff

iRODS gracefully allows for introducing control

bullData movement and replication

bullMetadata standards

bullArchival deletion and retention

bullIntegration with workflows hadoop databases

bullHiding complexities

bullAutomation

bullhellip all policy driven

bullhellip without breaking the in-place systems

Wild West Managed

SILS LifeTime Library

Student digital libraries

Enable students to build collections of

Photographs

MP3 audio files

Class documents

Video

Web site archive

Resources provided by School of Information and

Library Science at UNC-CH

Student collections range from 2 GBytes to 150 Gbytes

Number of files from 2000 to 12000

SILS LifeTime Library Policies

Library management

Replication

Checksums

Versioning

Strict access controls

Quotas

Metadata catalog replication

Installation environment archiving

Ingestion

Automated synchronization of student directory

with LifeTime Library

Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 5: iRODS

Shared Collections ndash Data Grid

File

System

Client 50 clients web browser

unix shell command hellip

Data grid middleware

provides global name

single sign-on policy

enforcement metadata

replication Tape

Archive

Data Grid

Multiple types of systems

can be used to store data

Policy-based Data Management

Client

iRODS-server

Rule-engine

Rule base

Workflows

iRODS-server

Rule Engine

Rule base

Workflows

Storage Storage

Logical

Collection

(data grid)

Consensus on Policies and Procedures

controls the Data Collection

7

Policy-Based Data Environments

Purpose Reason a collection is assembled

Properties Attributes needed to ensure the purpose

Policies Controls for enforcing desired properties

bull mapped to computer actionable rules

Procedures Functions that implement the policies

bull Mapped to computer actionable workflows

Persistent state information Results of applying the procedures

bull mapped to system metadata

Property verification Validation that state information conforms to the desired purpose

bull mapped to periodically executed policies

Community-based Collection Life Cycle

Project

Collection

Private

Local

Policy

Data

Grid

Shared

Distribution

Policy

Digital

Library

Published

Description

Policy

Data

Processing

Pipeline

Analyzed

Service

Policy

Reference

Collection

Preserved

Representation

Policy

Federation

Sustained

Re-purposing

Policy

Stages correspond to addition of new policies for a broader community

Virtualize the stages of the collection life cycle through policy evolution

The driving purpose changes at each stage of the data life cycle

Applications

Data Grids (data sharing)

Ocean Observatories Initiative

The iPlant Collaborative

National Optical Astronomy Observatory

Babar High Energy Physics

Broad Institute genomics data grid

WellCome Trust Sanger Institute genomics data grid

Digital Libraries (data publication)

Texas Digital Library

French National Library

UNC-CH SILS LifeTime Library

Repositories Archives (data preservation)

NASA Center for Climate Simulation

Carolina Digital Repository

Sequencing Work ndash an Infrastructure View

RENCI

Science

Portal

Open

Science

Grid

TeraGrid

UNC BASS

hellip

National Resources

Pipelines Genome

Databases

RENCI Infrastructure TestDevelopment

Distributed ad-hoc

processing

iRODS data-grid managed

processing

Data Production

UNC HTFS

Third Party

Vendors

Clinical Data Systems

NCGenes

Secure Medical

Workspace

Production

Pipelines

Archive Genome

Databases

RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High

Throughput Sequencing facility

Local

(TUCASI)

Data Sharing

NIH

Other

Institutions

Ref

Se

q

Genome

Annotations

dbSN

P HGMD

1000

Genomes

Managing several hundred TBs of genomic data

VarDB Hadoop

Managing Data on the Research Side

RENCI

STORAGE

(Tape Drives)

UNC

STORAGE

(Tape Drives)

UNC HPC RENCI HPC

External

Compute

Open Science

Grid

Clemson

Clouds

IT Machines

RENCI Hadoop

Genomics

Storage

Lab

Machines

NIH

External

Partners

Genomics HPC

Genomics

Hadoop

Data

Providers

Researchers

Students

External

Collaborators

IT Staff

iRODS gracefully allows for introducing control

bullData movement and replication

bullMetadata standards

bullArchival deletion and retention

bullIntegration with workflows hadoop databases

bullHiding complexities

bullAutomation

bullhellip all policy driven

bullhellip without breaking the in-place systems

Wild West Managed

SILS LifeTime Library

Student digital libraries

Enable students to build collections of

Photographs

MP3 audio files

Class documents

Video

Web site archive

Resources provided by School of Information and

Library Science at UNC-CH

Student collections range from 2 GBytes to 150 Gbytes

Number of files from 2000 to 12000

SILS LifeTime Library Policies

Library management

Replication

Checksums

Versioning

Strict access controls

Quotas

Metadata catalog replication

Installation environment archiving

Ingestion

Automated synchronization of student directory

with LifeTime Library

Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 6: iRODS

Policy-based Data Management

Client

iRODS-server

Rule-engine

Rule base

Workflows

iRODS-server

Rule Engine

Rule base

Workflows

Storage Storage

Logical

Collection

(data grid)

Consensus on Policies and Procedures

controls the Data Collection

7

Policy-Based Data Environments

Purpose Reason a collection is assembled

Properties Attributes needed to ensure the purpose

Policies Controls for enforcing desired properties

bull mapped to computer actionable rules

Procedures Functions that implement the policies

bull Mapped to computer actionable workflows

Persistent state information Results of applying the procedures

bull mapped to system metadata

Property verification Validation that state information conforms to the desired purpose

bull mapped to periodically executed policies

Community-based Collection Life Cycle

Project

Collection

Private

Local

Policy

Data

Grid

Shared

Distribution

Policy

Digital

Library

Published

Description

Policy

Data

Processing

Pipeline

Analyzed

Service

Policy

Reference

Collection

Preserved

Representation

Policy

Federation

Sustained

Re-purposing

Policy

Stages correspond to addition of new policies for a broader community

Virtualize the stages of the collection life cycle through policy evolution

The driving purpose changes at each stage of the data life cycle

Applications

Data Grids (data sharing)

Ocean Observatories Initiative

The iPlant Collaborative

National Optical Astronomy Observatory

Babar High Energy Physics

Broad Institute genomics data grid

WellCome Trust Sanger Institute genomics data grid

Digital Libraries (data publication)

Texas Digital Library

French National Library

UNC-CH SILS LifeTime Library

Repositories Archives (data preservation)

NASA Center for Climate Simulation

Carolina Digital Repository

Sequencing Work ndash an Infrastructure View

RENCI

Science

Portal

Open

Science

Grid

TeraGrid

UNC BASS

hellip

National Resources

Pipelines Genome

Databases

RENCI Infrastructure TestDevelopment

Distributed ad-hoc

processing

iRODS data-grid managed

processing

Data Production

UNC HTFS

Third Party

Vendors

Clinical Data Systems

NCGenes

Secure Medical

Workspace

Production

Pipelines

Archive Genome

Databases

RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High

Throughput Sequencing facility

Local

(TUCASI)

Data Sharing

NIH

Other

Institutions

Ref

Se

q

Genome

Annotations

dbSN

P HGMD

1000

Genomes

Managing several hundred TBs of genomic data

VarDB Hadoop

Managing Data on the Research Side

RENCI

STORAGE

(Tape Drives)

UNC

STORAGE

(Tape Drives)

UNC HPC RENCI HPC

External

Compute

Open Science

Grid

Clemson

Clouds

IT Machines

RENCI Hadoop

Genomics

Storage

Lab

Machines

NIH

External

Partners

Genomics HPC

Genomics

Hadoop

Data

Providers

Researchers

Students

External

Collaborators

IT Staff

iRODS gracefully allows for introducing control

bullData movement and replication

bullMetadata standards

bullArchival deletion and retention

bullIntegration with workflows hadoop databases

bullHiding complexities

bullAutomation

bullhellip all policy driven

bullhellip without breaking the in-place systems

Wild West Managed

SILS LifeTime Library

Student digital libraries

Enable students to build collections of

Photographs

MP3 audio files

Class documents

Video

Web site archive

Resources provided by School of Information and

Library Science at UNC-CH

Student collections range from 2 GBytes to 150 Gbytes

Number of files from 2000 to 12000

SILS LifeTime Library Policies

Library management

Replication

Checksums

Versioning

Strict access controls

Quotas

Metadata catalog replication

Installation environment archiving

Ingestion

Automated synchronization of student directory

with LifeTime Library

Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 7: iRODS

7

Policy-Based Data Environments

Purpose Reason a collection is assembled

Properties Attributes needed to ensure the purpose

Policies Controls for enforcing desired properties

bull mapped to computer actionable rules

Procedures Functions that implement the policies

bull Mapped to computer actionable workflows

Persistent state information Results of applying the procedures

bull mapped to system metadata

Property verification Validation that state information conforms to the desired purpose

bull mapped to periodically executed policies

Community-based Collection Life Cycle

Project

Collection

Private

Local

Policy

Data

Grid

Shared

Distribution

Policy

Digital

Library

Published

Description

Policy

Data

Processing

Pipeline

Analyzed

Service

Policy

Reference

Collection

Preserved

Representation

Policy

Federation

Sustained

Re-purposing

Policy

Stages correspond to addition of new policies for a broader community

Virtualize the stages of the collection life cycle through policy evolution

The driving purpose changes at each stage of the data life cycle

Applications

Data Grids (data sharing)

Ocean Observatories Initiative

The iPlant Collaborative

National Optical Astronomy Observatory

Babar High Energy Physics

Broad Institute genomics data grid

WellCome Trust Sanger Institute genomics data grid

Digital Libraries (data publication)

Texas Digital Library

French National Library

UNC-CH SILS LifeTime Library

Repositories Archives (data preservation)

NASA Center for Climate Simulation

Carolina Digital Repository

Sequencing Work ndash an Infrastructure View

RENCI

Science

Portal

Open

Science

Grid

TeraGrid

UNC BASS

hellip

National Resources

Pipelines Genome

Databases

RENCI Infrastructure TestDevelopment

Distributed ad-hoc

processing

iRODS data-grid managed

processing

Data Production

UNC HTFS

Third Party

Vendors

Clinical Data Systems

NCGenes

Secure Medical

Workspace

Production

Pipelines

Archive Genome

Databases

RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High

Throughput Sequencing facility

Local

(TUCASI)

Data Sharing

NIH

Other

Institutions

Ref

Se

q

Genome

Annotations

dbSN

P HGMD

1000

Genomes

Managing several hundred TBs of genomic data

VarDB Hadoop

Managing Data on the Research Side

RENCI

STORAGE

(Tape Drives)

UNC

STORAGE

(Tape Drives)

UNC HPC RENCI HPC

External

Compute

Open Science

Grid

Clemson

Clouds

IT Machines

RENCI Hadoop

Genomics

Storage

Lab

Machines

NIH

External

Partners

Genomics HPC

Genomics

Hadoop

Data

Providers

Researchers

Students

External

Collaborators

IT Staff

iRODS gracefully allows for introducing control

bullData movement and replication

bullMetadata standards

bullArchival deletion and retention

bullIntegration with workflows hadoop databases

bullHiding complexities

bullAutomation

bullhellip all policy driven

bullhellip without breaking the in-place systems

Wild West Managed

SILS LifeTime Library

Student digital libraries

Enable students to build collections of

Photographs

MP3 audio files

Class documents

Video

Web site archive

Resources provided by School of Information and

Library Science at UNC-CH

Student collections range from 2 GBytes to 150 Gbytes

Number of files from 2000 to 12000

SILS LifeTime Library Policies

Library management

Replication

Checksums

Versioning

Strict access controls

Quotas

Metadata catalog replication

Installation environment archiving

Ingestion

Automated synchronization of student directory

with LifeTime Library

Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 8: iRODS

Community-based Collection Life Cycle

Project

Collection

Private

Local

Policy

Data

Grid

Shared

Distribution

Policy

Digital

Library

Published

Description

Policy

Data

Processing

Pipeline

Analyzed

Service

Policy

Reference

Collection

Preserved

Representation

Policy

Federation

Sustained

Re-purposing

Policy

Stages correspond to addition of new policies for a broader community

Virtualize the stages of the collection life cycle through policy evolution

The driving purpose changes at each stage of the data life cycle

Applications

Data Grids (data sharing)

Ocean Observatories Initiative

The iPlant Collaborative

National Optical Astronomy Observatory

Babar High Energy Physics

Broad Institute genomics data grid

WellCome Trust Sanger Institute genomics data grid

Digital Libraries (data publication)

Texas Digital Library

French National Library

UNC-CH SILS LifeTime Library

Repositories Archives (data preservation)

NASA Center for Climate Simulation

Carolina Digital Repository

Sequencing Work ndash an Infrastructure View

RENCI

Science

Portal

Open

Science

Grid

TeraGrid

UNC BASS

hellip

National Resources

Pipelines Genome

Databases

RENCI Infrastructure TestDevelopment

Distributed ad-hoc

processing

iRODS data-grid managed

processing

Data Production

UNC HTFS

Third Party

Vendors

Clinical Data Systems

NCGenes

Secure Medical

Workspace

Production

Pipelines

Archive Genome

Databases

RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High

Throughput Sequencing facility

Local

(TUCASI)

Data Sharing

NIH

Other

Institutions

Ref

Se

q

Genome

Annotations

dbSN

P HGMD

1000

Genomes

Managing several hundred TBs of genomic data

VarDB Hadoop

Managing Data on the Research Side

RENCI

STORAGE

(Tape Drives)

UNC

STORAGE

(Tape Drives)

UNC HPC RENCI HPC

External

Compute

Open Science

Grid

Clemson

Clouds

IT Machines

RENCI Hadoop

Genomics

Storage

Lab

Machines

NIH

External

Partners

Genomics HPC

Genomics

Hadoop

Data

Providers

Researchers

Students

External

Collaborators

IT Staff

iRODS gracefully allows for introducing control

bullData movement and replication

bullMetadata standards

bullArchival deletion and retention

bullIntegration with workflows hadoop databases

bullHiding complexities

bullAutomation

bullhellip all policy driven

bullhellip without breaking the in-place systems

Wild West Managed

SILS LifeTime Library

Student digital libraries

Enable students to build collections of

Photographs

MP3 audio files

Class documents

Video

Web site archive

Resources provided by School of Information and

Library Science at UNC-CH

Student collections range from 2 GBytes to 150 Gbytes

Number of files from 2000 to 12000

SILS LifeTime Library Policies

Library management

Replication

Checksums

Versioning

Strict access controls

Quotas

Metadata catalog replication

Installation environment archiving

Ingestion

Automated synchronization of student directory

with LifeTime Library

Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 9: iRODS

Applications

Data Grids (data sharing)

Ocean Observatories Initiative

The iPlant Collaborative

National Optical Astronomy Observatory

Babar High Energy Physics

Broad Institute genomics data grid

WellCome Trust Sanger Institute genomics data grid

Digital Libraries (data publication)

Texas Digital Library

French National Library

UNC-CH SILS LifeTime Library

Repositories Archives (data preservation)

NASA Center for Climate Simulation

Carolina Digital Repository

Sequencing Work ndash an Infrastructure View

RENCI

Science

Portal

Open

Science

Grid

TeraGrid

UNC BASS

hellip

National Resources

Pipelines Genome

Databases

RENCI Infrastructure TestDevelopment

Distributed ad-hoc

processing

iRODS data-grid managed

processing

Data Production

UNC HTFS

Third Party

Vendors

Clinical Data Systems

NCGenes

Secure Medical

Workspace

Production

Pipelines

Archive Genome

Databases

RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High

Throughput Sequencing facility

Local

(TUCASI)

Data Sharing

NIH

Other

Institutions

Ref

Se

q

Genome

Annotations

dbSN

P HGMD

1000

Genomes

Managing several hundred TBs of genomic data

VarDB Hadoop

Managing Data on the Research Side

RENCI

STORAGE

(Tape Drives)

UNC

STORAGE

(Tape Drives)

UNC HPC RENCI HPC

External

Compute

Open Science

Grid

Clemson

Clouds

IT Machines

RENCI Hadoop

Genomics

Storage

Lab

Machines

NIH

External

Partners

Genomics HPC

Genomics

Hadoop

Data

Providers

Researchers

Students

External

Collaborators

IT Staff

iRODS gracefully allows for introducing control

bullData movement and replication

bullMetadata standards

bullArchival deletion and retention

bullIntegration with workflows hadoop databases

bullHiding complexities

bullAutomation

bullhellip all policy driven

bullhellip without breaking the in-place systems

Wild West Managed

SILS LifeTime Library

Student digital libraries

Enable students to build collections of

Photographs

MP3 audio files

Class documents

Video

Web site archive

Resources provided by School of Information and

Library Science at UNC-CH

Student collections range from 2 GBytes to 150 Gbytes

Number of files from 2000 to 12000

SILS LifeTime Library Policies

Library management

Replication

Checksums

Versioning

Strict access controls

Quotas

Metadata catalog replication

Installation environment archiving

Ingestion

Automated synchronization of student directory

with LifeTime Library

Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 10: iRODS

Sequencing Work ndash an Infrastructure View

RENCI

Science

Portal

Open

Science

Grid

TeraGrid

UNC BASS

hellip

National Resources

Pipelines Genome

Databases

RENCI Infrastructure TestDevelopment

Distributed ad-hoc

processing

iRODS data-grid managed

processing

Data Production

UNC HTFS

Third Party

Vendors

Clinical Data Systems

NCGenes

Secure Medical

Workspace

Production

Pipelines

Archive Genome

Databases

RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High

Throughput Sequencing facility

Local

(TUCASI)

Data Sharing

NIH

Other

Institutions

Ref

Se

q

Genome

Annotations

dbSN

P HGMD

1000

Genomes

Managing several hundred TBs of genomic data

VarDB Hadoop

Managing Data on the Research Side

RENCI

STORAGE

(Tape Drives)

UNC

STORAGE

(Tape Drives)

UNC HPC RENCI HPC

External

Compute

Open Science

Grid

Clemson

Clouds

IT Machines

RENCI Hadoop

Genomics

Storage

Lab

Machines

NIH

External

Partners

Genomics HPC

Genomics

Hadoop

Data

Providers

Researchers

Students

External

Collaborators

IT Staff

iRODS gracefully allows for introducing control

bullData movement and replication

bullMetadata standards

bullArchival deletion and retention

bullIntegration with workflows hadoop databases

bullHiding complexities

bullAutomation

bullhellip all policy driven

bullhellip without breaking the in-place systems

Wild West Managed

SILS LifeTime Library

Student digital libraries

Enable students to build collections of

Photographs

MP3 audio files

Class documents

Video

Web site archive

Resources provided by School of Information and

Library Science at UNC-CH

Student collections range from 2 GBytes to 150 Gbytes

Number of files from 2000 to 12000

SILS LifeTime Library Policies

Library management

Replication

Checksums

Versioning

Strict access controls

Quotas

Metadata catalog replication

Installation environment archiving

Ingestion

Automated synchronization of student directory

with LifeTime Library

Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 11: iRODS

Managing Data on the Research Side

RENCI

STORAGE

(Tape Drives)

UNC

STORAGE

(Tape Drives)

UNC HPC RENCI HPC

External

Compute

Open Science

Grid

Clemson

Clouds

IT Machines

RENCI Hadoop

Genomics

Storage

Lab

Machines

NIH

External

Partners

Genomics HPC

Genomics

Hadoop

Data

Providers

Researchers

Students

External

Collaborators

IT Staff

iRODS gracefully allows for introducing control

bullData movement and replication

bullMetadata standards

bullArchival deletion and retention

bullIntegration with workflows hadoop databases

bullHiding complexities

bullAutomation

bullhellip all policy driven

bullhellip without breaking the in-place systems

Wild West Managed

SILS LifeTime Library

Student digital libraries

Enable students to build collections of

Photographs

MP3 audio files

Class documents

Video

Web site archive

Resources provided by School of Information and

Library Science at UNC-CH

Student collections range from 2 GBytes to 150 Gbytes

Number of files from 2000 to 12000

SILS LifeTime Library Policies

Library management

Replication

Checksums

Versioning

Strict access controls

Quotas

Metadata catalog replication

Installation environment archiving

Ingestion

Automated synchronization of student directory

with LifeTime Library

Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 12: iRODS

SILS LifeTime Library

Student digital libraries

Enable students to build collections of

Photographs

MP3 audio files

Class documents

Video

Web site archive

Resources provided by School of Information and

Library Science at UNC-CH

Student collections range from 2 GBytes to 150 Gbytes

Number of files from 2000 to 12000

SILS LifeTime Library Policies

Library management

Replication

Checksums

Versioning

Strict access controls

Quotas

Metadata catalog replication

Installation environment archiving

Ingestion

Automated synchronization of student directory

with LifeTime Library

Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 13: iRODS

SILS LifeTime Library Policies

Library management

Replication

Checksums

Versioning

Strict access controls

Quotas

Metadata catalog replication

Installation environment archiving

Ingestion

Automated synchronization of student directory

with LifeTime Library

Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 14: iRODS

Policy-Driven Repository Infrastructure project

funded by the Institute for Museum and Library Services

Carolina Digital Repository

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 15: iRODS

Carolina Digital Repository

Ingest Workflow

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 16: iRODS

iRODS Data Grid

More than 50 different clients have been used to

interact with the data grid

Web browsers (iDrop-web Rich Web client)

Web services (VOSpace)

Load libraries (Python Java)

IO libraries (C C++ Fortran)

File systems (FUSE WebDav Parrot)

Synchronization interfaces (iDrop)

Unix tools Grid tools (icommands SAGA SRM Griphyn)

Workflows (Kepler Taverna)

Digital Libraries (Fedora DSpace)

Portals (EnginFrame)

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 17: iRODS

Managing Information amp Knowledge

Concepts

Data objects

Information names

Knowledge relationships between names

Wisdom relationships between relationships

Implementation

Data bytes Storage system

Information metadata Relational database

Knowledge policies procedures Rule base Rule engine

Wisdom policy enforcement point Data Grid

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 18: iRODS

Data Virtualization

Storage System

Storage Protocol

Access Interface

Policy Enforcement Points

Standard Micro-services

Standard IO Operations

bull Map from the actions

requested by the client to

multiple policy

enforcement points

bull Map from policy to

standard micro-services

bull Map from micro-services

to standard Posix IO

operations

bull Map standard IO

operations to the

protocol supported by

the storage system

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 19: iRODS

System and User-driven Rules

The data grid automatically applies rules

defined in the rule base corere

You can define rules that are applied

interactively or that are deferred for later

execution

irule ndashF ldquorule-filerrdquo

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 20: iRODS

Example Rule

Write ldquoHello Worldrdquo

Create rule file call ruleHellor

myTestRule

writeLine(stdoutrdquo Hello World)

INPUT null

OUTPUT ruleExecOut

irule ndashF ruleHellor

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 21: iRODS

Production Integrity Rule

Verify all input parameters for consistency

Query the iRODS metadata catalog to retrieve status information

Verify the integrity of each file in a collection

Update all replicas to the most recent version

Minimize the load on production services through a deadline scheduler

Differentiate between the logical name for a file and the physical replica locations

Identify all missing replicas and document their lack

Create new replicas to replace missing replicas

Implement load leveling to distribute the new replicas across the storage systems

Create a log file that records all repair operations performed upon the collection

Track progress of the policy execution

Initialize the rule for the first execution

Enable restart of the process from the last set of checked files in case of a system halt

Manipulate files in batches of 256 files at a time to handle arbitrarily large collections

Minimize the number of sleep periods used by the deadline scheduler

Include the checking of new files that have been added during the execution of the policy

Write out statistics about the effective execution rate and the number of files checked

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 22: iRODS

Workflow Management amp Registration

Workflow file

Directory holding all input and output files

associated with workflow file (mounted

collection that is linked to the workflow file)

Input parameter file lists parameters

and input and output file names

Directory holding all output

files generated for invocation

of eCWkflowrun the version

number is incremented

Automatically generated run file for

Executing each input file

Output file created for

eCWKflowmpf

eCWkflowmss

earthCubeeCWkflow

eCWkflowmpf

earthCubeeCWkfloweCWkflowrunDir0

eCWkflowrun

Outfile

eCWkflow2run

eCWkflow2mpf

earthCubeeCWkfloweCWkflow2runDir

0

Newfile

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 23: iRODS

Publications

Rajasekar R M Wan R Moore W Schroeder S-Y Chen L

Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B

Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo

Morgan amp Claypool 2010

Ward R M Wan W Schroeder A Rajasekar A de Torcy T

Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data

System (iRODS 30) Micro-service Workbookrdquo DICE

Foundation November 2011 ISBN 9781466469129

Amazoncom

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 24: iRODS

Reagan W Moore

rwmoorerenciorg

httpirodsdiceresearchorg

NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo

NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo

NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo

NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo

iRODS - Open Source Software

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 25: iRODS

iRODS Distributed Data Management

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 26: iRODS

Val = 0rdquo

msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =

Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)

foreach (GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)

if(int(Val) == 0)

Str1 = TEST_DATA_ID=0rdquo

msiString2KeyValPair(Str1kvp)

msiAssociateKeyValuePairsToObj(kvpColl-C)

writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)

on a restart TEST_DATA_ID will be greater than 0

msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and

META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)

msiExecGenQuery(GenQInp2GenQOut2)

foreach(GenQOut2)

msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)

Initializing Workflow Parameters

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 27: iRODS

Workflow Operations Used

Arithmetic (+ - )

Boolean tests (== = ampamp || gt lt gt=)

Conditional statements

if then else

Control

break fail

Loops

for foreach while

List manipulation

initialization list addition (cons) extracting an element from a

list (elem) updating an element in a list (setelem)

Variable manipulation

initialization type conversion (int double str)

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 28: iRODS

Micro-services Used

Metadata catalog manipulation

msiGetValByKey get metadata from structure

msiExecStrCondQuery execute string conditional query

msiString2KeyValPair convert string to key-value pair

msiAssociateKeyValuePairsToObj add metadata

msiMakeGenQuery create a query

msiExecGenQuery execute a query

msiCloseGenQuery release query buffers

msiGetContInxFromGenQueryOut check for more rows

msiRemoveKeyValuePairsFromObj remove metadata

msiGetMoreRows get more rows from query

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 29: iRODS

Micro-services Used

Data and directory manipulation

msiIsColl check whether name is a collection

msiCollCreate create a collection

msiDataObjCreate create a file

msiDataObjRepl replicate a file

msiDataObjChksum checksum a file

msiDataObjUnlink delete a file

System functions

msiGetSystemTime get the system time

writeLine write a line to a file or standard out

msiSleep sleep

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 30: iRODS

Performance at rencirsquo

bull Execute call to rule engine 18 msecs

bull Execute metadata query 714 msecs

bull Disk seek latency 5 msecs

bull Disk rotational latency 11 msecs

bull Production loop logic 63 msecs

bull Checksum verification 21 msecs

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 31: iRODS

Data Analysis Use Cases

bull Demonstrate reproducible science A use case could include the

registration storage sharing and re-execution of a workflow The hypoxia

use case from the Cross-Domain and Brokering Concept groups could be

used as an example

bull Automate data retrieval A use case could demonstrate remote access to a

data collection retrieval of desired data sets transformation and use in

an analysis workflow An eco-hydrology example that automates access

to digital elevation maps and land use coverage is being built

bull Integrate community resources with collaboration environments An

example would be use of the DAB protocol to identify and cache local

copies of relevant data sets for local analysis

bull Integrate multiple community resources A use case could be

demonstration of invocation of multiple workflow systems within the

same analysis An example is the integration of Cyber-integrator

workflow with collaboration environments to support drought prediction

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 32: iRODS

RHESSys workflow to develop a nested

watershed parameter file (worldfile)

containing a nested ecogeomorphic object

framework and full initial system state

Choose gauge

or outlet (HIS)

Extract

drainage area

(NHDPlus)

Digital

Elevation

Model (DEM)

Worldfile Flowtable

RHESSys

Slope

Aspect

Streams (NHD)

Roads (DOT) Strata

Hillslope

Patch

Basin

Stream network

Nested watershed

structure

Land Use

Leaf Area

Index

Phenology

Soil Data

NLCD (EPA)

Landsat TM

MODIS

USDA

Soil and vegetation

parameter files

Eco-Hydrology

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 33: iRODS

iRODS Rule for RHESSys

main

getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)

convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)

extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))

importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath

newLocObjPath)

delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)

Modular workflow composed by chaining basic transformation

Define input variables

Call functions to apply each transformation step

Store results in shared collection

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 34: iRODS

extractTileFromNHD_DEM(extentCoords)

Split path to object into collection and name

msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)

writeLine(serverLog nhdDEMObjColl)

writeLine(serverLog nhdDEMObjName)

Build query to discover physical path

msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)

msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)

msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)

msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)

Run query

msiExecGenQuery(genQInp genQOut)

Extract path from query result

foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)

writeLine(serverLog filePath)

Determine physical path of input directory

msiSplitPath(filePath inFileDir headerFileIgnore)

Generate physical path of output file

msiSplitPath(inFileDir inFileParentDir rasterDatasetName)

tileFileName = SUBSET-++rasterDatasetName++img

tileFilePath = inFileParentDir++++tileFileName

Generate iRODS path of output

msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)

tileObjPath = nhdDEMObjCollParent++++tileFileName

args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath

writeLine(serverLog args)

msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)

writeLine(serverLog cmd_out)

Register tile file with iRODS

msiPhyPathReg(tileObjPath rescName tileFilePath null status)

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 35: iRODS

Summary

iRODS is a power Policy-Based engine for Managing

NextGen Big Data Cyber-Infrastructures

Enables a Flexible Adaptive and Customizable

Data Management Architecture

ldquoCannedrdquo scripts (policies) can be created to

standardize and automated users processes

Simple menu driven interface

No CS Degree needed

iRODS is the middleware for

Distributed Data Management

Thank you

Questions

Page 36: iRODS

Thank you

Questions