13
1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant ([email protected]) Bio Computing & iPlant Collaborative Eric Lyons ([email protected]) Plant Sciences & iPlant Collaborative University of Arizona http://goo.gl/ p4j3m or https://sites.google.com/site/appliedciconcepts/ Will Computers Crash Genomics? Science Vol 331 Feb 2011

1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant ([email protected]) Bio Computing & iPlant Collaborative Eric Lyons

Embed Size (px)

Citation preview

Page 1: 1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons

1

Applied CyberInfrastructure ConceptsISTA 420/520 Fall 2014

1

Nirav Merchant ([email protected])Bio Computing & iPlant CollaborativeEric Lyons ([email protected])Plant Sciences & iPlant CollaborativeUniversity of Arizonahttp://goo.gl/p4j3m or https://sites.google.com/site/appliedciconcepts/

Will Computers Crash Genomics? Science Vol 331 Feb 2011

Page 2: 1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons

Topic CoverageLifecycle Issues (example from MIT)Why DM (Data Management) iRODS Introduction

Scaling the Infrastructure for Data Management(Chapter 3 from FiMDA) Group homework

Page 3: 1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons

Reality of data“We are drowning in data, but starving of information” - Attribution unknown

Page 4: 1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons

Data Life Cycle

http://www.data-archive.ac.uk/create-manage/life-cycle

Page 5: 1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons

5

iRODS Background and Evolution

• integrated Rule-Oriented Data System (iRODS) http://www.irods.org

• Originated at SDSC, developed by the DICE (Data Intensive Cyber Environments) group

• Based on decade-long SRB development experience for managing distributed data

• Community-driven

• Most of the group migrated to UNC Chapel Hill in 2008-2009– The group is bi-coastal: DICE-UNC, DICE-UCSD

• First release of iRODS in 2009

• iRODS picked up where SRB left off

Page 6: 1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons

6

iRODS Background and Evolution

• Modular, extensible, customizable

• Open source (BSD license)

• Supported at UNC with complementary activities by DICE and RENCI, a research unit of UNC Chapel Hill

• https://github.com/irods/irods

Page 7: 1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons

iRODS

I. Data grid middleware

II. Data management infrastructure

III. A framework for procedural implementation of data management policy (policy-driven data management)

iRODS is all these.

Page 8: 1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons

My Data:disk, filesystem,

site-specific storage, ...

My Data:tape, database, filesystem,

...

Partner’s Dataremote disk, tape,

filesystem, site-specific storage,…

User Client

• iRODS installs over heterogeneous data resources

• Users can share & manage distributed data as a single collection

User sees a single collection

iRODS View of Distributed Data

iRODS Unified Virtual Collection

Page 9: 1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons

iRODS as a Data Grid• Sharing data across:

– geographic and institutional boundaries– heterogeneous resources (hardware/software)

• Virtual (logical) collections of distributed data

• Global name spaces – data: files and collections– users: single sign on– storage: virtual resources

• Metadata catalogue (iCAT) manages mappings between logical and physical name spaces

Page 10: 1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons

A RENCI Data Grid

iRODS Server Metadata Catalog (iCAT)

iRODS Server

iRODS Server iRODS Server

• Client asks for data – request goes to an iRODS server

• Server contacts the iCAT-enabled server

• Information (location, access rights, etc) is retrieved from the iCAT

• Server containing data is signaled to send data to authorized client

• Client asks for data – request goes to an iRODS server

• Server contacts the iCAT-enabled server

• Information (location, access rights, etc) is retrieved from the iCAT

• Server containing data is signaled to send data to authorized client

iPlant

iRODS Server

NCSU

UNC-A

Duke

UNC-CH

iRODS Server

RENCI, Europa Center

A complete data grid (zone) hasone metadata catalogue (iCAT)

Page 11: 1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons

11

TUCASI Infrastructure Project (TIP) Federated Data Grids

Independent data grids (zones), each with its own iCAT,

can be federated18 September 2012

Page 12: 1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons

12

Federation of Data Grids• NASA

– Disparate data collections: Satellite data, model data, remote sensing data– Manage the collections separately (technically and administratively) with separate

data grids– Federate the data grids to give users an overall view onto NASA data

• Collaboration between consortia– DataNet Federation Consortium: 6 science domain partners, federating their data

grids to share data, users– Users authenticate to home data grid, access federated data grids

• For geographically distributed replication, evolution in data life cycle

18 September 2012

Page 13: 1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons

iPlant Data StoreFree Your Data

Different Users, Different Access Needs: One Data Store