23
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Embed Size (px)

Citation preview

Page 1: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

The Data Grid: Towards an Architecture for the Distributed Management

and Analysis of Large Scientific Dataset

Caitlin Minteer & Kelly Clynes

Page 2: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

The Data Grid

Large dataset size Geographic distribution of users and

resources Computationally intensive analysis No other architecture exists that allows

us to apply technologies in large scale application domains

Page 3: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

The Data Grid

Data grid applications must frequently operate in wide area, multi-institutional diverse environments

Page 4: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Design Architecture for The Data Grid

Mechanism Neutrality Designed to be as independent as

possible of low level mechanisms Defining interfaces that sum up oddness

of specific storage systems.

Page 5: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Design Architecture for The Data Grid

Policy Neutrality Structured so that design decisions with

significant performance implications are exposed to the user

Page 6: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Design Architecture for The Data Grid

Compatibility with Grid Infrastructure Take advantage of fundamental Grid

infrastructure Compatible with lower level Grid

mechanisms

Page 7: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Design Architecture for The Data Grid

Uniformity of Information Infrastructure The same data model and interface used

to access the grids metadata

Page 8: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Design Architecture for The Data Grid

These four principals lead us to development of a layered architecture.

Lower layers provide high performance access to a statistical set of devices.

In data grids, the focus on simple, policy-independent mechanisms will encourage and enable wide use without limiting the range of applications that can be applied.

Page 9: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Core Grid Data Services

Two fundamental services required in data grid architecture: Data Access Metadata Access

Page 10: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Data Access

Provides mechanisms for accessing, managing, and initiating third party transfers of data stored in storage systems

Page 11: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Metadata Access

Provides mechanisms for accessing and managing information about data stored in storage systems

Page 12: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Data Abstraction: Storage System

Basic grid component is the Storage System which provides functions for creating, destroying, reading, writing and manipulation file instances

File instances are basic unit of information in a storage system

A Storage system implemented by any storage technology that can support the required access functions

Page 13: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Data Access:

Storage system access functions must be included with the security environment of each site to which remote access is required

Applications should be able to provide storage systems with hints concerning access patterns, network performance, etc, that the storage system can use to optimize performance

Data movement functions must be able to detect and report errors

Page 14: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Metadata

Management of the data grid itself Information about file instances, the

contents of file instances, and the various storage systems contained in the grid

The metadata service provides the way to publish and access the data

Page 15: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Application Metadata

Describes the contents and structure of the data Content represented by the file Circumstances under which the data was

obtained Other info useful to applications that

process the data

Page 16: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Replica Metadata

Used to manage replication of data objects

Includes information for mapping file instances to a particular storage system locations

Page 17: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

System Configuration Metadata

Describes the fabric of the grid itself i.e network connectivity and details

about storage systems Capacity Usage policy

Page 18: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Additional Requirements

Service must operate efficiently in a distributed environment

Scalable Robust Assert Local Control over information

Page 19: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Hierarchical Distributed System

Because of these, the metadata service must be hierarchical distributed system Achieve scalability Avoid single points of failure Facilitate local control over data

Page 20: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Higher-Level Data Grid Components

Two types of representative components: Replica management Replica selection

Page 21: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Replica Management

Replica Manager Create copies of file instances, or

replicas, within specified storage systems

Offers better performance or availability for access to or from a particular location

Maintains repository or catalog

Page 22: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Replica Selection and Data Filtering

High level service provided in the data grid is Replica Selection Optimize performance principles

Speed Cost Security

Replicas may be local or accessed remotely

Page 23: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes

Summary

Architecture of the Data Grid Mechanism Neutrality Policy Neutrality Compatibility with Grid Infrastructure Uniformity of information infrastructure

Data Services Data Access Metadata Access

Replica Management