Data Management Reading: Chapter 5: Data-Intensive Computing And A Network-Aware Distributed Storage Cache for Data Intensive Environments

Data Management

Reading:

Chapter 5: “Data-Intensive Computing”

And

“A Network-Aware Distributed Storage Cache for Data Intensive Environments”

What is Data Management?It depends…

Storage systemsDisk arraysNetwork caches (e.g., DPSS)Hierarchical storage systems (e.g., HPSS)

Efficient data transport mechanismsStripedParallelSecureReliableThird-party transfers

What is data management? (cont.)

Replication managementAssociate files into collectionsMechanisms for reliably copying collections,

propagating updates to collections, selecting among replicas

Metadata managementAssociate attributes that describe dataSelect data based on attributes

Publishing and curation of data“Official” versions of important collectionsDigital libraries

Outline for Today

Examples of data-intensive applications Storage systems:

Disk arraysHigh-Performance Network Caches (DPSS)Hierarchical Storage Systems (Chris: HPSS)

Next two lectures:gridFTPGlobus replica managementMetadata systemsCuration

Data-Intensive Applications: Physics

CERN Large Hadron Collider

Several terabytes of data per yearStarting in 2005Continuing 15 to 20 years

Replication scenario:Copy of everything at CERN (Tier 0)Subsets at national centers (Tier 1)Smaller regional centers (Tier 2)Individual researchers will have copies

GriPhyN Overview(www.griphyn.org)

5-year, $12.5M NSF ITR proposal to realize the concept of virtual data, via:

Key research areas: Virtual data technologies (information models,

management of virtual data software, etc.) Request planning and scheduling (including

policy representation and enforcement) Task execution (including agent computing, fault

management, etc.) Development of Virtual Data Toolkit (VDT) Four Applications: ATLAS, CMS, LIGO, SDSS

GriPhyN Participants

Computer Science U.Chicago, USC/ISI, UW-Madison, UCSD, UCB, Indiana,

Northwestern, Florida Toolkit Development

U.Chicago, USC/ISI, UW-Madison, Caltech Applications

ATLAS (Indiana), CMS (Caltech), LIGO (UW-Milwaukee, UT-B, Caltech), SDSS (JHU)

Unfunded collaborators UIC (STAR-TAP), ANL, LBNL, Harvard, U.Penn

The Petascale Virtual Data Grid (PVDG) Model

Data suppliers publish data to the Grid Users request raw or derived data from

Grid, without needing to knowWhere data is locatedWhether data is stored or computed

User can easily determineWhat it will cost to obtain dataQuality of derived data

PVDG serves requests efficiently, subject to global and local policy constraints

PVDGScenario

?

Major Archive Facilities

Network caches & regional centers

Local sites

User requests may be satisfied via a combination of data access and computation at local, regional, and central sites

Other Application Scenarios

Climate communityTerabyte-scale climate model datasets:

Collecting measurementsSimulation results

Must support sharing, remote access to and analysis of datasets

Distance visualizationRemote navigation through large datasets,

with local and/or remote computing

Storage Systems: Disk Arrays

What is a disk array?Collection of disks

Advantages: Higher capacity

Many small, inexpensive disks

Higher throughputHigher bandwidth (Mbytes/sec) on large transfersHigher I/O rate (transactions/sec) on small

transfers

Trends in Magnetic Disks

Capacity increases: 60% per year Cost falling at similar rate ($/MB or $/GB) Evolving to smaller physical sizes

14in 5.25in 3.5in 2.5in 1.0in … ? Put lots of small disks together

Problem: RELIABILITYReliability of N disks =

Reliability of 1 disk divided by N

Key Concepts in Disk Arrays

Striping for High PerformanceInterleave data from single file across

multiple disksFine-grained interleaving:

every file spread across all disksany access involves all disks

Course-grained interleaving: interleave in large blocks small accesses may be satisfied by a single disk

Key Concepts in Disk Arrays

RedundancyMaintain extra information in disk array

DuplicationParityReed-Solomon error correction codesOthers

When a disk fails: use redundancy information to reconstruct data on failed disk

RAID “Levels” Defined by combinations of striping & redundancy

RAID Level 1: Mirroring or Shadowing Maintain a complete copy of each disk Very reliable High cost: twice the number of disks Great performance: on a read, may go to disk with

faster access time

RAID Level 2: Memory Style Error Detection and Correction Not really implemented in practice Based on DRAM-style Hamming codes In disk systems, don’t need detection Use less expensive correction schemes

RAID “Levels” (cont.)

RAID Level 3: Fine-grained Interleaving and Parity Many commercial RAIDs Calclate parity bit-wise across disks in the array

(using exclusive-OR logic) Maintain a separate parity disk; update on write

operations When a disk fails, use other data disk and parity disk

to reconstruct data on lost disk Fine-grained interleaving: all disks involved in any

access to the array

RAID “Levels” (cont.)

RAID Level 4: Large Block Interleaving and Parity Similar to level 3, but interleave on larger blocks Small accesses may be satisfied by a single disk Supports higher rate of small I/Os Parity disk may become a bottleneck with multiple

concurrent I/Os

RAID Level 5: Large Block Interleaving and Distributed Parity Similar to level 4 Distributes parity blocks throughout all disks in array

RAID Levels (cont.)

RAID Level 6: Reed-Solomon Error Correction CodesProtection against two disk failures

Disks getting so cheap: consider massive storage systems composed entirely of disksNo tape!!

DPSS: Distributed Parallel Storage System

Produced by Lawrence Berkeley National Labs

“Cache”: provides storage that is Faster than typical local disk Temporary

“Virtual disk”: appears to be single large, random-access, block-oriented I/O device

Isolates application from tertiary storage system: Acts as large buffer between slow tertiary storage and

high-performance network connections “Impedance matching”

Features of DPSS Components:

DPSS block servers Typically low-cost workstations Each with several disk controllers, several disks per

controller

DPSS mater process Data requests sent from client to master process Determines which DPSS block server stores the

requested blocks Forwards request to that block server

Note: servers can be anywhere on network (a distributed cache)

Features of DPSS (cont.)

Client API library Supports variety of I/O semantics dpssOpen(), dpssRead(), dpssWrite(), dpssLSeek(),

dpssClose()

Application controls data layout in cache For typical applications that read sequentially: stripe

blocks of data across servers in round-robin fashion

DPSS client library is multi-threaded Number of client threads is equal to number of DPSS

servers: client speed scales with server speed


Optimized for relatively small number of large files Several thousand files Greater than 50 MB

DPSS blocks are available as soon as they are placed in cache Good for staging larges files to/from tertiary storage Don’t have to wait for large transfer to complete

Dynamically reconfigurable Add or remove servers or disks on the fly


Agent-based performance monitoring system

Client library automatically sets TCP buffer size to optimal value Uses information published by monitoring system

Load balancing Supports replication of files on multiple servers DPSS master uses status information stored in LDAP

directory to select a replica that will give fastest response

Hierarchical Storage System

Fast, disk cache in front of larger, slower storage Works on same principle as other hierarchies:

Level-1 and Level-2 caches: minimize off-chip memory accesses

Virtual memory systems:minimize page faults to disk Goal:

Keep popular material in faster storage Keep most of material on cheaper, slower storage Locality: 10% of material gets 90% of accesses

Problem with tertiary storage (especially tape): Very slow Tape seek times can be a minute or more…

Documents

Data Management Reading: Chapter 5: Data-Intensive Computing And A Network-Aware Distributed Storage Cache for Data Intensive Environments