29
Andrew Hanushevsky 17-Mar-99 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky SLAC Computing Services Marcin Nowak CERN Produced under contract DE-AC03-76SF00515 between Stanford University and the Department of Energy

Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Embed Size (px)

DESCRIPTION

Andrew Hanushevsky17-Mar-993 High Energy Physics Quantitative Challenge n ExperimentBaBar/SLACATLAS/CERN n StartsMay 1999May 2005 n Data Volume0.2 petabytes/yr5.0 petabytes/yr u Total amount 2.0 petabytes 100 petabytes n Aggregate xfr rate200 MB/sec disk100 GB/sec disk n 60 MB/sec tape 1 GB/sec tape n Processing power5,000 SPECint95250,000 SPECint95 u SPARC Ultra 10’s ,000 n Physicists8003,000 n Locations n Countries 9 50

Citation preview

Page 1: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 1

Pursuit of a Scalable High Performance Multi-Petabyte Database

16th IEEE Symposium on Mass Storage SystemsAndrew Hanushevsky

SLAC Computing ServicesMarcin Nowak

CERNProduced under contract DE-AC03-76SF00515 between Stanford University and the Department of Energy

Page 2: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 2

High Energy Experiments

BaBar at SLAC High precision investigation of B-meson decays Explore the asymmetry between matter and antimatter

Where did all the antimatter go? ATLAS at CERN

Probe the Higgs boson energy range Explore the more exotic reaches of physics

Page 3: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 3

High Energy Physics Quantitative Challenge

Experiment BaBar/SLAC ATLAS/CERN Starts May 1999 May 2005 Data Volume 0.2 petabytes/yr 5.0 petabytes/yr

Total amount 2.0 petabytes 100 petabytes Aggregate xfr rate 200 MB/sec disk 100 GB/sec disk 60 MB/sec tape 1 GB/sec tape Processing power 5,000 SPECint95 250,000 SPECint95

SPARC Ultra 10’s 526 27,000 Physicists 800 3,000 Locations 87 250 Countries 9 50

Page 4: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 4

Common Elements

Data will be stored in an Object Oriented database Objectivity/DB

Has theoretical ability to scale to size of experiments Most data will be kept offline

HPSS Heavy duty, industrial strength mass storage system

BaBar will be blazing the path First large scale experiment to use this combination The year of the hare will be a very interesting time

Page 5: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 5

Objectivity/DB

Client/Server Application Primary access is through the Advanced Multithreaded Server (AMS)

Can have any number of AMS’ AMS serves “pages” (512 to 64K byte blocks)

Similar to other remote filesystem interfaces (e.g., NFS) Objectivity client can read and write database “pages” via AMS

Pages range from 512 bytes to 64K in powers of 2 (e.g., 1K, 2K, 4K, etc.)

ams protocolams protocol ufs protocolufs protocol

Page 6: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 6

High Performance Storage System

#Bitfile Server#Name Server

#Storage Servers#Physical Volume Library

# Physical Volume Repositories#Storage System Manager#Migration/Purge Server

#Metadata Manager#Log Daemon#Log Client

#Startup Daemon#Encina/SFS

#DCE

ControlNetwork

Data Network

Page 7: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 7

The Obvious Solution

DatabaseDatabaseServersServers

ComputeComputeFarmFarm

Mass StorageMass StorageSystemSystem

NetworkNetworkSwitchSwitch

External CollaboratorsExternal Collaborators

But… the devil is in the details

Page 8: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 8

Capacity and Transfer Rate

1

2

4

8

16

64

128

1024

32

512

Tape Cartridge Capacity

88 90 94 98 02 04

Tape Transfer Rate

GB Capacity

Year009692 06

256

MB/Sec

3

6

12

24

48

96

192

384

Disk System Capacity

Disk Transfer Rate

Page 9: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 9

The Capacity Transfer Rate Gap

Density growing faster than ability to transfer data We can store the data just fine, but do we have the time to look at it?

There are solutions short of poverty Stripped tape?

Only if you want a lot of headaches Intelligent staging

Primary access on RAID devices Cost/Performance is still a problem Need to address UFS scaling problem

Replication - a fatter pipe? Data synchronization problem Load balancing issues

Whatever the solution is, you’ll need lot of them

Page 10: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 10

Part of the solution: Together Alone

HPSS Highly scalable, excellent I/O performance for large files but

High latency for small block transfers (i.e., Objectivity/DB) AMS

Efficient database protocol and highly flexible but Limited security, tied to local filesystem

Need to synergistically mate these systems

Page 11: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 11

Opening up new vistas: The Extensible AMS

oofs interface

System specific interface

Page 12: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 12

Veritas Volume Manager Catenates disk devices to form very large capacity logical devices

Veritas File System High performance (60+ MB/Sec) journaled file system for fast recovery

Combination used as HPSS staging target Allows for fast streaming I/O and efficient small block transfers

As big as it gets: Scaling The File System

Page 13: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 13

Not out of the woods yet: Other Issues

Access Patterns Random vs sequential

Staging latency Scalability Security

Page 14: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 14

No prophets here: Supplying Performance Hints

Need additional information for optimum performance Different from Objectivity clustering hints

Database clustering Processing mode (sequential/random) Desired service levels

Information is Objectivity independent Need a mechanism to tunnel opaque information

Client supplies hints via oofs_set_info() call Information relayed to AMS in a transparent way AMS relays information to underlying file system via oofs()

Page 15: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 15

Where’s the data? Dealing With Latency...

Hierarchical filesystems may have high latency bursts Mounting a tape file

Need mechanism to notify client of expected delay Prevents request timeout Prevents retransmission storms Also allows server to degrade gracefully

Can delay clients when overloaded Defer Request Protocol

Certain oofs() requests can tell client of expected delay For example, open()

Client waits indicated amount of time and tries again

Page 16: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 16

Many out of one: Dynamically Replicated Databases

Dynamically distributed databases Single machine can’t manage over a terabyte of disk cache No good way to statically partition the database

Dynamically varying database access paths As load increases, add more copies

Copies accessed in parallel As load decreases, remove copies to free up disk space

Objectivity catalog independence Copies managed outside of Objectivity

Minimizes impact on administration

Page 17: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 17

If There are many, which One Do I Go To?

Request Redirect Protocol oofs () routines supply alternate AMS location

oofs routines responsible for update synchronization Typically, read/only access provided on copies Only one read/write copy conveniently supported

Client must declare intention to update prior to access Lazy synchronization possible

Good mechanism for largely read/only databases Load balancing provided by an AMS collective

Has one distinguished member recorded in the catalogue

Page 18: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 18

The AMS Collective

redirect

Collective members areeffectively interchangeable

DistinguishedMembers

Page 19: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 19

Keeping the hackers at bay: Object Oriented Security

No performance is sufficient if you have to always recompute Need mechanism to provide security to thwart hackers

Protocol Independent Authentication Model Public or private key

PGP, RSA, Kerberos, etc.• Can be negotiated at run-time

Automatically called by client and server kernels Supplied via replaceable shared libraries

Client Objectivity Kernel creates security objects as needed Security objects supply context-sensitive authentication credentials

Works only with Extensible AMS via oofs interface

Page 20: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 20

Overall Effects

Extensible AMS Allows use of any type of filesystem via oofs layer

Generic Authentication Protocol Allows proper client identification

Opaque Information Protocol Allows passing of hints to improve filesystem performance

Defer Request Protocol Accommodates hierarchical filesystems

Redirection Protocol Accommodates terabyte+ filesystems Provides for dynamic load balancing

Page 21: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 21

Dynamic Load Balancing Hierarchical Secure AMS

DynamicSelection

Page 22: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 22

Summary

AMS is capable of high performance Ultimate performance limited by disk speeds

Should be able to deliver average of 20 MB/Sec per disk The oofs interface + other protocols greatly enhance

performance, scalability, usability, and security 5+TB of SLAC data has been processed using AMS+HPSS

Some AMS problems No HPSS problems

SLAC will be using this combination to store physics data BaBar experiment will produce over a 2 PB database in 10 years

2,000,000,000,000,000 = 21015 bytes 200,000 3590 Tapes

Page 23: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 23

Now for the reality

Full AMS features not yet implemented SLAC/Objectivity design has been completed

oofs OO interface, OO security, protocols (I.e., DRP, RRP, and GAP) oofs and ooss layers are completely functional

HPSS integration is full-featured and complete Protocol development has been fully funded at SLAC

DRP, RRP, and GAP Initial feature set to be deployed late summer

DRP, GAP, and limited RRP Full asynchronous replication within 2 years

CERN & SLAC approaches similar But quite different in detail….

Page 24: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 24

CERN staging approach: RFIO/RFCP + HPSS

File & catalogmanagement

Stage-inrequests

UNIX FS I/ODB pages

AMSRFIO

daemon

HPSSServer

DiskPool

RFIO calls

Migrationdaemon

RFCP(RFIO copy)

DiskServer(Solaris)

HPSS MoverTapeRobot

Page 25: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 25

SLAC staging approach: PFTP + HPSS

File & catalogmanagement

Stage-inrequests

UNIX FS I/ODB pages

AMSGatewaydaemon

HPSSServer

DiskPool

Gateway Requests

Migrationdaemon

PFTP(data)

DiskServer(Solaris)

HPSS MoverTapeRobot

PFTP(control)

Page 26: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 26

SLAC ultimate approach: Direct Tape Access

File & catalogmanagement

Stage-inrequests

UNIX FS I/ODB pages

AMSHPSS

Server

DiskPool

Migrationdaemon

Direct Transfer

DiskServer(Solaris)

HPSS Mover

TapeRobot

Native API (rpc)

Page 27: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 27

CERN 1TB Test Bed

FDDIHIPPI

Fast Ethernet

DEC Alpha

IBMRS6000

IBMRS6000

SUN Sparc 5

SUN Sparc 5IBM Tape Silo

StagingPool

HPSS DataMover

HPSS DataMover

HPSS ServerRFIO daemon

AMS/HPSSInterface

current approximation

future1Gb switched ether

star topology

Page 28: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 28

SLAC Configuration

B

GigabitEthernet

Sun 4500

AMS ServerHPSS Mover

HPSS Server

IBM RS6000F50

Sun 4500

AMS ServerHPSS Mover

AMS ServerHPSS Mover

900 G

Sun 4500

AMS ServerHPSS Mover

Sun 4500

AMS ServerHPSS Mover

approximate

Page 29: Andrew Hanushevsky17-Mar-991 Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky

Andrew Hanushevsky 17-Mar-99 29

SLAC Detailed Configuration