Download pdf - Jack Baskin School of Engineeringlitwin/cours98/CoursBD/doc... · • Deduplication • Reliable & secure storage • Large-scale object-based storage Research Challenges • Exascale

Darrell Long & Ethan MillerJack Baskin School of Engineering

Overview of theStorage Systems Research

Center

The SSRC in one slide

Research Thrusts• Archival storage• New metadata & indexing approaches• New storage technologies• Deduplication• Reliable & secure storage• Large-scale object-based storage

Research Challenges• Exascale capacity & scalability• Performance• Security• Reliability & self-healing• New storage technologies• Improved metadata & indexing

SSRC features• 2+2 core faculty• ≈20 graduate students• High degree of collaboration• Close cooperation with sponsors• High visibility

SSRC sponsors• Dept. of Energy, NSF, LLNL, LANL• Data Domain, Hitachi, IBM, LSI,

NetApp, Seagate, Symantec• Over $1 million per year

2

Recent SSRC research

• Archival storage• Secure long-term storage: POTSHARDS (USENIX 2007)• “Brick-based” archival storage: Pergamum (FAST 2008)• Managing exascale archives (PDSW 2008)

• Metadata and file system indexing• Understanding current workloads (USENIX 2008)• Scalable metadata indexes (FAST 2009)

• Deduplication• Scalable, partitioned indexes (KDD 2007)• Secure deduplication (StorageSS 2008)

• Secure petabyte-scale storage (SC 07)• Reliability

• Large-stripe redundancy for disks (StorageSS 2007)• Finding good erasure codes (DSN 2008)

3

External collaborations

4

• Petascale Data Storage Institute (PDSI)• Three universities (UCSC, CMU, Michigan)• Five national laboratories (LANL, NERSC, ORNL, PNL, Sandia)

• NetApp• Non-volatile memory for large-scale file servers• Archival storage• Scalable indexing

• Symantec• Deduplication for virtual machines

• IBM• Storage class memories

• Hewlett Packard• Deduplication• Reliability in file and storage systems

A few of our projects...

• Archival Storage• Evolvable and Secure archives, Long-term management

• Deduplication• VM deduplication, Secure deduplication, Scalable indexing, On-line

deduplication• Petascale Storage• Security and Reliability

• Secure petascale storage, Quotas, Store, Forget & Check• Reliability

• Disk reliability, Disaster codes, Restructuring RAID • Indexing

• Highly scalable metadata indexing• Content-based access

5

Pergamum: Evolvable Archival Storage

• Archival storage requires archival solutions• Legislated requirements for professional data • Personal histories recorded digitally (photos, documents, ...)• Traditional systems too expensive to buy and operate• Archives must scale over many dimensions (time, technology,

capacity vendors, ...)• Approach: distributed network of intelligent, disk-based

devices• Function independently, cooperate in inter-device redundancy• Static cost control: standardized interfaces, commodity hardware• Operational cost control

• Low power hardware• Keep disks spun-down

6

• Goal: provide data protection without wasting energy• Approach:

• Scale the response to the problem

• Intra-disk – protection from latent sector errors• Inter-disk – protection from device loss

• Trees of signatures for algebraic consistency checking• Power aware encoding and rebuild strategies

Power-Efficient Data Protection

P

P

P

P

P

P

P

Tome 2

R R R R R R

R R R R R R

R R R R R R

P

P

P

P

P

P

P

Tome 0

R R R R R R

R R R R R R

R R R R R R

P

P

P

P

P

P

P

Tome 1

P

P

P

P

P

P

P

Tome 3

R R R R R R

RedundancyGroup

RedundancyGroup

RedundancyGroup

7

POTSHARDS: Secure Long-Term Storage

8

• Problem: over potentially indefinite data lifetimes...• ...computationally-bound mechanisms may eventually fail• ...change and loss are constants (keys, cryptosystems, ...)

• Trust the consensus of groups, not individuals• Store data on multiple, independently managed archives• Full collaboration should allow data recovery

• Use alternatives to computation-bound security• Unconditionally secure mechanisms (e.g. secret splitting)• Make targeted attacks difficult• Make aberrant behavior noticeable

Key Concept: Secret Splitting

9

Archive 0 Archive 1 Archive 2 Archive 3

Shard

Shard

Parity

Shard

Shard Y

Shard

Shard X

Object

User

Redundancy Group

Fragment Fragment Fragment

Data Deduplication

• Data deduplication is typically achieved by chunking files• Different ways to chunk data

• Variable and fixed size chunking• Analyze the chunks to see if there are any interesting properties

• Frequency of chunks per chunking scheme• Chunk locality

• Chunks occurring in sequence each time the first occurs• What about three or four chunks?

10

Managing Deduplication in Large Scale Archival Systems

• Index lookup: an integral part of the deduplication process in archival systems• A lookup index can consist of chunk signatures, shingles

• In large scale archival systems index lookup becomes a performance bottleneck• Index size increases• Query, update performance suffers• Network bandwidth under-utilized• Turnaround time not satisfactory

• Approaches used today:• Out-of-band deduplication• In-band deduplication

• Goal: partition the index while preserving efficient lookups

11

Partitioning the Chunk Index

• Goal: Partition chunk index I into multiple partitions I0,…, IK-1

• When adding a new file fn to the archive:• Extract chunk signatures • Choose m partitions using the document routing algorithm (m:

routing factor)• Route all the chunks to m partitions, m < K• Add the chunk signatures to each of the m partitions

• When identifying redundancies within a file fq• Extract chunk signatures • Choose m partitions using the document routing algorithm• Route all the features to m partitions, m < K• Query each of the m partitions to identify redundant chunks

• Goal: Minimal loss of compression while m being much smaller than K

12

VM Deduplication

• Huge opportunity for saving space in virtualized environments.

• Efficient with homogeneous VMs. • More images, more savings.

• Half space saving even with heterogeneous VMs.

13

Collection of Unbuntu VMs

Secure deduplication• Naive deduplication of encrypted data cannot work.• Solution: deduplicate the data first!

• But then how to you encrypt?• Encrypt so that if you know the data that was deduplicated, you can encrypt,

otherwise impossible to guess.• Solution: Convergent encryption of chunks (with appropriate data structures).

14

Hash

EncryptFile Chunking Chunk

Hash

Data

Key

ChunkChunk

Chunk

IDID

ID

Chunk

Map

Secure file systems

• Scalable security for Ceph• At most one “ticket” per file, regardless of how many clients

and OSDs• Public-key operations amortized across many uses

• Quotas in Ceph• Ensure usage within limits for near-zero cost and 100%

catching of cheaters• Remote verification of stored data

• Verify consistency of stored erasure-coded data• Secure against collusion by all remote servers

15

Scalable Security in Ceph

• Extended capabilities• Can authorize I/O for any number of clients to any number of

whole files• Reduce number of capabilities

• Automatic revocation• Capability expiration ⇔ capability revocation• Revocation without any contact

• Secure delegation• Delegate access rights to other clients• Shift security to key possession

16

• Separate allocation and quota management using a digital cash-based system model.• Quota management server acts as a bank.

• Clients withdraw vouchers from the quota server for a user and store for later use.

• Clients spend vouchers for users in order to purchase storage from storage servers.

• Storage servers periodically update the quota server about user storage.• Cheaters are caught by the bank at defined intervals.

• Vouchers are cryptographically-protected byte sequences {epoch, expiry, user, amount, serial}auth

Quotas in Ceph

Store, Forget and Check

18

• Systems store data on remote nodes• Remote nodes may not be trustworthy

• Data owner must check to ensure that data is really stored• Two current approaches:

• Read data from multiple sites and check for consistency• Generate checksum remotely and compare to checksum of local data

• We developed an efficient algorithm that does not require keeping a local copy of the data

• Storage utility provides remotely managed storage• Client sends data to the SSP then client retrieves data as needed

• Trust issue: how can client tell if SSP is doing its job?• Read data, check (public key-based) signature• Read data, decrypt, check secure hash and object ID

Reliable Storage (Advanced use of erasure codes)

• Cannot treat erasure codes as a black-box!• Must do more than choose m and n• Code choice affects performance, reliability and

power• Goal: smart application of codes in emerging storage

systems• NVRAM-based storage systems• Arrays of heterogeneous devices• Power-managed archives• Secure storage systems

19

Surviving disasters in storage

• Storage systems need to survive failures• CPUs / networks / power supplies fail: no data loss (usually)• Drives fail: data loss!

• Typical solution: use RAID• RAIDs don’t provide enough reliability for extreme cases

(multiple failures)• RAIDs that can survive multiple failures are often slow

• Our solution: disaster recovery codes• Handle common case (single failure) quickly• Handle uncommon case (multiple failure) correctly but

somewhat more slowly

20

How effective is this?

• Just a single NVRAM parity “disk” reduces the chance of data loss dramatically

• One disk of NVRAM in a system with hundreds of disks isn’t that expensive...

21

Num OSD Failures

0 10 20 30 40 50 60 70 80

Pro

babili

ty o

f D

ata

Loss

0

0.2

0.4

0.6

0.8

1

Data Loss Probabilities for Mirroring and Single NVRAM Parity

No_Parity

Single_Parity

Using non-volatile memory technologies

• File systems for byte-addressable NVRAM• Highly compressible metadata: 3× regular file systems• High performance: compressed metadata is faster• Rich linking structures

• Reliable file systems in NVRAM• Algebraic signatures ensure 64–128B blocks are correct• Parity (Reed-Solomon) provides redundancy for sets of

blocks• Necessary given relatively low reliability of NVRAM• Protects against software errors as well (in conjunction with

good file system design)• View-based file systems

22

Reliable NVRAM arrays

• Flash memory is unreliable• Relatively high error rates• More reads ➔ higher error rates• More writes ➔ higher error rates

• How can reliable systems rely on flash?• Use multi-level redundancy for arrays of flash

memory• Choose codes at each level knowing that higher levels are

present to correct errors• Low levels detect errors (and correct a few)• Higher levels correct erasures

• Tradeoffs in performance and reliability23

Distributed Indexing

• Storage systems are rapidly increasing in size• Petabytes of data, billions of files, thousands of users• Too large to manage with existing tools!

• Need a scalable way to find and access files• Users spend too much time organizing their data• Administrators need scalable tools to manage systems

• Search is an emerging solution• Common on desktops (Spotlight, WinFS)• However, these solutions cannot scale to large systems

• Need scalable search for large storage systems

24

• Designed to allow fast search and indexing across a data center and its entire history!• Optimized for non-selective,

hierarchical data sets• Storage nodes partition

namespace into individual indexes• Preserves locality of data

properties, exploits hierarchical namespace

• Index is a KD-Tree, a k-dimensional search tree• Fast K dimensional search for

non-selective data

Spyglass: Scalable Distributed Search

25

Conclusions

• The SSRC is a very active research group• Lots of interesting projects• Many collaborators from academia, industry, and the national

laboratories• Several students graduating each year

• We welcome your involvement!• Graduate Student internships (get them early)• Recruiting researchers (employees)• Sponsoring research• Visiting professors• Visitors from your laboratory

26