Darrell Long & Ethan MillerJack Baskin School of Engineering
Overview of theStorage Systems Research
Center
The SSRC in one slide
Research Thrusts• Archival storage• New metadata & indexing approaches• New storage technologies• Deduplication• Reliable & secure storage• Large-scale object-based storage
Research Challenges• Exascale capacity & scalability• Performance• Security• Reliability & self-healing• New storage technologies• Improved metadata & indexing
SSRC features• 2+2 core faculty• ≈20 graduate students• High degree of collaboration• Close cooperation with sponsors• High visibility
SSRC sponsors• Dept. of Energy, NSF, LLNL, LANL• Data Domain, Hitachi, IBM, LSI,
NetApp, Seagate, Symantec• Over $1 million per year
2
Recent SSRC research
• Archival storage• Secure long-term storage: POTSHARDS (USENIX 2007)• “Brick-based” archival storage: Pergamum (FAST 2008)• Managing exascale archives (PDSW 2008)
• Metadata and file system indexing• Understanding current workloads (USENIX 2008)• Scalable metadata indexes (FAST 2009)
• Deduplication• Scalable, partitioned indexes (KDD 2007)• Secure deduplication (StorageSS 2008)
• Secure petabyte-scale storage (SC 07)• Reliability
• Large-stripe redundancy for disks (StorageSS 2007)• Finding good erasure codes (DSN 2008)
3
External collaborations
4
• Petascale Data Storage Institute (PDSI)• Three universities (UCSC, CMU, Michigan)• Five national laboratories (LANL, NERSC, ORNL, PNL, Sandia)
• NetApp• Non-volatile memory for large-scale file servers• Archival storage• Scalable indexing
• Symantec• Deduplication for virtual machines
• IBM• Storage class memories
• Hewlett Packard• Deduplication• Reliability in file and storage systems
A few of our projects...
• Archival Storage• Evolvable and Secure archives, Long-term management
• Deduplication• VM deduplication, Secure deduplication, Scalable indexing, On-line
deduplication• Petascale Storage• Security and Reliability
• Secure petascale storage, Quotas, Store, Forget & Check• Reliability
• Disk reliability, Disaster codes, Restructuring RAID • Indexing
• Highly scalable metadata indexing• Content-based access
5
Pergamum: Evolvable Archival Storage
• Archival storage requires archival solutions• Legislated requirements for professional data • Personal histories recorded digitally (photos, documents, ...)• Traditional systems too expensive to buy and operate• Archives must scale over many dimensions (time, technology,
capacity vendors, ...)• Approach: distributed network of intelligent, disk-based
devices• Function independently, cooperate in inter-device redundancy• Static cost control: standardized interfaces, commodity hardware• Operational cost control
• Low power hardware• Keep disks spun-down
6
• Goal: provide data protection without wasting energy• Approach:
• Scale the response to the problem
• Intra-disk – protection from latent sector errors• Inter-disk – protection from device loss
• Trees of signatures for algebraic consistency checking• Power aware encoding and rebuild strategies
Power-Efficient Data Protection
P
P
P
P
P
P
P
Tome 2
R R R R R R
R R R R R R
R R R R R R
P
P
P
P
P
P
P
Tome 0
R R R R R R
R R R R R R
R R R R R R
P
P
P
P
P
P
P
Tome 1
P
P
P
P
P
P
P
Tome 3
R R R R R R
RedundancyGroup
RedundancyGroup
RedundancyGroup
7
POTSHARDS: Secure Long-Term Storage
8
• Problem: over potentially indefinite data lifetimes...• ...computationally-bound mechanisms may eventually fail• ...change and loss are constants (keys, cryptosystems, ...)
• Trust the consensus of groups, not individuals• Store data on multiple, independently managed archives• Full collaboration should allow data recovery
• Use alternatives to computation-bound security• Unconditionally secure mechanisms (e.g. secret splitting)• Make targeted attacks difficult• Make aberrant behavior noticeable
Key Concept: Secret Splitting
9
Archive 0 Archive 1 Archive 2 Archive 3
Shard
Shard
Parity
Shard
Shard Y
Shard
Shard X
Object
User
Redundancy Group
Fragment Fragment Fragment
Data Deduplication
• Data deduplication is typically achieved by chunking files• Different ways to chunk data
• Variable and fixed size chunking• Analyze the chunks to see if there are any interesting properties
• Frequency of chunks per chunking scheme• Chunk locality
• Chunks occurring in sequence each time the first occurs• What about three or four chunks?
10
Managing Deduplication in Large Scale Archival Systems
• Index lookup: an integral part of the deduplication process in archival systems• A lookup index can consist of chunk signatures, shingles
• In large scale archival systems index lookup becomes a performance bottleneck• Index size increases• Query, update performance suffers• Network bandwidth under-utilized• Turnaround time not satisfactory
• Approaches used today:• Out-of-band deduplication• In-band deduplication
• Goal: partition the index while preserving efficient lookups
11
Partitioning the Chunk Index
• Goal: Partition chunk index I into multiple partitions I0,…, IK-1
• When adding a new file fn to the archive:• Extract chunk signatures • Choose m partitions using the document routing algorithm (m:
routing factor)• Route all the chunks to m partitions, m < K• Add the chunk signatures to each of the m partitions
• When identifying redundancies within a file fq• Extract chunk signatures • Choose m partitions using the document routing algorithm• Route all the features to m partitions, m < K• Query each of the m partitions to identify redundant chunks
• Goal: Minimal loss of compression while m being much smaller than K
12
VM Deduplication
• Huge opportunity for saving space in virtualized environments.
• Efficient with homogeneous VMs. • More images, more savings.
• Half space saving even with heterogeneous VMs.
13
Collection of Unbuntu VMs
Secure deduplication• Naive deduplication of encrypted data cannot work.• Solution: deduplicate the data first!
• But then how to you encrypt?• Encrypt so that if you know the data that was deduplicated, you can encrypt,
otherwise impossible to guess.• Solution: Convergent encryption of chunks (with appropriate data structures).
14
Hash
EncryptFile Chunking Chunk
Hash
Data
Key
ChunkChunk
Chunk
IDID
ID
Chunk
Map
Secure file systems
• Scalable security for Ceph• At most one “ticket” per file, regardless of how many clients
and OSDs• Public-key operations amortized across many uses
• Quotas in Ceph• Ensure usage within limits for near-zero cost and 100%
catching of cheaters• Remote verification of stored data
• Verify consistency of stored erasure-coded data• Secure against collusion by all remote servers
15
Scalable Security in Ceph
• Extended capabilities• Can authorize I/O for any number of clients to any number of
whole files• Reduce number of capabilities
• Automatic revocation• Capability expiration ⇔ capability revocation• Revocation without any contact
• Secure delegation• Delegate access rights to other clients• Shift security to key possession
16
• Separate allocation and quota management using a digital cash-based system model.• Quota management server acts as a bank.
• Clients withdraw vouchers from the quota server for a user and store for later use.
• Clients spend vouchers for users in order to purchase storage from storage servers.
• Storage servers periodically update the quota server about user storage.• Cheaters are caught by the bank at defined intervals.
• Vouchers are cryptographically-protected byte sequences {epoch, expiry, user, amount, serial}auth
Quotas in Ceph
Store, Forget and Check
18
• Systems store data on remote nodes• Remote nodes may not be trustworthy
• Data owner must check to ensure that data is really stored• Two current approaches:
• Read data from multiple sites and check for consistency• Generate checksum remotely and compare to checksum of local data
• We developed an efficient algorithm that does not require keeping a local copy of the data
• Storage utility provides remotely managed storage• Client sends data to the SSP then client retrieves data as needed
• Trust issue: how can client tell if SSP is doing its job?• Read data, check (public key-based) signature• Read data, decrypt, check secure hash and object ID
Reliable Storage (Advanced use of erasure codes)
• Cannot treat erasure codes as a black-box!• Must do more than choose m and n• Code choice affects performance, reliability and
power• Goal: smart application of codes in emerging storage
systems• NVRAM-based storage systems• Arrays of heterogeneous devices• Power-managed archives• Secure storage systems
19
Surviving disasters in storage
• Storage systems need to survive failures• CPUs / networks / power supplies fail: no data loss (usually)• Drives fail: data loss!
• Typical solution: use RAID• RAIDs don’t provide enough reliability for extreme cases
(multiple failures)• RAIDs that can survive multiple failures are often slow
• Our solution: disaster recovery codes• Handle common case (single failure) quickly• Handle uncommon case (multiple failure) correctly but
somewhat more slowly
20
How effective is this?
• Just a single NVRAM parity “disk” reduces the chance of data loss dramatically
• One disk of NVRAM in a system with hundreds of disks isn’t that expensive...
21
Num OSD Failures
0 10 20 30 40 50 60 70 80
Pro
babili
ty o
f D
ata
Loss
0
0.2
0.4
0.6
0.8
1
Data Loss Probabilities for Mirroring and Single NVRAM Parity
No_Parity
Single_Parity
Using non-volatile memory technologies
• File systems for byte-addressable NVRAM• Highly compressible metadata: 3× regular file systems• High performance: compressed metadata is faster• Rich linking structures
• Reliable file systems in NVRAM• Algebraic signatures ensure 64–128B blocks are correct• Parity (Reed-Solomon) provides redundancy for sets of
blocks• Necessary given relatively low reliability of NVRAM• Protects against software errors as well (in conjunction with
good file system design)• View-based file systems
22
Reliable NVRAM arrays
• Flash memory is unreliable• Relatively high error rates• More reads ➔ higher error rates• More writes ➔ higher error rates
• How can reliable systems rely on flash?• Use multi-level redundancy for arrays of flash
memory• Choose codes at each level knowing that higher levels are
present to correct errors• Low levels detect errors (and correct a few)• Higher levels correct erasures
• Tradeoffs in performance and reliability23
Distributed Indexing
• Storage systems are rapidly increasing in size• Petabytes of data, billions of files, thousands of users• Too large to manage with existing tools!
• Need a scalable way to find and access files• Users spend too much time organizing their data• Administrators need scalable tools to manage systems
• Search is an emerging solution• Common on desktops (Spotlight, WinFS)• However, these solutions cannot scale to large systems
• Need scalable search for large storage systems
24
• Designed to allow fast search and indexing across a data center and its entire history!• Optimized for non-selective,
hierarchical data sets• Storage nodes partition
namespace into individual indexes• Preserves locality of data
properties, exploits hierarchical namespace
• Index is a KD-Tree, a k-dimensional search tree• Fast K dimensional search for
non-selective data
Spyglass: Scalable Distributed Search
25
Conclusions
• The SSRC is a very active research group• Lots of interesting projects• Many collaborators from academia, industry, and the national
laboratories• Several students graduating each year
• We welcome your involvement!• Graduate Student internships (get them early)• Recruiting researchers (employees)• Sponsoring research• Visiting professors• Visitors from your laboratory
26