101
Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson

Web20expo Filesystems

  • Upload
    royans

  • View
    1.059

  • Download
    2

Embed Size (px)

DESCRIPTION

Cal handerson's talk. Awesomehttp://www.iamcal.com/talks/

Citation preview

Page 1: Web20expo Filesystems

Beyond the File System

Designing Large Scale File Storage and Serving

Cal Henderson

Page 2: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 2

Hello!

Page 3: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 3

Big file systems?

• Too vague!

• What is a file system?

• What constitutes big?

• Some requirements would be nice

Page 4: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 4

ScalableLooking at storage and serving infrastructures1

Page 5: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 5

ReliableLooking at redundancy, failure rates, on the fly changes2

Page 6: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 6

CheapLooking at upfront costs, TCO and lifetimes3

Page 7: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 7

Four buckets

Storage

Serving

BCP

Cost

Page 8: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 8

Storage

Page 9: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 9

The storage stack

File system

Block protocol

RAID

Hardware

ext, reiserFS, NTFS

SCSI, SATA, FC

Mirrors, Stripes

Disks and stuff

File protocol NFS, CIFS, SMB

Page 10: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 10

Hardware overview

The storage scale

Internal DAS SAN NAS

Lower Higher

Page 11: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 11

Internal storage

• A disk in a computer– SCSI, IDE, SATA

• 4 disks in 1U is common

• 8 for half depth boxes

Page 12: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 12

DAS

Direct attached storage

Disk shelf, connected by SCSI/SATA

HP MSA30 – 14 disks in 3U

Page 13: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 13

SAN

• Storage Area Network

• Dumb disk shelves

• Clients connect via a ‘fabric’

• Fibre Channel, iSCSI, Infiniband– Low level protocols

Page 14: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 14

NAS

• Network Attached Storage

• Intelligent disk shelf

• Clients connect via a network

• NFS, SMB, CIFS– High level protocols

Page 15: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 15

Of course, it’s more confusing than that

Page 16: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 16

Meet the LUN

• Logical Unit Number

• A slice of storage space

• Originally for addressing a single drive:– c1t2d3– Controller, Target, Disk (Slice)

• Now means a virtual partition/volume– LVM, Logical Volume Management

Page 17: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 17

NAS vs SAN

With a SAN, a single host (initiator) owns a single LUN/volume

With NAS, multiple hosts own a single LUN/volume

NAS head – NAS access to a SAN

Page 18: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 18

SAN Advantages

Virtualization within a SAN offers some nice features:

• Real-time LUN replication

• Transparent backup

• SAN booting for host replacement

Page 19: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 19

Some Practical Examples

• There are a lot of vendors

• Configurations vary

• Prices vary wildly

• Let’s look at a couple– Ones I happen to have experience with– Not an endorsement ;)

Page 20: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 20

NetApp Filers

Heads and shelves, up to 500TB in 6 Cabs

FC SAN with 1 or 2 NAS heads

Page 21: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 21

Isilon IQ

• 2U Nodes, 3-96 nodes/cluster, 6-600 TB

• FC/InfiniBand SAN with NAS head on each node

Page 22: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 22

Scaling

Vertical vs Horizontal

Page 23: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 23

Vertical scaling

• Get a bigger box

• Bigger disk(s)

• More disks

• Limited by current tech – size of each disk and total number in appliance

Page 24: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 24

Horizontal scaling

• Buy more boxes

• Add more servers/appliances

• Scales forever*

*sort of

Page 25: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 25

Storage scaling approaches

• Four common models:

• Huge FS

• Physical nodes

• Virtual nodes

• Chunked space

Page 26: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 26

Huge FS

• Create one giant volume with growing space– Sun’s ZFS– Isilon IQ

• Expandable on-the-fly?

• Upper limits– Always limited somewhere

Page 27: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 27

Huge FS

• Pluses– Simple from the application side– Logically simple– Low administrative overhead

• Minuses– All your eggs in one basket– Hard to expand– Has an upper limit

Page 28: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 28

Physical nodes

• Application handles distribution to multiple physical nodes– Disks, Boxes, Appliances, whatever

• One ‘volume’ per node

• Each node acts by itself

• Expandable on-the-fly – add more nodes

• Scales forever

Page 29: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 29

Physical Nodes

• Pluses– Limitless expansion– Easy to expand– Unlikely to all fail at once

• Minuses– Many ‘mounts’ to manage– More administration

Page 30: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 30

Virtual nodes

• Application handles distribution to multiple virtual volumes, contained on multiple physical nodes

• Multiple volumes per node

• Flexible

• Expandable on-the-fly – add more nodes

• Scales forever

Page 31: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 31

Virtual Nodes

• Pluses– Limitless expansion– Easy to expand– Unlikely to all fail at once– Addressing is logical, not physical– Flexible volume sizing, consolidation

• Minuses– Many ‘mounts’ to manage– More administration

Page 32: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 32

Chunked space

• Storage layer writes parts of files to different physical nodes

• A higher-level RAID striping

• High performance for large files– read multiple parts simultaneously

Page 33: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 33

Chunked space

• Pluses– High performance– Limitless size

• Minuses– Conceptually complex– Can be hard to expand on the fly– Can’t manually poke it

Page 34: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 34

Real Life

Case Studies

Page 35: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 35

GFS – Google File System

• Developed by … Google

• Proprietary

• Everything we know about it is based on talks they’ve given

• Designed to store huge files for fast access

Page 36: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 36

GFS – Google File System

• Single ‘Master’ node holds metadata– SPF – Shadow master allows warm swap

• Grid of ‘chunkservers’– 64bit filenames– 64 MB file chunks

Page 37: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 37

GFS – Google File System

1(a) 2(a)

1(b)

Master

Page 38: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 38

GFS – Google File System

• Client reads metadata from master then file parts from multiple chunkservers

• Designed for big files (>100MB)

• Master server allocates access leases

• Replication is automatic and self repairing– Synchronously for atomicity

Page 39: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 39

GFS – Google File System

• Reading is fast (parallelizable)– But requires a lease

• Master server is required for all reads and writes

Page 40: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 40

MogileFS – OMG Files

• Developed by Danga / SixApart

• Open source

• Designed for scalable web app storage

Page 41: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 41

MogileFS – OMG Files

• Single metadata store (MySQL)– MySQL Cluster avoids SPF

• Multiple ‘tracker’ nodes locate files

• Multiple ‘storage’ nodes store files

Page 42: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 42

MogileFS – OMG Files

Tracker

Tracker

MySQL

Page 43: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 43

MogileFS – OMG Files

• Replication of file ‘classes’ happens transparently

• Storage nodes are not mirrored – replication is piecemeal

• Reading and writing go through trackers, but are performed directly upon storage nodes

Page 44: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 44

Flickr File System

• Developed by Flickr

• Proprietary

• Designed for very large scalable web app storage

Page 45: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 45

Flickr File System

• No metadata store– Deal with it yourself

• Multiple ‘StorageMaster’ nodes

• Multiple storage nodes with virtual volumes

Page 46: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 46

Flickr File System

SM

SM

SM

Page 47: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 47

Flickr File System

• Metadata stored by app– Just a virtual volume number– App chooses a path

• Virtual nodes are mirrored– Locally and remotely

• Reading is done directly from nodes

Page 48: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 48

Flickr File System

• StorageMaster nodes only used for write operations

• Reading and writing can scale separately

Page 49: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 49

Amazon S3

• A big disk in the sky

• Multiple ‘buckets’

• Files have user-defined keys

• Data + metadata

Page 50: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 50

Amazon S3

Servers Amazon

Page 51: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 51

Amazon S3

Servers Amazon

Users

Page 52: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 52

The cost

• Fixed price, by the GB

• Store: $0.15 per GB per month

• Serve: $0.20 per GB

Page 53: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 53

The cost

S3

Page 54: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 54

The cost

S3

Regular Bandwidth

Page 55: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 55

End costs

• ~$2k to store 1TB for a year

• ~$63 a month for 1Mb

• ~$65k a month for 1Gb

Page 56: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 56

Serving

Page 57: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 57

Serving files

Serving files is easy!

ApacheDisk

Page 58: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 58

Serving files

Scaling is harder

ApacheDisk

ApacheDisk

ApacheDisk

Page 59: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 59

Serving files

• This doesn’t scale well

• Primary storage is expensive– And takes a lot of space

• In many systems, we only access a small number of files most of the time

Page 60: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 60

Caching

• Insert caches between the storage and serving nodes

• Cache frequently accessed content to reduce reads on the storage nodes

• Software (Squid, mod_cache)

• Hardware (Netcache, Cacheflow)

Page 61: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 61

Why it works

• Keep a smaller working set

• Use faster hardware– Lots of RAM– SCSI– Outer edge of disks (ZCAV)

• Use more duplicates– Cheaper, since they’re smaller

Page 62: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 62

Two models

• Layer 4– ‘Simple’ balanced cache– Objects in multiple caches– Good for few objects requested many times

• Layer 7– URL balances cache– Objects in a single cache– Good for many objects requested a few times

Page 63: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 63

Replacement policies

• LRU – Least recently used

• GDSF – Greedy dual size frequency

• LFUDA – Least frequently used with dynamic aging

• All have advantages and disadvantages

• Performance varies greatly with each

Page 64: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 64

Cache Churn

• How long do objects typically stay in cache?

• If it gets too short, we’re doing badly– But it depends on your traffic profile

• Make the cached object store larger

Page 65: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 65

Problems

• Caching has some problems:

– Invalidation is hard– Replacement is dumb (even LFUDA)

• Avoiding caching makes your life (somewhat) easier

Page 66: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 66

CDN – Content Delivery Network

• Akamai, Savvis, Mirror Image Internet, etc

• Caches operated by other people– Already in-place– In lots of places

• GSLB/DNS balancing

Page 67: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 67

Edge networks

Origin

Page 68: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 68

Edge networks

Origin

Cache

Cache

Cache

CacheCache

Cache

CacheCache

Page 69: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 69

CDN Models

• Simple model– You push content to them, they serve it

• Reverse proxy model– You publish content on an origin, they proxy

and cache it

Page 70: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 70

CDN Invalidation

• You don’t control the caches– Just like those awful ISP ones

• Once something is cached by a CDN, assume it can never change– Nothing can be deleted– Nothing can be modified

Page 71: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 71

Versioning

• When you start to cache things, you need to care about versioning

– Invalidation & Expiry– Naming & Sync

Page 72: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 72

Cache Invalidation

• If you control the caches, invalidation is possible

• But remember ISP and client caches

• Remove deleted content explicitly– Avoid users finding old content– Save cache space

Page 73: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 73

Cache versioning

• Simple rule of thumb:– If an item is modified, change its name (URL)

• This can be independent of the file system!

Page 74: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 74

Virtual versioning

• Database indicates version 3 of file

• Web app writes version number into URL

• Request comes through cache and is cached with the versioned URL

• mod_rewrite converts versioned URL to path

Version 3

example.com/foo_3.jpg

Cached: foo_3.jpg

foo_3.jpg -> foo.jpg

Page 75: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 75

Authentication

• Authentication inline layer– Apache / perlbal

• Authentication sideline– ICP (CARP/HTCP)

• Authentication by URL– FlickrFS

Page 76: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 76

Auth layer

• Authenticator sits between client and storage

• Typically built into the cache software

Cache

Authenticator

Origin

Page 77: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 77

Auth sideline

• Authenticator sits beside the cache

• Lightweight protocol used for authenticator

Cache

Authenticator

Origin

Page 78: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 78

Auth by URL

• Someone else performs authentication and gives URLs to client (typically the web app)

• URLs hold the ‘keys’ for accessing files

Cache OriginWeb Server

Page 79: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 79

BCP

Page 80: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 80

Business Continuity Planning

• How can I deal with the unexpected?– The core of BCP

• Redundancy

• Replication

Page 81: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 81

Reality

• On a long enough timescale, anything that can fail, will fail

• Of course, everything can fail

• True reliability comes only through redundancy

Page 82: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 82

Reality

• Define your own SLAs

• How long can you afford to be down?

• How manual is the recovery process?

• How far can you roll back?

• How many $node boxes can fail at once?

Page 83: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 83

Failure scenarios

• Disk failure

• Storage array failure

• Storage head failure

• Fabric failure

• Metadata node failure

• Power outage

• Routing outage

Page 84: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 84

Reliable by design

• RAID avoids disk failures, but not head or fabric failures

• Duplicated nodes avoid host and fabric failures, but not routing or power failures

• Dual-colo avoids routing and power failures, but may need duplication too

Page 85: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 85

Tend to all points in the stack

• Going dual-colo: great

• Taking a whole colo offline because of a single failed disk: bad

• We need a combination of these

Page 86: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 86

Recovery times

• BCP is not just about continuing when things fail

• How can we restore after they come back?

• Host and colo level syncing– replication queuing

• Host and colo level rebuilding

Page 87: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 87

Reliable Reads & Writes

• Reliable reads are easy– 2 or more copies of files

• Reliable writes are harder– Write 2 copies at once– But what do we do when we can’t write to

one?

Page 88: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 88

Dual writes

• Queue up data to be written– Where?– Needs itself to be reliable

• Queue up journal of changes– And then read data from the disk whose write

succeeded

• Duplicate whole volume after failure– Slow!

Page 89: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 89

Cost

Page 90: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 90

Judging cost

• Per GB?

• Per GB upfront and per year

• Not as simple as you’d hope– How about an example

Page 91: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 91

Hardware costs

Cost of hardware

Usable GB

Single Cost

Page 92: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 92

Power costs

Cost of power per year

Usable GB

Recurring Cost

Page 93: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 93

Power costs

Power installation cost

Usable GB

Single Cost

Page 94: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 94

Space costs

Cost per U

Usable GB

[ ]U’s needed (inc network)x

Recurring Cost

Page 95: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 95

Network costs

Cost of network gear

Usable GB

Single Cost

Page 96: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 96

Misc costs

Support contracts + spare disks

Usable GB

+ bus adaptors + cables[ ]Single & Recurring Costs

Page 97: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 97

Human costs

Admin cost per node

Node countx

Recurring Cost

Usable GB

[ ]

Page 98: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 98

TCO

• Total cost of ownership in two parts– Upfront– Ongoing

• Architecture plays a huge part in costing– Don’t get tied to hardware– Allow heterogeneity– Move with the market

Page 99: Web20expo Filesystems

(fin)

Page 100: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 100

Photo credits

• flickr.com/photos/ebright/260823954/• flickr.com/photos/thomashawk/243477905/• flickr.com/photos/tom-carden/116315962/• flickr.com/photos/sillydog/287354869/• flickr.com/photos/foreversouls/131972916/• flickr.com/photos/julianb/324897/• flickr.com/photos/primejunta/140957047/• flickr.com/photos/whatknot/28973703/• flickr.com/photos/dcjohn/85504455/

Page 101: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 101

You can find these slides online:

iamcal.com/talks/