Upload
royans
View
3.507
Download
2
Tags:
Embed Size (px)
DESCRIPTION
http://www.iamcal.com/talks/
Citation preview
Beyond the File System
Designing Large Scale File Storage and Serving
Cal Henderson
Web Builder 2.0 2
Hello!
Web Builder 2.0 3
Big file systems?
• Too vague!
• What is a file system?
• What constitutes big?
• Some requirements would be nice
Web Builder 2.0 4
ScalableLooking at storage and serving infrastructures1
Web Builder 2.0 5
ReliableLooking at redundancy, failure rates, on the fly changes2
Web Builder 2.0 6
CheapLooking at upfront costs, TCO and lifetimes3
Web Builder 2.0 7
Four buckets
Storage
Serving
BCP
Cost
Web Builder 2.0 8
Storage
Web Builder 2.0 9
The storage stack
File system
Block protocol
RAID
Hardware
ext, reiserFS, NTFS
SCSI, SATA, FC
Mirrors, Stripes
Disks and stuff
File protocol NFS, CIFS, SMB
Web Builder 2.0 10
Hardware overview
The storage scale
Internal DAS SAN NAS
Lower Higher
Web Builder 2.0 11
Internal storage
• A disk in a computer– SCSI, IDE, SATA
• 4 disks in 1U is common
• 8 for half depth boxes
Web Builder 2.0 12
DAS
Direct attached storage
Disk shelf, connected by SCSI/SATA
HP MSA30 – 14 disks in 3U
Web Builder 2.0 13
SAN
• Storage Area Network
• Dumb disk shelves
• Clients connect via a ‘fabric’
• Fibre Channel, iSCSI, Infiniband– Low level protocols
Web Builder 2.0 14
NAS
• Network Attached Storage
• Intelligent disk shelf
• Clients connect via a network
• NFS, SMB, CIFS– High level protocols
Web Builder 2.0 15
Of course, it’s more confusing than that
Web Builder 2.0 16
Meet the LUN
• Logical Unit Number
• A slice of storage space
• Originally for addressing a single drive:– c1t2d3– Controller, Target, Disk (Slice)
• Now means a virtual partition/volume– LVM, Logical Volume Management
Web Builder 2.0 17
NAS vs SAN
With SAN, a single host (initiator) owns a single LUN/volume
With NAS, multiple hosts own a single LUN/volume
NAS head – NAS access to a SAN
Web Builder 2.0 18
SAN Advantages
Virtualization within a SAN offers some nice features:
• Real-time LUN replication
• Transparent backup
• SAN booting for host replacement
Web Builder 2.0 19
Some Practical Examples
• There are a lot of vendors
• Configurations vary
• Prices vary wildly
• Let’s look at a couple– Ones I happen to have experience with– Not an endorsement ;)
Web Builder 2.0 20
NetApp Filers
Heads and shelves, up to 500TB in 260U
FC SAN with 1 or 2 NAS heads
Web Builder 2.0 21
Isilon IQ
• 2U Nodes, 3-96 nodes/cluster, 6-600 TB
• FC/InfiniBand SAN with NAS head on each node
Web Builder 2.0 22
Scaling
Vertical vs Horizontal
Web Builder 2.0 23
Vertical scaling
• Get a bigger box
• Bigger disk(s)
• More disks
• Limited by current tech – size of each disk and total number in appliance
Web Builder 2.0 24
Horizontal scaling
• Buy more boxes
• Add more servers/appliances
• Scales forever*
*sort of
Web Builder 2.0 25
Storage scaling approaches
• Four common models:
• Huge FS
• Physical nodes
• Virtual nodes
• Chunked space
Web Builder 2.0 26
Huge FS
• Create one giant volume with growing space– Sun’s ZFS– Isilon IQ
• Expandable on-the-fly?
• Upper limits– Always limited somewhere
Web Builder 2.0 27
Huge FS
• Pluses– Simple from the application side– Logically simple– Low administrative overhead
• Minuses– All your eggs in one basket– Hard to expand– Has an upper limit
Web Builder 2.0 28
Physical nodes
• Application handles distribution to multiple physical nodes– Disks, Boxes, Appliances, whatever
• One ‘volume’ per node
• Each node acts by itself
• Expandable on-the-fly – add more nodes
• Scales forever
Web Builder 2.0 29
Physical Nodes
• Pluses– Limitless expansion– Easy to expand– Unlikely to all fail at once
• Minuses– Many ‘mounts’ to manage– More administration
Web Builder 2.0 30
Virtual nodes
• Application handles distribution to multiple virtual volumes, contained on multiple physical nodes
• Multiple volumes per node
• Flexible
• Expandable on-the-fly – add more nodes
• Scales forever
Web Builder 2.0 31
Virtual Nodes
• Pluses– Limitless expansion– Easy to expand– Unlikely to all fail at once– Addressing is logical, not physical– Flexible volume sizing, consolidation
• Minuses– Many ‘mounts’ to manage– More administration
Web Builder 2.0 32
Chunked space
• Storage layer writes parts of files to different physical nodes
• A higher-level RAID striping
• High performance for large files– read multiple parts simultaneously
Web Builder 2.0 33
Chunked space
• Pluses– High performance– Limitless size
• Minuses– Conceptually complex– Can be hard to expand on the fly– Can’t manually poke it
Web Builder 2.0 34
Real Life
Case Studies
Web Builder 2.0 35
GFS – Google File System
• Developed by … Google
• Proprietary
• Everything we know about it is based on talks they’ve given
• Designed to store huge files for fast access
Web Builder 2.0 36
GFS – Google File System
• Single ‘Master’ node holds metadata– SPF – Shadow master allows warm swap
• Grid of ‘chunkservers’– 64bit filenames– 64 MB file chunks
Web Builder 2.0 37
GFS – Google File System
1(a) 2(a)
1(b)
Master
Web Builder 2.0 38
GFS – Google File System
• Client reads metadata from master then file parts from multiple chunkservers
• Designed for big files (>100MB)
• Master server allocates access leases
• Replication is automatic and self repairing– Synchronously for atomicity
Web Builder 2.0 39
GFS – Google File System
• Reading is fast (parallelizable)– But requires a lease
• Master server is required for all reads and writes
Web Builder 2.0 40
MogileFS – OMG Files
• Developed by Danga / SixApart
• Open source
• Designed for scalable web app storage
Web Builder 2.0 41
MogileFS – OMG Files
• Single metadata store (MySQL)– MySQL Cluster avoids SPF
• Multiple ‘tracker’ nodes locate files
• Multiple ‘storage’ nodes store files
Web Builder 2.0 42
MogileFS – OMG Files
Tracker
Tracker
MySQL
Web Builder 2.0 43
MogileFS – OMG Files
• Replication of file ‘classes’ happens transparently
• Storage nodes are not mirrored – replication is piecemeal
• Reading and writing go through trackers, but are performed directly upon storage nodes
Web Builder 2.0 44
Flickr File System
• Developed by Flickr
• Proprietary
• Designed for very large scalable web app storage
Web Builder 2.0 45
Flickr File System
• No metadata store– Deal with it yourself
• Multiple ‘StorageMaster’ nodes
• Multiple storage nodes with virtual volumes
Web Builder 2.0 46
Flickr File System
SM
SM
SM
Web Builder 2.0 47
Flickr File System
• Metadata stored by app– Just a virtual volume number– App chooses a path
• Virtual nodes are mirrored– Locally and remotely
• Reading is done directly from nodes
Web Builder 2.0 48
Flickr File System
• StorageMaster nodes only used for write operations
• Reading and writing can scale separately
Web Builder 2.0 49
Serving
Web Builder 2.0 50
Serving files
Serving files is easy!
ApacheDisk
Web Builder 2.0 51
Serving files
Scaling is harder
ApacheDisk
ApacheDisk
ApacheDisk
Web Builder 2.0 52
Serving files
• This doesn’t scale well
• Primary storage is expensive– And takes a lot of space
• In many systems, we only access a small number of files most of the time
Web Builder 2.0 53
Caching
• Insert caches between the storage and serving nodes
• Cache frequently accessed content to reduce reads on the storage nodes
• Software (Squid, mod_cache)
• Hardware (Netcache, Cacheflow)
Web Builder 2.0 54
Why it works
• Keep a smaller working set
• Use faster hardware– Lots of RAM– SCSI– Outer edge of disks (ZCAV)
• Use more duplicates– Cheaper, since they’re smaller
Web Builder 2.0 55
Two models
• Layer 4– ‘Simple’ balanced cache– Objects in multiple caches– Good for few objects requested many times
• Layer 7– URL balances cache– Objects in a single cache– Good for many objects requested a few times
Web Builder 2.0 56
Replacement policies
• LRU – Least recently used
• GDSF – Greedy dual size frequency
• LFUDA – Least frequently used with dynamic aging
• All have advantages and disadvantages
• Performance varies greatly with each
Web Builder 2.0 57
Cache Churn
• How long do objects typically stay in cache?
• If it gets too short, we’re doing badly– But it depends on your traffic profile
• Make the cached object store larger
Web Builder 2.0 58
Problems
• Caching has some problems:
– Invalidation is hard– Replacement is dumb (even LFUDA)
• Avoiding caching makes your life (somewhat) easier
Web Builder 2.0 59
CDN – Content Delivery Network
• Akamai, Savvis, Mirror Image Internet, etc
• Caches operated by other people– Already in-place– In lots of places
• GSLB/DNS balancing
Web Builder 2.0 60
Edge networks
Origin
Web Builder 2.0 61
Edge networks
Origin
Cache
Cache
Cache
CacheCache
Cache
CacheCache
Web Builder 2.0 62
CDN Models
• Simple model– You push content to them, they serve it
• Reverse proxy model– You publish content on an origin, they proxy
and cache it
Web Builder 2.0 63
CDN Invalidation
• You don’t control the caches– Just like those awful ISP ones
• Once something is cached by a CDN, assume it can never change– Nothing can be deleted– Nothing can be modified
Web Builder 2.0 64
Versioning
• When you start to cache things, you need to care about versioning
– Invalidation & Expiry– Naming & Sync
Web Builder 2.0 65
Cache Invalidation
• If you control the caches, invalidation is possible
• But remember ISP and client caches
• Remove deleted content explicitly– Avoid users finding old content– Save cache space
Web Builder 2.0 66
Cache versioning
• Simple rule of thumb:– If an item is modified, change its name (URL)
• This can be independent of the file system!
Web Builder 2.0 67
Virtual versioning
• Database indicates version 3 of file
• Web app writes version number into URL
• Request comes through cache and is cached with the versioned URL
• mod_rewrite converts versioned URL to path
Version 3
example.com/foo_3.jpg
Cached: foo_3.jpg
foo_3.jpg -> foo.jpg
Web Builder 2.0 68
Authentication
• Authentication inline layer– Apache / perlbal
• Authentication sideline– ICP (CARP/HTCP)
• Authentication by URL– FlickrFS
Web Builder 2.0 69
Auth layer
• Authenticator sits between client and storage
• Typically built into the cache software
Cache
Authenticator
Origin
Web Builder 2.0 70
Auth sideline
• Authenticator sits beside the cache
• Lightweight protocol used for authenticator
Cache
Authenticator
Origin
Web Builder 2.0 71
Auth by URL
• Someone else performs authentication and gives URLs to client (typically the web app)
• URLs hold the ‘keys’ for accessing files
Cache OriginWeb Server
Web Builder 2.0 72
BCP
Web Builder 2.0 73
Business Continuity Planning
• How can I deal with the unexpected?– The core of BCP
• Redundancy
• Replication
Web Builder 2.0 74
Reality
• On a long enough timescale, anything that can fail, will fail
• Of course, everything can fail
• True reliability comes only through redundancy
Web Builder 2.0 75
Reality
• Define your own SLAs
• How long can you afford to be down?
• How manual is the recovery process?
• How far can you roll back?
• How many node x boxes can fail at once?
Web Builder 2.0 76
Failure scenarios
• Disk failure
• Storage array failure
• Storage head failure
• Fabric failure
• Metadata node failure
• Power outage
• Routing outage
Web Builder 2.0 77
Reliable by design
• RAID avoids disk failures, but not head or fabric failures
• Duplicated nodes avoid host and fabric failures, but not routing or power failures
• Dual-colo avoids routing and power failures, but my need duplication too
Web Builder 2.0 78
Tend to all points in the stack
• Going dual-colo: great
• Taking a whole colo offline because of a single failed disk: bad
• We need a combination of these
Web Builder 2.0 79
Recovery times
• BCP is not just about continuing when things fail
• How can we restore after they come back?
• Host and colo level syncing– replication queuing
• Host and colo level rebuilding
Web Builder 2.0 80
Reliable Reads & Writes
• Reliable reads are easy– 2 or more copies of files
• Reliable writes are harder– Write 2 copies at once– But what do we do when we can’t write to
one?
Web Builder 2.0 81
Dual writes
• Queue up data to be written– Where?– Needs itself to be reliable
• Queue up journal of changes– And then read data from the disk whose write
succeeded
• Duplicate whole volume after failure– Slow!
Web Builder 2.0 82
Cost
Web Builder 2.0 83
Judging cost
• Per GB?
• Per GB upfront and per year
• Not as simple as you’d hope– How about an example
Web Builder 2.0 84
Hardware costs
Cost of hardware
Usable GB
Single Cost
Web Builder 2.0 85
Power costs
Cost of power per year
Usable GB
Recurring Cost
Web Builder 2.0 86
Power costs
Power installation cost
Usable GB
Single Cost
Web Builder 2.0 87
Space costs
Cost per U
Usable GB
[ ]U’s needed (inc network)x
Recurring Cost
Web Builder 2.0 88
Network costs
Cost of network gear
Usable GB
Single Cost
Web Builder 2.0 89
Misc costs
Support contracts + spare disks
Usable GB
+ bus adaptors + cables[ ]Single & Recurring Costs
Web Builder 2.0 90
Human costs
Admin cost per node
Node countx
Recurring Cost
Usable GB
[ ]
Web Builder 2.0 91
TCO
• Total cost of ownership in two parts– Upfront– Ongoing
• Architecture plays a huge part in costing– Don’t get tied to hardware– Allow heterogeneity– Move with the market
(fin)
Web Builder 2.0 93
Photo credits
• flickr.com/photos/ebright/260823954/• flickr.com/photos/thomashawk/243477905/• flickr.com/photos/tom-carden/116315962/• flickr.com/photos/sillydog/287354869/• flickr.com/photos/foreversouls/131972916/• flickr.com/photos/julianb/324897/• flickr.com/photos/primejunta/140957047/• flickr.com/photos/whatknot/28973703/• flickr.com/photos/dcjohn/85504455/
Web Builder 2.0 94
You can find these slides online:
iamcal.com/talks/