View
213
Download
0
Category
Tags:
Preview:
Citation preview
CS194-24Advanced Operating Systems
Structures and Implementation Lecture 23
Application-Specific File SystemsDeep Archival Storage
Security and Protection
April 29th, 2013Prof. John Kubiatowicz
http://inst.eecs.berkeley.edu/~cs194-24
Lec 23.24/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Goals for Today
• Application-specific File Systems– Dynamo, Haystack
• Deep Archival Storage– OceanStore
• Security and Protection
Interactive is important!Ask Questions!
Note: Some slides and/or pictures in the following areadapted from Bovet, “Understanding the Linux Kernel”, 3rd edition, 2005
Lec 23.34/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Recall: VFS Common File Model
• Four primary object types for VFS:– superblock object: represents a specific mounted
filesystem– inode object: represents a specific file– dentry object: represents a directory entry – file object: represents open file associated with process
• There is no specific directory object (VFS treats directories as files)
• May need to fit the model by faking it– Example: make it look like directories are files– Example: make it look like have inodes, superblocks, etc.
Lec 23.44/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Recall: Data-based Caching (Data “De-Duplication”)
• Use a sliding-window hash function to break files into chunks– Rabin Fingerprint: randomized function of data window
» Pick sensitivity: e.g. 48 bytes at a time, lower 13 bits = 0 2-13 probability of happening, expected chunk size 8192
» Need minimum and maximum chunk sizes– Now – if data stays same, chunk stays the same
• Blocks named by cryptographic hashes such as SHA-256
Lec 23.54/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Recall: Peer-to-Peer: Fully equivalent components
• Peer-to-Peer has many interacting components– View system as a set of equivalent nodes
» “All nodes are created equal”– Any structure on system must be self-organizing
» Not based on physical characteristics, location, or ownership
Lec 23.64/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Recall: Lookup with Leaf Set (Chord)
0…
10…
110…
111…
Lookup ID
Source
Resp
onse
• Assign IDs to nodes– Map hash values to
node with closest ID• Leaf set is
successors and predecessors– All that’s needed for
correctness• Routing table
matches successively longer prefixes– Allows efficient
lookups• Data Replication:
– On leaf set
Lec 23.74/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Advantages/Disadvantages of Consistent Hashing
• Advantages:– Automatically adapts data partitioning as node membership
changes– Node given random key value automatically “knows” how to
participate in routing and data management– Random key assignment gives approximation to load balance
• Disadvantages– Uneven distribution of key storage natural consequence of
random node names Leads to uneven query load– Key management can be expensive when nodes transiently
fail» Assuming that we immediately respond to node failure, must
transfer state to new node set» Then when node returns, must transfer state back» Can be a significant cost if transient failure common
• Disadvantages of “Scalable” routing algorithms– More than one hop to find data O(log N) or worse– Number of hops unpredictable and almost always > 1
» Node failure, randomness, etc
Lec 23.84/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Dynamo Assumptions
• Query Model – Simple interface exposed to application level– Get(), Put()– No Delete()– No transactions, no complex queries
• Atomicity, Consistency, Isolation, Durability– Operations either succeed or fail, no middle ground– System will be eventually consistent, no sacrifice of
availability to assure consistency– Conflicts can occur while updates propagate through system– System can still function while entire sections of network
are down• Efficiency – Measure system by the 99.9th percentile
– Important with millions of users, 0.1% can be in the 10,000s• Non Hostile Environment
– No need to authenticate query, no malicious queries– Behind web services, not in front of them
Lec 23.94/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Service Level Agreements (SLA)
• Application can deliver its functionality in a bounded time: – Every dependency in the
platform needs to deliver its functionality with even tighter bounds.
• Example: service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second
• Contrast to services which focus on mean response time Service-oriented
architecture of Amazon’s platform
Lec 23.104/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Replication
• Each data item is replicated at N hosts
• “preference list”: The list of nodes responsible for storing a particular key– Successive nodes not guaranteed
to be on different physical nodes– Thus preference list includes physically distinct nodes
• Sloppy Quorum– R (or W) is the minimum number of nodes that must
participate in a successful read (or write) operation.– Setting R + W > N yields a quorum-like system.– Latency of a get (or put) is dictated by the slowest of
the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency.
• Replicas synchronized via anti-entropy protocol– Use of Merkle tree for each unique range– Nodes exchange root of trees for shared key range
Lec 23.114/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Administrivia
• Get moving on Lab 4– Will require you to read a bunch of code to
digest the VFS layer– Design due this Thursday!
» So that Palmer can have design reviews on Friday
» Focus on behavioral aspects• Mounting, File operations, Etc
• Don’t forget final Lecture during RRR– Monday 5/6– Send me final topics
Lec 23.124/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Data Versioning
• A put() call may return to its caller before the update has been applied at all the replicas
• A get() call may return many versions of the same object.
• Challenge: an object having distinct version sub-histories, which the system will need to reconcile in the future.
• Solution: uses vector clocks in order to capture causality between different versions of the same object– A vector clock is a list of (node, counter) pairs– Every version of every object is associated with
one vector clock– If the counters on the first object’s clock are
less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.
Lec 23.134/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Vector clock example
Lec 23.144/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Conflicts (multiversion data)
• Client must resolve conflicts– Only resolve conflicts on reads – Different resolution options:
» Use vector clocks to decide based on history» Use timestamps to pick latest version
– Examples given in paper:» For shopping cart, simply merge different versions» For customer’s session information, use latest version
– Stale versions returned on reads are updated (“read repair”)
• Vary N, R, W to match requirements of applications– High performance reads: R=1, W=N– Fast writes with possible inconsistency: W=1– Common configuration: N=3, R=2, W=2
• When do branches occur?– Branches uncommon: 0.06% of requests saw > 1 version
over 24 hours– Divergence occurs because of high write rate (more
coordinators), not necessarily because of failure
Lec 23.154/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Haystack File System
• Does it ever make sense to adapt a file system to a particular usage pattern?– Perhaps
• Good example: Facebook’s “Haystack” filesystem– Specific application (Photo Sharing)
» Large files!, Many files!» 260 Billion images, 20 PetaBytes (1015 bytes!)» One billion new photos a week (60 TeraBytes)
– Presence of Content Delivery Network (CDN)
» Distributed caching and distribution network
» Facebook web servers return special URLs that encode requests to CDN
» Pay for service by bandwidth– Specific usage patterns:
» New photos accessed a lot (caching well)
» Old photos accessed little, but likely to be requested at any time NEEDLES
Number of photosrequested in day
Lec 23.164/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Old Solution: NFS
• Issues with this design?• Long Tail Caching does not
work for most photos– Every access to back end storage
must be fast without benefit ofcaching!
• Linear Directory scheme worksbadly for many photos/directory– Many disk operations to find
even a single photo– Directory’s block map too big to cache in memory– “Fixed” by reducing directory size, however still not great
• Meta-Data (FFS) requires ≥ 3 disk accesses per lookup– Caching all iNodes in memory might help, but iNodes are
big• Fundamentally, Photo Storage different from other
storage:– Normal file systems fine for developers, databases, etc
Lec 23.174/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
New Solution: Haystack
• Finding a needle (old photo) in Haystack
• Differentiate between oldand new photos– How? By looking at “Writeable”
vs “Read-only” volumes– New Photos go to Writeable
volumes• Directory: Help locate photos
– Name (URL) of photo has embedded volume and photo ID
• Let CDN or Haystack CacheServe new photos– rather than forwarding them to
Writeable volumes• Haystack Store: Multiple “Physical Volumes”
– Physical volume is large file (100 GB) which stores millions of photos
– Data Accessed by Volume ID with offset into file– Since Physical Volumes are large files, use XFS which is
optimized for large files
Lec 23.184/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Haystack Details
• Each physical volume is stored as single file in XFS– Superblock: General information about the volume– Each photo (a “needle”) stored by appending to file
• Needles stored sequentially in file– Naming: [Volume ID, Key, Alternate Key, Cookie]– Cookie: random value to avoid guessing attacks– Key: Unique 64-bit photo ID– Alternate Key: four different sizes, ‘n’, ‘a’, ‘s’, ‘t’
• Deleted Needle Simply marked as “deleted”– Overwritten Needle – new version appended at end
Lec 23.194/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Haystack Details (Con’t)
• Replication for reliability and performance:– Multiple physical volumes
combined into logical volume» Factor of 3
– Four different sizes » Thumbnails, Small, Medium, Large
• Lookup– User requests Webpage– Webserver returns URL of form:
» http://<CDN>/<Cache>/<Machine id>/<Logical volume,photo>
» Possibly reference cache only if old image– CDN will strip off CDN reference if missing, forward to
cache– Cache will strip off cache reference and forward to Store
• In-memory index on Store for each volume map:[Key, Alternate Key] Offset
Lec 23.204/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
What about Protection?• Start by asking some high-level questions…
– What do we expect of our systems?» Won’t leak our information» Won’t lose our information» Will always work when we need them» Won’t launch attacks against other people
– How can we prevent systems from misbehaving?» Never connect them to the network?» Always authenticate users?» Never use them?
• Protection: use of one or more mechanisms for controlling the access of programs, processes, or users to resources– Page Table Mechanism– File Access Mechanism– On-disk encryption
• Can use lots of Protection but still have an insecure system!– Bugs, back doors, viruses, poorly defined policy, inside
man– Denial of service, …
Lec 23.214/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Protection vs Security• Security is a very complex topic: see, i.e. CS161
– Security is about Policy, i.e. what human-centered properties do we want from our system
» Usually with reference to an attack model– Security is achieved through a series of
Mechanisms, i.e. individual elements of the system combined together to achieve a security policy
• Security: use of protection mechanisms to prevent misuse of resources– Misuse defined with respect to policy
» E.g.: prevent exposure of certain sensitive information
» E.g.: prevent unauthorized modification/deletion of data
– Requires consideration of the external environment within which the system operates
» Most well-constructed system cannot protect information if user accidentally reveals password
Lec 23.224/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Preventing Misuse• Types of Misuse:
– Accidental:» If I delete shell, can’t log in to fix it!» Could make it more difficult by asking: “do you
really want to delete the shell?”– Intentional:
» Some high school brat who can’t get a date, so instead he transfers $3 billion from B to A.
» Doesn’t help to ask if they want to do it (of course!)
• Three Pieces to Security– Authentication: who the user actually is– Authorization: who is allowed to do what– Enforcement: make sure people do only what
they are supposed to do• Loopholes in any carefully constructed system:
– Log in as superuser and you’ve circumvented authentication
– Log in as self and can do anything with your resources; for instance: run program that erases all of your files
– Can you trust software to correctly enforce Authentication and Authorization?????
Lec 23.234/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Authentication: Identifying Users
• How to identify users to the system?– Passwords
» Shared secret between two parties» Since only user knows password, someone types
correct password must be user typing it» Very common technique
– Smart Cards» Electronics embedded in card capable of
providing long passwords or satisfying challenge response queries
» May have display to allow reading of password» Or can be plugged in directly; several
credit cards now in this category– Biometrics
» Use of one or more intrinsic physical or behavioral traits to identify someone
» Examples: fingerprint reader, palm reader, retinal scan
» Becoming quite a bit more common• What else?
– Consider the “Swarm” and “Un-pad” views
Lec 23.244/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Timing Attacks: Tenex Password Checking
• Tenex – early 70’s, BBN– Most popular system at universities before
UNIX– Thought to be very secure, gave “red team” all
the source code and documentation (want code to be publicly available, as in UNIX)
– In 48 hours, they figured out how to get every password in the system
• Here’s the code for the password check:for (i = 0; i < 8; i++)
if (userPasswd[i] != realPasswd[i]) go to error
• How many combinations of passwords?– 2568?– Wrong!
Lec 23.254/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Defeating Password Checking
• Tenex used VM, and it interacts badly with the above code– Key idea: force page faults at inopportune times to break
passwords quickly• Arrange 1st char in string to be last char in pg, rest on next
pg– Then arrange for pg with 1st char to be in memory, and
rest to be on disk (e.g., ref lots of other pgs, then ref 1st page)
a|aaaaaa |
page in memory| page on disk • Time password check to determine if first character is
correct!– If fast, 1st char is wrong– If slow, 1st char is right, pg fault, one of the others wrong– So try all first characters, until one is slow– Repeat with first two characters in memory, rest on disk
• Only 256 * 8 attempts to crack passwords– Fix is easy, don’t stop until you look at all the characters
Lec 23.264/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
• How do we decide who is authorizedto do actions in the system?
• Access Control Matrix: containsall permissions in the system– Resources across top
» Files, Devices, etc…– Domains in columns
» A domain might be a user or a group of permissions
» E.g. above: User D3 can read F2 or execute F3– In practice, table would be huge and sparse!• Two approaches to implementation
– Access Control Lists: store permissions with each object
» Still might be lots of users! » UNIX limits each file to: r,w,x for owner, group,
world» More recent systems allow definition of groups of
users and permissions for each group– Capability List: each process tracks objects has
permission to touch» Popular in the past, idea out of favor today» Consider page table: Each process has list of pages
it has access to, not each page has list of processes …
Recall: Authorization: Who Can Do What?
Lec 23.274/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Authorization Continued• Principle of least privilege: programs, users,
and systems should get only enough privileges to perform their tasks– Very hard to do in practice
» How do you figure out what the minimum set of privileges is needed to run your programs?
– People often run at higher privilege then necessary
» Such as the “administrator” privilege under windows
• One solution: Signed Software– Only use software from sources that you trust,
thereby dealing with the problem by means of authentication
– Fine for big, established firms such as Microsoft, since they can make their signing keys well known and people trust them
» Actually, not always fine: recently, one of Microsoft’s signing keys was compromised, leading to malicious software that looked valid
– What about new startups?» Who “validates” them?» How easy is it to fool them?
Lec 23.284/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Mandatory Access Control (MAC)
• Mandatory Access Control (MAC)– “A Type of Access control by which the operating
system constraints the ability of a subject or initiator to access or generally perform some sort of operation on an object or target.”
From Wikipedia– Subject: a process or thread– Object: files, directories, TCP/UDP ports, etc– Security policy is centrally controlled by a security
policy administrator: users not allowed to operate outside the policy
– Examples: SELinux, HiStar, etc.• Contrast: Discretionary Access Control (DAC)
– Access restricted based on the identity of subjects and/or groups to which they blong
– Controls are discretionary – a subject with a certain access permission is capable of passing that permission on to any other subject
– Standard UNIX model
Lec 23.294/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Data Centric Access Control (DCAC?)
• Problem with many current models:– If you break into OS data is compromised– In reality, it is the data that matters – hardware is
somewhat irrelevant (and ubiquitous)• Data-Centric Access Control (DCAC)
– I just made this term up, but you get the idea– Protect data at all costs, assume that software
might be compromised– Requires encryption and sandboxing techniques– If hardware (or virtual machine) has the right
cryptographic keys, then data is released• All of the previous authorization and enforcement
mechanisms reduce to key distribution and protection– Never let decrypted data or keys outside sandbox– Examples: Use of TPM, virtual machine mechanisms
Lec 23.304/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Enforcement• Enforcer checks passwords, ACLs, etc
– Makes sure the only authorized actions take place– Bugs in enforcerthings for malicious users to
exploit• Normally, in UNIX, superuser can do anything
– Because of coarse-grained access control, lots of stuff has to run as superuser in order to work
– If there is a bug in any one of these programs, you lose!
• Paradox– Bullet-proof enforcer
» Only known way is to make enforcer as small as possible
» Easier to make correct, but simple-minded protection model
– Fancy protection» Tries to adhere to principle of least privilege» Really hard to get right
• Same argument for Java or C++: What do you make private vs public?– Hard to make sure that code is usable but only
necessary modules are public– Pick something in middle? Get bugs and weak
protection!
Lec 23.314/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013
Summary
• Peer-to-Peer: – Use of 100s or 1000s of nodes to keep higher
performance or greater availability– May need to relax consistency for better
performance• Application-Specific File Systems (e.g.
Haystack):– Optimize system for particular usage pattern
• Security: use of protection mechanisms to prevent misuse of resources– Represents Human-Centered Policy as opposed
to mechanism• Three Pieces to Security
– Authentication: who the user actually is– Authorization: who is allowed to do what– Enforcement: make sure people do only what
they are supposed to do• Principle of least privilege: programs, users,
and systems should get only enough privileges to perform their tasks
Recommended