88
CSC 536 Lecture 8

CSC 536 Lecture 8

  • Upload
    lorna

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

CSC 536 Lecture 8. Outline. Fault tolerance Reliable client-server communication Reliable group communication Distributed commit Case study Google infrastructure. Reliable client-server communication. Process-to-process communication. - PowerPoint PPT Presentation

Citation preview

Page 1: CSC  536 Lecture  8

CSC 536 Lecture 8

Page 2: CSC  536 Lecture  8

Outline

Running Akka applications in the cloudAkka ClusterDockerEtcd

Overview of Google’s distributed systems (part 1)

Page 3: CSC  536 Lecture  8

Akka Cluster

Page 4: CSC  536 Lecture  8

Akka cluster

A fault-tolerant, elastic, decentralized p2p-based cluster membership service with no single point of failure or bottleneck Allows Actors/ActorSystems to work together without having to

specify the nodes they will live in Encourages development of peer-to-peer applications instead of

client-server ones Allows peers to automatically discover new nodes and remove dead

ones Introduces the concept of "roles" to distinguish different Akka

applications within a cluster Introduces clustered routers that automatically adjust their routees

list based on node availability.

Page 5: CSC  536 Lecture  8

Akka cluster

A fault-tolerant, elastic, decentralized p2p-based cluster membership service with no single point of failure or bottleneck

A layer of abstraction above Akka remoting

Page 6: CSC  536 Lecture  8

Akka cluster membership

A cluster is made up of a set of member nodes. Each node is an ActorSystem (having the same name) The identifier for each node is a hostname:port tuple.

An Akka application can be distributed over a cluster with each node hosting some part of the application.

Joining a cluster is initiated by issuing a Join command to one of the nodes in the cluster to join.

Page 7: CSC  536 Lecture  8

Akka cluster events

The membership service allows actors in member nodes to monitor state transitions of other member nodes

To do this, the actor must register as a listener for ClusterDomainEvents

MemberEvent MemberUp MemberRemoved

ReachabilityEvent UnreachableMember ReachableMember

Page 8: CSC  536 Lecture  8

Akka cluster example

Add dependency to build.sbt:

libraryDependencies += "com.typesafe.akka" %% "akka-cluster" % "2.5.1"

Page 9: CSC  536 Lecture  8

Akka cluster example

Application.conf:akka { actor { provider = cluster } remote {

netty.tcp { hostname = "127.0.0.1" port = 0 } } cluster { seed-nodes = [ "akka.tcp://[email protected]:2551", "akka.tcp://[email protected]:2552"] }}

The seed nodes are configured contact points for initial, automatic, join of the cluster

Page 11: CSC  536 Lecture  8

Seed nodes

The seed nodes are configured contact points for new nodes joining the cluster

When a new node is started it sends a message to all seed nodes and then sends a join command to the seed node that answers first

The seed nodes configuration value is only relevant for new nodes joining the cluster as it helps them to find contact points to send the join command to

A new member can send this command to any current member of the cluster, not only to the seed nodes

Page 12: CSC  536 Lecture  8

Akka cluster example 2

How to make use of cluster membership events? Example application with backend workers who detect, and then

register with, new frontend master nodes TransformationMessages.scalaTransformationBackend.scalaTransformationFrontend.scala

In separate windows:$ runMain sample.cluster.transformation.TransformationFrontend 2551$ runMain sample.cluster.transformation.TransformationBackend 2552$ runMain sample.cluster.transformation.TransformationBackend 0$ runMain sample.cluster.transformation.TransformationBackend 0$ runMain sample.cluster.transformation.TransformationFrontend 0

Page 13: CSC  536 Lecture  8

Node rolesNodes in cluster need not perform the same function

Some nodes could run the web front-end, others could run the data access layer, and some others could run the number-crunching

Deployment of actors—for example by cluster-aware routers—can take node roles into account to achieve this distribution of responsibilities

The roles of a node is defined in the configuration property named akka.cluster.roles

The roles of the nodes are part of the membership information in MemberEvent that you can subscribe to.

Page 14: CSC  536 Lecture  8

Akka cluster implementation

Cluster membership is communicated using a gossip protocol

The current state of the cluster is gossiped randomly through the cluster

The gossip protocol is a variation of push-pull gossip used to reduce the amount of gossip information sent around the

cluster a digest is sent representing current versions but not actual values the recipient of the gossip can then send back any values for which it

has newer versions and also request values for which it has outdated versions

based on Amazon’s Dynamo system as well as Riak (both to be discussed in future lecture)

Page 15: CSC  536 Lecture  8

Akka cluster implementation

Page 16: CSC  536 Lecture  8

Akka cluster implementation

Vector clocks are used to reconcile and merge differences in cluster state during gossiping

Each update to the cluster state has an accompanying update to the vector clock

The digest used by the push-pull gossip consists of vector timestamps so the the actual state is pushed only as needed

Page 17: CSC  536 Lecture  8

Akka cluster implementation

The recipient of the gossip vector timestamp needs to determine whether:

it has a newer version of the gossip state, in which case it sends that back to the gossiper

it has an outdated version of the state, in which case the recipient requests the current state from the gossiper by sending back its version of the gossip state

Page 18: CSC  536 Lecture  8

Akka cluster implementation

Information about the cluster converges locally at a node at certain points in time

This is when a node can prove that the cluster state it is observing has been observed by all other nodes in the cluster

Convergence is implemented by passing a set of nodes that have seen the current state version during gossip This information is referred to as the seen set

When all nodes are included in the seen set there is gossip convergence

Page 19: CSC  536 Lecture  8

Akka cluster implementation

Gossip convergence cannot occur while any nodes are unreachable

The nodes need to become reachable again, or moved to the down and removed states

Until then the leader is prevented from performing its cluster membership management role For example this means that during a network partition it is not

possible to add more nodes to the cluster

Page 20: CSC  536 Lecture  8

Akka cluster implementation

After gossip convergence a leader for the cluster is chosen There is no leader election process: the leader is just the first

node in the seen set (in sorted order) whenever there is gossip convergence

The role of the leader is to shift members in and out of the cluster, changing joining members to the up state or exiting members to the removed state.

The leader also has the power, if configured so, to "auto-down" a node that according to the Failure Detector is considered unreachable.

Page 21: CSC  536 Lecture  8

Akka cluster implementation

To detect whether a node is unreachable

Nodes send each other heartbeats on an ongoing basis If a node “misses enough” heartbeats, as determined by the

accrual failure detector, this will trigger unreachable gossip messages from its peers

If a quorum of cluster nodes agree that the node is unreachable and the leader is configured to “auto-down”, it will mark it as down and begin removing the node from the cluster.

Page 22: CSC  536 Lecture  8

Akka cluster implementation

An accrual failure detector is used to detect if a node is unreachable from the rest of the cluster

an implementation of The Phi Accrual Failure Detector by Hayashibara et al.

An accrual failure detector decouples monitoring and interpretation

Makes “educated” guesses about whether a specific node is up or down By keeping a history of failure statistics, calculated from heartbeats

received from other nodes It returns a phi value representing the likelihood that the node is down

Page 23: CSC  536 Lecture  8

Membership lifecycle

Normal Lifecycle:

A node begins in the joining state Once all nodes have seen that the new node is joining (through

gossip convergence) the leader will set the member state to up If a node is leaving the cluster in a safe, expected manner then

it switches to the leaving state Once the leader sees the convergence on the node in the

leaving state, the leader will then move it to exiting Once all nodes have seen the exiting state (convergence) the

leader will remove the node from the cluster, marking it as removed

Page 24: CSC  536 Lecture  8

Membership lifecycle

Failure Lifecycle:

If a node is unreachable then gossip convergence is not possible and therefore any leader actions are also not possible

To be able to move forward, the state of the unreachable nodes must be changed

It must become reachable again or marked as down. The cluster can, through the leader, auto-down a node after a

configured time of unreachability

Page 25: CSC  536 Lecture  8

Akka cluster node states

Page 26: CSC  536 Lecture  8

Cluster aware routers

All routers can be made aware of member nodes in the cluster

When a node becomes unreachable or leaves the cluster the routees of that node are automatically unregistered from the router

When new nodes join the cluster, additional routees are added to the router, according to the configuration Routees are also added when a node becomes reachable again, after

having been unreachable.

Page 27: CSC  536 Lecture  8

Cluster aware routers

There are two distinct types of routers

Group - router that sends messages to the specified path using actor selection The routees can be shared among routers running on different nodes in

the cluster Example: worker service running on backend nodes in the cluster and

used by routers running on front-end nodes Pool - router that creates routees as child actors and deploys

them on remote nodes Each router will have its own routee instances Example: single master that coordinates jobs and delegates the actual

work to routees running on other cluster nodes

Page 28: CSC  536 Lecture  8

Router with Group of Routees

Configuration (stats1.conf):

akka.actor.deployment { /statsService/workerRouter { router = consistent-hashing-group routees.paths = ["/user/statsWorker"] cluster { enabled = on allow-local-routees = on use-role = compute } }}

Page 29: CSC  536 Lecture  8

Router with Group of Routees

StatsMessage.scalaStatsWorker.scalaStatsService.scalaStatsSample.scala

In separate windows:$ runMain sample.cluster.stats.StatsSample 2551$ runMain sample.cluster.stats.StatsSample 2552$ runMain sample.cluster.stats.StatsSampleClient$ runMain sample.cluster.stats.StatsSample 0

Page 30: CSC  536 Lecture  8

Router with Pool of Remote Deployed Routees

Configuration (stats2.conf):

akka.actor.deployment { /statsService/singleton/workerRouter { router = consistent-hashing-pool

cluster { enabled = on max-nr-of-instances-per-node = 3 allow-local-routees = on use-role = compute } }}

Page 31: CSC  536 Lecture  8

Router with Pool of Remote Deployed Routees

StatsMessage.scalaStatsWorker.scalaStatsService.scalaStatsSampleOneMaster.scala

In separate windows:$ runMain sample.cluster.stats.StatsSampleOneMaster 2551$ runMain sample.cluster.stats.StatsSampleOneMaster 2552$ runMain sample.cluster.stats.StatsSampleOneMasterClient$ runMain sample.cluster.stats.StatsSampleOneMaster 0

Page 32: CSC  536 Lecture  8

Docker

Page 33: CSC  536 Lecture  8

Problem

How to deploy an Akka cluster applications to production servers or the cloud?

Server OS is likely configured differently that the developer’s OSServer OS may change between runs

Page 34: CSC  536 Lecture  8

Solution

Containers such as Dockerseparate your applications from your infrastructure and treat your infrastructure like a managed application

Page 35: CSC  536 Lecture  8

Docker overview

Page 37: CSC  536 Lecture  8

etcd

Page 38: CSC  536 Lecture  8

Problem

Running an Akka cluster requires establishing a set of seed nodes

all nodes joining the cluster do so by contacting any of the seed nodes over the network, andone of the seed nodes bootstraps the cluster by "joining itself" - establishing a new cluster of size 1

Problem: When running a distributed application on a cloud platform, the IP address of the container where the seed node(s) will execute is generally not known beforehand

Page 39: CSC  536 Lecture  8

A solution

Need an external, fault-tolerant key-value store that can be used to

elect the initial seed node, and thenpublish a list of seed nodes afterwards

This would enable a zero-configuration deployment scenario for an Akka cluster

Page 40: CSC  536 Lecture  8

More generally

Need a fault-tolerant service for cluster coordination and state management

systems “source of truth”

Page 41: CSC  536 Lecture  8

etcd

Distributed key-value store designed to reliably and quickly preserve and provide access to critical data

The keys form together a hierarchical keyspace with directories and keys.

Implementation: a replicated state machine that runs Raft underneath the hood

Allows developers to write systems that agree on the state of values

Page 42: CSC  536 Lecture  8

etcd

The name comes from the idea of distributing the Unix "/etc" directory, where many configuration files live, across multiple machines – /etc distributed

The CoreOS developers used it to make their distributed "/etc" plan work.

Page 43: CSC  536 Lecture  8

etcd API

From the command line to a local etcd instance:

$ etcdctl put mykey "this is awesome"OK$ etcdctl get mykeymykeythis is awesome

In addition: set, rm, mk, mkdir, rmdir, ls, setdir, update, updatedir, watch and watch-exec, (atomic) compareAndSwap, (atomic) compareAndDelete, and others

https://coreos.com/etcd/docs/latest/api.html

Page 44: CSC  536 Lecture  8

etcd API

read/write a value$ etcdctl get /folder/key$ etcdctl set /folder/key

read/create directory$ etcdctl mkdir /folder$ etcdctl ls /folder

listen to changes$ etcdctl watch /folder/key$ etcdctl exec-watch /folder/key -- /bin/bash -c

“touch /tmp/test”

Page 45: CSC  536 Lecture  8

Akka cluster with etcd

The known seed nodes are replaced by a known etcd service

Node A attempts to join the cluster, no one is there, so it puts itself into etcd and starts a single node cluster as a seedB joins, finds A in etcd and joins AC joins, finds A....A gets shut downAfter a timeout, B or C are put into etcd as new seed nodeD joins, finds B or C....

Page 46: CSC  536 Lecture  8

Akka cluster with etcd

ConstructR provides libraries for bootstrapping Akka and Cassandra clusters via etcd.

ConstructR first tries to get the seed nodes from the coordination serviceIf none are available it tries to acquire a lock, e.g. via a CAS write for etcd, and uses itself or retries getting the nodesThen it joins using these nodes as seed nodes.After that it adds its address to the nodes and starts the refresh loop

Page 47: CSC  536 Lecture  8

Other etcd uses

Kubernetes, the open source container cluster manager from Google

It takes care of storing and replicating data used by Kubernetes across the entire cluster

Cloud Foundry also uses etcd as their distributed key-value store

Page 48: CSC  536 Lecture  8

Overview of Google’s distributed systems

Page 49: CSC  536 Lecture  8

Original Google search engine architecture

Page 50: CSC  536 Lecture  8

More than just a search engine

Page 51: CSC  536 Lecture  8

Organization of Google’sphysical infrastructure

40-80 PCsper rack(terabytes ofdisk space each)

30+ racksper cluster

Hundredsof clustersspread acrossdata centersworldwide

Page 52: CSC  536 Lecture  8

System architecture requirements

Scalability

Reliability

Performance

Openness (at the beginning, at least)

Page 53: CSC  536 Lecture  8

Overall Google systems architecture

Page 54: CSC  536 Lecture  8

Google infrastructure

Page 55: CSC  536 Lecture  8

Design philosophy

SimplicitySoftware should do one thing and do it well

Provable performanceEstimate performance costs (accessing memory and disk, sending packet over network, locking and unlocking a mutex, etc.)“every millisecond counts”

TestingStringent testing”if it ain’t broke, you’re not trying hard enough”

Page 56: CSC  536 Lecture  8

Data and coordination services

Google File System (GFS)Broadly similar to NFS and AFSOptimized to type of files and data access used by Google

BigTableA distributed database that stores (semi-)structured dataJust enough organization and structure for the type of data Google uses

Chubbya locking service (and more) for GFS and BigTable

Page 57: CSC  536 Lecture  8

GFS requirements

Must run reliably on the physical platformMust tolerate failures of individual commodity components

So application-level services can rely on the file system

Optimized for Google’s usage patternsHuge files (100+MB, up to 1GB)Relatively small number of filesAccesses dominated by sequential reads and appendsAppends done concurrently by hundreds of writers

Meets the requirements of the whole Google infrastructurescalable, reliable, high performance, openImportant: throughput has higher priority than latency

Page 58: CSC  536 Lecture  8

GFS API

Familiar file system interfaceusual file operations to create, delete, open, close, read, and write files

Files are organized hierarchically in directories and identified by pathnames

GFS also has snapshot and record append operationsSnapshot creates a copy of a file or a directory tree at low costRecord append is an atomic append operation

Page 59: CSC  536 Lecture  8

GFS architectureFile stored in 64MB chunks in a cluster with

a master node (operations log replicated on remote machines)hundreds of chunk servers

Chunks replicated 3 times

Page 60: CSC  536 Lecture  8

GFS master

The master maintains all file system metadata (in memory)namespace (also kept persistently)access control information(also kept persistently)mapping from files to chunks (also kept persistently)current locations of chunks (NOT kept persistently)

It also controls system-wide activitieschunk lease managementgarbage collection of orphaned chunkschunk migration between chunkservers

After recovering from a crash, the master willpoll chunkservers to find out who has a replica of a given chunkperiodically communicate with each chunkserver in HeartBeat messages to give it instructions and collect its state

Page 61: CSC  536 Lecture  8

GFS client

GFS client code linked into each applicationImplements the file system APIcommunicates with the master and chunkservers to read or write data on behalf of the application

Clients interact with the master for metadata operations

All data-bearing communication goes directly to the chunkserversSo the master is not a bottleneck when doing reads and writes

Page 62: CSC  536 Lecture  8

Reading and writing

When the client wants to access a particular offset in a fileThe GFS client translates this to a (file name, chunk index)And then send this to the master

When the master receives the (file name, chunk index) pairIt replies with the chunk identifier and replica locations

The client then accesses the closest chunk replica directly

No client-side cachingCaching would not help in the type of (streaming) access GFS has

Page 63: CSC  536 Lecture  8

Keeping chunk replicas consistent

When the master receives a mutation request from a clientthe master grants a chunk replica a lease (replica is primary)returns identity of primary and other replicas to client

The client sends the mutation directly to all the replicasReplicas cache the mutation and acknowledge receipt

The client sends a write request to primaryPrimary orders mutations and updates accordinglyPrimary then requests that other replicas do the mutations in the same orderWhen all the replicas have acknowledged success, the primary reports an ack to the client

What consistency model does this seem to implement?

Page 64: CSC  536 Lecture  8

Keeping chunk replicas consistent

Page 65: CSC  536 Lecture  8

GFS (non-)guarantees

Writes (at a file offset) are not atomicConcurrent writes to the same location may corrupt replicated chunksIf any replica is left inconsistent, the write fails (and is retried a few times)

Appends are executed atomically “at least once”Offset is chosen by primary May end up with non-identical replicated chunks with some having duplicate appendsGFS does not guarantee that the replicas are identicalIt only guarantees that some file regions are consistent across replicas

When needed, GFS needs an external locking service (Chubby)As well as a leader election service (also Chubby) to select the primary replica

Page 66: CSC  536 Lecture  8

GFS locking needs

Each node in the namespace tree has associated read/write locksabsolute file names andabsolute directory names

To create a file in a directory the master obtains a write lock on the directory

The read lock on a directory name suffices to prevent the directory from being deleted, renamed, or snapshotted

A snapshot operation has to revoke chunkserver leases on all chunks covered by the snapshot; read locks over regions of the namespace are used to ensure proper serialization

Page 67: CSC  536 Lecture  8

Bigtable

GFS provides raw data storage

Also needed:Storage for structured data ...... optimized to handle the needs of Google’s apps ...... that is reliable, scalable, high-performance, open, etc

Page 68: CSC  536 Lecture  8

Examples of structured data

URLs:Content, crawl metadata, links, anchors, PageRank, ...

Per-user data:User preference settings, recent queries/search results, …

Geographic locations:Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …

Page 69: CSC  536 Lecture  8

Commercial DB

Why not use commercial database?Not scalable enoughToo expensiveFull-featured relational database not requiredLow-level optimizations may be needed

Page 70: CSC  536 Lecture  8

Bigtable table

Implementation: Sparse, distributed, persistent, multi-dimensional map (row, column, timestamp) → cell contents

Page 71: CSC  536 Lecture  8

Rows

Each row has a keyA string up to 64KB in sizeAccess to data in a row is atomic

Rows ordered lexicographicallyRows close together lexicographically reside on one or close machines (locality)

Page 72: CSC  536 Lecture  8

Columns

“com.cnn.www”

‘contents:.’

“<html>…” “CNN Sports”

‘anchor:com.cnn.www/sport’

“CNN world”

‘anchor:com.cnn.www/world’

Columns have two-level name structure:family:qualifier

Column familylogical grouping of datagroups unbounded number of columns (named with qualifiers)may have a single column with no qualifier

Page 73: CSC  536 Lecture  8

Timestamps

Used to store different versions of data in a celldefault to current timecan also be set explicitly set by client

Garbage CollectionPer-column-family GC settings

“Only retain most recent K values in a cell”“Keep values until they are older than K seconds”...

Page 74: CSC  536 Lecture  8

API

Create / delete tables and column families

Table *T = OpenOrDie(“/bigtable/web/webtable”);RowMutation r1(T, “com.cnn.www”);r1.Set(“anchor:com.cnn.www/sport”, “CNN Sports”);r1.Delete(“anchor:com.cnn.www/world”);Operation op;Apply(&op, &r1);

Page 75: CSC  536 Lecture  8

Bigtable architecture

An instance of BigTable is a cluster that stores tableslibrary on client sidemaster servertablet servers

table is decomposed into tablets

Page 76: CSC  536 Lecture  8

Tablets

A table is decomposed into tabletsTablet holds contiguous range of rows100MB - 200MB of data per tabletTablet server responsible for ~100 tablets

Each tablet is represented by A set of files stored in GFS

The files use the SSTable format, a mapping of (string) keys to (string) values

Log files

Page 77: CSC  536 Lecture  8

Tablet Server

When a tablet server starts, it creates and acquires an exclusive lock on a uniquely-named file in a specific Chubby directory

The master monitors this directory to discover tablet serversMaster assigns tablets to tablet servers

Tablet serverHandles reads / writes requests to tablets from clientsClients obtain server info (i.e., tablet location info) from Chubby

No data goes through masterBigtable client requires a naming/locator service (Chubby) to find the root tablet, which is part of the metadata tableThe metadata table contains metadata about actual tablets

including location information of associated SSTables and log files

Page 78: CSC  536 Lecture  8

MasterUpon startup, must grab master lock to insure it is the single master of a set of tablet servers

provided by locking service (Chubby)

Monitors tablet serversperiodically scans directory of tablet servers provided by naming service (Chubby)keeps track of tablets assigned to its table servers

obtains a lock on the tablet server from locking service (Chubby)lock is the communication mechanism between master and tablet server

Assigns unassigned tablets in the cluster to tablet servers it monitorswhen a tablet server hosting the tablet goes down or to…… move tablets around to achieve load balancing or

Garbage collects underlying files stored in GFS

Page 79: CSC  536 Lecture  8

BigTable tablet architecture

Each is an ordered and immutable mapping of keys to values

Page 80: CSC  536 Lecture  8

Tablet Serving

Writes committed to logMemtable: ordered log of recent commits (in memory)SSTables really store a snapshot of recent changes

When Memtable gets too big do minor compactionCreate new empty MemtableMerge old Memtable with recent SSTables to create new SSTable

A major compaction results in a single SSTable

Page 81: CSC  536 Lecture  8

SSTable

OperationsLook up value for keyIterate over all key/value pairs in specified range

Relies on lock service (Chubby)Ensure there is at most one active masterAdminister table server deathStore column family informationStore access control lists

Page 82: CSC  536 Lecture  8

Bigtable locking/metadata storage needs

Bigtable uses a locking service/key-value store (Chubby) for a variety of tasks

to ensure that there is at most one active master at any timeto store the bootstrap location of Bigtable datato discover tablet servers and finalize tablet server deaths (i.e., to keep the system state)to store Bigtable schema information (the column family information for each table)to store access control lists

If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable

Page 83: CSC  536 Lecture  8

Chubby

Chubby provides to the infrastructurea locking servicea file system for reliable storage of small filesa leader election service (e.g. to select a primary replica)a name service

Seemingly violates “simplicity” design philosophy but...

... Chubby really provides an asynchronous distributed agreement service

Page 84: CSC  536 Lecture  8

Chubby API

Page 85: CSC  536 Lecture  8

Overall architecture of Chubby

Cell: single instanceof Chubby system

5 replicas1 master replica

Each replica maintains a databaseof directories and files/locks

Consistency achieved using Lamport’s Paxos consensus protocol that uses an operation logChubby internally supports snapshots to periodically GCthe operation log

Page 86: CSC  536 Lecture  8

Paxos distributed consensus algorithm

A distributed consensus protocol for asynchronous systems

Used by servers managing replicas in order to reach agreement on update when

messages may be lost, re-ordered, duplicatedservers may operate at arbitrary speed and failservers have access to stable persistent storage

Fact: Consensus not always possible in asynchronous systemsPaxos works by insuring safety (correctness) not liveness (termination)

Page 87: CSC  536 Lecture  8

The Big Picture

Customized solutions for Google-type problems

GFS: Stores data reliablyJust raw files

BigTable: provides key/value mapDatabase like, but doesn’t provide everything we need

Chubby: locking mechanismHandles all synchronization problems

Page 88: CSC  536 Lecture  8

Common Principles

One master, multiple workersMapReduce: master coordinates work amongst map / reduce workersChubby: master among five replicas Bigtable: master knows about location of tablet servers GFS: master coordinates data across chunkservers

Strong consistency modelsSequential consistency