Upload
lorna
View
47
Download
0
Embed Size (px)
DESCRIPTION
CSC 536 Lecture 8. Outline. Fault tolerance Reliable client-server communication Reliable group communication Distributed commit Case study Google infrastructure. Reliable client-server communication. Process-to-process communication. - PowerPoint PPT Presentation
Citation preview
CSC 536 Lecture 8
Outline
Running Akka applications in the cloudAkka ClusterDockerEtcd
Overview of Google’s distributed systems (part 1)
Akka Cluster
Akka cluster
A fault-tolerant, elastic, decentralized p2p-based cluster membership service with no single point of failure or bottleneck Allows Actors/ActorSystems to work together without having to
specify the nodes they will live in Encourages development of peer-to-peer applications instead of
client-server ones Allows peers to automatically discover new nodes and remove dead
ones Introduces the concept of "roles" to distinguish different Akka
applications within a cluster Introduces clustered routers that automatically adjust their routees
list based on node availability.
Akka cluster
A fault-tolerant, elastic, decentralized p2p-based cluster membership service with no single point of failure or bottleneck
A layer of abstraction above Akka remoting
Akka cluster membership
A cluster is made up of a set of member nodes. Each node is an ActorSystem (having the same name) The identifier for each node is a hostname:port tuple.
An Akka application can be distributed over a cluster with each node hosting some part of the application.
Joining a cluster is initiated by issuing a Join command to one of the nodes in the cluster to join.
Akka cluster events
The membership service allows actors in member nodes to monitor state transitions of other member nodes
To do this, the actor must register as a listener for ClusterDomainEvents
MemberEvent MemberUp MemberRemoved
ReachabilityEvent UnreachableMember ReachableMember
Akka cluster example
Add dependency to build.sbt:
libraryDependencies += "com.typesafe.akka" %% "akka-cluster" % "2.5.1"
Akka cluster example
Application.conf:akka { actor { provider = cluster } remote {
netty.tcp { hostname = "127.0.0.1" port = 0 } } cluster { seed-nodes = [ "akka.tcp://[email protected]:2551", "akka.tcp://[email protected]:2552"] }}
The seed nodes are configured contact points for initial, automatic, join of the cluster
Akka cluster example
http://example.lightbend.com/v1/download/akka-samples-cluster-scala
Seed nodes
The seed nodes are configured contact points for new nodes joining the cluster
When a new node is started it sends a message to all seed nodes and then sends a join command to the seed node that answers first
The seed nodes configuration value is only relevant for new nodes joining the cluster as it helps them to find contact points to send the join command to
A new member can send this command to any current member of the cluster, not only to the seed nodes
Akka cluster example 2
How to make use of cluster membership events? Example application with backend workers who detect, and then
register with, new frontend master nodes TransformationMessages.scalaTransformationBackend.scalaTransformationFrontend.scala
In separate windows:$ runMain sample.cluster.transformation.TransformationFrontend 2551$ runMain sample.cluster.transformation.TransformationBackend 2552$ runMain sample.cluster.transformation.TransformationBackend 0$ runMain sample.cluster.transformation.TransformationBackend 0$ runMain sample.cluster.transformation.TransformationFrontend 0
Node rolesNodes in cluster need not perform the same function
Some nodes could run the web front-end, others could run the data access layer, and some others could run the number-crunching
Deployment of actors—for example by cluster-aware routers—can take node roles into account to achieve this distribution of responsibilities
The roles of a node is defined in the configuration property named akka.cluster.roles
The roles of the nodes are part of the membership information in MemberEvent that you can subscribe to.
Akka cluster implementation
Cluster membership is communicated using a gossip protocol
The current state of the cluster is gossiped randomly through the cluster
The gossip protocol is a variation of push-pull gossip used to reduce the amount of gossip information sent around the
cluster a digest is sent representing current versions but not actual values the recipient of the gossip can then send back any values for which it
has newer versions and also request values for which it has outdated versions
based on Amazon’s Dynamo system as well as Riak (both to be discussed in future lecture)
Akka cluster implementation
Akka cluster implementation
Vector clocks are used to reconcile and merge differences in cluster state during gossiping
Each update to the cluster state has an accompanying update to the vector clock
The digest used by the push-pull gossip consists of vector timestamps so the the actual state is pushed only as needed
Akka cluster implementation
The recipient of the gossip vector timestamp needs to determine whether:
it has a newer version of the gossip state, in which case it sends that back to the gossiper
it has an outdated version of the state, in which case the recipient requests the current state from the gossiper by sending back its version of the gossip state
Akka cluster implementation
Information about the cluster converges locally at a node at certain points in time
This is when a node can prove that the cluster state it is observing has been observed by all other nodes in the cluster
Convergence is implemented by passing a set of nodes that have seen the current state version during gossip This information is referred to as the seen set
When all nodes are included in the seen set there is gossip convergence
Akka cluster implementation
Gossip convergence cannot occur while any nodes are unreachable
The nodes need to become reachable again, or moved to the down and removed states
Until then the leader is prevented from performing its cluster membership management role For example this means that during a network partition it is not
possible to add more nodes to the cluster
Akka cluster implementation
After gossip convergence a leader for the cluster is chosen There is no leader election process: the leader is just the first
node in the seen set (in sorted order) whenever there is gossip convergence
The role of the leader is to shift members in and out of the cluster, changing joining members to the up state or exiting members to the removed state.
The leader also has the power, if configured so, to "auto-down" a node that according to the Failure Detector is considered unreachable.
Akka cluster implementation
To detect whether a node is unreachable
Nodes send each other heartbeats on an ongoing basis If a node “misses enough” heartbeats, as determined by the
accrual failure detector, this will trigger unreachable gossip messages from its peers
If a quorum of cluster nodes agree that the node is unreachable and the leader is configured to “auto-down”, it will mark it as down and begin removing the node from the cluster.
Akka cluster implementation
An accrual failure detector is used to detect if a node is unreachable from the rest of the cluster
an implementation of The Phi Accrual Failure Detector by Hayashibara et al.
An accrual failure detector decouples monitoring and interpretation
Makes “educated” guesses about whether a specific node is up or down By keeping a history of failure statistics, calculated from heartbeats
received from other nodes It returns a phi value representing the likelihood that the node is down
Membership lifecycle
Normal Lifecycle:
A node begins in the joining state Once all nodes have seen that the new node is joining (through
gossip convergence) the leader will set the member state to up If a node is leaving the cluster in a safe, expected manner then
it switches to the leaving state Once the leader sees the convergence on the node in the
leaving state, the leader will then move it to exiting Once all nodes have seen the exiting state (convergence) the
leader will remove the node from the cluster, marking it as removed
Membership lifecycle
Failure Lifecycle:
If a node is unreachable then gossip convergence is not possible and therefore any leader actions are also not possible
To be able to move forward, the state of the unreachable nodes must be changed
It must become reachable again or marked as down. The cluster can, through the leader, auto-down a node after a
configured time of unreachability
Akka cluster node states
Cluster aware routers
All routers can be made aware of member nodes in the cluster
When a node becomes unreachable or leaves the cluster the routees of that node are automatically unregistered from the router
When new nodes join the cluster, additional routees are added to the router, according to the configuration Routees are also added when a node becomes reachable again, after
having been unreachable.
Cluster aware routers
There are two distinct types of routers
Group - router that sends messages to the specified path using actor selection The routees can be shared among routers running on different nodes in
the cluster Example: worker service running on backend nodes in the cluster and
used by routers running on front-end nodes Pool - router that creates routees as child actors and deploys
them on remote nodes Each router will have its own routee instances Example: single master that coordinates jobs and delegates the actual
work to routees running on other cluster nodes
Router with Group of Routees
Configuration (stats1.conf):
akka.actor.deployment { /statsService/workerRouter { router = consistent-hashing-group routees.paths = ["/user/statsWorker"] cluster { enabled = on allow-local-routees = on use-role = compute } }}
Router with Group of Routees
StatsMessage.scalaStatsWorker.scalaStatsService.scalaStatsSample.scala
In separate windows:$ runMain sample.cluster.stats.StatsSample 2551$ runMain sample.cluster.stats.StatsSample 2552$ runMain sample.cluster.stats.StatsSampleClient$ runMain sample.cluster.stats.StatsSample 0
Router with Pool of Remote Deployed Routees
Configuration (stats2.conf):
akka.actor.deployment { /statsService/singleton/workerRouter { router = consistent-hashing-pool
cluster { enabled = on max-nr-of-instances-per-node = 3 allow-local-routees = on use-role = compute } }}
Router with Pool of Remote Deployed Routees
StatsMessage.scalaStatsWorker.scalaStatsService.scalaStatsSampleOneMaster.scala
In separate windows:$ runMain sample.cluster.stats.StatsSampleOneMaster 2551$ runMain sample.cluster.stats.StatsSampleOneMaster 2552$ runMain sample.cluster.stats.StatsSampleOneMasterClient$ runMain sample.cluster.stats.StatsSampleOneMaster 0
Docker
Problem
How to deploy an Akka cluster applications to production servers or the cloud?
Server OS is likely configured differently that the developer’s OSServer OS may change between runs
Solution
Containers such as Dockerseparate your applications from your infrastructure and treat your infrastructure like a managed application
Docker overview
Akka cluster with Docker
Get
http://www.lightbend.com/activator/template/bundle/akka-docker-cluster
etcd
Problem
Running an Akka cluster requires establishing a set of seed nodes
all nodes joining the cluster do so by contacting any of the seed nodes over the network, andone of the seed nodes bootstraps the cluster by "joining itself" - establishing a new cluster of size 1
Problem: When running a distributed application on a cloud platform, the IP address of the container where the seed node(s) will execute is generally not known beforehand
A solution
Need an external, fault-tolerant key-value store that can be used to
elect the initial seed node, and thenpublish a list of seed nodes afterwards
This would enable a zero-configuration deployment scenario for an Akka cluster
More generally
Need a fault-tolerant service for cluster coordination and state management
systems “source of truth”
etcd
Distributed key-value store designed to reliably and quickly preserve and provide access to critical data
The keys form together a hierarchical keyspace with directories and keys.
Implementation: a replicated state machine that runs Raft underneath the hood
Allows developers to write systems that agree on the state of values
etcd
The name comes from the idea of distributing the Unix "/etc" directory, where many configuration files live, across multiple machines – /etc distributed
The CoreOS developers used it to make their distributed "/etc" plan work.
etcd API
From the command line to a local etcd instance:
$ etcdctl put mykey "this is awesome"OK$ etcdctl get mykeymykeythis is awesome
In addition: set, rm, mk, mkdir, rmdir, ls, setdir, update, updatedir, watch and watch-exec, (atomic) compareAndSwap, (atomic) compareAndDelete, and others
https://coreos.com/etcd/docs/latest/api.html
etcd API
read/write a value$ etcdctl get /folder/key$ etcdctl set /folder/key
read/create directory$ etcdctl mkdir /folder$ etcdctl ls /folder
listen to changes$ etcdctl watch /folder/key$ etcdctl exec-watch /folder/key -- /bin/bash -c
“touch /tmp/test”
Akka cluster with etcd
The known seed nodes are replaced by a known etcd service
Node A attempts to join the cluster, no one is there, so it puts itself into etcd and starts a single node cluster as a seedB joins, finds A in etcd and joins AC joins, finds A....A gets shut downAfter a timeout, B or C are put into etcd as new seed nodeD joins, finds B or C....
Akka cluster with etcd
ConstructR provides libraries for bootstrapping Akka and Cassandra clusters via etcd.
ConstructR first tries to get the seed nodes from the coordination serviceIf none are available it tries to acquire a lock, e.g. via a CAS write for etcd, and uses itself or retries getting the nodesThen it joins using these nodes as seed nodes.After that it adds its address to the nodes and starts the refresh loop
Other etcd uses
Kubernetes, the open source container cluster manager from Google
It takes care of storing and replicating data used by Kubernetes across the entire cluster
Cloud Foundry also uses etcd as their distributed key-value store
Overview of Google’s distributed systems
Original Google search engine architecture
More than just a search engine
Organization of Google’sphysical infrastructure
40-80 PCsper rack(terabytes ofdisk space each)
30+ racksper cluster
Hundredsof clustersspread acrossdata centersworldwide
System architecture requirements
Scalability
Reliability
Performance
Openness (at the beginning, at least)
Overall Google systems architecture
Google infrastructure
Design philosophy
SimplicitySoftware should do one thing and do it well
Provable performanceEstimate performance costs (accessing memory and disk, sending packet over network, locking and unlocking a mutex, etc.)“every millisecond counts”
TestingStringent testing”if it ain’t broke, you’re not trying hard enough”
Data and coordination services
Google File System (GFS)Broadly similar to NFS and AFSOptimized to type of files and data access used by Google
BigTableA distributed database that stores (semi-)structured dataJust enough organization and structure for the type of data Google uses
Chubbya locking service (and more) for GFS and BigTable
GFS requirements
Must run reliably on the physical platformMust tolerate failures of individual commodity components
So application-level services can rely on the file system
Optimized for Google’s usage patternsHuge files (100+MB, up to 1GB)Relatively small number of filesAccesses dominated by sequential reads and appendsAppends done concurrently by hundreds of writers
Meets the requirements of the whole Google infrastructurescalable, reliable, high performance, openImportant: throughput has higher priority than latency
GFS API
Familiar file system interfaceusual file operations to create, delete, open, close, read, and write files
Files are organized hierarchically in directories and identified by pathnames
GFS also has snapshot and record append operationsSnapshot creates a copy of a file or a directory tree at low costRecord append is an atomic append operation
GFS architectureFile stored in 64MB chunks in a cluster with
a master node (operations log replicated on remote machines)hundreds of chunk servers
Chunks replicated 3 times
GFS master
The master maintains all file system metadata (in memory)namespace (also kept persistently)access control information(also kept persistently)mapping from files to chunks (also kept persistently)current locations of chunks (NOT kept persistently)
It also controls system-wide activitieschunk lease managementgarbage collection of orphaned chunkschunk migration between chunkservers
After recovering from a crash, the master willpoll chunkservers to find out who has a replica of a given chunkperiodically communicate with each chunkserver in HeartBeat messages to give it instructions and collect its state
GFS client
GFS client code linked into each applicationImplements the file system APIcommunicates with the master and chunkservers to read or write data on behalf of the application
Clients interact with the master for metadata operations
All data-bearing communication goes directly to the chunkserversSo the master is not a bottleneck when doing reads and writes
Reading and writing
When the client wants to access a particular offset in a fileThe GFS client translates this to a (file name, chunk index)And then send this to the master
When the master receives the (file name, chunk index) pairIt replies with the chunk identifier and replica locations
The client then accesses the closest chunk replica directly
No client-side cachingCaching would not help in the type of (streaming) access GFS has
Keeping chunk replicas consistent
When the master receives a mutation request from a clientthe master grants a chunk replica a lease (replica is primary)returns identity of primary and other replicas to client
The client sends the mutation directly to all the replicasReplicas cache the mutation and acknowledge receipt
The client sends a write request to primaryPrimary orders mutations and updates accordinglyPrimary then requests that other replicas do the mutations in the same orderWhen all the replicas have acknowledged success, the primary reports an ack to the client
What consistency model does this seem to implement?
Keeping chunk replicas consistent
GFS (non-)guarantees
Writes (at a file offset) are not atomicConcurrent writes to the same location may corrupt replicated chunksIf any replica is left inconsistent, the write fails (and is retried a few times)
Appends are executed atomically “at least once”Offset is chosen by primary May end up with non-identical replicated chunks with some having duplicate appendsGFS does not guarantee that the replicas are identicalIt only guarantees that some file regions are consistent across replicas
When needed, GFS needs an external locking service (Chubby)As well as a leader election service (also Chubby) to select the primary replica
GFS locking needs
Each node in the namespace tree has associated read/write locksabsolute file names andabsolute directory names
To create a file in a directory the master obtains a write lock on the directory
The read lock on a directory name suffices to prevent the directory from being deleted, renamed, or snapshotted
A snapshot operation has to revoke chunkserver leases on all chunks covered by the snapshot; read locks over regions of the namespace are used to ensure proper serialization
Bigtable
GFS provides raw data storage
Also needed:Storage for structured data ...... optimized to handle the needs of Google’s apps ...... that is reliable, scalable, high-performance, open, etc
Examples of structured data
URLs:Content, crawl metadata, links, anchors, PageRank, ...
Per-user data:User preference settings, recent queries/search results, …
Geographic locations:Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …
Commercial DB
Why not use commercial database?Not scalable enoughToo expensiveFull-featured relational database not requiredLow-level optimizations may be needed
Bigtable table
Implementation: Sparse, distributed, persistent, multi-dimensional map (row, column, timestamp) → cell contents
Rows
Each row has a keyA string up to 64KB in sizeAccess to data in a row is atomic
Rows ordered lexicographicallyRows close together lexicographically reside on one or close machines (locality)
Columns
“com.cnn.www”
‘contents:.’
“<html>…” “CNN Sports”
‘anchor:com.cnn.www/sport’
“CNN world”
‘anchor:com.cnn.www/world’
Columns have two-level name structure:family:qualifier
Column familylogical grouping of datagroups unbounded number of columns (named with qualifiers)may have a single column with no qualifier
Timestamps
Used to store different versions of data in a celldefault to current timecan also be set explicitly set by client
Garbage CollectionPer-column-family GC settings
“Only retain most recent K values in a cell”“Keep values until they are older than K seconds”...
API
Create / delete tables and column families
Table *T = OpenOrDie(“/bigtable/web/webtable”);RowMutation r1(T, “com.cnn.www”);r1.Set(“anchor:com.cnn.www/sport”, “CNN Sports”);r1.Delete(“anchor:com.cnn.www/world”);Operation op;Apply(&op, &r1);
Bigtable architecture
An instance of BigTable is a cluster that stores tableslibrary on client sidemaster servertablet servers
table is decomposed into tablets
Tablets
A table is decomposed into tabletsTablet holds contiguous range of rows100MB - 200MB of data per tabletTablet server responsible for ~100 tablets
Each tablet is represented by A set of files stored in GFS
The files use the SSTable format, a mapping of (string) keys to (string) values
Log files
Tablet Server
When a tablet server starts, it creates and acquires an exclusive lock on a uniquely-named file in a specific Chubby directory
The master monitors this directory to discover tablet serversMaster assigns tablets to tablet servers
Tablet serverHandles reads / writes requests to tablets from clientsClients obtain server info (i.e., tablet location info) from Chubby
No data goes through masterBigtable client requires a naming/locator service (Chubby) to find the root tablet, which is part of the metadata tableThe metadata table contains metadata about actual tablets
including location information of associated SSTables and log files
MasterUpon startup, must grab master lock to insure it is the single master of a set of tablet servers
provided by locking service (Chubby)
Monitors tablet serversperiodically scans directory of tablet servers provided by naming service (Chubby)keeps track of tablets assigned to its table servers
obtains a lock on the tablet server from locking service (Chubby)lock is the communication mechanism between master and tablet server
Assigns unassigned tablets in the cluster to tablet servers it monitorswhen a tablet server hosting the tablet goes down or to…… move tablets around to achieve load balancing or
Garbage collects underlying files stored in GFS
BigTable tablet architecture
Each is an ordered and immutable mapping of keys to values
Tablet Serving
Writes committed to logMemtable: ordered log of recent commits (in memory)SSTables really store a snapshot of recent changes
When Memtable gets too big do minor compactionCreate new empty MemtableMerge old Memtable with recent SSTables to create new SSTable
A major compaction results in a single SSTable
SSTable
OperationsLook up value for keyIterate over all key/value pairs in specified range
Relies on lock service (Chubby)Ensure there is at most one active masterAdminister table server deathStore column family informationStore access control lists
Bigtable locking/metadata storage needs
Bigtable uses a locking service/key-value store (Chubby) for a variety of tasks
to ensure that there is at most one active master at any timeto store the bootstrap location of Bigtable datato discover tablet servers and finalize tablet server deaths (i.e., to keep the system state)to store Bigtable schema information (the column family information for each table)to store access control lists
If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable
Chubby
Chubby provides to the infrastructurea locking servicea file system for reliable storage of small filesa leader election service (e.g. to select a primary replica)a name service
Seemingly violates “simplicity” design philosophy but...
... Chubby really provides an asynchronous distributed agreement service
Chubby API
Overall architecture of Chubby
Cell: single instanceof Chubby system
5 replicas1 master replica
Each replica maintains a databaseof directories and files/locks
Consistency achieved using Lamport’s Paxos consensus protocol that uses an operation logChubby internally supports snapshots to periodically GCthe operation log
Paxos distributed consensus algorithm
A distributed consensus protocol for asynchronous systems
Used by servers managing replicas in order to reach agreement on update when
messages may be lost, re-ordered, duplicatedservers may operate at arbitrary speed and failservers have access to stable persistent storage
Fact: Consensus not always possible in asynchronous systemsPaxos works by insuring safety (correctness) not liveness (termination)
The Big Picture
Customized solutions for Google-type problems
GFS: Stores data reliablyJust raw files
BigTable: provides key/value mapDatabase like, but doesn’t provide everything we need
Chubby: locking mechanismHandles all synchronization problems
Common Principles
One master, multiple workersMapReduce: master coordinates work amongst map / reduce workersChubby: master among five replicas Bigtable: master knows about location of tablet servers GFS: master coordinates data across chunkservers
Strong consistency modelsSequential consistency