CSC 536 Lecture 8

CSC 536 Lecture 8

Outline

Running Akka applications in the cloudAkka ClusterDockerEtcd

Overview of Google’s distributed systems (part 1)

Akka Cluster

Akka cluster

A fault-tolerant, elastic, decentralized p2p-based cluster membership service with no single point of failure or bottleneck Allows Actors/ActorSystems to work together without having to

specify the nodes they will live in Encourages development of peer-to-peer applications instead of

client-server ones Allows peers to automatically discover new nodes and remove dead

ones Introduces the concept of "roles" to distinguish different Akka

applications within a cluster Introduces clustered routers that automatically adjust their routees

list based on node availability.

Akka cluster

A fault-tolerant, elastic, decentralized p2p-based cluster membership service with no single point of failure or bottleneck

A layer of abstraction above Akka remoting

Akka cluster membership

A cluster is made up of a set of member nodes. Each node is an ActorSystem (having the same name) The identifier for each node is a hostname:port tuple.

An Akka application can be distributed over a cluster with each node hosting some part of the application.

Joining a cluster is initiated by issuing a Join command to one of the nodes in the cluster to join.

Akka cluster events

The membership service allows actors in member nodes to monitor state transitions of other member nodes

To do this, the actor must register as a listener for ClusterDomainEvents

MemberEvent MemberUp MemberRemoved

ReachabilityEvent UnreachableMember ReachableMember

Akka cluster example

Add dependency to build.sbt:

libraryDependencies += "com.typesafe.akka" %% "akka-cluster" % "2.5.1"


Application.conf:akka { actor { provider = cluster } remote {

netty.tcp { hostname = "127.0.0.1" port = 0 } } cluster { seed-nodes = [ "akka.tcp://[email protected]:2551", "akka.tcp://[email protected]:2552"] }}

The seed nodes are configured contact points for initial, automatic, join of the cluster


http://example.lightbend.com/v1/download/akka-samples-cluster-scala







Seed nodes

The seed nodes are configured contact points for new nodes joining the cluster

When a new node is started it sends a message to all seed nodes and then sends a join command to the seed node that answers first

The seed nodes configuration value is only relevant for new nodes joining the cluster as it helps them to find contact points to send the join command to

A new member can send this command to any current member of the cluster, not only to the seed nodes

Akka cluster example 2

How to make use of cluster membership events? Example application with backend workers who detect, and then

register with, new frontend master nodes TransformationMessages.scalaTransformationBackend.scalaTransformationFrontend.scala

In separate windows:$ runMain sample.cluster.transformation.TransformationFrontend 2551$ runMain sample.cluster.transformation.TransformationBackend 2552$ runMain sample.cluster.transformation.TransformationBackend 0$ runMain sample.cluster.transformation.TransformationBackend 0$ runMain sample.cluster.transformation.TransformationFrontend 0

Node rolesNodes in cluster need not perform the same function

Some nodes could run the web front-end, others could run the data access layer, and some others could run the number-crunching

Deployment of actors—for example by cluster-aware routers—can take node roles into account to achieve this distribution of responsibilities

The roles of a node is defined in the configuration property named akka.cluster.roles

The roles of the nodes are part of the membership information in MemberEvent that you can subscribe to.

Akka cluster implementation

Cluster membership is communicated using a gossip protocol

The current state of the cluster is gossiped randomly through the cluster

The gossip protocol is a variation of push-pull gossip used to reduce the amount of gossip information sent around the

cluster a digest is sent representing current versions but not actual values the recipient of the gossip can then send back any values for which it

has newer versions and also request values for which it has outdated versions

based on Amazon’s Dynamo system as well as Riak (both to be discussed in future lecture)



Vector clocks are used to reconcile and merge differences in cluster state during gossiping

Each update to the cluster state has an accompanying update to the vector clock

The digest used by the push-pull gossip consists of vector timestamps so the the actual state is pushed only as needed


The recipient of the gossip vector timestamp needs to determine whether:

it has a newer version of the gossip state, in which case it sends that back to the gossiper

it has an outdated version of the state, in which case the recipient requests the current state from the gossiper by sending back its version of the gossip state


Information about the cluster converges locally at a node at certain points in time

This is when a node can prove that the cluster state it is observing has been observed by all other nodes in the cluster

Convergence is implemented by passing a set of nodes that have seen the current state version during gossip This information is referred to as the seen set

When all nodes are included in the seen set there is gossip convergence


Gossip convergence cannot occur while any nodes are unreachable

The nodes need to become reachable again, or moved to the down and removed states

Until then the leader is prevented from performing its cluster membership management role For example this means that during a network partition it is not

possible to add more nodes to the cluster


After gossip convergence a leader for the cluster is chosen There is no leader election process: the leader is just the first

node in the seen set (in sorted order) whenever there is gossip convergence

The role of the leader is to shift members in and out of the cluster, changing joining members to the up state or exiting members to the removed state.

The leader also has the power, if configured so, to "auto-down" a node that according to the Failure Detector is considered unreachable.


To detect whether a node is unreachable

Nodes send each other heartbeats on an ongoing basis If a node “misses enough” heartbeats, as determined by the

accrual failure detector, this will trigger unreachable gossip messages from its peers

If a quorum of cluster nodes agree that the node is unreachable and the leader is configured to “auto-down”, it will mark it as down and begin removing the node from the cluster.


An accrual failure detector is used to detect if a node is unreachable from the rest of the cluster

an implementation of The Phi Accrual Failure Detector by Hayashibara et al.

An accrual failure detector decouples monitoring and interpretation

Makes “educated” guesses about whether a specific node is up or down By keeping a history of failure statistics, calculated from heartbeats

received from other nodes It returns a phi value representing the likelihood that the node is down

Membership lifecycle

Normal Lifecycle:

A node begins in the joining state Once all nodes have seen that the new node is joining (through

gossip convergence) the leader will set the member state to up If a node is leaving the cluster in a safe, expected manner then

it switches to the leaving state Once the leader sees the convergence on the node in the

leaving state, the leader will then move it to exiting Once all nodes have seen the exiting state (convergence) the

leader will remove the node from the cluster, marking it as removed

Membership lifecycle

Failure Lifecycle:

If a node is unreachable then gossip convergence is not possible and therefore any leader actions are also not possible

To be able to move forward, the state of the unreachable nodes must be changed

It must become reachable again or marked as down. The cluster can, through the leader, auto-down a node after a

configured time of unreachability

Akka cluster node states

Cluster aware routers

All routers can be made aware of member nodes in the cluster

When a node becomes unreachable or leaves the cluster the routees of that node are automatically unregistered from the router

When new nodes join the cluster, additional routees are added to the router, according to the configuration Routees are also added when a node becomes reachable again, after

having been unreachable.

Cluster aware routers

There are two distinct types of routers

Group - router that sends messages to the specified path using actor selection The routees can be shared among routers running on different nodes in

the cluster Example: worker service running on backend nodes in the cluster and

used by routers running on front-end nodes Pool - router that creates routees as child actors and deploys

them on remote nodes Each router will have its own routee instances Example: single master that coordinates jobs and delegates the actual

work to routees running on other cluster nodes

Router with Group of Routees

Configuration (stats1.conf):

akka.actor.deployment { /statsService/workerRouter { router = consistent-hashing-group routees.paths = ["/user/statsWorker"] cluster { enabled = on allow-local-routees = on use-role = compute } }}

Router with Group of Routees

StatsMessage.scalaStatsWorker.scalaStatsService.scalaStatsSample.scala

In separate windows:$ runMain sample.cluster.stats.StatsSample 2551$ runMain sample.cluster.stats.StatsSample 2552$ runMain sample.cluster.stats.StatsSampleClient$ runMain sample.cluster.stats.StatsSample 0

Router with Pool of Remote Deployed Routees

Configuration (stats2.conf):

akka.actor.deployment { /statsService/singleton/workerRouter { router = consistent-hashing-pool

cluster { enabled = on max-nr-of-instances-per-node = 3 allow-local-routees = on use-role = compute } }}

Router with Pool of Remote Deployed Routees

StatsMessage.scalaStatsWorker.scalaStatsService.scalaStatsSampleOneMaster.scala

In separate windows:$ runMain sample.cluster.stats.StatsSampleOneMaster 2551$ runMain sample.cluster.stats.StatsSampleOneMaster 2552$ runMain sample.cluster.stats.StatsSampleOneMasterClient$ runMain sample.cluster.stats.StatsSampleOneMaster 0

Docker

Problem

How to deploy an Akka cluster applications to production servers or the cloud?

Server OS is likely configured differently that the developer’s OSServer OS may change between runs

Solution

Containers such as Dockerseparate your applications from your infrastructure and treat your infrastructure like a managed application

Docker overview

Akka cluster with Docker

Get

http://www.lightbend.com/activator/template/bundle/akka-docker-cluster









etcd

Problem

Running an Akka cluster requires establishing a set of seed nodes

all nodes joining the cluster do so by contacting any of the seed nodes over the network, andone of the seed nodes bootstraps the cluster by "joining itself" - establishing a new cluster of size 1

Problem: When running a distributed application on a cloud platform, the IP address of the container where the seed node(s) will execute is generally not known beforehand

A solution

Need an external, fault-tolerant key-value store that can be used to

elect the initial seed node, and thenpublish a list of seed nodes afterwards

This would enable a zero-configuration deployment scenario for an Akka cluster

More generally

Need a fault-tolerant service for cluster coordination and state management

systems “source of truth”

etcd

Distributed key-value store designed to reliably and quickly preserve and provide access to critical data

The keys form together a hierarchical keyspace with directories and keys.

Implementation: a replicated state machine that runs Raft underneath the hood

Allows developers to write systems that agree on the state of values

etcd

The name comes from the idea of distributing the Unix "/etc" directory, where many configuration files live, across multiple machines – /etc distributed

The CoreOS developers used it to make their distributed "/etc" plan work.

etcd API

From the command line to a local etcd instance:

$ etcdctl put mykey "this is awesome"OK$ etcdctl get mykeymykeythis is awesome

In addition: set, rm, mk, mkdir, rmdir, ls, setdir, update, updatedir, watch and watch-exec, (atomic) compareAndSwap, (atomic) compareAndDelete, and others

https://coreos.com/etcd/docs/latest/api.html



etcd API

read/write a value$ etcdctl get /folder/key$ etcdctl set /folder/key

read/create directory$ etcdctl mkdir /folder$ etcdctl ls /folder

listen to changes$ etcdctl watch /folder/key$ etcdctl exec-watch /folder/key -- /bin/bash -c

“touch /tmp/test”

Akka cluster with etcd

The known seed nodes are replaced by a known etcd service

Node A attempts to join the cluster, no one is there, so it puts itself into etcd and starts a single node cluster as a seedB joins, finds A in etcd and joins AC joins, finds A....A gets shut downAfter a timeout, B or C are put into etcd as new seed nodeD joins, finds B or C....

Akka cluster with etcd

ConstructR provides libraries for bootstrapping Akka and Cassandra clusters via etcd.

ConstructR first tries to get the seed nodes from the coordination serviceIf none are available it tries to acquire a lock, e.g. via a CAS write for etcd, and uses itself or retries getting the nodesThen it joins using these nodes as seed nodes.After that it adds its address to the nodes and starts the refresh loop

Other etcd uses

Kubernetes, the open source container cluster manager from Google

It takes care of storing and replicating data used by Kubernetes across the entire cluster

Cloud Foundry also uses etcd as their distributed key-value store

Overview of Google’s distributed systems

Original Google search engine architecture

More than just a search engine

Organization of Google’sphysical infrastructure

40-80 PCsper rack(terabytes ofdisk space each)

30+ racksper cluster

Hundredsof clustersspread acrossdata centersworldwide

System architecture requirements

Scalability

Reliability

Performance

Openness (at the beginning, at least)

Overall Google systems architecture

Google infrastructure

Design philosophy

SimplicitySoftware should do one thing and do it well

Provable performanceEstimate performance costs (accessing memory and disk, sending packet over network, locking and unlocking a mutex, etc.)“every millisecond counts”

TestingStringent testing”if it ain’t broke, you’re not trying hard enough”

Data and coordination services

Google File System (GFS)Broadly similar to NFS and AFSOptimized to type of files and data access used by Google

BigTableA distributed database that stores (semi-)structured dataJust enough organization and structure for the type of data Google uses

Chubbya locking service (and more) for GFS and BigTable

GFS requirements

Must run reliably on the physical platformMust tolerate failures of individual commodity components

So application-level services can rely on the file system

Optimized for Google’s usage patternsHuge files (100+MB, up to 1GB)Relatively small number of filesAccesses dominated by sequential reads and appendsAppends done concurrently by hundreds of writers

Meets the requirements of the whole Google infrastructurescalable, reliable, high performance, openImportant: throughput has higher priority than latency

GFS API

Familiar file system interfaceusual file operations to create, delete, open, close, read, and write files

Files are organized hierarchically in directories and identified by pathnames

GFS also has snapshot and record append operationsSnapshot creates a copy of a file or a directory tree at low costRecord append is an atomic append operation

GFS architectureFile stored in 64MB chunks in a cluster with

a master node (operations log replicated on remote machines)hundreds of chunk servers

Chunks replicated 3 times

GFS master

The master maintains all file system metadata (in memory)namespace (also kept persistently)access control information(also kept persistently)mapping from files to chunks (also kept persistently)current locations of chunks (NOT kept persistently)

It also controls system-wide activitieschunk lease managementgarbage collection of orphaned chunkschunk migration between chunkservers

After recovering from a crash, the master willpoll chunkservers to find out who has a replica of a given chunkperiodically communicate with each chunkserver in HeartBeat messages to give it instructions and collect its state

GFS client

GFS client code linked into each applicationImplements the file system APIcommunicates with the master and chunkservers to read or write data on behalf of the application

Clients interact with the master for metadata operations

All data-bearing communication goes directly to the chunkserversSo the master is not a bottleneck when doing reads and writes

Reading and writing

When the client wants to access a particular offset in a fileThe GFS client translates this to a (file name, chunk index)And then send this to the master

When the master receives the (file name, chunk index) pairIt replies with the chunk identifier and replica locations

The client then accesses the closest chunk replica directly

No client-side cachingCaching would not help in the type of (streaming) access GFS has

Keeping chunk replicas consistent

When the master receives a mutation request from a clientthe master grants a chunk replica a lease (replica is primary)returns identity of primary and other replicas to client

The client sends the mutation directly to all the replicasReplicas cache the mutation and acknowledge receipt

The client sends a write request to primaryPrimary orders mutations and updates accordinglyPrimary then requests that other replicas do the mutations in the same orderWhen all the replicas have acknowledged success, the primary reports an ack to the client

What consistency model does this seem to implement?

Keeping chunk replicas consistent

GFS (non-)guarantees

Writes (at a file offset) are not atomicConcurrent writes to the same location may corrupt replicated chunksIf any replica is left inconsistent, the write fails (and is retried a few times)

Appends are executed atomically “at least once”Offset is chosen by primary May end up with non-identical replicated chunks with some having duplicate appendsGFS does not guarantee that the replicas are identicalIt only guarantees that some file regions are consistent across replicas

When needed, GFS needs an external locking service (Chubby)As well as a leader election service (also Chubby) to select the primary replica

GFS locking needs

Each node in the namespace tree has associated read/write locksabsolute file names andabsolute directory names

To create a file in a directory the master obtains a write lock on the directory

The read lock on a directory name suffices to prevent the directory from being deleted, renamed, or snapshotted

A snapshot operation has to revoke chunkserver leases on all chunks covered by the snapshot; read locks over regions of the namespace are used to ensure proper serialization

Bigtable

GFS provides raw data storage

Also needed:Storage for structured data ...... optimized to handle the needs of Google’s apps ...... that is reliable, scalable, high-performance, open, etc

Examples of structured data

URLs:Content, crawl metadata, links, anchors, PageRank, ...

Per-user data:User preference settings, recent queries/search results, …

Geographic locations:Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …

Commercial DB

Why not use commercial database?Not scalable enoughToo expensiveFull-featured relational database not requiredLow-level optimizations may be needed

Bigtable table

Implementation: Sparse, distributed, persistent, multi-dimensional map (row, column, timestamp) → cell contents

Rows

Each row has a keyA string up to 64KB in sizeAccess to data in a row is atomic

Rows ordered lexicographicallyRows close together lexicographically reside on one or close machines (locality)

Columns

“com.cnn.www”

‘contents:.’

“<html>…” “CNN Sports”

‘anchor:com.cnn.www/sport’

“CNN world”

‘anchor:com.cnn.www/world’

Columns have two-level name structure:family:qualifier

Column familylogical grouping of datagroups unbounded number of columns (named with qualifiers)may have a single column with no qualifier

Timestamps

Used to store different versions of data in a celldefault to current timecan also be set explicitly set by client

Garbage CollectionPer-column-family GC settings

“Only retain most recent K values in a cell”“Keep values until they are older than K seconds”...

API

Create / delete tables and column families

Table *T = OpenOrDie(“/bigtable/web/webtable”);RowMutation r1(T, “com.cnn.www”);r1.Set(“anchor:com.cnn.www/sport”, “CNN Sports”);r1.Delete(“anchor:com.cnn.www/world”);Operation op;Apply(&op, &r1);

Bigtable architecture

An instance of BigTable is a cluster that stores tableslibrary on client sidemaster servertablet servers

table is decomposed into tablets

Tablets

A table is decomposed into tabletsTablet holds contiguous range of rows100MB - 200MB of data per tabletTablet server responsible for ~100 tablets

Each tablet is represented by A set of files stored in GFS

The files use the SSTable format, a mapping of (string) keys to (string) values

Log files

Tablet Server

When a tablet server starts, it creates and acquires an exclusive lock on a uniquely-named file in a specific Chubby directory

The master monitors this directory to discover tablet serversMaster assigns tablets to tablet servers

Tablet serverHandles reads / writes requests to tablets from clientsClients obtain server info (i.e., tablet location info) from Chubby

No data goes through masterBigtable client requires a naming/locator service (Chubby) to find the root tablet, which is part of the metadata tableThe metadata table contains metadata about actual tablets

including location information of associated SSTables and log files

MasterUpon startup, must grab master lock to insure it is the single master of a set of tablet servers

provided by locking service (Chubby)

Monitors tablet serversperiodically scans directory of tablet servers provided by naming service (Chubby)keeps track of tablets assigned to its table servers

obtains a lock on the tablet server from locking service (Chubby)lock is the communication mechanism between master and tablet server

Assigns unassigned tablets in the cluster to tablet servers it monitorswhen a tablet server hosting the tablet goes down or to…… move tablets around to achieve load balancing or

Garbage collects underlying files stored in GFS

BigTable tablet architecture

Each is an ordered and immutable mapping of keys to values

Tablet Serving

Writes committed to logMemtable: ordered log of recent commits (in memory)SSTables really store a snapshot of recent changes

When Memtable gets too big do minor compactionCreate new empty MemtableMerge old Memtable with recent SSTables to create new SSTable

A major compaction results in a single SSTable

SSTable

OperationsLook up value for keyIterate over all key/value pairs in specified range

Relies on lock service (Chubby)Ensure there is at most one active masterAdminister table server deathStore column family informationStore access control lists

Bigtable locking/metadata storage needs

Bigtable uses a locking service/key-value store (Chubby) for a variety of tasks

to ensure that there is at most one active master at any timeto store the bootstrap location of Bigtable datato discover tablet servers and finalize tablet server deaths (i.e., to keep the system state)to store Bigtable schema information (the column family information for each table)to store access control lists

If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable

Chubby

Chubby provides to the infrastructurea locking servicea file system for reliable storage of small filesa leader election service (e.g. to select a primary replica)a name service

Seemingly violates “simplicity” design philosophy but...

... Chubby really provides an asynchronous distributed agreement service

Chubby API

Overall architecture of Chubby

Cell: single instanceof Chubby system

5 replicas1 master replica

Each replica maintains a databaseof directories and files/locks

Consistency achieved using Lamport’s Paxos consensus protocol that uses an operation logChubby internally supports snapshots to periodically GCthe operation log

Paxos distributed consensus algorithm

A distributed consensus protocol for asynchronous systems

Used by servers managing replicas in order to reach agreement on update when

messages may be lost, re-ordered, duplicatedservers may operate at arbitrary speed and failservers have access to stable persistent storage

Fact: Consensus not always possible in asynchronous systemsPaxos works by insuring safety (correctness) not liveness (termination)

The Big Picture

Customized solutions for Google-type problems

GFS: Stores data reliablyJust raw files

BigTable: provides key/value mapDatabase like, but doesn’t provide everything we need

Chubby: locking mechanismHandles all synchronization problems

Common Principles

One master, multiple workersMapReduce: master coordinates work amongst map / reduce workersChubby: master among five replicas Bigtable: master knows about location of tablet servers GFS: master coordinates data across chunkservers

Strong consistency modelsSequential consistency

Documents

CSC 536 Lecture 8