CSC 536 Lecture 8. Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)

CSC 536 Lecture 8

Outline

Reactive StreamsStreamsReactive streamsAkka streams

Case studyGoogle infrastructure (part I)

Reactive Streams

Streams

Stream Process involving data flow and transformation Data possibly of unbounded size Focus on describing transformation

Examples bulk data transfer real-time data sources batch processing of large data sets monitoring and analytics

Needed: Asynchrony

For fault tolerance: Encapsulation Isolation

For scalability: Distribution across nodes Distribution across cores

Problem: Managing data flow across an async boundary

Types of Async Boundaries

between different applications

between network nodes

between CPUs

between threads

between actors

Possible solutions

Traditional way: Synchronous/blocking (possibly remote) method calls

Does not scale

Possible solutions


Does not scale

Push way: Asynchronous/non-blocking message passing

Scales! Problem: message buffering and message dropping

Supply and Demand


Does not scale

Push way: Asynchronous/non-blocking message passing

Scales! Problem: message buffering and message dropping

Reactive way: non-blocking non-dropping

Reactive way

View slides 24-55 of http://www.slideshare.net/ktoso/reactive-streams-akka-streams-geecon-prague-2014

http://www.slideshare.net/ktoso/reactive-streams-akka-streams-geecon-prague-2014







Supply and Demand

data items flow downstream demand flows upstream data items flow only when there is demand

recipient is in control of incoming data rate data in flight is bounded by signaled demand

Dynamic Push-Pull

“push” behavior when consumer is faster“pull” behavior when producer is fasterswitches automatically between thesebatching demand allows batching data

Tailored Flow Control

Splitting the data means merging the demand

Tailored Flow Control

Merging the data means splitting the demand

Reactive Streams

Back-pressured Asynchronous Stream Processing asynchronous non-blocking data flow asynchronous non-blocking demand flow Goal: minimal coordination and contention

Message passing allows for distribution across applications across nodes across CPUs across threads across actors

Reactive Streams Projects

Standard implemented by many libraries

Engineers from Netflix Oracle Red Hat Twitter Typesafe …See http://reactive-streams.org

http://reactive-streams.org/

Reactive Streams

All participants had the same basic problem

All are building tools for their community

A common solution benefits everybody

Interoperability to make best use of efforts minimal interfaces rigorous specification of semantics full TCK for verification of implementation complete freedom for many idiomatic APIs

The underlying (internal) API

trait Publisher[T] {

def subscribe(sub: Subscriber[T]): Unit

}trait Subscription {

def requestMore(n: Int): Unit

def cancel(): Unit

}

trait Subscriber[T] {

def onSubscribe(s: Subscription): Unit

def onNext(elem: T): Unit

def onError(thr: Throwable): Unit

def onComplete(): Unit

}

The Process

Reactive Streams

All calls on Subscriber must dispatch async

All calls on Subscription must not block

Publisher is just there to create Subscriptions

Akka Streams

Powered by Akka Actors

Type-safe streaming through Actors with bounded buffering

Akka Streams API is geared towards end-users

Akka Streams implementation uses the Reactive Streams interfaces (Publisher/Subscriber) internally to pass data between the different processing stages

Examples

View slides 62-80 of http://www.slideshare.net/ktoso/reactive-streams-akka-streams-geecon-prague-2014

basic.scala

TcpEcho.scala

WritePrimes.scala


Overview of Google’s distributed systems

Original Google search engine architecture

More than just a search engine

Organization of Google’sphysical infrastructure

40-80 PCsper rack(terabytes ofdisk space each)

30+ racksper cluster

Hundredsof clustersspread acrossdata centersworldwide

System architecture requirements

Scalability

Reliability

Performance

Openness (at the beginning, at least)

Overall Google systems architecture

Google infrastructure

Design philosophy

SimplicitySoftware should do one thing and do it well

Provable performance“every millisecond counts”Estimate performance costs (accessing memory and disk, sending packet over network, locking and unlocking a mutex, etc.)

Testing”if it ain’t broke, you’re not trying hard enough”Stringent testing

Data and coordination services

Google File System (GFS)Broadly similar to NFS and AFSOptimized to type of files and data access used by Google

BigTableA distributed database that stores (semi-)structured dataJust enough organization and structure for the type of data Google uses

Chubbya locking service (and more) for GFS and BigTable

GFS requirements

Must run reliably on the physical platformMust tolerate failures of individual components

So application-level services can rely on the file system

Optimized for Google’s usage patternsHuge files (100+MB, up to 1GB)Relatively small number of filesAccesses dominated by sequential reads and appendsAppends done concurrently

Meets the requirements of the whole Google infrastructurescalable, reliable, high performance, openImportant: throughput has higher priority than latency

GFS architecture

File stored in 64MB chunks in a cluster witha master node (operations log replicated on remote machines)hundreds of chunk servers

Chunks replicated 3 times

Reading and writing

When the client wants to access a particular offset in a fileThe GFS client translates this to a (file name, chunk index)And then send this to the master

When the master receives the (file name, chunk index) pairIt replies with the chunk identifier and replica locations

The client then accesses the closest chunk replica directly

No client-side cachingCaching would not help in the type of (streaming) access GFS has

Keeping chunk replicas consistent

Keeping chunk replicas consistent

When the master receives a mutation request from a clientthe master grants a chunk replica a lease (replica is primary)returns identity of primary and other replicas to client

The client sends the mutation directly to all the replicasReplicas cache the mutation and acknowledge receipt

The client sends a write request to primaryPrimary orders mutations and updates accordinglyPrimary then requests that other replicas do the mutations in the same orderWhen all the replicas have acknowledged success, the primary reports an ack to the client

What consistency model does this seem to implement?

GFS (non-)guarantees

Writes (at a file offset) are not atomicConcurrent writes to the same location may corrupt replicated chunksIf any replica is left inconsistent, the write fails (and is retried a few times)

Appends are executed atomically “at least once”Offset is chosen by primary May end up with non-identical replicated chunks with some having duplicate appendsGFS does not guarantee that the replicas are identicalIt only guarantees that some file regions are consistent across replicas

When needed, GFS needs an external locking service (Chubby)As well as a leader election service (also Chubby) to select the primary replica

Bigtable

GFS provides raw data storage

Also needed:Storage for structured data ...... optimized to handle the needs of Google’s apps ...... that is reliable, scalable, high-performance, open, etc

Examples of structured data

URLs:Content, crawl metadata, links, anchors, PageRank, ...

Per-user data:User preference settings, recent queries/search results, …

Geographic locations:Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …

Commercial DB

Why not use commercial database?Not scalable enoughToo expensiveFull-featured relational database not requiredLow-level optimizations may be needed

Bigtable table

Implementation: Sparse distributed multi-dimensional map (row, column, timestamp) → cell contents

Rows

Each row has a keyA string up to 64KB in sizeAccess to data in a row is atomic

Rows ordered lexicographicallyRows close together lexicographically reside on one or close machines (locality)

Columns

“com.cnn.www”

‘contents:.’

“<html>…” “CNN Sports”

‘anchor:com.cnn.www/sport’

“CNN world”

‘anchor:com.cnn.www/world’

Columns have two-level name structure:family:qualifier

Column familylogical grouping of datagroups unbounded number of columns (named with qualifiers)may have a single column with no qualifier

Timestamps

Used to store different versions of data in a celldefault to current timecan also be set explicitly set by client

Garbage CollectionPer-column-family GC settings

“Only retain most recent K values in a cell”“Keep values until they are older than K seconds”...

API

Create / delete tables and column families

Table *T = OpenOrDie(“/bigtable/web/webtable”);

RowMutation r1(T, “com.cnn.www”);

r1.Set(“anchor:com.cnn.www/sport”, “CNN Sports”);

r1.Delete(“anchor:com.cnn.www/world”);

Operation op;

Apply(&op, &r1);

Bigtable architecture

An instance of BigTable is a cluster that stores tableslibrary on client sidemaster servertablet servers

table is decomposed into tablets

Tablets

A table is decomposed into tabletsTablet holds contiguous range of rows100MB - 200MB of data per tabletTablet server responsible for ~100 tablets

Each tablet is represented by A set of files stored in GFS

The files use the SSTable format, a mapping of (string) keys to (string) values

Log files

Tablet Server

Master assigns tablets to tablet servers

Tablet serverHandles reads / writes requests to tablets from clients

No data goes through masterBigtable client requires a naming/locator service (Chubby) to find the root tablet, which is part of the metadata tableThe metadata table contains metadata about actual tablets

including location information of associated SSTables and log files

MasterUpon startup, must grab master lock to insure it is the single masterof a set of tablet servers

provided by locking service (Chubby)

Monitors tablet serversperiodically scans directory of tablet servers provided by naming service (Chubby)keeps track of tablets assigned to its table servers

obtains a lock on the tablet server from locking service (Chubby)lock is the communication mechanism between master and tablet server

Assigns unassigned tablets in the cluster to tablet servers it monitorsand moving tablets around to achieve load balancing

Garbage collects underlying files stored in GFS

BigTable tablet architecture

Each is an ordered and immutable mapping of keys to values

Tablet Serving

Writes committed to logMemtable: ordered log of recent commits (in memory)SSTables really store a snapshot

When Memtable gets too bigCreate new empty MemtableMerge old Memtable with SSTables and write to GFS

SSTable

OperationsLook up value for keyIterate over all key/value pairs in specified range

Relies on lock service (Chubby)Ensure there is at most one active masterAdminister table server deathStore column family informationStore access control lists

Chubby

Chubby provides to the infrastructurea locking servicea file system for reliable storage of small filesa leader election service (e.g. to select a primary replica)a name service

Seemingly violates “simplicity” design philosophy but...

... Chubby really provides an asynchronous distributed agreement service

Chubby API

Overall architecture of Chubby

Cell: single instanceof Chubby system

5 replicas1 master replica

Each replica maintains a databaseof directories and files/locks

Consistency achieved using Lamport’s Paxos consensus protocol that uses an operation logChubby internally supports snapshots to periodically GCthe operation log

Paxos distributed consensus algorithm

A distributed consensus protocol for asynchronous systems

Used by servers managing replicas in order to reach agreement on update when

messages may be lost, re-ordered, duplicatedservers may operate at arbitrary speed and failservers have access to stable persistent storage

Fact: Consensus not always possible in asynchronous systemsPaxos works by insuring safety (correctness) not liveness (termination)

Paxos algorithm - step 1

Paxos algorithm - step 2

The Big Picture

Customized solutions for Google-type problems

GFS: Stores data reliablyJust raw files

BigTable: provides key/value mapDatabase like, but doesn’t provide everything we need

Chubby: locking mechanismHandles all synchronization problems

Common Principles

One master, multiple workersMapReduce: master coordinates work amongst map / reduce workersChubby: master among five replicas Bigtable: master knows about location of tablet servers GFS: master coordinates data across chunkservers

Documents

CSC 536 Lecture 8. Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)