CSC 536 Lecture 3

CSC 536 Lecture 3

Outline

Akka example: mapreduceDistributed transactions

MapReduce Framework: Motivation

Want to process lots of data ( > 1 TB)

Want to parallelize the job across hundreds/thousands of commodity CPUs connected by a commodity networks

Want to make this easy, re-usable

Example Uses at Google

Pagerank wordcount distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation …

Programming Model

Users implement interface of two functions:

mapper (in_key, in_value) ->

list((out_key, intermediate_value))

reducer (out_key, intermediate_values list) -> (out_key,

out_value)

Map phase

Records from the data source are fed into the mapper function as (key, value) pairs

(filename, content) (goal: wordcount) (web page URL, web page content) (goal: web link-

graph reversal)

mapper produces one or more intermediate (output key, intermediate value) pairs from the input

(word, 1) (link URL, web page URL)

Reduce phase

After the Map phase is over, all the intermediate values for a given output key are combined together into a list

(“hello”, 1), (“hello”, 1), (“hello”, 1) -> (“hello”, [1,1,1]) Done by intermediate aggregator step of MapReduce

reducer function combines those intermediate values into one or more final values for that same output key

(“hello”, [1,1,1]) -> (“hello”, 3)

Data store 1 Data store nmap

(key 1, values...)

(key 2, values...)

(key 3, values...)

map

(key 1, values...)

(key 2, values...)

(key 3, values...)

Input key*value pairs

Input key*value pairs

== Barrier == : Aggregates intermediate values by output key

reduce reduce reduce

key 1, intermediate

values

key 2, intermediate

values

key 3, intermediate

values

final key 1 values

final key 2 values

final key 3 values

...

Parallelism

mapper functions run in parallel, creating different intermediate values from different input data sets

reducer functions also run in parallel, each working on a different output key

All values are processed independently

MapReduce example: wordcount

Problem: Count the number of occurrences of words in a set of files

Input to any MapReduce job: A set of (input_key, input_value) pairs

In wordcount: (input_key, input_value) = (filename, content)

filenames = ["a.txt", "b.txt", "c.txt"]content = {}for filename in filenames: f = open(filename) content[filename] = f.read() f.close()


The content of the input files

a.txt:

The quick brown fox jumped over the lazy grey dogs.

b.txt:

That's one small step for a man, one giant leap for mankind.

c.txt:

Mary had a little lamb,Its fleece was white as snow;And everywhere that Mary went,The lamb was sure to go.


Map phase: Function mapper is applied to every (filename, content) pair

mapper moves through the words in the file for each word it encounters, it returns the intermediate key and value

(word, 1)

A call to mapper("a.txt", content["a.txt"]) returns:

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1)]

The output of the Map phase is the concatenation of the lists for map("a.txt", content["a.txt"]), map("b.txt", content[“b.txt"]), and map("c.txt", content[“c.txt"])


The output of the Map phase

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1), ('mary', 1), ('had', 1), ('a', 1), ('little', 1), ('lamb', 1), ('its', 1), ('fleece', 1), ('was', 1), ('white', 1), ('as', 1), ('snow', 1), ('and', 1), ('everywhere', 1), ('that', 1), ('mary', 1), ('went', 1), ('the', 1), ('lamb', 1), ('was', 1), ('sure', 1), ('to', 1), ('go', 1),('thats', 1), ('one', 1), ('small', 1), ('step', 1),('for', 1), ('a', 1), ('man', 1), ('one', 1),('giant', 1), ('leap', 1), ('for', 1), ('mankind', 1)]


The Map phase of MapReduce is logically trivial But when the input dictionary has, say, 10 billion keys, and those keys point

to files held on thousands of different machines, implementing the map phase is actually quite non-trivial.

The MapReduce library should handle: knowing which files are stored on what machines, making sure that machine failures don’t affect the computation, making efficient use of the network, and storing the output in a useable form.

The programmer only writes the mapper function The MapReduce framework takes care of everything else


In preparation for the Reduce phase, the MapReduce library groups together all the intermediate values which have the same key to obtain this intermediate dictionary:

{'and': [1], 'fox': [1], 'over': [1], 'one': [1, 1], 'as': [1], 'go': [1], 'its': [1], 'lamb': [1, 1], 'giant': [1], 'for': [1, 1], 'jumped': [1], 'had': [1], 'snow': [1], 'to': [1], 'leap': [1], 'white': [1], 'was': [1, 1], 'mary': [1, 1], 'brown': [1], 'lazy': [1], 'sure': [1], 'that': [1], 'little': [1], 'small': [1], 'step': [1], 'everywhere': [1], 'mankind': [1], 'went': [1], 'man': [1], 'a': [1, 1], 'fleece': [1], 'grey': [1], 'dogs': [1], 'quick': [1], 'the': [1, 1, 1], 'thats': [1]}


In the Reduce phase, a programmer-defined function

reducer(out_key, intermediate_value_list)

is applied to each entry in the intermediate dictionary.

For wordcount, reducer sums up the list of intermediate values, and returns both out_key and the sum as the output.

def reduce(out_key, intermediate_value_list): return (out_key, sum(intermediate_value_list))


The output from the Reduce phase, and from the complete MapReduce computation, is:

[('and', 1), ('fox', 1), ('over', 1), ('one', 2), ('as', 1), ('go', 1), ('its', 1), ('lamb', 2), ('giant', 1), ('for', 2), ('jumped', 1), ('had', 1), ('snow', 1), ('to', 1), ('leap', 1), ('white', 1), ('was', 2), ('mary', 2), ('brown', 1), ('lazy', 1), ('sure', 1), ('that', 1), ('little', 1), ('small', 1), ('step', 1), ('everywhere', 1), ('mankind', 1), ('went', 1), ('man', 1), ('a', 2), ('fleece', 1), ('grey', 1), ('dogs', 1), ('quick', 1), ('the', 3), ('thats', 1)]

MapReduce example: wordcountMap and Reduce can be done in parallel... but how is the grouping step that takes place between the Map phase and the Reduce phase done? For the reducer functions to work in parallel, we need to ensure that all the

intermediate values corresponding to the same key get sent to the same machine

MapReduce example: wordcountMap and Reduce can be done in parallel... but how is the grouping step that takes place between the Map phase and the Reduce phase done? For the reducer functions to work in parallel, we need to ensure that all the

intermediate values corresponding to the same key get sent to the same machine

The general idea: Imagine you’ve got 1000 machines that you’re going to use to run reduce on. As the mapper functions compute the output keys and intermediate value lists, they

compute hash(out_key) mod 1000 for some hash function. This number is used to identify the machine in the cluster that the corresponding

reducer will be run on, and the resulting output key and value list is then sent to that machine.

Because every machine running mapper uses the same hash function, this ensures that value lists corresponding to the same output key all end up at the same machine.

Furthermore, by using a hash we ensure that the output keys end up pretty evenly spread over machines in the cluster

mapreduce example

project mapreduce in lecture 3 code

MapReduce optimizations

Locality Fault Tolerance Time optimization Bandwidth optimization

Locality

Master program divvies up tasks based on location of data tries to have mapper tasks on same machine as physical file

data, or at least same rack

mapper task inputs are divided into 64 MB blocks same size as Google File System chunks

Redundancy for Fault Tolerance

Master detects worker failures via periodic heartbeats Re-executes completed & in-progress mapper tasks Re-executes in-progress reducer tasks

Redundancy for time optimization

Reduce phase can’t start until Map phase is complete

Slow workers significantly lengthen completion time A single slow disk controller can rate-limit the whole process Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled

Solution: Near end of phase, spawn backup copies of tasks Whichever one finishes first "wins” Effect: Dramatically shortens job completion time

Bandwidth Optimizations

“Aggregator” function can run on same machine as a mapper function

Causes a mini-reduce phase to occur before the real Reduce phase, to save bandwidth

Distributed Transactions

Distributed transactions

Transactions, like mutual exclusion, protect shared data against simultaneous access by several concurrent processes.

Transactions allow a process to access and modify multiple data items as a single atomic transaction.

If the process backs out halfway during the transaction, everything is restored to the point just before the transaction started.

Distributed transactions: example 1

A customer dials into her bank web account and does the following:

Withdraws amount x from account 1.Deposits amount x to account 2.

If telephone connection is broken after the first step but before the second, what happens?

Either both or neither should be completed.Requires special primitives provided by the DS.

The Transaction Model

Examples of primitives for transactions

Write data to a file, a table, or otherwiseWRITE

Read data from a file, a table, or otherwiseREAD

Kill the transaction and restore the old valuesABORT_TRANSACTION

Terminate the transaction and try to commitEND_TRANSACTION

Make the start of a transactionBEGIN_TRANSACTION

DescriptionPrimitive

Distributed transactions: example 2

a) Transaction to reserve three flights commitsb) Transaction aborts when third flight is unavailable

BEGIN_TRANSACTION reserve WP -> JFK; reserve JFK -> Nairobi; reserve Nairobi -> Malindi full =>ABORT_TRANSACTION (b)

BEGIN_TRANSACTION reserve WP -> JFK; reserve JFK -> Nairobi; reserve Nairobi -> Malindi;END_TRANSACTION (a)

ACID

Transactions areAtomic: to the outside world, the transaction happens indivisibly.

Consistent: the transaction does not violate system invariants.

Isolated (or serializable): concurrent transactions do not interfere with each other.

Durable: once a transaction commits, the changes are permanent.

Flat, nested and distributed transactions

a) A nested transactionb) A distributed transaction

Implementation of distributed transactions

For simplicity, we consider transactions on a file system.

Note that if each process executing a transaction just updates the file in place, transactions will not be atomic, and changes will not vanish if the transaction aborts.

Other methods required.

Atomicity

If each process executing a transaction just updates the file in place, transactions will not be atomic, and changes will vanish if the transaction aborts.

Solution 1: Private Workspace

a) The file index and disk blocks for a three-block fileb) The situation after a transaction has modified block 0 and appended block 3c) After committing

Solution 2: Writeahead Log

(a) A transaction(b) – (d) The log before each statement is executed

Log

[x = 0 / 1][y = 0/2][x = 0/4]

(d)

Log

[x = 0 / 1][y = 0/2]

(c)

Log

[x = 0 / 1]

(b)

x = 0;y = 0;BEGIN_TRANSACTION; x = x + 1; y = y + 2 x = y * y;END_TRANSACTION; (a)

Concurrency control (1)

We just learned how to achieve atomicity; we will learn about durability when discussing fault tolerance

Need to handle consistency and isolation

Concurrency control allows several transactions to be executed simultaneously, while making sure that the data is left in a consistent state

This is done by scheduling operations on data in an order whereby the final result is the same as if all transactions had run sequentially

Concurrency control (2)

General organization of managers for handling transactions

Concurrency control (3)General organization of managers for handling distributed transactions.

Serializability

The main issue in concurrency control is the scheduling of conflicting operations (operating on same data item and one of which is a write operation)

Read/Write operations can be synchronized using:Mutual exclusion mechanisms, orScheduling using timestamps

Pessimistic/optimistic concurrency control

The lost update problem

Transaction T : balance = b.getBalance();b.setBalance(balance*1.1);a.withdraw(balance/10)

Transaction U:

balance = b.getBalance();b.setBalance(balance*1.1);c.withdraw(balance/10)

balance = b.getBalance(); $200

balance = b.getBalance(); $200

b.setBalance(balance*1.1); $220

b.setBalance(balance*1.1); $220

a.withdraw(balance/10) $80

c.withdraw(balance/10) $280

Accounts a, b, and c start with $100, $200, and $300, respectively

The inconsistent retrievals problem

Transaction V: a.withdraw(100)b.deposit(100)

Transaction W:

aBranch.branchTotal()

a.withdraw(100); $100total = a.getBalance() $100

total = total+b.getBalance() $300

total = total+c.getBalance()

b.deposit(100) $300

Accounts a and b start with $200 each.

A serialized interleaving of T and U

Transaction T: balance = b.getBalance()b.setBalance(balance*1.1)a.withdraw(balance/10)

Transaction U: balance = b.getBalance()b.setBalance(balance*1.1)c.withdraw(balance/10)

balance = b.getBalance() $200

b.setBalance(balance*1.1) $220balance = b.getBalance() $220

b.setBalance(balance*1.1) $242a.withdraw(balance/10) $80

c.withdraw(balance/10) $278

A serialized interleaving of V and W

Transaction V: a.withdraw(100);b.deposit(100)

Transaction W:

aBranch.branchTotal()

a.withdraw(100); $100

b.deposit(100) $300

total = a.getBalance() $100

total = total+b.getBalance() $400

total = total+c.getBalance()...

Read and write operation conflict rules

Operations of differenttransactions

Conflict Reason

read read No Because the effect of a pair of read operationsdoes not depend on the order in which they areexecuted

read write Yes Because the effect of a read and a write operationdepends on the order of their execution

write write Yes Because the effect of a pair of write operationsdepends on the order of their execution

Serializability

Two transactions are serialized

if and only if

All pairs of conflicting operations of the two transactions are executed in the same order at all

objects they both access.

A non-serialized interleaving of operations of transactions T and U

Transaction T: Transaction U:

x = read(i)write(i, 10)

y = read(j)write(j, 30)

write(j, 20)z = read (i)

Recoverability of aborts

Aborted transactions must be prevented from affecting other concurrent transactions

Dirty readsCascading aborts

A dirty read when transaction T aborts

Transaction T: a.getBalance()a.setBalance(balance + 10)

Transaction U:a.getBalance()a.setBalance(balance + 20)

balance = a.getBalance() $100

a.setBalance(balance + 10) $110balance = a.getBalance() $110

a.setBalance(balance + 20) $130

commit transaction

abort transaction

Cascading aborts

Suppose:Transaction U has seen the effects of transaction TTransaction V has seen the effects of transaction UT decides to abort

Cascading aborts

Suppose:Transaction U has seen the effects of transaction TTransaction V has seen the effects of transaction UT decides to abort

V and U must abort

Transactions T and U with locksTransaction T: balance = b.getBalance()b.setBalance(bal*1.1)a.withdraw(bal/10)

Transaction U: balance = b.getBalance()b.setBalance(bal*1.1)c.withdraw(bal/10)

Operations Locks Operations Locks

openTransactionbal = b.getBalance() lock B

b.setBalance(bal*1.1) openTransaction

a.withdraw(bal/10) lock A bal = b.getBalance() waits for T’slock on B

closeTransaction unlock A, B lock B

b.setBalance(bal*1.1) c.withdraw(bal/10) lock C

closeTransaction unlock B, C

Two-phase locking (2)

Idea: the scheduler grants locks in a way that creates only serializable schedules.

In 2-phase-locking, the transaction acquires all the locks it needs in the first phase, and then releases them in the second. This will insure a serializable schedule.

Dirty reads and cascading aborts are still possible

Two-phase locking (2)

Idea: the scheduler grants locks in a way that creates only serializable schedules.

In 2-phase-locking, the transaction acquires all the locks it needs in the first phase, and then releases them in the second. This will insure a serializable schedule.

Dirty reads and cascading aborts are still possible

Under strict 2-phase locking, a transaction that needs to read or write an object must be delayed until other transactions that wrote the same object have committed or aborted

Locks are held until transaction commits or aborts

Example: CORBA Concurrency Control Service

Two-phase locking in a distributed system

The data is assumed to be distributed across multiple machines

Centralized 2PL: central scheduler grants locks

Primary 2PL: local scheduler is coordinator for local data

Distributed 2PL: (data may be replicated)the local schedulers use a distributed mutual exclusion algorithm to obtain a lockThe local scheduler forwards Read/Write operations to data managers holding the replicas

Two-phase locking issues

Exclusive locks reduce concurrency more than necessary. It is sometimes preferable to allow concurrent transactions to read an object; two types of locks may be needed (read locks and write locks)

Deadlocks are possible.Solution 1: acquire all locks in the same order.Solution 2: use a graph to detect potential deadlocks.

Deadlock with write locks

Transaction T Transaction U

Operations Locks Operations Locks

a.deposit(100); write lock A

b.deposit(200) write lock B

b.withdraw(100)waits for U’s a.withdraw(200); waits for T’s

lock on B lock on A

The wait-for graph

B

A

Waits for

Held by

Held by

T UU T

Waits for

A cycle in a wait-for graph

U

V

T

Deadlock prevention with timeouts

Transaction T Transaction U Operations Locks Operations Locks

a.deposit(100); write lock A

b.deposit(200) write lock B

b.withdraw(100)waits for U’s a.withdraw(200); waits for T’slock on B lock on A

(timeout elapses) T’s lock on A becomes vulnerable, unlock A, abort T

a.withdraw(200); write locks Aunlock A, B

Disadvantages of locking

High overhead

Deadlocks

Locks cannot be released until the end of the transaction, which reduces concurrency

In most applications, the likelihood of two clients accessing the same object is low

Pessimistic timestamp concurrency control

A transaction’s request to write an object is valid only if that object was last read and written by an earlier transaction

A transaction’s request to read an object is valid only if that object was last written by an earlier transaction

Advantage: Non-blocking and deadlock-free

Disadvantage: Transactions may need to abort and restart

Operation conflicts for timestamp ordering

Rule Tc Ti

1. write read Tc must not write an object that has been read by any Ti where this requires that Tc≥ the maximum read timestamp of the object.

2. write write Tc must not write an object that has been written by any Ti where

Ti >Tc

this requires that Tc> write timestamp of the committedobject.

3. read write Tc must not read an object that has been written by any Ti where this requires that Tc > write timestamp of the committed object.

Ti >Tc

Ti >Tc

Pessimistic Timestamp Ordering

Concurrency control using timestamps.

Optimistic timestamp ordering

Idea: just go ahead and do the operations without paying attention to what concurrent transactions are doing:

Keep track of when each data item has been read and written.Before committing, check whether any item has been changed since the transaction started. If so, abort. If not, commit.

Advantage: deadlock free and fast.Disadvatange: it can fail and transactions must be run again.Example: Scala Software Transactional Memory (next week)

Documents

CSC 536 Lecture 3