65
CSC 536 Lecture 3

CSC 536 Lecture 3

  • Upload
    ocean

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

CSC 536 Lecture 3. Outline. Akka example: mapreduce Distributed transactions. MapReduce Framework: Motivation. Want to process lots of data ( > 1 TB) Want to parallelize the job across hundreds/thousands of commodity CPUs connected by a commodity networks - PowerPoint PPT Presentation

Citation preview

Page 1: CSC  536 Lecture  3

CSC 536 Lecture 3

Page 2: CSC  536 Lecture  3

Outline

Akka example: mapreduceDistributed transactions

Page 3: CSC  536 Lecture  3

MapReduce Framework: Motivation

Want to process lots of data ( > 1 TB)

Want to parallelize the job across hundreds/thousands of commodity CPUs connected by a commodity networks

Want to make this easy, re-usable

Page 4: CSC  536 Lecture  3

Example Uses at Google

Pagerank wordcount distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation …

Page 5: CSC  536 Lecture  3

Programming Model

Users implement interface of two functions:

mapper (in_key, in_value) ->

list((out_key, intermediate_value))

reducer (out_key, intermediate_values list) -> (out_key,

out_value)

Page 6: CSC  536 Lecture  3

Map phase

Records from the data source are fed into the mapper function as (key, value) pairs

(filename, content) (goal: wordcount) (web page URL, web page content) (goal: web link-

graph reversal)

mapper produces one or more intermediate (output key, intermediate value) pairs from the input

(word, 1) (link URL, web page URL)

Page 7: CSC  536 Lecture  3

Reduce phase

After the Map phase is over, all the intermediate values for a given output key are combined together into a list

(“hello”, 1), (“hello”, 1), (“hello”, 1) -> (“hello”, [1,1,1]) Done by intermediate aggregator step of MapReduce

reducer function combines those intermediate values into one or more final values for that same output key

(“hello”, [1,1,1]) -> (“hello”, 3)

Page 8: CSC  536 Lecture  3

Data store 1 Data store nmap

(key 1, values...)

(key 2, values...)

(key 3, values...)

map

(key 1, values...)

(key 2, values...)

(key 3, values...)

Input key*value pairs

Input key*value pairs

== Barrier == : Aggregates intermediate values by output key

reduce reduce reduce

key 1, intermediate

values

key 2, intermediate

values

key 3, intermediate

values

final key 1 values

final key 2 values

final key 3 values

...

Page 9: CSC  536 Lecture  3

Parallelism

mapper functions run in parallel, creating different intermediate values from different input data sets

reducer functions also run in parallel, each working on a different output key

All values are processed independently

Page 10: CSC  536 Lecture  3

MapReduce example: wordcount

Problem: Count the number of occurrences of words in a set of files

Input to any MapReduce job: A set of (input_key, input_value) pairs

In wordcount: (input_key, input_value) = (filename, content)

filenames = ["a.txt", "b.txt", "c.txt"]content = {}for filename in filenames: f = open(filename) content[filename] = f.read() f.close()

Page 11: CSC  536 Lecture  3

MapReduce example: wordcount

The content of the input files

a.txt:

The quick brown fox jumped over the lazy grey dogs.

b.txt:

That's one small step for a man, one giant leap for mankind.

c.txt:

Mary had a little lamb,Its fleece was white as snow;And everywhere that Mary went,The lamb was sure to go.

Page 12: CSC  536 Lecture  3

MapReduce example: wordcount

Map phase: Function mapper is applied to every (filename, content) pair

mapper moves through the words in the file for each word it encounters, it returns the intermediate key and value

(word, 1)

A call to mapper("a.txt", content["a.txt"]) returns:

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1)]

The output of the Map phase is the concatenation of the lists for map("a.txt", content["a.txt"]), map("b.txt", content[“b.txt"]), and map("c.txt", content[“c.txt"])

Page 13: CSC  536 Lecture  3

MapReduce example: wordcount

The output of the Map phase

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1), ('mary', 1), ('had', 1), ('a', 1), ('little', 1), ('lamb', 1), ('its', 1), ('fleece', 1), ('was', 1), ('white', 1), ('as', 1), ('snow', 1), ('and', 1), ('everywhere', 1), ('that', 1), ('mary', 1), ('went', 1), ('the', 1), ('lamb', 1), ('was', 1), ('sure', 1), ('to', 1), ('go', 1),('thats', 1), ('one', 1), ('small', 1), ('step', 1),('for', 1), ('a', 1), ('man', 1), ('one', 1),('giant', 1), ('leap', 1), ('for', 1), ('mankind', 1)]

Page 14: CSC  536 Lecture  3

MapReduce example: wordcount

The Map phase of MapReduce is logically trivial But when the input dictionary has, say, 10 billion keys, and those keys point

to files held on thousands of different machines, implementing the map phase is actually quite non-trivial.

The MapReduce library should handle: knowing which files are stored on what machines, making sure that machine failures don’t affect the computation, making efficient use of the network, and storing the output in a useable form.

The programmer only writes the mapper function The MapReduce framework takes care of everything else

Page 15: CSC  536 Lecture  3

MapReduce example: wordcount

In preparation for the Reduce phase, the MapReduce library groups together all the intermediate values which have the same key to obtain this intermediate dictionary:

{'and': [1], 'fox': [1], 'over': [1], 'one': [1, 1], 'as': [1], 'go': [1], 'its': [1], 'lamb': [1, 1], 'giant': [1], 'for': [1, 1], 'jumped': [1], 'had': [1], 'snow': [1], 'to': [1], 'leap': [1], 'white': [1], 'was': [1, 1], 'mary': [1, 1], 'brown': [1], 'lazy': [1], 'sure': [1], 'that': [1], 'little': [1], 'small': [1], 'step': [1], 'everywhere': [1], 'mankind': [1], 'went': [1], 'man': [1], 'a': [1, 1], 'fleece': [1], 'grey': [1], 'dogs': [1], 'quick': [1], 'the': [1, 1, 1], 'thats': [1]}

Page 16: CSC  536 Lecture  3

MapReduce example: wordcount

In the Reduce phase, a programmer-defined function

reducer(out_key, intermediate_value_list)

is applied to each entry in the intermediate dictionary.

For wordcount, reducer sums up the list of intermediate values, and returns both out_key and the sum as the output.

def reduce(out_key, intermediate_value_list): return (out_key, sum(intermediate_value_list))

Page 17: CSC  536 Lecture  3

MapReduce example: wordcount

The output from the Reduce phase, and from the complete MapReduce computation, is:

[('and', 1), ('fox', 1), ('over', 1), ('one', 2), ('as', 1), ('go', 1), ('its', 1), ('lamb', 2), ('giant', 1), ('for', 2), ('jumped', 1), ('had', 1), ('snow', 1), ('to', 1), ('leap', 1), ('white', 1), ('was', 2), ('mary', 2), ('brown', 1), ('lazy', 1), ('sure', 1), ('that', 1), ('little', 1), ('small', 1), ('step', 1), ('everywhere', 1), ('mankind', 1), ('went', 1), ('man', 1), ('a', 2), ('fleece', 1), ('grey', 1), ('dogs', 1), ('quick', 1), ('the', 3), ('thats', 1)]

Page 18: CSC  536 Lecture  3

MapReduce example: wordcountMap and Reduce can be done in parallel... but how is the grouping step that takes place between the Map phase and the Reduce phase done? For the reducer functions to work in parallel, we need to ensure that all the

intermediate values corresponding to the same key get sent to the same machine

Page 19: CSC  536 Lecture  3

MapReduce example: wordcountMap and Reduce can be done in parallel... but how is the grouping step that takes place between the Map phase and the Reduce phase done? For the reducer functions to work in parallel, we need to ensure that all the

intermediate values corresponding to the same key get sent to the same machine

The general idea: Imagine you’ve got 1000 machines that you’re going to use to run reduce on. As the mapper functions compute the output keys and intermediate value lists, they

compute hash(out_key) mod 1000 for some hash function. This number is used to identify the machine in the cluster that the corresponding

reducer will be run on, and the resulting output key and value list is then sent to that machine.

Because every machine running mapper uses the same hash function, this ensures that value lists corresponding to the same output key all end up at the same machine.

Furthermore, by using a hash we ensure that the output keys end up pretty evenly spread over machines in the cluster

Page 20: CSC  536 Lecture  3

mapreduce example

project mapreduce in lecture 3 code

Page 21: CSC  536 Lecture  3

MapReduce optimizations

Locality Fault Tolerance Time optimization Bandwidth optimization

Page 22: CSC  536 Lecture  3

Locality

Master program divvies up tasks based on location of data tries to have mapper tasks on same machine as physical file

data, or at least same rack

mapper task inputs are divided into 64 MB blocks same size as Google File System chunks

Page 23: CSC  536 Lecture  3

Redundancy for Fault Tolerance

Master detects worker failures via periodic heartbeats Re-executes completed & in-progress mapper tasks Re-executes in-progress reducer tasks

Page 24: CSC  536 Lecture  3

Redundancy for time optimization

Reduce phase can’t start until Map phase is complete

Slow workers significantly lengthen completion time A single slow disk controller can rate-limit the whole process Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled

Solution: Near end of phase, spawn backup copies of tasks Whichever one finishes first "wins” Effect: Dramatically shortens job completion time

Page 25: CSC  536 Lecture  3

Bandwidth Optimizations

“Aggregator” function can run on same machine as a mapper function

Causes a mini-reduce phase to occur before the real Reduce phase, to save bandwidth

Page 26: CSC  536 Lecture  3

Distributed Transactions

Page 27: CSC  536 Lecture  3

Distributed transactions

Transactions, like mutual exclusion, protect shared data against simultaneous access by several concurrent processes.

Transactions allow a process to access and modify multiple data items as a single atomic transaction.

If the process backs out halfway during the transaction, everything is restored to the point just before the transaction started.

Page 28: CSC  536 Lecture  3

Distributed transactions: example 1

A customer dials into her bank web account and does the following:

Withdraws amount x from account 1.Deposits amount x to account 2.

If telephone connection is broken after the first step but before the second, what happens?

Either both or neither should be completed.Requires special primitives provided by the DS.

Page 29: CSC  536 Lecture  3

The Transaction Model

Examples of primitives for transactions

Write data to a file, a table, or otherwiseWRITE

Read data from a file, a table, or otherwiseREAD

Kill the transaction and restore the old valuesABORT_TRANSACTION

Terminate the transaction and try to commitEND_TRANSACTION

Make the start of a transactionBEGIN_TRANSACTION

DescriptionPrimitive

Page 30: CSC  536 Lecture  3

Distributed transactions: example 2

a) Transaction to reserve three flights commitsb) Transaction aborts when third flight is unavailable

BEGIN_TRANSACTION reserve WP -> JFK; reserve JFK -> Nairobi; reserve Nairobi -> Malindi full =>ABORT_TRANSACTION (b)

BEGIN_TRANSACTION reserve WP -> JFK; reserve JFK -> Nairobi; reserve Nairobi -> Malindi;END_TRANSACTION (a)

Page 31: CSC  536 Lecture  3

ACID

Transactions areAtomic: to the outside world, the transaction happens indivisibly.

Consistent: the transaction does not violate system invariants.

Isolated (or serializable): concurrent transactions do not interfere with each other.

Durable: once a transaction commits, the changes are permanent.

Page 32: CSC  536 Lecture  3

Flat, nested and distributed transactions

a) A nested transactionb) A distributed transaction

Page 33: CSC  536 Lecture  3

Implementation of distributed transactions

For simplicity, we consider transactions on a file system.

Note that if each process executing a transaction just updates the file in place, transactions will not be atomic, and changes will not vanish if the transaction aborts.

Other methods required.

Page 34: CSC  536 Lecture  3

Atomicity

If each process executing a transaction just updates the file in place, transactions will not be atomic, and changes will vanish if the transaction aborts.

Page 35: CSC  536 Lecture  3

Solution 1: Private Workspace

a) The file index and disk blocks for a three-block fileb) The situation after a transaction has modified block 0 and appended block 3c) After committing

Page 36: CSC  536 Lecture  3

Solution 2: Writeahead Log

(a) A transaction(b) – (d) The log before each statement is executed

Log

[x = 0 / 1][y = 0/2][x = 0/4]

(d)

Log

[x = 0 / 1][y = 0/2]

(c)

Log

[x = 0 / 1]

(b)

x = 0;y = 0;BEGIN_TRANSACTION; x = x + 1; y = y + 2 x = y * y;END_TRANSACTION; (a)

Page 37: CSC  536 Lecture  3

Concurrency control (1)

We just learned how to achieve atomicity; we will learn about durability when discussing fault tolerance

Need to handle consistency and isolation

Concurrency control allows several transactions to be executed simultaneously, while making sure that the data is left in a consistent state

This is done by scheduling operations on data in an order whereby the final result is the same as if all transactions had run sequentially

Page 38: CSC  536 Lecture  3

Concurrency control (2)

General organization of managers for handling transactions

Page 39: CSC  536 Lecture  3

Concurrency control (3)General organization of managers for handling distributed transactions.

Page 40: CSC  536 Lecture  3

Serializability

The main issue in concurrency control is the scheduling of conflicting operations (operating on same data item and one of which is a write operation)

Read/Write operations can be synchronized using:Mutual exclusion mechanisms, orScheduling using timestamps

Pessimistic/optimistic concurrency control

Page 41: CSC  536 Lecture  3

The lost update problem

Transaction T : balance = b.getBalance();b.setBalance(balance*1.1);a.withdraw(balance/10)

Transaction U:

balance = b.getBalance();b.setBalance(balance*1.1);c.withdraw(balance/10)

balance = b.getBalance(); $200

balance = b.getBalance(); $200

b.setBalance(balance*1.1); $220

b.setBalance(balance*1.1); $220

a.withdraw(balance/10) $80

c.withdraw(balance/10) $280

Accounts a, b, and c start with $100, $200, and $300, respectively

Page 42: CSC  536 Lecture  3

The inconsistent retrievals problem

Transaction V: a.withdraw(100)b.deposit(100)

Transaction W:

aBranch.branchTotal()

a.withdraw(100); $100total = a.getBalance() $100

total = total+b.getBalance() $300

total = total+c.getBalance()

b.deposit(100) $300

Accounts a and b start with $200 each.

Page 43: CSC  536 Lecture  3

A serialized interleaving of T and U

Transaction T: balance = b.getBalance()b.setBalance(balance*1.1)a.withdraw(balance/10)

Transaction U: balance = b.getBalance()b.setBalance(balance*1.1)c.withdraw(balance/10)

balance = b.getBalance() $200

b.setBalance(balance*1.1) $220balance = b.getBalance() $220

b.setBalance(balance*1.1) $242a.withdraw(balance/10) $80

c.withdraw(balance/10) $278

Page 44: CSC  536 Lecture  3

A serialized interleaving of V and W

Transaction V: a.withdraw(100);b.deposit(100)

Transaction W:

aBranch.branchTotal()

a.withdraw(100); $100

b.deposit(100) $300

total = a.getBalance() $100

total = total+b.getBalance() $400

total = total+c.getBalance()...

Page 45: CSC  536 Lecture  3

Read and write operation conflict rules

Operations of differenttransactions

Conflict Reason

read read No Because the effect of a pair of read operationsdoes not depend on the order in which they areexecuted

read write Yes Because the effect of a read and a write operationdepends on the order of their execution

write write Yes Because the effect of a pair of write operationsdepends on the order of their execution

Page 46: CSC  536 Lecture  3

Serializability

Two transactions are serialized

if and only if

All pairs of conflicting operations of the two transactions are executed in the same order at all

objects they both access.

Page 47: CSC  536 Lecture  3

A non-serialized interleaving of operations of transactions T and U

Transaction T: Transaction U:

x = read(i)write(i, 10)

y = read(j)write(j, 30)

write(j, 20)z = read (i)

Page 48: CSC  536 Lecture  3

Recoverability of aborts

Aborted transactions must be prevented from affecting other concurrent transactions

Dirty readsCascading aborts

Page 49: CSC  536 Lecture  3

A dirty read when transaction T aborts

Transaction T: a.getBalance()a.setBalance(balance + 10)

Transaction U:a.getBalance()a.setBalance(balance + 20)

balance = a.getBalance() $100

a.setBalance(balance + 10) $110balance = a.getBalance() $110

a.setBalance(balance + 20) $130

commit transaction

abort transaction

Page 50: CSC  536 Lecture  3

Cascading aborts

Suppose:Transaction U has seen the effects of transaction TTransaction V has seen the effects of transaction UT decides to abort

Page 51: CSC  536 Lecture  3

Cascading aborts

Suppose:Transaction U has seen the effects of transaction TTransaction V has seen the effects of transaction UT decides to abort

V and U must abort

Page 52: CSC  536 Lecture  3

Transactions T and U with locksTransaction T: balance = b.getBalance()b.setBalance(bal*1.1)a.withdraw(bal/10)

Transaction U: balance = b.getBalance()b.setBalance(bal*1.1)c.withdraw(bal/10)

Operations Locks Operations Locks

openTransactionbal = b.getBalance() lock B

b.setBalance(bal*1.1) openTransaction

a.withdraw(bal/10) lock A bal = b.getBalance() waits for T’slock on B

closeTransaction unlock A, B lock B

b.setBalance(bal*1.1) c.withdraw(bal/10) lock C

closeTransaction unlock B, C

Page 53: CSC  536 Lecture  3

Two-phase locking (2)

Idea: the scheduler grants locks in a way that creates only serializable schedules.

In 2-phase-locking, the transaction acquires all the locks it needs in the first phase, and then releases them in the second. This will insure a serializable schedule.

Dirty reads and cascading aborts are still possible

Page 54: CSC  536 Lecture  3

Two-phase locking (2)

Idea: the scheduler grants locks in a way that creates only serializable schedules.

In 2-phase-locking, the transaction acquires all the locks it needs in the first phase, and then releases them in the second. This will insure a serializable schedule.

Dirty reads and cascading aborts are still possible

Under strict 2-phase locking, a transaction that needs to read or write an object must be delayed until other transactions that wrote the same object have committed or aborted

Locks are held until transaction commits or aborts

Example: CORBA Concurrency Control Service

Page 55: CSC  536 Lecture  3

Two-phase locking in a distributed system

The data is assumed to be distributed across multiple machines

Centralized 2PL: central scheduler grants locks

Primary 2PL: local scheduler is coordinator for local data

Distributed 2PL: (data may be replicated)the local schedulers use a distributed mutual exclusion algorithm to obtain a lockThe local scheduler forwards Read/Write operations to data managers holding the replicas

Page 56: CSC  536 Lecture  3

Two-phase locking issues

Exclusive locks reduce concurrency more than necessary. It is sometimes preferable to allow concurrent transactions to read an object; two types of locks may be needed (read locks and write locks)

Deadlocks are possible.Solution 1: acquire all locks in the same order.Solution 2: use a graph to detect potential deadlocks.

Page 57: CSC  536 Lecture  3

Deadlock with write locks

Transaction T Transaction U

Operations Locks Operations Locks

a.deposit(100); write lock A

b.deposit(200) write lock B

b.withdraw(100)waits for U’s a.withdraw(200); waits for T’s

lock on B lock on A

Page 58: CSC  536 Lecture  3

The wait-for graph

B

A

Waits for

Held by

Held by

T UU T

Waits for

Page 59: CSC  536 Lecture  3

A cycle in a wait-for graph

U

V

T

Page 60: CSC  536 Lecture  3

Deadlock prevention with timeouts

Transaction T Transaction U Operations Locks Operations Locks

a.deposit(100); write lock A

b.deposit(200) write lock B

b.withdraw(100)waits for U’s a.withdraw(200); waits for T’slock on B lock on A

(timeout elapses) T’s lock on A becomes vulnerable, unlock A, abort T

a.withdraw(200); write locks Aunlock A, B

Page 61: CSC  536 Lecture  3

Disadvantages of locking

High overhead

Deadlocks

Locks cannot be released until the end of the transaction, which reduces concurrency

In most applications, the likelihood of two clients accessing the same object is low

Page 62: CSC  536 Lecture  3

Pessimistic timestamp concurrency control

A transaction’s request to write an object is valid only if that object was last read and written by an earlier transaction

A transaction’s request to read an object is valid only if that object was last written by an earlier transaction

Advantage: Non-blocking and deadlock-free

Disadvantage: Transactions may need to abort and restart

Page 63: CSC  536 Lecture  3

Operation conflicts for timestamp ordering

Rule Tc Ti

1. write read Tc must not write an object that has been read by any Ti where this requires that Tc≥ the maximum read timestamp of the object.

2. write write Tc must not write an object that has been written by any Ti where

Ti >Tc

this requires that Tc> write timestamp of the committedobject.

3. read write Tc must not read an object that has been written by any Ti where this requires that Tc > write timestamp of the committed object.

Ti >Tc

Ti >Tc

Page 64: CSC  536 Lecture  3

Pessimistic Timestamp Ordering

Concurrency control using timestamps.

Page 65: CSC  536 Lecture  3

Optimistic timestamp ordering

Idea: just go ahead and do the operations without paying attention to what concurrent transactions are doing:

Keep track of when each data item has been read and written.Before committing, check whether any item has been changed since the transaction started. If so, abort. If not, commit.

Advantage: deadlock free and fast.Disadvatange: it can fail and transactions must be run again.Example: Scala Software Transactional Memory (next week)