Upload
ocean
View
51
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CSC 536 Lecture 3. Outline. Akka example: mapreduce Distributed transactions. MapReduce Framework: Motivation. Want to process lots of data ( > 1 TB) Want to parallelize the job across hundreds/thousands of commodity CPUs connected by a commodity networks - PowerPoint PPT Presentation
Citation preview
CSC 536 Lecture 3
Outline
Akka example: mapreduceDistributed transactions
MapReduce Framework: Motivation
Want to process lots of data ( > 1 TB)
Want to parallelize the job across hundreds/thousands of commodity CPUs connected by a commodity networks
Want to make this easy, re-usable
Example Uses at Google
Pagerank wordcount distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation …
Programming Model
Users implement interface of two functions:
mapper (in_key, in_value) ->
list((out_key, intermediate_value))
reducer (out_key, intermediate_values list) -> (out_key,
out_value)
Map phase
Records from the data source are fed into the mapper function as (key, value) pairs
(filename, content) (goal: wordcount) (web page URL, web page content) (goal: web link-
graph reversal)
mapper produces one or more intermediate (output key, intermediate value) pairs from the input
(word, 1) (link URL, web page URL)
Reduce phase
After the Map phase is over, all the intermediate values for a given output key are combined together into a list
(“hello”, 1), (“hello”, 1), (“hello”, 1) -> (“hello”, [1,1,1]) Done by intermediate aggregator step of MapReduce
reducer function combines those intermediate values into one or more final values for that same output key
(“hello”, [1,1,1]) -> (“hello”, 3)
Data store 1 Data store nmap
(key 1, values...)
(key 2, values...)
(key 3, values...)
map
(key 1, values...)
(key 2, values...)
(key 3, values...)
Input key*value pairs
Input key*value pairs
== Barrier == : Aggregates intermediate values by output key
reduce reduce reduce
key 1, intermediate
values
key 2, intermediate
values
key 3, intermediate
values
final key 1 values
final key 2 values
final key 3 values
...
Parallelism
mapper functions run in parallel, creating different intermediate values from different input data sets
reducer functions also run in parallel, each working on a different output key
All values are processed independently
MapReduce example: wordcount
Problem: Count the number of occurrences of words in a set of files
Input to any MapReduce job: A set of (input_key, input_value) pairs
In wordcount: (input_key, input_value) = (filename, content)
filenames = ["a.txt", "b.txt", "c.txt"]content = {}for filename in filenames: f = open(filename) content[filename] = f.read() f.close()
MapReduce example: wordcount
The content of the input files
a.txt:
The quick brown fox jumped over the lazy grey dogs.
b.txt:
That's one small step for a man, one giant leap for mankind.
c.txt:
Mary had a little lamb,Its fleece was white as snow;And everywhere that Mary went,The lamb was sure to go.
MapReduce example: wordcount
Map phase: Function mapper is applied to every (filename, content) pair
mapper moves through the words in the file for each word it encounters, it returns the intermediate key and value
(word, 1)
A call to mapper("a.txt", content["a.txt"]) returns:
[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1)]
The output of the Map phase is the concatenation of the lists for map("a.txt", content["a.txt"]), map("b.txt", content[“b.txt"]), and map("c.txt", content[“c.txt"])
MapReduce example: wordcount
The output of the Map phase
[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1), ('mary', 1), ('had', 1), ('a', 1), ('little', 1), ('lamb', 1), ('its', 1), ('fleece', 1), ('was', 1), ('white', 1), ('as', 1), ('snow', 1), ('and', 1), ('everywhere', 1), ('that', 1), ('mary', 1), ('went', 1), ('the', 1), ('lamb', 1), ('was', 1), ('sure', 1), ('to', 1), ('go', 1),('thats', 1), ('one', 1), ('small', 1), ('step', 1),('for', 1), ('a', 1), ('man', 1), ('one', 1),('giant', 1), ('leap', 1), ('for', 1), ('mankind', 1)]
MapReduce example: wordcount
The Map phase of MapReduce is logically trivial But when the input dictionary has, say, 10 billion keys, and those keys point
to files held on thousands of different machines, implementing the map phase is actually quite non-trivial.
The MapReduce library should handle: knowing which files are stored on what machines, making sure that machine failures don’t affect the computation, making efficient use of the network, and storing the output in a useable form.
The programmer only writes the mapper function The MapReduce framework takes care of everything else
MapReduce example: wordcount
In preparation for the Reduce phase, the MapReduce library groups together all the intermediate values which have the same key to obtain this intermediate dictionary:
{'and': [1], 'fox': [1], 'over': [1], 'one': [1, 1], 'as': [1], 'go': [1], 'its': [1], 'lamb': [1, 1], 'giant': [1], 'for': [1, 1], 'jumped': [1], 'had': [1], 'snow': [1], 'to': [1], 'leap': [1], 'white': [1], 'was': [1, 1], 'mary': [1, 1], 'brown': [1], 'lazy': [1], 'sure': [1], 'that': [1], 'little': [1], 'small': [1], 'step': [1], 'everywhere': [1], 'mankind': [1], 'went': [1], 'man': [1], 'a': [1, 1], 'fleece': [1], 'grey': [1], 'dogs': [1], 'quick': [1], 'the': [1, 1, 1], 'thats': [1]}
MapReduce example: wordcount
In the Reduce phase, a programmer-defined function
reducer(out_key, intermediate_value_list)
is applied to each entry in the intermediate dictionary.
For wordcount, reducer sums up the list of intermediate values, and returns both out_key and the sum as the output.
def reduce(out_key, intermediate_value_list): return (out_key, sum(intermediate_value_list))
MapReduce example: wordcount
The output from the Reduce phase, and from the complete MapReduce computation, is:
[('and', 1), ('fox', 1), ('over', 1), ('one', 2), ('as', 1), ('go', 1), ('its', 1), ('lamb', 2), ('giant', 1), ('for', 2), ('jumped', 1), ('had', 1), ('snow', 1), ('to', 1), ('leap', 1), ('white', 1), ('was', 2), ('mary', 2), ('brown', 1), ('lazy', 1), ('sure', 1), ('that', 1), ('little', 1), ('small', 1), ('step', 1), ('everywhere', 1), ('mankind', 1), ('went', 1), ('man', 1), ('a', 2), ('fleece', 1), ('grey', 1), ('dogs', 1), ('quick', 1), ('the', 3), ('thats', 1)]
MapReduce example: wordcountMap and Reduce can be done in parallel... but how is the grouping step that takes place between the Map phase and the Reduce phase done? For the reducer functions to work in parallel, we need to ensure that all the
intermediate values corresponding to the same key get sent to the same machine
MapReduce example: wordcountMap and Reduce can be done in parallel... but how is the grouping step that takes place between the Map phase and the Reduce phase done? For the reducer functions to work in parallel, we need to ensure that all the
intermediate values corresponding to the same key get sent to the same machine
The general idea: Imagine you’ve got 1000 machines that you’re going to use to run reduce on. As the mapper functions compute the output keys and intermediate value lists, they
compute hash(out_key) mod 1000 for some hash function. This number is used to identify the machine in the cluster that the corresponding
reducer will be run on, and the resulting output key and value list is then sent to that machine.
Because every machine running mapper uses the same hash function, this ensures that value lists corresponding to the same output key all end up at the same machine.
Furthermore, by using a hash we ensure that the output keys end up pretty evenly spread over machines in the cluster
mapreduce example
project mapreduce in lecture 3 code
MapReduce optimizations
Locality Fault Tolerance Time optimization Bandwidth optimization
Locality
Master program divvies up tasks based on location of data tries to have mapper tasks on same machine as physical file
data, or at least same rack
mapper task inputs are divided into 64 MB blocks same size as Google File System chunks
Redundancy for Fault Tolerance
Master detects worker failures via periodic heartbeats Re-executes completed & in-progress mapper tasks Re-executes in-progress reducer tasks
Redundancy for time optimization
Reduce phase can’t start until Map phase is complete
Slow workers significantly lengthen completion time A single slow disk controller can rate-limit the whole process Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled
Solution: Near end of phase, spawn backup copies of tasks Whichever one finishes first "wins” Effect: Dramatically shortens job completion time
Bandwidth Optimizations
“Aggregator” function can run on same machine as a mapper function
Causes a mini-reduce phase to occur before the real Reduce phase, to save bandwidth
Distributed Transactions
Distributed transactions
Transactions, like mutual exclusion, protect shared data against simultaneous access by several concurrent processes.
Transactions allow a process to access and modify multiple data items as a single atomic transaction.
If the process backs out halfway during the transaction, everything is restored to the point just before the transaction started.
Distributed transactions: example 1
A customer dials into her bank web account and does the following:
Withdraws amount x from account 1.Deposits amount x to account 2.
If telephone connection is broken after the first step but before the second, what happens?
Either both or neither should be completed.Requires special primitives provided by the DS.
The Transaction Model
Examples of primitives for transactions
Write data to a file, a table, or otherwiseWRITE
Read data from a file, a table, or otherwiseREAD
Kill the transaction and restore the old valuesABORT_TRANSACTION
Terminate the transaction and try to commitEND_TRANSACTION
Make the start of a transactionBEGIN_TRANSACTION
DescriptionPrimitive
Distributed transactions: example 2
a) Transaction to reserve three flights commitsb) Transaction aborts when third flight is unavailable
BEGIN_TRANSACTION reserve WP -> JFK; reserve JFK -> Nairobi; reserve Nairobi -> Malindi full =>ABORT_TRANSACTION (b)
BEGIN_TRANSACTION reserve WP -> JFK; reserve JFK -> Nairobi; reserve Nairobi -> Malindi;END_TRANSACTION (a)
ACID
Transactions areAtomic: to the outside world, the transaction happens indivisibly.
Consistent: the transaction does not violate system invariants.
Isolated (or serializable): concurrent transactions do not interfere with each other.
Durable: once a transaction commits, the changes are permanent.
Flat, nested and distributed transactions
a) A nested transactionb) A distributed transaction
Implementation of distributed transactions
For simplicity, we consider transactions on a file system.
Note that if each process executing a transaction just updates the file in place, transactions will not be atomic, and changes will not vanish if the transaction aborts.
Other methods required.
Atomicity
If each process executing a transaction just updates the file in place, transactions will not be atomic, and changes will vanish if the transaction aborts.
Solution 1: Private Workspace
a) The file index and disk blocks for a three-block fileb) The situation after a transaction has modified block 0 and appended block 3c) After committing
Solution 2: Writeahead Log
(a) A transaction(b) – (d) The log before each statement is executed
Log
[x = 0 / 1][y = 0/2][x = 0/4]
(d)
Log
[x = 0 / 1][y = 0/2]
(c)
Log
[x = 0 / 1]
(b)
x = 0;y = 0;BEGIN_TRANSACTION; x = x + 1; y = y + 2 x = y * y;END_TRANSACTION; (a)
Concurrency control (1)
We just learned how to achieve atomicity; we will learn about durability when discussing fault tolerance
Need to handle consistency and isolation
Concurrency control allows several transactions to be executed simultaneously, while making sure that the data is left in a consistent state
This is done by scheduling operations on data in an order whereby the final result is the same as if all transactions had run sequentially
Concurrency control (2)
General organization of managers for handling transactions
Concurrency control (3)General organization of managers for handling distributed transactions.
Serializability
The main issue in concurrency control is the scheduling of conflicting operations (operating on same data item and one of which is a write operation)
Read/Write operations can be synchronized using:Mutual exclusion mechanisms, orScheduling using timestamps
Pessimistic/optimistic concurrency control
The lost update problem
Transaction T : balance = b.getBalance();b.setBalance(balance*1.1);a.withdraw(balance/10)
Transaction U:
balance = b.getBalance();b.setBalance(balance*1.1);c.withdraw(balance/10)
balance = b.getBalance(); $200
balance = b.getBalance(); $200
b.setBalance(balance*1.1); $220
b.setBalance(balance*1.1); $220
a.withdraw(balance/10) $80
c.withdraw(balance/10) $280
Accounts a, b, and c start with $100, $200, and $300, respectively
The inconsistent retrievals problem
Transaction V: a.withdraw(100)b.deposit(100)
Transaction W:
aBranch.branchTotal()
a.withdraw(100); $100total = a.getBalance() $100
total = total+b.getBalance() $300
total = total+c.getBalance()
b.deposit(100) $300
Accounts a and b start with $200 each.
A serialized interleaving of T and U
Transaction T: balance = b.getBalance()b.setBalance(balance*1.1)a.withdraw(balance/10)
Transaction U: balance = b.getBalance()b.setBalance(balance*1.1)c.withdraw(balance/10)
balance = b.getBalance() $200
b.setBalance(balance*1.1) $220balance = b.getBalance() $220
b.setBalance(balance*1.1) $242a.withdraw(balance/10) $80
c.withdraw(balance/10) $278
A serialized interleaving of V and W
Transaction V: a.withdraw(100);b.deposit(100)
Transaction W:
aBranch.branchTotal()
a.withdraw(100); $100
b.deposit(100) $300
total = a.getBalance() $100
total = total+b.getBalance() $400
total = total+c.getBalance()...
Read and write operation conflict rules
Operations of differenttransactions
Conflict Reason
read read No Because the effect of a pair of read operationsdoes not depend on the order in which they areexecuted
read write Yes Because the effect of a read and a write operationdepends on the order of their execution
write write Yes Because the effect of a pair of write operationsdepends on the order of their execution
Serializability
Two transactions are serialized
if and only if
All pairs of conflicting operations of the two transactions are executed in the same order at all
objects they both access.
A non-serialized interleaving of operations of transactions T and U
Transaction T: Transaction U:
x = read(i)write(i, 10)
y = read(j)write(j, 30)
write(j, 20)z = read (i)
Recoverability of aborts
Aborted transactions must be prevented from affecting other concurrent transactions
Dirty readsCascading aborts
A dirty read when transaction T aborts
Transaction T: a.getBalance()a.setBalance(balance + 10)
Transaction U:a.getBalance()a.setBalance(balance + 20)
balance = a.getBalance() $100
a.setBalance(balance + 10) $110balance = a.getBalance() $110
a.setBalance(balance + 20) $130
commit transaction
abort transaction
Cascading aborts
Suppose:Transaction U has seen the effects of transaction TTransaction V has seen the effects of transaction UT decides to abort
Cascading aborts
Suppose:Transaction U has seen the effects of transaction TTransaction V has seen the effects of transaction UT decides to abort
V and U must abort
Transactions T and U with locksTransaction T: balance = b.getBalance()b.setBalance(bal*1.1)a.withdraw(bal/10)
Transaction U: balance = b.getBalance()b.setBalance(bal*1.1)c.withdraw(bal/10)
Operations Locks Operations Locks
openTransactionbal = b.getBalance() lock B
b.setBalance(bal*1.1) openTransaction
a.withdraw(bal/10) lock A bal = b.getBalance() waits for T’slock on B
closeTransaction unlock A, B lock B
b.setBalance(bal*1.1) c.withdraw(bal/10) lock C
closeTransaction unlock B, C
Two-phase locking (2)
Idea: the scheduler grants locks in a way that creates only serializable schedules.
In 2-phase-locking, the transaction acquires all the locks it needs in the first phase, and then releases them in the second. This will insure a serializable schedule.
Dirty reads and cascading aborts are still possible
Two-phase locking (2)
Idea: the scheduler grants locks in a way that creates only serializable schedules.
In 2-phase-locking, the transaction acquires all the locks it needs in the first phase, and then releases them in the second. This will insure a serializable schedule.
Dirty reads and cascading aborts are still possible
Under strict 2-phase locking, a transaction that needs to read or write an object must be delayed until other transactions that wrote the same object have committed or aborted
Locks are held until transaction commits or aborts
Example: CORBA Concurrency Control Service
Two-phase locking in a distributed system
The data is assumed to be distributed across multiple machines
Centralized 2PL: central scheduler grants locks
Primary 2PL: local scheduler is coordinator for local data
Distributed 2PL: (data may be replicated)the local schedulers use a distributed mutual exclusion algorithm to obtain a lockThe local scheduler forwards Read/Write operations to data managers holding the replicas
Two-phase locking issues
Exclusive locks reduce concurrency more than necessary. It is sometimes preferable to allow concurrent transactions to read an object; two types of locks may be needed (read locks and write locks)
Deadlocks are possible.Solution 1: acquire all locks in the same order.Solution 2: use a graph to detect potential deadlocks.
Deadlock with write locks
Transaction T Transaction U
Operations Locks Operations Locks
a.deposit(100); write lock A
b.deposit(200) write lock B
b.withdraw(100)waits for U’s a.withdraw(200); waits for T’s
lock on B lock on A
The wait-for graph
B
A
Waits for
Held by
Held by
T UU T
Waits for
A cycle in a wait-for graph
U
V
T
Deadlock prevention with timeouts
Transaction T Transaction U Operations Locks Operations Locks
a.deposit(100); write lock A
b.deposit(200) write lock B
b.withdraw(100)waits for U’s a.withdraw(200); waits for T’slock on B lock on A
(timeout elapses) T’s lock on A becomes vulnerable, unlock A, abort T
a.withdraw(200); write locks Aunlock A, B
Disadvantages of locking
High overhead
Deadlocks
Locks cannot be released until the end of the transaction, which reduces concurrency
In most applications, the likelihood of two clients accessing the same object is low
Pessimistic timestamp concurrency control
A transaction’s request to write an object is valid only if that object was last read and written by an earlier transaction
A transaction’s request to read an object is valid only if that object was last written by an earlier transaction
Advantage: Non-blocking and deadlock-free
Disadvantage: Transactions may need to abort and restart
Operation conflicts for timestamp ordering
Rule Tc Ti
1. write read Tc must not write an object that has been read by any Ti where this requires that Tc≥ the maximum read timestamp of the object.
2. write write Tc must not write an object that has been written by any Ti where
Ti >Tc
this requires that Tc> write timestamp of the committedobject.
3. read write Tc must not read an object that has been written by any Ti where this requires that Tc > write timestamp of the committed object.
Ti >Tc
Ti >Tc
Pessimistic Timestamp Ordering
Concurrency control using timestamps.
Optimistic timestamp ordering
Idea: just go ahead and do the operations without paying attention to what concurrent transactions are doing:
Keep track of when each data item has been read and written.Before committing, check whether any item has been changed since the transaction started. If so, abort. If not, commit.
Advantage: deadlock free and fast.Disadvatange: it can fail and transactions must be run again.Example: Scala Software Transactional Memory (next week)