EC2, MapReduce, and Distributed Processing

Preview:

DESCRIPTION

RailsConf Europe talk on MapReduce and EC2.

Citation preview

EC2, MapReduce, and Distributed Processing

Jonathan Dahl

(and Rail Spikes,Slantwise,Zencoder,

etc.)

distributed processing /dis'trib'ut'ed prŏs'ěs'ĭz/ noun Refers to any of a variety of computer systems that use more than one computer, or processor, to run an application. This includes parallel processing, in which a single computer uses more than one CPU to execute programs. More often, however, distributed processing refers to local-area networks (LANs) designed so that a single program can run simultaneously at various

asynchronous processing/a·syn·chro·nous prŏs'ěs'ĭz/ noun Computations that run independently of each other, without requiring constant synchronization. Each operation

parallel processing/ p a r · a l · l e l p rŏs 'ěs ' ĭz / n o u n Simultaneous computation of a single problem or system running across separate CPU cores. Includes

distributed processing/dis'trib'ut'ed prŏs'ěs'ĭz/ noun Just like parallel processing, but utilizing separate full systems, not just separate CPU cores.

You

Me

Map______...

Transcoder 3

Transcoder 2

Rails DB

Transcoder 1

1. Poll Queue

2. Get job

Message

Queue

3. Result

Roadmap:I. Functional ProgrammingII. MapReduceIII. EC2IV. Distributed Processing

1. Functional Programming

ƒ(x) vs. i++;

ƒ(x) = 2x + 1

ƒ(person) = first name + last name

lambda {|x| x*2 + 1 }

lambda do |user| "#{user.firstname} #{user.lastname}"end

ƒ(users) = ∑ of logins for each user

users.sum { |user| user.number_of_logins }

var total_logins = 0;

for (i = 0; i < users.size; i++) { total_logins += number_of_logins(users[i])}

users.sum(&:number)

users.sum(&:number)

users.each {}

result = Array.new

users.each {|user| result << user.email }

result

reduce

reduce == inject == fold

reduce(list, function, init)

reduce(list, function, init)

(1..10)[“a”, “b”, “c”, “d”][#<User id: 19>, #<User id=43>]

reduce(list, function, init)

ƒ(x,y) = x + yƒ(x,y) = x << y if y > 0ƒ(x,y) = x << y.upcase

reduce(list, function, init)

lambda {|result, i| result + i}

lambda do |result, i| result << i if i > 0end

lambda {|r, i| r << i.upcase }

reduce(list, function, init)

0[]Hash.new(“”)

list.reduce(init) {}

(1..10).reduce(0) do |r, x| r + xend

(1..10).reduce(0) do |r, x| r + xend

(1..10).reduce(0) do |r, x| r + xend

(1..10).reduce(0) do |r, x| r + xend

(1..10).reduce(0) do |r, x| r + xend

(1..10).reduce(0) do |r, x| r + xend

(1..10).reduce(0) do |r, x| r + xend

(1..10).reduce(0) do |r, x| r + xend

(1..10).reduce(0) do |r, x| r + xend# 55

reduceinjectfold

reduceinjectfold

list -> valuereduceinjectfold

reduceinjectfold

reduceinjectfold

reduceinjectfold

reduceinjectfold

|result, x|

reduceinjectfold

reduceinjectfold

reduceinjectfold

reduceinjectfold

reduceinjectfold

reduceinjectfold

reduceinjectfold

reduceinjectfold

reduceinjectfold

reduceinjectfold

reduceinjectfold

map(list, function)

map(list, function)

(1..10)[“a”, “b”, “c”, “d”][#<User id: 19>, #<User id=43>]

map(list, function)

lambda {|x| x + 1 }lambda {|x| x.upcase }lambda {|x| x.nil? }

list.map {}

(1..10).map {|x| x > 5 }

(1..10).map {|x| x > 5 }

(1..10).map {|x| x > 5 }

(1..10).map {|x| x > 5 }

(1..10).map {|x| x > 5 }

(1..10).map {|x| x > 5 }

(1..10).map {|x| x > 5 }# [false, false, false, false, false, true, true, true, true, true]

[“a”,”b”,”c”]

[“a”,”b”,”c”] [“A”,”B”,”C”]=>

[“a”,”b”,”c”] [“A”,”B”,”C”]

User.all

=>

[“a”,”b”,”c”] [“A”,”B”,”C”]

User.all [“david”, “stanley”, “anna”]=>

=>

(1..5).map {|x| x * x}

1 * 12 * 23 * 34 * 45 * 5

parallelizable!

(1..5).reduce(0) {|i,x| i * x}

map: parallelizable

reduce: not (?)

II. MapReduce

MapReduce != map + reduce

MAP a problem across several

servers

REDUCE the results of each server to a

single result set

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

list.map {|i| i.function }(group)

results.reduce {|final, i| final[i.key] = i.function }

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

key -> value

(1..10).map {|x| }

1. Initial data

(1..10).map_with_index {|i, x| }

1. Initial data

1. Initial data

• GFS chunk identifier• Book page number• Web URL• Arbitrary group ID

Map server I:‘key1’ -> 6.8‘key2’ -> 6.9‘key3’ -> 8.1

2. Intermediate data

2. Intermediate data

Map server 2:‘key1’ -> 6.2‘key4’ -> 5.5

Reduce results:‘key1’ -> 6.5‘key2’ -> 6.9‘key3’ -> 8.1‘key4’ -> 5.5

3. Final data

another view

• Stage in between ‘map’ and ‘reduce’

• All mappers must finish before reduce

• Prepare intermediate results

• (Group results by key)

Parallel reduce?

ƒ(key1), ƒ(key3), ƒ(key4)

ƒ(key2), ƒ(key5)

Example

chunky: 12bacon: 15

book = File.open("wrnpc12.txt", "r").to_awords = book.join(" ").split(" ")c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iendwords = c.sort{|a,b| b[1]<=>a[1]}

c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iend

puts words[1]puts words[100]puts words[1000]

puts word_counts[:ruby]puts word_counts[:rails]

+1 second

book = File.open("wrnpc12.txt", "r").to_awords = book.join(" ").split(" ")c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iendwords = c.sort{|a,b| b[1]<=>a[1]}

word_chunks = input_words.chunk(200)

mapped_words = word_chunks.map do |words| distributed_count(words)end

def distributed_count(words) c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 i end c.sort{|a,b| b[1]<=>a[1]}end

grouped_words = group(mapped_words)

# :the => [1829, 887, 1523] ..# :cat => [19, 7, 36, 132] ...

final_results = grouped_words.inject({}) do |result, words| result[words.first] = words.last.inject(0) {|r, i| r + i } resultendwords = final_results.sort{|a,b| b[1]<=>a[1]}

puts words[1]puts words[100]puts words[1000]

puts word_counts[:ruby]puts word_counts[:rails]

requirements

1. Fixed problem

2. Mappable problem

3. Distributed reduce

example uses

III. EC2

Why?

Example

1851-1922

4TB

Hadoop + EC2

Hadoop

100 instances

24 hours

$240

(€164)

IV. Three Thoughts

Transcoder 3

Transcoder 2

Rails DB

Transcoder 1

1. Poll Queue

2. Get job

Message

Queue

3. Result

Hadoop

Thanks!Jonathan Dahl

Slides at Rail Spikes http://railspikes.com

Photo Credits

•Rofi: http://flickr.com/photos/rofi/

•Digital:Slurp http://flickr.com/photos/digitalslurp/

•Others stolen from Google Image search

Recommended