EC2, MapReduce, and Distributed Processing

Jonathan Dahl

(and Rail Spikes,Slantwise,Zencoder,

distributed processing /dis'trib'ut'ed prŏs'ěs'ĭz/ noun Refers to any of a variety of computer systems that use more than one computer, or processor, to run an application. This includes parallel processing, in which a single computer uses more than one CPU to execute programs. More often, however, distributed processing refers to local-area networks (LANs) designed so that a single program can run simultaneously at various

asynchronous processing/a·syn·chro·nous prŏs'ěs'ĭz/ noun Computations that run independently of each other, without requiring constant synchronization. Each operation

parallel processing/ p a r · a l · l e l p rŏs 'ěs ' ĭz / n o u n Simultaneous computation of a single problem or system running across separate CPU cores. Includes

distributed processing/dis'trib'ut'ed prŏs'ěs'ĭz/ noun Just like parallel processing, but utilizing separate full systems, not just separate CPU cores.

Map______...

Transcoder 3

Transcoder 2

Rails DB

Transcoder 1

1. Poll Queue

2. Get job

Message

3. Result

Roadmap:I. Functional ProgrammingII. MapReduceIII. EC2IV. Distributed Processing

1. Functional Programming

ƒ(x) vs. i++;

ƒ(x) = 2x + 1

ƒ(person) = first name + last name

lambda {|x| x*2 + 1 }

lambda do |user| "#{user.firstname} #{user.lastname}"end

ƒ(users) = ∑ of logins for each user

users.sum { |user| user.number_of_logins }

var total_logins = 0;

for (i = 0; i < users.size; i++) { total_logins += number_of_logins(users[i])}

users.sum(&:number)

users.each {}

result = Array.new

users.each {|user| result << user.email }

result

reduce

reduce == inject == fold

reduce(list, function, init)

(1..10)[“a”, “b”, “c”, “d”][#<User id: 19>, #<User id=43>]

ƒ(x,y) = x + yƒ(x,y) = x << y if y > 0ƒ(x,y) = x << y.upcase

lambda {|result, i| result + i}

lambda do |result, i| result << i if i > 0end

lambda {|r, i| r << i.upcase }

0[]Hash.new(“”)

list.reduce(init) {}

(1..10).reduce(0) do |r, x| r + xend

(1..10).reduce(0) do |r, x| r + xend# 55

reduceinjectfold

list -> valuereduceinjectfold

reduceinjectfold

|result, x|

reduceinjectfold

map(list, function)

(1..10)[“a”, “b”, “c”, “d”][#<User id: 19>, #<User id=43>]

map(list, function)

lambda {|x| x + 1 }lambda {|x| x.upcase }lambda {|x| x.nil? }

list.map {}

(1..10).map {|x| x > 5 }

(1..10).map {|x| x > 5 }# [false, false, false, false, false, true, true, true, true, true]

[“a”,”b”,”c”]

[“a”,”b”,”c”] [“A”,”B”,”C”]=>

[“a”,”b”,”c”] [“A”,”B”,”C”]

User.all

[“a”,”b”,”c”] [“A”,”B”,”C”]

User.all [“david”, “stanley”, “anna”]=>

(1..5).map {|x| x * x}

1 * 12 * 23 * 34 * 45 * 5

parallelizable!

(1..5).reduce(0) {|i,x| i * x}

map: parallelizable

reduce: not (?)

II. MapReduce

MapReduce != map + reduce

MAP a problem across several

servers

REDUCE the results of each server to a

single result set

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

list.map {|i| i.function }(group)

key -> value

(1..10).map {|x| }

1. Initial data

(1..10).map_with_index {|i, x| }

1. Initial data

• GFS chunk identifier• Book page number• Web URL• Arbitrary group ID

Map server I:‘key1’ -> 6.8‘key2’ -> 6.9‘key3’ -> 8.1

2. Intermediate data

Map server 2:‘key1’ -> 6.2‘key4’ -> 5.5

Reduce results:‘key1’ -> 6.5‘key2’ -> 6.9‘key3’ -> 8.1‘key4’ -> 5.5

3. Final data

another view

• Stage in between ‘map’ and ‘reduce’

• All mappers must finish before reduce

• Prepare intermediate results

• (Group results by key)

Parallel reduce?

ƒ(key1), ƒ(key3), ƒ(key4)

ƒ(key2), ƒ(key5)

Example

chunky: 12bacon: 15

book = File.open("wrnpc12.txt", "r").to_awords = book.join(" ").split(" ")c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iendwords = c.sort{|a,b| b[1]<=>a[1]}

c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iend

puts words[1]puts words[100]puts words[1000]

puts word_counts[:ruby]puts word_counts[:rails]

+1 second

book = File.open("wrnpc12.txt", "r").to_awords = book.join(" ").split(" ")c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iendwords = c.sort{|a,b| b[1]<=>a[1]}

word_chunks = input_words.chunk(200)

mapped_words = word_chunks.map do |words| distributed_count(words)end

def distributed_count(words) c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 i end c.sort{|a,b| b[1]<=>a[1]}end

grouped_words = group(mapped_words)

# :the => [1829, 887, 1523] ..# :cat => [19, 7, 36, 132] ...

final_results = grouped_words.inject({}) do |result, words| result[words.first] = words.last.inject(0) {|r, i| r + i } resultendwords = final_results.sort{|a,b| b[1]<=>a[1]}

puts words[1]puts words[100]puts words[1000]

puts word_counts[:ruby]puts word_counts[:rails]

requirements

1. Fixed problem

2. Mappable problem

3. Distributed reduce

example uses

III. EC2

Example

1851-1922

Hadoop + EC2

Hadoop

100 instances

24 hours

(€164)

IV. Three Thoughts

Transcoder 3

Transcoder 2

Rails DB

Transcoder 1

1. Poll Queue

2. Get job

Message

3. Result

Hadoop

Thanks!Jonathan Dahl

Slides at Rail Spikes http://railspikes.com

Photo Credits

•Rofi: http://flickr.com/photos/rofi/

•Digital:Slurp http://flickr.com/photos/digitalslurp/

•Others stolen from Google Image search

EC2, MapReduce, and Distributed Processing

Technology

Distributed Keyword Search over RDF via MapReduce

Performance Enhancement Algorithms For Data …Title MapReduce Algorithm Input Output A novel MapReduce - based a pproach for distributed frequent sub - graph mining (16) MapReduce

Comparing Distributed Indexing To Mapreduce or Not?

Distributed Processing General Distributed Data Processingdb.in.tum.de/teaching/ws1819/foundationsde/chapter4.pdf3 / 50 Distributed Processing MapReduce Alternative Solution Idea •

journal | Computing and Informatics - A MAPREDUCE BASED DISTRIBUTED … · A MapReduce Based Distributed LSI for Scalable Information Retrieval 261 the performance of the MapReduce

MapReduce & Distributed Indexing - Emory Universityeugene/cs572/lectures/lecture20-distributed... · MapReduce & Distributed Indexing Acknowledgments: Many slides in this lecture

MapReduce and Hadoop · 2012. 1. 19. · Hadoop • Open Source MapReduce framework in Java –Spinoff from Nuch web crawler project • HDFS – Hadoop Distributed Filesystem –Distributed,

D 3 -MapReduce: Towards MapReduce for Distributed and … · 2017-01-27 · D3-MapReduce: Towards MapReduce for Distributed and Dynamic Data Sets Haiwu He CSNET/CNIC Chinese Academy

Engineering Distributed Graph Clustering using MapReduce...Engineering Distributed Graph Clustering using MapReduce Master Thesis of Tim Zeitz At the Department of Informatics Institute

Distributed Systems 17. MapReduce - Cohen- · PDF fileDistributed Systems 17. MapReduce ... –MapReduce library groups together all intermediate values associated ... •Distributed

Distributed Computations MapReduce adapted from Jeff Dean’s slides

Distributed Systems 17. MapReduce - Computer Sciencepxk/417/notes/content/17-mapreduce... · Distributed Systems 17. MapReduce ... into R pieces using a partitioning function (e.g.,

Parallel and Distributed Computing: MapReduce Alona Fyshe

Data Management in Large-Scale Distributed Systems - MapReduce … · Introduction to MapReduce The Hadoop Eco-System HDFS Hadoop MapReduce 4. MapReduce at Google Publication The

Web Search and Text Mining Lecture 3. Outline Distributed programming: MapReduce Distributed indexing Several other examples using MapReduce Zones in

Distributed Computations MapReduce

Introduction to MapReduce and Hadoop IT 332 Distributed Systems

Distributed Image Processing using Hadoop MapReduce frameworksearch.iiit.ac.in/cloud/presentations/26.pdf · Distributed Image Processing using Hadoop MapReduce framework Binoy A

MapReduce · 2020. 7. 22. · Hadoop is an implementation of MapReduce 14. Why MapReduce • GFS: distributed system to store more data than possible on one computer • MapReduce:

Distributed Cache With MapReduce