25
Introduction to MapReduce EECS 4415 Big Data Systems Tilemachos Pechlivanoglou [email protected]

Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Introduction to MapReduce

EECS 4415

Big Data Systems

Tilemachos Pechlivanoglou

[email protected]

Page 2: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

■ Our first peek into MapReduce implementation

■ Using Python

■ Example program: WordCount

MapReduce

2

Page 3: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

3

Conventional approach

Page 4: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Preparation:

Loading file line by line:

4

Conventional (step 0)

import sys

import re

sums = {}

for line in sys.stdin:

Page 5: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Removing non-word characters:

Splitting into words:

5

Conventional (step 1)

line = re.sub( r'^\W+|\W+$', '', line )

words = re.split( r'\W+', line )

Page 6: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Iterating over words:

Making everything lowercase:

Incrementing the count of every word in the dictionary

(if word doesn’t exist, get 0)6

Conventional (step 2)

for word in words:

word = word.lower()

sums[word] = sums.get( word, 0 ) + 1

Page 7: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

7

Conventional (Moby Dick)

Page 8: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

8

Conventional (output)

Page 9: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Limitations of approach

■ Requires use of dictionary– entire object stored in memory– if too big for memory crashes

■ Slower as dictionary grows– the bigger it is, the more time needed to get key (word)

9

Page 10: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Limitations of approach (graph)

10

Page 11: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

MapReduce approach

Does not require a central data structure (dictionary)

Steps:■ map: intermediate results, associates them with output key■ shuffle: intermediate results, same output key■ reduce: final result, takes keys as input

11

Page 12: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

MapReduce Mapper

12

Page 13: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Same first steps:

13

MapReduce Mapper (step 0)

import sys

import re

sums = {}

for line in sys.stdin:

line = re.sub( r'^\W+|\W+$', '', line )

words = re.split( r'\W+', line )

for word in words:

Page 14: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Output word and count:■ convert to lowercase■ “\t” (tab) is Hadoop for “:” separates key from value

Conventional execution:

14

MapReduce Mapper (step 1)

print( word.lower() + "\t1" )

./mapper.py < input.txt

Page 15: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

MapReduce Mapper (output)

15

Page 16: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Simple sort of calculated words■ Running on a cluster, more distribution happens here

Conventional execution (Linux command):

16

MapReduce Shuffle

./mapper.py < input.txt | sort

Page 17: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

MapReduce Shuffle (output)

17

Page 18: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

MapReduce Reducer

18

Page 19: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Preparation:

Loading previous results line by line:

19

MapReduce Reducer (step 0)

import sys

previous = None

sum = 0

for line in sys.stdin:

Page 20: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Split pairs again:

If we are still counting occurences of the same word:

Unless it’s the first entry:

20

MapReduce Reducer (step 1)

key, value = line.split( '\t' )

if key != previous:

if previous is not None:

Page 21: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Sum up 2 words:

Otherwise, re-initialize for next word

Either way, add new value to sum

21

MapReduce Reducer (step 1)

print str( sum ) + '\t' + previous

previous = key

sum = 0

sum = sum + int( value )

Page 22: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Return those two words:

Conventional execution:

22

MapReduce Reducer (step 4)

print str( sum ) + '\t' + previous

./mapper.py < input.txt | sort | ./reducer.py

Page 23: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

MapReduce Reduce (output)

23

Page 24: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

MapReduce Execution

24

hadoop jar /usr/hadoop-3.0.0/share/hadoop/tools/lib/hadoop-streaming-

3.0.0.jar \

-file ./mapper.py \

-mapper ./mapper.py \

-file ./reducer.py \

-reducer ./reducer.py \

-input /input.txt \

-output /output

Page 25: Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

Thank you!

25

Based onhttps://zettadatanet.wordpress.com/2015/04/04/a-hands-on-introduction-to-mapreduce-in-python/