Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making...

Introduction to MapReduce

EECS 4415

Big Data Systems

Tilemachos Pechlivanoglou

tipech@eecs.yorku.ca

■ Our first peek into MapReduce implementation

■ Using Python

■ Example program: WordCount

MapReduce

Conventional approach

Preparation:

Loading file line by line:

Conventional (step 0)

import sys

import re

sums = {}

for line in sys.stdin:

Removing non-word characters:

Splitting into words:

line = re.sub( r'^\W+|\W+$', '', line )

words = re.split( r'\W+', line )

Iterating over words:

Making everything lowercase:

Incrementing the count of every word in the dictionary

(if word doesn’t exist, get 0)6

for word in words:

word = word.lower()

sums[word] = sums.get( word, 0 ) + 1

Conventional (Moby Dick)

Conventional (output)

Limitations of approach

■ Requires use of dictionary– entire object stored in memory– if too big for memory crashes

■ Slower as dictionary grows– the bigger it is, the more time needed to get key (word)

Limitations of approach (graph)

MapReduce approach

Does not require a central data structure (dictionary)

Steps:■ map: intermediate results, associates them with output key■ shuffle: intermediate results, same output key■ reduce: final result, takes keys as input

MapReduce Mapper

Same first steps:

MapReduce Mapper (step 0)

import sys

import re

sums = {}

line = re.sub( r'^\W+|\W+$', '', line )

words = re.split( r'\W+', line )

for word in words:

Output word and count:■ convert to lowercase■ “\t” (tab) is Hadoop for “:” separates key from value

Conventional execution:

MapReduce Mapper (step 1)

print( word.lower() + "\t1" )

./mapper.py < input.txt

MapReduce Mapper (output)

Simple sort of calculated words■ Running on a cluster, more distribution happens here

Conventional execution (Linux command):

MapReduce Shuffle

./mapper.py < input.txt | sort

MapReduce Shuffle (output)

MapReduce Reducer

Preparation:

Loading previous results line by line:

MapReduce Reducer (step 0)

import sys

previous = None

sum = 0

Split pairs again:

If we are still counting occurences of the same word:

Unless it’s the first entry:

key, value = line.split( '\t' )

if key != previous:

if previous is not None:

Sum up 2 words:

Otherwise, re-initialize for next word

Either way, add new value to sum

print str( sum ) + '\t' + previous

previous = key

sum = 0

sum = sum + int( value )

Return those two words:

Conventional execution:

print str( sum ) + '\t' + previous

./mapper.py < input.txt | sort | ./reducer.py

MapReduce Reduce (output)

MapReduce Execution

hadoop jar /usr/hadoop-3.0.0/share/hadoop/tools/lib/hadoop-streaming-

3.0.0.jar \

-file ./mapper.py \

-mapper ./mapper.py \

-file ./reducer.py \

-reducer ./reducer.py \

-input /input.txt \

-output /output

Thank you!

Based onhttps://zettadatanet.wordpress.com/2015/04/04/a-hands-on-introduction-to-mapreduce-in-python/

Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making...

Documents

Formally Defining and Iterating Infinite Models (MODELS 2012)

iterating detox

The Agency Mercurial: Iterating on an Old Model

Iterating evolutes and involutes - University of …etsuker/Files/iterating...The relation between a curve and its evolute is similar to the relation between a periodic function and

Iterating over a Tree Structure Binary Search Trees 1Binary Search Trees Data Structures & File Management Iterating over a Tree Structure Iterating over linear structures (lists,

Data contains value and knowledge - York Universitypapaggel/courses/eecs... · HDFS (Hadoop Distributed File System) Presents like a local filesystem Distribution mechanics handled

A Transformed Moody Chart That Is Read Without Iterating

Richard Blum - download.e-bookshelf.de · Integer Arithmetic 201. Addition 201 Subtraction 210 Incrementing and decrementing 215 Multiplication 216 Division 221. Shift Instructions

Iterating an Innovation Model: Challenges and Opportunities in Adapting Accelerator Practices in Evolving Ecosystems (Presentation)

Edge 2014: Responsive & Fast: Iterating Live on Modern RWD Sites

Agile 2015 Prototyping - Iterating Your Way to Glory

Iterating the weighted average cost of capital (2) linked in

Velocity 2014: Responsive & Fast (iterating live)

Region Incrementing Visual Cryptography

Introduction to C Programming (Part C) - York Universitypapaggel/courses/eecs... · Introduction • This chapter covers −string constants (or literals) −string variables •

Iterating - Innovation Practice

Iterating In A Startup Emory University

3D-Incrementing Numbers - Computational Thinking Foundation

snappy Rok Sosič, Jure Leskovec - York Universitypapaggel/courses/eecs4414/docs/tutori… · Introduction to SNAP Snap.py for Python Network analytics Contact information: Rok Sosič,

Velocity EU 2014: Responsive & Fast (Iterating Live)