Introduction to map reduce

Introduction to MapReduceBhupesh Chawda

bhupesh@apache.org

DataTorrent

Why Hadoop?● Data Growth is mind boggling. Forecast for 2020: 40 Trillion GB

● Cost effective

● Scalable

● Fast

● Open source

Source: https://rapidminer.com/rapidminer-acquires-radoop/Image: http://seikun.kambashi.com/images/blog/interning_at_placeiq/2.jpg

What is Mapreduce● It is a powerful paradigm for parallel computation

● Hadoop uses MapReduce to execute jobs on files in HDFS

● Hadoop will intelligently distribute computation over cluster

● Take computation to data

Analogy: Counting Fans● Given a cricket stadium, count the number of fans for each player / team

● Traditional way

● Smart way

● Smarter way?

Origin: Functional Programming● Map - Returns a list constructed by applying a function (the first argument) to all

items in a list passed as the second argument

○ map f [a, b, c] = [f(a), f(b), f(c)]

○ map sq [1, 2, 3] = [sq(1), sq(2), sq(3)] = [1,4,9]

● Reduce - Returns a list constructed by applying a function (the first argument) on

the list passed as the second argument. Can be identity (do nothing).

○ reduce f [a, b, c] = f(a, b, c)

○ reduce sum [1, 4, 9] = sum(1, sum(4,sum(9,sum(NULL)))) = 14

Sum of squares example

Sum of squares of even and odd numbers

Programming model - Key Value Pairs● Format of input- output

(key, value)

● Map: (k1 , v1 ) → list (k2 , v2 )● Reduce: (k2 , list v2 ) → list (k3 , v3 )

Sum of squares of odd, even and prime

Map reduce overview

Map reduce with combiner

The Big Picture

Image Source: http://blog.csdn.net/bingduanlbd/article/details/51933914

The Bigger Picture

Image Source: http://blog.csdn.net/bingduanlbd/article/details/51933914

MapReduce Code Example - Word Count

Image Source: http://arnon.me/2014/06/mapreduce/

MapReduce - The Mapper

Source: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

MapReduce - The Reducer

Source: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

MapReduce - The Driver

Image Source: https://memegenerator.net/instance/56997204

Hadoop Distributions

Who is using Hadoop?

References● https://hadoop.apache.org/

● www.slideshare.net/SandeepDeshmukh5/hadoopintroduction-46841859

● Hadoop - The Definitive Guide - 4th Edition

● Images shamelessly stolen from the internet - Have credited though!

Acknowledgements● Sandeep Deshmukh, DataTorrent - For some of the slides

Thank You!!

Please send your questions at:bhupesh@apache.org

Extra Slides

Anatomy of a Map reduce run● In Map reduce context

○ The client which submits the job

○ Job tracker which coordinates the run

○ Task trackers which run the map and

reduce tasks

○ HDFS

● In YARN context - Will see later

○ The client which submits the job

○ YARN resource manager

○ YARN node managers

○ Map Reduce App Master

○ HDFS

Map reduce in YARN - Will see later

The Map Side - Details● Map task writes to a circular buffer which it writes the output to

● Once it reaches a threshold, it starts to spill the contents to local disk

● Before writing to disk, the data is partitioned corresponding to the reducers that

the data will be sent to

● Each partition is sorted by key and combiner is run on the sorted output

● Multiple spill files may be created by the time map finishes. These spill files are

merged into a single partitioned, sorted output file

● The output file partitions are made available to reducers over HTTP

The Reduce Side - Details● The map outputs are sitting on local disks. Reduce tasks will need this data in

order to proceed with the reduce task

● Reduce task needs the map output for its particular partition from several maps

across the cluster

● The reduce task starts copying the map outputs as soon as each map completes.

This is the copy phase. The map outputs are fetched in parallel by multiple

threads.

● Map outputs are copied to jvm’s memory if small enough, else copied to disk. As

copies accumulate, they are merged into larger sorted files. When all are copied,

they are merged maintaining their sort order

● Reduce function is invoked for each key in sorted output and output is written

directly to HDFS

Map reduce as unix commandsProblem:

● Input

○ 1 TB file containing color

names - Red, Blue, Green,

Yellow, Purple, Maroon

● Output

○ Number of occurrences of

colors Blue and Green

Introduction to map reduce

Software

IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE

Why hadoop map reduce needs scala, an introduction to scoobi and scalding

An Introduction To Map-Reduce

Introduction to Map Reduce - stg-tud.github.iostg-tud.github.io/ctbd/2017/CTBD_04_mapreduce.pdf · Map Reduce: Motivation “We realized that most of our computations involved applying

Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift

UNIT 3 : MAP REDUCE Introduction To Map Reduce Programs

Introduction to map reduce

Map Reduce - cis.csuohio.educis.csuohio.edu/~sschung/cis612/MapReduceJoinAlgorithms.pdf · Map Reduce Basic Map Reduce Join Algorithms Improvements in performance by pre processing

Introduction to Hadoop - KTH · Limitations of MapReduce [Zaharia’11] Map Map Map Reduce Reduce Input Output •MapReduce is based on an acyclic data flow from stable storage to

Introduction to Map-Reduce - Smith College€¦ · mith College C omputer Science Dominique Thiébaut dthiebaut@smith.edu Introduction to Map-Reduce CSC352—Week #11

Introduction to Map/Reduce · Introduction to Map/Reduce Examples and Principles . Recall the framework: D 1 map() • User defines , mapper, and reducer ... reduce()

Introduction to Cluster Computing and Map Reduce (from Google)

Zh Tw Introduction To Map Reduce

Massive Distributed Processing using Map-Reduce€¦ · Introduction MR Hadoop Experiments Conclusions Map Reduce Map Reduce (Je rey Dean, Sanjay Ghemawat; Google Inc.) A technique

NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0

An Introduction to Spark Fishel... · 4 A brief review of MapReduce Map MapMap Map Map Map Map Map Map Reduce Reduce Reduce Reduce Key advances by MapReduce: • Data Locality: Automatic

Big Data Analysis using Hadoop Map-Reduce –An Introduction

Hadoop - Introduction to map reduce programming - Reunião 12/04/2014

Big Data Analysis using Hadoop Map-Reduce –An Introduction ...b-tierney.com/wp-content/uploads/2019/01/L2-Hadoop-1.pdf · Big Data Analysis using Hadoop Map-Reduce –An Introduction

Introduction to Map/Reduce - University of Cretehy562/labs20/Lab1-Intro To... · 2019-10-12 · Introduction to Map/Reduce: From Hadoop to SPARK Serafeim Mustakas Computer Science