Upload
apache-apex
View
80
Download
0
Embed Size (px)
Citation preview
Why Hadoop?Data Growth is mind boggling. Forecast for 2020: 40 Trillion GB
Cost effective
Scalable
Fast
Open source
Source: https://rapidminer.com/rapidminer-acquires-radoop/Image: http://seikun.kambashi.com/images/blog/interning_at_placeiq/2.jpg
What is MapreduceIt is a powerful paradigm for parallel computation
Hadoop uses MapReduce to execute jobs on files in HDFS
Hadoop will intelligently distribute computation over cluster
Take computation to data
Analogy: Counting FansGiven a cricket stadium, count the number of fans for each player /
team
Traditional way
Smart way
Smarter way?
Origin: Functional ProgrammingMap - Returns a list constructed by applying a function (the first
argument) to all items in a list passed as the second argumentmap f [a, b, c] = [f(a), f(b), f(c)]
map sq [1, 2, 3] = [sq(1), sq(2), sq(3)] = [1,4,9]
Reduce - Returns a list constructed by applying a function (the first argument) on the list passed as the second argument. Can be identity (do nothing).
reduce f [a, b, c] = f(a, b, c)
reduce sum [1, 4, 9] = sum(1, sum(4,sum(9,sum(NULL)))) = 14
Sum of squares example
Sum of squares of even and odd numbers
Programming model - Key Value PairsFormat of input- output
(key, value)
Map: (k1 , v1 ) → list (k2 , v2 )
Reduce: (k2 , list v2 ) → list (k3 , v3 )
Sum of squares of odd, even and prime
Map reduce overview
Map reduce with combiner
The Big Picture
Image Source: http://blog.csdn.net/bingduanlbd/article/details/51933914
The Bigger Picture
Image Source: http://blog.csdn.net/bingduanlbd/article/details/51933914
MapReduce Code Example - Word Count
Image Source: http://arnon.me/2014/06/mapreduce/
MapReduce - The Mapper
Source: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
MapReduce - The Reducer
Source: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
MapReduce - The Driver
Image Source: https://memegenerator.net/instance/56997204
Hadoop Distributions
Who is using Hadoop?
Referenceshttps://hadoop.apache.org/
www.slideshare.net/SandeepDeshmukh5/hadoopintroduction-46841859
Hadoop - The Definitive Guide - 4th Edition
Images shamelessly stolen from the internet - Have credited though!
AcknowledgementsSandeep Deshmukh, DataTorrent - For some of the slides
Thank You!!
Please send your questions at:[email protected] / [email protected]
Extra Slides
Anatomy of a Map reduce runIn Map reduce context
The client which submits the job
Job tracker which coordinates the run
Task trackers which run the map and reduce tasks
HDFS
In YARN context - Will see later
The client which submits the job
YARN resource manager
YARN node managers
Map Reduce App Master
HDFS
Map reduce in YARN - Will see later
The Map Side - DetailsMap task writes to a circular buffer which it writes the output to
Once it reaches a threshold, it starts to spill the contents to local disk
Before writing to disk, the data is partitioned corresponding to the reducers that the data will be sent to
Each partition is sorted by key and combiner is run on the sorted output
Multiple spill files may be created by the time map finishes. These spill files are merged into a single partitioned, sorted output file
The output file partitions are made available to reducers over HTTP
The Reduce Side - DetailsThe map outputs are sitting on local disks. Reduce tasks will need this
data in order to proceed with the reduce task
Reduce task needs the map output for its particular partition from several maps across the cluster
The reduce task starts copying the map outputs as soon as each map completes. This is the copy phase. The map outputs are fetched in parallel by multiple threads.
Map outputs are copied to jvm’s memory if small enough, else copied to disk. As copies accumulate, they are merged into larger sorted files. When all are copied, they are merged maintaining their sort order
Reduce function is invoked for each key in sorted output and output is written directly to HDFS
Map reduce as unix commandsProblem:
Input1 TB file containing
color names - Red, Blue, Green, Yellow, Purple, Maroon
OutputNumber of occurrences
of colors Blue and Green