38
Università degli Studi di Roma Tor VergataDipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference Big Data stack Resource Management Data Storage Data Processing High-level Interfaces Support / Integration Valeria Cardellini - SABD 2017/18 1

MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

MapReduce and Hadoop

Corso di Sistemi e Architetture per Big Data A.A. 2017/18

Valeria Cardellini

The reference Big Data stack

Resource Management

Data Storage

Data Processing

High-level Interfaces Support / Integration

Valeria Cardellini - SABD 2017/18

1

Page 2: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

The Big Data landscape

Valeria Cardellini - SABD 2017/18

2

The open-source landscape for Big Data

Valeria Cardellini - SABD 2017/18

3

Page 3: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Processing Big Data

•  Some frameworks and platforms to process Big Data –  Relational DBMS: old fashion, no longer applicable –  Batch processing: store and process data sets at

massive scale (especially Volume+Variety) •  In this lesson: MapReduce and Hadoop

–  Data stream processing: process fast data (in real-time) as data is being generated, without storing

Valeria Cardellini - SABD 2017/18

4

MapReduce

Valeria Cardellini - SABD 2017/18

5

Page 4: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Parallel programming: background

•  Parallel programming –  Break processing into parts that can be executed

concurrently on multiple processors •  Challenge

–  Identify tasks that can run concurrently and/or groups of data that can be processed concurrently

–  Not all problems can be parallelized!

Valeria Cardellini - SABD 2017/18

6

Parallel programming: background •  Simplest environment for parallel programming

–  No dependency among data •  Data can be split into equal-size chunks

–  Each process can work on a chunk –  Master/worker approach

•  Master –  Initializes array and splits it according to the number of workers –  Sends each worker the sub-array –  Receives the results from each worker

•  Worker: –  Receives a sub-array from master –  Performs processing –  Sends results to master

•  Single Program, Multiple Data (SPMD): technique to achieve parallelism –  The most common style of parallel programming

Valeria Cardellini - SABD 2017/18

7

Page 5: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Key idea behind MapReduce: Divide and conquer

•  A feasible approach to tackling large-data problems –  Partition a large problem into smaller sub-problems –  Independent sub-problems executed in parallel –  Combine intermediate results from each individual worker

•  The workers can be: –  Threads in a processor

core –  Cores in a multi-core

processor –  Multiple processors in a

machine –  Many machines in a

cluster •  Implementation details of

divide and conquer are complex

Valeria Cardellini - SABD 2017/18

8

Divide and conquer: how?

•  Decompose the original problem in smaller, parallel tasks

•  Schedule tasks on workers distributed in a cluster, keeping into account: –  Data locality –  Resource availability

•  Ensure workers get the data they need •  Coordinate synchronization among workers •  Share partial results •  Handle failures

Valeria Cardellini - SABD 2017/18

9

Page 6: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Key idea behind MapReduce: Scale out, not up!

•  For data-intensive workloads, a large number of commodity servers is preferred over a small number of high-end servers –  Cost of super-computers is not linear –  Datacenter efficiency is a difficult problem to solve, but recent

improvements

•  Processing data is quick, I/O is very slow •  Sharing vs. shared nothing:

–  Sharing: manage a common/global state –  Shared nothing: independent entities, no common state

•  Sharing is difficult: –  Synchronization, deadlocks –  Finite bandwidth to access data from SAN –  Temporal dependencies are complicated (restarts)

Valeria Cardellini - SABD 2017/18

10

MapReduce •  Programming model for processing huge

amounts of data sets over thousands of servers –  Originally proposed by Google in 2004:

“MapReduce: simplified data processing on large clusters” http://bit.ly/2iq7jlY

–  Based on a shared nothing approach •  Also an associated implementation

(framework) of the distributed system that runs the corresponding programs

•  Some examples of applications for Google: –  Web indexing –  Reverse Web-link graph –  Distributed sort –  Web access statistics

Valeria Cardellini - SABD 2017/18

11

Page 7: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

MapReduce: programmer view

•  MapReduce hides system-level details –  Key idea: separate the what from the how –  MapReduce abstracts away the “distributed” part of

the system –  Such details are handled by the framework

•  Programmers get simple API –  Don’t have to worry about handling

•  Parallelization •  Data distribution •  Load balancing •  Fault tolerance

Valeria Cardellini - SABD 2017/18

12

Typical Big Data problem

•  Iterate over a large number of records •  Extract something of interest from each record •  Shuffle and sort intermediate results •  Aggregate intermediate results •  Generate final output

Key idea: provide a functional abstraction of the two Map and Reduce operations

Valeria Cardellini - SABD 2017/18

13

Page 8: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

MapReduce: model •  Processing occurs in two phases: Map and Reduce

–  Functional programming roots (e.g., Lisp) •  Map and Reduce are defined by the programmer •  Input and output: sets of key-value pairs •  Programmers specify two functions: map and reduce •  map(k1, v1) [(k2, v2)] •  reduce(k2, [v2]) [(k3, v3)]

–  (k, v) denotes a (key, value) pair –  […] denotes a list –  Keys do not have to be unique: different pairs can have the

same key –  Normally the keys k1 of input elements are not relevant

Valeria Cardellini - SABD 2017/18

14

Map •  Execute a function on a set of key-value pairs

(input shard) to create a new list of key-value pairs map (in_key, in_value) list(out_key, intermediate_value)

•  Map calls are distributed across machines by automatically partitioning the input data into shards –  Parallelism is achieved as keys can be processed by

different machines •  MapReduce library groups together all

intermediate values associated with the same intermediate key and passes them to the Reduce function

Valeria Cardellini - SABD 2017/18

15

Page 9: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Reduce

•  Combine values in sets to create a new value reduce (out_key, list(intermediate_value))

list(out_key, out_value) -  The key in output is often identical to the key in the

input -  Parallelism is achieved as reducers operating on

different keys can be executed simultaneously

Valeria Cardellini - SABD 2017/18

16

Your first MapReduce example (in Lisp)

•  Example: sum-of-squares (sum the square of numbers from 1 to n) in MapReduce fashion

•  Map function: map square [1,2,3,4] returns [1,4,9,16]

•  Reduce function: reduce [1,4,9,16] returns 30 (the sum of the square elements)

Valeria Cardellini - SABD 2017/18

17

Page 10: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

MapReduce program •  A MapReduce program, referred to as a job, consists of:

–  Code for Map and Reduce packaged together –  Configuration parameters (where the input lies, where the output

should be stored) –  Input data set, stored on the underlying distributed file system

•  The input will not fit on a single computer’s disk

•  Each MapReduce job is divided by the system into smaller units called tasks –  Map tasks or mappers –  Reduce tasks or reducers –  All mappers need to finish before reducers can begin

•  The output of MapReduce job is also stored on the underlying distributed file system

•  A MapReduce program may consist of many rounds of different map and reduce functions

Valeria Cardellini - SABD 2017/18

18

MapReduce computation 1.  Some number of Map tasks each are given one or more chunks

of data from a distributed file system 2.  These Map tasks turn the chunk into a sequence of key-value

pairs –  The way key-value pairs are produced from the input data is

determined by the code written by the user for the Map function 3.  The key-value pairs from each Map task are collected by a master

controller and sorted by key 4.  The keys are divided among all the Reduce tasks, so all key-

value pairs with the same key wind up at the same Reduce task 5.  The Reduce tasks work on one key at a time, and combine all the

values associated with that key in some way –  The manner of combination of values is determined by the code

written by the user for the Reduce function 6.  Output key-value pairs from each reducer are written persistently

back onto the distributed file system 7.  The output ends up in r files, where r is the number of reducers

–  Such output may be the input to a subsequent MapReduce phase

Valeria Cardellini - SABD 2017/18

19

Page 11: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Where the magic happens •  Implicit between the map and reduce phases is a

distributed “group by” operation on intermediate keys, called shuffle and sort –  Transfer mappers output to reducers, merging and sorting

the output –  Intermediate data arrive at every reducer sorted by key

•  Intermediate keys are transient –  They are not stored on the distributed file system –  They are “spilled” to the local disk of each machine in the

cluster

(k, v) Pairs

Map Function

(k�, v’) Pairs

Reduce Function

(k��, v��) Pairs

Input Splits Intermediate Outputs Final Outputs

Valeria Cardellini - SABD 2017/18

20

MapReduce computation: the complete picture

Valeria Cardellini - SABD 2017/18

21

Page 12: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

A simplified view of MapReduce: example

•  Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs

•  Reducers are applied to all intermediate values associated with the same intermediate key

•  Between the map and reduce phase lies a barrier that involves a large distributed sort and group by

Valeria Cardellini - SABD 2017/18

22

“Hello World” in MapReduce: WordCount

•  Problem: count the number of occurrences for each word in a large collection of documents

•  Input: repository of documents, each document is an element

•  Map: read a document and emit a sequence of key-value pairs where: –  Keys are words of the documents and values are equal to 1:

(w1, 1), (w2, 1), … , (wn, 1) •  Shuffle and sort: group by key and generates pairs of the

form (w1, [1, 1, … , 1]) , . . . , (wn, [1, 1, … , 1]) •  Reduce: add up all the values and emits (w1, k) ,…, (wn, l) •  Output: (w,m) pairs where:

–  w is a word that appears at least once among all the input documents and m is the total number of occurrences of w among all those documents

Valeria Cardellini - SABD 2017/18

23

Page 13: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

WordCount in practice

Map Shuffle Reduce

Valeria Cardellini - SABD 2017/18

24

WordCount: Map

•  Map emits each word in the document with an associated value equal to “1”

Map(String key, String value):

// key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1” );

Valeria Cardellini - SABD 2017/18

25

Page 14: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

WordCount: Reduce

•  Reduce adds up all the “1” emitted for a given word

Reduce(String key, Iterator values): // key: a word // values: a list of counts int result=0 for each v in values: result += ParseInt(v) Emit(AsString(result))

•  This is pseudo-code; for the complete code of the example see the MapReduce paper

Valeria Cardellini - SABD 2017/18

26

Example: WordLengthCount

•  Problem: count how many words of certain lengths exist in a collection of documents

•  Input: a repository of documents, each document is an element

•  Map: read a document and emit a sequence of key-value pairs where the key is the length of a word and the value is the word itself:

(i, w1), … , (j, wn) •  Shuffle and sort: group by key and generate pairs of the form

(1, [w1, … , wk]) , … , (n, [wr, … , ws]) •  Reduce: count the number of words in each list and emit:

(1, l) , … , (p, m) •  Output: (l,n) pairs, where l is a length and n is the total

number of words of length l in the input documents Valeria Cardellini - SABD 2017/18

27

Page 15: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Example: matrix-vector multiplication

•  Sparse matrix A = [aij] size n x n •  Vector x = [xj] size n x 1 •  Problem: matrix-vector multiplication y = Ax where

yi = Σj=1…n aijxj –  Used in the PageRank algorithm

•  Let us assume that x can fit into the mapper memory •  Map: apply to ((i, j), aij) and produce key-value pair

(i, aijxj) •  Reduce: receive (i, [ai1x1, ..., ainxn]) and sum all

values of the list associated with a given key i, i.e., yi = Σj=1…n aijxj. The result will be a pair (i, yi)

Valeria Cardellini - SABD 2017/18

28

Example: matrix-vector multiplication •  What if vector x does not fit in mapper’s memory? •  Solution:

-  Split x in horizontal stripes that fit in memory -  Split matrix A accordingly in vertical stripes -  Stripes of A do not need to fit in memory

–  Each mapper is assigned a matrix stripe; it also gets the corresponding vector stripe

–  Map and Reduce functions are as before

•  The best strategy is to partition A into square blocks, rather than stripes

Valeria Cardellini - SABD 2017/18

29

y A x

Page 16: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Combine •  When the Reduce function is associative and

commutative, we can push some of what the reducers do to the preceding Map tasks –  Associative and commutative: values to be combined can be

combined in any order, with the same result –  E.g., addition in Reduce of WordCount

•  In this case we also apply a combiner to Map •  combine (k2, [v2]) → [(k3, v3)] •  In many cases the same function can be used for

combining as the final reduction •  Shuffle and sort is still necessary! •  Advantages:

–  Reduce amount of intermediate data –  Reduce network traffic

Valeria Cardellini - SABD 2017/18

30

Combine

Valeria Cardellini - SABD 2017/18

31

Page 17: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

MapReduce: execution overview

•  Master-worker architecture •  The master controller process coordinates the

map and reduce tasks running the Map-Reduce job –  Can rerun tasks on a failed compute node

•  The worker processes execute the map and reduce tasks

Valeria Cardellini - SABD 2017/18

32

MapReduce: execution overview

Valeria Cardellini - SABD 2017/18

33

Page 18: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Coping with failures •  Worst scenario: compute node hosting the master

fails –  The entire MapReduce job must be restarted

•  Compute node hosting a map worker fails –  All the Map tasks that were assigned to this Worker will have

to be redone, even if they had completed

•  Computer node hosting a reduce worker fails –  Reschedule on another reduce worker

Valeria Cardellini - SABD 2017/18

34

ApacheHadoop

Valeria Cardellini - SABD 2017/18

35

Page 19: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

What is Apache Hadoop? •  Open-source software framework for reliable, scalable,

distributed data-intensive computing –  Originally developed by Yahoo!

•  Goal: storage and processing of data-sets at massive scale

•  Infrastructure: clusters of commodity hardware •  Core components:

–  HDFS: Hadoop Distributed File System –  Hadoop YARN –  Hadoop MapReduce

•  Includes a number of related projects –  Among which Apache Pig, Apache Hive, Apache HBase

•  Used in production by Facebook, IBM, Linkedin, Twitter, Yahoo! and many others

Valeria Cardellini - SABD 2017/18

36

Hadoop runs on clusters

•  Compute nodes are stored on racks –  8-64 compute nodes on a rack

•  There can be many racks of compute nodes •  The nodes on a single rack are connected by a network,

typically gigabit Ethernet •  Racks are connected by another level of network or a

switch •  The bandwidth of intra-rack communication is usually

much greater than that of inter-rack communication •  Compute nodes can fail! Solution:

–  Files are stored redundantly –  Computations are divided into tasks

Valeria Cardellini - SABD 2017/18

37

Page 20: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Hadoop runs on commodity hardware

•  Pros –  You are not tied to expensive, proprietary offerings

from a single vendor –  You can choose standardized, commonly

available hardware from any of a large range of vendors to build your cluster

–  Scale out, not up •  Cons

–  Significant failure rates –  Solution: replication!

Valeria Cardellini - SABD 2017/18

38

Commodity hardware •  A typical specification of commodity hardware

–  Processor: Intel Xeon E5 8-10 cores 1.8-2.2GHz –  Memory: 32-64 GB ECC RAM –  Storage: 4 × 1TB HDD –  Network: 1-10 Gb Ethernet

•  Yahoo! has a huge installation of Hadoop: –  In 2014: more than 100,000 CPUs in > 40,000 computers running

Hadoop •  Used to support research for Ad Systems and Web Search •  Also used to do scaling tests to support development of Apache

Hadoop on larger clusters –  In 2017: 36 different clusters spread across YARN, Apache

HBase, and Apache Storm deployments •  For some example, see https://wiki.apache.org/hadoop/PoweredBy

Valeria Cardellini - SABD 2017/18

39

Page 21: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Hadoop core

•  HDFS ü Already examined

•  Hadoop YARN –  Cluster resource management

•  Hadoop MapReduce –  System to write applications which process large

data sets in parallel on large clusters (> 1000 nodes) of commodity hardware in a reliable, fault-tolerant manner

Valeria Cardellini - SABD 2017/18

40

Valeria Cardellini - SABD 2017/18

41

Page 22: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Hadoop core: MapReduce

•  Programming paradigm for expressing distributed computations over multiple servers

•  The powerhouse behind most of today’s big data processing

•  Also used in other MPP environments and NoSQL databases (e.g., Vertica and MongoDB)

•  Improving Hadoop usage –  Improving programmability:

Pig and Hive –  Improving data access:

HBase, Sqoop and Flume

Valeria Cardellini - SABD 2017/18

42

We will study YARN later •  Global ResourceManager (RM) •  A set of per-application ApplicationMasters (AMs) •  A set of NodeManagers (NMs)

Valeria Cardellini - SABD 2017/18

43

Page 23: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

YARN job execution

Valeria Cardellini - SABD 2017/18

44

Data locality optimization

•  Hadoop tries to run each map task on a data-local node (data locality optimization) –  So to not use cluster bandwidth –  Otherwise rack-local –  Off-rack as last choice

Valeria Cardellini - SABD 2017/18

45

Page 24: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

MapReduce data flow: single Reduce task

Valeria Cardellini - SABD 2017/18

46

MapReduce data flow: multiple Reduce tasks

Valeria Cardellini - SABD 2017/18

47

•  When there are multiple reducers, the map tasks partition their output

Page 25: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Partitioner •  When there are multiple reducers, the map tasks

partition their output –  One partition for each reduce task

•  Goal: to determine which reducer will receive which intermediate keys and values –  Records for any given key are all in a single partition

•  Default partitioner uses a hash function on the key to determine which bucket (i.e., reducer)

•  Partitioning can be also controlled by a user-defined function –  Requires to implement a custom partitioner

Valeria Cardellini - SABD 2017/18

48

MapReduce data flow: no Reduce task

Valeria Cardellini - SABD 2017/18

49

Page 26: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Combiner

•  Many MapReduce jobs are limited by the cluster bandwidth –  How to minimize the data transferred between map and reduce

tasks?

•  Use Combine task –  A combiner is an optional localized reducer that applies a user-

provided method to combine mapper output –  Performs data aggregation on intermediate data of the same key

for the Map task’s output before transmitting the result to the Reduce task

•  Takes each key-value pair from the Map task, processes it, and produces the output as key-value collection pairs

•  The Reduce task is still needed to process records with the same key from different maps

Valeria Cardellini - SABD 2017/18

50

Programming languages for Hadoop

•  Java is the default programming language •  In Java you write a program with at least three

parts: 1.  A Main method which configures the job, and

lauches it •  Set number of reducers •  Set mapper and reducer classes •  Set partitioner •  Set other Hadoop configurations

2.  A Mapper class •  Takes (k,v) inputs, writes (k,v) outputs

3.  A Reducer class •  Takes k, Iterator[v] inputs, and writes (k,v) outputs

Valeria Cardellini - SABD 2017/18

51

Page 27: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Programming languages for Hadoop

•  Hadoop Streaming API to write Map and Reduce functions in programming languages other than Java

•  Uses Unix standard streams as interface between Hadoop and MapReduce program

•  Allows to use any language (e.g., Python) that can read standard input (stdin) and write to standard output (stdout) to write the MapReduce program –  Standard input is stream data going into a program; it is from the

keyboard –  Standard output is the stream where a program writes its output

data; it is the text terminal –  Reducer interface for streaming is different than in Java: instead

of receiving reduce(k, Iterator[v]), your script is sent one line per value, including the key

Valeria Cardellini - SABD 2017/18

52

WordCount on Hadoop

•  Let’s analyze the WordCount code •  The code in Java

http://bit.ly/1m9xHym

•  The same code in Python –  Use Hadoop Streaming API for passing data between

the Map and Reduce code via standard input and standard output

–  See Michael Noll’s tutorial http://bit.ly/19YutB1

Valeria Cardellini - SABD 2017/18

53

Page 28: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

WordCount in Java import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

Valeria Cardellini - SABD 2017/18

54

WordCount in Java (2)

public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Valeria Cardellini - SABD 2017/18

55

•  map method processes one line at a time, splits the line into tokens separated by whitespaces, via StringTokenizer, and emits a key-value pair <word, 1>

Page 29: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

WordCount in Java (3)

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Valeria Cardellini - SABD 2017/18

56

•  reduce method sums up the values, which are the occurrence counts for each key

WordCount in Java (4)

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }

Valeria Cardellini - SABD 2017/18

57

•  main method specifies various facets of the job

the output of each map is passed through the local combiner (same as the Reducer) for local aggregation, after being sorted on the keys

Page 30: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

WordCount in Java (5)

•  Input files

•  Output file

Valeria Cardellini - SABD 2017/18

58

Hello World Bye World

Hello Hadoop Goodbye Hadoop

Bye 1 Goodbye 1

Hadoop 2 Hello 2

World 2

WordCount in Java (6)

•  1st mapper output

•  After local combiner

Valeria Cardellini - SABD 2017/18

59

< Hello, 1> < World, 1>

< Bye, 1> < World, 1>

< Bye, 1> < Hello, 1>

< World, 2>

Page 31: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

WordCount in Java (7) •  2nd mapper output

•  After local combiner

•  Reducer output

Valeria Cardellini - SABD 2017/18

60

< Hello, 1> < Hadoop, 1>

< Goodbye, 1> < Hadoop, 1>

< Goodbye, 1> < Hadoop, 2>

< Hello, 1>

< Bye, 1> < Goodbye, 1>

< Hadoop, 2> < Hello, 2>

< World, 2>

Hadoop installation

•  Some straightforward alternatives –  Single-node installation

http://bit.ly/1eMFVFm https://bit.ly/2GTjB35

–  Cloudera CDH, Hortonworks HDP, MapR distributions •  Integrated Hadoop-based stack containing all the

components needed for production, tested and packaged to work together

•  Include at least Hadoop, YARN, HDFS (not in MapR), HBase, Hive, Pig, Sqoop, Flume, Zookeeper, Kafka, Spark

•  MapR –  Own distributed Posix file system (MapR-FS) rather than HDFS

Valeria Cardellini - SABD 2017/18

61

Page 32: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Hadoop installation •  Cloudera CDH

–  See QuickStarts http://bit.ly/2oMHob4 –  Single-node cluster: requires to have previously installed an

hypervisor (VMware, KVM, or VirtualBox) or Docker –  QuickStarts not for use in production

Valeria Cardellini - SABD 2017/18

62

Hadoop installation •  Hortonworks HDP

–  Hortonworks Sandbox on a VM •  Requires either VirtualBox, VMWare or Docker •  Image file sizes: 10-12 GB

–  Includes Ambari: open source management platform for Hadoop

•  Distributions include security frameworks (Sentri, Ranger, Knobs, …) to handle access control, security policies, …

Vale

ria C

arde

llini

- S

AB

D 2

017/

18

63

Page 33: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Hadoop configuration

•  Tuning Hadoop clusters for good performance is somehow magic

•  HW and SW systems (including OS), e.g.: –  Optimal number of disks that maximizes I/O bandwidth –  BIOS parameters –  File system type (on Linux, EXT4 is better) –  Open file handles –  Optimal HDFS block size –  JVM settings for Java heap usage and garbage collections See http://bit.ly/2nT7TIn for some detailed hints

Valeria Cardellini - SABD 2017/18

64

Hadoop configuration

•  There are also Hadoop-specific parameters that can be tuned for performance –  Number of mappers –  Number of reducers –  Many other map-side and reduce-side tuning parameters

Valeria Cardellini - SABD 2017/18

65

Page 34: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

How many mappers

•  Number of mappers –  Driven by the number of blocks in the input files –  You can adjust the HDFS block size to adjust the

number of maps –  Good level of parallelism: 10-100 mappers per-node,

but up to 300 mappers for very CPU-light map tasks –  Task setup takes a while, so it is best if the maps take

at least a minute to execute

Valeria Cardellini - SABD 2017/18

66

How many reducers

•  Number of reducers –  Can be user-defined (default is 1) –  Use Job.setNumReduceTasks(int) –  The right number of reduces seems to be 0.95 or 1.75

multiplied by (<no. of nodes> * <no. of maximum containers per node>)

•  0.95: all of the reduces can launch immediately and start transferring map outputs as the maps finish

•  1.75: the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing

–  Can be set to zero if no reduction is desired •  No sorting of map-outputs before writing them to output file

See http://bit.ly/2oK0D5A

Valeria Cardellini - SABD 2017/18

67

Page 35: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Some tips for performance

•  To handle massive I/O and save bandwidth -  Compress input data, map output data, reduce output data

•  To address massive I/O in partition and sort phases –  Each map task has a circular buffer memory to which it

writes the output; when the buffer is full, contents are written (“spilled”) to disk. Avoid that records are spilled more than once

–  How? Adjust spill records and sorting buffer

•  To address massive network traffic caused by large map output -  Compress map output -  Implement a combiner to reduce the amount of data passing

through the shuffle See https://bit.ly/2GU82sq

Valeria Cardellini - SABD 2017/18

68

Hadoop in the Cloud

•  Pros: –  Gain Cloud scalability and elasticity –  Do not need to manage and provision the

infrastructure and the platform •  Main challenges:

–  Move data to the Cloud •  Latency is not zero (because of speed of light)! •  Minor issue: network bandwidth

–  Data security and privacy

Valeria Cardellini - SABD 2017/18

69

Page 36: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Amazon Elastic MapReduce (EMR) •  Distribute the computational work across a cluster of

virtual servers running in AWS cloud (EC2 instances) •  Not only Hadoop: also Hive, Pig, Spark and Presto

–  Usually not the latest released version •  Input and output: Amazon S3, DynamoDB, Elastic File

System (EFS) •  Access through AWS Management Console, command

line

Valeria Cardellini - SABD 2017/18

70

Create EMR cluster

Steps: 1.  Provide cluster

name 2.  Select EMR release

and applications to install

3.  Select instance type (including spot instances)

4.  Select number of instances

5.  Select EC2 key-pair to connect securely to the cluster

Seehttp://amzn.to/2nnujp5 Valeria Cardellini - SABD 2017/18

71

Page 37: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

EMR cluster details

Valeria Cardellini - SABD 2017/18

72

•  You can still tune some parameters for performance -  Some EC2 parameters (heap size used by Hadoop and Yarn )

-  Hadoop parameters (e.g., memory for map and reduce JVMs)

Google Cloud Dataproc

•  Distribute the computational work across a cluster of virtual servers running in the Google Cloud Platform

•  Not only Hadoop (plus Hive and Pig), also Spark •  Input and output from other Google services: Cloud

Storage, Bigtable •  Access through REST API, Cloud SDK, and Cloud

Dataproc UI •  Fine-grain pay-per-use (seconds)

Valeria Cardellini - SABD 2017/18

73

Page 38: MapReduce and Hadoop - ce.uniroma2.itHadoop.pdf · Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop

Create Cloud Datproc cluster Va

leria

Car

delli

ni -

SA

BD

201

7/18

74

References

•  Dean and Ghemawat, "MapReduce: simplified data processing on large clusters", OSDI 2004. http://static.googleusercontent.com/media/research.google.com/it//archive/mapreduce-osdi04.pdf

•  White, “Hadoop: The Definitive Guide, 4th edition”, O’Reilly, 2015.

•  Miner and Shook, “MapReduce Design Patterns”, O’Reilly, 2012.

•  Leskovec, Rajaraman, and Ullman, “Mining of Massive Datasets”, chapter 2, 2014. http://www.mmds.org

Valeria Cardellini - SABD 2017/18

75