Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

Map Reduce & Hadoop

June 3, 2015HS Oh, HR Lee, JY Choi

YS Lee, SH Choi

2

Outline

Part1 Introduction to Hadoop MapReduce Tutorial with Simple Example Hadoop v2.0: YARN

Part2 MapReduce Hive Stream Data Processing: Storm Spark Up-to-date Trends

3

MapReduce

Overview Task flow Shuffle configurables Combiner Partitioner

Custom Partitioner Example Number of Maps and Reduces How to write MapReduce functions

4

MapReduce Overview

http://www.micronautomata.com/big_data

AA

A

AA

A

B

B

B

B

B

B

5

MapReduce Task flow

http://grepalex.com/2012/09/10/sorting-text-files-with-mapreduce/

6

MapReduce Shuffle Configurables

http://grepalex.com/2012/11/26/hadoop-shuffle-configurables/

7

Combiner Mini Reducer Functionally same as the reducer Performs on each map task(locally), reduces communication cost Using combiner when Reduce function is both commutative and associative

http://www.kalyanhadooptraining.com/2014_07_01_archive.html

8

Partitioner

Divides Map’s output key, value pair by rule Default strategy is hashing

HashPartitioner

public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> {public void configure(JobConf job) {}public int getPartition(K2 key, V2 value, int numReduceTasks) return (key.hashCode() & Integer.

MAX_VALUE) % numReduceTasks;}

}

http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.2.0/org/apache/hadoop/mapred/Partitioner.java#Partitioner

http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.2.0/org/apache/hadoop/mapred/JobConfigurable.java#JobConfigurable.configure%28org.apache.hadoop.mapred.JobConf%29

http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.2.0/org/apache/hadoop/mapred/JobConf.java#JobConf

http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.2.0/org/apache/hadoop/mapred/Partitioner.java#Partitioner.getPartition%28java.lang.Object%2Cjava.lang.Object%2Cint%29

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/Object.java#Object.hashCode%28%29

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/Integer.java#Integer.0MAX_VALUE

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/Integer.java#Integer.0MAX_VALUE

9

Custom Partitioner Example

Input with name, age, sex, and score Map outputs divide by range of age

public static class AgePartitioner extends Partitioner<Text, Text> { @Override public int getPartition(Text key, Text value, int numReduceTasks) { String [] nameAgeScore = value.toString().split("\t"); String age = nameAgeScore[1]; int ageInt = Integer.parseInt(age); //this is done to avoid performing mod with 0 if(numReduceTasks == 0) return 0; //if the age is <20, assign partition 0 if(ageInt <=20){ return 0; } //else if the age is between 20 and 50, assign partition 1 if(ageInt >20 && ageInt <=50){ return 1 % numReduceTasks; } //otherwise assign partition 2 else return 2 % numReduceTasks; } }

http://hadooptutorial.wikispaces.com/Custom+partitioner

10

Number of Maps and Reduces

The number of Maps = DFS blocks To adjust DFS block size to adjust the number of maps Right level of parallelism for maps → 10~100 maps/node mapred.map.tasks parameter is just a hint

The number of Reduces Suggested values

Set # of reduce tasks a little bit less than # of total slot A task time between 5 and 15 min Create the fewest files possible

conf.setNumReduceTasks(int num)

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

11

How to write MapReduce functions [1/2]

Java Word Count Example

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } }

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }

Input part Output part

Input part Output part

http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

12

How to write MapReduce functions [2/2]

Python Word Count Example

Mapper.py

#!/usr/bin/python import sys for line in sys.stdin: for word in line.strip().split(): print "%s\t%d" % (word, 1)

How to excute

bin/Hadoop jar share/Hadoop/tools/lib/Hadoop-streaming-2.4.0.jar-files /home/hduser/Mapper.py, /home/hduser/Reduc-er.py-mapper /home/hduser/Mapper.py-reducer /home/hduser/Reducer.py-input /input/count_of_monte_cristo.txt-output /output

Reducer.py

#!/usr/bin/python import sys current_word = Nonecurrent_count = 1 for line in sys.stdin: word, count = line.strip().split('\t') if current_word: if word == current_word: current_count += int(count) else: print "%s\t%d" % (current_word, current_count) current_count = 1 current_word = word if current_count > 1: print "%s\t%d" % (current_word, current_count)

http://dogdogfish.com/2014/05/19/hadoop-wordcount-in-python/

13

Hive &Stream Data Processing: Storm

Hadoop Ecosystem

14

The World of Big Data Tools

DAG Model

ForIterations /

Learning

ForQuery

ForStreaming

MapReduce Model

Graph ModelBSP / Collective

Model

Hadoop

MPI

HaLoop

Twister

Spark

Harp

Flink

REEFDryad / DryadLIN

Q Pig / PigLatin

Hive

Tez

SparkSQL(Shark)

MRQL

S4 Storm

Samza Spark Streaming

Drill

Giraph

Hama

GraphLab

GraphX

From Bingjing Zhang

15

Hive

Data warehousing on top of Hadoop

Designed to enable easy data summarization ad-hoc querying analysis of large volumes of data

HiveQL statements are automatically translated into MapReduce jobs

16

Advantages

Higher level query language Simplifies working with large amounts of data

Lower learning curve than Pig or MapReduce HiveQL is much closer to SQL than Pig Less trial and error than Pig

17

Disadvantages

Updating data is complicated Mainly because of using HDFS Can add records Can overwrite partitions

No real time access to data Use other means like HBase or Impala

High latency

18

Hive Architecture

19

Metastore

20

Parser Semantic Analyzer Logical Plan Generator Query Plan Generator

Compiler

21

Hive Architecture

22

While based on SQL, HiveQL does not strictly follow the full SQL-92 standard.

HiveQL offers extensions not in SQL, including multitable inserts and cre-ate table as select, but only offers basic support for indexes.

HiveQL lacks support for transactions and materialized views, and only limited subquery support.

Support for insert, update, and delete with full ACID functionality was made available with release 0.14.

HiveQL

23

Datatypes in Hive

Primitive datatypes TINYINT SMALLINT INT BIGINT BOOLEAN FLOAT DOUBLE STRING

24

HiveQL – Group By

• HiveQL : INSERT INTO TABLE pageid_age_sum SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;

pageid

age

1 25

2 25

1 32

2 25

3 27

1 21

… …

… …

18570

30

18570

26

pv_users

pageid

age Count

1 25 1

1 32 1

1 21 1

2 25 2

3 27 1

… … …

… … …

18570

30 1

18570

26 1

pageid_age_sum

25

HiveQL – Group By in MapReduce

pageid

age

1 25

2 25

1 32

pageid

age Count

1 25 1

1 32 1

1 21 1

pageid

age

2 25

3 27

1 21

key value

<1,25>

1

<2,25>

1

<1,32>

1key valu

e

<2,25>

1

<3,27>

1

<1,21>

1

key value

<1,25>

1

<1,32>

1

<1,21>

1key value

<2,25>

1

<2,25>

1

2 25 2

pageid

age

18570 30

18570 26

…

key value

<18570,30>

1

<18570,26>

1

…key value

<18570,30>

1

<18570,26>

1

key value

<3,27>

1…

3 27 1

18570

30 1

18570

26 1

…

Map Shuffle Reduce

26

Stream Data Processing

27

Distributed Stream Processing Engine

Stream data Unbounded sequence of event tuples E.g., sensor data, stock trading data, web traffic data, …

Since large volume of data flows from many sources, centralized sys-tems can no longer process in real time.

28


General Stream Processing Model Stream processing involves processing data before storing.

c.f. Batch systems(like Hadoop) provide processing data after stor-ing.

Processing Element (PE): A processing unit in stream engine

Generally stream processing engine creates a logical network of stream processing elements(PE) connected in directed acyclic graph(DAG).

29


30

DSPE Systems

Apache Storm (Current release: 0.10) Developed by Twitter Donated to Apache Software Foundation in 2013 Pull based messaging http://storm.apache.org/

Apache S4 (Current release: 0.6) Developed by Yahoo Donated to Apache Software Foundation in 2011 S4 stands for Simple Scalable Streaming Systems Push based messaging http://incubator.apache.org/s4/

Apache Samza (Current release: 0.9) Developed by LinkedIn Donated to Apache Software Foundation in 2013 Messaging using message broker(Kafka) http://samza.apache.org/

https://storm.apache.org/



http://incubator.apache.org/s4/

http://incubator.apache.org/s4/

http://samza.apache.org/

http://samza.apache.org/

31

Apache Storm

System Architecture

32

Apache Storm

Topology A PE DAG on Storm Spout: Starting point of data stream can be listening to HTTP port or pulling

from queue Bolt: Process incoming stream tuple Bolt pulls message from upstream PE.

Bolts don’t take excessive amount of messages.

Stream grouping Shuffle grouping, fields grouping, partial key grouping, all grouping,

global grouping, …

Message Processing Guarantee Each PE keeps the output message until downstream PE

processes the message and sends acknowledgement message.

33

Apache Storm: Spouts

Tuple Tuple Tuple Tuple Tuple

Source of streams

Tuple Tuple Tuple Tuple Tuple

34

Apache Storm: Bolts

TupleTuple

TupleTuple

Tuple Tuple Tuple TupleTuple Tuple Tuple Tuple

Processes input streams and produces new streams

35

Apache Storm: Topology

Network of spouts and bolts

36

Apache Storm: Task

Spouts and bolts execute as many tasks across the cluster

37

Apache Storm: Stream grouping

Shuffle grouping: pick a random task

Fields grouping: consistent hashing on a subset of tuple fields

All grouping: send to all tasks

Global grouping: pick task with lowest id

38

Apache Storm

Supported language Python, Java, Clojure

Tutorial

Bolt ‘exclaim1’ appends the string “!!” to its input.Bolt ‘exclaim2’ appends the string “**” to its input.

39

Apache Storm

JohnBob

Rice

JohnBob

Rice

John!!B

ob!!R

ice!!

John** Bob** Rice**Rice!!**Bob!!**John!!**

exclaim1

exclaim2

word

40

References

1. Apache Hive, https://hive.apache.org/

2. Design - Apache Hive, https://cwiki.apache.org/confluence/display/Hive/Design

3. Apache Storm, https://storm.apache.org/

https://hive.apache.org/

https://hive.apache.org/

https://cwiki.apache.org/confluence/display/Hive/Design

https://cwiki.apache.org/confluence/display/Hive/Design



41

SparkFast, Interactive, Language-Integrated Cluster Computing

42

Motivation

Most current cluster programming models are based on acyclic data flow from stable storage to stable storage

Benefits of data flow: runtime can decide where to run tasks and can automatically recover from fail-ures

Map

Map

Map

Reduce

Reduce

Input Output

43

Motivation

Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: Iterative algorithms (machine learning, graphs) Interactive data mining tools (R, Excel, Python)

With such frameworks, apps reload data from stable storage on each query

44

Solution: Resilient Distributed Datasets(RDDs)

Allow apps to keep working sets in memory for effi-cient reuse

Retain the attractive properties of MapReduce Fault tolerance, data locality, scalability

Support a wide range of applications Batch, Query processing, Stream processing,

Graph processing, Machine learning

45

RDD Operations

Transformations(define a new RDD)

mapfilter

samplegroupByKeyreduceByKey

sortByKey

flatMapunionjoin

cogroupcross

mapValues

Actions(return a result to driver

program)

collectreducecountsave

lookupKey

46

Example: Log Mining

Load error messages from a log into memory, then interac-tively search for various patterns

lines = sc.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache 2

Cache 3

Base RDDTransformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)

47

RDD Fault Tolerance

RDDs maintain lineage information that can be used to re-construct lost partitions

messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDDfilter

(func = _.contains(...))map

(func = _.split(...))

48

Performance

Logistic Regression

https://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html

49

Fault Recovery

Run K-means on 75-node cluster Each iteration consists of 400 tasks working on 100GB data RDD is reconstructed by using lineage

Recovery overhead: 24s (≈ 30%) Lineage graph: ≤10KB

Matei et al, Resilient Distributed Datasets, NSDI `12

50

Generality

Various type of applications can be built atop RDD

Can be combined in single application and run on Spark Runtime

http://spark.apache.org

51

Interactive Analytics

Interactive shell is provided Program returns the result directly Run ad-hoc queries

52

Demo

WordCount in Scala API

Show the result on the shell counts.saveAsTextFile() → counts.collect()

53

Conclusion

Performance Fast due to caching data in memory

Fault-tolerance Fast recovery by using lineage history

Programmability Multiple languages support Simple & Integrated programming model

54

Up-to-date Trends

55

Up-to-date Trends

Batch + Real-time Analytics

Big-Data-as-a-Service

56

Trend1: Batch + Real-time Analytics

Lambda Architecture

1. Data Dispatched to both the

batch layer and the speed layer

2. Batch layer Manage the master

dataset (an immutable, append-only set of raw data)

Pre-compute the batch views.

57

Trend1: Batch + Real-time Analytics

Lambda Architecture

3. Serving layer Index the batch views Can be queried in low-la-

tency, ad-hoc way.

4. Speed layer Deals with recent data

only (serving layer’s up-date cost is high)

5. Merge results from batch views and real-time views when answering queries.

58

Trend2: Big-Data-as-a-Service

Big-Data-as-a-Service Big data analytics systems are provided as Cloud

serviceProgramming API & Monitoring interfaceInfrastructure can also be provided as a service

No worry for distributing data, resource optimiza-tion, resource provision, etc.Users can focus on the data itself

59

Trend2: Big-Data-as-a-service

Google Cloud Dataflow

<Programming API> <Monitoring UI>

60

References

1. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI`12

2. Apache Spark, http://spark.apache.org

3. Databricks, http://www.databricks.com

4. Lambda Architecture, http://lambda-architecture.net

5. Google Cloud Dataflow http://cloud.google.com/dataflow

http://spark.apache.org/

http://www.databricks.com/

http://lambda-architecture.net/



http://cloud.google.com/dataflow

61

Thank youQuestions?

Documents

Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi