61
Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

Embed Size (px)

Citation preview

Page 1: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

Map Reduce & Hadoop

June 3, 2015HS Oh, HR Lee, JY Choi

YS Lee, SH Choi

Page 2: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

2

Outline

Part1 Introduction to Hadoop MapReduce Tutorial with Simple Example Hadoop v2.0: YARN

Part2 MapReduce Hive Stream Data Processing: Storm Spark Up-to-date Trends

Page 3: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

3

MapReduce

Overview Task flow Shuffle configurables Combiner Partitioner

Custom Partitioner Example Number of Maps and Reduces How to write MapReduce functions

Page 4: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

4

MapReduce Overview

http://www.micronautomata.com/big_data

AA

A

AA

A

B

B

B

B

B

B

Page 5: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

5

MapReduce Task flow

http://grepalex.com/2012/09/10/sorting-text-files-with-mapreduce/

Page 6: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

6

MapReduce Shuffle Configurables

http://grepalex.com/2012/11/26/hadoop-shuffle-configurables/

Page 7: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

7

Combiner Mini Reducer Functionally same as the reducer Performs on each map task(locally), reduces communication cost Using combiner when Reduce function is both commutative and associative

http://www.kalyanhadooptraining.com/2014_07_01_archive.html

Page 9: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

9

Custom Partitioner Example

Input with name, age, sex, and score Map outputs divide by range of age

public static class AgePartitioner extends Partitioner<Text, Text> {        @Override        public int getPartition(Text key, Text value, int numReduceTasks) {             String [] nameAgeScore = value.toString().split("\t");            String age = nameAgeScore[1];            int ageInt = Integer.parseInt(age);                       //this is done to avoid performing mod with 0            if(numReduceTasks == 0)                return 0;             //if the age is <20, assign partition 0            if(ageInt <=20){                               return 0;            }            //else if the age is between 20 and 50, assign partition 1            if(ageInt >20 && ageInt <=50){                               return 1 % numReduceTasks;            }            //otherwise assign partition 2            else                return 2 % numReduceTasks;        }    }

http://hadooptutorial.wikispaces.com/Custom+partitioner

Page 10: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

10

Number of Maps and Reduces

The number of Maps = DFS blocks To adjust DFS block size to adjust the number of maps Right level of parallelism for maps → 10~100 maps/node mapred.map.tasks parameter is just a hint

The number of Reduces Suggested values

Set # of reduce tasks a little bit less than # of total slot A task time between 5 and 15 min Create the fewest files possible

conf.setNumReduceTasks(int num)

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Page 11: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

11

How to write MapReduce functions [1/2]

Java Word Count Example

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } }

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }

Input part Output part

Input part Output part

http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

Page 12: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

12

How to write MapReduce functions [2/2]

Python Word Count Example

Mapper.py

#!/usr/bin/python import sys for line in sys.stdin:    for word in line.strip().split():        print "%s\t%d" % (word, 1)

How to excute

bin/Hadoop jar share/Hadoop/tools/lib/Hadoop-streaming-2.4.0.jar-files /home/hduser/Mapper.py, /home/hduser/Reduc-er.py-mapper /home/hduser/Mapper.py-reducer /home/hduser/Reducer.py-input /input/count_of_monte_cristo.txt-output /output

Reducer.py

#!/usr/bin/python import sys current_word = Nonecurrent_count = 1 for line in sys.stdin:    word, count = line.strip().split('\t')    if current_word:        if word == current_word:            current_count += int(count)        else:            print "%s\t%d" % (current_word, current_count)            current_count = 1     current_word = word if current_count > 1:    print "%s\t%d" % (current_word, current_count)

http://dogdogfish.com/2014/05/19/hadoop-wordcount-in-python/

Page 13: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

13

Hive &Stream Data Processing: Storm

Hadoop Ecosystem

Page 14: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

14

The World of Big Data Tools

DAG Model

ForIterations /

Learning

ForQuery

ForStreaming

MapReduce Model

Graph ModelBSP / Collective

Model

Hadoop

MPI

HaLoop

Twister

Spark

Harp

Flink

REEFDryad / DryadLIN

Q Pig / PigLatin

Hive

Tez

SparkSQL(Shark)

MRQL

S4 Storm

Samza Spark Streaming

Drill

Giraph

Hama

GraphLab

GraphX

From Bingjing Zhang

Page 15: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

15

Hive

Data warehousing on top of Hadoop

Designed to enable easy data summarization ad-hoc querying analysis of large volumes of data

HiveQL statements are automatically translated into MapReduce jobs

Page 16: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

16

Advantages

Higher level query language Simplifies working with large amounts of data

Lower learning curve than Pig or MapReduce HiveQL is much closer to SQL than Pig Less trial and error than Pig

Page 17: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

17

Disadvantages

Updating data is complicated Mainly because of using HDFS Can add records Can overwrite partitions

No real time access to data Use other means like HBase or Impala

High latency

Page 18: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

18

Hive Architecture

Page 19: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

19

Metastore

Page 20: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

20

Parser Semantic Analyzer Logical Plan Generator Query Plan Generator

Compiler

Page 21: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

21

Hive Architecture

Page 22: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

22

While based on SQL, HiveQL does not strictly follow the full SQL-92 standard.

HiveQL offers extensions not in SQL, including multitable inserts and cre-ate table as select, but only offers basic support for indexes.

HiveQL lacks support for transactions and materialized views, and only limited subquery support.

Support for insert, update, and delete with full ACID functionality was made available with release 0.14.

HiveQL

Page 23: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

23

Datatypes in Hive

Primitive datatypes TINYINT SMALLINT INT BIGINT BOOLEAN FLOAT DOUBLE STRING

Page 24: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

24

HiveQL – Group By

• HiveQL : INSERT INTO TABLE pageid_age_sum SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;

pageid

age

1 25

2 25

1 32

2 25

3 27

1 21

… …

… …

18570

30

18570

26

pv_users

pageid

age Count

1 25 1

1 32 1

1 21 1

2 25 2

3 27 1

… … …

… … …

18570

30 1

18570

26 1

pageid_age_sum

Page 25: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

25

HiveQL – Group By in MapReduce

pageid

age

1 25

2 25

1 32

pageid

age Count

1 25 1

1 32 1

1 21 1

pageid

age

2 25

3 27

1 21

key value

<1,25>

1

<2,25>

1

<1,32>

1key valu

e

<2,25>

1

<3,27>

1

<1,21>

1

key value

<1,25>

1

<1,32>

1

<1,21>

1key value

<2,25>

1

<2,25>

1

2 25 2

pageid

age

18570 30

18570 26

key value

<18570,30>

1

<18570,26>

1

…key value

<18570,30>

1

<18570,26>

1

key value

<3,27>

1…

3 27 1

18570

30 1

18570

26 1

Map Shuffle Reduce

Page 26: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

26

Stream Data Processing

Page 27: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

27

Distributed Stream Processing Engine

Stream data Unbounded sequence of event tuples E.g., sensor data, stock trading data, web traffic data, …

Since large volume of data flows from many sources, centralized sys-tems can no longer process in real time.

Page 28: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

28

Distributed Stream Processing Engine

General Stream Processing Model Stream processing involves processing data before storing.

c.f. Batch systems(like Hadoop) provide processing data after stor-ing.

Processing Element (PE): A processing unit in stream engine

Generally stream processing engine creates a logical network of stream processing elements(PE) connected in directed acyclic graph(DAG).

Page 29: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

29

Distributed Stream Processing Engine

Page 30: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

30

DSPE Systems

Apache Storm (Current release: 0.10) Developed by Twitter Donated to Apache Software Foundation in 2013 Pull based messaging http://storm.apache.org/

Apache S4 (Current release: 0.6) Developed by Yahoo Donated to Apache Software Foundation in 2011 S4 stands for Simple Scalable Streaming Systems Push based messaging http://incubator.apache.org/s4/

Apache Samza (Current release: 0.9) Developed by LinkedIn Donated to Apache Software Foundation in 2013 Messaging using message broker(Kafka) http://samza.apache.org/

Page 31: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

31

Apache Storm

System Architecture

Page 32: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

32

Apache Storm

Topology A PE DAG on Storm Spout: Starting point of data stream can be listening to HTTP port or pulling

from queue Bolt: Process incoming stream tuple Bolt pulls message from upstream PE.

Bolts don’t take excessive amount of messages.

Stream grouping Shuffle grouping, fields grouping, partial key grouping, all grouping,

global grouping, …

Message Processing Guarantee Each PE keeps the output message until downstream PE

processes the message and sends acknowledgement message.

Page 33: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

33

Apache Storm: Spouts

Tuple Tuple Tuple Tuple Tuple

Source of streams

Tuple Tuple Tuple Tuple Tuple

Page 34: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

34

Apache Storm: Bolts

TupleTuple

TupleTuple

Tuple Tuple Tuple TupleTuple Tuple Tuple Tuple

Processes input streams and produces new streams

Page 35: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

35

Apache Storm: Topology

Network of spouts and bolts

Page 36: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

36

Apache Storm: Task

Spouts and bolts execute as many tasks across the cluster

Page 37: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

37

Apache Storm: Stream grouping

Shuffle grouping: pick a random task

Fields grouping: consistent hashing on a subset of tuple fields

All grouping: send to all tasks

Global grouping: pick task with lowest id

Page 38: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

38

Apache Storm

Supported language Python, Java, Clojure

Tutorial

Bolt ‘exclaim1’ appends the string “!!” to its input.Bolt ‘exclaim2’ appends the string “**” to its input.

Page 39: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

39

Apache Storm

JohnBob

Rice

JohnBob

Rice

John!!B

ob!!R

ice!!

John** Bob** Rice**Rice!!**Bob!!**John!!**

exclaim1

exclaim2

word

Page 40: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

40

References

1. Apache Hive, https://hive.apache.org/

2. Design - Apache Hive, https://cwiki.apache.org/confluence/display/Hive/Design

3. Apache Storm, https://storm.apache.org/

Page 41: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

41

SparkFast, Interactive, Language-Integrated Cluster Computing

Page 42: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

42

Motivation

Most current cluster programming models are based on acyclic data flow from stable storage to stable storage

Benefits of data flow: runtime can decide where to run tasks and can automatically recover from fail-ures

Map

Map

Map

Reduce

Reduce

Input Output

Page 43: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

43

Motivation

Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: Iterative algorithms (machine learning, graphs) Interactive data mining tools (R, Excel, Python)

With such frameworks, apps reload data from stable storage on each query

Page 44: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

44

Solution: Resilient Distributed Datasets(RDDs)

Allow apps to keep working sets in memory for effi-cient reuse

Retain the attractive properties of MapReduce Fault tolerance, data locality, scalability

Support a wide range of applications Batch, Query processing, Stream processing,

Graph processing, Machine learning

Page 45: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

45

RDD Operations

Transformations(define a new RDD)

mapfilter

samplegroupByKeyreduceByKey

sortByKey

flatMapunionjoin

cogroupcross

mapValues

Actions(return a result to driver

program)

collectreducecountsave

lookupKey

Page 46: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

46

Example: Log Mining

Load error messages from a log into memory, then interac-tively search for various patterns

lines = sc.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache 2

Cache 3

Base RDDTransformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)

Page 47: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

47

RDD Fault Tolerance

RDDs maintain lineage information that can be used to re-construct lost partitions

messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDDfilter

(func = _.contains(...))map

(func = _.split(...))

Page 48: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

48

Performance

Logistic Regression

https://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html

Page 49: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

49

Fault Recovery

Run K-means on 75-node cluster Each iteration consists of 400 tasks working on 100GB data RDD is reconstructed by using lineage

Recovery overhead: 24s (≈ 30%) Lineage graph: ≤10KB

Matei et al, Resilient Distributed Datasets, NSDI `12

Page 50: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

50

Generality

Various type of applications can be built atop RDD

Can be combined in single application and run on Spark Runtime

http://spark.apache.org

Page 51: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

51

Interactive Analytics

Interactive shell is provided Program returns the result directly Run ad-hoc queries

Page 52: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

52

Demo

WordCount in Scala API

Show the result on the shell counts.saveAsTextFile() → counts.collect()

Page 53: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

53

Conclusion

Performance Fast due to caching data in memory

Fault-tolerance Fast recovery by using lineage history

Programmability Multiple languages support Simple & Integrated programming model

Page 54: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

54

Up-to-date Trends

Page 55: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

55

Up-to-date Trends

Batch + Real-time Analytics

Big-Data-as-a-Service

Page 56: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

56

Trend1: Batch + Real-time Analytics

Lambda Architecture

1. Data Dispatched to both the

batch layer and the speed layer

2. Batch layer Manage the master

dataset (an immutable, append-only set of raw data)

Pre-compute the batch views.

Page 57: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

57

Trend1: Batch + Real-time Analytics

Lambda Architecture

3. Serving layer Index the batch views Can be queried in low-la-

tency, ad-hoc way.

4. Speed layer Deals with recent data

only (serving layer’s up-date cost is high)

5. Merge results from batch views and real-time views when answering queries.

Page 58: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

58

Trend2: Big-Data-as-a-Service

Big-Data-as-a-Service Big data analytics systems are provided as Cloud

serviceProgramming API & Monitoring interfaceInfrastructure can also be provided as a service

No worry for distributing data, resource optimiza-tion, resource provision, etc.Users can focus on the data itself

Page 59: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

59

Trend2: Big-Data-as-a-service

Google Cloud Dataflow

<Programming API> <Monitoring UI>

Page 60: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

60

References

1. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI`12

2. Apache Spark, http://spark.apache.org

3. Databricks, http://www.databricks.com

4. Lambda Architecture, http://lambda-architecture.net

5. Google Cloud Dataflow http://cloud.google.com/dataflow

Page 61: Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

61

Thank youQuestions?