45
www.edureka.co/big-data-and-hadoop Hadoop the ultimate data storage And processing Together

XML Parsing with Map Reduce

  • Upload
    edureka

  • View
    144

  • Download
    1

Embed Size (px)

Citation preview

Page 1: XML Parsing with Map Reduce

www.edureka.co/big-data-and-hadoop

Hadoop the ultimate data storage

And processing Together

Page 2: XML Parsing with Map Reduce

Slide 2 www.edureka.co/big-data-and-hadoop

Objectives

Analyze different use-cases where MapReduce is used

Differentiate between Traditional way and MapReduce way

Learn about Hadoop 2.x MapReduce architecture and components

Understand execution flow of YARN MapReduce application

Implement basic MapReduce concepts

Run a MapReduce Program

At the end of this module, you will be able to

Page 3: XML Parsing with Map Reduce

Slide 3 www.edureka.co/big-data-and-hadoop

Where MapReduce is Used?

Weather Forecasting

HealthCare

Problem Statement:» De-identify personal health information.

Problem Statement:» Finding Maximum temperature recorded in a year.

Page 4: XML Parsing with Map Reduce

Slide 4 www.edureka.co/big-data-and-hadoop

Where MapReduce is Used?

MapReduce

FeaturesLarge Scale Distributed Model

Used in

Function

Design Pattern

Parallel Programming

A Program Model

Classification

Analytics

Recommendation

Index and SearchMap

Reduce

ClassificationEg: Top N records

AnalyticsEg: Join, Selection

RecommendationEg: Sort

SummarizationEg: Inverted Index

Implemented

Google

Apache Hadoop

HDFS

Pig

Hive

HBase

For

Page 5: XML Parsing with Map Reduce

Slide 5 www.edureka.co/big-data-and-hadoop

The Traditional Way

VeryBig

Data

Split Data matches

Allmatches

grep

grep

grep cat

grep

:

matches

matches

matches

Split Data

Split Data

Split Data

Page 6: XML Parsing with Map Reduce

Slide 6 www.edureka.co/big-data-and-hadoop

MapReduce Way

VeryBig

Data

Split Data

Allmatches

:

Split Data

Split Data

Split Data

MAP

REDUCE

MapReduce Framework

Page 7: XML Parsing with Map Reduce

Slide 7 www.edureka.co/big-data-and-hadoop

MapReduce Paradigm

The Overall MapReduce Word Count Process

Input Splitting Mapping Shuffling Reducing Final Result

List(K3,V3)Deer Bear River

Dear Bear RiverCar Car RiverDeer Car Bear

Bear, 2Car, 3Deer, 2River, 2

Deer, 1Bear, 1River, 1

Car, 1Car, 1

River, 1

Deer, 1Car, 1Bear, 1

K2,List(V2)List(K2,V2)K1,V1

Car Car River

Deer Car Bear

Bear, 2

Car, 3

Deer, 2

River, 2

Bear, (1,1)

Car, (1,1,1)

Deer, (1,1)

River, (1,1)

Page 8: XML Parsing with Map Reduce

Slide 8 www.edureka.co/big-data-and-hadoop

Anatomy of a MapReduce Program

MapReduce

Map:

Reduce:

(K1, V1) List (K2, V2)

(K2, list (V2)) List (K3, V3)

Key Value

Page 9: XML Parsing with Map Reduce

Slide 9 www.edureka.co/big-data-and-hadoop

Why MapReduce?

Two biggest Advantages:

» Taking processing to the data

» Processing data in parallel

ab

c

Map Task

HDFS BlockData Center

Rack

Node

Page 10: XML Parsing with Map Reduce

Slide 10 www.edureka.co/big-data-and-hadoop

ApplicationMaster

» One per application» Short life» Coordinates and Manages MapReduce Jobs» Negotiates with Resource Manager to

schedule tasks» The tasks are started by NodeManager(s)

Job History Server

» Maintains information about submitted MapReduce jobs after their ApplicationMasterterminates

Client

» Submits a MapReduce Job

Resource Manager

» Cluster Level resource manager» Long Life, High Quality Hardware

Node Manager

» One per Data Node» Monitors resources on Data Node

Hadoop 2.x MapReduce Components

Container» Created by NM when requested» Allocates certain amount of resources

(memory, CPU etc.) on a slave node

Page 11: XML Parsing with Map Reduce

Slide 11 www.edureka.co/big-data-and-hadoop

BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4, …)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

OTHER(Search)

(Weave..)

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN – Moving beyond MapReduce

Page 12: XML Parsing with Map Reduce

Slide 12 www.edureka.co/big-data-and-hadoop

MapReduce Application Execution

Executing MapReduce Application on YARN

Page 13: XML Parsing with Map Reduce

Slide 13 www.edureka.co/big-data-and-hadoop

YARN MR Application Execution Flow

MapReduce Job Execution

» Job Submission

» Job Initialization

» Tasks Assignment

» Memory Assignment

» Status Updates

» Failure Recovery

Page 14: XML Parsing with Map Reduce

Slide 14 www.edureka.co/big-data-and-hadoop

HDFS

Application Job Object

Client JVM

Client

Resource Manager

Management Node

Run Job

2. Get New Application ID

4. Submit Application Context

3. Prepare the Application submit

context3.1 App Jar

3.2 Job Resources(Block locations)

3.3 User Information

1. Notify Start Application

YARN MR Application Execution Flow

Page 15: XML Parsing with Map Reduce

Slide 15 www.edureka.co/big-data-and-hadoop

HDFS

3. Prepare the Application submit

context3.1 App Jar

3.2 Job Resources(Block locations)

3.3 User Information

Node Manager

5. Start AppMaster container / Allocate Context for AppMaster

App Master

6.Alloate Container for AppMaster

7.Request Resources

8.Notify with resources Availability

Data Node

YARN MR Application Execution Flow

Application Job Object

Client JVM

Client

Resource Manager

Management Node

Run Job

2. Get New Application ID

4. Submit Application Context

1. Notify Start Application

Page 16: XML Parsing with Map Reduce

Slide 16 www.edureka.co/big-data-and-hadoop

HDFS

Resource Manager

3. Prepare the Application submit context

3.1 App Jar3.2 Job Resources(Block

locations)3.3 User Information

Management Node

Node Manager

5. Start AppMaster container / Allocate Context for AppMaster

App Master

6. Allocate Container for AppMaster

7.Request Resources

8.Notify with resources Availability

Data Node

Client

Node Manager

Data node-1

Node Manager

Map Block

9.Start Containerin the worker node

Data node-2

Node Manager

Map Block

10.NM allocate Container

10.NM allocate Container

2. Get New Application

4. Submit Application

1. Notify Start Application

9.Start Container in the worker

node

YARN MR Application Execution Flow

Page 17: XML Parsing with Map Reduce

Slide 17 www.edureka.co/big-data-and-hadoop

YARN MR Application Execution Flow

11.Task get Executed.

12.If any reducer in a Job Reducer, again AppMaster Request the Node Manager to start the and Allocate Container

13.Output of All the Maps given to reducer and Reducer get executed

14.Once Job finished, Application Master notify the Resource Manager and Client Library

15.Application Master closed.

Page 18: XML Parsing with Map Reduce

Slide 18 www.edureka.co/big-data-and-hadoop

Hadoop 2.x : YARN Workflow

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Container 1.2

Container 1.1

Container 2.1

Container 2.2

Container 2.3

AppMaster 2

AppMaster 1

Scheduler

Applications Manager (AsM)

Resource

Manager

Page 19: XML Parsing with Map Reduce

Slide 19 www.edureka.co/big-data-and-hadoop

Summary: Application Workflow

Execution Sequence :

1. Client submits an application Client RM NM AM

1

Page 20: XML Parsing with Map Reduce

Slide 20 www.edureka.co/big-data-and-hadoop

Summary: Application Workflow

Execution Sequence :

1. Client submits an application

2. RM allocates a container to start AM

Client RM NM AM

1

2

Page 21: XML Parsing with Map Reduce

Slide 21 www.edureka.co/big-data-and-hadoop

Summary: Application Workflow

Execution Sequence :

1. Client submits an application

2. RM allocates a container to start AM

3. AM registers with RM

Client RM NM AM

1

2

3

Page 22: XML Parsing with Map Reduce

Slide 22 www.edureka.co/big-data-and-hadoop

Summary: Application Workflow

Execution Sequence :

1. Client submits an application

2. RM allocates a container to start AM

3. AM registers with RM

4. AM asks containers from RM

Client RM NM AM

1

2

3

4

Page 23: XML Parsing with Map Reduce

Slide 23 www.edureka.co/big-data-and-hadoop

Summary: Application Workflow

Execution Sequence :

1. Client submits an application

2. RM allocates a container to start AM

3. AM registers with RM

4. AM asks containers from RM

5. AM notifies NM to launch containers

Client RM NM AM

1

2

3

4

5

Page 24: XML Parsing with Map Reduce

Slide 24 www.edureka.co/big-data-and-hadoop

Summary: Application Workflow

Execution Sequence :

1. Client submits an application

2. RM allocates a container to start AM

3. AM registers with RM

4. AM asks containers from RM

5. AM notifies NM to launch containers

6. Application code is executed in container

Client RM NM AM

1

2

3

4

5

6

Page 25: XML Parsing with Map Reduce

Slide 25 www.edureka.co/big-data-and-hadoop

Summary: Application Workflow

Execution Sequence :

1. Client submits an application

2. RM allocates a container to start AM

3. AM registers with RM

4. AM asks containers from RM

5. AM notifies NM to launch containers

6. Application code is executed in container

7. Client contacts RM/AM to monitor application’s status

Client RM NM AM

1

2

3

4

5

7 6

Page 26: XML Parsing with Map Reduce

Slide 26 www.edureka.co/big-data-and-hadoop

Summary: Application Workflow

Execution Sequence :

1. Client submits an application

2. RM allocates a container to start AM

3. AM registers with RM

4. AM asks containers from RM

5. AM notifies NM to launch containers

6. Application code is executed in container

7. Client contacts RM/AM to monitor application’s status

8. AM unregisters with RM

Client RM NM AM

1

2

3

4

5

7

8

6

Page 27: XML Parsing with Map Reduce

Slide 27 www.edureka.co/big-data-and-hadoop

Input Splits

INPUT DATA

PhysicalDivision

LogicalDivision

HDFSBlocks

InputSplits

Page 28: XML Parsing with Map Reduce

Slide 28 www.edureka.co/big-data-and-hadoop

Relation Between Input Splits and HDFS Blocks

1 2 3 4 5 6 7 8 9 10 11

Logical records do not fit neatly into the HDFS blocks.

Logical records are lines that cross the boundary of the blocks.

First split contains line 5 although it spans across blocks.

FileLines

BlockBoundary

BlockBoundary

BlockBoundary

BlockBoundary

Split Split Split

Page 29: XML Parsing with Map Reduce

Slide 29 www.edureka.co/big-data-and-hadoop

MapReduce Job Submission Flow

Input data is distributed to nodes

Node 1 Node 2

INPUT DATA

Page 30: XML Parsing with Map Reduce

Slide 30 www.edureka.co/big-data-and-hadoop

MapReduce Job Submission Flow

Input data is distributed to nodes

Each map task works on a “split” of dataMap

Node 1

Map

Node 2

INPUT DATA

Page 31: XML Parsing with Map Reduce

Slide 31 www.edureka.co/big-data-and-hadoop

MapReduce Job Submission Flow

Input data is distributed to nodes

Each map task works on a “split” of data

Mapper outputs intermediate data

Map

Node 1

Map

Node 2

INPUT DATA

Page 32: XML Parsing with Map Reduce

Slide 32 www.edureka.co/big-data-and-hadoop

MapReduce Job Submission Flow

Input data is distributed to nodes

Each map task works on a “split” of data

Mapper outputs intermediate data

Data exchange between nodes in a “shuffle” process

Map

Node 1

Map

Node 2

Node 1 Node 2

INPUT DATA

Page 33: XML Parsing with Map Reduce

Slide 33 www.edureka.co/big-data-and-hadoop

MapReduce Job Submission Flow

Input data is distributed to nodes

Each map task works on a “split” of data

Mapper outputs intermediate data

Data exchange between nodes in a “shuffle” process

Intermediate data of the same key goes to the same reducer

Map

Node 1

Map

Node 2

Reduce

Node 1

Reduce

Node 2

INPUT DATA

Page 34: XML Parsing with Map Reduce

Slide 34 www.edureka.co/big-data-and-hadoop

MapReduce Job Submission Flow

Input data is distributed to nodes

Each map task works on a “split” of data

Mapper outputs intermediate data

Data exchange between nodes in a “shuffle” process

Intermediate data of the same key goes to the same reducer

Reducer output is stored

Map

Node 1

Map

Node 2

Reduce

Node 1

Reduce

Node 2

INPUT DATA

Page 35: XML Parsing with Map Reduce

Slide 35 www.edureka.co/big-data-and-hadoop

Combiner

Combiner

Reducer

(B,1)(C,1)(D,1)(E,1)(D,1)(B,1)

(D,1)(A,1)(A,1)(C,1)(B,1)(D,1)

(B,2)(C,1)(D,2)(E,1)

(D,2)(A,2)(C,1)(B,1)

(A, [2])(B, [2,1])(C, [1,1])(D, [2,2])(E, [1])

(A,2)(B,3)(C,2)(D,4)(E,1)

Shuffle

CombinerMapper

Mapper

BCDEDB

DAACBD

Blo

ck 1

Blo

ck 2

Page 36: XML Parsing with Map Reduce

Slide 36 www.edureka.co/big-data-and-hadoop

Partitioner – Redirecting Output from Mapper

Map

Map

Map

Reducer

Reducer

Reducer

Partitioner

Partitioner

Partitioner

Page 37: XML Parsing with Map Reduce

Slide 37 www.edureka.co/big-data-and-hadoop

Getting Data to the Mapper

Input File Input File

Input split Input split Input split Input split

RecordReader RecordReader RecordReader RecordReader

Mapper Mapper Mapper Mapper

(intermediates) (intermediates) (intermediates) (intermediates)

Page 38: XML Parsing with Map Reduce

Slide 38 www.edureka.co/big-data-and-hadoop

Partition and Shuffle

Mapper Mapper Mapper Mapper

(intermediates) (intermediates) (intermediates) (intermediates)

Partitioner Partitioner Partitioner Partitioner

(intermediates) (intermediates) (intermediates)

Reducer Reducer Reducer

Page 39: XML Parsing with Map Reduce

Slide 39 www.edureka.co/big-data-and-hadoop

Demo of Word Count ProgramTo illustrate Default Input Format

(Text Input Format)

Demo

Page 40: XML Parsing with Map Reduce

Slide 40 www.edureka.co/big-data-and-hadoop

Input file

Input Split Input Split Input Split

RecordReader

RecordReader

RecordReader

Mapper Mapper Mapper

(Intermediates) (Intermediates) (Intermediates)

Inp

ut

Form

at Input Split

RecordReader

Mapper

Input file

(Intermediates)

Input Format

Page 41: XML Parsing with Map Reduce

Slide 41 www.edureka.co/big-data-and-hadoop

Combine FileInput Format<K,V>

Text Input Format

Key Value Text Input Format

Nline Input Format

Sequence FileInput Format<K,V>

File Input Format

<K,V>

Input Format<K,V>

org.apache.hadoop.mapreduce

<<interface>>

Composable

Input Format

<K,V>

Composite Input Format

<K,V>

DB Input Format<T>

Sequence File As

Binary Input Format

Sequence File As

Text Input Format

Sequence File Input

Filter<K,V>

Input Format – Class Hierarchy

Page 42: XML Parsing with Map Reduce

Slide 42 www.edureka.co/big-data-and-hadoop

Reducer

RecordWriter

Output file

Reducer

RecordWriter

Output file

Reducer

RecordWriter

Output file

Outp

ut Form

at

Output Format

Page 43: XML Parsing with Map Reduce

Slide 43 www.edureka.co/big-data-and-hadoop

Text Output Format<K,V>

Sequence FileOutput Format<K,V>

Output Format <K,V>

org.apache.hadoop.mapreduce

DB Output Format

<K,V>

File Output Format

<K,V>

Null Output Format

<K,V>

Filter Output Format

<K,V>

Sequence File As Binary Output Format

Lazy Output Format

<K,V>

Output Format – Class Hierarchy

Page 44: XML Parsing with Map Reduce

Slide 44 www.edureka.co/big-data-and-hadoop

Demo

Demo: Custom Input Format

Page 45: XML Parsing with Map Reduce