40
CS 626 Large Scale Data Science Jun Zhang January 30, 2020 Originally created by Dr. Licong Cui Lecture 4 – Hadoop System

CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

CS 626 Large Scale Data Science

Jun Zhang

January 30, 2020

Originally created by Dr. Licong Cui

Lecture 4 – Hadoop System

Page 2: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Outline

Hadoop Distributed File System (HDFS)

MapReduce

Hands On

Page 3: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Review: Basic Scalable Computing Concepts

Distributed File Systems

Scalable Computing over the Internet

Programming Models for Big Data

Hadoop Ecosystem

Page 4: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Hadoop Ecosystem – Layer Diagram

Page 5: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

What is Hadoop?

Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.

Goals/Requirements Abstract and facilitate the storage and processing of

large and/or rapidly growing data sets High scalability and availability Use commodity hardware (cheap!)

Fault-tolerance Move computation to data

Page 6: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Hadoop Architecture

Page 7: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Hadoop Architecture (cont.)

HDFS Name Node

Data Node

Job Tracker

Task Tracker

MapReduce

Page 8: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Hadoop Distributed File System (HDFS)

Scalability: Split files into blocks across nodes for parallel access

A B C D

A B C DFile

Nodes

Default block size: 128MB

Node 1 Node 2 Node 3 Node 4

Page 9: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Hadoop Distributed File System (HDFS)

Reliability: Replication for fault tolerance

A B C DFile

Nodes

Default: replicates 3 times

A

C

B

D

B

C

C

A

D

B

D

A

Node 1 Node 2 Node 3 Node 4

Page 10: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Hadoop Architecture

Page 11: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

HDFS Components

Name Node (admin/master) Metadata Manage blocks

Data Node (slave) Actual data Block storage

Backup Node (name node) Checkpoints

Page 12: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

HDFS Components (cont.)

Page 13: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Hadoop Rack Aware Replication

Page 14: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

HDFS Name Node

Stores metadata for the files, like the directory structure of a typical FS.

Transaction log for file deletes/adds, etc.

Handles creation of more replica blocks when necessary after a Data Node failure

Page 15: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

HDFS Data Node

Stores the actual data in HDFS

Notifies Name Node of what blocks it has

Replicates blocks 2x in local rack, 1x elsewhere

Page 16: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Write Files to HDFS

Page 17: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Hadoop Rack Aware Replication

Page 18: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Read Files from HDFS

Page 19: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

MapReduce

Programming model for Hadoop ecosystem Based on functional programming

Map = apply operation to all elements

Reduce = summarize operation on elements

f(x) = y

Page 20: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

MapReduce Engine

Job Tracker & Task Tracker

Job Tracker splits up data into smaller tasks (“Map”) and sends it to the Task Tracker process in each node

Task Tracker reports back to the Job Tracker node and reports on job progress, sends data (“Reduce”) or requests new jobs

Page 21: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

MapReduce Job Tracker

Runs on NameNode

Receives MapReduce execution requests from the client

Talks to NameNode to determine the location of the data

Finds the best TaskTracker nodes to execute tasks

Monitors individual TaskTrackers and then submits back the overall status of the job back to the client

When the JobTracker is down, HDFS will still be functional but the MapReduce execution can not be started and the existing MapReduce jobs will be halted.

Page 22: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

MapReduce Task Tracker

Runs on DataNode

Execute Mapper and Reducer tasks assigned by JobTracker

Constantly communicates with the JobTracker signaling the progress of the task in execution

TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will assign the task executed by the TaskTracker to another node.

Page 23: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Heartbeats from Datanode

A dataNode sends heartbeat to NameNode to report its status

The default interval is 3 seconds

If the DataNode in HDFS does not send heartbeat to NameNode in ten minutes, then NameNode considers the DataNode to be out of service and the Blocks replicas hosted by that DataNode to be unavailable.

Page 24: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution
Page 25: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

MapReduce Model

Map Sort & Shuffle Reduce

Page 26: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Word Count Example

Given a large file of words

Count the number of times each distinct word appears in the file

Sample application

Analyze web server logs to find popular URLs

Page 27: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution
Page 28: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Map + Reduce

Map

Accepts input <key,value> pair

Emits intermediate <key,value> pair

Reduce Accepts intermediate <key,value*> pair

Emits output <key,value> pair

Page 29: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

<key, value>

Map

(k1, v1) -> list(k2, v2)

Reduce

(k2, list(v2)) -> list(k3, v3)

Page 30: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Word Count using MapReduce (Pseudocode)

map(key, value):// key: hidden line number; value: line text

for each word w in value:emit(w, 1)

reduce(key, values):// key: a word; values: an iterator over counts

result = 0for each count v in values:

result += vemit(key, result)

Page 31: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Word Count using MapReduce in Java

Page 32: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Hands On Materials

Hadoop MapReduce Example

http://docs.cloudera.com/documentation/other/tutorial/CDHS/topics/Hadoop=Tutorial.html

Get this example work on your machine

Page 33: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

MapReduce Pros and Cons

MapReduce architecture provides Automatic parallelization & distribution Fault tolerance I/O scheduling Monitoring & status updates MapReduce is not suitable for Frequently changing data Dependent Tasks Interactive analysis

Page 34: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

MapReduce

Simplified parallel programming

Applications with independent data parallel tasks

Page 35: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Hands On: Basic File Manipulation in HDFS

Create directory in HDFShadoop fs –mkdir cs626

Copy file to HDFShadoop fs –copyFromLocal words.txt cs626

List files in HDFS directoryhadoop fs –ls cs626

Copy a file within HDFShadoop fs -cp cs626/words.txt cs626/words2.txt

Page 36: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Copy a file from HDFShadoop fs -copyToLocal cs626/words2.txt

Delete a file in HDFS

hadoop fs -rm cs626/words2.txt

Hands On: Basic File Manipulation in HDFS

Page 37: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Hands On: Run the Word Count program

Execute the Word Count applicationhadoop jar wordcount.jar cs626/words.txtcs626/output/

Copy the results from Word Count out of HDFS

hadoop fs –copyToLocal cs626/output/part-r-00000 local.txt

Page 38: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Hands On Materials

Create and Execute MapReduce in Eclipsehttps://www.youtube.com/watch?v=VzKGdM4hc74

Build a MapReduce Code Using Maven in Eclipse

https://www.youtube.com/watch?v=JwnUl42-JSE

Apache Hadoop Main 2.9.1 APIhttps://hadoop.apache.org/docs/current/api/

Page 39: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Reading Materials

The Hadoop Distributed File Systemby Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler

MapReduce: Simplified Data Processing on Large Clusters

by Jeffrey Dean and Sanjay Ghemawat

Page 40: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

Hands On Materials

Hadoop MapReduce Example

http://docs.cloudera.com/documentation/other/tutorial/CDHS/topics/Hadoop=Tutorial.html

Get this example work on your machine