MapReduce (Hadoop)densetsu.org/Cloud2012/(11) MapReduce.pdf · Map: (K1, V1) list (K2, V2) Reduce: (K2, list(V2)) list(K3, V3) The general MapReduce data flow. Note that after distributing

MapReduce (Hadoop) Robert Neumann

The contents of this lecture base to a great extent on the books

“Hadoop in Action” and “Hadoop – The Definitive Guide”

439 Cloud Computing

MapReduce

A Framework for processing “planet size” data

1. “Map” Phase

- Master node takes input and splits it up into smaller problems

- Assigns problems to worker nodes, which again can split problem

- Worker node processes problem and returns answer to master

2. “Reduce” Phase

- Master recombines answers of all worker nodes to form solution

440 Cloud Computing

MapReduce – Divide and Conquer

Master

Worker

Worker Worker

Worker

Worker

Map

Reduce

441 Cloud Computing

Case Study – New York Times (NYT)

• Converting 11 million documents from the NYT archive

• NYT decided to make all articles between 1851 and 1922 freely available

• Stored articles in TIFF images, they wanted to combine pieces of each article together into one PDF

• First, NYT chose a real-time approach to scale, glue, and convert the TIFF images

• Worked well for a small number of images, but did not scale

• Solution was to pregenerate all articles as PDF

• Archive had 11 million articles with 4 TB of data

• NYT copied 4 TB of data into S3, built a Hadoop program to do transformation and let it run on 100 nodes

• Job ran for 24 hours and generated another 1.5 TB of data

• Whole job costed $240 (100 * 24 * $0,1) in computation

442 Cloud Computing

Case Study – Stumble Upon (SU)

• SU uses HBase and Hadoop for analyzing the click rating of their users

• 10 million users, millions of click ratings per day

1) For a 20-node cluster

• Write performance ranged between 100.000 and 300.000 operations per second (rows were ~100 bytes in size)

• Using an 80-times-parallel MapReduce read aggregation job, the cluster achieves a total read speed of 4.5 million rows per second

• At this speed, reading their largest tables takes less than an our

2) For a 10 GB Apache log

System Result

1 Node (Hadoop) 21m46s

3 Nodes (Hadoop) 8m3s

15 Nodes (Hadoop) 1m30s

Naive Perl (Script) 42m49s

443 Cloud Computing

MapReduce Terminology

• Job

• Piece of work that the client wants to be performed

• Consists of input data, the program, and the configuration

• Job is devided into tasks

• Task

• Distinguished into map tasks and reducde tasks

• Input splits

• Input data divided into fixed-size pieces

• One map task is created for each split

• Map tasks runs user-defined map function for each record in the split

444 Cloud Computing

MapReduce Concept

(Dean, 2004)

(Dean, 2004)

445 Cloud Computing

Data Locality Optimization

• Run the map task on the node where the input data resides in HDFS

• To save on valuable cluster bandwidth

• If all nodes hosting HDFS block replicas for a map task‘s input split are busy, JobTracker will look for a free map slot on another rack node first, before searching on other racks (results in inter rack network transfer)

Split size should be block size, since it is the largest size of input that can be guaranteed to be stored on a single node

446 Cloud Computing


• Map tasks

• Write their ouput to the local file system, not to HDFS

• Map outpout is intermediate output that is consumed by reduce tasks to produce the final output

• In case the node running the map task fails, before map output has been consumed by a reducer, the map task will automatically be restarted

447 Cloud Computing


• Reduce tasks

• Don‘t have the advantage of data locality

• Output from a single reduce task is normally the output from many (or all) mappers

• Therefore, sorted map outputs have to be transferred across the network to the reducer

• Reducer‘s output is normally stored in HDFS (for reliability)

(White, 2012)

448 Cloud Computing


• Reduce tasks

(White, 2012)

449 Cloud Computing


• Reduce tasks

• With multiple reducers, map tasks partition their output, each creating one partition for each reduce task

• There can be manys keys in each partition, but the records for any key are all in a single partition

• The „shuffle“ describes the process when each reduce task is fed by many map tasks

(White, 2012)

450 Cloud Computing


• Reduce tasks

(White, 2012)

451 Cloud Computing

Anatomy of a MapReduce Program

MapReduce programs process data by manipulating (key, value) pairs in the general form

Map: (K1, V1) list (K2, V2)

Reduce: (K2, list(V2)) list(K3, V3)

The general MapReduce data flow. Note that after distributing input data to different nodes, the only time nodes communicate with each other is at the “shuffle” step . This restriction on communication greatly helps scalability

(Lam, 2011)

452 Cloud Computing

MapReduce Sample Program

Problem; „Count number of each word in a large set of documents“

Writes “1“ for each word

to Intermediate files

Sums up all “1“

(occurences) for

one word

(Dean, 2004)

453 Cloud Computing

MapReduce Applications

Grep

- 1 TB of data

- Rare three character pattern (occurrences: 92.337)

- Max. bandwith = 30 GB/s

- Execution time = 150s (90s + 60s startup overhead)

- Machine count = 1700

(Dean, 2004)

454 Cloud Computing

MapReduce Applications

Sort

- 1 TB of data

- Max. bandwith = 13 GB/s

- Execution time = 891s

- Machine count = 1764

(Dean, 2004)

455 Cloud Computing

MapReduce – Indexing the Internet

At Google

- Crawlers provide 20 TB of raw data

- 5-10 map-reduce operations

- 700 lines of code

- Execution time?

456 Cloud Computing

Hadoop Components

• General

• Master-slave architecture for both distributed storage (HDFS) and distributed computing

• Namenode

• Master of HDFS; directs slave DataNode to perform low-level I/O task

• Keeps track of how files are broken down into blocks and which nodes store those blocks

• Single point of failure; if other nodes die, cluster will continue to smoothly functions; not so for namenode

• There is only one NameNode per Hadoop cluster

457 Cloud Computing

Hadoop Components

• DataNode

• Rads and writes HDFS blocks to actual files on the local filesystem

• Upon read/write of a file, file is broken down into blocks and namenode tells client which DataNode each block resides in

• Clients directly communicate with DataNode to process the files

• DataNodes might communicate with another to replicate blocks

(Lam, 2011)

458 Cloud Computing

Hadoop Components

• DataNode

(Lam, 2011)

459 Cloud Computing

Hadoop Components

• Secondary NameNode (SNN)

• Assistant daemon for monitoring state of the HDFS cluster

• Does not receive or record any real-time changes to HDFS

• Instead, it communicates with NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster config

• If NameNode dies, snapshots help minimize the downtime and loss of data, since new NameNode can be brought up quickly that restores from backup HDFS metadata in SNN

• There is only one SNN per Hadoop cluster

460 Cloud Computing

Hadoop Components

• JobTracker

• „liaison“ between a client application and Hadoop

• Once code (program) is submitted to a cluster JobTracker

• Determines execution plan,

• Determines which files to process,

• Assigns nodes to different tasks, and

• Monitors all taks

• If task fails, JobTracker automatically relaunches task (possibly on different node)

• Only one JobTracker per Hadoop cluster!

461 Cloud Computing

Hadoop Components

• TaskTracker

• Is responsible for executing the individual task the JobTracker assigns

• Each TaskTracker can spawn multiple VMs to handle many map/reduce tasks in parallel

• Constantly sends heartbeat to JobTracker; if JobTracker fails to receive heartbeat from a TaskTracker, it assumes the TaskTracker to have failed and launches the same task on another TaskTracker again

(Lam, 2011)

462 Cloud Computing

Hadoop Components

• TaskTracker

(Lam, 2011)

463 Cloud Computing

YARN (MapReduce2)

• YARN Application Resource Negotiatior

• Developed in 2010 by a group at Yahoo!

• Remedies the scalability shortcomings of classic MapReduce by splitting the responsibilities of the JobTracker

• Job scheduling (matching tasks with TaskTrackers)

• Task progress scheduling (keeping track of tasks, restarting failed or slow tasks, maintaining counter totals)

• Separates these two roles into two independent daemons

• Resource manager (manages resources across the cluster)

• Application master (manages lifecycle of applications running on the cluster)

• In contrast to a JobTracker, Each instance of a MapReduce application has a dedicated application master, which runs for the duration of the application

464 Cloud Computing

And now lean back after another tough lecture…

Computer Man

http://www.youtube.com/watch?v=i1IqqlW1U4k

465 Cloud Computing

Literature

• Map Reduce: Simplified Data Processing on Large Clusters (Dean, 2004)

• Hadoop in Action (Lam, 2011)

• Hadoop – The Definitive Guide (White, 2012)

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/de/archive/mapreduce-osdi04.pdf

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/de/archive/mapreduce-osdi04.pdf

http://www.amazon.de/Hadoop-Action-Chuck-Lam/dp/1935182196/ref=sr_1_1?ie=UTF8&qid=1357806890&sr=8-1



http://www.amazon.de/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/ref=sr_1_1?s=books-intl-de&ie=UTF8&qid=1357806926&sr=1-1




Documents

MapReduce (Hadoop)densetsu.org/Cloud2012/(11) MapReduce.pdf · Map: (K1, V1) list (K2, V2) Reduce: (K2, list(V2)) list(K3, V3) The general MapReduce data flow. Note that after distributing