39
Alexandru Costan Apache Hadoop

Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

1

Alexandru Costan

Apache Hadoop

Page 2: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

2 Big Data Landscape

No one-size-fits-all solution: SQL, NoSQL, MapReduce, …

No standard, except Hadoop

Page 3: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

3 Outline

•  What is Hadoop ? •  Who uses it ? •  Architecture •  HDFS •  MapReduce •  Open Issues •  Examples

Page 4: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

4

4

Page 5: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

5 What is Hadoop?

Hadoop is a top-level Apache project •  Open source implementation of MapReduce •  Developed in Java

Platform for data storage and processing

•  Scalable •  Fault tolerant •  Distributed •  Any type of complex data

Page 6: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

6 Why?

Page 7: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

7 What for?

•  Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook) as the engine to power the cloud.

•  Used for: •  Batch data processing, not real-time / user facing:

web search •  Log processing •  Document analysis and indexing •  Web graphs and crawling •  Highly parallel data intensive distributed applications

Page 8: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

8 Who uses Hadoop?

8

Page 9: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

9 Components: the Hadoop stack

Page 10: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

10 HDFS

Distributed storage system •  Files are divided into large blocks (128 MB) •  Blocks are distributed across the cluster •  Blocks are replicated to help against hardware failure •  Data placement is exposed so that computation can be

migrated to data •  Master / Slave architecture

Notable differences from mainstream DFS work

•  Single ‘storage + compute’ cluster vs. separate clusters •  Simple I/O centric API

Page 11: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

11 HDFS Architecture

HDFS Master: NameNode •  Manages all file system metadata in memory: •  List of files •  For each file name: a set of blocks •  For each block: a set of DataNodes •  File attributes (creation time, replication factor)

•  Controls read/write access to files •  Manages block replication •  Transaction log: register file creation, deletion, etc.

Page 12: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

12 HDFS Architecture

HDFS Slaves: DataNodes •  A DataNode is a block server

•  Stores data in the local file system (e.g. ext3) •  Stores meta-data of a block (e.g. CRC) •  Serves data and meta-data to Clients

•  Block report •  Periodically sends a report of all existing blocks to

the NameNode •  Pipelining of data

•  Forwards data to other specified DataNodes •  Perform replication tasks upon instruction by

NameNode •  Rack-aware

Page 13: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

13 HDFS Architecture

Page 14: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

14 Fault tolerance in HDFS

NameNode uses heartbeats to detect DataNode failures:

•  Once every 3 seconds •  Chooses new DataNodes for new replicas •  Balances disk usage •  Balances communication traffic to DataNodes

Multiple copies of a block are stored: •  Default replication: 3 •  Copy #1 on another node on the same rack •  Copy #2 on another node on a different rack

Page 15: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

15 Data Correctness

Use checksums to validate data •  CRC32

File creation •  Client computes checksum per 512 byte •  DataNode stores the checksum

File access •  Client retrieves the data and checksum from

DataNode •  If validation fails, client tries other replicas

Page 16: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

16 NameNode failures

A single point of failure Transaction Log stored in multiple directories

•  A directory on the local file system •  A directory on a remote file system (NFS/

CIFS) The Secondary NameNode holds a backup of the NameNode data

•  On the same machine L Need to develop a really highly available solution!

Page 17: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

17 Data pipelining

Client retrieves a list of DataNodes on which to place replicas of a block

•  Client writes block to the first DataNode •  The first DataNode forwards the data to the second •  The second DataNode forwards the data to the third

DataNode in the pipeline •  When all replicas are written, the client moves on to

write the next block in file

Page 18: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

18 User interface

Commands for HDFS User: –  hadoop dfs –put /localsrc /dest –  hadoop dfs -mkdir /foodir –  hadoop dfs -cat /foodir/myfile.txt –  hadoop dfs -rm /foodir myfile.txt –  hadoop dfs -cp /src /dest

Commands for HDFS Administrator –  hadoop dfsadmin -report –  hadoop dfsadmin -decommission datanodename

Web Interface –  http://host:port/dfshealth.jsp

Page 19: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

19 Hadoop MapReduce

Parallel processing for large datasets Relies on HDFS Master-Slave architecture:

•  Job Tracker •  Task Trackers

Page 20: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

20 Hadoop MapReduce Architecture

Map-Reduce Master: JobTracker •  Accepts MapReduce jobs submitted by users

•  Assigns Map and Reduce tasks to

TaskTrackers

•  Monitors task and TaskTracker status

•  Re-executes tasks upon failure

Page 21: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

21 Hadoop MapReduce Architecture

Map-Reduce Slaves: TaskTrackers •  Run Map and Reduce tasks upon instruction

from the JobTracker

•  Manage storage and transmission of intermediate output

•  Generic Reusable Framework supporting pluggable user code (file system, I/O format)

Page 22: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

22 Putting everything together: HDFS and MapReduce deployment

Page 23: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

23 Hadoop MapReduce Client

•  Define Mapper and Reducer classes and a “launching” program

•  Language support •  Java •  C++ •  Python

•  Special case: Maps only

Page 24: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

24 Zoom on the Map phase

Page 25: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

25 Zoom on the Reduce Phase

Page 26: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

26 Data locality

•  Data locality is exposed in the map task scheduling put tasks where data is

•  Data are replicated: •  Fault tolerance •  Performance: divide the work among nodes

•  JobTracker schedules map tasks considering: •  Node-aware •  Rack-aware •  Non-local map tasks

Page 27: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

27 Fault tolerance

•  TaskTrackers send heartbeats to the JobTracker •  Once every 3 seconds •  Node is labled as failed if no heartbeat is recieved for a

defined expiry time (default: 10 minutes)

•  Re-execute all the ongoing and completed tasks

•  Need to develop a more efficient policy to prevent re-executing completed tasks (storing this data in HDFS)!

Page 28: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

28 Speculation in Hadoop

Slow nodes (stragglers) à run backup tasks

Node 1

Node 2

Page 29: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

29 Life cycle of Map/Reduce tasks

29

Page 30: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

30 Open Issues - 1

All the metadata are handled through one single Master in HDFS (Name Node) Performs bad when:

•  Handling many files •  Heavy concurrency

Page 31: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

31 Open Issues - 2

Data locality is crucial for Hadoop’s performance How can we expose data-locality of Hadoop in the Cloud efficiently? Hadoop in the Cloud:

•  Unaware of network topology •  Node-aware or non-local map tasks

Page 32: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

32 Open Issues - 2

Data locality in the Cloud

The simplicity of map tasks scheduling leads to non-local maps execution (25%)

Node1

Node2

Node3

Node5

Node4

Node6

31 105 13 9764 5 67 1082 2 49 12 12 8 11 11

Empty node Empty node Empty node

Page 33: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

33 Open Issues - 3

Data Skew •  The current Hadoop hash partitioning works well when

the keys are equally frequent and uniformly stored in the data nodes

•  In the presence of partitioning skew: •  Variation in Intermediate Keys frequencies •  Variation in Intermediate Keys distribution among

different Data Nodes •  Native blindly hash-partitioning is inadequate and will

lead to: •  Network congestion •  Unfairness in reducers inputs -> Reduce computation

skew •  Performance degradation

Page 34: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

34 Open Issues - 3

Data Node1 K1

Hash code: (Intermediate-key) Modulo ReduceID

K1 K1 K2 K2 K2

K2 K2 K3 K3 K3 K3

K4 K4 K4 K4 K5 K6

Data Node1 K1 K1 K1 K1 K1 K1

K1 K1 K1 K2 K4 K4

K4 K5 K5 K6 K6 K6

Data Node1 K1 K1 K1 K2 K2

K2

K1

K4 K4 K4 K4 K4

K4 K5 K5 K5 K5 K5

K1 K2 K3 K4 K5 K6 K1 K2 K3 K4 K5 K6

K1 K1 K1 K2 K2 K2

K2 K2 K3 K3 K3 K3

K4 K4 K4 K4 K5 K6

K1 K1 K1 K1 K1 K1

K1 K1 K1 K2 K4 K4

K4 K5 K5 K6 K6 K6

K1 K1 K1 K2 K2

K2

K1

K4 K4 K4 K4 K4

K4 K5 K5 K5 K5 K5

Data Node1 Data Node2 Data Node3

Total Out Data Transfer 11 15 18 Total 44/54

Reduce Input 29 17 8

Page 35: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

35 Job scheduling in Hadoop

•  Considerations •  Job priority: deadline •  Capacity: cluster resources available, resources needed for

job

•  FIFO •  The first job to arrive at a JobTracker is processed first

•  Capacity Scheduling •  Organizes jobs into queues •  Queue shares as %’s of cluster

•  Fair Scheduler •  Group jobs into “pools” •  Divide excess capacity evenly between pools •  Delay scheduler to expose data locality

Page 36: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

36 Hadoop at work!

Cluster of machines running Hadoop at Yahoo! (credit: Yahoo!)

Page 37: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

37 Running Hadoop

Multiple options: •  On your local machine (standalone or

pseudo distributed)

•  Local with a Virtual Machine

•  On the cloud (i.e. Amazon EC2) •  In your own datacenter (e.g. Grid5000)

Page 38: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

38 Word Count Example In Hadoop

Page 39: Apache Hadoop - IRISA · Platform for data storage and processing • Scalable ... What for? 7 • Advocated by industry’s premier Web players (Google, Yahoo!, Microsoft, Facebook)

39 Bibliography

HDFS Design: http://hadoop.apache.org/core/docs/current/hdfs_design.html

Hadoop API: http://hadoop.apache.org/core/docs/current/api/