Upload
bradhedlund
View
1.201
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Understanding Hadoop Clusters and the Network
Part 1. Introduction and Overview
BRAD HEDLUND .com
Brad Hedlundhttp://bradhedlund.comhttp://www.linkedin.com/in/bradhedlund@bradhedlund
Hadoop Server Roles
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
slaves
masters
Clients
Name NodeJob TrackerSecondary
Name Node
Map Reduce HDFSDistributed Data Analytics Distributed Data Storage
BRAD HEDLUND .com
Hadoop Cluster
Rack 1
DN + TT
DN + TT
DN + TT
DN + TT
Name Node
Rack 2
DN + TT
DN + TT
DN + TT
DN + TT
Job Tracker
Rack 3
DN + TT
DN + TT
DN + TT
DN + TT
Secondary NN
Rack 4
DN + TT
DN + TT
DN + TT
DN + TT
Client
Rack N
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
World
switch switch switch switch switch
switch switch
BRAD HEDLUND .com
Typical Workflow
• Load data into the cluster (HDFS writes)• Analyze the data (Map Reduce)• Store results in the cluster (HDFS writes)• Read the results from the cluster (HDFS reads)
How many times did our customers type the word “Fraud” into emails sent to customer service?
Sample Scenario:
File.txtHuge file containing all emails sent to customer service
BRAD HEDLUND .com
Writing files to HDFS
• Client consults Name Node• Client writes block directly to one Data Node• Data Nodes replicates block• Cycle repeats for next block
Name Node
Data Node 1 Data Node 5 Data Node 6 Data Node N
Client
I want to write Blocks A,B,C of
File.txt
OK. Write to Data Nodes
1,5,6
Blk A Blk B Blk C
File.txt
Blk A Blk B Blk C
BRAD HEDLUND .com
Hadoop Rack Awareness – Why?Name Node
File.txt=Blk A:DN1, DN5, DN6
Blk B:DN7, DN1, DN2
Blk C:DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node 3
Data Node 5
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node 7
Data Node 8
switch
Rack 9
Data Node 9
Data Node 10
Data Node 11
Data Node 12
switch
• Never loose all data if entire rack fails• Keep bulky flows in-rack when possible• Assumption that in-rack is higher bandwidth,
lower latency
A
A
BB
B
C C
C
switch
Rack 1:Data Node 1Data Node 2Data Node 3
Rack 5:Data Node 5Data Node 6Data Node 7
Rack aware
BRAD HEDLUND .com
switch
switchswitch
Preparing HDFS writes
Name Node
Data Node 1 Data Node 5
Data Node 6
Client
I want to write File.txtBlock A
OK. Write to Data Nodes
1,5,6
Blk A Blk B Blk C
File.txt
Ready Data Nodes
5,6
Read
y!
Ready?
Rack 1 Rack 5
Rack 1:Data Node 1
Rack 5:Data Node 5Data Node 6
ReadyData Node 6
Rack aware
Ready! • Name Node picks two nodes in the same rack, one node in a different rack
• Data protection• Locality for M/R
Ready!
BRAD HEDLUND .com
switch
switchswitch
Preparing HDFS writes
Name Node
Data Node 1 Data Node 5
Data Node 6
Client
I want to write File.txtBlock A
OK. Write to Data Nodes
1,5,6
Blk A Blk B Blk C
File.txt
Ready Data Nodes
5,6
Read
y!
Ready?
Rack 1 Rack 5
Rack 1:Data Node 1
Rack 5:Data Node 5Data Node 6
ReadyData Node 6
Rack aware
Ready! • Name Node picks two nodes in the same rack, one node in a different rack
• Data protection• Locality for M/R
Ready!
BRAD HEDLUND .com
switch
switchswitch
Pipelined Write
Name Node
Data Node 1 Data Node 5
Data Node 6
ClientBlk A Blk B Blk C
File.txt
Rack 1 Rack 5
Rack 1:Data Node 1
Rack 5:Data Node 5Data Node 6
A A
A
• Data Nodes 1 & 2 pass data along as its received
• TCP 50010
Rack aware
BRAD HEDLUND .com
switch
switchswitch
Pipelined Write
Name Node
Data Node 1 Data Node 5
Data Node 6
ClientBlk A Blk B Blk C
File.txt
Rack 1 Rack 5
Rack 1:Data Node 1
Rack 5:Data Node 5Data Node 6
A A
A
• Data Nodes 1 & 2 pass data along as its received
• TCP 50010
Rack aware
BRAD HEDLUND .com
switch
switchswitch
Pipelined Write
Name Node
Data Node 1 Data Node 2
Data Node 3
ClientBlk A Blk B Blk C
File.txt
Rack 1 Rack 5
Rack 1:Data Node 1
Rack 5:Data Node 2Data Node 3
A A
A
Block received
Success
Success
File.txtBlk A:DN1, DN2, DN3
BRAD HEDLUND .com
Multi-block Replication Pipeline
Data Node 1 Data Node X
Data Node 3
Rack 1
switch
switch
Client
switch
Blk A Blk A
Blk A
Rack 4
Data Node 2
Rack 5
switch
Data Node Y
Data Node Z
Blk B
Blk B
Blk B
Data Node WBlk C
Blk C
Blk C
Blk A Blk B Blk C
File.txt 1TB File =3TB storage3TB network traffic
BRAD HEDLUND .com
Client writes Span the HDFS ClusterClient
Rack N
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
File.txt
Rack 4
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 3
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
• Block size• File Size
Factors:
More blocks = Wider spread
BRAD HEDLUND .com
Data Node writes span itself, and other racks
Rack N
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 4
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 3
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
C
C
B
B
A
A
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
switch
CBA
Blk A Blk B Blk C
Results.txt
BRAD HEDLUND .com
Name Node
• Data Node sends Heartbeats• Every 10th heartbeat is a Block report• Name Node builds metadata from Block reports• TCP – every 3 seconds• If Name Node is down, HDFS is down
Name Node
Data Node 1 Data Node 2 Data Node 3 Data Node N
A AA CC
DN1: A,CDN2: A,CDN3: A,C
metadata
File.txt = A,C
C
File systemAwesome!
Thanks.
I’m alive!
I have blocks:
A, C
BRAD HEDLUND .com
Re-replicating missing replicas
Name Node
Data Node 1 Data Node 2 Data Node 3 Data Node 8
A AA CC
DN1: A,CDN2: B,CDN3: A, B
metadataRack1: DN1, DN2Rack5: DN3, Rack9: DN8
C
Rack AwarenessUh Oh!Missing replicas
Copy blocks A,C to Node 8
A C
• Missing Heartbeats signify lost Nodes• Name Node consults metadata, finds affected data• Name Node consults Rack Awareness script• Name Node tells a Data Node to re-replicate
BRAD HEDLUND .com
Secondary Name Node
• Not a hot standby for the Name Node• Connects to Name Node every hour*• Housekeeping, backup of Name Node metadata• Saved metadata can rebuild a failed Name Node
Name Node
metadata
File.txt = A,C
File system
Secondary Name Node
Its been an hour, give me your
metadata
BRAD HEDLUND .com
Client reading files from HDFS
Name NodeClient
Tell me the block locations of Results.txt
Blk A = 1,5,6Blk B = 8,1,2Blk C = 5,8,9
Results.txt=Blk A:DN1, DN5, DN6
Blk B:DN7, DN1, DN2
Blk C:DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node
Data Node
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node
Data Node
switch
Rack 9
Data Node 8
Data Node 9
Data Node
Data Node
switch
• Client receives Data Node list for each block• Client picks first Data Node for each block• Client reads blocks sequentially
A
A
BB
B
C C
C
BRAD HEDLUND .com
Data Node reading files from HDFS
Name Node
Block A = 1,5,6
File.txt=Blk A:DN1, DN5, DN6
Blk B:DN7, DN1, DN2
Blk C:DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node 3
Data Node
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node
Data Node
switch
Rack 9
Data Node 8
Data Node 9
Data Node
Data Node
switch
• Name Node provides rack local Nodes first• Leverage in-rack bandwidth, single hop
A
A
BB
B
C C
C
Tell me the locations of Block A of
File.txt switch
Rack 1:Data Node 1Data Node 2Data Node 3
Rack 5:Data Node 5
Rack aware
BRAD HEDLUND .com
Data Processing: Map
• Map: “Run this computation on your local data”• Job Tracker delivers Java code to Nodes with local data
Map Task Map Task Map Task
A B C
Client
How many times does
“Fraud” appear in File.txt?
Count “Fraud”
in Block C
File.txt
Fraud = 3 Fraud = 0 Fraud = 11
Job TrackerName Node
Data Node 1 Data Node 5 Data Node 9
BRAD HEDLUND .com
switchswitchswitch
What if data isn’t local?
• Job Tracker tries to select Node in same rack as data• Name Node rack awareness
“I need block A”Map Task Map Task
B C
Client
How many times does
“Fraud” appear in File.txt?
Count “Fraud”
in Block C
Fraud = 0 Fraud = 11
Job TrackerName Node
Data Node 1Data Node 5 Data Node 9
“no Map tasks left”A
Data Node 2
Rack 1 Rack 5 Rack 9
BRAD HEDLUND .com
Data Processing: Reduce
• Reduce: “Run this computation across Map results”• Map Tasks deliver output data over the network • Reduce Task data output written to and read from HDFS
Fraud = 3Fraud = 0
Fraud = 11
Job Tracker
Reduce Task
Sum “Fraud”
Results.txtFraud = 14
Map Task Map Task Map Task
A B C
Client
HDFS
X Y Z
Data Node 1 Data Node 5 Data Node 9
Data Node 3
BRAD HEDLUND .com
Unbalanced Cluster
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Hadoop prefers local processing if possible• New servers underutilized for Map Reduce, HDFS*• Might see more network bandwidth, slower job times**
**I was assigned a Map Task but don’t have the block. Guess I need to get it.
*I’m bored!
BRAD HEDLUND .com
Cluster Balancing
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Balancer utility (if used) runs in the background• Does not interfere with Map Reduce or HDFS• Default speed limit 1 MB/s
brad@cloudera-1:~$hadoop balancer
BRAD HEDLUND .com