Upload
john-berns
View
107
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Given at IoT Asia 2014
Citation preview
Hadoop 101A really quick overview of the concepts…
A few Terabytes of Data...
Text processing--a few hours?
But what if you have more data?
Network Storage--Petabytes!
Network Storage--Petabytes!
What if you need compute power for complex algorithms?
8 core? 16 Cores? 64 cores? 512 GB RAM?
A network of commodity computers
Run jobs on PART of the data on each computer then AGGRETAGE the intermediary results from each
computer.
Let’s add a computer to manage the process of job delegation, merging the results...
and keeping track of the results...
We also need something to keep track of what files are where, so we know where the data is that needs
to be computed...
When you have a lot of computers, and even more hard drives,
one thing I can guarantee...
Computers will eventually fail.
Computers will eventually fail.
Hard drives will eventually fail.
Hard drives will eventually fail.
Hard drives will eventually fail.
Hard drives will eventually fail.
Even whole racks will fail.
If a computer fails and you only have one copy of your data...
You will be very, very unhappy.
So lets store multiple copies of the data. Hard drives are CHEAP!
So lets store multiple copies of the data. Hard drives are CHEAP!
So lets store multiple copies of the data. Hard drives are CHEAP!
So lets store multiple copies of the data. Hard drives are CHEAP!
If one hard drive fails... we are still OK
If one computer fails... we are still OK
Even if a whole rack fails... we are still OK
Once we find a failure let’s have the system recopy the copies.
Send the compute job to all nodes.
And let it run on it’s part of the data….
And let it run on it’s part of the data….
And let it run on it’s part of the data….
And let it run on it’s part of the data….
One is stuck….
We have three copies—we can redistribute the compute
And take the one that finishes fastest
Merge sorted sets based on some key…
A-E F-J K-O P-T U-Z
…and write partial results
PART-01 PART-02 PART-03 PART-04 PART-05
Guess, what? We’ve just invented Hadoop!
PART-03
PART-01
PART-02
A-E F-J
So let’s talk about the pieces of Hadoop.
Data nodes store and manage the data on a single “slave” computer
Data Node
Task trackers manage the compute
Data Node
Task Tracker
Job tracker manages task trackers, ships code to compute nodes
Data Node
Task TrackerJob Tracker
Name node manages distribution and replication on the data nodes
Data Node
Task TrackerJob Tracker
Name Node
Map Reduce
Task TrackerJob Tracker
HDFS (Hadoop Distributed File System)
Data Node
Name Node
HDFS
Visual Example
Map
Shuffle
Reduce
Putting It All Together