Hadoop 101 v2

Hadoop 101A really quick overview of the concepts…

A few Terabytes of Data...

Text processing--a few hours?

But what if you have more data?

Network Storage--Petabytes!

Network Storage--Petabytes!

What if you need compute power for complex algorithms?

8 core? 16 Cores? 64 cores? 512 GB RAM?

A network of commodity computers

Run jobs on PART of the data on each computer then AGGRETAGE the intermediary results from each

computer.

Let’s add a computer to manage the process of job delegation, merging the results...

and keeping track of the results...

We also need something to keep track of what files are where, so we know where the data is that needs

to be computed...

When you have a lot of computers, and even more hard drives,

one thing I can guarantee...

Computers will eventually fail.

Computers will eventually fail.

Hard drives will eventually fail.




Even whole racks will fail.

If a computer fails and you only have one copy of your data...

You will be very, very unhappy.

So lets store multiple copies of the data. Hard drives are CHEAP!




If one hard drive fails... we are still OK

If one computer fails... we are still OK

Even if a whole rack fails... we are still OK

Once we find a failure let’s have the system recopy the copies.

Send the compute job to all nodes.

And let it run on it’s part of the data….




One is stuck….

We have three copies—we can redistribute the compute

And take the one that finishes fastest

Merge sorted sets based on some key…

A-E F-J K-O P-T U-Z

…and write partial results

PART-01 PART-02 PART-03 PART-04 PART-05

Guess, what? We’ve just invented Hadoop!

PART-03

PART-01

PART-02

A-E F-J

So let’s talk about the pieces of Hadoop.

Data nodes store and manage the data on a single “slave” computer

Data Node

Task trackers manage the compute

Data Node

Task Tracker

Job tracker manages task trackers, ships code to compute nodes

Data Node

Task TrackerJob Tracker

Name node manages distribution and replication on the data nodes

Data Node


Name Node

Map Reduce


HDFS (Hadoop Distributed File System)

Data Node

Name Node

HDFS

Visual Example

Map

Shuffle

Reduce

Putting It All Together

Data & Analytics

Hadoop 101 v2