56
Hadoop 101 A really quick overview of the concepts…

Hadoop 101 v2

Embed Size (px)

DESCRIPTION

Given at IoT Asia 2014

Citation preview

Page 1: Hadoop 101 v2

Hadoop 101A really quick overview of the concepts…

Page 2: Hadoop 101 v2

A few Terabytes of Data...

Page 3: Hadoop 101 v2
Page 4: Hadoop 101 v2
Page 5: Hadoop 101 v2

Text processing--a few hours?

Page 6: Hadoop 101 v2

But what if you have more data?

Page 7: Hadoop 101 v2

Network Storage--Petabytes!

Page 8: Hadoop 101 v2

Network Storage--Petabytes!

Page 9: Hadoop 101 v2

What if you need compute power for complex algorithms?

Page 10: Hadoop 101 v2

8 core? 16 Cores? 64 cores? 512 GB RAM?

Page 11: Hadoop 101 v2

A network of commodity computers

Page 12: Hadoop 101 v2

Run jobs on PART of the data on each computer then AGGRETAGE the intermediary results from each

computer.

Page 13: Hadoop 101 v2

Let’s add a computer to manage the process of job delegation, merging the results...

and keeping track of the results...

Page 14: Hadoop 101 v2

We also need something to keep track of what files are where, so we know where the data is that needs

to be computed...

Page 15: Hadoop 101 v2

When you have a lot of computers, and even more hard drives,

one thing I can guarantee...

Page 16: Hadoop 101 v2

Computers will eventually fail.

Page 17: Hadoop 101 v2

Computers will eventually fail.

Page 18: Hadoop 101 v2

Hard drives will eventually fail.

Page 19: Hadoop 101 v2

Hard drives will eventually fail.

Page 20: Hadoop 101 v2

Hard drives will eventually fail.

Page 21: Hadoop 101 v2

Hard drives will eventually fail.

Page 22: Hadoop 101 v2

Even whole racks will fail.

Page 23: Hadoop 101 v2

If a computer fails and you only have one copy of your data...

Page 24: Hadoop 101 v2

You will be very, very unhappy.

Page 25: Hadoop 101 v2

So lets store multiple copies of the data. Hard drives are CHEAP!

Page 26: Hadoop 101 v2

So lets store multiple copies of the data. Hard drives are CHEAP!

Page 27: Hadoop 101 v2

So lets store multiple copies of the data. Hard drives are CHEAP!

Page 28: Hadoop 101 v2

So lets store multiple copies of the data. Hard drives are CHEAP!

Page 29: Hadoop 101 v2

If one hard drive fails... we are still OK

Page 30: Hadoop 101 v2

If one computer fails... we are still OK

Page 31: Hadoop 101 v2

Even if a whole rack fails... we are still OK

Page 32: Hadoop 101 v2

Once we find a failure let’s have the system recopy the copies.

Page 33: Hadoop 101 v2

Send the compute job to all nodes.

Page 34: Hadoop 101 v2

And let it run on it’s part of the data….

Page 35: Hadoop 101 v2

And let it run on it’s part of the data….

Page 36: Hadoop 101 v2

And let it run on it’s part of the data….

Page 37: Hadoop 101 v2

And let it run on it’s part of the data….

Page 38: Hadoop 101 v2

One is stuck….

Page 39: Hadoop 101 v2

We have three copies—we can redistribute the compute

Page 40: Hadoop 101 v2

And take the one that finishes fastest

Page 41: Hadoop 101 v2

Merge sorted sets based on some key…

A-E F-J K-O P-T U-Z

Page 42: Hadoop 101 v2

…and write partial results

PART-01 PART-02 PART-03 PART-04 PART-05

Page 43: Hadoop 101 v2

Guess, what? We’ve just invented Hadoop!

PART-03

PART-01

PART-02

A-E F-J

Page 44: Hadoop 101 v2

So let’s talk about the pieces of Hadoop.

Page 45: Hadoop 101 v2

Data nodes store and manage the data on a single “slave” computer

Data Node

Page 46: Hadoop 101 v2

Task trackers manage the compute

Data Node

Task Tracker

Page 47: Hadoop 101 v2

Job tracker manages task trackers, ships code to compute nodes

Data Node

Task TrackerJob Tracker

Page 48: Hadoop 101 v2

Name node manages distribution and replication on the data nodes

Data Node

Task TrackerJob Tracker

Name Node

Page 49: Hadoop 101 v2

Map Reduce

Task TrackerJob Tracker

Page 50: Hadoop 101 v2

HDFS (Hadoop Distributed File System)

Data Node

Name Node

Page 51: Hadoop 101 v2

HDFS

Page 52: Hadoop 101 v2

Visual Example

Page 53: Hadoop 101 v2

Map

Page 54: Hadoop 101 v2

Shuffle

Page 55: Hadoop 101 v2

Reduce

Page 56: Hadoop 101 v2

Putting It All Together