Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it...

Preview:

Citation preview

Sky Agile HorizonsHadoop at Sky

• What is Hadoop?- Reliable, Scalable, Distributed

• Where did it come from?- Community + Yahoo!

• Where is it now? - Apache Software Foundation

• Why is it called “Hadoop”?

1.01

Hadoop at Sky

Overview

To name just a few…

1.02

Hadoop at Sky

Who is using it?

This screengrab is from one of the Hadoop clusters at Facebook (May 2010)

1.03

Hadoop at Sky

Is it “production” ready?

1.04

Hadoop at Sky

So, what does it give you?

• Distributed Filesystem (HDFS)- Name Node- Data Node(s)

• Distributed Processing Infrastructure- Job Tracker- Task Tracker(s)

1.05

Hadoop at Sky

Just two things...

• Blocks- 64MB chunks (configurable)

• WORM (Write once, read many)

- NO EDITS- NO APPENDS

• Replication- 3 copies- direct

1.06

Hadoop at Sky

HDFS - Overview

1.07

Hadoop at Sky

HDFS - ReadName Node

1 1 1 2

2

2

3 3 34

4 4

Client 1. Get Metadata

2. Fetch Blocks

Data Nodes

Control / Monitoring

1.08

Hadoop at Sky

HDFS - WriteName Node

1 32

Client 1. Create Metadata

2. Put Blocks

Data Nodes

Control / Monitoring

1 1

2 2

3 3

• Slots- X mapper slots, Y reducer slots (per node)

• Jobs- Queued- Prioritised

• Tasks

- Data-aware

1.09

Hadoop at Sky

Distributed Processing

1.10

Hadoop at Sky

Distributed ProcessingJob TrackerClient 1. Setup Job

Task Trackers

Control / Monitoring

M M

M M

R R

M M

M M

R R

M M

M M

R R

M M

M M

R R

M M

M M

R R

• Two modes of operation

1.11

Hadoop at Sky

Implementation

Name Node

Data Node

Job Tracker

Task Tracker

Standalone

Name Node

Job Tracker

Master

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Slaves

1.12

Hadoop at Sky

Building upon the basics

• Map/Reduce – divide & conquer

• Pig – SQL-like “Pig Latin”

• HBase – column-based database

• Hive – data-warehousing (SQL-like queries)

• Mahout – distributed algorithms

1.13

Hadoop at Sky

Sub-projects

• Java-based- Key,Value input, Key,Value output(s)

• Intended for low-level / bespoke work

1.14

Hadoop at Sky

Map/Reduce

Start

M

M

M

M

M

R

M

R

R

R

R

End

• SQL-like syntax, Map/Reduce under the hood

• Client-only software

1.15

Hadoop at Sky

Hive

Query

M R

Results

M R M R M R

1.16

Hadoop at Sky

Live Demo

• It’s not a magic bullet…

• If the tools you need don’t exist…

• Approach is everything…

• Hadoop is *just* the framework

1.17

Hadoop at Sky

Lastly, word of warning...

1.18

Hadoop at Sky

Thank you!

Questions?

http://cotdp.com/hadoop.html- Soft-copy of this presentation- VM image available to download- Example code is on GitHub

Recommended