79
Introduction to Hadoop

Introduction to hadoop V2

Embed Size (px)

Citation preview

Page 1: Introduction to hadoop V2

Introduction to Hadoop

Page 2: Introduction to hadoop V2

• Tarjei Romtveit

• Co-founder of Monokkel AS

• Former CTO – Integrasco AS

• My story with Hadoopwww.monokkel.io

Page 3: Introduction to hadoop V2

• Daglig leder i Monokkel AS

• Tidligere COO i Integrasco AS

• Persistering, Prosessering og Presentasjon av data

Persistering – Prosessering – Presentasjon

Page 4: Introduction to hadoop V2

Bombshell

If you work with data today and not start to learn the Hadoop ecosystem: You may be

unemployed soon

Page 5: Introduction to hadoop V2

Agenda• Context – Big Data and how to handle it

• What is Hadoop?

• Demo

• Distributions and/or demo

• “Deepdive” into Hadoop - Architecure– HDFS– YARN– MapReduce

• Languages and ecosystem

Page 6: Introduction to hadoop V2

What we not will cover

• Security

• Integrations with database X or system Y

• Running Hadoop in production

Page 7: Introduction to hadoop V2

Big Data

Page 8: Introduction to hadoop V2

Big Data – hype and hipsters

Page 9: Introduction to hadoop V2

Big Data

The originator

Page 10: Introduction to hadoop V2

Big Data – Let’s add some letters • Volume• Variety• Velocity• Variability• Veracity / Data quality

and the step-brother

• Complexity

Relatively boring stuff

Page 11: Introduction to hadoop V2

Big Data – Example

This is a CEO

Page 12: Introduction to hadoop V2

The Nordic Hotel Tycoon1600 Hotels in 5 countries

Page 13: Introduction to hadoop V2

I am a digital champion:The website

Page 14: Introduction to hadoop V2

I am a digital champion:The desk

Page 15: Introduction to hadoop V2

I am a digital champion:The external provider

Page 16: Introduction to hadoop V2

I am a digital championThe IoT case

Page 17: Introduction to hadoop V2

I am a digital championSocial

Page 18: Introduction to hadoop V2

Houston we have a problem• Sales is declining and my stock price is

tumbling

Page 19: Introduction to hadoop V2

The CEO

No cluewhat is

happening

Page 20: Introduction to hadoop V2

How can the CEO manage his problem?

• Get control over the data

• Implement analytical processes to aid sales

Page 21: Introduction to hadoop V2

BUT HE DOES NOT WANT TO

PAY 10 000 000 000

000$ FOR IT

Page 22: Introduction to hadoop V2

The data he need to handle

• Volume – Gigabytes/Terabyte

• Variety – Click stream, Voice, emails, sensor data, social data, different languages, timestamp data, transactional data, third party data

• Variability – Various quality

• Velocity – MB per second

Page 23: Introduction to hadoop V2

The data he need to handle

• Veracity / Data quality – Inconsistent data quality

• Complexity – Many legacy domain models

Page 24: Introduction to hadoop V2

How to handle ?Web

Emails

Sensors

Social

StorageProcessing

RDBMS

Search

Page 25: Introduction to hadoop V2

How to understand ?Web

Emails

Sensors

Social

StorageProcessing

RDBMS

Search

Page 26: Introduction to hadoop V2

So what do Hadoop solve?

StorageProcessing

Page 27: Introduction to hadoop V2

What is Hadoop?

Page 28: Introduction to hadoop V2

What is Hadoop? An operating system for data

Page 29: Introduction to hadoop V2

An OS need software on top

Page 30: Introduction to hadoop V2

Distributions

'

Page 31: Introduction to hadoop V2

Distributions• ”Stable” compilation of the Hadoop Ecosystem

• Operational tools

• Integration tools and frameworks

• Data governance and data management tools

• Security

Page 32: Introduction to hadoop V2

Distributions

Page 33: Introduction to hadoop V2

HADOOP An operating system for data

Layman’s terms

• Store huge files (unstructured) on many machines

• Query and modify data

• Can run sophisticated analytics on top

Page 34: Introduction to hadoop V2

How to start:Alt 1• https://hadoop.apache.org/ • Getting Started• Download• Unzip• bin/hadoop <commandline arguments>

Alt 2• http://hortonworks.com/products/hortonworks-sandbox/#install• Install VMWare Player or VirtualBox • Download image (6 GB)• Install and run (give it lots of memory)

Page 35: Introduction to hadoop V2

DEMO

– Transform and modify data

– Machine learning with Spark

– Integrate with ElasticSearch

NEXT: ARCHITECHTURE AND HOW IT WORKS

Page 36: Introduction to hadoop V2

DEMO• Hortonworks Sandbox

• Hortonworks Ambari

• Hortonworks Hue

Page 37: Introduction to hadoop V2

Hadoop - ArchitectureHDFSYARN

MapReduce

Page 38: Introduction to hadoop V2

2.X.X

• Hadoop Distributed File System (HDFS)

• YARN (Yet Another Resource Negotiator)

• MapReduce

Page 39: Introduction to hadoop V2

HDFS

D1

D2

DX

Name NodeFailover

Name Node

Client

Page 40: Introduction to hadoop V2

HDFS

Block indexD1

D2

D3

Data Nodes

B: 1, D1B: 2, D2B: 3, D3B: 4, D1B: 5, D2B: 6, D3

Name node

Page 41: Introduction to hadoop V2

HDFS

Block indexD1

D2

D3

Data Nodes

B: 1, D1B: 2, D2B: 3, D3B: 4, D1B: 5, D2B: 6, D3

Name node

Page 42: Introduction to hadoop V2

HDFS Write

Client

/path/to/document1, R:2, B:{1,2}

Name Node

I need to write adocument!

Page 43: Introduction to hadoop V2

Client

/path/to/document1, R:2, B:{1,2}

Name Node

I need to write/path/to/document1, R:2, B:{3,4} /path/to/document1, R:2, B:{5,6}

HDFS Write

Page 44: Introduction to hadoop V2

Client

Name NodeYou can write to

: D1,D2,D3 D1

D2

D3

Data Nodes

HDFS Write

Page 45: Introduction to hadoop V2

Client

Name Node D1

D2

D3

B:{D2:5,D3:6}

B:{D3:3,D1:4}

B:{D1:1,D2:2}Split and write

HDFS Write

Page 46: Introduction to hadoop V2

HDFS Write

Client

Name Node

D1

D2

D3

Replicate B:1 to D2:2

Success

Page 47: Introduction to hadoop V2

HDFS Read

Client

Name Node

D1

D2

D3

I want to read

/path/to/document1

B:{D3:3,D3:6}

B:{D2:2,D2:5}

Page 48: Introduction to hadoop V2

• HDFS blocks are immutable you can not change them!

• Deletes and updates are written as new blocks

• The node name takes care of overwriting deleted blocks

• Small files are consuming a lot of name node memory

HDFS Delete/Update

Page 49: Introduction to hadoop V2

HDFS Scalability

D1

D2

DX

Name NodeFailover

Name Node

Page 50: Introduction to hadoop V2

YARN

HOW DOES HADOOP PROCESS THE DATA STORED IN HDFS?

Page 51: Introduction to hadoop V2

YARN

Client

Resource Manager

Scheduler

Applications manager

I want to process file “docuemt1” with my-app.jar?

Page 52: Introduction to hadoop V2

YARN

Resource Manager

Scheduler

Applications manager

You can process on D1!

Page 53: Introduction to hadoop V2

YARN

D1 D2

Node Manager Node Manager

Resource Manager

Scheduler

Applications managerStart my-app.jar

Application Master

Page 54: Introduction to hadoop V2

YARN

D1 D2

Node Manager Node Manager

Resource Manager

Scheduler

Applications manager

Application Master

AM to RM: “document1” is located on d1 and d2 and I need X Gb RAM

Page 55: Introduction to hadoop V2

YARN

D1 D2

Node Manager Node Manager

Application Master Container

Resource Manager

Scheduler

Applications manager

my-app.jar is running here!

Start my-app.jar

Page 56: Introduction to hadoop V2

YARN + HDFSD1

D2

D3

Name Node

Client

Client

Client

• YARN will try to make sure data is processed where it is stored

• ….. data locality

Page 57: Introduction to hadoop V2

YARN + HDFS• Blocks are immutable. This enables high write speeds

• Data is schema free! You can store any data you want

• Data locality is what differentiates HDFS from other data storage

• You can read massive amounts of data only limited by disk read speeds

Page 58: Introduction to hadoop V2

MapReduce and others

OK… BUT HOW DO I PROCESS ?

Page 59: Introduction to hadoop V2

YARN

Tez MapReduce <Name here>

Libraries: Mahout, MLib, GraphX, Oryx Languages: Hive, Pig, R, Spark SQL, Stinger

Page 60: Introduction to hadoop V2

YARN

Tez <Name here>

Languages: Hive, Pig, R, Spark SQL, StingerLibraries: Mahout, Crunch, Mlib, GraphX, Oryx

MapReduce

Page 61: Introduction to hadoop V2

MapReduce

Page 62: Introduction to hadoop V2

Document

Deer Bear RiverCar Car RiverDeer Car BearDocument

stored in HDFS

Page 63: Introduction to hadoop V2

Splitting

Deer Bear River

Deer Car Bear

Deer Bear River

Car Car RiverCar Car River

Deer Car Bear

Page 64: Introduction to hadoop V2

MappingDeer Bear River

Car Car River

Deer Car Bear

Deer 1Bear 1 River 1

Car 1Car 1 River 1

Deer 1Car 1Bear 1

Page 65: Introduction to hadoop V2

ShufflingDeer 1Bear 1 River 1

Deer 1Car 1Bear 1

Car 1Car 1 River 1

Deer 1Deer 1Deer 1

Bear 1Bear 1

Car 1Car 1

River 1River 1

Page 66: Introduction to hadoop V2

ReduceDeer 1Deer 1Deer 1

Bear 1Bear 1

Car 1Car 1

River 1River 1

Deer 3

Bear 2

Car 2

River 2

Deer 3Bear 2Car 2River 2

HDFS

Page 67: Introduction to hadoop V2

API: Mapper interface

Page 68: Introduction to hadoop V2

API: Reduce interface

Page 69: Introduction to hadoop V2

API: Main

Page 70: Introduction to hadoop V2

How to run

$ bin/hadoop jar wc.jar WordCount /hdfs/dir/in /hdfs/dir/out

Page 71: Introduction to hadoop V2

MapReduce• Mappers and reducers are distributed in YARN

containers

• Chaining of MapReduce jobs make them slow

• Easy to scale but difficult to code

• … use the data DSL languages instead

Page 72: Introduction to hadoop V2

Languages

Page 73: Introduction to hadoop V2

YARN

Tez MapReduce <Name here>

Languages: Hive, Pig, R, Spark SQL, StingerLibraries: Mahout, Crunch, MLib, GraphX, Oryx

Page 74: Introduction to hadoop V2

”Languages”

Page 75: Introduction to hadoop V2

PIG• Procedural language

• Execute on YARN

• Great for• Structuring• Moving• Transforming

Page 76: Introduction to hadoop V2

Hive/Drill/Spark SQL

• Declarative / SQL-like languages

• Great for• Column data / Database dumps• Aggregations• Connect BI tools and Dashboards

• Data Warehouse for Hadoop++

Page 77: Introduction to hadoop V2

Spark• Core language (runs in YARN or standalone)

• Great for• Anything that MapReduce can do• Analytics, Machine Learning

• In memory and languages in Java, Scala and Python

Page 78: Introduction to hadoop V2

Summary• Hadoop is designed to handle/process massive amounts of data

through HDFS and/or YARN

• The data do not need to be structured before it is stored in HDFS

• Hadoop is an ecosystem and have languages/frameworks for data extraction, data management, data analysis and data integration

• It is most convenient to begin with Hadoop by testing distributions. E.g. Hortonworks, Cloudera, MapR etc.

• Learn MapReduce and learn to understand languages and a few integration tools

Page 79: Introduction to hadoop V2

Is it a fad?