Introduction to hadoop V2

Introduction to Hadoop

• Tarjei Romtveit

• Co-founder of Monokkel AS

• Former CTO – Integrasco AS

• My story with Hadoopwww.monokkel.io

• Daglig leder i Monokkel AS

• Tidligere COO i Integrasco AS

• Persistering, Prosessering og Presentasjon av data

Persistering – Prosessering – Presentasjon

Bombshell

If you work with data today and not start to learn the Hadoop ecosystem: You may be

unemployed soon

Agenda• Context – Big Data and how to handle it

• What is Hadoop?

• Demo

• Distributions and/or demo

• “Deepdive” into Hadoop - Architecure– HDFS– YARN– MapReduce

• Languages and ecosystem

What we not will cover

• Security

• Integrations with database X or system Y

• Running Hadoop in production

Big Data

Big Data – hype and hipsters

Big Data

The originator

Big Data – Let’s add some letters • Volume• Variety• Velocity• Variability• Veracity / Data quality

and the step-brother

• Complexity

Relatively boring stuff

Big Data – Example

This is a CEO

The Nordic Hotel Tycoon1600 Hotels in 5 countries

I am a digital champion:The website

I am a digital champion:The desk

I am a digital champion:The external provider

I am a digital championThe IoT case

I am a digital championSocial

Houston we have a problem• Sales is declining and my stock price is

tumbling

The CEO

No cluewhat is

happening

How can the CEO manage his problem?

• Get control over the data

• Implement analytical processes to aid sales

BUT HE DOES NOT WANT TO

PAY 10 000 000 000

000$ FOR IT

The data he need to handle

• Volume – Gigabytes/Terabyte

• Variety – Click stream, Voice, emails, sensor data, social data, different languages, timestamp data, transactional data, third party data

• Variability – Various quality

• Velocity – MB per second

The data he need to handle

• Veracity / Data quality – Inconsistent data quality

• Complexity – Many legacy domain models

How to handle ?Web

Emails

Sensors

Social

StorageProcessing

RDBMS

Search

How to understand ?Web

Emails

Sensors

Social

StorageProcessing

RDBMS

Search

So what do Hadoop solve?

StorageProcessing

What is Hadoop?

What is Hadoop? An operating system for data

An OS need software on top

Distributions

'

Distributions• ”Stable” compilation of the Hadoop Ecosystem

• Operational tools

• Integration tools and frameworks

• Data governance and data management tools

• Security

Distributions

HADOOP An operating system for data

Layman’s terms

• Store huge files (unstructured) on many machines

• Query and modify data

• Can run sophisticated analytics on top

How to start:Alt 1• https://hadoop.apache.org/ • Getting Started• Download• Unzip• bin/hadoop <commandline arguments>

Alt 2• http://hortonworks.com/products/hortonworks-sandbox/#install• Install VMWare Player or VirtualBox • Download image (6 GB)• Install and run (give it lots of memory)

DEMO

– Transform and modify data

– Machine learning with Spark

– Integrate with ElasticSearch

NEXT: ARCHITECHTURE AND HOW IT WORKS

DEMO• Hortonworks Sandbox

• Hortonworks Ambari

• Hortonworks Hue

Hadoop - ArchitectureHDFSYARN

MapReduce

2.X.X

• Hadoop Distributed File System (HDFS)

• YARN (Yet Another Resource Negotiator)

• MapReduce

HDFS

D1

D2

DX

Name NodeFailover

Name Node

Client

HDFS

Block indexD1

D2

D3

Data Nodes

B: 1, D1B: 2, D2B: 3, D3B: 4, D1B: 5, D2B: 6, D3

Name node

HDFS

Block indexD1

D2

D3

Data Nodes

B: 1, D1B: 2, D2B: 3, D3B: 4, D1B: 5, D2B: 6, D3

Name node

HDFS Write

Client

/path/to/document1, R:2, B:{1,2}

Name Node

I need to write adocument!

Client

/path/to/document1, R:2, B:{1,2}

Name Node

I need to write/path/to/document1, R:2, B:{3,4} /path/to/document1, R:2, B:{5,6}

HDFS Write

Client

Name NodeYou can write to

: D1,D2,D3 D1

D2

D3

Data Nodes

HDFS Write

Client

Name Node D1

D2

D3

B:{D2:5,D3:6}

B:{D3:3,D1:4}

B:{D1:1,D2:2}Split and write

HDFS Write

HDFS Write

Client

Name Node

D1

D2

D3

Replicate B:1 to D2:2

Success

HDFS Read

Client

Name Node

D1

D2

D3

I want to read

/path/to/document1

B:{D3:3,D3:6}

B:{D2:2,D2:5}

• HDFS blocks are immutable you can not change them!

• Deletes and updates are written as new blocks

• The node name takes care of overwriting deleted blocks

• Small files are consuming a lot of name node memory

HDFS Delete/Update

HDFS Scalability

D1

D2

DX

Name NodeFailover

Name Node

YARN

HOW DOES HADOOP PROCESS THE DATA STORED IN HDFS?

YARN

Client

Resource Manager

Scheduler

Applications manager

I want to process file “docuemt1” with my-app.jar?

YARN

Resource Manager

Scheduler


You can process on D1!

YARN

D1 D2

Node Manager Node Manager

Resource Manager

Scheduler

Applications managerStart my-app.jar

Application Master

YARN

D1 D2


Resource Manager

Scheduler


Application Master

AM to RM: “document1” is located on d1 and d2 and I need X Gb RAM

YARN

D1 D2


Application Master Container

Resource Manager

Scheduler


my-app.jar is running here!

Start my-app.jar

YARN + HDFSD1

D2

D3

Name Node

Client

Client

Client

• YARN will try to make sure data is processed where it is stored

• ….. data locality

YARN + HDFS• Blocks are immutable. This enables high write speeds

• Data is schema free! You can store any data you want

• Data locality is what differentiates HDFS from other data storage

• You can read massive amounts of data only limited by disk read speeds

MapReduce and others

OK… BUT HOW DO I PROCESS ?

YARN

Tez MapReduce <Name here>

Libraries: Mahout, MLib, GraphX, Oryx Languages: Hive, Pig, R, Spark SQL, Stinger

YARN

Tez <Name here>

Languages: Hive, Pig, R, Spark SQL, StingerLibraries: Mahout, Crunch, Mlib, GraphX, Oryx

MapReduce

MapReduce

Document

Deer Bear RiverCar Car RiverDeer Car BearDocument

stored in HDFS

Splitting

Deer Bear River

Deer Car Bear

Deer Bear River

Car Car RiverCar Car River

Deer Car Bear

MappingDeer Bear River

Car Car River

Deer Car Bear

Deer 1Bear 1 River 1

Car 1Car 1 River 1

Deer 1Car 1Bear 1

ShufflingDeer 1Bear 1 River 1

Deer 1Car 1Bear 1

Car 1Car 1 River 1

Deer 1Deer 1Deer 1

Bear 1Bear 1

Car 1Car 1

River 1River 1

ReduceDeer 1Deer 1Deer 1

Bear 1Bear 1

Car 1Car 1

River 1River 1

Deer 3

Bear 2

Car 2

River 2

Deer 3Bear 2Car 2River 2

HDFS

API: Mapper interface

API: Reduce interface

API: Main

How to run

$ bin/hadoop jar wc.jar WordCount /hdfs/dir/in /hdfs/dir/out

MapReduce• Mappers and reducers are distributed in YARN

containers

• Chaining of MapReduce jobs make them slow

• Easy to scale but difficult to code

• … use the data DSL languages instead

Languages

YARN

Tez MapReduce <Name here>

Languages: Hive, Pig, R, Spark SQL, StingerLibraries: Mahout, Crunch, MLib, GraphX, Oryx

”Languages”

PIG• Procedural language

• Execute on YARN

• Great for• Structuring• Moving• Transforming

Hive/Drill/Spark SQL

• Declarative / SQL-like languages

• Great for• Column data / Database dumps• Aggregations• Connect BI tools and Dashboards

• Data Warehouse for Hadoop++

Spark• Core language (runs in YARN or standalone)

• Great for• Anything that MapReduce can do• Analytics, Machine Learning

• In memory and languages in Java, Scala and Python

Summary• Hadoop is designed to handle/process massive amounts of data

through HDFS and/or YARN

• The data do not need to be structured before it is stored in HDFS

• Hadoop is an ecosystem and have languages/frameworks for data extraction, data management, data analysis and data integration

• It is most convenient to begin with Hadoop by testing distributions. E.g. Hortonworks, Cloudera, MapR etc.

• Learn MapReduce and learn to understand languages and a few integration tools

Is it a fad?

Technology

Introduction to hadoop V2