Upload
tarjeiromtveit
View
200
Download
0
Embed Size (px)
Citation preview
Introduction to Hadoop
• Tarjei Romtveit
• Co-founder of Monokkel AS
• Former CTO – Integrasco AS
• My story with Hadoopwww.monokkel.io
• Daglig leder i Monokkel AS
• Tidligere COO i Integrasco AS
• Persistering, Prosessering og Presentasjon av data
Persistering – Prosessering – Presentasjon
Bombshell
If you work with data today and not start to learn the Hadoop ecosystem: You may be
unemployed soon
Agenda• Context – Big Data and how to handle it
• What is Hadoop?
• Demo
• Distributions and/or demo
• “Deepdive” into Hadoop - Architecure– HDFS– YARN– MapReduce
• Languages and ecosystem
What we not will cover
• Security
• Integrations with database X or system Y
• Running Hadoop in production
Big Data
Big Data – hype and hipsters
Big Data
The originator
Big Data – Let’s add some letters • Volume• Variety• Velocity• Variability• Veracity / Data quality
and the step-brother
• Complexity
Relatively boring stuff
Big Data – Example
This is a CEO
The Nordic Hotel Tycoon1600 Hotels in 5 countries
I am a digital champion:The website
I am a digital champion:The desk
I am a digital champion:The external provider
I am a digital championThe IoT case
I am a digital championSocial
Houston we have a problem• Sales is declining and my stock price is
tumbling
The CEO
No cluewhat is
happening
How can the CEO manage his problem?
• Get control over the data
• Implement analytical processes to aid sales
BUT HE DOES NOT WANT TO
PAY 10 000 000 000
000$ FOR IT
The data he need to handle
• Volume – Gigabytes/Terabyte
• Variety – Click stream, Voice, emails, sensor data, social data, different languages, timestamp data, transactional data, third party data
• Variability – Various quality
• Velocity – MB per second
The data he need to handle
• Veracity / Data quality – Inconsistent data quality
• Complexity – Many legacy domain models
How to handle ?Web
Emails
Sensors
Social
StorageProcessing
RDBMS
Search
How to understand ?Web
Emails
Sensors
Social
StorageProcessing
RDBMS
Search
So what do Hadoop solve?
StorageProcessing
What is Hadoop?
What is Hadoop? An operating system for data
An OS need software on top
Distributions
'
Distributions• ”Stable” compilation of the Hadoop Ecosystem
• Operational tools
• Integration tools and frameworks
• Data governance and data management tools
• Security
Distributions
HADOOP An operating system for data
Layman’s terms
• Store huge files (unstructured) on many machines
• Query and modify data
• Can run sophisticated analytics on top
How to start:Alt 1• https://hadoop.apache.org/ • Getting Started• Download• Unzip• bin/hadoop <commandline arguments>
Alt 2• http://hortonworks.com/products/hortonworks-sandbox/#install• Install VMWare Player or VirtualBox • Download image (6 GB)• Install and run (give it lots of memory)
DEMO
– Transform and modify data
– Machine learning with Spark
– Integrate with ElasticSearch
NEXT: ARCHITECHTURE AND HOW IT WORKS
DEMO• Hortonworks Sandbox
• Hortonworks Ambari
• Hortonworks Hue
Hadoop - ArchitectureHDFSYARN
MapReduce
2.X.X
• Hadoop Distributed File System (HDFS)
• YARN (Yet Another Resource Negotiator)
• MapReduce
HDFS
D1
D2
DX
Name NodeFailover
Name Node
Client
HDFS
Block indexD1
D2
D3
Data Nodes
B: 1, D1B: 2, D2B: 3, D3B: 4, D1B: 5, D2B: 6, D3
Name node
HDFS
Block indexD1
D2
D3
Data Nodes
B: 1, D1B: 2, D2B: 3, D3B: 4, D1B: 5, D2B: 6, D3
Name node
HDFS Write
Client
/path/to/document1, R:2, B:{1,2}
Name Node
I need to write adocument!
Client
/path/to/document1, R:2, B:{1,2}
Name Node
I need to write/path/to/document1, R:2, B:{3,4} /path/to/document1, R:2, B:{5,6}
HDFS Write
Client
Name NodeYou can write to
: D1,D2,D3 D1
D2
D3
Data Nodes
HDFS Write
Client
Name Node D1
D2
D3
B:{D2:5,D3:6}
B:{D3:3,D1:4}
B:{D1:1,D2:2}Split and write
HDFS Write
HDFS Write
Client
Name Node
D1
D2
D3
Replicate B:1 to D2:2
Success
HDFS Read
Client
Name Node
D1
D2
D3
I want to read
/path/to/document1
B:{D3:3,D3:6}
B:{D2:2,D2:5}
• HDFS blocks are immutable you can not change them!
• Deletes and updates are written as new blocks
• The node name takes care of overwriting deleted blocks
• Small files are consuming a lot of name node memory
HDFS Delete/Update
HDFS Scalability
D1
D2
DX
Name NodeFailover
Name Node
YARN
HOW DOES HADOOP PROCESS THE DATA STORED IN HDFS?
YARN
Client
Resource Manager
Scheduler
Applications manager
I want to process file “docuemt1” with my-app.jar?
YARN
Resource Manager
Scheduler
Applications manager
You can process on D1!
YARN
D1 D2
Node Manager Node Manager
Resource Manager
Scheduler
Applications managerStart my-app.jar
Application Master
YARN
D1 D2
Node Manager Node Manager
Resource Manager
Scheduler
Applications manager
Application Master
AM to RM: “document1” is located on d1 and d2 and I need X Gb RAM
YARN
D1 D2
Node Manager Node Manager
Application Master Container
Resource Manager
Scheduler
Applications manager
my-app.jar is running here!
Start my-app.jar
YARN + HDFSD1
D2
D3
Name Node
Client
Client
Client
• YARN will try to make sure data is processed where it is stored
• ….. data locality
YARN + HDFS• Blocks are immutable. This enables high write speeds
• Data is schema free! You can store any data you want
• Data locality is what differentiates HDFS from other data storage
• You can read massive amounts of data only limited by disk read speeds
MapReduce and others
OK… BUT HOW DO I PROCESS ?
YARN
Tez MapReduce <Name here>
Libraries: Mahout, MLib, GraphX, Oryx Languages: Hive, Pig, R, Spark SQL, Stinger
YARN
Tez <Name here>
Languages: Hive, Pig, R, Spark SQL, StingerLibraries: Mahout, Crunch, Mlib, GraphX, Oryx
MapReduce
MapReduce
Document
Deer Bear RiverCar Car RiverDeer Car BearDocument
stored in HDFS
Splitting
Deer Bear River
Deer Car Bear
Deer Bear River
Car Car RiverCar Car River
Deer Car Bear
MappingDeer Bear River
Car Car River
Deer Car Bear
Deer 1Bear 1 River 1
Car 1Car 1 River 1
Deer 1Car 1Bear 1
ShufflingDeer 1Bear 1 River 1
Deer 1Car 1Bear 1
Car 1Car 1 River 1
Deer 1Deer 1Deer 1
Bear 1Bear 1
Car 1Car 1
River 1River 1
ReduceDeer 1Deer 1Deer 1
Bear 1Bear 1
Car 1Car 1
River 1River 1
Deer 3
Bear 2
Car 2
River 2
Deer 3Bear 2Car 2River 2
HDFS
API: Mapper interface
API: Reduce interface
API: Main
How to run
$ bin/hadoop jar wc.jar WordCount /hdfs/dir/in /hdfs/dir/out
MapReduce• Mappers and reducers are distributed in YARN
containers
• Chaining of MapReduce jobs make them slow
• Easy to scale but difficult to code
• … use the data DSL languages instead
Languages
YARN
Tez MapReduce <Name here>
Languages: Hive, Pig, R, Spark SQL, StingerLibraries: Mahout, Crunch, MLib, GraphX, Oryx
”Languages”
PIG• Procedural language
• Execute on YARN
• Great for• Structuring• Moving• Transforming
Hive/Drill/Spark SQL
• Declarative / SQL-like languages
• Great for• Column data / Database dumps• Aggregations• Connect BI tools and Dashboards
• Data Warehouse for Hadoop++
Spark• Core language (runs in YARN or standalone)
• Great for• Anything that MapReduce can do• Analytics, Machine Learning
• In memory and languages in Java, Scala and Python
Summary• Hadoop is designed to handle/process massive amounts of data
through HDFS and/or YARN
• The data do not need to be structured before it is stored in HDFS
• Hadoop is an ecosystem and have languages/frameworks for data extraction, data management, data analysis and data integration
• It is most convenient to begin with Hadoop by testing distributions. E.g. Hortonworks, Cloudera, MapR etc.
• Learn MapReduce and learn to understand languages and a few integration tools
Is it a fad?