Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Apache Hadoop & Spark – What is it ?
Joseph Allemandou
JoalTech
17 / 05 / 2018
1 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Hello!
Joseph Allemandou
2 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Plan
1 Generalities on High Performance Computing (HPC)
2 Apache Hadoop and Spark – A Glimpse
3 Demonstration
3 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
More computation power: Scale up vs. scale out
Scale Up Scale out
4 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
More computation power: Scale up vs. scale out
Scale UpScale out
4 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Parallel computing
Things to consider when doing parallel computing:
Partitioning (tasks, data)
Communications
Synchronization
Data dependencies
Load balancing
Granularity
IO
Livermore Computing Center - Tutorial
5 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Looking back - Since 1950
Figure: Google Ngram Viewer
6 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Looking back - Since 1950
Figure: Google Ngram Viewer
6 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Looking back - Since 1950
Figure: Google Ngram Viewer
6 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Looking back - Recent times
Figure: Google Trends
7 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Same problem, different tools
Supercomputer
Dedicated hardware
Message Passing Interface
Big Data
Commodity hardware
8 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
MPI
C / C++ / Fortran / Python
Low-level API - Send / receive messages
a lot to do manually
split the data
assign tasks to workers
handle synchronisation
handle errors
9 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Hadoop + Spark
Java / Scala / Python / R
High-level API - dataflows
Less performant than MPI (see this scientific paper )
Easier to code than MPI (a lot!)
High-availability oriented
10 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Plan
1 Generalities on High Performance Computing (HPC)
2 Apache Hadoop and Spark – A Glimpse
3 Demonstration
11 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Apache Hadoop (v2) in (almost) 3 points
Distributed storage and computation platform (Java, opensource)
HDFS – Hadoop Distributed File SystemYARN – Yet Another Resource Negotiator
Built for scalability
Up to thousands of nodes (see this tweet )Can be easily extended (heterogeneous hardware)Is resilient to errors (to some extent)
Working with multiple computation engines
Hadoop MapReduce (the good old one)Apache Spark (the new kid on the block)
12 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Apache Hadoop (v2) in (almost) 3 points
Distributed storage and computation platform (Java, opensource)
HDFS – Hadoop Distributed File SystemYARN – Yet Another Resource Negotiator
Built for scalability
Up to thousands of nodes (see this tweet )Can be easily extended (heterogeneous hardware)Is resilient to errors (to some extent)
Working with multiple computation engines
Hadoop MapReduce (the good old one)Apache Spark (the new kid on the block)
12 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Apache Hadoop (v2) in (almost) 3 points
Distributed storage and computation platform (Java, opensource)
HDFS – Hadoop Distributed File SystemYARN – Yet Another Resource Negotiator
Built for scalability
Up to thousands of nodes (see this tweet )Can be easily extended (heterogeneous hardware)Is resilient to errors (to some extent)
Working with multiple computation engines
Hadoop MapReduce (the good old one)Apache Spark (the new kid on the block)
12 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Apache Hadoop (v2) in (almost) 3 points
Distributed storage and computation platform (Java, opensource)
HDFS – Hadoop Distributed File SystemYARN – Yet Another Resource Negotiator
Built for scalability
Up to thousands of nodes (see this tweet )Can be easily extended (heterogeneous hardware)Is resilient to errors (to some extent)
Working with multiple computation engines
Hadoop MapReduce (the good old one)Apache Spark (the new kid on the block)
12 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Apache Hadoop - A big ecosysem
13 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Things to know about Apache Hadoop
HDFS
Files are split in blocks – Partitioning !Blocks are replicated – Failure tolerance !Master - Slave architecture (NameNode & DataNodes)
YARN
Execution containers with defined resourcesScheduler manages resource sharingMaster - Slave architecture (ResourceManager &NodeManagers)
Global
Execution engines exploit data locality
14 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Things to know about Apache Hadoop
HDFS
Files are split in blocks – Partitioning !Blocks are replicated – Failure tolerance !Master - Slave architecture (NameNode & DataNodes)
YARN
Execution containers with defined resourcesScheduler manages resource sharingMaster - Slave architecture (ResourceManager &NodeManagers)
Global
Execution engines exploit data locality
14 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Things to know about Apache Hadoop
HDFS
Files are split in blocks – Partitioning !Blocks are replicated – Failure tolerance !Master - Slave architecture (NameNode & DataNodes)
YARN
Execution containers with defined resourcesScheduler manages resource sharingMaster - Slave architecture (ResourceManager &NodeManagers)
Global
Execution engines exploit data locality
14 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Things to know about Apache Hadoop
HDFS
Files are split in blocks – Partitioning !Blocks are replicated – Failure tolerance !Master - Slave architecture (NameNode & DataNodes)
YARN
Execution containers with defined resourcesScheduler manages resource sharingMaster - Slave architecture (ResourceManager &NodeManagers)
Global
Execution engines exploit data locality
14 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Apache Hadoop - The Frame (HDFS + YARN)
15 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Apache Hadoop - The Frame (HDFS + YARN)
15 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Apache Hadoop - The Frame (HDFS + YARN)
15 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Apache Hadoop - MapReduce
16 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Apache Hadoop - MapReduce
16 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Apache Spark
17 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Apache Spark
17 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Why is Spark preferred to MapReduce?
Programming data-flows instead of single map-reducesteps
A lot less code for complex flowsData lineage allows for error recovery
Decoupling of tasks and containers – Spark executors runmultiple tasks
Less executor management overhead
Executors can reuse RAM - Caching!
18 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Why is Spark preferred to MapReduce?
Programming data-flows instead of single map-reducesteps
A lot less code for complex flowsData lineage allows for error recovery
Decoupling of tasks and containers – Spark executors runmultiple tasks
Less executor management overhead
Executors can reuse RAM - Caching!
18 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Why is Spark preferred to MapReduce?
Programming data-flows instead of single map-reducesteps
A lot less code for complex flowsData lineage allows for error recovery
Decoupling of tasks and containers – Spark executors runmultiple tasks
Less executor management overhead
Executors can reuse RAM - Caching!
18 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Spark - A big ecosysem as well
19 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Spark - An example of (very) complex flow
Mediawiki History Reconstruction Job
20 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Plan
1 Generalities on High Performance Computing (HPC)
2 Apache Hadoop and Spark – A Glimpse
3 Demonstration
21 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Scala Spark / Toree / Jupyter notebook
Notebook with Spark on Docker
22 / 23
Hadoop –Spark
Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Thank you!
23 / 23