125
CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big Data Algorithms, Techniques and Platforms Distributed Databases Hadoop Applications and Ecosystem. Hugues Talbot & Céline Hudelot, professors.

CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big Data Algorithms, Techniques and Platforms

Distributed Databases Hadoop Applications and Ecosystem.

Hugues Talbot & Céline Hudelot, professors.

Page 2: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Big Data Hadoop Stack

Page 3: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Lecture #1

Hadoop Beginnings

Page 4: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

What is Hadoop?

Apache Hadoop is an open source software framework for storage and

large scale processing of data-sets on clusters of commodity hardware

Page 5: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Hadoop was created by Doug Cutting and Mike Cafarella in 2005

Named the project after son's toy elephant

Page 6: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Moving Computation to Data

Computation

Data

Page 7: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Scalability at Hadoop’s core!

Page 8: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Reliability! Reliability! Reliability!

Page 9: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Reliability! Reliability! Reliability!

Page 10: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Reliability! Reliability! Reliability!

Google File System

Page 11: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Once a year

365 Computers Once a day

Page 12: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Hourly

Page 13: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

New Approach to Data

Keep all data

Page 14: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

New Kinds of Analysis

Schema-on read style

Page 15: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

New Kinds of Analysis

Schema-on read style New analysis

Page 16: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Small Data & Complex Algorithm

Large Data & Simple Algorithm

Vs.

Page 17: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Lecture #2

Apache Framework Hadoop Modules

Page 18: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Framework Basic Modules Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce

Page 19: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Framework Basic Modules

Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce

Page 20: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Framework Basic Modules

Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce

Page 21: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Framework Basic Modules

Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce

Page 22: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 23: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 24: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 25: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 26: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 27: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Lecture #3

Hadoop Distributed File System (HDFS)

Page 28: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

HDFS Hadoop Distributed File System

Distributed, scalable, and portable file-system written in Java for the Hadoop framework

Page 29: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

HDFS

Page 30: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 31: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

MapReduce Engine Job Tracker

Page 32: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

MapReduce Engine

Task Tracker

Job Tracker

Page 33: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Hadoop NextGen MapReduce (YARN)

Page 34: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 35: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

What is Yarn?

• YARN enhances the power of a Hadoop compute cluster

Scalability

Page 36: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

What is Yarn?

• YARN enhances the power of a Hadoop compute cluster

Scalability

Improved cluster utilization

Page 37: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

What is Yarn?

• YARN enhances the power of a Hadoop compute cluster

Scalability

MapReduce Compatibility Improved cluster utilization

Page 38: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

What is Yarn?

• YARN enhances the power of a Hadoop compute cluster

Scalability

MapReduce Compatibility

Improved cluster utilization

Supports Other Workloads

Page 39: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Lecture #4

The Hadoop “Zoo”

Page 40: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 41: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

How to figure out the Zoo??

Page 42: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Original Google Stack

Page 43: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Original Google Stack

Page 44: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Original Google Stack

Page 45: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Original Google Stack Data Integration

Page 46: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Original Google Stack

Page 47: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Original Google Stack

Page 48: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Original Google Stack

Page 49: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Facebook’s Version of the Stack Data Integration

Coordination

Languages Compilers Data Store

Page 50: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Yahoo’s Version of the Stack Data Integration

Coordination

Languages Compilers Data Store

Page 51: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

LinkedIn’s Version of the Stack Data Integration

Coordination

Languages Compilers Data Store

Page 52: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Cloudera’s Version of the Stack

Coordination

Languages Compilers

Data Integration

Data Store

Page 53: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Hadoop Ecosystem Major Components

Lecture #5

Page 54: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 55: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Sqoop • Tool designed for

efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases

Page 56: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 57: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

HBASE • Column-oriented database

management system • Key-value store • Based on Google Big Table • Can hold extremely large data • Dynamic data model • Not a Relational DBMS

Page 58: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 59: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

PIG

High level programming on top of Hadoop MapReduce

Page 60: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

PIG

The language: Pig Latin

Page 61: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

PIG

Data analysis problems as data flows

Page 62: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

PIG

Originally developed at Yahoo 2006

Page 63: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Pig for ETL

Page 64: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Pig for ETL

Page 65: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Pig for ETL

Page 66: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 67: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Hive

• Data warehouse software facilitates querying and managing large datasets residing in distributed storage

Page 68: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Hive

SQL-like language!

Page 69: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Hive

Facilitates querying and managing large datasets in HDFS

Page 70: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Hive

Mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL

Page 71: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 72: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Oozie

Workflow scheduler system to manage Apache Hadoop jobs

Page 73: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Oozie

Oozie Coordinator jobs!

Page 74: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Oozie Supports MapReduce, Pig, Apache Hive, and Sqoop, etc.

Page 75: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 76: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Zookeeper

Provides operational services for a Hadoop cluster group services

Page 77: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Zookeeper Centralized service for: maintaining configuration information naming services providing distributed synchronization and providing group services

Page 78: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Zookeeper Centralized service for: maintaining configuration information

Page 79: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Zookeeper Centralized service for: maintaining configuration information naming services

Page 80: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Zookeeper Centralized service for: maintaining configuration information naming services providing distributed synchronization and providing group services

Page 81: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Flume Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data

Page 82: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Additional Cloudera Hadoop Components

Impala

Page 83: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 84: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Impala

• Cloudera's open source massively parallel processing (MPP) SQL query engine Apache Hadoop

Page 85: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Additional Cloudera Hadoop Components

Spark The New Paradigm

Page 86: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters
Page 87: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Spark

Apache Spark™ is a fast and general engine for large-scale data processing

Page 88: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Spark Benefits Multi-stage in-memory primitives provides performance up to 100 times faster for certain applications

Page 89: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Spark Benefits Allows user programs to load data into a cluster's memory and query it repeatedly

Well-suited to machine learning!!!

Page 90: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Up Next

Tour of the Cloudera’s Quick Start VM

Page 91: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Pig

• Overview of apps, high level languages, services

• Databases/Stores • Querying • Machine Learning • Graph Processing

Page 92: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Databases/Stores • Avro: data structures within context of

Hadoop MapReduce jobs. • Hbase: distributed non-relational

database • Cassandra: distributed data

management system

Page 93: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Querying • Pig : Platform for analyzing large data

sets in HDFS • Hive : Query and manage large datasets • Impala : High-performance, low-latency

SQL querying of data in Hadoop file formats

• Spark : General processing engine for streaming, SQL, machine learning and graph processing.

Page 94: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Machine Learning, Graph Processing • Giraph: Iterative graph

processing using Hadoop framework • Mahout: Framework for machine

learning applications using Hadoop, Spark

• Spark: General processing engine for streaming, SQL, machine learning and graph processing.

Page 95: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Pig

• Pig components – PigLatin, and infrastructure layer

• Typical Pig use cases • Run Pig with Hadoop

integration.

Page 96: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Pig • Platform for data processing • Pig Latin: High level language • Pig execution environment: Local,

MapReduce, Tez • In built operators and functions • Extensible

Page 97: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Pig Usage Areas • Extract, Transform, Load (ETL)

operations • Manipulating, analyzing “raw” data • Widely used, extensive list at:

https://cwiki.apache.org/confluence/display/PIG/PoweredBy

Page 98: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Pig Example • Load passwd file

and work with data.

• Step 1: hdfs dfs -put /etc/passwd /user/cloudera

(Note: this is a single line) • Step 2:

pig -x mapreduce

Page 99: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Pig Example • Puts you in

“grunt” shell. • clear • Load the file:

load ‘/user/cloudera/passwd’ using PigStorage(‘:’);

• Pick subset of values: B = foreach A generate $0, $4, $5; dump B;

Page 100: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Pig Example • Backend

Hadoop job info.

Page 101: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Pig Example • Outputs

username, full name, and home directory path.

Page 102: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Pig Example • Can store this

processed data in HDFS

• Command: store B into ‘userinfo.out’;

Page 103: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Pig Example • Verify the new

data is in HDFS.

Page 104: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Summary • Used interactive shell for Pig example • Can also run using scripts • Also as embedded programs in a host

language (Java for example).

Page 105: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Hive

• Query and manage data using HiveQL

• Run interactively using beeline.

• Other run mechanisms

Page 106: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache Hive • Data warehouse software • HiveQL: SQL like language to structure,

and query data • Execution environment: MapReduce, Tez,

Spark • Data in HDFS, HBase • Custom mappers/reducers

Page 107: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Hive Usage Areas • Data mining, analytics • Machine learning • Ad hoc analysis • Widely used, extensive list at:

https://cwiki.apache.org/confluence/display/Hive/PoweredBy

Page 108: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Hive Example • Revisit /etc/passwd file example from Pig

lesson video. • Start by loading file into HDFS:

hdfs dfs -put /etc/passwd /tmp/ • Run beeline to access interactively:

beeline -u jdbc:hive2://

Page 109: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Hive Example • Copy passwd

file to HDFS • Running

interactively usiing beeline.

Page 110: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Hive Example: Command list CREATE TABLE userinfo ( uname STRING, pswd STRING, uid INT, gid INT, fullname STRING, hdir STRING, shell STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ':' STORED AS TEXTFILE; LOAD DATA INPATH '/tmp/passwd' OVERWRITE INTO TABLE userinfo; SELECT uname, fullname, hdir FROM userinfo ORDER BY uname ;

Page 111: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Hive Example • Run the

Create table command

Page 112: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Hive Example • Load passwd

file from HDFS

Page 113: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Hive Example • Select info -

this launches the Hadoop job and outputs once its complete.

Page 114: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Hive Example • Completed

MapReduce jobs; output shows username, fullname, and home directory.

Page 115: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Summary • Used beeline for interactive Hive

example • Can also use

• Hive command line interface (CLI) • Hcatalog • WebHcat.

Page 116: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache HBase

• Hbase features • Run interactively using

HBase shell. • List other access

mechanisms

Page 117: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Apache HBase • Scalable data store • Non-relational distributed database • Runs on top of HDFS • Compression • In-memory operations: MemStore,

BlockCache

Page 118: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

HBase Features • Consistency • High Availability • Automatic Sharding

Page 119: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Hbase Features • Replication • Security • SQL like access (Hive, Spark, Impala)

Page 120: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

HBase Example • Start HBase

shell: hbase shell

Page 121: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

HBase Example • Create Table: create

‘usertableinfo’,{NAME=>’username’},{NAME=>’fullname’},{NAME=>’homedir’}

Page 122: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

HBase Example • Add data: put ‘userinfotable’,’r1’,’username’,’vcsa’

Page 123: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

HBase Example • Scan table after data entry: scan ‘userinfotable’

Page 124: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

HBase Example • Select info from all rows corresponding to column ‘fullname’.

Page 125: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters

Summary • We used: Apache HBase Shell • Other options:

• HBase, MapReduce • HBase API • HBase External API