Hadoop a Highly Available and Secure Enterprise Data Warehousing solution

Preview:

Citation preview

www.edureka.co/r-for-analytics

www.edureka.co/hadoop-admin

Hadoop : A Highly Available and Secure Enterprise Data warehousing Solution

Slide 2Slide 2Slide 2 www.edureka.co/hadoop-admin

At the end of this webinar we will Know about:

What is Big Data

Why do Enterprise care about Big Data

Why your DWH needs Hadoop?

Security in Hadoop

How Hadoop maintains high Availability

Data warehousing tools in Hadoop

Agenda

Slide 3Slide 3Slide 3 www.edureka.co/hadoop-admin

What is Big Data

Slide 4Slide 4Slide 4 www.edureka.co/hadoop-admin

Slide 5Slide 5Slide 5 www.edureka.co/hadoop-admin

What is Wrong with our traditional DWH Solutions

Slide 6Slide 6Slide 6 www.edureka.co/hadoop-admin

Storing Unstructured data like images and video

Processing images and video

Storing and processing other large files

PDFs, Excel files

Processing large blocks of natural language text

Blog posts, job ads, product descriptions

Processing semi-structured data

CSV, JSON, XML, log files

Sensor data

When RDBMS Makes no Sense?

Slide 7Slide 7Slide 7 www.edureka.co/hadoop-admin

Ad-hoc, exploratory analytics

Integrating data from external sources

Data cleanup tasks

Very advanced analytics (machine learning)

When RDBMS Makes no Sense?

Slide 8Slide 8Slide 8 www.edureka.co/hadoop-admin

It is:

– Unstructured

– Unprocessed

– Un-aggregated

– Un-filtered

– Repetitive

– Low quality

– And generally messy.

Oh, and there is a lot of it.

Big Problems with Big Data

Slide 9Slide 9Slide 9 www.edureka.co/hadoop-admin

Storage capacity

Storage throughput

Pipeline throughput

Processing power

Parallel processing

System Integration

Data Analysis

Scalable storage

Massive Parallel Processing

Ready to use tools

Technical Challenges

Slide 10Slide 10Slide 10 www.edureka.co/hadoop-admin

Too many channels for data

Technical Challenges

Slide 11Slide 11Slide 11 www.edureka.co/hadoop-admin

Why do Enterprise care about Big Data

Slide 12Slide 12Slide 12 www.edureka.co/hadoop-admin

Slide 13Slide 13Slide 13 www.edureka.co/hadoop-admin

Slide 14Slide 14Slide 14 www.edureka.co/hadoop-admin

You said RDBMS does not have solution

for Big Data, Then who has???

Slide 15Slide 15Slide 15 www.edureka.co/hadoop-admin

I Have The solution for Big Data Problem

Hadoop

Hadoop : The Savior

Slide 16Slide 16Slide 16 www.edureka.co/hadoop-admin

How Hadoop differs from RDBMS

Hadoop can store all types of data in it so that you have flexibility of analyzing all types of data.

You can drill down the big data to find even the rare insight which was not possible earlier.

Slide 17Slide 17Slide 17 www.edureka.co/hadoop-admin

First Load the data then do whatever you want to do.

This is Possible because of the cheap storage and distributed HDFS.

Hadoop Is The New DWH Solution

• This is ETL• Before loading you should

transform data in particular format

• This puts an restriction on the type of data that can be stored

Slide 18Slide 18Slide 18 www.edureka.co/hadoop-admin

First Load the data then do whatever you want to do.

This is Possible because of the cheap storage and distributed HDFS.

Hadoop Is The New DWH Solution

• This is ETL• Before loading you should

transform data in particular format

• This puts an restriction on the type of data that can be stored

• This is ELT• There is no need to transform

the data beforehand• You can have all kind of data on

board• Freedom to work with all data

Slide 19Slide 19Slide 19 www.edureka.co/hadoop-admin

Hadoop is the new Data Warehouse for all kind of BI requirements.

Hadoop Does ELT Not ETL

Slide 20Slide 20Slide 20 www.edureka.co/hadoop-admin

Core Features of Hadoop

Slide 21Slide 21Slide 21 www.edureka.co/hadoop-admin

Hadoop Is Fault Tolerant And Super Consistent

Slide 22Slide 22Slide 22 www.edureka.co/hadoop-admin

Maintaining High Availability(HA)

In Distributed Computing, failure is a norm, which means YARN should have acceptable amount of availability

NameNode - No Horizontal Scale

NameNode - No High Availability

DataNode

DataNode

DataNode

….

Client get Block Locations

Read Data

NameNodeNS

Block Management

Slide 23Slide 23Slide 23 www.edureka.co/hadoop-admin

Secondary NameNode:

"Not a hot standby" for the NameNode

Connects to NameNode every hour*

Housekeeping, backup of NemeNode metadata

Saved metadata can build a failed NameNode

SecondaryNameNode

NameNode

metadata

metadata

Single PointFailure

You give me metadata

every hour, I will make it

secure

NameNode – Single Point of Failure

Slide 24Slide 24Slide 24 www.edureka.co/hadoop-admin

Node Manager

HDFS

YARN

Resource Manager

Shared edit logs

All name space edits logged to shared NFS storage; single writer

(fencing)

Read edit logs and applies to its own namespace

Secondary Name Node

DataNode

Standby NameNode

Active NameNode

ContainerApp

Master

Node Manager

DataNode

ContainerApp

Master

Data Node

Client

DataNode

ContainerApp

Master

Node Manager

DataNode

ContainerApp

Master

Node Manager

NameNode High Availability

Next Generation MapReduce

HDFS HIGH AVAILABILITY

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

Hadoop 2.0 Cluster Architecture - HA

Demo

Achieving HDFS and YARN High Availability

Slide 26Slide 26Slide 26 www.edureka.co/hadoop-admin

Hadoop is Secure

Slide 27Slide 27Slide 27 www.edureka.co/hadoop-admin

Security

Service-level authorization and web proxy capabilities in YARN.

Access Control Lists(ACL) : The Hadoop Distributed File System (HDFS) implements a permissions model for files and directories that shares much of the POSIX model

Slide 28Slide 28Slide 28 www.edureka.co/hadoop-admin

Security – Simple Flow

Security Risks

Insufficient Authentication Do not authenticate users services

No Privacy and No Integrity Insecure Network Transport No Message level security

Arbitrary Code Execution No User verification for MapReduce code

execution, malicious users could submit a job

Client Job Tracker

HDFS

Task Tracker

Task

HDFS

Task Tracker

Task

Slide 29Slide 29Slide 29 www.edureka.co/hadoop-admin

Managing users, permissions , quotas, etc …

Checking Resources Usage And Users Permissions

Demo

Demo on ACL

Slide 31Slide 31Slide 31 www.edureka.co/hadoop-admin

Hadoop provides traditional SQL interface as well asNoSQL Interface foe data storage

Slide 32Slide 32Slide 32 www.edureka.co/hadoop-admin

Hive ??

Slide 33Slide 33Slide 33 www.edureka.co/hadoop-admin

Hive Architecture

Slide 34Slide 34Slide 34 www.edureka.co/hadoop-admin

Hbase and its Architecture??

Hive and HBase Integration

Questions

Slide 36

Slide 37

Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your experience better!

Please spare few minutes to take the survey after the webinar.

Survey

Recommended