36
Introduction to Big Data & Hadoop Big Data Hadoop Training

Introduction to Hadoop

Embed Size (px)

Citation preview

Page 1: Introduction to Hadoop

Introduction to Big Data & Hadoop

Big Data Hadoop Training

Page 2: Introduction to Hadoop

Introduction to Big Data

Page 3: Introduction to Hadoop

Page 3Classification: Restricted

Importance Of Data

“Data is the new oil,” said Andreas Weigend, social data guru and former chief scientist at Amazon.com. “Oil needs to be refined before it can be useful.”

Page 4: Introduction to Hadoop

Page 4Classification: Restricted

ESG Report on Analytics:

• Majority of organizations view data analytics as a top 5 business and IT priority.• Reduced costs and process improvement are top data analytics platform benefits.• No leading data analytics platform has emerged yet. Nearly one-third of the

organizations surveyed are using a custom-developed solution.• Big data is driving changes in analytics tools, infrastructure, and processes.

Page 5: Introduction to Hadoop

Page 5Classification: Restricted

Meaning of the term BigData

Page 6: Introduction to Hadoop

Page 6Classification: Restricted

Size of the largest dataset for processing

Page 7: Introduction to Hadoop

Page 7Classification: Restricted

Number of Data Sources to integrate

Page 8: Introduction to Hadoop

Page 8Classification: Restricted

Update frequency of the largest data set

Page 9: Introduction to Hadoop

Page 9Classification: Restricted

Challenges while processing data

Page 10: Introduction to Hadoop

Page 10Classification: Restricted

Key benefits from processing data

Page 11: Introduction to Hadoop

Page 11Classification: Restricted

Big Data & its hype..

• Gartner: Hadoop will be in two-thirds of advanced analytics products by 2015• Livemint.com: SMAC is the new flavour of IT companies• SMAC will allow the IT industry to offer more value to the clients• Offshore Insights: Growth of IT companies will be dictated by cloud, mobile, analytics,

big data and social media services, according to a survey of 410 global IT decision-makers by research firm Offshore Insights, released in February

Page 12: Introduction to Hadoop

Page 12Classification: Restricted

What is Big Data ?

• Lots of Data (in terms of Terabytes or Petabytes)• It is a term applied to data-sets whose size is beyond the ability of commonly used

software tools to capture, manage & process within a tolerable elapsed time.• Systems/Enterprises generate huge amount of data from Terabytes to even

Petabytes.

Page 13: Introduction to Hadoop

Page 13Classification: Restricted

Structured Vs Unstructured

Page 14: Introduction to Hadoop

Page 14Classification: Restricted

Big Data Characteristics

• Big Data is characterized by 3 Vs

Page 15: Introduction to Hadoop

Page 15Classification: Restricted

Time for Quiz

• For the given file formats, identify which category of data that it belongs to:• Word Docs, PDFs, Tetxt files• eMail body• XML files• Data generated by ERPs, CRMs etc

Page 16: Introduction to Hadoop

Page 16Classification: Restricted

Big Data Users & Scenarios

Page 17: Introduction to Hadoop

Page 17Classification: Restricted

Challenges Of Big Data

• Problem #1 : Slow Disk Reads/Writes

• Problem #2 : Hardware Failures

• Problem #3 : Data integration & Transfer

Page 18: Introduction to Hadoop

Page 18Classification: Restricted

Why Distributed Processing?To Read 1 TB of data:

Disk seek-time: 100 Mb/sec Disk seek-time:

100 Mb/sec

Page 19: Introduction to Hadoop

Page 19Classification: Restricted

Why Distributed Processing?To Read 1 TB of data:

Time to Process: (1TB/100MB) = 10485 sec or 175min.

Time to Process: (1TB/5*100MB) = 2097 sec or 35 min.

Page 20: Introduction to Hadoop

Introduction to Hadoop

Page 21: Introduction to Hadoop

Page 21Classification: Restricted

Course Contents:

History of hadoopHadoop EcosystemHadoop Animal PlanetWhat is Hadoop?Distinctions of hadoopHadoop ComponentsAnatomy of a File WriteAnatomy of a File ReadReplication & Rack awareness

Page 22: Introduction to Hadoop

Page 22Classification: Restricted

History of Hadoop

Page 23: Introduction to Hadoop

Page 23Classification: Restricted

Hadoop Ecosystem

Page 24: Introduction to Hadoop

Page 24Classification: Restricted

Hadoop Animal Planet

Page 25: Introduction to Hadoop

Page 25Classification: Restricted

• The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models

• It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

• Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

What is Hadoop?

Page 26: Introduction to Hadoop

Page 26Classification: Restricted

Key Distinctions of Hadoop

HADOOPScalable Robust

Accessible

Simple

Page 27: Introduction to Hadoop

Page 27Classification: Restricted

Hadoop Components

Page 28: Introduction to Hadoop

Page 28Classification: Restricted

• HDFS – Hadoop Distributed File System(storage):• Data is split and distributed across nodes• Each split is replicated• Namenode is the master & Datanodes are the slaves

• Mapreduce(processing):• Splits a task across processors• Execution is Near the data & the results are merged• Self-healing• Jobtracker is the master & Task trackers are slaves

Hadoop Components

Page 29: Introduction to Hadoop

Page 29Classification: Restricted

Hadoop Components

MapReduce

HDFS Cluster

Job Tracker

Namenode

Task Tracker

Task Tracker

Task Tracker

Data Node Data Node Data Node

Page 30: Introduction to Hadoop

Page 30Classification: Restricted

• NameNode• It is the master node & responsible for the entire cluster• Manages the filesystem namespace• Enterprise level software is used

• DataNode• Slaves which run on commodity/cheap hardware• Store and retrieve data when they are told to (by client or Namenode)• Sends heart-beat signals to NN with the blocks that they store

• Secondary Node• It is a backup for the Namenode

Storage Components

Page 31: Introduction to Hadoop

Page 31Classification: Restricted

• Job Tracker:• Coordinates all the jobs run on the system by scheduling tasks • Keeps a record of overall progress of each job• If a job fails, reschedules the job on a different tasktracker

• Task Tracker:• Slave daemon which accepts tasks to be run a block of data• Sends progress reports as heart beat signals to the Job tracker at

regular intervals

Processing components

Page 32: Introduction to Hadoop

Page 32Classification: Restricted

HDFS

Page 33: Introduction to Hadoop

Page 33Classification: Restricted

Mapreduce Job

Page 34: Introduction to Hadoop

Page 34Classification: Restricted

Anatomy of a File Read

Page 35: Introduction to Hadoop

Page 35Classification: Restricted

Anatomy of a File Write

Page 36: Introduction to Hadoop

Page 36Classification: Restricted

Replication & Rack awareness

Block A: Block B: Block C:

Rack 1

1

2

3

4

Rack 2

5

6

7

8

Rack 3

9

10

11

12