51
FEARLESS engineering Acharya institute of technology, Bengaluru. Grand Welcome to the world of, “bigdata_community”

Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

Embed Size (px)

DESCRIPTION

Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

Citation preview

Page 1: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Acharya institute of technology, Bengaluru.

Grand Welcome to the world of, “bigdata_community”

Page 2: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Acharya institute of technology, Bengaluru.

Introduction to bigdata and installation of single-node

apache hadoop cluster

Presented By, Mahantesh C. Angadi Nagarjuna DN Manoj PT

Under the Guidance of, Prof. Manjunath tN Prof. Amogh pk Dept. of ISE AIT, Bengaluru

Session-1: 24 March 2014

Page 3: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Contents

Purpose of This Talk

Introduction

Terminologies

Cloud Computing

BigData

Traditional Approaches to Solve BigData Problems

Hadoop and its Characteristics

Architecture of Hadoop

Page 4: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Contents

Why Only Hadoop?

Advantages of Hadoop

Limitations of Hadoop

Job Opportunities in BigData & Hadoop

Conclusion

References

Page 5: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Purpose of This talk

To understand the terminologies

To introduce you to BigData and Hadoop

To be able to clearly differentiate between Cloud, BigData,

Hadoop

Traditional approaches to handle BigData

Characteristics of Hadoop

Explain how Hadoop works?

Get friendly with MapReduce and HDFS

Page 6: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

continued…

Get aware about Hadoop Ecosystem

Advantages and Limitations of Hadoop

Job opportunities in Hadoop

Conclusion

Page 7: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

introduction

Today We Live in the Data Age.

Due to Internet of Things (IoT), the speed of

ingestion of data is keeps on increasing and

increasing.

So, the World is getting more “Hungrier and Hungrier

for Data”

Page 8: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

terminologies

Cloud Computing

BigData

Hadoop

Distributed Computing

Parallel Computing

Utility Computing

Data Scientist

Page 9: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Letz start the journey…!

Page 10: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Cloud computing

“Computing may someday be organized as a public

utility, just as the telephone system is organized as a

public utility”

- John McCarthy, 1961

The word “Cloud” is first time used in a technical

perspective by HP and Compaq people.

Cloud Computing is a Utility Computing that involves a

large number of computers connected through a

communication network such as the Internet, provides

services based-on-demand.

Page 11: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

utility computing

Utility computing is the packaging of computing

resources, such as computation, storage and services, as

a metered service.

This model has the advantage of a low or no initial cost

to acquire computer resources;

Page 12: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Distributed computing

Distributed computing refers to the use of distributed

systems to solve computational problems.

Here a problem will be divided into many no. of small

tasks, each of which is solved by one or more

computers, which communicate with each other by

passing messages.

Parallel Computing is a form of computation in which

many calculations are carried out simultaneously,

operating on the principle that large problems can often

be divided into smaller ones, which are then

solved concurrently.

Page 13: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Why bigdata deserves our attention?

Everyday we create 2.5 quintillions bytes of data, 90%

this data is unstructured.

90% of the data in the World today has been created

in the last two years alone.

By the end of 2015, CISCO estimates that global

Internet traffic will reach 4.8 Zettabytes a year.

BigData would create 4.4. million jobs by 2015.

There is a shortage of 140,000-190,000 BigData

professionals in the United States alone…!

Page 14: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

what happens in an internet minute…?

Page 15: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

what is bigdata…?

BigData is the any amount of data that is structured

and/or unstructured data which is beyond the

storage and processing capabilities of a single

physical machine and traditional database techniques.

Data that has extra large Volume, comes from Variety

of sources, Variety of formats and comes at us with a

great Velocity is normally refers to as BigData.

Page 16: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

3 v’s of bigdata

Page 17: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Rise of bigdata adoption

Page 18: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Rise of bigdata adoption

Data Scientist is the Hottest job of 21st Century…!

- Harvard Business Review Magazine

Positions such as Data Scientist, Data Analytics were

doesn’t exist few years ago.

Today Companies are fighting to recruit these specialists.

The market is not Growing at the rate it wants to grow:-

Because skills shortage is looming, so they increase

Salaries up…!

Data Scientists take huge amounts of data & attempt to

pull useful “Business Insights” from that raw data.

Page 19: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Traditional approaches to solve bigdata problems

Page 20: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Traditional approach: Storage area network (san)

Application

Servers

Page 21: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Characteristics of Storage area network (san)

SAN can be visualize as, one massive storage that can

give us Infinite Storage.

Moving Data to Computational Nodes.

It has multiple Application Servers.

Programs run on each Application Server.

All the data is stored in one SAN.

Before Execution, each server Gets the data from SAN.

After Execution, each server Writes the output to SAN.

Page 22: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Problems with Storage area network (san)

Huge dependency between networks

Huge bandwidth demand

Scaling up and scaling down is not a smooth process

Partial failures are also difficult to handle

A lot of processing power is spent on Transferring the Data

Data Synchronization is required during exchange

Page 23: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

moore’s law

continued…

Moore’s Law:

“The number of Transistors per silicon chip, that can be

placed in a Processor will double approximately every

two years, for half the cost.”

It is named after Gordon Moore, the founder of Intel.

Page 24: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

hadoop

Inspired by Google.

Google is originated in the year 1998.

They faced serious challenge in early 2000 to handle

the BigData.

In 2004 Google related two papers:

- GFS: Google File System

- MapReduce: A Programming Model

Page 25: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

what is hadoop…?

Apache Hadoop is an open-source software framework,

used to manage BigData.

Its built and used by a global community of contributors and

users.

It’s not only a tool, it’s a Framework of tools.

Moving computation is cheaper than moving data.

Most important Hadoop sub-projects:

i. HDFS: Hadoop Distributed File System

ii. MapReduce: A Programming Model

Page 26: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

founders of hadoop

Page 27: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

why the name hadoop…?

“Hadoop“ is simply the name of a stuffed toy ELEPHANT that belonged to the son of its creator “DOUG CUTTING”.

Page 28: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Scalable– New nodes can be added without changing

data formats.

Cost-effective– It parallelly processes huge datasets on

large clusters of commodity computers.

Efficient and Flexible- It is schema-less, and can

absorb any type of data, from any number of sources.

Fault-tolerant and Reliable- It handles failures of

nodes easily because od Replication.

Easy to use- It uses simple Map and Reduce functions

to process the data.

It is developed in Java but it can support Python &

others too.

Characteristics of hadoop

Page 29: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

who uses hadoop…?

Page 30: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Hadoop ecosystem

Page 31: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Hadoop core components

Hadoop core has two major components:

1. HDFS

a. Name Node

b. Secondary Name Node

c. Data Node

2. MapReduce Engine

a. Job Tracker

b. Task Tracker

Page 32: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Architecture of hadoop

Page 33: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

overview of hadoop

Page 34: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Hadoop Distributed File System

Pioneered by Google File System (GFS)

It consists of three major components -

i. Name Node

• It is responsible for the distribution of the data throughout the Hadoop cluster.

ii. Secondary Name Node (Backup Node)

• It regularly contacts Name Node and maintains an up to date snapshots of Name Node's directory information.

iii. Data Node

• It responsible to store the chunk of data that is assigned to it by the Name Node.

Page 35: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Mapreduce

Pioneered by Google, Popularized by Yahoo (Apache).

It consists of two major components –

i. Job Tracker

• It is responsible for scheduling the task to slave nodes.

• So it consults the Name Node and assigns the task to the nodes which has the data on which task would be performed.

ii. Task Tracker

• It has the actual logic to perform the task, so it performs Map and Reduce functions on the data assigned to it by Master Node.

Page 36: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Distributed model

Page 37: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Task tracker and data nodes

Page 38: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Master/slave architecture

Page 39: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

continued…

Page 40: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

continued…

Page 41: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Mapreduce example: wordcount

Page 42: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Advantages of hadoop

Moving Computation is far better than Moving Data

Runs on commodity hardware

It’s a Master/Slave architecture

It handles all types of node failures by live Heartbeats

It handles assigning tasks to nodes

It has Rack awareness between nodes

So, Programmers only need to concentrate on getting

business values from BigData

Page 43: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

limitations of hadoop

Do you think Hadoop is a “Golden_Bullet” that can solve all

kinds of problems…?

- The answer is NO…!!!

Not suitable, if data is too small.

Not suitable, if there is a dependency between the data.

Not suitable, if Job cannot be divided into small chunks.

Not suitable, to process real-time and stream-based

processing.

Page 44: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Closer home

AADHAR – Government of India’s UIDAI project is considered as one of the largest

BigData project in the World...!

Page 45: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Closer home

Feb 14th 2011 – IBM’s Super Computer “WATSON” built using BigData Technology.

Its not online & its process like a Human Brain…!

Page 46: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Job opportunities

Roles and Profiles:

Hadoop Administrator

Pre-req: Networking, Admin

Hadoop Developer

Pre-req: Programming Expertise

Preferably Java/Python

Data Scientist and Data Analytics

Pre-req: Mathematics, Statistical Background

Scripting languages like Perl etc.

Page 47: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Conception about bigdata

“BigData is like Teen_Age_Sex:

Everyone talks about it, nobody really

knows how to do it…??? Everyone thinks

that everyone else is doing it, so everyone

claims they are doing it…!!! But anyone

who actually tries this will be Terrible at

it...!!! ”

-Dan Ariely, Behavioral Economics Guru.

Page 48: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

conclusion

• Big Data brings new and exciting

opportunities to companies who utilize the

platforms available.

• In this Information Era, BigData technology

has got its own importance for businesses.

• It has got lot of opportunities in the upcoming

days.

Page 50: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

Any queries…???

Page 51: Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

FEARLESS engineering

thank you one and all For your patience <3