53
Hadoop and BigData Ranjith Sekar July 2016

Hadoop and BigData - July 2016

Embed Size (px)

Citation preview

Page 1: Hadoop and BigData - July 2016

Hadoop and BigDataRanjith Sekar

July 2016

Page 2: Hadoop and BigData - July 2016

Agenda What is BigData and Hadoop?

Hadoop Architecture

HDFS

MapReduce

Installing Hadoop

Develop & Run a MapReduce Program

Hadoop Ecosystems

Page 3: Hadoop and BigData - July 2016

Introduction

Page 4: Hadoop and BigData - July 2016

Data Structured

Relational DB,

Library Catalogues (date, author, place, subject, etc.,)

Semi Structured

CSV, XML, JSON, NoSQL database

Unstructured

Page 5: Hadoop and BigData - July 2016

Unstructured Data Machine Generated

Satellite images Scientific data Photographs and video Radar or sonar data

Human Generated Word, PDF, Text Social media data (Facebook, Twitter, LinkedIn) Mobile data (text messages) website contents (blogs, Instagram)

Page 6: Hadoop and BigData - July 2016

Storage

Page 7: Hadoop and BigData - July 2016

Key Terms Commodity Hardware – PCs which can be used to form clusters.

Node – Commodity servers interconnected through network device.

NameNode = Master Node, DataNode = Slave Node

Cluster – interconnection of different nodes/systems in a network.

Page 8: Hadoop and BigData - July 2016

BigData

Page 9: Hadoop and BigData - July 2016
Page 10: Hadoop and BigData - July 2016

BigData Traditional approaches not fit for data analysis due to inflation.

Handling Large volume of data (zettabytes & petabytes) which are structured or

unstructured.

Datasets that grow so large that it is difficult to capture, store, manage, share, analyze

and visualize with the typical database software tools.

Generated by different sources around us like Systems, sensors and mobile devices.

2.5 quintillion bytes of data created everyday.

80-90% of the data in the world today has been created in the last two years alone.

Page 11: Hadoop and BigData - July 2016

Flood of Data More than 3 billion internet users in the world today.

The New York Stock Exchange generates about 4-5 TB of data per day.

7TB of data are processed by Twitter every day.

10TB of data are processed by Facebook every day and growing at 7 PB per month.

Interestingly 80% of these data are unstructured.

With this massive quantity of data, businesses need fast, reliable, deeper data insight.

Therefore, BigData solutions based on Hadoop and other analytics software are

becoming more and more relevant.

Page 12: Hadoop and BigData - July 2016

Dimensions of BigData

Volume – Big data comes in one size: large. Enterprises are awash with data, easily

amassing terabytes and even petabytes of information.

Velocity – Often time-sensitive, big data must be used as it is streaming in to the

enterprise in order to maximize its value to the business.

Variety – Big data extends beyond structured data, including unstructured data of all

varieties: text, audio, video, click streams, log files and more.

Page 13: Hadoop and BigData - July 2016

BigData Benefits Analysis of market and derive new strategy to improve business in different geo locations.

To know the response for their campaigns, promotions, and other advertising mediums.

Use medical history of patients, hospitals to provide better and quick service.

Re-develop your products.

Perform Risk Analysis.

Create new revenue streams.

Reduces maintenance cost.

Faster, better decision making.

New products & services.

Page 14: Hadoop and BigData - July 2016

Hadoop

Page 15: Hadoop and BigData - July 2016
Page 16: Hadoop and BigData - July 2016

Hadoop Google File System (2003).

Developed by Doug Cutting from Yahoo.

Hadoop 0.1.0 was released in April 2006.

Open source project of the Apache Software Foundation.

A Framework written in Java.

Distributed storage and distributed processing of very large

data sets on computer clusters built from

commodity hardware.

Naming the Hadoop.

Page 17: Hadoop and BigData - July 2016

Hardware & Software Hardware (commodity hardware)

Software OS

RedHat Enterprise Linux (RHEL)

CentOS

Ubuntu

Java Oracle JDK 1.6 (v 1.6.31)

Medium HighCPU 8 physical cores 12 physical coresMemory 16 GB 48 GBDisk 4 disks x 1TB = 4 TB 12 disks x 3TB = 36 TBNetwork 1 GB Ethernet 10 GB Ethernet or Infiniband

Page 18: Hadoop and BigData - July 2016

When Hadoop? When you must process lots of unstructured data.

When your processing can easily be made parallel.

When running batch jobs is acceptable.

When you have access to lots of cheap hardware.

Page 19: Hadoop and BigData - July 2016

Hadoop Distributions

http://www.cloudera.com/downloads/

http://hortonworks.com/downloads/

https://www.mapr.com/products/hadoop-download

http://pivotal.io/big-data/pivotal-hdb

http://www.ibm.com/developerworks/downloads/im/biginsightsquick/

Page 20: Hadoop and BigData - July 2016

Hadoop Architecture

Page 21: Hadoop and BigData - July 2016

Hadoop Core Components

Page 22: Hadoop and BigData - July 2016

Hadoop Configurations Standalone Mode

All Hadoop services run into a single JVM and on a single machine.

Pseudo-Distributed Mode

Individual Hadoop services run in an individual JVM, but on a single machine.

Fully Distributed Mode

Hadoop services run in individual JVMs, but JVMs resides in separate machines in a single

cluster.

Page 23: Hadoop and BigData - July 2016

Hadoop Core Services NameNode

Secondary NameNode

DataNode

ResourceManager

ApplicationMaster

NodeManager

Page 24: Hadoop and BigData - July 2016

How does Hadoop work? Stage 1

User submit the Job to process with location of the input and output files in HDFS & Jar file of MapReduce Program.

Job configuration by setting different parameters specific to the job.

Stage 2 The Hadoop Job Client submits the Job and Configuration to JobTracker. JobTracker will initiate the process to TaskTracker which in slave nodes. JobTracker will schedule the tasks and monitoring them, providing status and diagnostic

information to the job-client.

Stage 3 TaskTracker executes the Job as per MapReduce implementation. Input will be processed and output will be stored into HDFS.

Page 25: Hadoop and BigData - July 2016

Hadoop Cluster

Page 26: Hadoop and BigData - July 2016

HDFS

Page 27: Hadoop and BigData - July 2016

Hadoop Distributed File System (HDFS) Java-based file system to store large volume of data.

Scalability of up to 200 PB of storage and a single cluster of 4500 servers.

Supporting close to a billion files and blocks.

Access Java API Python/C for Non-Java Applications Web GUI through HTTP

FS Shell - shell-like commands that directly interact with HDFS

Page 28: Hadoop and BigData - July 2016

HDFS Features HDFS can handle large data sets.

Since HDFS deals with large scale data, it supports a multitude of machines.

HDFS provides a write-once-read-many access model.

HDFS is built using the Java language making it portable across various platforms.

Fault Tolerance and availability are high.

Page 29: Hadoop and BigData - July 2016

HDFS Architecture

Page 30: Hadoop and BigData - July 2016

File Storage in HDFS Split into multiple blocks/chunks and stored into different machines.

Blocks – 64MB size (default), 128MB (recommended).

Replication – fault tolerance and availability, it is configurable and it can be modified.

No storage space wasted. E.g. 420MB file stored as

Page 31: Hadoop and BigData - July 2016

NameNode One Per Hadoop Cluster and Act as Master Server.

Commodity hardware that contains the Linux operating system.

Namenode software – runs on commodity hardware.

Responsible for

Manages the file system namespace.

Regulates client’s access to files.

executes file system operations such as renaming, closing, and opening files and directories.

Page 32: Hadoop and BigData - July 2016

Secondary NameNode NameNode contains meta-data of job & data details in RAM.

S-NameNode contacts NameNode in a periodic time and copy of metadata information out

of NameNode.

When NameNode crashes, the meta-data copied from S-NameNode.

Page 33: Hadoop and BigData - July 2016

DataNode Many per Hadoop Cluster.

Uses inexpensive commodity hardware.

Contains actual data.

Performs read/write operations on file based on request.

Performs block creation, deletion, and replication according to the instructions of the

NameNode.

Page 34: Hadoop and BigData - July 2016

HDFS Command Line Interface View existing files

Copy files from local (copyFromLocal / put)

Copy files to local (copyToLocal / get)

Reset replication

Page 35: Hadoop and BigData - July 2016

HDFS Operation Principle

Page 36: Hadoop and BigData - July 2016

MapReduce

Page 37: Hadoop and BigData - July 2016

MapReduce Heart of Hadoop.

Programming model/Algorithm for data processing.

Hadoop can run MapReduce programs written in various languages (Java, Ruby, Python etc.,).

MapReduce programs are inherently parallel.

Master-Slave Model. Mapper

Performs filtering and sorting.

Reducer Performs a summary operation.

Page 38: Hadoop and BigData - July 2016

MapReduce Architecture

Page 39: Hadoop and BigData - July 2016

Job Tracker One per Hadoop Cluster.

Controls overall execution of MapReduce Program.

Manages the Task Tracker running on Data Node.

Tracking of available & utilized resources.

Tracks the running jobs and provides fault tolerance.

Heartbeat from TaskTracker for every few minutes.

Page 40: Hadoop and BigData - July 2016

Task Tracker Many per Hadoop Cluster.

Executes and manages the individual tasks assigned by Job Tracker.

Periodic status to the JobTracker about the execution of the Job.

Handles the data motion between map() and reduce().

Notifies JobTracker if any task failed.

Page 41: Hadoop and BigData - July 2016

MapReduce Engine

Page 42: Hadoop and BigData - July 2016

Hadoop Installation

Page 43: Hadoop and BigData - July 2016

Installing Hadoop Prerequisites Installation

Download : http://hadoop.apache.org/releases.html

> tar xzf hadoop-x.y.z.tar.gz

> export JAVA_HOME=/user/software/java6/

> export HADOOP_INSTALL=/home/tom/hadoop-x.y.z

> export PATH=$PATH:$HADOOP_INSTALL/bin

> Hadoop version

Hadoop 0.20.0

Page 44: Hadoop and BigData - July 2016

Pseudo-Distributed Mode Configurationcore-site.xml hdfs-site.xml mapred-site.xml

<?xml version="1.0"?><configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost/</value> </property></configuration>

<?xml version="1.0"?><configuration> <property> <name>dfs.replication</name> <value>1</value> </property></configuration>

<?xml version="1.0"?><configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property></configuration>

Formatting HDFS > hadoop namenode -format

Start HDFS & MapReduce > start-dfs.sh

> start-mapred.sh

Stop HDFS & MapReduce > stop-dfs.sh

> stop-mapred.sh

Page 45: Hadoop and BigData - July 2016

Develop & Run a MapReduce Program

Page 46: Hadoop and BigData - July 2016

Mapperimport java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}}}

Page 47: Hadoop and BigData - July 2016

Reducerimport java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException,

InterruptedException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

context.write(key, new IntWritable(sum));

}

}

Page 48: Hadoop and BigData - July 2016

Main Programimport org.apache.hadoop.*;

public class WordCount {public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();Job = new Job(conf, "wordcount");job.setJarByClass(WordCount.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(WordCountMapper.class);job.setReducerClass(WordCountReducer.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.waitForCompletion(true);

} }

Page 49: Hadoop and BigData - July 2016

Input Data$ bin/hadoop dfs -ls /user/ranjith/mapreduce/input/

/user/ranjith/mapreduce/input/file01

/user/ranjith/mapreduce/input/file02

$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file01

Hello World Bye World

$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file02

Hello Hadoop Goodbye Hadoop

Page 50: Hadoop and BigData - July 2016

Run Create Jar WordCout.jar

Run Command> hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/ranjith/mapreduce/input/ /user/ranjith/mapreduce/output

Output$ bin/hadoop dfs -cat /user/ranjith/mapreduce/output/part-00000

Bye 1

Goodbye 1

Hadoop 2

Hello 2

World 2

Link : http://javabyranjith.blogspot.in/2015/10/hadoop-word-count-example-with-maven.html

Page 51: Hadoop and BigData - July 2016

Hadoop Ecosystem

Page 52: Hadoop and BigData - July 2016

Hadoop Ecosystem HDFS & MapReduce

Ambari - provisioning, managing, and monitoring Apache Hadoop clusters.

Pig – Scripting Language for MapReduce Program.

Mahout - Scalable, commercial-friendly machine learning for building intelligent application.

Hive – Metastore to view HDFS data.

Hbase - open source, non-relational, distributed database.

Sqoop – CLI application for transferring data between relational databases and Hadoop.

ZooKeeper - distributed configuration service, synchronization service, and naming registry for large

distributed systems.

Oozie – define and manage the workflow.