Hadoop and BigData - July 2016

Preview:

Citation preview

Hadoop and BigDataRanjith Sekar

July 2016

Agenda What is BigData and Hadoop?

Hadoop Architecture

HDFS

MapReduce

Installing Hadoop

Develop & Run a MapReduce Program

Hadoop Ecosystems

Introduction

Data Structured

Relational DB,

Library Catalogues (date, author, place, subject, etc.,)

Semi Structured

CSV, XML, JSON, NoSQL database

Unstructured

Unstructured Data Machine Generated

Satellite images Scientific data Photographs and video Radar or sonar data

Human Generated Word, PDF, Text Social media data (Facebook, Twitter, LinkedIn) Mobile data (text messages) website contents (blogs, Instagram)

Storage

Key Terms Commodity Hardware – PCs which can be used to form clusters.

Node – Commodity servers interconnected through network device.

NameNode = Master Node, DataNode = Slave Node

Cluster – interconnection of different nodes/systems in a network.

BigData

BigData Traditional approaches not fit for data analysis due to inflation.

Handling Large volume of data (zettabytes & petabytes) which are structured or

unstructured.

Datasets that grow so large that it is difficult to capture, store, manage, share, analyze

and visualize with the typical database software tools.

Generated by different sources around us like Systems, sensors and mobile devices.

2.5 quintillion bytes of data created everyday.

80-90% of the data in the world today has been created in the last two years alone.

Flood of Data More than 3 billion internet users in the world today.

The New York Stock Exchange generates about 4-5 TB of data per day.

7TB of data are processed by Twitter every day.

10TB of data are processed by Facebook every day and growing at 7 PB per month.

Interestingly 80% of these data are unstructured.

With this massive quantity of data, businesses need fast, reliable, deeper data insight.

Therefore, BigData solutions based on Hadoop and other analytics software are

becoming more and more relevant.

Dimensions of BigData

Volume – Big data comes in one size: large. Enterprises are awash with data, easily

amassing terabytes and even petabytes of information.

Velocity – Often time-sensitive, big data must be used as it is streaming in to the

enterprise in order to maximize its value to the business.

Variety – Big data extends beyond structured data, including unstructured data of all

varieties: text, audio, video, click streams, log files and more.

BigData Benefits Analysis of market and derive new strategy to improve business in different geo locations.

To know the response for their campaigns, promotions, and other advertising mediums.

Use medical history of patients, hospitals to provide better and quick service.

Re-develop your products.

Perform Risk Analysis.

Create new revenue streams.

Reduces maintenance cost.

Faster, better decision making.

New products & services.

Hadoop

Hadoop Google File System (2003).

Developed by Doug Cutting from Yahoo.

Hadoop 0.1.0 was released in April 2006.

Open source project of the Apache Software Foundation.

A Framework written in Java.

Distributed storage and distributed processing of very large

data sets on computer clusters built from

commodity hardware.

Naming the Hadoop.

Hardware & Software Hardware (commodity hardware)

Software OS

RedHat Enterprise Linux (RHEL)

CentOS

Ubuntu

Java Oracle JDK 1.6 (v 1.6.31)

Medium HighCPU 8 physical cores 12 physical coresMemory 16 GB 48 GBDisk 4 disks x 1TB = 4 TB 12 disks x 3TB = 36 TBNetwork 1 GB Ethernet 10 GB Ethernet or Infiniband

When Hadoop? When you must process lots of unstructured data.

When your processing can easily be made parallel.

When running batch jobs is acceptable.

When you have access to lots of cheap hardware.

Hadoop Distributions

http://www.cloudera.com/downloads/

http://hortonworks.com/downloads/

https://www.mapr.com/products/hadoop-download

http://pivotal.io/big-data/pivotal-hdb

http://www.ibm.com/developerworks/downloads/im/biginsightsquick/

Hadoop Architecture

Hadoop Core Components

Hadoop Configurations Standalone Mode

All Hadoop services run into a single JVM and on a single machine.

Pseudo-Distributed Mode

Individual Hadoop services run in an individual JVM, but on a single machine.

Fully Distributed Mode

Hadoop services run in individual JVMs, but JVMs resides in separate machines in a single

cluster.

Hadoop Core Services NameNode

Secondary NameNode

DataNode

ResourceManager

ApplicationMaster

NodeManager

How does Hadoop work? Stage 1

User submit the Job to process with location of the input and output files in HDFS & Jar file of MapReduce Program.

Job configuration by setting different parameters specific to the job.

Stage 2 The Hadoop Job Client submits the Job and Configuration to JobTracker. JobTracker will initiate the process to TaskTracker which in slave nodes. JobTracker will schedule the tasks and monitoring them, providing status and diagnostic

information to the job-client.

Stage 3 TaskTracker executes the Job as per MapReduce implementation. Input will be processed and output will be stored into HDFS.

Hadoop Cluster

HDFS

Hadoop Distributed File System (HDFS) Java-based file system to store large volume of data.

Scalability of up to 200 PB of storage and a single cluster of 4500 servers.

Supporting close to a billion files and blocks.

Access Java API Python/C for Non-Java Applications Web GUI through HTTP

FS Shell - shell-like commands that directly interact with HDFS

HDFS Features HDFS can handle large data sets.

Since HDFS deals with large scale data, it supports a multitude of machines.

HDFS provides a write-once-read-many access model.

HDFS is built using the Java language making it portable across various platforms.

Fault Tolerance and availability are high.

HDFS Architecture

File Storage in HDFS Split into multiple blocks/chunks and stored into different machines.

Blocks – 64MB size (default), 128MB (recommended).

Replication – fault tolerance and availability, it is configurable and it can be modified.

No storage space wasted. E.g. 420MB file stored as

NameNode One Per Hadoop Cluster and Act as Master Server.

Commodity hardware that contains the Linux operating system.

Namenode software – runs on commodity hardware.

Responsible for

Manages the file system namespace.

Regulates client’s access to files.

executes file system operations such as renaming, closing, and opening files and directories.

Secondary NameNode NameNode contains meta-data of job & data details in RAM.

S-NameNode contacts NameNode in a periodic time and copy of metadata information out

of NameNode.

When NameNode crashes, the meta-data copied from S-NameNode.

DataNode Many per Hadoop Cluster.

Uses inexpensive commodity hardware.

Contains actual data.

Performs read/write operations on file based on request.

Performs block creation, deletion, and replication according to the instructions of the

NameNode.

HDFS Command Line Interface View existing files

Copy files from local (copyFromLocal / put)

Copy files to local (copyToLocal / get)

Reset replication

HDFS Operation Principle

MapReduce

MapReduce Heart of Hadoop.

Programming model/Algorithm for data processing.

Hadoop can run MapReduce programs written in various languages (Java, Ruby, Python etc.,).

MapReduce programs are inherently parallel.

Master-Slave Model. Mapper

Performs filtering and sorting.

Reducer Performs a summary operation.

MapReduce Architecture

Job Tracker One per Hadoop Cluster.

Controls overall execution of MapReduce Program.

Manages the Task Tracker running on Data Node.

Tracking of available & utilized resources.

Tracks the running jobs and provides fault tolerance.

Heartbeat from TaskTracker for every few minutes.

Task Tracker Many per Hadoop Cluster.

Executes and manages the individual tasks assigned by Job Tracker.

Periodic status to the JobTracker about the execution of the Job.

Handles the data motion between map() and reduce().

Notifies JobTracker if any task failed.

MapReduce Engine

Hadoop Installation

Installing Hadoop Prerequisites Installation

Download : http://hadoop.apache.org/releases.html

> tar xzf hadoop-x.y.z.tar.gz

> export JAVA_HOME=/user/software/java6/

> export HADOOP_INSTALL=/home/tom/hadoop-x.y.z

> export PATH=$PATH:$HADOOP_INSTALL/bin

> Hadoop version

Hadoop 0.20.0

Pseudo-Distributed Mode Configurationcore-site.xml hdfs-site.xml mapred-site.xml

<?xml version="1.0"?><configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost/</value> </property></configuration>

<?xml version="1.0"?><configuration> <property> <name>dfs.replication</name> <value>1</value> </property></configuration>

<?xml version="1.0"?><configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property></configuration>

Formatting HDFS > hadoop namenode -format

Start HDFS & MapReduce > start-dfs.sh

> start-mapred.sh

Stop HDFS & MapReduce > stop-dfs.sh

> stop-mapred.sh

Develop & Run a MapReduce Program

Mapperimport java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}}}

Reducerimport java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException,

InterruptedException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

context.write(key, new IntWritable(sum));

}

}

Main Programimport org.apache.hadoop.*;

public class WordCount {public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();Job = new Job(conf, "wordcount");job.setJarByClass(WordCount.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(WordCountMapper.class);job.setReducerClass(WordCountReducer.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.waitForCompletion(true);

} }

Input Data$ bin/hadoop dfs -ls /user/ranjith/mapreduce/input/

/user/ranjith/mapreduce/input/file01

/user/ranjith/mapreduce/input/file02

$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file01

Hello World Bye World

$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file02

Hello Hadoop Goodbye Hadoop

Run Create Jar WordCout.jar

Run Command> hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/ranjith/mapreduce/input/ /user/ranjith/mapreduce/output

Output$ bin/hadoop dfs -cat /user/ranjith/mapreduce/output/part-00000

Bye 1

Goodbye 1

Hadoop 2

Hello 2

World 2

Link : http://javabyranjith.blogspot.in/2015/10/hadoop-word-count-example-with-maven.html

Hadoop Ecosystem

Hadoop Ecosystem HDFS & MapReduce

Ambari - provisioning, managing, and monitoring Apache Hadoop clusters.

Pig – Scripting Language for MapReduce Program.

Mahout - Scalable, commercial-friendly machine learning for building intelligent application.

Hive – Metastore to view HDFS data.

Hbase - open source, non-relational, distributed database.

Sqoop – CLI application for transferring data between relational databases and Hadoop.

ZooKeeper - distributed configuration service, synchronization service, and naming registry for large

distributed systems.

Oozie – define and manage the workflow.

Queries ?

http://www.slideshare.net/java2ranjith java2ranjith@gmail.com

Recommended