project report on hadoop

CHAPTER-1

HADOOP ECHOSYSTEM 2.X.X

With a rapid pace in evolution of Big Data, its processing frameworks also seem to be evolving

in a full swing mode. Hadoop (Hadoop 1.0) has progressed from a more restricted processing

model of batch oriented MapReduce jobs to developing specialized and interactive processing

models (Hadoop 2.0). With the advent of Hadoop 2.0, it is possible for organizations to create

data crunching methodologies within Hadoop which were not possible with Hadoop 1.0

architectural limitations. In this piece of writing we provide the users an insight on the

novel Hadoop 2.0 (YARN) and help them understand the need to switch from Hadoop 1.0 to

Hadoop 2.0.

Evolution of Hadoop 2.0 (YARN) -Swiss Army Knife of Big Data

With the introduction of Hadoop in 2005 to support cluster distributed processing of large scale

data workloads through the MapReduce processing engine, Hadoop has undergone a great

refurbishment over time. The result of this is a better and advanced Hadoop framework that does

not merely support MapReduce but renders support to various other distributed processing

models also.

The huge data giants on the web such as Google, Yahoo and Facebook who had adopted Apache

Hadoop had to depend on the partnership of Hadoop HDFS with the resource management

environment and MapReduce programming. These technologies collectively enabled the users to

manage processes and store huge amounts of semi-structured, structured or unstructured data

within Hadoop clusters. Nevertheless there were certain intrinsic drawbacks with Hadoop

MapReduce pairing. For instance, Google and other users of Apache Hadoop had various

alluding issues with Hadoop 1.0 of not having the ability to keep track with the flood of

information that they were collecting online due to the batch processing arrangement of

MapReduce.

1

https://www.dezyre.com/Hadoop-Training-online/19

Figure: - 1

What Is Hadoop?

To get started, let’s look at a simple definition of the tool that the utilities we’ll discuss support.

Apache Hadoop is a framework that allows for the distributed processing of large data sets

across clusters of commodity computers using simple programming model. It is an open-source

data management system with scale-out storage and distributed processing. It’s designed with

big data in mind and is ideal for large amounts of information

The Hadoop Ecosystem

The Hadoop Ecosystem consists of tools for data analysis, moving large amounts of unstructured

and structured data, data processing, querying data, storing data, and other similar data-oriented

processes. These utilities each serve a unique purpose and are geared toward different tasks

completed through or user roles interacting with Hadoop.

2

Data Storage

HDFS (Hadoop Distributed File System) is the key component that makes up Hadoop. HDFS is

used to store and access huge file based on client/server architecture. This system also enables

the distribution and storage of data across Hadoop clusters.

HBase (Hadoop Database) is a columnar database built on top of the HDFS. Being a file system,

HDFS lacks the random read and write capability. It is when HBase steps in and provides fast

record lookups in large tables.

Data Processing

MapReduce is a parallel data processing framework over clusters. Using MapReduce can help

data seeker save a lot of time, for example, if it takes a normal relational database around 20

hours to process a large data set, it might take MapReduce only around three minutes to get

everything done.

YARN (Yet Another Resource Negotiator) is a resource manager. It is said to be the second

generation of MapReduce and also a critical advancement from Hadoop 1. YARN acts the role

of an operating system, its jobs is to manage and monitor workloads, make sure it can serve

multiple clients and perform security controls. In addition, YARN supports new processing

models that MapReduce does not.

Data Access

Hive is new kind of structured query language. It was born to help who are familiar with the

traditional database and SQL to leverage Hadoop and MapReduce.

Pig serves the analysis purpose for large data sets. Pig is made up of two components, firstly the

platform to execute Pig programs; secondly, a powerful and simple scripting language called

PigLatin, which is used to write those programs.

Mahout provides a library of the most popular machine learning algorithms written in Java that

supports collaborative filtering, clustering, and classification.

Arvo is a data serialization system. It uses JSON for defining data types and protocols to support

data-driven applications. Arvo provides a simple integration with many different languages with

the expectation to support Hadoop application to be written in other languages (e.g. Python, C+

+) rather than Java.

3

Sqoop (SQL + Hadoop = Sqoop) is a command line interface application, which helps transfer

data between Hadoop and relational databases (e.g. MySQL or Oracle) or mainframes.

Data Management

Oozie is a workflow scheduler for Hadoop. Oozie streamlines the process of creating workflows

and managing coordination jobs among Hadoop and other applications such as Map Reduce, Pig,

Sqoop, Hive etc. The main responsibilities of Oozie are: firstly to define a sequence of actions to

be executed; secondly, to place triggers for those actions.

Chukwa is another framework that is built on top of HDFS and Map Reduce. Its purpose is to

provide a dynamic and powerful data collection system. Chukwa is capable of monitoring,

analyzing and presenting the results to get the most out of collected data.

Flume is also a scalable and reliable system for collecting and moving cluster logs from various

sources to a centralized store like Chukwa. However, there are some differences. In Flume,

chunks of data are transferred from node to node in store and forward manner; while in Chukwa,

the agent of each machine will need to determine what data to be sent.

ZooKeeper is a distributed coordination service for distributed system. It provides a very simple

programming interface and helps reduce the management complexity by providing services such

as configuration, distributed synchronization, naming, group services etc.

4

CHAPTER-2

HDFS (HADOOP DISTRIBUTION FILE SYSTEM)Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.

Features of HDFS

It is suitable for the distributed storage and processing.

Hadoop provides a command interface to interact with HDFS.

The built-in servers of namenode and datanode help users to easily check the status of cluster.

Streaming access to file system data.

HDFS provides file permissions and authentication.

HDFS Architecture

Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:

Manages the file system namespace.

Regulates client’s access to files.

It also executes file system operations such as renaming, closing, and opening files and directories.

5

Figure: - 2

Datanode

The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.

Datanodes perform read-write operations on the file systems, as per client request.

They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.

Block

Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

6

Goals of HDFS

Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.

Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.

Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.

7

CHAPTER-3

MAPREDUCE

MapReduce is a framework using which we can write applications to process huge amounts of

data, in parallel, on large clusters of commodity hardware in a reliable manner.

What is MapReduce?

MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.

The Algorithm

Generally MapReduce paradigm is based on sending the computer to where the data resides!

MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.

o Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.

o Reduce stage : This stage is the combination of the Shufflestage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

8

During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.

The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.

Most of the computing takes place on nodes with data on local disks that reduces the network traffic.

After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.

Figure: - 3

9

Inputs and Outputs (Java Perspective)

The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.

The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3>(Output).

Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Figure: - 4

Terminology

PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.

Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair. NamedNode - Node that manages the Hadoop Distributed File System (HDFS). DataNode - Node where data is presented in advance before any processing takes place. MasterNode - Node where JobTracker runs and which accepts job requests from clients. SlaveNode - Node where Map and Reduce program runs. JobTracker - Schedules jobs and tracks the assign jobs to Task tracker. Task Tracker - Tracks the task and reports status to JobTracker. Job - A program is an execution of a Mapper and Reducer across a dataset. Task - An execution of a Mapper or a Reducer on a slice of data. Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.

10

CHAPTER-4

PROJECT

Aim:- Temperature Data Analyses.

National Climatic Data Center (NCDC) is responsible for preserving, monitoring, assessing, and providing public access to weather data. A log file is created to store all this information.this file

11

includes the various type data related to climate like temperature, wind flow its direction, information related to the cyclones, whether change, the tempratature of each day is also noted.Through this project we analyze the temperature variation of the whole month and years. With the help of map reducing technique we can calculate the highest and lowest temperature or hottest or coolest day of the month or year.After going through wordcount mapreduce guide, you now have the basic idea of how a mapreduce program works. So, let us see a complex mapreduce program on weather dataset. Here I am using one of the dataset of year 2015 of Austin, Texas . We will do analytics on the dataset and classify whether it was a hot day or a cold day depending on the temperature recorded by NCDC.

NCDC gives us all the weather data we need for this mapreduce project.

The dataset which we will be using looks like below snapshot.

Figure: - 5

Step 1:- Import the project in ECLIPS IDE.

Step 2:- When the project is not having any error, we will export it as a jar file, same as we did in wordcount mapreduce guide. Right Click on the Project file and click on Export. Select jar file

12

Figure: - 6

Give the path we want to save the file

13

Figure: - 7

Select the mail file clicking on the browser

14

Figure: - 8

15

Click on finish the export

Figure: - 9

Step 3:- Before running the mapreduce program to check what it does, see that your cluster is up and all the hadoop daemons are running.

16

Figure: - 10

Step 4:- Select the input file on hdfs

17

Command :- hdfs –put download/inputfile.txt

Figure: - 11

Step 5:- Run jar file

18

Command :- hadoop jar temp.jar /wathear-data.txt /output

Figure: - 12

19

References:-

https://www.ncdc.noaa.gov/

https://www.tutorialspoint.com/

http://hadoop.apache.org/

20

http://hadoop.apache.org/

https://www.tutorialspoint.com/

https://www.ncdc.noaa.gov/

Data & Analytics

project report on hadoop