Big Data Doc

8/13/2019 Big Data Doc

1/18

A DIVE INTO BIG DATA ITS

SOLUTION USING HADOOP

BY

G.LOUIS AROKIARAJ

B.TECH CSE (IV YEAR)

NATIONAL INSITUTE OF TECHNOLOGY, PUDUCHERRY


2/18


3/18

HOW EFFICIENT THAN DATA WAREHOUSE ANALYSIS

Data warehouses store current as well as historical data and are used for creating trending reports for

senior management reporting such as annual and quarterly comparisons.

The drawback of data warehouse is when such a large quantity of data is flooding into thesystem, then it wont be able to process those data.

More expensive.Some of the solutions to data warehouse are INFORMATICA, TERADATA.

SOURCES OF BIG DATA

Social MediaFacebook, Twitter, Google+, Orkut. Stock marketRisk analysis. Health carePatient details, diagnosis, prescriptions, medicines, reports. Information Technology companiesEmployee details, stats. E-commerceRecommendations. Indian Government is trying to implement big data analysis in tax revenues to increase the

economy of the country.

STATISTICS

In the present data world about 90% of data is unstructured. Only remaining 10% is structured. In the last 2 years there has been in immense increase in the quantity of data because of

various factors like online shopping, facebook, twitter etc.

In 1 day, about 2.2 million data are created.

In 2010, the growth of big data market was accounted to $3.2 billion. In 2016, the growth of big data market is expected to be $16.9 billion.

Huge Data is

lost and left

unprocessed

X

Y


4/18

HOW TO SOLVE THIS??

Here comes the solution of HADOOP.

Hadoop is one of the solution to big data Analysis. Other forms of Big data technologies are

No sql databaseCasandra, MangoDB, etc Search toolslucene, elastic search, etc. Stream processingSTORM, S4, etc. Kafka, Thrift, Scribe, etc.

WHY HADOOP?

FlexibleHadoop can process all the 3 types of (structured, semi-structured, unstructured) data.Hadoop supports various languages like perl, python, java, sql using hadoop streaming

API. So Hadoop is not adhered to Java experts alone!

Scale out architecture. Building more efficient data economy. Robust ecosystem. Cost effective Hadoop is getting cloudy too! Hadoop focuses on moving code to data instead of data to code. Hadoop supports OLAP (online analytics processing) but not OLTP (online transaction

processing).

HADOOP OVER TRADITIONAL SYSTEMS

HADOOP ANALYSIS

Hadoop is an open source frameworkwhich allows for distributed processing oflarge data across clusters of computers

TRADITIONAL SYSTEM ANALYSIS

Traditional systems use paid software andtools for the analysis of data.


5/18

using simple programming model.

When the data keeps growing, if the rackarchitecture is not able to take it up thenwe can just replace/substitute acheap/common commodity to the rack

and continue the execution

Scale out architecture

When the data keeps growing, if thesystem is not able to withstand the datathen we can extend it to certain limit andif the data crosses beyond that limit, then

we are forced to replace the entiremachine.

Scalable architecture

HISTORY OF HADOOP

Hadoop was created by DOUG CUTTING, the creator of Apache Lucene, the widely used textsearch library and MIKE CAFARELLA.

The concept was proposed in a paper by GOOGLE proposing it GOOGLE FILE SYSTEM (GFS)which was evolved into HDFS (Hadoop Distributed File System) in Hadoop.

Once again it is from GOOGLEs paper of map reduce, hadoops concept of Distributed ParallelProcessing came into existence.

2004Initial versions of what is now Hadoop Distributed File system and Map- Reduceimplemented by Doug Cutting and Mike Cafarella.

December 2005Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006Doug Cutting joins Yahoo! February 2006Apache Hadoop project officially started to support the standalone development

of Map Reduce and HDFS.

February 2006Adoption of Hadoop by Yahoo! Grid team. April 2006Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours. May 2006Yahoo! set up a Hadoop research cluster300 nodes. May 2006Sort benchmark run on 500 nodes in 42 hours (better hardware than April

benchmark).

October 2006Research cluster reaches 600 nodes. December 2006Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500

nodes in 5.2 hours, 900 nodes in 7.8 hours.

January 2007Research cluster reaches 900 nodes. April 2007Research clusters2 clusters of 1000 nodes. April 2008won the 1 terabyte sort benchmark in 209 seconds on 900 nodes. October 2008Loading 10 terabytes of data per day on to research clusters. March 200917 clusters with a total of 24,000 nodes. April 2009won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100

terabyte sort in 173 minutes (on 3,400 nodes).

The name Hadoop was kept after Doug Cuttings sons toy elephant's name. And thus the symbolof elephant too came into existence.


6/18

DISTRIBUTORS OF HADOOP

Apachefunded by google. Houton works Cloudera MapR Intel

HADOOP ARCHITECTURE

The hadoop architecture comprises of 2 major parts

HDFSHadoop distributed file systemA distributed file system that runs on large clusters of commodity machines.

It comprises of 3 components:

Name Node is the master of the system. It maintains the name system (directories andfiles) and manages the blocks which are present on the Data Nodes. It holds the metadatafor hdfs.

Data Nodes are the slaves which are deployed on each machine and provide the actualstorage. They are responsible for serving read and write requests for the clients.

Secondary Name Node is responsible for performing periodic checkpoints. In the eventof Name Node failure, you can restart the Name Node using the checkpoint.

MAP REDUCEThis is responsible for 2 processes

Map taskbreak the input into set of key value pairs. Reduce task It will consolidate the outputs from each distributed executions and

process them into reduced tuples.

This is responsible for the computation of the problem.

It comprises of 2 components:

Job Tracker is the master of the system which manages the jobs and resources in the clus-ter (Task Trackers). The Job Tracker tries to schedule each map as close to the actual databeing processed i.e. on the Task Tracker which is running on the same Data Node as theunderlying block.

Task Trackers are the slaves which are deployed on each machine. They are responsiblefor running the map and reduce tasks as instructed by the Job Tracker.


7/18

Input and output must be always be in HDFS for the execution of Hadoop.

CLUSTERThe entire configuration of 1 hadoop architecture is called a cluster.

RACKRack is a metal shelf which holds the various nodes, servers and memory storage components.

DATA BLOCKSThe Data to be processed will be disintegrated to blocks of data which will be subjected toparallel execution. The block size can be of either 64 mb, 128 mb or 256 mb.

FAULT TOLERANCE SCHEME


8/18

In order to make the system fault tolerant Hadoop uses the concept of REPLICATION. The given

Data will be replicated and saved in some other blocks of memory. In case of any data loss, this

replicated block of data will be used for the further processing/execution.

The replication order can be of 1,2 or 3.

EXECUTION PROCESS OF HADOOP

The client will submit the job to the job tracker The job tracker communicates with the name node which holds the metadata of all the other

nodes. ( holding the index of all other nodes)

The job trackers retrieves the information from the name node determining the availability of thetask trackers.

Depending upon the availability, the job trackers will assign the jobs to the corresponding tasktrackers.

The corresponding data nodes hold the block of data upon which the task tracker works. The task trackers will intimate the job tracker upon the completion of execution.. Then upon receiving the completion status of all the blocks of given data, the job tracker informs

back to the client saying that the execution is complete.

The reduce task will start only after the entire mapping task is done.WORD COUNT PROBLEM

The word count problem is one basic example of illustrating the working of hadoop architecture. Here a

huge text file consisting of various words and phrases is processed into hadoop cluster. The job is to count

the number of distinct words and its occurrences in the given file. This is done by the process of

Mapper instantiation of each line. Map key value splitting. Sorting and shuffling. Reduce key value pairs. Printing the final output.


9/18

This simple problem can be done using a single node cluster which can be setup in our personal computer

machines.

SETTING UP A SINGLE NODE CLUSTER.

The following steps are used to setup a single node cluster. Hadoop 1.2 supported only linux platform

operating systems.

1. Unzip the tar file$ tarxzvf hadoop-1.1.2.tar.gz

2. Install Jdk$ sh jdk-6u45-linux-x64.bin

3. Open /home/ubunutu/.bashrc file and set path


10/18

4. Hadoop configuration filePath: /home/ubunutu/hadoop1.1.2/conf

$ cd hadoop-1.1.2/

$ cd conf

$ vi hadoop-env.sh


11/18

5. $ vi core-site.xml


12/18

6. $ vi mapred-site.xml

7. $ vi hdfs-site.xml


13/18

8. Change the directory to/home/ubunutu/

9. Now generate the ssh-key from user path$ ssh-keygent rsa

$ cd .ssh

$ sudo apt-get install openssh-server

$ cat id_rsa.pub >> authorized_keys

$ ssh localhost

It prompts for password

$ exit.

10.Now we have set the single node cluster in our personal system.11.We have to start the cluster using

$ cd /home/ubunutu/hadoop-1.1.2/bin

$ ./start-all.sh


14/18

`The JPS command is used to list out the various nodes running in the cluster.

12.Since hadoop reads from the HDFS and writes back to the HDFS, it requires both the input fileand the output file to be present in its local HDFS.

Thus now we need to copy the file from our local storage to HDFS storage. This is done using the commandHadoop dfscopyFromLocal /home/ubunutu/filename.txt /home/ubunutu/inputdata

Thus the file is copied from our local storage file system to HDFS file system of hadoop.

13.Now we can execute the distributed execution of the data using the following commandHadoop jar (jar name) (program name) (input file path) (output file path)


15/18

The input is a text as follows :


16/18

The output file can be seen through the following addresses:

Namenodehttp://localhost:50070

Jobtrackerhttp://localhost:50030.

Thus the word count program is being executed on a certain exampled data using hadooparchitecture of Distributed execution.

WORD COUNT PROGRAM IN JAVA

package org.samples.mapreduce.training;

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
http://localhost:50070/http://localhost:50070/http://localhost:50070/http://localhost:50030/http://localhost:50030/http://localhost:50030/http://localhost:50030/http://localhost:50070/


17/18

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

public static class Map extends Mapper {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException,

InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

public static class Reduce extends Reducer {

public void reduce(Text key, Iterable values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key, new IntWritable(sum));

}

}

public static void main(String[] args) throws Exception


18/18

{

Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setJarByClass(WordCount.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

}

BIBLIOGRAPHIES

Hadoop, the definitive guide 3rdedition, Tom White, published by OReilly. Hadoop, the definitive guide 2ndedition, Tom white, published by OReilly and Yahoo

press.

file:///G:/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htm

file:///G:/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htm
http://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htm

Documents

Big Data Doc