Big Data Doc

Embed Size (px)

Citation preview

  • 8/13/2019 Big Data Doc

    1/18

    A DIVE INTO BIG DATA ITS

    SOLUTION USING HADOOP

    BY

    G.LOUIS AROKIARAJ

    B.TECH CSE (IV YEAR)

    NATIONAL INSITUTE OF TECHNOLOGY, PUDUCHERRY

  • 8/13/2019 Big Data Doc

    2/18

  • 8/13/2019 Big Data Doc

    3/18

    HOW EFFICIENT THAN DATA WAREHOUSE ANALYSIS

    Data warehouses store current as well as historical data and are used for creating trending reports for

    senior management reporting such as annual and quarterly comparisons.

    The drawback of data warehouse is when such a large quantity of data is flooding into thesystem, then it wont be able to process those data.

    More expensive.Some of the solutions to data warehouse are INFORMATICA, TERADATA.

    SOURCES OF BIG DATA

    Social MediaFacebook, Twitter, Google+, Orkut. Stock marketRisk analysis. Health carePatient details, diagnosis, prescriptions, medicines, reports. Information Technology companiesEmployee details, stats. E-commerceRecommendations. Indian Government is trying to implement big data analysis in tax revenues to increase the

    economy of the country.

    STATISTICS

    In the present data world about 90% of data is unstructured. Only remaining 10% is structured. In the last 2 years there has been in immense increase in the quantity of data because of

    various factors like online shopping, facebook, twitter etc.

    In 1 day, about 2.2 million data are created.

    In 2010, the growth of big data market was accounted to $3.2 billion. In 2016, the growth of big data market is expected to be $16.9 billion.

    Huge Data is

    lost and left

    unprocessed

    X

    Y

  • 8/13/2019 Big Data Doc

    4/18

    HOW TO SOLVE THIS??

    Here comes the solution of HADOOP.

    Hadoop is one of the solution to big data Analysis. Other forms of Big data technologies are

    No sql databaseCasandra, MangoDB, etc Search toolslucene, elastic search, etc. Stream processingSTORM, S4, etc. Kafka, Thrift, Scribe, etc.

    WHY HADOOP?

    FlexibleHadoop can process all the 3 types of (structured, semi-structured, unstructured) data.Hadoop supports various languages like perl, python, java, sql using hadoop streaming

    API. So Hadoop is not adhered to Java experts alone!

    Scale out architecture. Building more efficient data economy. Robust ecosystem. Cost effective Hadoop is getting cloudy too! Hadoop focuses on moving code to data instead of data to code. Hadoop supports OLAP (online analytics processing) but not OLTP (online transaction

    processing).

    HADOOP OVER TRADITIONAL SYSTEMS

    HADOOP ANALYSIS

    Hadoop is an open source frameworkwhich allows for distributed processing oflarge data across clusters of computers

    TRADITIONAL SYSTEM ANALYSIS

    Traditional systems use paid software andtools for the analysis of data.

  • 8/13/2019 Big Data Doc

    5/18

    using simple programming model.

    When the data keeps growing, if the rackarchitecture is not able to take it up thenwe can just replace/substitute acheap/common commodity to the rack

    and continue the execution

    Scale out architecture

    When the data keeps growing, if thesystem is not able to withstand the datathen we can extend it to certain limit andif the data crosses beyond that limit, then

    we are forced to replace the entiremachine.

    Scalable architecture

    HISTORY OF HADOOP

    Hadoop was created by DOUG CUTTING, the creator of Apache Lucene, the widely used textsearch library and MIKE CAFARELLA.

    The concept was proposed in a paper by GOOGLE proposing it GOOGLE FILE SYSTEM (GFS)which was evolved into HDFS (Hadoop Distributed File System) in Hadoop.

    Once again it is from GOOGLEs paper of map reduce, hadoops concept of Distributed ParallelProcessing came into existence.

    2004Initial versions of what is now Hadoop Distributed File system and Map- Reduceimplemented by Doug Cutting and Mike Cafarella.

    December 2005Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006Doug Cutting joins Yahoo! February 2006Apache Hadoop project officially started to support the standalone development

    of Map Reduce and HDFS.

    February 2006Adoption of Hadoop by Yahoo! Grid team. April 2006Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours. May 2006Yahoo! set up a Hadoop research cluster300 nodes. May 2006Sort benchmark run on 500 nodes in 42 hours (better hardware than April

    benchmark).

    October 2006Research cluster reaches 600 nodes. December 2006Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500

    nodes in 5.2 hours, 900 nodes in 7.8 hours.

    January 2007Research cluster reaches 900 nodes. April 2007Research clusters2 clusters of 1000 nodes. April 2008won the 1 terabyte sort benchmark in 209 seconds on 900 nodes. October 2008Loading 10 terabytes of data per day on to research clusters. March 200917 clusters with a total of 24,000 nodes. April 2009won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100

    terabyte sort in 173 minutes (on 3,400 nodes).

    The name Hadoop was kept after Doug Cuttings sons toy elephant's name. And thus the symbolof elephant too came into existence.

  • 8/13/2019 Big Data Doc

    6/18

    DISTRIBUTORS OF HADOOP

    Apachefunded by google. Houton works Cloudera MapR Intel

    HADOOP ARCHITECTURE

    The hadoop architecture comprises of 2 major parts

    HDFSHadoop distributed file systemA distributed file system that runs on large clusters of commodity machines.

    It comprises of 3 components:

    Name Node is the master of the system. It maintains the name system (directories andfiles) and manages the blocks which are present on the Data Nodes. It holds the metadatafor hdfs.

    Data Nodes are the slaves which are deployed on each machine and provide the actualstorage. They are responsible for serving read and write requests for the clients.

    Secondary Name Node is responsible for performing periodic checkpoints. In the eventof Name Node failure, you can restart the Name Node using the checkpoint.

    MAP REDUCEThis is responsible for 2 processes

    Map taskbreak the input into set of key value pairs. Reduce task It will consolidate the outputs from each distributed executions and

    process them into reduced tuples.

    This is responsible for the computation of the problem.

    It comprises of 2 components:

    Job Tracker is the master of the system which manages the jobs and resources in the clus-ter (Task Trackers). The Job Tracker tries to schedule each map as close to the actual databeing processed i.e. on the Task Tracker which is running on the same Data Node as theunderlying block.

    Task Trackers are the slaves which are deployed on each machine. They are responsiblefor running the map and reduce tasks as instructed by the Job Tracker.

  • 8/13/2019 Big Data Doc

    7/18

    Input and output must be always be in HDFS for the execution of Hadoop.

    CLUSTERThe entire configuration of 1 hadoop architecture is called a cluster.

    RACKRack is a metal shelf which holds the various nodes, servers and memory storage components.

    DATA BLOCKSThe Data to be processed will be disintegrated to blocks of data which will be subjected toparallel execution. The block size can be of either 64 mb, 128 mb or 256 mb.

    FAULT TOLERANCE SCHEME

  • 8/13/2019 Big Data Doc

    8/18

    In order to make the system fault tolerant Hadoop uses the concept of REPLICATION. The given

    Data will be replicated and saved in some other blocks of memory. In case of any data loss, this

    replicated block of data will be used for the further processing/execution.

    The replication order can be of 1,2 or 3.

    EXECUTION PROCESS OF HADOOP

    The client will submit the job to the job tracker The job tracker communicates with the name node which holds the metadata of all the other

    nodes. ( holding the index of all other nodes)

    The job trackers retrieves the information from the name node determining the availability of thetask trackers.

    Depending upon the availability, the job trackers will assign the jobs to the corresponding tasktrackers.

    The corresponding data nodes hold the block of data upon which the task tracker works. The task trackers will intimate the job tracker upon the completion of execution.. Then upon receiving the completion status of all the blocks of given data, the job tracker informs

    back to the client saying that the execution is complete.

    The reduce task will start only after the entire mapping task is done.WORD COUNT PROBLEM

    The word count problem is one basic example of illustrating the working of hadoop architecture. Here a

    huge text file consisting of various words and phrases is processed into hadoop cluster. The job is to count

    the number of distinct words and its occurrences in the given file. This is done by the process of

    Mapper instantiation of each line. Map key value splitting. Sorting and shuffling. Reduce key value pairs. Printing the final output.

  • 8/13/2019 Big Data Doc

    9/18

    This simple problem can be done using a single node cluster which can be setup in our personal computer

    machines.

    SETTING UP A SINGLE NODE CLUSTER.

    The following steps are used to setup a single node cluster. Hadoop 1.2 supported only linux platform

    operating systems.

    1. Unzip the tar file$ tarxzvf hadoop-1.1.2.tar.gz

    2. Install Jdk$ sh jdk-6u45-linux-x64.bin

    3. Open /home/ubunutu/.bashrc file and set path

  • 8/13/2019 Big Data Doc

    10/18

    4. Hadoop configuration filePath: /home/ubunutu/hadoop1.1.2/conf

    $ cd hadoop-1.1.2/

    $ cd conf

    $ vi hadoop-env.sh

  • 8/13/2019 Big Data Doc

    11/18

    5. $ vi core-site.xml

  • 8/13/2019 Big Data Doc

    12/18

    6. $ vi mapred-site.xml

    7. $ vi hdfs-site.xml

  • 8/13/2019 Big Data Doc

    13/18

    8. Change the directory to/home/ubunutu/

    9. Now generate the ssh-key from user path$ ssh-keygent rsa

    $ cd .ssh

    $ sudo apt-get install openssh-server

    $ cat id_rsa.pub >> authorized_keys

    $ ssh localhost

    It prompts for password

    $ exit.

    10.Now we have set the single node cluster in our personal system.11.We have to start the cluster using

    $ cd /home/ubunutu/hadoop-1.1.2/bin

    $ ./start-all.sh

  • 8/13/2019 Big Data Doc

    14/18

    `The JPS command is used to list out the various nodes running in the cluster.

    12.Since hadoop reads from the HDFS and writes back to the HDFS, it requires both the input fileand the output file to be present in its local HDFS.

    Thus now we need to copy the file from our local storage to HDFS storage. This is done using the commandHadoop dfscopyFromLocal /home/ubunutu/filename.txt /home/ubunutu/inputdata

    Thus the file is copied from our local storage file system to HDFS file system of hadoop.

    13.Now we can execute the distributed execution of the data using the following commandHadoop jar (jar name) (program name) (input file path) (output file path)

  • 8/13/2019 Big Data Doc

    15/18

    The input is a text as follows :

  • 8/13/2019 Big Data Doc

    16/18

    The output file can be seen through the following addresses:

    Namenodehttp://localhost:50070

    Jobtrackerhttp://localhost:50030.

    Thus the word count program is being executed on a certain exampled data using hadooparchitecture of Distributed execution.

    WORD COUNT PROGRAM IN JAVA

    package org.samples.mapreduce.training;

    import java.io.IOException;

    import java.util.*;

    import org.apache.hadoop.fs.Path;

    import org.apache.hadoop.conf.*;

    import org.apache.hadoop.io.*;

    import org.apache.hadoop.mapreduce.*;

    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    http://localhost:50070/http://localhost:50070/http://localhost:50070/http://localhost:50030/http://localhost:50030/http://localhost:50030/http://localhost:50030/http://localhost:50070/
  • 8/13/2019 Big Data Doc

    17/18

    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

    public class WordCount {

    public static class Map extends Mapper {

    private final static IntWritable one = new IntWritable(1);

    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException,

    InterruptedException {

    String line = value.toString();

    StringTokenizer tokenizer = new StringTokenizer(line);

    while (tokenizer.hasMoreTokens()) {

    word.set(tokenizer.nextToken());

    context.write(word, one);

    }

    }

    }

    public static class Reduce extends Reducer {

    public void reduce(Text key, Iterable values, Context context)

    throws IOException, InterruptedException {

    int sum = 0;

    for (IntWritable val : values) {

    sum += val.get();

    }

    context.write(key, new IntWritable(sum));

    }

    }

    public static void main(String[] args) throws Exception

  • 8/13/2019 Big Data Doc

    18/18

    {

    Configuration conf = new Configuration();

    Job job = new Job(conf, "wordcount");

    job.setJarByClass(WordCount.class);

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(Map.class);

    job.setReducerClass(Reduce.class);

    job.setInputFormatClass(TextInputFormat.class);

    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));

    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);

    }

    }

    BIBLIOGRAPHIES

    Hadoop, the definitive guide 3rdedition, Tom White, published by OReilly. Hadoop, the definitive guide 2ndedition, Tom white, published by OReilly and Yahoo

    press.

    file:///G:/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htm

    file:///G:/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htm

    http://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htm