Summer Independent Study Report

Summer 2016 Report by: Shreya Chakrabarti

Self-Learning Hadoop

What is Big Data?

(Image Reference: http://www.webopedia.com/TERM/B/big_data.html)

According to recent research and findings it has been found that every day we create around 2.5 quintillion bytes of data. Surprisingly, majority of this data has been acquired in a short span of last 10 years. A major contribution to this data is the various social media ventures in the recent years namely Facebook, Twitter, Instagram etc. Other sources of data also include the cell phone GPS signals, Shopper’s profile storage stored by shopping giants like Amazon, eBay etc. and other numerous resources. All of this data which is so huge that storing, analyzing, visualizing and performing analytics on the same is increasingly difficult because of the sheer volume of the data, such data is called Big Data.

Big Data is becoming a very popular term in recent times as the world realizes the importance of using the existing data to their advantage and maximizing business profits. The main advantage of storing this data and utilizing newer Big Data technologies is analytics. The four Types of Analytic techniques can be used to achieve greater heights in today’s world for companies to better engage with their customers and in turn maximize their own capital. The four type of analytic techniques include:1) Descriptive Analytics: “What Happened?” Simple tool like page views can give us an idea about the success of a particular campaign2) Diagnostic Analytics:” Why it happened?” Business Intelligence tools used to analyze the data most presently available in the company give us the specific reasons for why a particular campaign was successful or unsuccessful based on which the decision to continue the campaign or discontinue it can be easily taken.3) Predictive Analytics: “Future Prediction” Predictive analytics is a branch of advanced analytics


which is used to make predictions about unknown future events. Predictive analytics uses many techniques like data mining, statistics modeling, machine learning and artificial intelligence to analyze current data to make predictions about future.

4)Prescriptive Analytics: “Prevention better than cure” Once predictive analytics predicts what needs to be done in order to maximize profits, care needs to be taken that nothing is done in the opposite direction to hamper the profits.

Why Hadoop?

As discussed earlier Technology needs to advance at a drastic speed for the world to take advantage of the existing as well as ever updating data.Apache Hadoop is an open source software framework for distributed storage and distributed processing of very large datasets on computer clusters built from commodity hardware. In simple terms “Hadoop” can be said to be a database used to store large datasets and perform data analysis on it.

Hadoop was designed on the base of Google File System paper published in 2003.Doug Cutting the creator of Hadoop named it after his son’s toy elephant. Hadoop 0.1.0 was released in April 2006 and continues to evolve by the many contributors to the Apache Hadoop project.

Hadoop is based on Map-Reduce algorithms

Hadoop Components

Hadoop Distributed File

System

MapReduce Processing


HDFS Architecture

(https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html)

Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is an open-source data management framework with scale out storage and distributed processing capabilities. It distributes data across multiple machines. Files are logically divided into equal sized blocks. Blocks are spread across multiple machines who create replicas of blocks. Three replicas are maintained to ensure availability. Data integrity is maintained by computing the block checksum. The Name-node maintains address of the blocks on the respective data-nodes. Whenever data is requested the name-node provides the address of the data physically closest to the client. The secondary name node serves as a checkpoint server and is not a replacement to Primary name-node when it fails.


Map Reduce

Earlier spawned from Google Map-Reduce is a popular algorithm for processing and generating large data sets. The name MapReduce originally referred to the proprietary Google technology, but has since been genericized. Google, however has moved on to newer technologies since 2014.

The below diagram is from Google’s orignal Mapreduce paper.The diagram describes the working of the Map-Reduce algorithm.

The Map-Reduce algorithm breaks down into three important steps namely Map, Group & Sort and Reduce.The MAP part of the algorithm divides the data into key: value pairs. The Key is the most important part of the Map function as this key is further used by the reduce function too.

Group and sort basically groups the values with same keys together to make it simpler for the next stage of Reducer.

The final stage of the Reducer is that it receives the grouped and sorted data from the previous stage and selects the output desired from the processing of the dataset.


Some of the examples which can give an in depth understanding of MapReduce are explained in below projects.

Mini-Project 1: Max and Min Temperatures in year 1800

The dataset in this mini project contains temperatures from the year 1800 which were recorded at various weather stations.

The dataset can be explained as below:

The data also contains some other fields which are not relevant to our mini project.We will be finding out the “Minimum Temperatures at a particular Weather Station throughout the year 1800” and “Maximum Temperatures at that particular Weather Station throughout the year 1800”.(There are only two weather stations included in this particular dataset)

Understanding the data plays a very important role in determining the “Map” and “Reduce” part for writing a Map-Reduce Program.

Weather Station Code

Date in the year 1800 when the temperature was recorded

Type of Temperature (Maximum or Minimum)

Temperatures in Celsius


The understanding of how a Map Reduce Program Works:

Data

Mapper (Key -Value Pairs)

Group and Sort

Reducer

The working of the Map-Reduce algorithm can be explained in the above diagram. The data is then fed to the mapper where the mapper selects the required data which is relevant for the result, basically separates the data into key-value pairs. Then this data is further grouped and sorted according to the keys. The Reducer can be said to be a function which ultimately gives us the result.

ITE00100554 18000101 TMAX -75GM000010962 18000101 PRCP 0

EZE00100082 18000101 TMAX -86E00100082 18000101 TMIN -135EZE00100082 18000101 TMIN -135ITE00100554 18000102 TMAX -60

ITE00100554 18000102 TMIN -125GM000010962 18000102 PRCP 0

EZE00100082 18000102 TMAX -44

ITE00100554,-75 EZE00100082 ,-86 ITE00100554, -60

ITE00100554,-75,-60 EZE00100082 ,-86

ITE00100554,-60 EZE00100082 ,-86


The above logic can be written as below in Python Language Code Minimum Temperature

Maximum Temperature

Mapper(To establish Key-Value Pair)

Reducer (For Final Results)


Running the Minimum Temperatures Code:

Output for Minimum Temperatures:

Running the Maximum Temperatures Code:

Output for Minimum Temperatures:


Mini-Project 2: Total Amount Ordered by each customer

The dataset contains a list of customers with the amounts they spend in each order they placed in a restaurant. The dataset contains 3 attributes namely Customer ID, Order Number and Amount Spend.

To write the code for this data analysis problem, let us design an approach for the problem

Data

MapperThe Mapper should be able to

establish the Key-Value pair. In this case the key value pair would be

Customer and the amount he Spend.

Group and SortIn group and sort there would be

grouping on the basis of the customer.

The data after Grouping and Sorting would contain the

Customer Number and the amount he spends in total

ReducerThe Reducer would in turn produce

the output as to Customer with what ID spend How much Money in

orders.


The code for the same is thus written as below in Python:

Output:

The output of this Project can also be improved by feeding the output of the first reducer into another mapper to get a sorted output. This sort of MapReduce job is called “Chained MapReduce Jobs”.


Revised Code:

Revised Output:

First Reducer’s Output of “Order Totals” is send to another Mapper, Reducer Pair to get the results

sorted


Project: Social Graph of Superhero’s

This dataset contains of Superhero Data from Marvel which mentions the appearance of Super Hero’s with each other in various comic books. It basically traces the appearance of superheroes with each other in various comic books which feature them.

The above image is a snippet from the data where the various numbers are assigned to various characters and the first character(Highlighted) is the Superhero with the following numbers belonging to other characters who the main character is Friends with.

Step:1 Find Total Number of Friends per Superhero

To find the most popular superhero first we need to map the character and the number of friends the particular superhero has. To do this we need to add the friends per character and map them as Key-Value pair and feed to the reducer. The reducer then adds up the number of friends per character.

Step:2 Find Superhero with Maximum Friend Count

Mapper1: Count the number of friends per character, per line. Establish a key value pair of

Superhero: NumberOfFriends

Reducer1: Add up the number of Friends per

Superhero

Reducer1: Total number of friends per

Superhero

Mapper2: Substitute a common key (Empty

Key) for example None: 59 5933

where None: Key59 5933: Value

Reducer2: Find out the Superhero with max

friends


These two steps would give us the most popular Social Hero.


The load_name_dictionary displays the name of the Superhero from the superhero name file as opposed to the code of the Superhero with the number of Friends he has.

Output:

Other Important Technologies in Hadoop

YARN

Yarn can be simply called the operating system of Hadoop because it is responsible for managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls and managing high availability features of Hadoop.

(https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html)


Resource Manager: Master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system.

Node Manager: Node Manager takes instructions from resource manager and manage resources on a single node.

Application Master: Negotiators, application masters are responsible for negotiating resources from Resource Manager.

HIVE

Hive is an open source project run by volunteers at the Apache Software Foundation. Hive is basically a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis.HIVE provides a SQL language HIVEQL with schema on read and transparently converts queries to MapReduce.

SQOOP

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. Sqoop got its name from SQL+Hadoop.

SPARK

Spark was developed in response to limitations in the MapReduce cluster computing paradigm. Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop.

Documents

Summer Independent Study Report