View
280
Download
1
Category
Preview:
Citation preview
WHAT WE ARE GOING TO LEARN?
• What is Bigdata?
• What is HADOOP?
• Distributed Computing.
• 4V’s of Big data.
• HADOOP Daemons.
• Writing files to HDFS.
• Reading files from HDFS.
• Replication factor
• Rack Awareness
• Map/Reduce
WHAT IS BIG DATA?
• Big data are collection of data sets so large and complex that it becomes difficult to process using on-
hand database management tools or traditional data processing applications (SOURCE : White Tom,
Definitive Guide).
• Why do we need to manage this Big data?
The data is growing enormously. In earlier days, employees and customers were generating data.(Eg,
Feedback form, Survey results).But in today's world, even the machines has started generating data (
Eg, Sensor, RFID, Satelllite). In fact, 90% of the world data is generated in last 3 years.
Company realized that it needs to manage this huge amount of data. Imagine the data flow that
happens in search engines like Google. Google came up with the idea of distributed computing and
parallel processing which is very well explained in their research papers : “Google File system and
MAP Reduce”. It showed the world how they were able to process this huge data.
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
http://research.google.com/archive/mapreduce.html
Wait !!! What is the use of this big data?
Big data is used to better understand customers and their behaviors and preferences. Take flipkart as an example. I’m a great lover of “Nike shoes”. I might have purchased a Nike shoe some 5 years ago. But for some reason my interest turned towards “Adidas shoes” and I’m buying only Adidas shoes for last couple of years. Based on the historical data, now flipkart knows that I’m interested much in Adidas shoes than the Nike shoes. So I may get ads related to Adidas shoes whenever I visit flipkart.
The flipkart logs for the user “vinothkumar” will be
• In year 2010- Bought Nike shoes
• In year 2011- Bought Adidas shoes
• In year 2012- Bought Adidas shoes
• In year 2013- Bought Adidas shoes
• In year 2014- Bought Adidas shoes
• In year 2015- Bought Adidas shoes
Now this is considered to be the big data. Here we’ve considered only for one user “vinothkumar” and only one product “shoes”. Imagine the logs of all the flipkart users. Company can use this big data to predict the customer preference and sell products which adds profit to the organization.
This is one such example for Big data. It is also used in different fields like Hospital and Medical industry, Travel Industry etc..,
Big data
WHAT IS HADOOP • What is HADOOP ?
In simple words, HADOOP is a framework for managing/processing this big data.
The data is divided into smaller piece of chunks and the data is distributed across “n” number
of systems.We then process the data in parallel approach to obtain the final result.
NETWORK IDS – Real time application.(This is my graduate project .Source code on request )
Imagine a scenario, the company has realized that data breach has occurred and it wants to
track the malicious activity from a particular IP. The company had a wire shark running 24/7 to
capture the data packets and the file is stored in .pcap format. Imagine the size of the file to
be 1000 TB. Using traditional computing, it takes huge amount of time to process this data
where as in HADOOP framework, the data is split and stored in 1000 systems. Now each
system has to process just 1 TB of data and it is processed in parallel to obtain the result more
quickly .i.e the malicious activities observed from a particular IP.
DISTRIBUTED COMPUTING
• Let’s understand the concept of Distributed computing with a simple scenario.
• Consider “Ox and the Load” example. Ox are used to carry the load.
• When the load size increases, we didn’t decide to grow up the ox.
DISTRIBUTED COMPUTING
Instead, we decided to increase the number of the Ox.
The same concept is applied in the the process of distributed computing.
HADOOP DAEMONSDaemon is a service provided by the operating system which runs in the background. It exits
as soon as the server exits/Shutdown. There are 5 daemons in the HADOOP Architecture
which are categorized into two types Masters and Slaves.
MASTERS
1. Name node
2. Task Tracker
3. Secondary Name node
SLAVES
4. Data node
5. Job Tracker
• As the name says, SLAVES always acts as per the command /order received from the MASTER.
• Think this scenario like a typical job in IT Industry. Employees always report their work to
their corresponding reporting manager. Similarly, data node and job tracker are the SLAVES
which reports to the MASTER Name node and Job Tracker correspondingly
• Name Node (NN):
– Heart of the Hadoop architecture.
– Contains metadata information like where the data is stored.
• Secondary Name Node (SNN):
– Nope, you’ve guessed it wrong. SNN is not exactly a back up for NN. Instead it will store the
“Checkpoints”. It means at that particular instance or checkpoint, it will take a back up of NN.
SNN is also known as checkpoint servers.
– Checkpoints (CP) :
If you are a gamer, you might have heard about the term called checkpoint. When your mission is
failed, you will resume from that checkpoint and not from the start of the game. Same thing goes
with SNN. When NN is failed, SNN will act as a temporary NN from that particular CP.
STEP 1 : Client wants to store the file chunks (Block A,B,C) named "File.txt" in HADOOP
Framework and it will seek the help of Name Node .
STEP 2: Name node queries its metadata information to find out the free space. It replies the
client that Data node - DN 1,5,6 are free and you can go ahead and store the data.
STEP 3 : Client stores these chunks of information in the data node accordingly.
NOTE : Storing and Retrieving files from HDFS is automatically taken care by HADOOP. i.e. We
need not worry about the location of our data in the data node. Only thing that HADOOP
expect from us is the Input Data and Mapper/Reducer program which is explained later.
STEP 1: Client wants to retrieve the information of "Result.txt (Blk A,B,C) ". So it will seek
the help of Name node
STEP 2: Name Node queries its metadata information to find out where result.txt is stored.
It replies the client that
Blk A is stored in DN1,5,6
Blk B is stored in DN 8,1,2
Blk C is stored in DN 5,8,9
You may be wondering why HADOOP is storing the same block in 3 different location in the
data node, (Eg: Blk A is stored in DN 1, 5 and 6) that is where Replication Factor ( RF)
comes into picture.
REPLICATION FACTORHADOOP Architecture is designed in such a way that data loss shouldn’t occur. But still failure
may occur for whatsoever . In such cases, we need to ensure the availability of data. So each
block of data is replicated thrice and stored in the data node. So the replication factor for
Hadoop is defined as 3.
But If you notice carefully, you can see Blk A is stored twice in Rack-5 and once in Rack-1. This
is the specialty in HADOOP Architecture.
RACK AWARENESS• If Block A (Blk A) has to be retrieved from the data node, hadoop will normally prefer the
RACK space which has stored two copies of the same Blk A. In the previous slide RACK-5.
• RACK-5 has two copies of Blk A stored in DN-5 and 6.
• Imagine a scenario, the file pointer (FP) is in RACK-5, and for some reasons DN-5 got failed.
So the Hadoop will have to fetch Blk A from either DN-6 present in RACK-5 or from DN-1
present in RACK-1.
• Search time to locate the block will be increased if hadoop has to go to DN-1 in RACK-1.
Instead it can quickly retrieve another copy of Blk A from DN-6 which is present in the same
RACK-5. So by default HADOOP stores two copies of block in the same RACK and another
copy in a different RACK.
• This concept is known as RACK AWARENESS that makes HADOOP more effective while
retrieving data.
MAP / REDUCE• Map and Reduce are simply JAVA program which does the processing on the data. It simply
answers the following question.
“What you want to do with this data?”
Class Reducer()
{
java code
}
Class Mapper()
{
java code
}
Writing your first hadoop program
Word Count is a simple application that counts the number of occurrences of each word
in a given input set. This example is generally considered to be the “Hello Word”
program of the programming languages
Visit the following link to learn how to install Hadoop and to write the mapper and
reducer code to find out the number of occurrence of the word
http://javabeginnerstutorial.com/hadoop/your-first-hadoop-map-reduce-job
• Mapper :
The mapper simply process the input data based on the java code written on the
mapper class (In the example above : To count the occurrence of word in input file) and
creates several small chunks of data.
• Reducer :
The reducer phase simply process the chunks of data created after mapper phase and
gives the output in the reduced or user required format after applying shuffle and sorting.
This can be better understood with the diagram which explained in the next slide.
CONCLUSION
• Big data and Hadoop is considered to be the hottest topic in the IT industry. Everyone wants
to learn this technology . But we don’t have much professionals who has in depth knowledge in
these frameworks. So there is always a demand for big data engineers
• Concept of HADOOP and BIG DATA is very vast. It’s very difficult to explain the whole
HADOOP ecosystem in slides. This presentation mainly focus on the beginners to kick start
their brain.exe in the field of Big data and HADOOP.
• I’ve tried my best to explain the concept in a simple way that can be understood even by a
beginner.
• Feedbacks are always appreciated.
• Thanks for taking time to read my slides.
REFERENCES
• https://www.udemy.com/big-data-and-hadoop-essentials-free-tutorial
• www.Hadoop-skills.net
• https://www.udemy.com/overview-of-big-data-hadoop/
• http://www.tutorialspoint.com/hadoop/
Recommended