2. What is the Need of Big data Technology when we have robust,
high-performing, relational database management system ?
3. RDBMS Data Stored in structured format like PK, Rows,
Columns, Tuples and FK . It was for just Transactional data
analysis. Later using Data warehouse for offline data. Massive use
of Internet and Social Networking(FB, Linkdin) Data become less
structured.
4. Big Data is similar to small data, but bigger Datasets that
grow so large that they become awkward to work with using on-hand
database management tools. Difficulties include capture, storage,
search, sharing, and analytics. What is Big Data?
5. 3 Vs of Big Data Volume Data quantity Velocity Data Speed
Variety Data Types
6. Hadoop History 2003 Doug Cutting was creating Nutch Open
Source Google Web Crawler Indexer Crawler and Indexing processing
was difficult Massive storage and processing problem In 2003 Google
publishes GFS paper and in 2004 MapReduce paper Based on Googles
paper, Doug redesign Nutch and delivered it in 2006 as Hadoop.
7. What is Hadoop? Framework of tools Open source maintained by
and under Apache License Support running apps for BigData
Addressing the BigData challenges: Variety VelocityVolume
8. What is Hadoop? Hadoop is a software framework for
distributed processing of large datasets across large clusters of
computers Large datasets Terabytes or petabytes of data Large
clusters hundreds or thousands of nodes Hadoop is open-source
implementation for Google MapReduce Hadoop is based on a simple
programming model called MapReduce an open source software
framework written in Java.
9. Hadoop makes it easier to store, process and analyze lot of
data on commodity hardware!
10. Apache Hadoop Developer(s) Apache Software Foundation
Initial release December 10, 2011; Stable release 2.6.0 / Nov 18,
2014 Development status Active Written in Java
11. Operating system Cross-platform Type Distributed file
system License Apache License 2.0 Website hadoop.apache.org
12. Characteristics of Hadoop Scalable A cluster can be
expanded by adding new servers or resources without having to move,
reformat, or change the dependent analytic workflows or
applications. Cost effective Hadoop brings massively parallel
computing to commodity servers. The result is a sizeable decrease
in the cost per terabyte of storage, which in turn makes it
affordable to model all your data.
13. Flexible Hadoop is schema-less and can absorb any type of
data, structured or not, from any number of sources. Data from
multiple sources can be joined and aggregated in arbitrary ways
enabling deeper analysis than any one system can provide. Fault
tolerant When you lose a node, the system redirects work to another
location of the data and continues processing without missing a
beat.
14. Hadoop Master/Slave Architecture Hadoop is designed as a
master-slave shared-nothing architecture 16 Master node (single
node) Many slave nodes
16. HDFS Basics HDFS (Hadoop Distributed File System) is a file
system written in Java Sits on top of a native file system Provides
redundant storage for massive amounts of data
17. Main Properties of HDFS Large: A HDFS instance may consist
of thousands of server machines, each storing part of the file
systems data Replication: Each data block is replicated many times
(default is 3) Failure: Failure is the norm rather than exception
Fault Tolerance: Detection of faults and quick, automatic recovery
from them is a core architectural goal of HDFS Namenode is
consistently checking Datanodes 19
18. Hadoop Distributed File System (HDFS) 20 Centralized
namenode - Maintains metadata info about files Many datanode
(1000s) - Store the actual data - Files are divided into blocks -
Each block is replicated N times (Default = 3) File F 1 2 3 4 5
Blocks (64 MB)
19. HDFS Data Data is split into blocks and stored on multiple
nodes in the cluster. Each block is usually 64 MB or 128 MB Each
block is replicated multiple times. Replicas stored on different
data nodes Large files, 100 MB+
20. 2 Kinds of Nodes Master Nodes Slave Nodes
21. Master Nodes NameNode only 1 per cluster metadata server
and database JobTracker only 1 per cluster job scheduler
22. Slave Nodes DataNodes 1-4000 per cluster block data storage
TaskTrackers 1-4000 per cluster task execution
23. NameNode A single NameNode stores all metadata Filenames,
locations on DataNodes of each block, owner, group, etc. All
information maintained in RAM for fast lookup File system metadata
size is limited to the amount of available RAM on the NameNode
24. Data Node DataNodes store file contents Different blocks of
the same file will be stored on different DataNodes Same block is
stored on three (or more) DataNodes for redundancy
25. MapReduce
26. MapReduce Programming model used by Google Input: a set of
key/value pairs User supplies two functions: map(k,v) list(k1,v1)
reduce(k1, list(v1)) v2 Map Process a key/value pair to generate
intermediate key/value pairs Reduce Merge all intermediate values
associated with the same key
28. Properties of MapReduce Engine Job Tracker is the master
node (runs with the namenode) Receives the users job Decides on how
many tasks will run (number of mappers) 30 This file has 5 Blocks
run 5 map tasks Node 1 Node 2 Node 3
29. Properties of MapReduce Engine (Contd) Task Tracker is the
slave node (runs on each datanode) Receives the task from Job
Tracker Runs the task until completion (either map or reduce task)
Always in communication with the Job Tracker reporting progress 31
Reduce Reduce Reduce Map Map Map Map Parse-hash Parse-hash
Parse-hash Parse-hash In this example, 1 map-reduce job consists of
4 map tasks and 3 reduce tasks
30. How Map and Reduce Work Together Map returns information
Reduce accepts information Reduce applies a user defined function
to reduce the amount of data
31. MapReduce Example - WordCount
32. Lifecycle of a MapReduce Job Map function Reduce function
Run this program as a MapReduce job
33. Hadoop Workflow Hadoop Cluster You 1. Load data into HDFS
2. Develop code locally 3. Submit MapReduce job 3a. Go back to Step
2 4. Retrieve data from HDFS