Upload
vijaymohan-vasu
View
71
Download
0
Embed Size (px)
Citation preview
Big Data (HADOOP AND MAPREDUCE?)
What is Hadoop? Simple answer, Hadoop lets you store files bigger than what
can be stored on one part icular node or server. So you can store very, very
large files and many files on mult iple servers/computers in a distributed fashion.
Advantages of Hadoop include affordability (it runs on industry standard hardware and
agility (store any data, run any analysis).
Hadoop is an Apache open source project that provides a parallel storage and
processing framework. I ts primary purpose is to run MapReduce batch programs in
parallel on tens to thousands of server nodes.
Hadoop scales out to large clusters of servers and storage using the Hadoop Distributed
File System (HDFS) to manage huge data sets and spread them across the servers.
Hadoop comes with libraries and utilities needed by other Hadoop modules. Hadoop
consists of the Hadoop Common package, which provides filesystem and OS level
abstractions, a MapReduce engine. The Hadoop Common package contains the
necessary JAVA files and scripts needed to start Hadoop. The package also prov ides
source code, documentation, and a contribution section that includes projects from
the Hadoop Community
Hadoop Distributed file-system that stores data on commodity machines, prov iding very
high aggregate bandwidth across the cluster. Hadoop scales out to large clusters of
servers and storage using the Hadoop Distributed File System (HDFS) to manage huge
data sets and spread them across the servers.
HDFS was designed to be a scalable, fault-tolerant, distributed storage system that
works closely with MapReduce. HDFS will “just work” under a variety of physical and
systemic circumstances. By distributing storage and computation across many servers,
the combined storage resource can grow with demand while remaining economical at
every size.
What is Map Reduce? Map reduce is a framework for processing the data. The data is not moved in a
conventional fashion using the network because it is slow for huge amount of data and
media. MapReduce uses a better approach to fit well with big data sets. So rather than
move the data to the software, MapReduce moves the processing software to the
data.
MAP
REDUCE KEY TO BE OR NOT
VALUE 2 2 1 1
Map Reduce – a programming model for large scale data processing. MapReduce
refers to the application modules written by a programmer that run in two phases: first
mapping the data (extract) then reducing it (transform).
Hadoop’s greatest benefits is the ability of programmers to write application modules in
almost any language and run them in parallel on the same cluster that stores the data.
With Hadoop, any programmer can harness the power and capacity of thousands of
CPUs and hard drives simultaneously.
KEY TO BE OR NOT TO BE
VALUE 1 1 1 1 1 1