2
Big Data (HADOOP AND MAPREDUCE?) What is Hadoop? Simple answer, Hadoop lets you store files bigger than what can be stored on one particular node or server. So you can store very, very large files and many files on multiple servers/computers in a distributed fashion. Adv antages of Hadoop include affordability (it runs on industry standard hardware and agility (store any data, run any analysis). Hadoop is an Apache open source project that provides a parallel storage and processing framework. Its primary purpose is to run MapReduce batch programs in parallel on tens to thousands of server nodes. Hadoop scales out to large clusters of servers and storage using the Hadoop Distributed File System (HDFS) to manage huge data sets and spread them across the servers. Hadoop comes with libraries and utilities needed by other Hadoop modules. Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine. The Hadoop Common package contains the necessary JAVA files and scripts needed to start Hadoop. The package also provides source code, documentation, and a contribution section that includes projects from the Hadoop Community Hadoop Distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. Hadoop scales out to large clusters of servers and storage using the Hadoop Distributed File System (HDFS) to manage huge data sets and spread them across the servers. HDFS was designed to be a scalable, fault-tolerant, distributed storage system that works closely with MapReduce. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers,

Hadoop map reduce

Embed Size (px)

Citation preview

Page 1: Hadoop map reduce

Big Data (HADOOP AND MAPREDUCE?)

What is Hadoop? Simple answer, Hadoop lets you store files bigger than what

can be stored on one part icular node or server. So you can store very, very

large files and many files on mult iple servers/computers in a distributed fashion.

Advantages of Hadoop include affordability (it runs on industry standard hardware and

agility (store any data, run any analysis).

Hadoop is an Apache open source project that provides a parallel storage and

processing framework. I ts primary purpose is to run MapReduce batch programs in

parallel on tens to thousands of server nodes.

Hadoop scales out to large clusters of servers and storage using the Hadoop Distributed

File System (HDFS) to manage huge data sets and spread them across the servers.

Hadoop comes with libraries and utilities needed by other Hadoop modules. Hadoop

consists of the Hadoop Common package, which provides filesystem and OS level

abstractions, a MapReduce engine. The Hadoop Common package contains the

necessary JAVA files and scripts needed to start Hadoop. The package also prov ides

source code, documentation, and a contribution section that includes projects from

the Hadoop Community

Hadoop Distributed file-system that stores data on commodity machines, prov iding very

high aggregate bandwidth across the cluster. Hadoop scales out to large clusters of

servers and storage using the Hadoop Distributed File System (HDFS) to manage huge

data sets and spread them across the servers.

HDFS was designed to be a scalable, fault-tolerant, distributed storage system that

works closely with MapReduce. HDFS will “just work” under a variety of physical and

systemic circumstances. By distributing storage and computation across many servers,

Page 2: Hadoop map reduce

the combined storage resource can grow with demand while remaining economical at

every size.

What is Map Reduce? Map reduce is a framework for processing the data. The data is not moved in a

conventional fashion using the network because it is slow for huge amount of data and

media. MapReduce uses a better approach to fit well with big data sets. So rather than

move the data to the software, MapReduce moves the processing software to the

data.

MAP

REDUCE KEY TO BE OR NOT

VALUE 2 2 1 1

Map Reduce – a programming model for large scale data processing. MapReduce

refers to the application modules written by a programmer that run in two phases: first

mapping the data (extract) then reducing it (transform).

Hadoop’s greatest benefits is the ability of programmers to write application modules in

almost any language and run them in parallel on the same cluster that stores the data.

With Hadoop, any programmer can harness the power and capacity of thousands of

CPUs and hard drives simultaneously.

KEY TO BE OR NOT TO BE

VALUE 1 1 1 1 1 1