Hadoop

Structured, Unstructured and Complex Data Management

Amit Chaudhary 11MCA03

Karthik Iyer 11MCA05

Hadoop

What is this? Structure of this Is this unknown thing right for me? Where is this used?

Any idea? (Idea SIM card)

What is ?

It is an open source project by the Apache Foundation to handle large data processing

It was inspired by Google’s MapReduce and Google File System (GFS) papers

It was originally conceived by Doug Cutting

It is named after his son’s pet elephant incidentally

Large Data Means?

1000 kilobytes = 1 Megabyte 1000 Megabytes = 1 Gigabyte 1000 Gigabytes = 1 Terabyte 1000 Terabytes = 1 Petabyte 1000 Petabytes = 1 Exabyte 1000 Exabytes = 1 Zettabyte 1000 Zettabytes = 1 Yottabyte 1000 Yottabytes = 1 Bronobyte 1000 Bronobytes = 1 Geopbyte

So what’s the big deal?

Scalable: New nodes can be added as needed, without changing the formats

Flexible: It is schema-less, and can absorb any type of data, structured or not, from any number of sources

Fault tolerant: System redirects work to another location if a node fails

Hadoop = HDFS + MapReduce

HDFS: For storing massive datasets using low-cost storage

MapReduce: The algorithm on which Google built its empire

HDFS

It is a fault-tolerant storage system Able to store huge amounts of

information It creates clusters of machines and

coordinates work among them If one fails, it continues to operate the

cluster without losing data or interrupting work, by shifting work to the remaining machines in the cluster

HDFS

It manages storage on the cluster by breaking incoming files into pieces, called blocks

Stores each of the blocks redundantly across the pool of servers

It stores three complete copies of each file by copying each piece to three different servers

How this works?

How this works?

Which companies are using? LinkedIn Walt Disney Wal-mart General Electric Nokia Bank of America Foursquare

at Foursquare

Foursquare: Mobile + Location + Social Networking

Is this unknown thing right for me?

Education

Hadoop