14
Structured, Unstructured and Complex Data Management Amit Chaudhary 11MCA03 Karthik Iyer 11MCA05

Hadoop

Embed Size (px)

DESCRIPTION

Hadoop is a project run under Apache. It is an efficient choice to manage big clusters of data easily.

Citation preview

Page 1: Hadoop

Structured, Unstructured and Complex Data Management

Amit Chaudhary 11MCA03

Karthik Iyer 11MCA05

Page 2: Hadoop

Hadoop

What is this? Structure of this Is this unknown thing right for me? Where is this used?

Page 3: Hadoop

Any idea? (Idea SIM card)

Page 4: Hadoop

What is ?

It is an open source project by the Apache Foundation to handle large data processing

It was inspired by Google’s MapReduce and Google File System (GFS) papers

It was originally conceived by Doug Cutting

It is named after his son’s pet elephant incidentally

Page 5: Hadoop

Large Data Means?

1000 kilobytes = 1 Megabyte 1000 Megabytes = 1 Gigabyte 1000 Gigabytes = 1 Terabyte 1000 Terabytes = 1 Petabyte 1000 Petabytes = 1 Exabyte 1000 Exabytes = 1 Zettabyte 1000 Zettabytes = 1 Yottabyte 1000 Yottabytes = 1 Bronobyte 1000 Bronobytes = 1 Geopbyte

Page 6: Hadoop

So what’s the big deal?

Scalable: New nodes can be added as needed, without changing the formats

Flexible: It is schema-less, and can absorb any type of data, structured or not, from any number of sources

Fault tolerant: System redirects work to another location if a node fails

Page 7: Hadoop

Hadoop = HDFS + MapReduce

HDFS: For storing massive datasets using low-cost storage

MapReduce: The algorithm on which Google built its empire

Page 8: Hadoop

HDFS

It is a fault-tolerant storage system Able to store huge amounts of

information It creates clusters of machines and

coordinates work among them If one fails, it continues to operate the

cluster without losing data or interrupting work, by shifting work to the remaining machines in the cluster

Page 9: Hadoop

HDFS

It manages storage on the cluster by breaking incoming files into pieces, called blocks

Stores each of the blocks redundantly across the pool of servers

It stores three complete copies of each file by copying each piece to three different servers

Page 10: Hadoop

How this works?

Page 11: Hadoop

How this works?

Page 12: Hadoop

Which companies are using? LinkedIn Walt Disney Wal-mart General Electric Nokia Bank of America Foursquare

Page 13: Hadoop

at Foursquare

Foursquare: Mobile + Location + Social Networking

Page 14: Hadoop

Is this unknown thing right for me?