20
APACHE HADOOP SHAH HUSSAIN 1213313318

Apache hadoop by shah

Embed Size (px)

Citation preview

APACHE HADOOP

SHAH HUSSAIN

1213313318

DATA IS EVERYWHERE

DATA IS IMPORTANT

What is Hadoop?

Motivation of Hadoop

• How do you scale up applications?– Run jobs processing 100’s of terabytes of data

– Takes 11 days to read on 1 computer

• Need lots of cheap computers– Fixes speed problem (15 minutes on 1000

computers), but…

– Reliability problems• In large clusters, computers fail every day

• Cluster size is not fixed

• Need common infrastructure– Must be efficient and reliable

Motivation of Hadoop

• Open Source Apache Project

• Hadoop Core includes:

– Distributed File System - distributes data

– Map/Reduce - distributes application

• Written in Java

• Runs on

– Linux, Mac OS/X, Windows, and Solaris

– Commodity hardware

Fun Fact of Hadoop

"The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term."

---- Doug Cutting, Hadoop project creator

History of Hadoop

Apache Nutch

Doug Cutting

“Map-reduce”2004

“It is an important technique!”

Extended

The great journey begins…

Nowadays…

• When you visit yahoo, you are interacting with data processed with Hadoop!

Nowadays…• Yahoo! has ~20,000 machines running Hadoop

• The largest clusters are currently 2000 nodes

• Several petabytes of user data (compressed, unreplicated)

• Yahoo! runs hundreds of thousands of jobs every month

Applications…

• Who use Hadoop?

• Amazon

• AOL

• Facebook

• Fox interactive media

• Google

• IBM

• New York Times

• PowerSet (now Microsoft)

• Quantcast

• Rackspace/Mailtrust

• Veoh

• Yahoo!

References• http://hadoop.apache.org/

• http://en.wikipedia.org/wiki/Apache_Hadoop

• https://github.com/apache/hadoop

• http://www.cloudera.com/content/cloudera/en/about/hadoop-and-big-data.html

Questions?

THANK YOU!