Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop

Chris McConnellCSI-5414/6/2010

Outline

Introduction Map/Reduce Hadoop Advanced Topics Conclusion

Introduction

What motivated Hadoop? Large amounts of data with a desire to query it on

demand and quickly What about traditional RDMS?

Introduction

What motivated Hadoop? Large amounts of data with a desire to query it on

demand and quickly What about traditional RDMS?

Traditional RDMS MapReduce

Data Size

Gigabytes Petabytes

Access Interactive and Batch Batch

Updates Read and Write many times

Write once, read many

Structure Static Schema Dynamic Schema

Integrity High Low

Scaling Nonlinear Linear

Introduction II

What is Hadoop good for?Semi-Structured/Unstructured dataLarge volumes of dataMany reads, few writes and a write that spans

a majority of the data setGenerally*, image analysis, graph-based

problems even machine learning algorithms

Outline

Introduction Map/Reduce Hadoop Advanced Topics Conclusion

Map/Reduce

Map/Reduce is a programming ‘technique’ that was introduced by Google in 2004

What is it?Given data, we want to ‘Map’ it, then ‘Reduce’

it until we have our answer When is it good?

Lots of data that follows a similar format for the specified query

Map/Reduce II

Example: Stock Market, Top PriceData lines:

Historical Stock data, one line per recording XYZD01142010T08301234P2534C+12 NADD01142010T08452549P453C-01 …

Map/Reduce II

Example: Stock Market, Top PriceData lines:

Historical Stock data, one line per recording XYZD01142010T08301234P2534C+12 NADD01142010T08452549P453C-01 …

“Mapped”: Key, Value…lets get something meaningful though (XYZ, D01142010T08301234P2534C+12) (NAD, D01142010T08452549P453C-01) …

Map/Reduce III

Example continuedMapped:

There could be multiple lines, so send out each key, value for the reducing stage

(XYZ, 25.34) (XYZ, 35.12) (NAD, 5.66) (NAD, 4.53) …

Now, pass it along to the Reduce(r)…

Map/Reduce IV

From Map to Reduce, the system organizes based upon keys (more details later) Reduce gets:

(XYZ, [25.34, 35.12]) (NAD, [5.66, 4.53])

Final Reduce: iterate over values, get max and output (XYZ, 35.12) (NAD, 5.66)

Map/Reduce V

General Flow

Map/Reduce VI

ShuffleNot talking about iPods

Map/Reduce VI

ShuffleNot talking about iPodsWhen sending from the Map to Reduce stage,

data flows through a Partition, Shuffle, Sort post/pre processing

This is done to allow for higher parallelism

Map/Reduce VII

Map/Reduce VIII

Everything is Parallel …sort ofImplicitly, mapping can be parallel, as a list for

one key is being created, the other keys don’t care, but it might take planning to accomplish

Reducing can be all in parallelBottleneck?

Map/Reduce VIII

Everything is Parallel …sort ofImplicitly, mapping can be parallel, as a list for

one key is being created, the other keys don’t care, but it might take planning to accomplish

Reducing can be all in parallelBottleneck?

Reduce needs to wait for Map

Map/Reduce IX

One more overview highest level

Outline Introduction Map/Reduce Hadoop

Introduction File System Setup Example

Advanced Topics Conclusion

Hadoop - Introduction History

Created by Doug Cutting Started as an open source web search engine 2004, began Nutch Distributed File System modeled

after the Google File System 2005, Began a MapReduce implementation on

NDFS… 2006 Hadoop subproject began 2008 Yahoo! Utilizes a 10,000 core cluster for

production search engine

Hadoop – Introduction II

Current StatusMultiple subprojects (Will discuss some later)Core – components and interfaces for the

distributed filesystems and general I/OMapReduce – Discussed earlierHDFS – Hadoop Distributed File System

Hadoop – Introduction II

Current StatusMultiple subprojects (Will discuss some later)Core – components and interfaces for the

distributed filesystems and general I/OMapReduce – Discussed earlierHDFS – Hadoop Distributed File System

Details coming soon

Hadoop – HDFS

Clusters follow a master/worker pattern Namenode (master)

Single per cluster (not required for all)Maintains file system tree and metadataAccepts MR from clientHandles replication, block assignments

among other tasks

Hadoop – HDFS II

Datanode (workers) Execute tasks as told to do so Useless for recovery Files on Datanode are handled in large blocks,

typically 64/128MB Client can interact with namenode and

datanodes since data is not sent to namenode Minimizing Bottlenecks

Hadoop – HDFS III

General overview

Hadoop – HDFS IV

Meta-DataList of files, blocks, datanodes, file atbs

Hadoop – HDFS IV


BalancingTry to balance data on all datanodes by

moving, or creating replicas

Hadoop – HDFS IV


BalancingTry to balance data on all datanodes by

moving, or creating replicas Fault Tolerance

Logs, Secondary namenode, write confirmation after all replicas written

Hadoop – HDFS IV Meta-Data

List of files, blocks, datanodes, file atbs Balancing

Try to balance data on all datanodes by moving, or creating replicas

Fault Tolerance Logs, Secondary namenode, write confirmation after

all replicas written Communication

Basic TCP/IP with protocols

Outline Introduction Map/Reduce Hadoop

Introduction File System Setup Example

Advanced Topics Conclusion

Hadoop – Setup

Download…

Hadoop – Setup

Download…double click…

Hadoop – Setup

Download…double click…next a few times…

Hadoop – Setup

Download…double click…next a few times…

Just kidding, however its not too bad Hadoop is available for Windows and

Linux systemsWill discuss some brief setup about a Linux

cluster system

Hadoop – Setup II

Ensure you have Java 1.6 installed Download and extract the Hadoop system

(reference at end) Single machine – all set Multiple machine clusters involve a few

more steps…

Hadoop – Setup III

Set up your namenode in the conf/master file (specify IP)

Set up datanodes in the conf/slave file (specify IP)

Configure ports in conf/core-site.xml, conf/mapred-site.xml, conf/hdfs-site.xml Override defaults

Finally, point to an output location

Hadoop – Setup IV

Common Commands bin/start-all.shbin/stop-all.shbin/hadoop namenode -formatbin/hadoop dfs(or fs) -copyFromLocal file dir

copyToLocal Many more basic *nix commands (-ls, -cat, -mkdir) Some non-standard commands (-rmr instead of

rmdir)

Hadoop - Example

Lets look at some code and a simple run bin/hadoop jar <jar location> <main class>

<input dir> <output dir> <other args>Note: args can be in any order, but usually

suggested to have input/output first, as they are “required”

Hadoop – Example II

Extra information can be found under /logs/userlogsDirectories for the Map/Reduce stagesstdout will hold System.out.print..()stderr will hold System.err.print..()

Hadoop – Example III

A deeper look

Outline

Introduction Map/Reduce Hadoop Advanced Topics

Other Technologies Hadoop vs. Others Future Topics

Conclusion

Adv – Other Technologies HBase

Powerset Table storage for semi-structured data

Zookeeper Yahoo! Coordinating Distributed Applications

Hive Facebook SQL-like query language

Adv – Other Technologies II Pig

Yahoo! High-level language for data analysis Example:

SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 106

good_urls = FILTER urls BY pagerank > 0.2;groups = GROUP good_urls BY category;big_groups = FILTER groups BY COUNT(good_urls) > 106;output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

Adv – Other Technologies III

How?

Build on top of the structure already there

Adv – Hadoop vs. Others How does Hadoop stack up?

Study compares Hadoop vs. Vertica vs. DBMSX (legal restrictions prevent the actual name)

Study performed on cluster with 100 nodes @ 2.40 GHz Intel Dual Core

A few measurements Load time for data, specific task run times, startup

even ease of use The outcome…

Adv – Hadoop vs. Others II

Data loading was much faster

Adv – Hadoop vs. Others III

However, a simple select proves too much

Adv – Hadoop vs. Others IV

But, an advanced task…

Adv – Hadoop vs. Others V

Finally study felt that Hadoop was much easier to get started

Programming with Hadoop breaks the rules

Not really a good interface…yet Needs a significant amount of data for

processing to show improvements

Conclusion

Hadoop has a lot of potential, but needs some work for any moderate sized data set

Systems where unstructured data needs to have the same question answered stand out as Hadoop candidates

Tough learning curve for anything advanced (maybe)

Questions? References

Map/Reduce with Hadoop Presentation, ETH Zurich 2008www.systems.ethz.ch/hs08/hadoop.pdf (note: link shortened to fit on screen, follow by clicking)

Apache Hadoop Website (http://hadoop.apache.org/) Dean, Jeffrey and Ghemawat, Sanjay. MapReduce: Simplified

Data Processing on Large Clusters. OSDI, 2004 Olston, Christopher et. Al. Pig Latin: A Not-So-Foreign Language

for Data Processing. SIGMOD, 2008 Pavlo, Andrew et. Al. A Comparison of Approaches to Large-

Scale Data Analysis. SIGMOD 2009 White, Tom. Hadoop The Definitive Guide. O’Reilly 2009

http://www.systems.ethz.ch/education/past-courses/hs08/map-reduce/slides/hadoop.pdf

http://hadoop.apache.org/

Documents

Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional