51
Hadoop Chris McConnell CSI-541 4/6/2010

Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop

Chris McConnellCSI-5414/6/2010

Page 2: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Outline

Introduction Map/Reduce Hadoop Advanced Topics Conclusion

Page 3: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Introduction

What motivated Hadoop? Large amounts of data with a desire to query it on

demand and quickly What about traditional RDMS?

Page 4: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Introduction

What motivated Hadoop? Large amounts of data with a desire to query it on

demand and quickly What about traditional RDMS?

Traditional RDMS MapReduce

Data Size

Gigabytes Petabytes

Access Interactive and Batch Batch

Updates Read and Write many times

Write once, read many

Structure Static Schema Dynamic Schema

Integrity High Low

Scaling Nonlinear Linear

Page 5: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Introduction II

What is Hadoop good for?Semi-Structured/Unstructured dataLarge volumes of dataMany reads, few writes and a write that spans

a majority of the data setGenerally*, image analysis, graph-based

problems even machine learning algorithms

Page 6: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Outline

Introduction Map/Reduce Hadoop Advanced Topics Conclusion

Page 7: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Map/Reduce

Map/Reduce is a programming ‘technique’ that was introduced by Google in 2004

What is it?Given data, we want to ‘Map’ it, then ‘Reduce’

it until we have our answer When is it good?

Lots of data that follows a similar format for the specified query

Page 8: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Map/Reduce II

Example: Stock Market, Top PriceData lines:

Historical Stock data, one line per recording XYZD01142010T08301234P2534C+12 NADD01142010T08452549P453C-01 …

Page 9: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Map/Reduce II

Example: Stock Market, Top PriceData lines:

Historical Stock data, one line per recording XYZD01142010T08301234P2534C+12 NADD01142010T08452549P453C-01 …

“Mapped”: Key, Value…lets get something meaningful though (XYZ, D01142010T08301234P2534C+12) (NAD, D01142010T08452549P453C-01) …

Page 10: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Map/Reduce III

Example continuedMapped:

There could be multiple lines, so send out each key, value for the reducing stage

(XYZ, 25.34) (XYZ, 35.12) (NAD, 5.66) (NAD, 4.53) …

Now, pass it along to the Reduce(r)…

Page 11: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Map/Reduce IV

From Map to Reduce, the system organizes based upon keys (more details later) Reduce gets:

(XYZ, [25.34, 35.12]) (NAD, [5.66, 4.53])

Final Reduce: iterate over values, get max and output (XYZ, 35.12) (NAD, 5.66)

Page 12: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Map/Reduce V

General Flow

Page 13: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Map/Reduce VI

ShuffleNot talking about iPods

Page 14: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Map/Reduce VI

ShuffleNot talking about iPodsWhen sending from the Map to Reduce stage,

data flows through a Partition, Shuffle, Sort post/pre processing

This is done to allow for higher parallelism

Page 15: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Map/Reduce VII

Page 16: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Map/Reduce VIII

Everything is Parallel …sort ofImplicitly, mapping can be parallel, as a list for

one key is being created, the other keys don’t care, but it might take planning to accomplish

Reducing can be all in parallelBottleneck?

Page 17: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Map/Reduce VIII

Everything is Parallel …sort ofImplicitly, mapping can be parallel, as a list for

one key is being created, the other keys don’t care, but it might take planning to accomplish

Reducing can be all in parallelBottleneck?

Reduce needs to wait for Map

Page 18: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Map/Reduce IX

One more overview highest level

Page 19: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Outline Introduction Map/Reduce Hadoop

Introduction File System Setup Example

Advanced Topics Conclusion

Page 20: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop - Introduction History

Created by Doug Cutting Started as an open source web search engine 2004, began Nutch Distributed File System modeled

after the Google File System 2005, Began a MapReduce implementation on

NDFS… 2006 Hadoop subproject began 2008 Yahoo! Utilizes a 10,000 core cluster for

production search engine

Page 21: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – Introduction II

Current StatusMultiple subprojects (Will discuss some later)Core – components and interfaces for the

distributed filesystems and general I/OMapReduce – Discussed earlierHDFS – Hadoop Distributed File System

Page 22: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – Introduction II

Current StatusMultiple subprojects (Will discuss some later)Core – components and interfaces for the

distributed filesystems and general I/OMapReduce – Discussed earlierHDFS – Hadoop Distributed File System

Details coming soon

Page 23: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – HDFS

Clusters follow a master/worker pattern Namenode (master)

Single per cluster (not required for all)Maintains file system tree and metadataAccepts MR from clientHandles replication, block assignments

among other tasks

Page 24: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – HDFS II

Datanode (workers) Execute tasks as told to do so Useless for recovery Files on Datanode are handled in large blocks,

typically 64/128MB Client can interact with namenode and

datanodes since data is not sent to namenode Minimizing Bottlenecks

Page 25: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – HDFS III

General overview

Page 26: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – HDFS IV

Meta-DataList of files, blocks, datanodes, file atbs

Page 27: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – HDFS IV

Meta-DataList of files, blocks, datanodes, file atbs

BalancingTry to balance data on all datanodes by

moving, or creating replicas

Page 28: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – HDFS IV

Meta-DataList of files, blocks, datanodes, file atbs

BalancingTry to balance data on all datanodes by

moving, or creating replicas Fault Tolerance

Logs, Secondary namenode, write confirmation after all replicas written

Page 29: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – HDFS IV Meta-Data

List of files, blocks, datanodes, file atbs Balancing

Try to balance data on all datanodes by moving, or creating replicas

Fault Tolerance Logs, Secondary namenode, write confirmation after

all replicas written Communication

Basic TCP/IP with protocols

Page 30: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Outline Introduction Map/Reduce Hadoop

Introduction File System Setup Example

Advanced Topics Conclusion

Page 31: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – Setup

Download…

Page 32: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – Setup

Download…double click…

Page 33: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – Setup

Download…double click…next a few times…

Page 34: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – Setup

Download…double click…next a few times…

Just kidding, however its not too bad Hadoop is available for Windows and

Linux systemsWill discuss some brief setup about a Linux

cluster system

Page 35: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – Setup II

Ensure you have Java 1.6 installed Download and extract the Hadoop system

(reference at end) Single machine – all set Multiple machine clusters involve a few

more steps…

Page 36: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – Setup III

Set up your namenode in the conf/master file (specify IP)

Set up datanodes in the conf/slave file (specify IP)

Configure ports in conf/core-site.xml, conf/mapred-site.xml, conf/hdfs-site.xml Override defaults

Finally, point to an output location

Page 37: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – Setup IV

Common Commands bin/start-all.shbin/stop-all.shbin/hadoop namenode -formatbin/hadoop dfs(or fs) -copyFromLocal file dir

copyToLocal Many more basic *nix commands (-ls, -cat, -mkdir) Some non-standard commands (-rmr instead of

rmdir)

Page 38: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop - Example

Lets look at some code and a simple run bin/hadoop jar <jar location> <main class>

<input dir> <output dir> <other args>Note: args can be in any order, but usually

suggested to have input/output first, as they are “required”

Page 39: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – Example II

Extra information can be found under /logs/userlogsDirectories for the Map/Reduce stagesstdout will hold System.out.print..()stderr will hold System.err.print..()

Page 40: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Hadoop – Example III

A deeper look

Page 41: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Outline

Introduction Map/Reduce Hadoop Advanced Topics

Other Technologies Hadoop vs. Others Future Topics

Conclusion

Page 42: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Adv – Other Technologies HBase

Powerset Table storage for semi-structured data

Zookeeper Yahoo! Coordinating Distributed Applications

Hive Facebook SQL-like query language

Page 43: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Adv – Other Technologies II Pig

Yahoo! High-level language for data analysis Example:

SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 106

good_urls = FILTER urls BY pagerank > 0.2;groups = GROUP good_urls BY category;big_groups = FILTER groups BY COUNT(good_urls) > 106;output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

Page 44: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Adv – Other Technologies III

How?

Build on top of the structure already there

Page 45: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Adv – Hadoop vs. Others How does Hadoop stack up?

Study compares Hadoop vs. Vertica vs. DBMSX (legal restrictions prevent the actual name)

Study performed on cluster with 100 nodes @ 2.40 GHz Intel Dual Core

A few measurements Load time for data, specific task run times, startup

even ease of use The outcome…

Page 46: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Adv – Hadoop vs. Others II

Data loading was much faster

Page 47: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Adv – Hadoop vs. Others III

However, a simple select proves too much

Page 48: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Adv – Hadoop vs. Others IV

But, an advanced task…

Page 49: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Adv – Hadoop vs. Others V

Finally study felt that Hadoop was much easier to get started

Programming with Hadoop breaks the rules

Not really a good interface…yet Needs a significant amount of data for

processing to show improvements

Page 50: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Conclusion

Hadoop has a lot of potential, but needs some work for any moderate sized data set

Systems where unstructured data needs to have the same question answered stand out as Hadoop candidates

Tough learning curve for anything advanced (maybe)

Page 51: Hadoop - Albanygilder/Hadoop.pdf · Introduction What motivated Hadoop? Large amounts of data with a desire to query it on demand and quickly What about traditional RDMS? Traditional

Questions? References

Map/Reduce with Hadoop Presentation, ETH Zurich 2008www.systems.ethz.ch/hs08/hadoop.pdf (note: link shortened to fit on screen, follow by clicking)

Apache Hadoop Website (http://hadoop.apache.org/) Dean, Jeffrey and Ghemawat, Sanjay. MapReduce: Simplified

Data Processing on Large Clusters. OSDI, 2004 Olston, Christopher et. Al. Pig Latin: A Not-So-Foreign Language

for Data Processing. SIGMOD, 2008 Pavlo, Andrew et. Al. A Comparison of Approaches to Large-

Scale Data Analysis. SIGMOD 2009 White, Tom. Hadoop The Definitive Guide. O’Reilly 2009