Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Hadoop
Chris McConnellCSI-5414/6/2010
Outline
Introduction Map/Reduce Hadoop Advanced Topics Conclusion
Introduction
What motivated Hadoop? Large amounts of data with a desire to query it on
demand and quickly What about traditional RDMS?
Introduction
What motivated Hadoop? Large amounts of data with a desire to query it on
demand and quickly What about traditional RDMS?
Traditional RDMS MapReduce
Data Size
Gigabytes Petabytes
Access Interactive and Batch Batch
Updates Read and Write many times
Write once, read many
Structure Static Schema Dynamic Schema
Integrity High Low
Scaling Nonlinear Linear
Introduction II
What is Hadoop good for?Semi-Structured/Unstructured dataLarge volumes of dataMany reads, few writes and a write that spans
a majority of the data setGenerally*, image analysis, graph-based
problems even machine learning algorithms
Outline
Introduction Map/Reduce Hadoop Advanced Topics Conclusion
Map/Reduce
Map/Reduce is a programming ‘technique’ that was introduced by Google in 2004
What is it?Given data, we want to ‘Map’ it, then ‘Reduce’
it until we have our answer When is it good?
Lots of data that follows a similar format for the specified query
Map/Reduce II
Example: Stock Market, Top PriceData lines:
Historical Stock data, one line per recording XYZD01142010T08301234P2534C+12 NADD01142010T08452549P453C-01 …
Map/Reduce II
Example: Stock Market, Top PriceData lines:
Historical Stock data, one line per recording XYZD01142010T08301234P2534C+12 NADD01142010T08452549P453C-01 …
“Mapped”: Key, Value…lets get something meaningful though (XYZ, D01142010T08301234P2534C+12) (NAD, D01142010T08452549P453C-01) …
Map/Reduce III
Example continuedMapped:
There could be multiple lines, so send out each key, value for the reducing stage
(XYZ, 25.34) (XYZ, 35.12) (NAD, 5.66) (NAD, 4.53) …
Now, pass it along to the Reduce(r)…
Map/Reduce IV
From Map to Reduce, the system organizes based upon keys (more details later) Reduce gets:
(XYZ, [25.34, 35.12]) (NAD, [5.66, 4.53])
Final Reduce: iterate over values, get max and output (XYZ, 35.12) (NAD, 5.66)
Map/Reduce V
General Flow
Map/Reduce VI
ShuffleNot talking about iPods
Map/Reduce VI
ShuffleNot talking about iPodsWhen sending from the Map to Reduce stage,
data flows through a Partition, Shuffle, Sort post/pre processing
This is done to allow for higher parallelism
Map/Reduce VII
Map/Reduce VIII
Everything is Parallel …sort ofImplicitly, mapping can be parallel, as a list for
one key is being created, the other keys don’t care, but it might take planning to accomplish
Reducing can be all in parallelBottleneck?
Map/Reduce VIII
Everything is Parallel …sort ofImplicitly, mapping can be parallel, as a list for
one key is being created, the other keys don’t care, but it might take planning to accomplish
Reducing can be all in parallelBottleneck?
Reduce needs to wait for Map
Map/Reduce IX
One more overview highest level
Outline Introduction Map/Reduce Hadoop
Introduction File System Setup Example
Advanced Topics Conclusion
Hadoop - Introduction History
Created by Doug Cutting Started as an open source web search engine 2004, began Nutch Distributed File System modeled
after the Google File System 2005, Began a MapReduce implementation on
NDFS… 2006 Hadoop subproject began 2008 Yahoo! Utilizes a 10,000 core cluster for
production search engine
Hadoop – Introduction II
Current StatusMultiple subprojects (Will discuss some later)Core – components and interfaces for the
distributed filesystems and general I/OMapReduce – Discussed earlierHDFS – Hadoop Distributed File System
Hadoop – Introduction II
Current StatusMultiple subprojects (Will discuss some later)Core – components and interfaces for the
distributed filesystems and general I/OMapReduce – Discussed earlierHDFS – Hadoop Distributed File System
Details coming soon
Hadoop – HDFS
Clusters follow a master/worker pattern Namenode (master)
Single per cluster (not required for all)Maintains file system tree and metadataAccepts MR from clientHandles replication, block assignments
among other tasks
Hadoop – HDFS II
Datanode (workers) Execute tasks as told to do so Useless for recovery Files on Datanode are handled in large blocks,
typically 64/128MB Client can interact with namenode and
datanodes since data is not sent to namenode Minimizing Bottlenecks
Hadoop – HDFS III
General overview
Hadoop – HDFS IV
Meta-DataList of files, blocks, datanodes, file atbs
Hadoop – HDFS IV
Meta-DataList of files, blocks, datanodes, file atbs
BalancingTry to balance data on all datanodes by
moving, or creating replicas
Hadoop – HDFS IV
Meta-DataList of files, blocks, datanodes, file atbs
BalancingTry to balance data on all datanodes by
moving, or creating replicas Fault Tolerance
Logs, Secondary namenode, write confirmation after all replicas written
Hadoop – HDFS IV Meta-Data
List of files, blocks, datanodes, file atbs Balancing
Try to balance data on all datanodes by moving, or creating replicas
Fault Tolerance Logs, Secondary namenode, write confirmation after
all replicas written Communication
Basic TCP/IP with protocols
Outline Introduction Map/Reduce Hadoop
Introduction File System Setup Example
Advanced Topics Conclusion
Hadoop – Setup
Download…
Hadoop – Setup
Download…double click…
Hadoop – Setup
Download…double click…next a few times…
Hadoop – Setup
Download…double click…next a few times…
Just kidding, however its not too bad Hadoop is available for Windows and
Linux systemsWill discuss some brief setup about a Linux
cluster system
Hadoop – Setup II
Ensure you have Java 1.6 installed Download and extract the Hadoop system
(reference at end) Single machine – all set Multiple machine clusters involve a few
more steps…
Hadoop – Setup III
Set up your namenode in the conf/master file (specify IP)
Set up datanodes in the conf/slave file (specify IP)
Configure ports in conf/core-site.xml, conf/mapred-site.xml, conf/hdfs-site.xml Override defaults
Finally, point to an output location
Hadoop – Setup IV
Common Commands bin/start-all.shbin/stop-all.shbin/hadoop namenode -formatbin/hadoop dfs(or fs) -copyFromLocal file dir
copyToLocal Many more basic *nix commands (-ls, -cat, -mkdir) Some non-standard commands (-rmr instead of
rmdir)
Hadoop - Example
Lets look at some code and a simple run bin/hadoop jar <jar location> <main class>
<input dir> <output dir> <other args>Note: args can be in any order, but usually
suggested to have input/output first, as they are “required”
Hadoop – Example II
Extra information can be found under /logs/userlogsDirectories for the Map/Reduce stagesstdout will hold System.out.print..()stderr will hold System.err.print..()
Hadoop – Example III
A deeper look
Outline
Introduction Map/Reduce Hadoop Advanced Topics
Other Technologies Hadoop vs. Others Future Topics
Conclusion
Adv – Other Technologies HBase
Powerset Table storage for semi-structured data
Zookeeper Yahoo! Coordinating Distributed Applications
Hive Facebook SQL-like query language
Adv – Other Technologies II Pig
Yahoo! High-level language for data analysis Example:
SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 106
good_urls = FILTER urls BY pagerank > 0.2;groups = GROUP good_urls BY category;big_groups = FILTER groups BY COUNT(good_urls) > 106;output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);
Adv – Other Technologies III
How?
Build on top of the structure already there
Adv – Hadoop vs. Others How does Hadoop stack up?
Study compares Hadoop vs. Vertica vs. DBMSX (legal restrictions prevent the actual name)
Study performed on cluster with 100 nodes @ 2.40 GHz Intel Dual Core
A few measurements Load time for data, specific task run times, startup
even ease of use The outcome…
Adv – Hadoop vs. Others II
Data loading was much faster
Adv – Hadoop vs. Others III
However, a simple select proves too much
Adv – Hadoop vs. Others IV
But, an advanced task…
Adv – Hadoop vs. Others V
Finally study felt that Hadoop was much easier to get started
Programming with Hadoop breaks the rules
Not really a good interface…yet Needs a significant amount of data for
processing to show improvements
Conclusion
Hadoop has a lot of potential, but needs some work for any moderate sized data set
Systems where unstructured data needs to have the same question answered stand out as Hadoop candidates
Tough learning curve for anything advanced (maybe)
Questions? References
Map/Reduce with Hadoop Presentation, ETH Zurich 2008www.systems.ethz.ch/hs08/hadoop.pdf (note: link shortened to fit on screen, follow by clicking)
Apache Hadoop Website (http://hadoop.apache.org/) Dean, Jeffrey and Ghemawat, Sanjay. MapReduce: Simplified
Data Processing on Large Clusters. OSDI, 2004 Olston, Christopher et. Al. Pig Latin: A Not-So-Foreign Language
for Data Processing. SIGMOD, 2008 Pavlo, Andrew et. Al. A Comparison of Approaches to Large-
Scale Data Analysis. SIGMOD 2009 White, Tom. Hadoop The Definitive Guide. O’Reilly 2009