Upload
melissa-dalton
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Hadoop IntroductionRob Hughes
1
Why the need for Hadoop?• LOTS!!! of data causes some problems:• In 1990 a typical hard drive could store 1,370 MB of data with
a transfer speed of 4.4MB/s. So you could read all the data from a full drive in around five minutes.
• Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.
• Hard drive access speeds have not kept up with storage capacity. Stupid seek time!
2
How to mitigate access times• Partition data into more-or-less equal size chunks spread
across separate drives. Process data by working in parallel.• With a hundred drives each containing a hundredth of the
data, working in parallel, you could read an entire 1 terabyte drive in under 2 minutes.
• But this creates other problems…
3
More Problems• Increasing the amount of hardware increases the likelihood of
hardware failure.• If computer fails you lose part of the computation.• If drive fails, you lose part of your data.
• Intermediate results of processing now stored on multiple drives. Data may need to be combined to produce final result.
4
Again, why the need for Hadoop?
• Distributed processing (parallelism).• Distributed data (replication).• Fault-tolerance.• Mechanism to combine data at key points during processing.
5
Hadoop Features• Data Compression.• Separation of concerns:• Hadoop manages the complexity of data storage and replication,
coordination of hundreds to thousands of machines, and provides a fault tolerant platform for data access and job execution.
• Developers develop instead of becoming distributed system experts. Hadoop defines an API for packaging and submitting jobs, an API hook to update job progress, and a file system to capture job results.
• Data locality.• Scalable + Commodity hardware.
6
HDFSHadoop Distributed File System• Core component of Hadoop.• Exhibits all the characteristics of a distributed file system.• Files managed across a network of servers.• Data file size can grow beyond limits of physical server.• Scalable storage of data.• Tolerate failure of nodes without losing access to data.
• HDFS does this by high replication count.
7
HDFS• Designed for storing very large files and large amounts of data. • Capable of storing petabytes of data.
• Designed to run on commodity hardware.• Doesn’t require expensive or highly available hardware.• Runs on large clusters of inexpensive commonly available
hardware.• Chance of node failure is high for large clusters but HDFS is
designed to survive in the face of failure.• Optimized for write-once, read many times.• Optimized for high throughput to read entire dataset.
8
HDFS – Not Ideal For:• Low-latency data access (tens of milliseconds).• Large number of files. • Due to memory constraints, the number of files is currently
limited. Scales to millions but not billions of files.• Multiple writers. Writes in middle of files.• Files may only be written by a single writer. File modifications are
always done at the end of a file.
9
HDFS -- Blocks• Disks organized into blocks which is basic unit of storage
• Minimum amount of data than can be read or written.
• Filesystems perform I/O to individual disks in terms of multiple disk blocks.
• HDFS has notion of a block as well but block size is much larger than typical filesystems—64 MB by default.• Goal is to keep the number of relatively slow disk seeks low.• Hadoop operations designed to operate on data the size of an
HDFS block. Allows operations to access data with a single disk seek.
• Favors throughput over low-latency.• Files stored in HDFS as one or more HDFS blocks.• File replication in HDFS occurs at block level.
10
HDFS Architecture• Collection of HDFS nodes/servers is known as an HDFS cluster• An HDFS cluster comprised of two types of nodes (Namenode
and Datanode) operating in a master-worker relationship• Namenode (the master) – Server running a special piece of
software called the NameNode.• Datanodes (the worker) – Servers running a special piece of
software called the DataNode.
11
HDFS Architecture—Nodes• Namenode
• Maintains the filesystem tree and the metadata for all the files and directories in the tree.
• Information is stored persistently on the local disk in two files: the namespace image and the edit log.
• Recommended to configure HDFS to write copies of namespace image and edit log files to remote NFS mounted filesystem.
• Determines the mapping of blocks to Datanodes.• Knows the Datanodes on which all the blocks for a given file are
located. This information is stored in memory and not persisted to disk.• Datanodes
• Store and retrieves blocks when requested. • Perform block creation, deletion, and replication upon instruction from
the Namenode.• Periodically (and at system startup) reports list of blocks back to
Namenode.
12
HDFS Architecture (misc.)• Namenode is a single point of failure.• Cannot use the filesystem without the Namenode.• In the Apache Hadoop distribution, manual procedures are
needed to bring another node online as the new Namenode. Involves recovering namespace image and edit log from failed Namenode server or from external copy of those files.
• Hadoop 2.x offers federated Namenodes and HA features• Secondary Namenode—Optional node type.• Not a standby for Namenode as the name may imply.• Main role is to periodically merge namespace image with the edit
log to prevent the edit log from becoming too large.
13
HDFS Access• Hadoop and third party clients used to access HDFS filesystem• Command Line Interface (CLI) program named “hadoop”.• Various Hadoop java libraries provide programmatic filesystem
access.• C library called libhdfs also bundled with Hadoop.
• Clients access filesystem on behalf of user or program by hiding interaction with Namenodes and Datanodes.
14
rack
server
Local Storage
HD
FS
DataNode
server
Local Storage
HD
FS
DataNode
server
Local Storage
HD
FS
DataNode
Sample HDFS Cluster
rack
server
Local Storage
HD
FS
DataNode
server
Local StorageNameNode
server
Local Storage
HD
FS
DataNode
rack
server
Local Storage
HD
FS
DataNode
server
Local Storage
HD
FS
DataNode
server
Local Storage
HD
FS
DataNode
Hadoop Network Topology• Hadoop takes a simple approach in which the network is
represented as a tree.• “Distance” between nodes is important.• For high-volume data processing the limiting factor is how
rapidly data can be transferred between nodes.• Levels in the tree are not predefined, but it is common to have
levels that correspond to the data center, the rack, and the node.
• Network distance and data locality optimizations are key features that distinguish HDFS from other distributed file systems. 16
Sample Hierarchical/Tree Network Topology
NodesRacksLogical Root
HDFS Cluster
Rack 1
Datanode1
DatanodeN
Namenode
RackN
DN1
DN2
DNN 17
Sample Topology Adding Data Center Layer
HDFS Cluster/
Data Center 1
Rack 1
Datanode1 DatanodeN
RackN
DN1 Namenode
DC2
R1
DN1 DN2
R2
DN1 DN2
RN
DN1 DN2 DNN
18
Distance between two nodes is the sum of their distances to their closest common ancestor.
D=0 D=2 D=4 D=6
HDFS Cluster—Flat Topology
HDFS Cluster/
Datanode1 Datanode2 Namenode DatanodeN
19
D=0 D=2 D=2 D=2
client node
client jvm
HDFS Write: Replication Factor=3
rack
server
Local StorageDataNode
server
Local Storage
HD
FS
NameNode
rack
server
Local Storage
HD
FS
DataNode
server
Local Storage
HD
FS
DataNode
HDFS Client
DistributedFileSystem
FSDataOutputStream
1. create2. create
3. write
4. write packet
5. ack packet
b1
b1
b1
b1
4
5
HDFS block
pipelined writepipelined ack
4
4
5
5
4. write packet
4. write packet
4. write packet
block locations
6. close
7. complete
Hadoop Modes• Standalone• All Hadoop/MapReduce/HDFS daemons run as threads within a
single Java Virtual Machine (JVM).• Debugging distributed programs across multiple JVMs and
servers notoriously difficult. Standalone mode simplifies debugging experience.
• Pseudo-distributed• All Hadoop/MapReduce/HDFS daemons run as threads within
separate JVMs but on the same node. Closer to full-up cluster mode but all processing occurs on local node.
• Cluster• Hadoop/MapReduce/HDFS daemons run as threads within JVMs
spread across nodes of Hadoop/HDFS cluster. 21
Exercise 2
22
MapReduce• Skirted the issue so far. What good is all that data without analysis.• MapReduce is one of the core components of Hadoop and provides a data model
for processing data.• Analyzing data with Hadoop is broken up into two primary phases: the Map phase
and the Reduce phase. (hence the name).• Each phase has key-value pairs as input and output. The types of those key-value
pairs is selectable by the programmer. • The MapReduce API provides a number of available input and output types. • Types are extensible.
• The programmer must specify two functions: the map function and the reduce function.
• The output key-value types for the Map phase must be the same as the input key-value types for the Reduce phase.
• The output from the map function is processed by the MapReduce framework before• being sent to the reduce function. This processing sorts and groups the key-value pairs• by key—Known as “The Shuffle”.• MapReduce provides an API for creating a ‘job’ and submitting it to Hadoop for
execution.
23
Map Phase• Input: Key/value pairs• Value represents data set to be processed
• Map Function: User-defined and is applied to every value in data set.
• Output: New list of key/value pairs.• Output key/value types may be different than input types.
24
Reduce Phase
Map Function Output:(K3, V1)(K1, V1)(K1, V2)(K2, V1)
=>
25
• Input: Intermediate key/value pairs output from Map Phase.– Data is sorted and grouped by key before being passed to reduce
function.
Input to Reduce Function:(K1, [V1, V2])(K2, [V1])(K3, [V1])
• Reduce Function: User-defined function applied to each grouping (by key) of values.– Typically a function that takes a large number of key/value pairs
and produces a smaller number of key/value pairs.– Hence the name “reduce”.
• Output: Finalized set of key/value pairs.• All values with the same key will eventually be processed by the
same reduce task.
MapReduce (cont)• In addition to the Map and Reduce phases there are a few
data processing steps.• Input – Turn raw data into key-value pairs for input into Map
phase.• “Shuffle” – Step to turn output key-value pairs from Map
phase into input key-value pairs for Reduce phase. The output from the map function is processed by the MapReduce framework before being sent to the reduce function. This processing sorts and groups the key-value pairs by key.
• Output – Write results of Reduce phase to file system.
26
Scaling MapReduce• Leverage data stored in HDFS.• Use Hadoop to move computations to nodes hosting part of
the data.• MapReduce Job—unit of work to be completed for client.• Consists of:
• Input data.• MapReduce program.• Configuration information.
• Hadoop divides job into two types of tasks:• Map tasks.• Reduce tasks.
27
Scaling MapReduceDivision Of Labor• Hadoop divides input data into fixed-sized pieces called splits.• Hadoop creates one map task for each split. The map task
executes the user-defined map function for each record in the split.
28
outputreduceshufflemap/combineinput
Map and Reduce Tasks
29
split 1 map
part 1
part n
sort
map task
split 2 map
part 1
part n
sort
map task
split 3 map
part 1
part n
sort
map task
part 1 reduce part 1
merge/sort
reduce task
output HDFS HDFSreplication
part 1
part n reduce part n
merge/sort
reduce task
output HDFS HDFSreplication
part n
Hadoop Architecture• Hadoop cluster uses two node types to facilitate job
execution:• Jobtracker – Server running a special piece of software known as
JobTracker.• Tasktracker – Server(s) running a special piece of software known
as TaskTracker.
30
Hadoop Architecture--Nodes• Jobtracker--The jobtracker coordinates all the jobs run on the
system by scheduling tasks to run on tasktrackers.• Keeps track of progress for each job.• Reschedules failed tasks on another tasktracker.
• Tasktracker--Runs tasks and sends progress reports to the jobtracker.
• Note: This architecture is for Hadoop 1.x. Architecture has changed for Hadoop 2.x to overcome 1.x scaling issues.
31
rack
server
Local Storage
HD
FS
DataNode
TaskTracker
server
Local Storage
HD
FS
DataNode
TaskTracker
server
Local Storage
HD
FS
DataNode
TaskTracker
Sample HDFS/Hadoop Cluster
rack
server
Local Storage
JobTracker
server
Local StorageNameNode
server
Local Storage
HD
FS
DataNode
TaskTracker
rack
server
Local Storage
HD
FS
DataNode
TaskTracker
server
Local Storage
HD
FS
DataNode
TaskTracker
server
Local Storage
HD
FS
DataNode
TaskTracker
Hadoop Data Locality Optimization
rack
server
Local Storage
DataNodeH
DFS
rack
a
b
HDFS block
Data-localRack-local
TaskTracker
maptask
server
Local Storage
DataNode
HD
FS
TaskTracker
maptask
c Off-rack
a
b
server
Local Storage
DataNode
HD
FS
TaskTracker
c
server
Local Storage
DataNode
HD
FS
TaskTracker
• Hadoop does its best to run map tasks on the same node containing the input split.
• Optimal split size is one HDFS block
MapReduce Example
Input dataset: SINGLE.TXT*aardwolvesdraftableflatbreadtutortrouttourt
34
• Find all input rows containing a word that is an anagram of another word in the input.• Input to Map Phase is a single file SINGLE.TXT containing all words
• Output of Map Phase is a set of key/value pairs. One key/value pair for each line of the input dataset. Key is sorted letters for word. The value is the word itself.
* Data from Project Gutenberg @ http://www.gutenberg.org/dirs/etext02/mword10.zip
aadelorsvw aardwolvesaabdeflrt draftableaabdeflrt flatbreadorttu tutororttu troutorttu tourt
MapReduce Example
Output: part-00000<snip>aabdeflrt draftable,flatbread<snip>orttu tutor,trout,tourt<snip>
35
• Input to the Reduce phase is the sorted and grouped keys and values.• The idea being that anagrams will show up as key/value pairs with
multiple values in the group (keyed by common sorted letters)
• Output of the Reduce phase will be records with multiple values for the same key:
<snip>aadelorsvw [aardwolves]aabdeflrt [draftable,flatbread]<snip>orttu [tutor,trout,tourt]<snip>
Map Functionpackage com.hadoop.examples.anagrams;import java.io.IOException;import java.util.Arrays;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reporter;public class AnagramMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, Text> {private Text sortedText = new Text();private Text orginalText = new Text();public void map(LongWritable key, Text value,
OutputCollector<Text, Text> outputCollector, Reporter reporter)throws IOException {
String word = value.toString();char[] wordChars = word.toCharArray();Arrays.sort(wordChars);String sortedWord = new String(wordChars);sortedText.set(sortedWord);orginalText.set(word);outputCollector.collect(sortedText, orginalText);
}
}
36
Reduce Functionpackage com.hadoop.examples.anagrams;import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;public class AnagramReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
private Text outputKey = new Text();private Text outputValue = new Text();public void reduce(Text anagramKey, Iterator<Text> anagramValues,
OutputCollector<Text, Text> results, Reporter reporter) throws IOException {String output = "";while(anagramValues.hasNext()){
Text anagram = anagramValues.next();output = output + anagram.toString() + "~";
}StringTokenizer outputTokenizer = new StringTokenizer(output,"~");if(outputTokenizer.countTokens()>=2){
output = output.replace("~", ",");outputKey.set(anagramKey.toString());outputValue.set(output);results.collect(outputKey, outputValue);
}}}
37
Package The Jobpackage com.hadoop.examples.anagrams;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.TextInputFormat;import org.apache.hadoop.mapred.TextOutputFormat;public class AnagramJob {
public static void main(String[] args) throws Exception{JobConf conf = new JobConf(com.hadoop.examples.anagrams.AnagramJob.class);conf.setJobName("anagramcount");conf.setKeepTaskFilesPattern(".*");conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(Text.class);conf.setMapperClass(AnagramMapper.class);conf.setReducerClass(AnagramReducer.class);conf.setInputFormat(TextInputFormat.class);conf.setOutputFormat(TextOutputFormat.class);FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));JobClient.runJob(conf);
}}
38
MapReduce Example 2
Dataset 1: NAMES-F.TXT*AarenAarikaAbagaelAbagailAbbe…ZsazsaZulemaZuzana
Dataset 2: NAMES-M.TXT *AaronAbAbbaAbbeAbbey…ZerkZollieZolly
39
• Compare two datasets for a set of common names
• Input to Map Phase is both data sets (Female and Male names)• Output of Map Phase is a set of key/value pairs. One key/value pair
for each line of the input datasets. Both key and value set to the name from the input record.
• Input to the Reduce phase is the sorted and grouped keys and values.• The idea being that common names will show up as key/value pairs
with multiple values in the group.* Data from Project Gutenberg @ http://www.gutenberg.org/dirs/etext02/mword10.zip
Map Functionpackage com.hadoop.examples.commonnames;
import java.io.IOException;import java.util.Arrays;
import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reporter;public class CommonNamesMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
private Text originalText = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text, Text> outputCollector, Reporter reporter)throws IOException {
String word = value.toString();
originalText.set(word);outputCollector.collect(originalText, originalText);
}}
40
Reduce Functionpackage com.hadoop.examples.commonnames;import <snip>public class CommonNamesReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
private Text outputKey = new Text();private Text outputValue = new Text();
public void reduce(Text namesKey, Iterator<Text> namesValues,OutputCollector<Text, Text> results, Reporter reporter) throws IOException {
String output = "";while(namesValues.hasNext()){
Text names = namesValues.next();
output = output + names.toString() + "~";}
StringTokenizer outputTokenizer = new StringTokenizer(output,"~");if(outputTokenizer.countTokens()>=2){
output = output.replace("~", ",");outputKey.set(namesKey.toString());outputValue.set(output);results.collect(outputKey, outputValue);
}}
}
41
Package The Jobpackage com.hadoop.examples.commonnames;
import <snip>
public class CommonNamesJob {public static void main(String[] args) throws Exception{JobConf conf = new JobConf(com.hadoop.examples.commonnames.CommonNamesJob.class);conf.setJobName("commonnames");conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(Text.class);
conf.setMapperClass(CommonNamesMapper.class);conf.setReducerClass(CommonNamesReducer.class);
conf.setInputFormat(TextInputFormat.class);conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);}
}
42
Hadoop Modes• Standalone• All Hadoop/MapReduce/HDFS daemons run as threads within a
single Java Virtual Machine (JVM).• Debugging distributed programs across multiple JVMs and
servers notoriously difficult. Standalone mode simplifies debugging experience.
• Pseudo-distributed• All Hadoop/MapReduce/HDFS daemons run as threads within
separate JVMs but on the name node. Closer to full-up cluster mode but all processing occurs on local node.
• Cluster• Hadoop/MapReduce/HDFS daemons run as threads within JVMs
spread across nodes of Hadoop/HDFS cluster. 43
Exercise 3
44
Hadoop Introduction• Vendor: Copyright © 2011 The Apache Software Foundation.
Website: http://hadoop.apache.org/
Description:The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
• The project includes these subprojects:• Hadoop Common: The common utilities that support the other Hadoop subprojects.• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-
throughput access to application data.• Hadoop MapReduce: A software framework for distributed processing of large data sets on
compute clusters.
45
Additional Resources• “Hadoop: The Definitive Guide, Third Edition, by Tom White.
Copyright 2011 Tom White, 978-1-449-31152-0.”• Cloudera:• Linux packages and Virtual Machines:
• https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads#CDHDownloads-CDH3PackagesandDownloads
• Apache Hadoop Motivation Webinars (free registration required)• http://www.cloudera.com/resource/cloudera-essentials-for-apache-
hadoop-the-motivation-for-hadoop/
46
Backup Slides
47
Alternatives• MPI• PVM
48
HDFS Access• Talk about HTTP/Proxy access??• Talk about pluggable filesystems???
49
HDFS
• File structures• SequenceFile• MapFIle
50
MapReduce Features
• Counters• Sorting• Joins• Side Data
51