Hadoop , Hadoop , Hadoop !!!

  • Published on
    25-Feb-2016

  • View
    176

  • Download
    3

DESCRIPTION

Hadoop , Hadoop , Hadoop !!!. Jerome Mitchell Indiana University. Outline. BIG DATA Hadoop MapReduce The Hadoop Distributed File System (HDFS) Workflow Conclusions References HandsOn. LOTS OF DATA EVERYWHERE. Why Should You Care? Even Grocery Stores Care. What is Hadoop ?. - PowerPoint PPT Presentation

Transcript

<p>PowerPoint Presentation</p> <p>Hadoop, Hadoop, Hadoop!!!Jerome MitchellIndiana UniversityOutlineBIG DATAHadoop MapReduceThe Hadoop Distributed File System (HDFS)WorkflowConclusionsReferencesHandsOn</p> <p>LOTS OF DATA EVERYWHEREWhy Should You Care? Even Grocery Stores Care</p> <p>What is Hadoop?At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.GFS is not open source.Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.This is open source and distributed by Apache.</p> <p>6What exactly is HadoopA growing collection of subprojects </p> <p>Motivation for MapReduceLarge-Scale Data ProcessingWant to use 1000s of CPUsBut, dont want the hassle of managing things</p> <p>MapReduce Architecture providesAutomatic Parallelization and Distribution Fault ToleranceI/O SchedulingMonitoring and Status Updates</p> <p>8MapReduce ModelInput &amp; Output: a set of key/value pairsTwo primitive operationsmap: (k1,v1) list(k2,v2)reduce: (k2,list(v2)) list(k3,v3)Each map operation processes one input key/value pair and produces a set of key/value pairsEach reduce operationMerges all intermediate values (produced by map ops) for a particular keyProduce final key/value pairsOperations are organized into tasksMap tasks: apply map operation to a set of key/value pairsReduce tasks: apply reduce operation to intermediate key/value pairsEach MapReduce job comprises a set of map and reduce (optional) tasks.9HDFS Architecture </p> <p>The WorkFlowLoad data into the Cluster (HDFS writes)Analyze the data (MapReduce)Store results in the Cluster (HDFS)Read the results from the Cluster (HDFS reads)Distributed Data ProcessingSlavesMasterClientDistributed Data StorageName NodeSecondary Name NodeData Nodes &amp;Task TrackerData Nodes &amp;Task TrackerData Nodes &amp;Task TrackerData Nodes &amp;Task TrackerData Nodes &amp;Task TrackerData Nodes &amp;Task TrackerMapReduceHDFSJob TrackerHadoop Server Roles12Hadoop ClusterSwitchSwitch</p> <p>SwitchSwitchSwitchDN + TTDN + TTDN + TTDN + TTDN + TTDN + TTDN + TTDN + TTDN + TTDN + TTDN + TTDN + TTDN + TTDN + TTDN + TTDN + TTDN + TTSwitchRack 1Rack 2Rack 3Rack NHadoop Rack Awareness Why?Data Node 1Data Node 2Data Node 3Data Node 5Rack 1Rack 5Rack 9SwitchSwitchSwitchData Node 5Data Node 6Data Node 7Data Node 8Data Node 5Data Node 10Data Node 11 Data Node 12SwitchName NodeResults.txt =BLK A:DN1, DN5, DN6</p> <p>BLK B:DN7, DN1, DN2</p> <p>BLK CDN5, DN8, DN9metatadataRack 1:Data Node 1 Data Node 2Data Node 3</p> <p>Rack 1:Data Node 5Data Node 6Data Node 7</p> <p>Rack Aware14Sample ScenarioHuge file containing all emails sent to Indiana UniversityFile.txtHow many times did our customers type the word refund into emails sent to Indiana University?</p> <p>Writing Files to HDFSFile.txtData Node 1Data Node 5Data Node 6BLK ABLK BBLK CBLK ABLK BBLK CData Node NClientNameNode16Data Node Reading Files From HDFSData Node 1Data Node 2Data NodeData NodeRack 1Rack 5Rack 9SwitchSwitchSwitchData Node 1Data Node 2Data NodeData NodeData Node 1Data Node 2Data NodeData NodeSwitchName NodeResults.txt =BLK A:DN1, DN5, DN6</p> <p>BLK B:DN7, DN1, DN2</p> <p>BLK CDN5, DN8, DN9metatadataRack 1:Data Node 1 Data Node 2Data Node 3</p> <p>Rack 1:Data Node 5Rack Aware17MapReduce: Three Phases1. Map2. Sort3. ReduceData Processing: MapMap TaskMap TaskMap TaskBLK ABLK BBLK CJob TrackerName NodeData Node 1Data Node 5Data Node 9File.txt19MapReduce: The Map StepvkkvkvmapvkvkkvmapInputkey-value pairsIntermediatekey-value pairskvData Processing: ReduceMap TaskMap TaskMap TaskReduce TaskResults.txtHDFSBLK ABLK BBLK C21MapReduce: The Reduce StepkvkvkvkvIntermediatekey-value pairsgroupreducereducekvkvkvkvkvkvvvvKey-value groupsOutput key-value pairsClients Reading Files from HDFSData Node 1Data Node 2Data NodeData NodeRack 1Rack 5Rack 9SwitchSwitchSwitchData Node 1Data Node 2Data NodeData NodeData Node 1Data Node 2Data NodeData NodeClientNameNodeResults.txt =BLK A:DN1, DN5, DN6</p> <p>BLK B:DN7, DN1, DN2</p> <p>BLK CDN5, DN8, DN9metatadata23ConclusionsWe introduced MapReduce programming model for processing large scale dataWe discussed the supporting Hadoop Distributed File SystemThe concepts were illustrated using a simple exampleWe reviewed some important parts of the source code for the example.</p> <p>References Apache Hadoop Tutorial: http://hadoop.apache.org http://hadoop.apache.org/core/docs/current/mapred_tutorial.htmlDean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113.Cloudera Videos by Aaron Kimball: http://www.cloudera.com/hadoop-training-basic4. http://www.cse.buffalo.edu/faculty/bina/mapreduce.html</p> <p>Finally, A Hands-on Assignment!</p> <p>The MapReduce Framework (pioneered by Google) </p> <p>28Automatic Parallel Execution in MapReduce (Google)</p> <p> Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to avoid a slow task slowing down the whole job29The Map (Example)When in the course of human events it It was the best of times and the worst of times map(in,1) (the,1) (of,1) (it,1) (it,1) (was,1) (the,1) (of,1) (when,1), (course,1) (human,1) (events,1) (best,1) inputstasks (M=3)partitions (intermediate files) (R=2)This paper evaluates the suitability of the map(this,1) (paper,1) (evaluates,1) (suitability,1) (the,1) (of,1) (the,1) Over the past five years, the authors and many map(over,1), (past,1) (five,1) (years,1) (authors,1) (many,1) (the,1), (the,1) (and,1) The Reduce (Example)reduce(in,1) (the,1) (of,1) (it,1) (it,1) (was,1) (the,1) (of,1) (the,1) (of,1) (the,1) reduce taskpartition (intermediate files) (R=2)(the,1), (the,1) (and,1) sort(and, (1)) (in,(1)) (it, (1,1)) (the, (1,1,1,1,1,1)) (of, (1,1,1)) (was,(1))(and,1) (in,1) (it, 2) (of, 3) (the,6) (was,1)Note: only one of the two reduce tasks shownLifecycle of a MapReduce Job</p> <p>Map functionReduce functionRun this program as aMapReduce jobJob Configuration Parameters190+ parameters in HadoopSet manually or defaults are used</p> <p>33Formatting the NameNode Before we start, we have to format Hadoop's distributed filesystem (HDFS) for the namenode. You need to do this the first time you set up Hadoop. Do not format a running Hadoop namenode, this will cause all your data in the HDFS filesytem to be erased. To format the filesystem, run the command (from the master):--------------------------------------------bin/hadoop namenode -format ---------------------------------------------Starting Hadoop:Starting hadoop is done in two steps: First, the HDFS daemons are started: the namenode daemon is started on master, and datanode daemons are started on all slaves (here: master and slave). Second, the MapReduce daemons are started: the jobtracker is started on master, and tasktracker daemons are started on all slaves (here: master and slave).</p> <p>HDFS daemons:Run the command /bin/start-dfs.sh on the machine you want the namenode to run on. This will bring up HDFS with the namenode running on the machine you ran the previous command on, and datanodes on the machines listed in the conf/slaves file. In our case, we will run bin/start-dfs.sh on master: -------------------------bin/start-dfs.sh------------------------On slave, you can examine the success or failure of this command by inspecting the log file /logs/hadoop-hadoop-datanode-slave.log.</p> <p>Now, the following Java processes should run on master:root@ubuntu:$HADOOP_HOME/bin: jps 14799 NameNode 15314 Jps 14880 DataNode 14977 SecondaryNameNode ------------------------------------ </p> <p>MapReduce Daemons:In our case, we will run bin/start-mapred.sh on master:-------------------------------------bin/start-mapred.sh -------------------------------------On slave, you can examine the success or failure of this command by inspecting the log file /logs/hadoop-hadoop-tasktracker-slave.log. </p> <p>At this point, the following Java processes should run on master:---------------------------------------------------- root@ubuntu:$HADOOP_HOME/bin$ jps 16017 Jps 14799 NameNode 15686 TaskTracker14880 DataNode 15596 JobTracker 14977 SecondaryNameNode----------------------------------------------------</p> <p>We will execute your first Hadoop MapReduce job. We will use the WordCount example, which will read a text file and count the frequency of words. The input is a text file and the output is a text file with each line of which contains a word and the count of how often it occurred, separated by a tab. Download example input data:</p> <p>Copy local data file to HDFSBefore we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop's HDFS-----------------------------root@ubuntu:$HADOOP_HOME/bin$ hadoop dfs -copyFromLocal /tmp/source destination</p> <p>Build WordCountExecute build.sh</p> <p>Run the MapReduce jobNow, we actually run the WordCount example job. This command will read all the files in the HDFS destination directory , process it, and store the result in the HDFS directory output.-----------------------------------------root@ubuntu:$HADOOP_HOME/bin bin/hadoop hadoop-example wordcount destination output-----------------------------------------You can check if the result is successfully stored in HDFS directory output.</p> <p>Retrieve the job result from HDFSTo inspect the file, you can copy it from HDFS to the local file system.-------------------------------------root@ubuntu:/usr/local/hadoop$ mkdir /tmp/outputroot@ubuntu:/usr/local/hadoop$ bin/hadoop dfs copyToLocal output/part-00000 /tmp/output ---------------------------------------- Alternatively, you can read the file directly from HDFS without copying it to the local file system by using the command :---------------------------------------------root@ubuntu:/usr/local/hadoop$ bin/hadoop dfs cat output/part-00000 </p> <p>Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C++. Creating a launching program for your applicationThe launching program configures: The Mapper and Reducer to use The output key and value types (input types are inferred from the InputFormat) The locations for your input and outputThe launching program then submits the job and typically waits for it to complete</p> <p> A Map/Reduce may specify how its input is to be read by specifying an InputFormat to be used A Map/Reduce may specify how its output is to be written by specifying an OutputFormat to be used</p>

Recommended

View more >