Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970

Embed Size (px)

DESCRIPTION

Know Big Data One More Step When we talk about big data we must know what Hadoop is When we planning about data warehousing we must know what HDFS and NoSQL are. When we say data mining we must know what Mahout and H2O are. Do you know Hadoop data warehousing does not need dimensional modeling? Do you know how Hadoop stores heterogeneous data? Do you know what are Hadoop’s “Archeries heal”? Do you know you can install a Hadoop system in your Laptop? Do you know Alibaba has retired its last mini-computer in 2014? So, let’s talk about Hadoop 36/16/2015Zhangxi Lin

Citation preview

Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone: , QQ/WeChat: /16/2015Zhangxi Lin CAABI, Texas Tech University Center for Advanced Analytics and Business Intelligence initially started in 2004 by Dr. Peter Westfall, ISQS, Rawls College of Business. Center for Advanced Analytics and Business Intelligence Sichuan Key Lab of Financial Intelligence and Financial Engineering (FIFE), SWUFE One of two key labs in finance founded in 2008 and sponsored by Sichuan Provincial Government Underpinned by two areas in SWUFE: Information and Finance 2 ISQS 6339, Data Mgmt & BI 2 6/16/2015Zhangxi Lin Know Big Data One More Step When we talk about big data we must know what Hadoop is When we planning about data warehousing we must know what HDFS and NoSQL are. When we say data mining we must know what Mahout and H2O are. Do you know Hadoop data warehousing does not need dimensional modeling? Do you know how Hadoop stores heterogeneous data? Do you know what are Hadoops Archeries heal? Do you know you can install a Hadoop system in your Laptop? Do you know Alibaba has retired its last mini-computer in 2014? So, lets talk about Hadoop 36/16/2015Zhangxi Lin After this lecture you will Understand what challenges are in big data management Understand how Hadoop and MapReduce works Get familiar to the Hadoop ecology Be able to install a Hadoop in your laptop Be able to install a handy big data tool in your laptop to visualize and mine data 6/16/2015Zhangxi Lin4 Outlines Apache Hadoop Hadoop Data Warehousing Hadoop ETL Hadoop Data Mining Data Visualization with Hadoop MapReduce Algorithm Setting up Your Hadoop Appendixes The Hadoop Ecological System Matrix calculation with MapReduce Zhangxi Lin5 6/16/2015 A Traditional Business Intelligence System 6 SSMS SSIS SSAS SSRS SAS EM SAS EM SAS EG SAS EG MS SQL Server BIDS 6/16/2015Zhangxi Lin Hadoop ecosystem 6/16/20157Zhangxi Lin - 6/16/20158Zhangxi Lin What is Hadoop? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Hadoop is not a replacement for traditional RDMS but is a supplement to handle and process large datasets. It achieves two tasks: 1. Massive data storage. 2. Faster processing. Using Hadoop is cheaper, faster and better. 6/16/20159Zhangxi Lin Hadoop 2: Big data's big leap forward The new Hadoop is the Apache Foundation's attempt to create a whole new general framework for the way big data can be stored, mined, and processed. The biggest constraint on scale has been Hadoops job handling. All jobs in Hadoop are run as batch processes through a single daemon called JobTracker, which creates a scalability and processing-speed bottleneck. Hadoop 2 uses an entirely new job-processing framework built using two daemons: ResourceManager, which governs all jobs in the system, and NodeManager, which runs on each Hadoop node and keeps the ResourceManager informed about what's happening on that node. Zhangxi Lin106/16/2015 Hadoop 1.0 VS Hadoop 2.0 Hadoop 1.0 Hadoop 2.0 Horizontal scalability of Namenode. Namenode is no longer a single point of failure. Ability to process Terabytes and Petabytes of data available in HDFS using Non-MapReduce applications such as MPI, GIRAPH. The two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons. Features of Hadoop 2.0 over Hadoop 1.0: 6/16/201511Zhangxi Lin Apache Spark Apache Spark is an open source cluster computing framework originally developed in the AMPlab at UC Berkley. Spark in-memory provides performance up to 100 times faster for certain applications. Spark is well suited for machine learning algorithms. Spark requires a cluster manager and a distributed storage system. Spark supports Hadoop YARN. 6/16/201512Zhangxi Lin MapReduce 13 MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster or a grid. 6/16/2015Zhangxi Lin MapReduce 2.0 YARN (Yet Another Resource Negotiator) Zhangxi Lin146/16/2015 How Hadoop Operates Zhangxi Lin156/16/2015 Hadoop Ecosystem 6/16/201516Zhangxi Lin Hadoop Topics 17 No:TopicComponents 1Data warehousingHDFS, HBase, HIVE, KylinNoSQL/NewSQL, Solr 2Publicly available big data servicesHortonworks, CloudEra, HaaS, EC2 3MapReduce & Data miningMahout, H2O, R, Python 4Big data ETLKettle, Flume, Sqoop, Impala,Chakwa. Dremel, Pig 5Big data platform managementOozie, ZooKeeper, Ambari, Loom, Ganglia 6Application development platformTomcat, Neo4J, Pig, Hue 7Tools & VisualizationsPentaho, Tableau Saiku, Mondrian, Gephi, 8Streaming data processing Spark, Storm, Kafka, Avro 6/16/2015Zhangxi Lin HADOOP DATA WAREHOUSING 186/16/2015Zhangxi Lin Comparing the RDBMS and Hadoop data warehousing stack Layer Conventional RDBMS Hadoop Advantages of Hadoop over conventional RDBMS StorageDatabase tablesHDFS file system HDFS is purpose- built for extreme IO speeds MetadataSystem tablesHCatalog All clients can use HCatalog to read files. QuerySQL query engine Multiple engines (SQL and non- SQL) Multiple query engines like Hive or Impala are available. 6/16/201519Zhangxi Lin HDFS ( Hadoop Distributed File System) Hadoop ecosystem consists of many components and libraries for varied tasks. The storage part of Hadoop is HDFS and the processing part is MapReduce. HDFS is the a java based distributed file-system that stores data on commodity machines without prior organization, providing very high aggregate bandwidth across the clusters. 6/16/2015Zhangxi Lin20 HDFS Architecture & Design HDFS has a master/slave architecture. HDFS consists of a single NameNode and several number of DataNodes in a cluster. In HDFS files are split in one or more blocks and are stored in a set of DataNodes. HDFS exposes a file system namespace and allows user data to be stored in files. DataNodes serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode. 6/16/201521Zhangxi Lin 6/16/201522Zhangxi Lin What is NoSQL? Stands for Not Only SQL NoSQL is a non-relational database management system. NoSQL is different from traditional relational database management systems in some significant ways. NoSQL is designed for distributed data stores where very large scale of data storing is needed (for example Google or Facebook which collects terabits of data every day for their users). These types of data storing may not require fixed schema, avoid join operations and typically scale horizontally. 6/16/201523Zhangxi Lin NoSQL 6/16/201524Zhangxi Lin - Praveen Ashokan 6/16/201525Zhangxi Lin What is NewSQL? A modern RDBMS that seek to provide the same scalable performance of NoSQL systems for OLTP read-write workloads while still maintaining the ACID guarantees of a traditional database system. SQL as the primary interface Non- Locking Concurrency control High per-node performance H-Store parallel database system is the first known NewSQL system 6/16/201526Zhangxi Lin Classification of NoSQL and NewSQL 6/16/201527Zhangxi Lin Taxonomy of Big Data Stores 6/16/201528Zhangxi Lin Features of OldSQL vs NoSQL vs NewSQL 6/16/201529Zhangxi Lin 6/16/201530Zhangxi Lin HBase HBase is a non-relational,distributed database It is a column-oriented DBMS It is an implementation of Googles Big Table HBase is built on top of Hadoop File Distributed System(HDFS) 6/16/201531Zhangxi Lin Differences between HBase and Relational Database HBase is a column-oriented database while a Relational database is a row-oriented database HBase is highly scalable while RDBMS is hard to scale. Hbase has flexible schema while RDBMS has fixed schema HBase holds denormalized data while data in a Relational database is normalized The performance of HBase is good for large volumes of unstructured data while the performance is poor for a Relational database HBase does not use any query language while a Relational Database uses SQL to retrieve data 6/16/201532Zhangxi Lin HBase Data Model 6/16/201533Zhangxi Lin HBase: Keys and Column Families Each record is divided into Column Families 6/16/201534Zhangxi Lin What is Apache Hive? The Apache Hive is data warehouse software facilitates querying and managing large datasets residing in distributed storage It built on top of Apache Hadoop it provides tools to easy data extract/transform/load It supports analysis of large datasets stored in Hadoops HDFS It supports SQL-like language called HQL as well as big data analytics with the help of Map-Reduce 6/16/201535Zhangxi Lin What is HQL? HQL : Hive Query Language Doesnt conform any ANSI standard Very close to MySQL dialect, but with some differences SQL to HQL cheat sheetcontent/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLt oHive.pdfcontent/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLt oHive.pdf HQL doesnt support transactions, so dont compare with RDBMS 6/16/201536Zhangxi Lin HADOOP ETL 6/16/201537Zhangxi Lin List of Tools Sqoop Flume Impala Chukwa Kettle 6/16/201538Zhangxi Lin E T L 6/16/201539Zhangxi Lin Sqoop Is a short form of SQL to Hadoop Used to move back data back and forth between RDBMS and HDFS for performing analysis using BI Tools. Is a simple command line tool(Sqoop 2 is bringing web interface as well) 6/16/201540Zhangxi Lin How Sqoop Works Dataset Slice 1 Slice 2 Slice 1 Mapper 1 6/16/201541Zhangxi Lin Sqoop 1 & Sqoop 2 FeatureSqoop 1Sqoop 2 Connectors for all major RDBMSSupported.Not supported. Workaround: Use the generic JDBC Connector which has been tested on the following databases: Microsoft SQL Server, PostgreSQL, MySQL and Oracle. This connector should work on any other JDBC compliant database. However, performance might not be comparable to that of specialized connectors in Sqoop. Encryption of Stored PasswordsNot supported. No workaround.Supported using Derby's on-disk encryption.Disclaimer: Although expected to work in the current version of Sqoop 2, this configuration has not been verified.on-disk encryption Data transfer from RDBMS to Hive or HBase Supported.Not supported. 1.Workaround: Follow this two-step approach.Import data from RDBMS into HDFS 2.Load data into Hive or HBase manually using appropriate tools and commands such as the LOAD DATA statement in Hive Data transfer from Hive or HBase to RDBMS 1.Not supported.Workaround: Follow this two-step approach.Extract data from Hive or HBase into HDFS (either as a text or Avro file) 2.Use Sqoop to export output of previous step to RDBMS Not supported. Follow the same workaround as for Sqoop 1. 6/16/201542Zhangxi Lin Sqoop 1 & Sqoop 2 Architecture For more on Differences https://www.youtube.com/watch?v=xzU3HL4ZYI0 6/16/201543Zhangxi Lin What is Flume ? Flume It is a distributed, reliable service used for gathering, aggregating and transporting large amounts of streaming event data for analysis. Event data streaming log data (website/application logs to analyse users activity) or streaming data (e.g. social media analyse an event, stock prices- to analyse a stocks performance) 6/16/201544Zhangxi Lin Architecture and Working 6/16/201545Zhangxi Lin Impala An open source SQL query engine Developed by Cloudera and fully open source, hosted on github. Released as beta in 10/ version available in 05/2013 6/16/201546Zhangxi Lin About Impala 6/16/201547Zhangxi Lin What is Chukwa Chukwa is an open source data collection system for monitoring large distributed systems. Used for log collection and analysis. Built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework Not a streaming database Not a real time system 6/16/201548Zhangxi Lin Why do we need Chukwa? Data monitoring and analysis. To collect system matrices and log files. To store data in Hadoop clusters Uses MapReduce to analyze data. Robust Scalable Rapid Data Processing 6/16/201549Zhangxi Lin How it Works? 6/16/201550Zhangxi Lin Data Analysis 6/16/201551Zhangxi Lin ETL ToolsFeaturesAdvantageDisadvantage Sqoop Bulk import Direct input Data interaction Data export Parallel data transfer Efficient data Analysis Not east to manage installations and configurations Flume Fan out Fan in Processors Auto-batching of events Multiplexing channels for data mining Reliable, Scalable, Manageable, Customizable, High Performance Feature Rich and Fully Extensible Contextual Routing Have to weaken some delivery guarantees Kettle Migrating data between applications or databases Exporting data from databases to flat files Loading data massively into databases Data cleansing Integrating applications Higher level than code Well tested full suite of components Data analysis tools Free Not running fast Take some time to install 6/16/201552Zhangxi Lin Building a Datawarehouse in Hadoop using ETL Tools Copy data into HDFS with ETL tool (e.g. Informatica), Sqoop or Flume into standard HDFS files (write once). This registers the metadata with HCatalog. Declare the query schema in Hive or Impala, which doesnt require data copying or re-loading, due to the schema-on-read advantage of Hadoop compared with schema-on-write constraint in RDBMS. Explore with SQL queries and launching BI tools e.g. Tableau, BusinessObjects for exploratory analytics. 6/16/201553Zhangxi Lin HADOOP DATA MINING 546/16/2015Zhangxi Lin What is Mahout? Meaning: A person who keep and drives an elephant an Indian term Mahout is a scalable open source machine learning library hosted by Apache. Mahout core algorithms are implemented on top of Apache Hadoop using the Map/Reduce paradigm. 6/16/201555Zhangxi Lin Mahouts position 6/16/201556Zhangxi Lin 6/16/201557Zhangxi Lin Mapreduce flow in mahout 6/16/201558Zhangxi Lin What is H2O? H2O scales statistics, machine learning and math over BigData. H2O is extensible and users can build blocks using simple math legos in the core. H2O keeps familiar interfaces like R, Excel & JSON so that BigData enthusiasts & experts can explore, merge, model and score datasets using a range of simple to advanced algorithms. H2O makes it fast and easy to derive insights from your data through faster and better predictive modeling. H2O has a vision of online scoring and modeling in a single platform 6/16/201559Zhangxi Lin H2O How is H2O Different form Mahout ? H2OMahout Can use any of R, REST/JSON, GUI (browser), Java or Scala. Can use Java H2O is GUI product with less algorithmsMore number of Algorithms that need knowledge od Java Algorithms are typically 100x faster than current Map/Reduce-based Mahout Algorithms are typically slower compared to H2O. Knowledge of Java is NOT required to develop prediction model Knowledge of Java required to develop prediction model Real TimeNot Real Time 6/16/201560Zhangxi Lin H2O Predictive Modeling Factories Better Marketing with H2O Advertising Technology Better Conversions with H2O Risk & Fraud Analysis Better detection with H2O Customer Intelligence Better Sales with H2O Users of H2O 6/16/201561Zhangxi Lin MAP/REDUCE ALGORITHM 6/16/2015 Zhangxi Lin62 How to write a MapReduce program Parallelization is the key Algorithm is different from a single server application Map function Reduce function Considerations Load balance Efficiency Memory management 6/16/201563Zhangxi Lin MapReduce Executes 6/16/201564Zhangxi Lin Schematic of a map-reduce computation 6/16/201565Zhangxi Lin Example: counting the number of occurrences for each word in a collection of documents The input file is a repository of documents, and each document is an element. The Map function for this example uses keys that are of type String (the words) and values that are integers. The Map task reads a document and breaks it into its sequence of words w1,w2,...,wn. It then emits a sequence of key-value pairs where the value is always 1. That is, the output of the Map task for this document is the sequence of key- value pairs: (w1, 1), (w2, 1),..., (wn, 1) 6/16/201566Zhangxi Lin Map Task A single Map task will typically process many documents. Thus, its output will be more than the sequence for the one document suggested above. If a word w appears m times among all the documents assigned to that process, then there will be m key- value pairs (w, 1) among its output. After all the Map tasks have completed successfully, the master controller merges the files from each Map task that are destined for a particular Reduce task and feeds the merged file to that process as a sequence of key-list-of-value pairs. That is, for each key k, the input to the Reduce task that handles key k is a pair of the form (k, [v1, v2,..., vn]), where (k, v1), (k, v2),..., (k, vn) are all the key-value pairs with key k coming from all the Map tasks. 6/16/201567Zhangxi Lin Reduce Task The output of the Reduce function is a sequence of zero or more key-value pairs. The Reduce function simply adds up all the values. The output of a reducer consists of the word and the sum. Thus, the output of all the Reduce tasks is a sequence of (w,m) pairs, where w is a word that appears at least once among all the input documents and m is the total number of occurrences of w among all those documents. The application of the Reduce function to a single key and its associated list of values is referred to as a reducer. 6/16/201568Zhangxi Lin Big Data Visualization and Tools 6/16/201569Zhangxi Lin Big Data Visualization and Tools Tools : Tableau Pentaho Modrian Saiku Spotfire Gephi 6/16/201570Zhangxi Lin Tableau Tableau is a visual analysis solution that allows people to explore and analyze data with simple drag and drop operations. What is Tableau? 6/16/201571Zhangxi Lin Tableau Tableau Alliance Partners 6/16/201572Zhangxi Lin Tableau 6/16/201573Zhangxi Lin What is Pentaho? Pentaho is a commercial open source software for Business Intelligence (BI). Pentaho has been developed since 2004 in Orlando, Florida. Pentaho provides comprehensive reporting, OLAP analysis, dashboards, data integration, data mining and a BI platform. It is built under Java platform. Runs well under various platforms (Windows, Linux, Macintosh, Solaris, Unix, etc.) Has a complete package from reporting, ETL for warehousing data management, OLAP server data mining also dashboard. BI Platform supports Pentaho end to end business intelligence capabilities and provide central access to your business information, with back end security, integration, scheduling, auditing and more. Designed to meet the needs of any size organization. 6/16/201574Zhangxi Lin A few facts 6/16/201575Zhangxi Lin 6/16/201576Zhangxi Lin Analyzer 6/16/201577Zhangxi Lin Reports 6/16/201578Zhangxi Lin Overall Features 6/16/201579Zhangxi Lin HADOOP IN YOUR LAPTOP 6/16/2015Zhangxi Lin80 Hortonworks Background Hortonworks is a Business computer software company based in Palo Alto,California Hortonworks supports & develops Apache Hadoop framework, that allows distributed processing of large data sets across clusters of computers They are the sponsors of Apache Software Foundation Founded in June 2011 by Yahoo and Benchmark capital as an independent company. It went public on December 2014 Below are the list of company collaborated with Hortonworks Microsoft on October 2011 to develop Azure & Window server Infomatica on November 2011 to develop HParser Teradata on February 2012 to develop Aster data system SAP AG on September 2012 announced it would resell Hortonworks distribution 6/16/201581Zhangxi Lin They do Hadoop using HDP 6/16/201582Zhangxi Lin Hortonworks Data Platform Hortonworks' product named Hortonworks Data Platform (HDP) includes Apache Hadoop and is used for storing, processing, and analyzing large volumes of data. It includes Apache Projects like HDFS, MapReduce, Pig, Hive, Hbase, Zookeeepr and other components Why was it developed? It was develop with one aim to make Apache Hadoop ready for enterprise. What does it do? It takes Big Data component of Apache Hadoop and make them ready for prime time use in Enterprise Environment. 6/16/201583Zhangxi Lin HDP Functional Areas 6/16/201584Zhangxi Lin Certified Technology Program One of the most important aspects of the Technology Partner Program is the certification of partner technologies with HDP Hortonworks Certified Technology Program simplifies big data planning by providing pre-built and validated integrations between leading enterprise technologies and Hortonworks Data Platform (HDP) YARN Ready Certification,Operations Ready Security Ready, Governance Ready More Details:6/16/201585Zhangxi Lin How to get HDP? HDP is architected, developed, and built completely in the open. Anyone can download it from websitefor freehttp://hortonworks.com/hdp/downloads/ It comes with different version which can used as per need. HDP 2.2 on Sandbox runs on VirtualBox or VMWare Automated (Amabri) RHEL/Ubuntu/CentOS/SLES Manual RHEL/Ubuntu/CentOS/SLES Windows Windows Server 2008 & /16/201586Zhangxi Lin Installing HDP IP address to login on the browser 6/16/201587Zhangxi Lin DEMO-HDP Below are the step we will be preforming in HDP Starting HDP Upload a source file Load in file in HCatalog Pig Basic Tutorial 6/16/201588Zhangxi Lin 6/16/201589Zhangxi Lin About Cloudera Cloudera is The commercial Hadoop company Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo Provides consulting and training services for Hadoop users Staff includes several committers to Hadoop projects 6/16/201590Zhangxi Lin Who uses Cloudera? 6/16/201591Zhangxi Lin Cloudera Software (All Open-Source) Clouderas Distribution including Apache Hadoop (CDH) A single, easy-to-install package from the Apache Hadoop core repository Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version Components Apache Hadoop Apache Hive Apache Pig Apache HBase Apache Zookeeper Flume, Hue, Oozie, and Sqoop 6/16/201592Zhangxi Lin CDH and Enterprise Ecosystem 6/16/201593Zhangxi Lin Beyond Hadoop Hadoop is incapable of handling OLTP tasks because of its latency. Alibaba has deelop its own distributed system instead of using Hadoop. Currently, it takes Alipays system 20 ms to process a payment transaction, but 200 ms for fraud detection 2014 11 79 / 20 / 4 cn has replaced its old system with VMware vFabric TM GemFire in-memory database system. This makes its services stable and robustic. 946/16/2015Zhangxi Lin HaaS(Hadoop as a Service) HaaS example HaaS example Amazon Web Services(AWS) -Amazon Elastic MapReduce (EMR) providing Hadoop based platform for data analysis with S3 as the storage system and EC2 as the compute system Microsoft HDInsight, Cloudera CDH3, IBM Infoshpere BigInsights, EMC GreenPlum HD and Windows Azure HDInsight Service are the primary HaaS services by global IT giants APPENDIX 1: HADOOP ECOLOGICAL SYSTEM 976/16/2015Zhangxi Lin Choosing a right Hadoop architecture Application dependent Too many solution providers Too many choices 986/16/2015Zhangxi Lin Teradata Big Data Platform 6/16/2015Zhangxi Lin99 Dells Hadoop ecosystem 1006/16/2015Zhangxi Lin Nokias Big Data Architechture 6/16/2015Zhangxi Lin101 Clouderas Hadoop System 1026/16/2015Zhangxi Lin 1036/16/2015Zhangxi Lin 1046/16/2015Zhangxi Lin Intel 1056/16/2015Zhangxi Lin Comparison of Two Generations of Hadoop 1066/16/2015Zhangxi Lin 1076/16/2015Zhangxi Lin 1086/16/2015Zhangxi Lin Different Components of Hadoop 1096/16/2015Zhangxi Lin 1106/16/2015Zhangxi Lin APPENDIX 2: MATRIX CALCULATION 6/16/2015 Zhangxi Lin111 Map/Reduce Matrix Multiplication 6/16/2015 Zhangxi Lin112 Map/Reduce Scheme 1, Step 1 6/16/ Zhangxi Lin Map/Reduce Scheme 1, Step 2 6/16/2015 Zhangxi Lin114 Map/Reduce Scheme 2, Oneshot 6/16/2015 Zhangxi Lin115 Communication Cost 6/16/2015 Zhangxi Lin116 The sum of the communication cost of all the tasks implementing that algorithm. In addition to the amount of time to execute a task it also includes the time for moving data into the memory. The algorithm executed by each task tends to be very simple, often linear in the size of its input The typical interconnect speed for a computing cluster is one gigabit per second. The time taken to move the data from a chunk into the main memory may exceed the time needed to operate on the data. Reducer size 6/16/2015 Zhangxi Lin117 The upper bound on the number of values that are allowed to appear in the list associated with a single key. Reducer size can be selected with at least two goals. By making the reducer size small, we can force there to be many reducers, according to which the problem input is divided by the Map tasks. We can choose a reducer size sufficiently small that we are certain the computation associated with a single reducer can be executed entirely in the main memory of the compute node where its Reduce task is located. The running time will be greatly reduced if we can avoid having to move data repeatedly between main memory and disk. Replication rate 6/16/2015 Zhangxi Lin118 The number of key-value pairs produced by all the Map tasks on all the inputs, divided by the number of inputs. That is, the average communication from Map tasks to Reduce tasks (measured by counting key-value pairs) per input. Segmenting Matrix to Reduce the Cost 6/16/2015 Zhangxi Lin119 Map/Reduce Scheme 3 6/16/2015 Zhangxi Lin120 Map/Reduce Scheme 4, Step 1 6/16/2015 Zhangxi Lin121 Map/Reduce Scheme 4, Step 2 6/16/2015 Zhangxi Lin122