MapReduce: Simplified Data Processing on Large...

MapReduce: Simplified Data Processing on Large Clusters

Naser AlDuaij

MapReduce: Benefits

  Programming model for processing/generating large data sets

  Easy to use   Highly scalable (order of TBs of data)

MapReduce: Features

  Parallelization   Fault-tolerance   Locality optimization (avoid network overhead)   Load balancing

MapReduce: Real-life applications

  Google's production web search service   Sorting   Data mining   Machine learning   Graph computations

MapReduce: Map & Reduce

  Distributed divide & conquer approach   Input/output key/value pairs   Map: Master node partitions input   Reduce: Master node collects & combines data

MapReduce: Example

Word frequency

MapReduce: Types

Map (k1, v1) → list (k2, v2) Reduce (k2, list (v2)) → list (v2)

  Input k,v ≠ Output k,v data domain   Word Frequency example: Map (string, string) → list (partial_str, int) Reduce (partial_str, Iterator (list (int))) → list (int)

MapReduce: More Examples

  Distributed Grep: Map <pattern> – Reduce <collect output>

  URL Access Frequency: Map <URL, 1> – Reduce <URL, total count>

  Reverse Web-Link Graph: Map <target, src> - Reduce <target, list(src)>

MapReduce: More Examples

  Term-Vector per Host: Map <hostname, term vec> - Reduce <hostname, term vec>

  Inverted Index: Map <word, doc ID> - Reduce <word, list(doc ID)>

  Distributed Sort: Map <key, rec> - Reduce <key, rec>

MapReduce: Execution Overview

  M map tasks, R reduce tasks / output files   Input partitioned into M splits (~16-64 MB/piece)   R pieces using tweak-able partitioning function   Ideally, M && R >> # of worker machines   Task states: <idle, in-progress, completed>

How is master node chosen?

MapReduce: Fault Tolerance

  “Master pings every worker periodically”   Completed map tasks re-executed on failure

(locality)   Master redistributes tasks for failed workers   MapReduce is aborted on master failure

MapReduce: Backup Tasks

“Stragglers”: Slow machines that hinder completion

  Solution: Schedule backup tasks for in-progress tasks towards

the end of MapReduce run

MapReduce: Refinements

  Partitioning function (default: Hash(key) mod R)   Key/value pairs processed in increasing order (sorted

output file per partition)   Combiner function: Performed in map with results

stored in intermediate file   Implementable reader interface for new input types

MapReduce: Side-effects & Debugging

  Users are responsible for output files   “No support for atomic two-phase commits of

multiple output files”   Signal handler for worker process (UDP packet)   Option to run MapReduce on one local machine   “Counter” facility for counting (e.g. # of words)   Status information page with statistics

MapReduce: Performance

  Two computations: Search and sort (~1TB)   Cluster: ~1800 Machines   Input split into 64 MB pieces   M = 15000, R = 1 for Grep and R = 4000 for sort

MapReduce: Performance - Grep

  Search for rare three-character pattern

  ~150 seconds to complete

MapReduce: Performance - Sort

MapReduce: Statistics

MapReduce: Google's web search indexing

  Code is “Simpler, smaller, easier to understand”   3800 → 700 lines of code with MapReduce   Good performance with fault tolerance

MapReduce: Related work

  Parallelization (Bulk Synchronous Programming)   Locality optimization (Active disks)   Eager scheduling mechanism (Charlotte System)   Cluster management system (Condor)   Sorting facility (NOW-Sort)   Sending data over distributed queues (River)   Fault tolerance (BAD-FS, TACC)

MapReduce: Simplified Data Processing on Large Clusters

Questions?

Thank you

MapReduce: Simplified Data Processing on Large...

Documents

MapReduce with Scalding · Java MapReduce Word count example . MapReduce Techs Scalding.io Java MapReduce Pig Hive Hadoop Cascading Others tion . The promise of Cascading Scalding.io

MapReduce - polito.itdbdmg.polito.it/.../uploads/2019/11/04-MapReduce.pdf · MapReduce •Published in 2004 by Google •J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing

MapReduce - Rangerranger.uta.edu/~jrao/CSE5306/fall2016/slides/MapReduce.pdf · Tod Lipcon@Hadoop summit RAMManager Local disk Merge to disk Yes, fetch to RAM No, fetch to disk Merge

MapReduce - cacs.usc.educacs.usc.edu/education/cs653/MapReduce.pdf · MapReduce • Parallel programming model for data-intensive applications on large clusters > User just implements

MapReduce & Hadoop IIcslui/CMSC5702/mapreduce_hadoop2.pdf · MapReduce & Hadoop II ... MapReduce & Hadoop MapReduce Recap ... example, the combiners aggregate term counts across the

Clustering Lecture 8: MapReduce - csce.uark.eduxintaowu/BDAM/mapreduce.pdf · Distributed Grep . Very big data . Split data Split data Split data Split data . grep grep grep grep

Data-Intensive Text Processing with MapReducegtsat/collection/map reduce/Lin-MapReduce.pdf · Data-Intensive Text Processing with MapReduce Jimmy LinJimmy Lin The iSchool University

Processing on Large Clusters Mapreduce: Simplified Datamanosk/assets/slides/f19/mapreduce.pdf · Break down computation into map and reduce stages Map stage applies some operation

Cloud Computing using MapReduce, Hadoop, Sparkoxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/mapreduce.pdf · MapReduce Goals • Cloud Environment: – Commodity nodes (cheap,

Data Mining and MapReduce - postgraduates undergraduatesgprovan/CS6323/2015/L13-NaiveBayes-MapReduce.pdf · Data Mining and MapReduce Adapted from Lectures by Prabhakar Raghavan (Yahoo

MapReduce: Theory and Implementationuser.it.uu.se/~justin/Teaching/DS/mapreduce.pdf · reduce After the map phase is over, all the intermediate values for a given output key are combined

Data Management in Large-Scale Distributed Systems - MapReduce … · Introduction to MapReduce The Hadoop Eco-System HDFS Hadoop MapReduce 4. MapReduce at Google Publication The

MapReduce-MPI Library Users Manualmapreduce.sandia.gov/doc/Manual.pdf · MapReduce-MPI WWW Site - MapReduce-MPI Documentation What is a MapReduce? The canonical example of a MapReduce

MapReduce Paradigm

MapReduce (Hadoop)densetsu.org/Cloud2012/(11) MapReduce.pdf · Map: (K1, V1) list (K2, V2) Reduce: (K2, list(V2)) list(K3, V3) The general MapReduce data flow. Note that after distributing

Hadoop/MapReduce - 123seminarsonly.comHadoop MapReduce • MapReduce is a programming model and software framework first developed by Google (Google’s MapReduce paper submitted in

MapReduce Tutorial

Python MapReduce Programming with Pydoop · MapReduce and Hadoop Hadoop Crash Course Pydoop: a Python MapReduce and HDFS API for Hadoop Python MapReduce Programming with Pydoop Simone

Pipelined-MapReduce an Improved MapReduce

MapReduce - University of Technologycse.hcmut.edu.vn/~ptvu/ppds/MapReduce.pdf · More examples: Distributed Grep, Count of URL access frequency, etc. ... master-worker paradigm to