Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n

Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe


CHAPTER 25

Big Data Technologies Basedon MapReduce and Hadoop


Introduction

n Phenomenal growth in data generationn Social median Sensorsn Communications networks and satellite imageryn User-specific business data

n “Big data” refers to massive amounts of datan Exceeds the typical reach of a DBMS

n Big data analytics

Slide 25- 3


25.1 What is Big Data?

n Big data ranges from terabytes (1012 bytes) or petabytes (1015 bytes) to exobytes (1018 bytes)

n Volumen Refers to size of data managed by the system

n Velocityn Speed of data creation, ingestion, and processing

n Varietyn Refers to type of data sourcen Structured, unstructured

Slide 25- 4


What is Big Data? (cont’d.)

n Veracityn Credibility of the sourcen Suitability of data for the target audiencen Evaluated through quality testing or credibility

analysis

Slide 25- 5


25.2 Introduction to MapReduce and Hadoop

n Core components of Hadoopn MapReduce programming paradigmn Hadoop Distributed File System (HDFS)

n Hadoop originated from quest for open source search enginen Developed by Cutting and Carafella in 2004n Cutting joined Yahoo in 2006n Yahoo spun off Hadoop-centered company in

2011n Tremendous growth

Slide 25- 6


Introduction to MapReduce and Hadoop (cont’d.)

n MapReducen Fault-tolerant implementation and runtime

environmentn Developed by Dean and Ghemawat at Google in

2004n Programming style: map and reduce tasks

n Automatically parallelized and executed on large clusters of commodity hardware

n Allows programmers to analyze very large datasets

n Underlying data model assumed: key-value pairSlide 25- 7


The MapReduce Programming Model

n Mapn Generic function that takes a key of type K1 and

value of type V1n Returns a list of key-value pairs of type K2 and V2

n Reducen Generic function that takes a key of type K2 and a

list of values V2 and returns pairs of type (K3, V3)n Outputs from the map function must match the

input type of the reduce function

Slide 25- 8


The MapReduce Programming Model (cont’d.)

Slide 25-9

Figure 25.1 Overview of MapReduce execution (Adapted from T. White, 2012)



n MapReduce examplen Make a list of frequencies of words in a documentn Pseudocode

Slide 25- 10



n MapReduce example (cont’d.)n Actual MapReduce code

Slide 25- 11



n Distributed grepn Looks for a given pattern in a filen Map function emits a line if it matches a supplied

patternn Reduce function is an identity function

n Reverse Web-link graphn Outputs (target URL, source URL) pairs for each

link to a target page found in a source page

Slide 25- 12



n Inverted indexn Builds an inverted index based on all words

present in a document repositoryn Map function parses each document

n Emits a sequence of (word, document_id) pairsn Reduce function takes all pairs for a given word

and sorts them by document_idn Job

n Code for Map and Reduce phases, a set of artifacts, and properties

Slide 25- 13



n Hadoop releasesn 1.x features

n Continuation of the original code basen Additions include security, additional HDFS and

MapReduce improvementsn 2.x features

n YARN (Yet Another Resource Navigator)n A new MR runtime that runs on top of YARNn Improved HDFS that supports federation and

increased availability

Slide 25- 14


25.3 Hadoop Distributed File System (HDFS)

n HDFSn File system component of Hadoopn Designed to run on a cluster of commodity

hardwaren Patterned after UNIX file systemn Provides high-throughput access to large datasetsn Stores metadata on NameNode servern Stores application data on DataNode servers

n File content replicated on multiple DataNodes

Slide 25- 15


Hadoop Distributed File System (cont’d.)

n HDFS design assumptions and goalsn Hardware failure is the normn Batch processingn Large datasetsn Simple coherency model

n HDFS architecturen Master-slaven Decouples metadata from data operationsn Replication provides reliability and high availabilityn Network traffic minimized

Slide 25- 16



n NameNoden Maintains image of the file system

n i-nodes and corresponding block locationsn Changes maintained in write-ahead commit log

called Journaln Secondary NameNodes

n Checkpointing role or backup rolen DataNodes

n Stores blocks in node’s native file systemn Periodically reports state to the NameNode

Slide 25- 17



n File I/O operationsn Single-writer, multiple-reader modeln Files cannot be updated, only appendedn Write pipeline set up to minimize network

utilizationn Block placement

n Nodes of Hadoop cluster typically spread across many racks

n Nodes on a rack share a switch

Slide 25- 18



n Replica managementn NameNode tracks number of replicas and block

locationn Based on block reports

n Replication priority queue contains blocks that need to be replicated

n HDFS scalabilityn Yahoo cluster achieved 14 petabytes, 4000 nodes,

15k clients, and 600 million files

Slide 25- 19


The Hadoop Ecosystem

n Related projects with additional functionalityn Pig and hive

n Provides higher-level interface for working with Hadoop framework

n Oozien Service for scheduling and running workflows of

jobsn Sqoop

n Library and runtime environment for efficiently moving data between relational databases and HDFS

Slide 25- 20


The Hadoop Ecosystem (cont’d.)

n Related projects with additional functionality (cont’d.)n HBase

n Column-oriented key-value store that uses HDFS

Slide 25- 21


25.4 MapReduce: Additional Details

n MapReduce runtime environmentn JobTracker

n Master processn Responsible for managing the life cycle of Jobs and

scheduling Tasks on the clustern TaskTracker

n Slave processn Runs on all Worker nodes of the cluster

Slide 25- 22


MapReduce: Additional Details (cont’d.)

n Overall flow of a MapReduce jobn Job submissionn Job initializationn Task assignmentn Task executionn Job completion

Slide 25- 23



n Fault tolerance in MapReducen Task failure

n Runtime exceptionn Java virtual machine crashn No timely updates from the task process

n TaskTracker failuren Crash or disconnection from JobTrackern Failed Tasks are rescheduled

n JobTracker failuren Not a recoverable failure in Hadoop v1

Slide 25- 24



n The shuffle proceduren Reducers get all the rows for a given key togethern Map phase

n Background thread partitions buffered rows based on the number of Reducers in the job and the Partitioner

n Rows sorted on key valuesn Comparator or Combiner may be used

n Copy phasen Reduce phase

Slide 25- 25



n Job schedulingn JobTracker schedules work on cluster nodesn Fair Scheduler

n Provides fast response time to small jobs in a Hadoop shared cluster

n Capacity Schedulern Geared to meet needs of large enterprise

customers

Slide 25- 26



n Strategies for equi-joins in MapReduce environmentn Sort-merge joinn Map-side hash joinn Partition joinn Bucket joinsn N-way map-side joinsn Simple N-way joins

Slide 25- 27



n Apache Pign Bridges the gap between declarative-style

interfaces such as SQL, and rigid style required by MapReduce

n Designed to solve problems such as ad hoc analyses of Web logs and clickstreams

n Accommodates user-defined functions

Slide 25- 28



n Apache Hiven Provides a higher-level interface to Hadoop using

SQL-like queriesn Supports processing of aggregate analytical

queries typical of data warehousesn Developed at Facebook

Slide 25- 29


Hive System Architecture and Components

Slide 25-30

Figure 25.2 Hive system architecture and components


Advantages of the Hadoop/MapReduce Technology

n Disk seek rate a limiting factor when dealing with very large data setsn Limited by disk mechanical structure

n Transfer speed is an electronic feature and increasing steadily

n MapReduce processes large datasets in paralleln MapReduce handles semistructured data and

key-value datasets more easilyn Linear scalability

Slide 25- 31


25.5 Hadoop v2 (Alias YARN)

n Reasons for developing Hadoop v2n JobTracker became a bottleneckn Cluster utilization less than desirablen Different types of applications did not fit into the

MR modeln Difficult to keep up with new open source versions

of Hadoop

Slide 25- 32


YARN Architecture

n Separates cluster resource management from Jobs management

n ResourceManager and NodeManager together form a platform for hosting any application on YARN

n ApplicationMasters send ResourceRequests to the ResourceManager which then responds with cluster Container leases

n NodeManager responsible for managing Containers on their nodes

Slide 25- 33


Hadoop Version Schematics

Slide 25-34

Figure 25.3 The Hadoop v1 vs. Hadoop v2 schematic


Other Frameworks on YARN

n Apache Tezn Extensible framework being developed at

Hortonworks for building high-performance applications in YARN

n Apache Giraphn Open-source implementation of Google’s Pregel

system, a large-scale graph processing system used to calculate Page-Rank

n Hoya: HBase on YARNn More flexibility and improved cluster utilization

Slide 25- 35


25.6 General Discussion

n Hadoop/MapReduce versus parallel RDBMSn 2009: performance of two approaches measured

n Parallel database took longer to tune compared to MR

n Performance of parallel database 3-6 times faster than MR

n MR improvements since 2009n Hadoop has upfront cost advantage

n Open source platform

Slide 25- 36


General Discussion (cont’d.)

n MR able to handle semistructured datasetsn Support for unstructured data on the rise in

RDBMSsn Higher level language support

n SQL for RDBMSsn Hive has incorporated SQL features in HiveQL

n Fault-tolerance: advantage of MR-based systems

Slide 25- 37



n Big data somewhat dependent on cloud technology

n Cloud model offers flexibilityn Scaling out and scaling upn Distributed software and interchangeable

resourcesn Unpredictable computing needs not uncommon in

big data projectsn High availability and durability

Slide 25- 38



n Data locality issuesn Network load a concernn Self-configurable, locality-based data and virtual

machine management framework proposedn Enables access of data locally

n Caching techniques also improve performancen Resource optimization

n Challenge: optimize globally across all jobs in the cloud rather than per-job resource optimizations

Slide 25- 39



n YARN as a data service platformn Emerging trend: Hadoop as a data lake

n Contains significant portion of enterprise datan Processing happens

n Support for SQL in Hadoop is improvingn Apache Storm

n Distributed scalable streaming enginen Allows users to process real-time data feeds

n Storm on YARN and SAS on YARN

Slide 25- 40



n Challenges faced by big data technologiesn Heterogeneity of informationn Privacy and confidentialityn Need for visualization and better human interfacesn Inconsistent and incomplete information

Slide 25- 41



n Building data solutions on Hadoopn May involve assembling ETL (extract, transform,

load) processing, machine learning, graph processing, and/or report creation

n Programming models and metadata not unifiedn Analytics application developers must try to

integrate services into coherent solutionn Cluster a vast resource of main memory and flash

storagen In-memory data enginesn Spark platform from Databricks

Slide 25- 42


25.7 Summary

n Big data technologies at the center of data analytics and machine learning applications

n MapReducen Hadoop Distributed File Systemn Hadoop v2 or YARN

n Generic data services platformn MapReduce/Hadoop versus parallel DBMSs

Slide 25- 43

Documents

Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n