Extending Hadoop for Fun & Profit

Extending Hadoop for Fun & Profit

Milind Bhandarkar Chief Scientist, Pivotal Software,

(Twitter : @techmilind)

About Me• http://www.linkedin.com/in/milindb

• Founding member of Hadoop team at Yahoo! [2005-2010]

• Contributor to Apache Hadoop since v0.1

• Built and led Grid Solutions Team at Yahoo! [2007-2010]

• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)

• Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)

http://www.linkedin.com/in/milindb

Agenda• Extending MapReduce

• Functionality

• Performance

• Beyond MapReduce with YARN

• Hamster & GraphLab

• Extending HDFS

• Q & A

Extending MapReduce

MapReduce Overview

• Record = (Key, Value)

• Key : Comparable, Serializable

• Value: Serializable

• Logical Phases: Input, Map, Shuffle, Reduce, Output

Map

• Input: (Key1, Value1)

• Output: List(Key2, Value2)

• Projections, Filtering, Transformation

Shuffle

• Input: List(Key2, Value2)

• Output

• Sort(Partition(List(Key2, List(Value2))))

• Provided by Hadoop : Several Customizations Possible

Reduce

• Input: List(Key2, List(Value2))

• Output: List(Key3, Value3)

• Aggregations

MapReduce DataFlow

Configuration• Unified Mechanism for

• Configuring Daemons

• Runtime environment for Jobs/Tasks

• Defaults: *-default.xml

• Site-Specific: *-site.xml

• final parameters

<configuration> <property> <name>mapred.job.tracker</name> <value>head.server.node.com:9001</value> </property> <property> <name>fs.default.name</name> <value>hdfs://head.server.node.com:9000</value> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx512m</value> <final>true</final> </property>....</configuration>

Example

Extending Input Phase• Convert ByteStream to List(Key, Value)

• Several Formats pre-packaged

• TextInputFormat<long, Text>!

• SequenceFileInputFormat<K,V>!

• KeyValueTextInputFormat<Text,Text>!

• Specify InputFormat for each job

• JobConf.setInputFormat()

InputFormat

• getSplits() : From Input descriptors, get Input Splits, such that each Split can be processed independently

•<FileName, startOffset, length>!

• getRecordReader() : From an InputSplit, get list of Records

Industry Use Case !

Surveillance Video Anomaly Detection

Acknowledgements

• Victor Fang

• Regu Radhakrishnan

• Derek Lin

• Sameer Tiwari

Anomaly Detection in Surveillance Video

• Detect anomalous objects in a restricted perimeter

• Typical large enterprise collects TB’s video per day

• Hadoop MapReduce runs computer vision algorithms in parallel and captures violation events

• Post-Incident monitoring enabled by Interactive Query

Video DataFlow

• Timestamped Video Files as input

• Distributed Video Transcoding : ETL in Hadoop

• Distributed Video Analytics in Hadoop/HAWQ

• Insights in relational DB

Real World Video Data

• Benchmark Surveillance videos from UK Home Office (iLids)

• CCTV Video footage depicting scenarios central to Govt requirements

Common Video Standards

• MPEG & ITU responsible for most video standards

• MPEG-2 (1995) Widely adopted in DVDs, TV, Set Top boxes

MPEG Standard Format

• Sequence of encoded video frames

• Compression by eliminating:

• Redundancy in Time: Inter-Frame Encoding

• Redundancy in Space: Intra-Frame Encoding

Motion Compensation

• I-Frame: Intra-Frame encoding

• P-Frame: Predicated frame from previous frame

• B-Frame: Predicted frame from both previous & next frame

Distributed MPEG Decoding

• HDFS splits large files in 64 MB/128 MB blocks

• Each HDFS block can be processed independently by a Map task

• Can we decode individual video frames from an arbitrary HDFS block in an MPEG File ?

Splitting MPEG-2

• Header Information available only once per file

• Group of Pictures (GOP) header repeats

• Each GOP starts with an I-Frame and ends with an I-Frame

• Each GOP can be decoded independently

• First and last GOP may straddle HDFS blocks

MPEG2InputFormat

• Derived from FileInputFormat

• getSplits() : Identical to FileInputFormat

• InputSplit = HDFS Block

•getRecordReader()!

•MPEG2RecordReader

MPEG2RecordReader

• Start from beginning of block

• Search for the first GOP Header

• Locate an I-Frame, decode, keep in memory

• If P-Frame, decode using last frame

• If B-Frame, keep current frame in memory, read next frame, decode current frame

Considerations for Input Format

• Use as little metadata as possible

• Number of Splits = Number of Map Tasks

• Combine small files

• Split determination happens in a single process, so should be metadata-based

• Affects scalability of MapReduce

Scalability

• If one node processes k MB/s, then N nodes should process (k*N) MB/s

• If some fixed amount of data is processed in T minutes on one node, the N nodes should process same data in (T/N) minutes

• Linear Scalability

Reduce LatencyMinimize Job Execution time

Increase ThroughputMaximize amount of data processed per unit time

Amdahl’s Law

S = N1+!(N !1)

Multi-Phase Computations

• If computation C is split into N different parts, C1..CN

• If partial computation Ci can be speeded up by a factor of Si

Amdahl’s Law, Restated

€

S =

Cii=1

N

∑Ci

Sii=1

N

∑

Amdahl’s Law• Suppose Job has 5 phases: P0 is 10 seconds, P1,

P2, P3 are 200 seconds each, and P4 is 10 seconds

• Sequential runtime = 620 seconds • P1, P2, P3 parallelized on 100 machines with

speedup of 80 (Each executes in 2.5 seconds)

• After parallelization, runtime = 27.5 seconds • Effective Speedup: (620s/27.5s) = 22.5

MapReduce Workflow

Extending Shuffle

Why Shuffle ?

• Often, the most expensive phase in MapReduce, involves slow disks and network

• Map tasks partition, sort and serialize outputs, and write to local disk

• Reduce tasks pull individual Map outputs over network, merge, and may spill to disk

Message Cost Model

€

T = α + Nβ

Message Granularity

• For Gigabit Ethernet

• α = 300 μS

• β = 100 MB/s

• 100 Messages of 10KB each = 40 ms

• 10 Messages of 100 KB each = 13 ms

Alpha-Beta• Common Mistake: Assuming that α is constant

• Scheduling latency for responder

• MR daemons time slice inversely proportional to number of concurrent tasks

• Common Mistake: Assuming that β is constant

• Network congestion

• TCP incast

Efficient Hardware Platforms

• Mellanox - Hadoop Acceleration through Network-assisted Merge

• RoCE - Brocade, Cisco, Extreme, Arista...

• SSD - Velobit, Violin, FusionIO, Samsung..

• Niche - Compression, Encryption...

Pluggable Shuffle & Sort• Replace HTTP-based pull with RDMA

• Avoid spilling altogether

• Replace default Sort implementation with Job-optimized sorting algorithm

• Experimental APIs

• google PluggableShuffleAndPluggableSort.html

Mellanox UDA

• Developed jointly with Auburn University

• 2x Performance on TeraSort

• Reduces disk writes by 45%, disk reads by 15%

Syncsort DMX-h

Beyond MapReduce with YARN

Single'App'

BATCH

HDFS

Single'App'

INTERACTIVE

Single'App'

BATCH

HDFS

Single'App'

BATCH

HDFS

Single'App'

ONLINE

Hadoop 1.0 (Image Courtesy Arun Murthy, Hortonworks)

MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks)

Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks)

HADOOP 1.0

HDFS%(redundant,*reliable*storage)*

MapReduce%(cluster*resource*management*

*&*data*processing)*

HDFS2%(redundant,*reliable*storage)*

YARN%(cluster*resource*management)*

Tez%(execu7on*engine)*

HADOOP 2.0

Pig%(data*flow)*

Hive%(sql)*

%Others%(cascading)*

*

Pig%(data*flow)*

Hive%(sql)*

%Others%(cascading)*

%

MR%(batch)*

RT%%Stream,%Graph%Storm,''Giraph'

*

Services%HBase'

*

Applica'ons+Run+Na'vely+IN+Hadoop+

HDFS2+(Redundant,*Reliable*Storage)*

YARN+(Cluster*Resource*Management)***

BATCH+(MapReduce)+

INTERACTIVE+(Tez)+

STREAMING+(Storm,+S4,…)+

GRAPH+(Giraph)+

INLMEMORY+(Spark)+

HPC+MPI+(OpenMPI)+

ONLINE+(HBase)+

OTHER+(Search)+(Weave…)+

YARN Platform (Image Courtesy Arun Murthy, Hortonworks)

NodeManager* NodeManager* NodeManager* NodeManager*

Container*1.1*

Container*2.4*



Container*1.2*

Container*1.3*

AM*1*

Container*2.2*

Container*2.1*

Container*2.3*

AM2*

Client2*

ResourceManager*

Scheduler*

YARN Architecture (Image Courtesy Arun Murthy, Hortonworks)

YARN

• Yet Another Resource Negotiator

• Resource Manager

• Node Managers

• Application Masters

• Specific to paradigm, e.g. MR Application master (aka JobTracker)

Beyond MapReduce

• Apache Giraph - BSP & Graph Processing

• Storm on Yarn - Streaming Computation

• HOYA - HBase on Yarn

• Hamster - MPI on Hadoop

• More to come ...

Hamster• Hadoop and MPI on the same

cluster

• OpenMPI Runtime on Hadoop YARN

• Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System

• Open MPI Provides: Process launching, Communication, I/O forwarding

Hamster Components

• Hamster Application Master

• Gang Scheduler, YARN Application Preemption

• Resource Isolation (lxc Containers)

• ORTE: Hamster Runtime

• Process launching, Wireup, Interconnect

Resource Manager

Scheduler

AMService

Node Manager Node Manager Node Manager …

Proc/Container

Framework Daemon NS MPI

Scheduler HNP

MPI AM

Proc/Container

… RM-AM

AM-NM

RM-NodeManager Client Client-RM

Aux Srvcs

Proc/Container

Framework Daemon NS

Proc/Container

…

Aux Srvcs RM-

NodeManager

Hamster Architecture

Hamster Scalability• Sufficient for small to medium HPC

workloads

• Job launch time gated by YARN resource scheduler

Launch WireUp Collectives

Monitor

OpenMPI O(logN) O(logN) O(logN) O(logN)

Hamster O(N) O(logN) O(logN) O(logN)

GraphLab + Hamster on Hadoop

!

About GraphLab

• Graph-based, High-Performance distributed computation framework

• Started by Prof. Carlos Guestrin in CMU in 2009

• Recently founded Graphlab Inc to commercialize Graphlab.org

GraphLab Features• Topic Modeling (e.g. LDA)

• Graph Analytics (Pagerank, Triangle counting)

• Clustering (K-Means)

• Collaborative Filtering

• Linear Solvers

• etc...

Only Graphs are not Enough

• Full Data processing workflow required ETL/Postprocessing, Visualization, Data Wrangling, Serving

• MapReduce excels at data wrangling

• OLTP/NoSQL Row-Based stores excel at Serving

• GraphLab should co-exist with other Hadoop frameworks

Coming Soon…

Extending HDFS

HCFS

• Hadoop Compatible File Systems

• FileSystem, FileContext

• S3, Local FS, webhdfs

• Azure Blob Storage, CassandraFS, Ceph, CleverSafe, Google Cloud Storage, Gluster, Lustre, QFS, EMC ViPR (more to come)

New Dataset

• Reuse Namenode and Datanode implementations

• Substitute a different DataSet implementation: FsDatasetSpi, FsVolumeSpi

• Jira: HDFS-5194

Extending Namenode

• Pluggable Namespace: HDFS-5324, HDFS-5389

• Pluggable Block Management: HDFS-5477

• Requires fine-grained locking in Namenode: HDFS-5453

Questions ?

Technology

Extending Hadoop for Fun & Profit