Hadoop on VMware

© 2009 VMware Inc. All rights reserved

Hadoop as a Service

Hadoop on Virtualization Hadoop World, December 2011 Jun Ping Du Richard McDougall VMware, Inc.

2

Cloud: Big Shifts in Simplification and Optimization

2. Dramatically Lower Costs

to redirect investment into

value-add opportunities

3. Enable Flexible, Agile IT Service Delivery

to meet and anticipate the

needs of the business

1. Reduce the Complexity

to simplify operations and maintenance

3

Infrastructure, Apps and now Data…

Private Public

Build Run

Manage

Simplify Infrastructure With Cloud

Simplify App Platform Through PaaS Next Trend:

Simplify Data

4

Trend 1/3: New Data Growing at 60% Y/Y

Source: The Information Explosion , 2009

medical(imaging,(sensors(

cad/cam,(appliances,(videoconfercing,(digital(movies(

digital(photos(

digital(tv(

audio(

camera(phones,(rfid(

satellite(images,(games,(scanners,(twi8er(

Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta generation…

5

Trend 2/3: Big Data – Driven by Real-World Benefit

6

Trend 3/3: Value from Data Exceeds Hardware Cost

!  Value from the intelligence of data analytics now outstrips the cost of hardware •  Hadoop enables the use of lower cost hardware

•  Hardware cost halving every 18mo

Big Iron: $40k/CPU

Commodity Cluster: $1k/CPU

Value

Cost

7

SQLCluster

Unified Big Data Infrastructure

Hadoop Cluster

Private Public

Big SQL

Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware

Hadoop NoSQL

DSS Cluster

NoSQL Cluster

!  Trend is “not just hadoop” for big data • Hadoop is often combined with other

technologies: Big SQL, NoSQL etc,… • Unify the infrastructure platform for all

!  Common Hardware Base • Eliminate the hardware/driver/testing phase • Use existing team for ordering, diagnosis,

capacity management of hardware farm

8

Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning

I WANT MY HADOOP CLUSTER NOW!

!  Instant Cluster Provisioning • Provision Hadoop Clusters instantly • Automatable using provisioning

engines/scripts: e.g. whir

9

Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities

!  Increase Utilization •  Hadoop cluster only uses resources it needs

•  Extra resources can be used by other applications when not in use

!  Eliminate single points of failure •  Use vSphere HA for Namenode and Jobtracker

!  Use VM Isolation •  Create separate clusters with defensible security

•  Enables multiple-versions of Hadoop on the same infrastructure

•  Extends to Hadoop and Linux Environments

!  Leverage Resource Management •  Control/assign resources through resource pools

•  E.g. Use spare cycles for Hadoop Processing through priority control

10

What? Hadoop in a VM? Really?

Actually, Hadoop performs well in a virtual machine

11

Performance Test: Cluster Configuration

AMAX ClusterMax 2X X5650, 96 GB 12X SATA 500 GB Mellanox 10 GbE adapter

Mellanox10 GbE switch

12

Cluster Configuration !  Hardware •  AMAX ClusterMax, 7 nodes

•  2X X5650 2.67 GHz hex-core, 96 GB memory

•  12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4

•  Mellanox ConnectX VPI (MT26418), 10 GbE

•  Mellanox Vantage 6048, 10 GbE

!  OS/Hypervisor •  RHEL 6.1 x86_64 (native and guest)

•  ESX 5.0 RTM with devel Mellanox driver

!  VMs (HT off/on) •  1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks

•  2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks

•  4 VMs (HT on only): •  2 small: 18400 MB, 5 vCPUs, 2 disks •  2 large: 27600 MB, 7 vCPUs, 3 disks

13

Hadoop Configuration Distribution •  Cloudera CDH3u0

•  Based on Apache open-source 0.20.2

Parameters •  dfs.datanode.max.xcievers=4096

•  dfs.replication=2

•  dfs.block.size=134217728

•  io.file.buffer.size=131072

•  mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native)

•  mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual)

!  Network topology •  Hadoop uses info for reliability and performance

•  Multiple VMs/host: Each host is a “rack”

14

Benchmarks !  Derived from test apps included in distro !  Pi •  Direct-exec Monte-Carlo estimation of pi

•  # map tasks = # logical processors

•  1.68 T samples

!  TestDFSIO •  Streaming write and read

•  1 TB

•  More tasks than processors

!  Terasort •  3 phases: teragen, terasort, teravalidate

•  10B or 35B records, each 100 Bytes (1 TB, 3.5 TB)

•  More tasks than processors

•  CPU, networking, and storage I/O

π ~ 4*R/(R+G) = 22/7

15

Performance of Hadoop for Several Workloads

0

0.2

0.4

0.6

0.8

1

1.2 R

atio

to N

ativ

e

1 VM

2 VMs

Ratio of time taken – Lower is Better

16

Architecting Hadoop as a Service using Virtualization

!  Goals •  Make it fast and easy to provision new Hadoop Clusters on Demand

•  Leverage virtual machines to provide isolation (esp. for Multi-tenant)

•  Optimize Hadoop’s performance based on virtual topologies

•  Make the system reliable based on virtual topologies

!  Leveraging Virtualization •  Elastic scale in/out

•  Use high-availability to protect namenode/job tracker

•  Resource controls and sharing: re-use underutilized memory, cpu

•  Prioritize Workloads: limit or guarantee resource usage in a mixed environment

17

Provisioning

!  Leverage the vSphere APIs to auto-deploy a cluster •  Whirr, HOD, or custom using ruby, chef, etc,…

!  Use linked-clones to rapidly fork many nodes

18

Fast Provisioning

!  From a “seed” node to a cluster

Thin Provisioning Linked Clone

60GB => 3.5GB� ~6 second�

19

SAN, NAS or Local Disk?

!  Shared Storage: SAN or NAS •  Easy to provision

•  Automated cluster rebalancing

!  Hybrid Storage •  SAN for boot images, VMs, other

workloads •  Local disk for HDFS

•  Scalable Bandwidth, Lower Cost/GB

Host

Had

oop

Oth

er V

M

Oth

er V

M

Host

Had

oop

Had

oop

Oth

er V

M

Host

Had

oop

Had

oop

Oth

er V

M

Host

Had

oop

Oth

er V

M

Oth

er V

M

Host

Had

oop

Had

oop

Oth

er V

M

Host

Had

oop

Had

oop

Oth

er V

M

20

Enable Automatic Rack awareness through vSphere

!  Important to robust hadoop cluster

!  Automatic network topology detect — an important vSphere feature

!  Rack script is generated automatically

21

Multi-tenant: share cluster or not

!  Shared big cluster VS. Isolated small clusters

 High performance  Large scale

 Pre-job provisioning

 Secure  Flexible

 Post-job provisioning

Combination – as customers’ requirement are different

22

Elastic Hadoop Cluster�

!  Traditional hadoop cluster •  Easy to scale out

•  Fast-provision new hadoop nodes and join into existing cluster

•  Hard to scale in

!  Elastic hadoop cluster

While (ClusterIsTooLarge) {

choose node k;

kill (node k);

wait (k’s data block is recovered);

if necessary, hadoop.rebalance();

}

NN JT

…

…

Elastic node

Normal node

TaskTracker

DataNode

23

Replica Placement�

!  Second Replica •  Different rack

•  Rack-awareness required

!  Third Replica •  Same rack, different physical host

•  Nodes share host (in virtualized

environment)

24

Demo

25

Performance

!  Create more smaller VMs •  Makes Hadoop scale better

•  Allows for easier/faster adjustment of packing of VMs across hosts by vSphere (including through DRS)

!  Sizing/Configuration of storage is critical •  Plan on ~50Mbytes/sec of bandwidth per core

•  SANs are typically configured by default for IOPS, not Bandwidth

•  Ensure SAN ports/switch topology allows required aggregate bandwidth

•  Performance of the backend storage should be tested/sized

•  Local disks will give ~100-140MBytes/sec per disk: pick correct controller

26

Summary

!  Hadoop does work well in a virtual environment !  Plan a virtual cluster, enable other big-data solutions on the same

infrastructure !  Leverage the recipes to automate your configuration and

deployment

Technology

Hadoop on VMware