26
© 2009 VMware Inc. All rights reserved Hadoop as a Service Hadoop on Virtualization Hadoop World, December 2011 Jun Ping Du Richard McDougall VMware, Inc.

Hadoop on VMware

Embed Size (px)

DESCRIPTION

Hadoop on VMware, presented at Hadoop World 2011

Citation preview

Page 1: Hadoop on VMware

© 2009 VMware Inc. All rights reserved

Hadoop as a Service

Hadoop on Virtualization Hadoop World, December 2011 Jun Ping Du Richard McDougall VMware, Inc.

Page 2: Hadoop on VMware

2

Cloud: Big Shifts in Simplification and Optimization

2. Dramatically Lower Costs

to redirect investment into

value-add opportunities

3. Enable Flexible, Agile IT Service Delivery

to meet and anticipate the

needs of the business

1. Reduce the Complexity

to simplify operations and maintenance

Page 3: Hadoop on VMware

3

Infrastructure, Apps and now Data…

Private Public

Build Run

Manage

Simplify Infrastructure With Cloud

Simplify App Platform Through PaaS Next Trend:

Simplify Data

Page 4: Hadoop on VMware

4

Trend 1/3: New Data Growing at 60% Y/Y

Source: The Information Explosion , 2009

medical(imaging,(sensors(

cad/cam,(appliances,(videoconfercing,(digital(movies(

digital(photos(

digital(tv(

audio(

camera(phones,(rfid(

satellite(images,(games,(scanners,(twi8er(

Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta generation…

Page 5: Hadoop on VMware

5

Trend 2/3: Big Data – Driven by Real-World Benefit

Page 6: Hadoop on VMware

6

Trend 3/3: Value from Data Exceeds Hardware Cost

!  Value from the intelligence of data analytics now outstrips the cost of hardware •  Hadoop enables the use of lower cost hardware

•  Hardware cost halving every 18mo

Big Iron: $40k/CPU

Commodity Cluster: $1k/CPU

Value

Cost

Page 7: Hadoop on VMware

7

SQLCluster

Unified Big Data Infrastructure

Hadoop Cluster

Private Public

Big SQL

Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware

Hadoop NoSQL

DSS Cluster

NoSQL Cluster

!  Trend is “not just hadoop” for big data • Hadoop is often combined with other

technologies: Big SQL, NoSQL etc,… • Unify the infrastructure platform for all

!  Common Hardware Base • Eliminate the hardware/driver/testing phase • Use existing team for ordering, diagnosis,

capacity management of hardware farm

Page 8: Hadoop on VMware

8

Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning

I WANT MY HADOOP CLUSTER NOW!

!  Instant Cluster Provisioning • Provision Hadoop Clusters instantly • Automatable using provisioning

engines/scripts: e.g. whir

Page 9: Hadoop on VMware

9

Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities

!  Increase Utilization •  Hadoop cluster only uses resources it needs

•  Extra resources can be used by other applications when not in use

!  Eliminate single points of failure •  Use vSphere HA for Namenode and Jobtracker

!  Use VM Isolation •  Create separate clusters with defensible security

•  Enables multiple-versions of Hadoop on the same infrastructure

•  Extends to Hadoop and Linux Environments

!  Leverage Resource Management •  Control/assign resources through resource pools

•  E.g. Use spare cycles for Hadoop Processing through priority control

Page 10: Hadoop on VMware

10

What? Hadoop in a VM? Really?

Actually, Hadoop performs well in a virtual machine

Page 11: Hadoop on VMware

11

Performance Test: Cluster Configuration

AMAX ClusterMax 2X X5650, 96 GB 12X SATA 500 GB Mellanox 10 GbE adapter

Mellanox10 GbE switch

Page 12: Hadoop on VMware

12

Cluster Configuration !  Hardware •  AMAX ClusterMax, 7 nodes

•  2X X5650 2.67 GHz hex-core, 96 GB memory

•  12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4

•  Mellanox ConnectX VPI (MT26418), 10 GbE

•  Mellanox Vantage 6048, 10 GbE

!  OS/Hypervisor •  RHEL 6.1 x86_64 (native and guest)

•  ESX 5.0 RTM with devel Mellanox driver

!  VMs (HT off/on) •  1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks

•  2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks

•  4 VMs (HT on only): •  2 small: 18400 MB, 5 vCPUs, 2 disks •  2 large: 27600 MB, 7 vCPUs, 3 disks

Page 13: Hadoop on VMware

13

Hadoop Configuration Distribution •  Cloudera CDH3u0

•  Based on Apache open-source 0.20.2

Parameters •  dfs.datanode.max.xcievers=4096

•  dfs.replication=2

•  dfs.block.size=134217728

•  io.file.buffer.size=131072

•  mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native)

•  mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual)

!  Network topology •  Hadoop uses info for reliability and performance

•  Multiple VMs/host: Each host is a “rack”

Page 14: Hadoop on VMware

14

Benchmarks !  Derived from test apps included in distro !  Pi •  Direct-exec Monte-Carlo estimation of pi

•  # map tasks = # logical processors

•  1.68 T samples

!  TestDFSIO •  Streaming write and read

•  1 TB

•  More tasks than processors

!  Terasort •  3 phases: teragen, terasort, teravalidate

•  10B or 35B records, each 100 Bytes (1 TB, 3.5 TB)

•  More tasks than processors

•  CPU, networking, and storage I/O

π ~ 4*R/(R+G) = 22/7

Page 15: Hadoop on VMware

15

Performance of Hadoop for Several Workloads

0

0.2

0.4

0.6

0.8

1

1.2 R

atio

to N

ativ

e

1 VM

2 VMs

Ratio of time taken – Lower is Better

Page 16: Hadoop on VMware

16

Architecting Hadoop as a Service using Virtualization

!  Goals •  Make it fast and easy to provision new Hadoop Clusters on Demand

•  Leverage virtual machines to provide isolation (esp. for Multi-tenant)

•  Optimize Hadoop’s performance based on virtual topologies

•  Make the system reliable based on virtual topologies

!  Leveraging Virtualization •  Elastic scale in/out

•  Use high-availability to protect namenode/job tracker

•  Resource controls and sharing: re-use underutilized memory, cpu

•  Prioritize Workloads: limit or guarantee resource usage in a mixed environment

Page 17: Hadoop on VMware

17

Provisioning

!  Leverage the vSphere APIs to auto-deploy a cluster •  Whirr, HOD, or custom using ruby, chef, etc,…

!  Use linked-clones to rapidly fork many nodes

Page 18: Hadoop on VMware

18

Fast Provisioning

!  From a “seed” node to a cluster

Thin Provisioning Linked Clone

60GB => 3.5GB� ~6 second�

Page 19: Hadoop on VMware

19

SAN, NAS or Local Disk?

!  Shared Storage: SAN or NAS •  Easy to provision

•  Automated cluster rebalancing

!  Hybrid Storage •  SAN for boot images, VMs, other

workloads •  Local disk for HDFS

•  Scalable Bandwidth, Lower Cost/GB

Host

Had

oop

Oth

er V

M

Oth

er V

M

Host

Had

oop

Had

oop

Oth

er V

M

Host

Had

oop

Had

oop

Oth

er V

M

Host

Had

oop

Oth

er V

M

Oth

er V

M

Host

Had

oop

Had

oop

Oth

er V

M

Host

Had

oop

Had

oop

Oth

er V

M

Page 20: Hadoop on VMware

20

Enable Automatic Rack awareness through vSphere

!  Important to robust hadoop cluster

!  Automatic network topology detect — an important vSphere feature

!  Rack script is generated automatically

Page 21: Hadoop on VMware

21

Multi-tenant: share cluster or not

!  Shared big cluster VS. Isolated small clusters

 High performance  Large scale

 Pre-job provisioning

 Secure  Flexible

 Post-job provisioning

Combination – as customers’ requirement are different

Page 22: Hadoop on VMware

22

Elastic Hadoop Cluster�

!  Traditional hadoop cluster •  Easy to scale out

•  Fast-provision new hadoop nodes and join into existing cluster

•  Hard to scale in

!  Elastic hadoop cluster

While (ClusterIsTooLarge) {

choose node k;

kill (node k);

wait (k’s data block is recovered);

if necessary, hadoop.rebalance();

}

NN JT

Elastic node

Normal node

TaskTracker

DataNode

Page 23: Hadoop on VMware

23

Replica Placement�

!  Second Replica •  Different rack

•  Rack-awareness required

!  Third Replica •  Same rack, different physical host

•  Nodes share host (in virtualized

environment)

Page 24: Hadoop on VMware

24

Demo

Page 25: Hadoop on VMware

25

Performance

!  Create more smaller VMs •  Makes Hadoop scale better

•  Allows for easier/faster adjustment of packing of VMs across hosts by vSphere (including through DRS)

!  Sizing/Configuration of storage is critical •  Plan on ~50Mbytes/sec of bandwidth per core

•  SANs are typically configured by default for IOPS, not Bandwidth

•  Ensure SAN ports/switch topology allows required aggregate bandwidth

•  Performance of the backend storage should be tested/sized

•  Local disks will give ~100-140MBytes/sec per disk: pick correct controller

Page 26: Hadoop on VMware

26

Summary

!  Hadoop does work well in a virtual environment !  Plan a virtual cluster, enable other big-data solutions on the same

infrastructure !  Leverage the recipes to automate your configuration and

deployment