12
FEDERATION WHITE PAPER VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES How Adobe Systems achieved breakthrough results in Big Data analytics with Hadoop-as-a-Service ABSTRACT Large-scale Apache Hadoop analytics have long eluded the industry, especially in virtualized environments. In a ground-breaking proof of concept (POC), Adobe Systems demonstrated running Hadoop-as-a-Service (HDaaS) on a virtualized and centralized infrastructure handled large-scale data analytics workloads. This white paper documents the POC’s infrastructure design, initial obstacles, and successful completion, as well as sizing and configuration details, and best practices. Importantly, the paper also underscores how HDaaS built on an integrated and virtualized infrastructure delivers outstanding performance, scalability, and efficiency, paving the path toward larger-scale Big Data analytics in Hadoop environments. December, 2014

VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES · PDF fileVIRTUALIZING HADOOP IN . LARGE-SCALE INFRASTRUCTURES . ... Hadoop-as-a-Service ... Adobe used Pivotal HD as an enhanced

  • Upload
    lekien

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

Page 1: VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES · PDF fileVIRTUALIZING HADOOP IN . LARGE-SCALE INFRASTRUCTURES . ... Hadoop-as-a-Service ... Adobe used Pivotal HD as an enhanced

FEDERATION WHITE PAPER

VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES How Adobe Systems achieved breakthrough results in Big Data analytics with Hadoop-as-a-Service

ABSTRACT

Large-scale Apache Hadoop analytics have long eluded the industry, especially in virtualized environments. In a ground-breaking proof of concept (POC), Adobe Systems demonstrated running

Hadoop-as-a-Service (HDaaS) on a virtualized and centralized infrastructure handled large-scale data analytics workloads. This white paper documents the POC’s infrastructure design, initial obstacles, and successful completion, as well as sizing and configuration details, and best practices. Importantly, the

paper also underscores how HDaaS built on an integrated and virtualized infrastructure delivers outstanding performance, scalability, and efficiency, paving the path toward larger-scale Big Data analytics in Hadoop environments.

December, 2014

Page 2: VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES · PDF fileVIRTUALIZING HADOOP IN . LARGE-SCALE INFRASTRUCTURES . ... Hadoop-as-a-Service ... Adobe used Pivotal HD as an enhanced

To learn more about how EMC products, services, and solutions can help solve your business and IT challenges, contact your local representative or authorized

reseller, visit www.emc.com, or explore and compare products in the EMC Store

Copyright © 2014 EMC Corporation. All Rights Reserved.

EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with respect to the information in

this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.

Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.

VMware and vSphere are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. All other trademarks used herein

are the property of their respective owners.

Part Number H13856

Page 3: VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES · PDF fileVIRTUALIZING HADOOP IN . LARGE-SCALE INFRASTRUCTURES . ... Hadoop-as-a-Service ... Adobe used Pivotal HD as an enhanced

TABLE OF CONTENTS

EXECUTIVE SUMMARY ................................................................................................................................................. 4

INTRODUCTION ............................................................................................................................................................. 5

BOLD ARCHITECTURE FOR HDAAS .............................................................................................................................. 7

NAVIGATING TOWARD LARGE-SCALE HDAAS ........................................................................................................... 8

A Few surprises ............................................................................................................................................................................................. 8

Diving in Deeper ............................................................................................................................................................................................ 8 Relooking at Memory Settings ............................................................................................................................................................................ 8

Modifying Settings Properly with BDE ................................................................................................................................................................ 9

Bigger is Not Always Better ................................................................................................................................................................................. 9

Storage Sizing Proved Successful ...................................................................................................................................................................... 9

BREAKTHROUGH IN HADOOP ANALYTICS ............................................................................................................... 10

Impressive Performance Results ............................................................................................................................................................... 10

Breaking with Tradition Adds Efficiency ................................................................................................................................................... 11

Stronger Data PRotection .......................................................................................................................................................................... 11

Freeing the Infrastructure .......................................................................................................................................................................... 11

BEST PRACTICE RECOMMENDATIONS ..................................................................................................................... 12

Memory settings are key ........................................................................................................................................................................... 12

Understand Sizing and Configuration ...................................................................................................................................................... 12

Acquire or Develop Hadoop Expertise ..................................................................................................................................................... 12

NEXT STEPS: LIVE WITH HDAAS ............................................................................................................................... 12

H13856 Page 3 of 12

Page 4: VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES · PDF fileVIRTUALIZING HADOOP IN . LARGE-SCALE INFRASTRUCTURES . ... Hadoop-as-a-Service ... Adobe used Pivotal HD as an enhanced

EXECUTIVE SUMMARY

Apache Hadoop has become a prime tool for analyzing Big Data and achieving greater insights that help organizations improve strategic decision making.

Traditional Hadoop clusters have proved inefficient for handling large-scale analytics jobs sized at hundreds of terabytes or even petabytes. Adobe’s Digital

Marketing organization, which operates data analytic jobs on this scale, was encountering increased demand internally to use Hadoop for analysis of the

company's existing eight-petabyte data repository.

To address this need, Adobe explored an innovative approach to Hadoop. Rather than running traditional Hadoop clusters on commodity servers with locally

attached storage, Adobe virtualized the Hadoop computing environment and used its existing EMC® Isilon® storage—where the eight-petabyte data repository

resides—as a central location for Hadoop data.

Adobe enlisted resources, technologies, and expertise of EMC, VMware, and Cisco to build a reference architecture for virtualized Hadoop-as-a-Service

(HDaaS) and perform a comprehensive proof of concept. While the five-month POC encountered some challenges, the project also yielded a wealth of insights

and understanding relating to how Hadoop operates and its infrastructure requirements.

After meticulous configuring, refining, and testing, Adobe successfully ran a 65-terabyte Hadoop job—one of the industry’s largest to date in a virtualized

environment. This white paper details the process that Adobe and the POC team followed that led to this accomplishment.

The paper includes specific configurations of the virtual HDaaS environment used in the POC. It also covers initial obstacles and how the POC team overcame

them. It also documents how the team adjusted settings, sized systems, and reconfigured the environment to support large-scale Hadoop analytics in a virtual

environment with centralized storage.

Most importantly, the paper presents POC’s results, along with valuable best practices for other organizations interested in pursuing similar projects. The last

section describes Adobe’s plans to bring virtual HDaaS to production for its business users and data scientists.

H13856 Page 4 of 12

Page 5: VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES · PDF fileVIRTUALIZING HADOOP IN . LARGE-SCALE INFRASTRUCTURES . ... Hadoop-as-a-Service ... Adobe used Pivotal HD as an enhanced

INTRODUCTION

Organizations across the world increasingly view Big Data as a prime source of competitive differentiation, and analytics as the means to tap this source.

Specifically, Hadoop enables data scientists to perform sophisticated queries against massive volumes of data to gain insights, discover trends, and predict

outcomes. In fact, a GE and Accenture study reported that 84 percent of survey respondents believe that using Big Data analytics “has the power to shift the

competitive landscape for my industry" in the next year.1

Apache Hadoop, an increasingly popular environment for running analytics jobs, is an open source framework for storing and processing large data sets.

Traditionally running on clusters of commodity servers with local storage, Hadoop comprises multiple components, primarily the Hadoop Distributed File System

(HDFS) for data storage, Yet Another Resource Negotiator (YARN) for managing system resources like memory and CPUs, and MapReduce for processing

massive jobs by splitting up input data into small subtasks and collating results.

At Adobe, a global leader in digital marketing and digital media solutions, its Technical Operations team uses traditional Hadoop clusters to deliver Hadoop as a

Service (HDaaS) in a private cloud for several application teams. These teams run Big Data jobs such as log and statistical analysis of application layers to

uncover trends that help guide product enhancements.

Elsewhere, Adobe's Digital Marketing organization tracks and analyzes customers’ website statistics, which are stored in an eight-petabyte data repository on

EMC Isilon storage. Adobe Digital Marketing would like to use HDaaS for more in-depth analysis that would help their clients improve website effectiveness,

correlate site visits to revenue, and guide strategic business decisions. Rather than moving data from a large data repository to the Hadoop clusters—a time-

consuming task, Technical Operations determined it would be most efficient to simply use Hadoop to access data sets on the existing Isilon-based data

repository.

Adobe has a goal of running analytics jobs against data sets that are hundreds of terabytes in size. Simply adding commodity servers to Hadoop clusters

would become highly inefficient, especially since traditional Hadoop clusters require three copies of the data to ensure availability. Adobe also was concerned

that current Hadoop versions lack high availability features. For example, Hadoop has only has two NameNodes, which tracks where data resides in Hadoop environments. If both NameNodes fail, the entire Hadoop cluster would collapse.

Technical Operations proposed separating the Hadoop elements and placing them where they can scale more efficiently and reliably. This meant using Isilon, where Adobe’s file-based data repository is stored, for centralized Hadoop storage and virtualizing the Hadoop cluster nodes to enable more flexible scalability and lower compute costs. (Figures 1 and 2)

Figure 1. Traditional Hadoop Architecture

1 "Industrial Internet Insights Report for 2015." GE, Accenture. 2014.

H13856 Page 5 of 12

Page 6: VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES · PDF fileVIRTUALIZING HADOOP IN . LARGE-SCALE INFRASTRUCTURES . ... Hadoop-as-a-Service ... Adobe used Pivotal HD as an enhanced

Figure 2. Virtual Hadoop Architecture with Isilon

Despite internal skepticism about a virtualized infrastructure handling Hadoop’s complexity, Technical Operations recognized a compelling upside: improving

efficiency and increasing scalability to a level that had not been achieved for single-job data sets in a virtualize Hadoop environment with Isilon. This is

enticing, especially as data analytics jobs continue to grow in size across all environments.

"People think by that virtualizing Hadoop, you're going to take a performance hit. But we

showed that's not the case. Instead you get added flexibility that actually unencumbers

your infrastructure."

Chris Mutchler, Compute Platform Engineer, Adobe Systems

To explore the possibilities, Adobe Technical Operations embarked on a virtual HDaaS POC for Adobe Systems Digital Marketing. The infrastructure comprised EMC, VMware, and Cisco solutions and was designed to test the outer limits of Big Data analytics on Isilon and VMware using Hadoop.

Key objectives of the POC included:

• Building a virtualized HDaaS environment to deliver analytics through a self-service catalog to internal Adobe customers

• Decoupling storage from compute by using EMC Isilon to provide HDFS, ultimately enabling access to the entire data repository for analytics

• Understanding sizing and security requirements of the integrated EMC Isilon, EMC VNX, VMware and Cisco UCS infrastructure to support larger-scale

HDaaS

• Proving an attractive return on investment and total cost of ownership in virtualized HDaaS environments compared to physical in-house solutions or

public cloud services such as Amazon Web Services

• Documenting key learnings and best practices

The results were impressive. While the POC uncovered some surprises, Adobe gained valuable knowledge for future HDaaS projects. Ultimately, Adobe ran

some of the largest Hadoop data analytics jobs to date in a virtualized HDaaS environment. It was a groundbreaking achievement and bodes a new era of scale

and efficiency for Big Data analytics.

H13856 Page 6 of 12

Page 7: VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES · PDF fileVIRTUALIZING HADOOP IN . LARGE-SCALE INFRASTRUCTURES . ... Hadoop-as-a-Service ... Adobe used Pivotal HD as an enhanced

BOLD ARCHITECTURE FOR HDAAS

The POC’s physical topology is built on Cisco Unified Compute System (UCS), Cisco Nexus networking, EMC VNX® block storage, and EMC Isilon scale-out

storage. (Figure 3)

Figure 3. HDaaS Hardware Topology

At the compute layer, Adobe was particularly interested in Cisco UCS for its firmware management and centralized configuration capabilities. Plus, UCS provides a converged compute and network environment when deployed with Nexus.

VNX provides block storage for VMware ESX hosts and virtual machines (VMs) that comprise the Hadoop cluster. Adobe's focus was learning the VNX sizing and performance requirements to support virtualized HDaaS.

An existing Isilon customer, Adobe especially liked Isilon’s data lake concept that enables access to one source of data through multiple protocols, such as

NFS, FTP, Object, and HDFS. In the POC, data was loaded onto Isilon via NFS and accessed via HDFS by virtual machines in the Hadoop compute cluster. The

goal was to prove that Isilon delivered sufficient performance to support large Hadoop workloads.

To deploy, run, and manage Hadoop on a common virtual infrastructure, Adobe relied on VMware Big Data Extensions (BDE) – an essential software component

of the overall environment. Adobe already used BDE in its private cloud HDaaS deployment and wanted to apply it to the new infrastructure.

BDE enabled Adobe to automate and simplify deployment of hundreds of virtualized Hadoop compute nodes that were tied directly to Isilon for HDFS. During

testing, Adobe also used BDE to deploy, reclaim, and redeploy the Hadoop cluster more than 30 times to evaluate different cluster configurations. Without the automation and flexibility of BDE, Adobe would not have been able to conduct such a wide range and high volume of tests within such a short timeframe.

In this POC, Adobe used Pivotal HD as an enhanced Hadoop distribution framework but designed the infrastructure to run any Hadoop distribution.

The following tools assisted Adobe with monitoring, collecting and reporting on metrics generated by the POC:

• VNX Monitor and Reporting Suite (M&R)

• Isilon Insight IQ (IIQ)

• Vmware vCenter Operations Manager (VCOPS)

• Cisco UCS Director (USCD)

H13856 Page 7 of 12

Page 8: VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES · PDF fileVIRTUALIZING HADOOP IN . LARGE-SCALE INFRASTRUCTURES . ... Hadoop-as-a-Service ... Adobe used Pivotal HD as an enhanced

NAVIGATING TOWARD LARGE-SCALE HDAAS

The POC spanned five months from hardware delivery through final testing. Adobe expected the infrastructure components to integrate well, provide a stable

environment, and perform satisfactorily.

In fact, the POC team implemented the infrastructure in about one and a half weeks. Then it put Isilon to the test as the HDFS data store, and evaluated how

well Hadoop ran in a virtualized environment.

A FEW SURPRISES

Adobe ran its first Hadoop MapReduce job in the virtual HDaaS environment within three days of initial set-up. Smaller data sets of 60 to 450 gigabytes

performed well, but the team hit a wall beyond 450 gigabytes.

The team focused on the job definition of the Hadoop configuration to determine if it was written correctly or using memory efficiently. In researching the

industry at large, Adobe learned that most enterprise Hadoop environments were testing data on a small scale. In fact, Adobe did not find another Hadoop POC

or implementation that exceeded 10 terabytes for single-job data sets in a virtualized Hadoop environment with Isilon.

"When we talked to other people in the industry, we realized we were on the forefront of

scaling Hadoop at levels possibly never seen before."

Jason Farnsworth, Senior Storage Engineer, Adobe Systems

After four weeks of tweaking the Hadoop job definition and adjusting memory settings, the team successfully ran a six-terabyte job. Pushing beyond six terabytes, the team sought to run larger data sets upwards of 60 terabytes. The larger jobs again proved difficult to complete successfully.

DIVING IN DEEPER

The next phase involved Adobe Technical Operations enlisting help from storage services, compute platforms, research scientists, data center operations, and network engineering. Technical Operations also reached out to the POC’s key partners—EMC, including Isilon and Pivotal, VMware, Cisco, and Trace3, an EMC

value-added reseller and IT systems integrator.

The team, which included several Hadoop experts, dissected nearly every element of the HDaaS environment. This included Hadoop job definitions, memory

settings, Java memory allocations, command line options, physical and virtual infrastructure configurations, and HDFS options.

"We had several excellent meetings with Hadoop experts from EMC and VMware. We

learned an enormous amount that helped us solve our initial problems and tweak the

infrastructure to scale the way we wanted."

Jason Farnsworth, Senior Storage Engineer, Adobe Systems

Relooking at Memory Settings

Close inspection of Hadoop revealed a lack of maturity to perform in virtualized environments. For example, some operations launched through VMware BDE

did not function properly on Hadoop, requiring significant tweaking. Complicating matters, the team learned that Hadoop error messages did not clearly

describe the problem or indicate the origin.

Most notable, the team discovered that Hadoop lacked sufficient intelligence to analyze memory requirements for large analytics jobs. This necessitated

manually adjusting memory settings.

The POC team recommends the following memory settings as a good starting point for organizations to diagnose scaling and job-related issues when testing

Hadoop in larger-scale environments:

Yarn Settings

• Amount of physical memory in megabytes that can be allocated for containers: “yarn.nodemanager.resource.memory-mb=x” x=memory in megabytes.

BDE has a base calculation for this value according to how much RAM to allocate to the workers on deployment. Default value is 8192.

• Minimum container memory for YARN. The minimum allocation for every container request at the ResourceManager, in megabytes:

“yarn.scheduler.minimum-allocation-mb=x” x=memory in megabytes. Default Value is 1024.

• Application Master Memory: “yarn.app.mapreduce.am.resource.mb=x” x=memory in megabytes. Default value is 1536.

• Java options for the application master (JVM HEAP Size): “yarn.app.mapreduce.am.command-opts=x” x=memory in megabytes but passed as a Java

option (e.g., Xmx7000m). Default value is Xmx1024m.

H13856 Page 8 of 12

Page 9: VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES · PDF fileVIRTUALIZING HADOOP IN . LARGE-SCALE INFRASTRUCTURES . ... Hadoop-as-a-Service ... Adobe used Pivotal HD as an enhanced

Mapred Settings

• Mapper memory: “mapreduce.map.memory.mb=x” x=memory in megabytes. Default Value is 1536.

• Reducer memory: “mapreduce.reduce.memory.mb=x” x=memory in megabytes. Default Value is 3072

• Mapper Java Options (JVM Heap Size). Heap size for child JVMs of maps: “mapreduce.map.java.opts=x” x=memory but passed as a Java option (e.g,

Xmx2000m). Default Value is Xmx1024m

• Reducer Java Options (JVM Heap Size). Heap size for child JVMs of reduces: “mapreduce.reduce.java.opts=x” x=memory but passed as a Java option

(e.g., xmx4000m). Default Value is Xmx2560m

• Maximum size of the split metainfo file: “mapreduce.jobtracker.split.metainfo.maxsize=x” x=10000000 by default. POC team set this to -1, which disables

or sets to any size.

For guidance on baseline values to use in these memory settings, the POC team recommends the following documents:

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.1/bk_installing_manually_book/content/rpm-chap1-11.html

http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/

https://support.pivotal.io/hc/en-us/articles/201462036-Mapreduce-YARN-Memory-Parameters

http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

http://hadoop.apache.org/docs/r2.5.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

Modifying Settings Properly with BDE

Both the virtual and physical infrastructure required configuration adjustments. Since VMware BDE acts as a management service layer on top of Hadoop, the

team relied on BDE to modify Hadoop settings to ensure they were properly applied to the virtual clusters and remained persistent. Changing the settings via the servers would not enable consistent application of modifications across all the virtual clusters. The team also kept in mind that stopping, restarting, or redeploying a cluster through BDE would automatically reset all the node settings to their default values.

Bigger is Not Always Better

The POC revealed that the configuration of physical servers (hosts) and virtual servers (Hadoop workers or guests) affected Hadoop performance and cost efficiency.

For example, a greater number of physical cores (CPUs) with less megahertz delivered improved performance versus fewer cores with more megahertz. At a higher cost, the same number of physical cores with more megahertz delivered even better performance.

In a virtual environment, fewer virtual CPUs (vCPUs) with a greater number of Hadoop workers, performed and scaled better than a greater number of vCPUs supporting fewer workers.

The team also learned to keep all physical hosts in the VMware cluster configured identically and ensure there were not any variations in host configurations. This way, VMware distributed resource scheduling would not be invoked to spend time and resources balancing the cluster and resources instead would be made immediately available to Hadoop. BDE also was especially valuable in ensuring that memory settings and the alignment between cores and VMs were

consistent.

Storage Sizing Proved Successful

Both VNX and Isilon performed perfectly in the POC. The team sized VNX to hold both the VMware environment and the Hadoop intermediate space—temporary space used by Hadoop jobs such as MapReduce. Intermediate space also can be configured to be stored directly on the Isilon cluster, but this

setting was not tested during the POC.

Technical Operations also tested various HDFS block sizes, resulting in performance optimizations. Depending on job and workload, the team found that block

sizes of 64 megabytes to 1024 megabytes drove optimal throughput. The 12 Isilon X-Series nodes with two-terabyte drives provided more than enough

capacity and performance for tested workloads, and could easily scale to support Hadoop workloads hundreds of terabytes in size.

While the POC’s Isilon did not incorporate flash technology, the team noted that adding flash drives would provide a measurable performance increase.

H13856 Page 9 of 12

Page 10: VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES · PDF fileVIRTUALIZING HADOOP IN . LARGE-SCALE INFRASTRUCTURES . ... Hadoop-as-a-Service ... Adobe used Pivotal HD as an enhanced

BREAKTHROUGH IN HADOOP ANALYTICS

After eight weeks of fine-tuning the virtual HDaaS infrastructure, Adobe succeeded in running a 65-terabyte Hadoop workload—significantly larger than the

largest known virtual Hadoop workloads. In addition, this was the largest workload ever tested by EMC in a virtual Hadoop environment on Isilon.

Fundamentally, these results proved that Isilon as the HDFS layer worked. In fact, the POC refutes claims by some in the industry that suggest shared storage

will cause problems with Hadoop. To the contrary, Isilon had no adverse effects and even contributed superior results in a virtualized HDaaS environment

compared to traditional Hadoop clusters. These advantages apply to many aspects of Hadoop, including performance, storage efficiency, data protection, and

flexibility.

"Our results proved that having Isilon act as the HDFS layer was not adverse. In fact, we

got better results with Isilon than we would have in a traditional cluster."

Chris Mutchler, Compute Platform Engineer, Adobe Systems

IMPRESSIVE PERFORMANCE RESULTS

With compute resources allocated in small quantities to a large number of VMs, job run time improved significantly. (Figures 4 and 5) Furthermore, the test

demonstrated that Isilon performed well without flash drives.

Figure 4. TeraSort Job Run Time by Worker Count

Figure 5. Adobe Pig Job Run Time by Worker Count

The team concluded that Hadoop performs better in a scale-out rather than scale-up configuration. That is, jobs complete more quickly when run on a greater

number of compute nodes, so having more cores is more important than having faster processors. In fact, performance improved as the number of workers

increased.

Tests were run with the following cluster configurations:

• 256 workers, 1 vCPU, 7.25 GB RAM, 30 GB intermediate space

• 128 workers, 2 vCPU, 14.5 GB RAM, 90 GB intermediate space

• 64 workers, 4 vCPU, 29 GB RAM, 210 GB intermediate space

• 32 Workers, 8 vCPU, 58 GB RAM, 450 GB intermediate space

H13856 Page 10 of 12

Page 11: VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES · PDF fileVIRTUALIZING HADOOP IN . LARGE-SCALE INFRASTRUCTURES . ... Hadoop-as-a-Service ... Adobe used Pivotal HD as an enhanced

BREAKING WITH TRADITION ADDS EFFICIENCY

Traditional Hadoop clusters require three copies of the data in case servers fail. Isilon eliminates tripling storage capacity requirements due to built-in data

protection capabilities of the Isilon OneFS operating system.

For example, in a traditional Hadoop cluster running jobs against eight petabytes of data, the infrastructure would require 24 petabytes of raw disk capacity—

a 200 percent overhead—to accommodate three copies. Eight petabytes of Hadoop data when stored on Isilon requires only 9.6 petabytes of raw disk

capacity—a nearly 60 percent reduction. Not only does Isilon save on storage but it also streamlines storage administration by eliminating the need to oversee

numerous islands of storage. Using Adobe’s eight-petabyte data set in a traditional environment would require 24 petabytes of local disk capacity

necessitating thousands of Hadoop nodes when hundreds of compute nodes would be adequate.

Enabling a “data lake,” Isilon OneFS provides enterprises with one central data repository of data accessible through multiple protocols. Rather than requiring

a separate, purpose-built HDFS device, Isilon supports HDFS along with NFS, FTP, SMP, HTTP, NDMB, SWIFT, and OBJECT. (Figure 6). This allows organizations

to bring Hadoop to the data—a more streamlined approach, rather than moving data to Hadoop.

Figure 4. Isilon Data Lake Concept with Multi-protocol Support

STRONGER DATA PROTECTION

Isilon provides secure control over data access by supporting POSIX for granular file access permissions. Isilon stores data in a POSIX-compliant file system

with SMB and NFS workflows that users can also access through HDFS for MapReduce. Isilon protects partitioned subsets of data with access zones that

prevent unauthorized access.

In addition, Isilon offers rich data services that are not available in traditional Hadoop environments. For example, Isilon enables users to create snapshots of

the Hadoop environment for point-in-time data protection or to create duplicate environments. Isilon replication also can synchronize Hadoop to a remote site,

providing even greater protection. This allows organizations to keep Hadoop data secure on premises, rather than moving data to a public cloud.

FREEING THE INFRASTRUCTURE

Virtualizing HDaaS introduces greater opportunities for flexibility, unencumbering the infrastructure from physical limitations. Instead of traditional bare-metal

clusters with rigid configurations, virtualization allows organizations to tailor Hadoop VMs to their individual workloads and even use existing compute

infrastructure. This is key to optimizing performance and efficiency. Plus, virtualization facilitates multi-tenancy and offers additional high-availability advantages through fluid movement of VMs from one physical host to another.

H13856 Page 11 of 12

Page 12: VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES · PDF fileVIRTUALIZING HADOOP IN . LARGE-SCALE INFRASTRUCTURES . ... Hadoop-as-a-Service ... Adobe used Pivotal HD as an enhanced

BEST PRACTICE RECOMMENDATIONS

Several important lessons learned and best practices were documented from this breakthrough POC, as follows.

MEMORY SETTINGS ARE KEY

It's important to recognize that Hadoop is still a maturing product and does not automatically recognize optimal memory requirements. Memory settings are

crucial to achieving sufficient performance to run Hadoop jobs against large data sets. EMC recommends methodically adjusting memory settings and

repeatedly testing configurations until the optimal environment is achieved.

UNDERSTAND SIZING AND CONFIGURATION

Operating at Adobe's scale—hundreds of terabytes to tens of petabytes—demands close attention to sizing and configuration of virtualized infrastructure

components. Since no two Hadoop jobs are alike, IT organizations must thoroughly understand the data sets and jobs their customers plan to run. Key sizing

and configuration insights from this POC include:

• Devote ample time upfront to sizing storage layers based on workload and scalability requirements. Sizing for Hadoop intermediate space also deserves

careful consideration.

• Consider setting large HDFS block sizes to 256 to 1024 megabytes to ensure sufficient performance. On Isilon, HDFS block size is configured as a

protocol setting in the OneFS operating system.

• In the compute environment, deploy a large number of hosts using processors with as many cores as possible and align the VMs to those cores. In general,

having more cores is more important than having faster processors and results in better performance and scalability.

• Configure all physical hosts in the VMware cluster identically. For example, mixing eight-core and ten-core systems will make CPU alignment challenging

when using BDE. Different RAM amounts also will cause unwanted overhead while VMware's distributed resource scheduling moves virtual guests.

ACQUIRE OR DEVELOP HADOOP EXPERTISE

Hadoop is complex, with numerous moving parts that must operate in concert. For example, MapReduce settings may affect Java, which may in turn, impact YARN. EMC recommends that organizations wishing to use Hadoop to ramp up gradually and review the many resources available to help simplify Hadoop

implementation with Isilon. Hadoop insights also may be achieved through "tribal" sharing of experiences among industry colleagues, as well as formal documentation and training. The POC team recommends these resources as a starting place:

• EMC Isilon Free Hadoop website

• EMC Hadoop Starter Kit

• EMC Isilon Best Practices for Hadoop Data Storage white paper

• EMC Big Data website

When building and configuring the virtual HDaaS infrastructure, companies should select vendors with extensive expertise in Hadoop and especially in large-

scale Hadoop environments. EMC, VMware, and solution integrators with Big Data experience can help accelerate a Hadoop deployment and ensure success.

Because of the interdependencies among the many components in a virtual HDaaS infrastructure, internal and external team members will need broad

knowledge of the technology stack, including compute, storage, virtualization, and networking, with deep understanding of how each performs separately and together. While IT as a whole is still evolving toward developing integrated skill sets, EMC has been on the forefront of this trend and can provide insights and

guidance.

NEXT STEPS: LIVE WITH HDAAS

With the breakthrough results of this POC, Adobe plans to take the HDaaS reference architecture using Isilon into production and test even larger Hadoop

jobs. To generate additional results, Adobe also will run a variety of Hadoop jobs on the virtual HDaaS platform repeatedly—as much as hundreds of times. The

goal is to demonstrate that virtual HDaaS can deliver and is ready for large production applications.

While the POC pointed one Hadoop cluster to Isilon, additional testing will focus on multiple Hadoop clusters accessing data sets on Isilon to further prove

scalability. This multi-tenancy capability is crucial for supporting multiple analytics teams with separate projects. Adobe Technical Operations plans to run

Hadoop jobs through Isilon access zones to ensure isolation is preserved without impacting performance or scalability.

In addition, the team plans to move intermediate space from VNX block storage to Isilon and evaluate the impact of additional I/O on Isilon. Adobe also expects

that an all-flash array such as EMC XtremIO would provide an excellent option for block storage in place of VNX.

Additional configuration adjustments and testing are well worth the effort to Adobe and present tremendous opportunities for the analytics community as a

whole. Using centralized storage, such as Isilon, provides a common data source rather than creating numerous storage locations for multiple Hadoop projects.

The flexibility and scalability of the virtual HDaaS environment is also of great value as Hadoop jobs continue to grow in size.

Most important, moving virtual HDaaS into production will enable Adobe's data scientists will be able to query against the entire data set residing on Isilon. By

doing so, they will have a powerful way to gain more insight and intelligence that can be presented to Adobe’s customers and provide both Adobe and their

customers with strong competitive advantage.

H13856 Page 12 of 12