62
Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference architecture guide describes the architectural recommendations for Cloudera Hadoop 6.1 software on Dell EMC PowerEdge servers and Dell EMC Networking switches.

Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Ready Solutions for Data AnalyticsCloudera Hadoop 6.1

Architecture Guide

April 2019

H17614.1

Abstract

This reference architecture guide describes the architectural recommendations for Cloudera Hadoop 6.1 software on Dell EMC PowerEdge servers and Dell EMC Networking switches.

Page 2: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Copyright © 2017-2019 Dell Inc. or its subsidiaries. All rights reserved.

Published April 2019

Dell believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS-IS.“ DELL MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH

RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS

FOR A PARTICULAR PURPOSE. USE, COPYING, AND DISTRIBUTION OF ANY DELL SOFTWARE DESCRIBED IN THIS PUBLICATION REQUIRES AN

APPLICABLE SOFTWARE LICENSE.

Dell, EMC, and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be the property of their respective owners. Published

in the USA.

Dell EMCHopkinton, Massachusetts 01748-91031-508-435-1000 In North America 1-866-464-7381www.DellEMC.com

2 Ready Architecture for Cloudera Hadoop 6.1

Page 3: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

5

7

Executive Summary 9Document purpose...................................................................................... 10Audience..................................................................................................... 10Hadoop overview........................................................................................ 10Cloudera Enterprise software overview...................................................... 10

Hadoop for the enterprise.............................................................. 10Data management...........................................................................11Cloudera Enterprise components.................................................... 11Cloudera Enterprise Data Hub........................................................ 12

Cloudera Hadoop 6.1 Ready Solution...........................................................12Solution use case summary......................................................................... 14

Ready Architecture Components 15Solution components.................................................................................. 16Dell EMC PowerEdge rack servers.............................................................. 17

Dell EMC PowerEdge R640 server................................................. 17Dell EMC PowerEdge R740xd server............................................. 18

Solution Architecture Overview 19Cluster architecture................................................................................... 20

High-level node architecture......................................................... 20High availability..............................................................................25

Network architecture................................................................................. 27Network definitions....................................................................... 28Cluster physical networks..............................................................28Physical network components....................................................... 29

Rack server hardware configurations......................................................... 36Infrastructure Nodes..................................................................... 36Worker Nodes................................................................................37Edge Nodes................................................................................... 39Node configuration ....................................................................... 40

References 43Cloudera partnership and certification........................................................44Dell EMC Customer Solution Centers......................................................... 44Technical support.......................................................................................45

Dell EMC PowerEdge R740xd Worker Nodes Physical RackConfiguration 47Worker Nodes single-rack configuration.....................................................48Worker Nodes initial rack configuration......................................................49Worker Nodes additional pod rack configuration........................................ 50

Figures

Tables

Chapter 1

Chapter 2

Chapter 3

Chapter 4

Appendix A

CONTENTS

Ready Architecture for Cloudera Hadoop 6.1 3

Page 4: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Tested Component Versions 53Software versions...................................................................................... 54Network switch firmware versions............................................................. 54Dell EMC PowerEdge R640 firmware versions........................................... 54Dell EMC PowerEdge R740xd firmware versions........................................55

57

61

Appendix B

Glossary

Index

CONTENTS

4 Ready Architecture for Cloudera Hadoop 6.1

Page 5: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Solution components.................................................................................................. 16Dell EMC PowerEdge R640 server 10 x 2.5 in. chassis................................................ 18Dell EMC PowerEdge R740xd server 3.5 in. chassis....................................................18Cluster architecture................................................................................................... 20Cluster network fabric architecture............................................................................ 27Hadoop 25 GbE network connections.........................................................................29Dell EMC PowerEdge R640 network ports................................................................. 30Dell EMC PowerEdge R740xd Worker Node network ports........................................3025 GbE single-pod networking equipment.................................................................. 32Dell EMC Networking Z9100-ON multiple-pod networking equipment........................33Multiple-pod view using Dell EMC Networking Z9100-ON switches (based on Layer 3ECMP) ...................................................................................................................... 34

1234567891011

FIGURES

Ready Architecture for Cloudera Hadoop 6.1 5

Page 6: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

6 Ready Architecture for Cloudera Hadoop 6.1

Page 7: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Cloudera Enterprise components/services ................................................................. 11Cloudera Enterprise Data Hub..................................................................................... 12Solution use cases.......................................................................................................14Data processing and access components.................................................................... 17Cluster node roles.......................................................................................................20Service locations by node........................................................................................... 22Recommended number of nodes and pods for 25 GbE cluster....................................24Alternative number of nodes and pods for 25 Gbe cluster.......................................... 24Rack and pod density scenarios..................................................................................25CDH network definitions.............................................................................................28Cluster networks........................................................................................................ 28Network/bond/interface cross reference ..................................................................30Per rack network equipment ......................................................................................35Per pod network equipment....................................................................................... 35Per cluster aggregation network switches for multiple pods.......................................35Per node network cables required.............................................................................. 35Hardware configurations: Dell EMC PowerEdge R640 Infrastructure Nodes .............36Dell EMC PowerEdge R640 Infrastructure Node volumes.......................................... 37Dell EMC PowerEdge R640 Infrastructure Node partitions........................................ 37Hardware configurations: Dell EMC PowerEdge R740xd Worker Nodes.....................37Dell EMC PowerEdge R740xd Worker Node volumes................................................. 38Dell EMC PowerEdge R740xd Worker Node partitions............................................... 38Hardware Configurations – Dell EMC PowerEdge R640 Edge Nodes.........................39Dell EMC PowerEdge R640 Edge Node volumes........................................................ 40Dell EMC PowerEdge R640 Edge Node partitions...................................................... 40Specialized Worker Node subtypes............................................................................. 41Solution Support Matrix............................................................................................. 45Single-rack configuration: Worker Nodes................................................................... 48Initial pod rack configuration: Dell EMC PowerEdge R740xd Worker Nodes...............49Additional pod rack configuration: Dell EMC PowerEdge R740xd Worker Nodes....... 50Software versions.......................................................................................................54Network switch firmware versions............................................................................. 54Dell EMC PowerEdge R640 firmware versions........................................................... 54Dell EMC PowerEdge R740xd firmware versions........................................................55

12345678910111213141516171819202122232425262728293031323334

TABLES

Ready Architecture for Cloudera Hadoop 6.1 7

Page 8: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

8 Ready Architecture for Cloudera Hadoop 6.1

Page 9: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

CHAPTER 1

Executive Summary

This chapter presents the following topics:

l Document purpose............................................................................................. 10l Audience.............................................................................................................10l Hadoop overview................................................................................................ 10l Cloudera Enterprise software overview.............................................................. 10l Cloudera Hadoop 6.1 Ready Solution.................................................................. 12l Solution use case summary.................................................................................14

Executive Summary 9

Page 10: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Document purposeThis document describes the Dell EMC server hardware and networking configurationthat is recommended for running CDH, the Cloudera distribution including ApacheHadoop.

It also includes recommendations for the location of CDH core services andcomponents. The installation of additional components and services is flexible anddepends on the applications and workloads.

For additional and relevant information, see the Dell EMC Ready Architectures forHadoop web page.

Audience

This document is for customers and system architects who require information aboutconfiguring Hadoop clusters in their information technology environment for Big Dataanalytics.

Hadoop overview

Hadoop is an Apache project that is being built and used by a global community ofcontributors, using the Java programming language. Yahoo! has been the largestcontributor to this project and uses Apache Hadoop extensively across its businesses.Core committed contributors on the Hadoop project include employees from Cloudera,eBay, Facebook, Getopt, Hortonworks, Huawei, IBM, InMobi, INRIA, LinkedIn, MapR,Microsoft, Pivotal, Twitter, UC Berkeley, VMware, WANdisco, and Yahoo!. Many moreindividuals and organizations have made contributions.

Cloudera Enterprise software overviewCloudera Enterprise helps enterprises become information-driven. It combines thebest of open-source software components with enterprise capabilities.

Hadoop for the enterprise

Specifically for mission-critical environments, Cloudera Enterprise includes CDH, aleading open source Hadoop-based platform. It also includes advanced systemmanagement and data management tools plus dedicated support and communityadvocacy from its team of Hadoop developers and experts.

Executive Summary

10 Ready Architecture for Cloudera Hadoop 6.1

Page 11: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Cloudera Enterprise with Apache Hadoop is:

l Unified—One integrated system that brings diverse users and applicationworkloads to one pool of data on common infrastructure. No data movement isrequired.

l Secure—Perimeter security, authentication, granular authorization, and dataprotection.

l Governed—Enterprise-grade data auditing, data lineage, and data discovery.

l Managed—Native high-availability, fault-tolerance and self-healing storage,automated backup and disaster recovery, and advanced system and datamanagement.

l Open—Apache-licensed open source to ensure that your data and applicationsremain yours, and an open platform to connect with all your existing investmentsin technology and skills.

Data managementWith Cloudera Enterprise, organizations put their data at the center of theiroperations to increase business visibility and reduce costs, while successfullymanaging risk and compliance requirements.

Cloudera Enterprise provides:

l A massively scalable platform to store any amount or type of data, in its originalform, for as long as required

l Integration with your existing infrastructure and tools

l Flexibility to run a variety of enterprise workloads such as batch processing,interactive SQL, enterprise search, and advanced analytics

l Robust security, governance, data protection, and management

Cloudera Enterprise componentsThe following table lists the products and services that are included with ClouderaEnterprise.

Table 1 Cloudera Enterprise components/services

Product/Service Description

CDH As the core of Cloudera Enterprise, combinesApache Hadoop with several other open-sourceprojects to create a single, massively scalablesystem. You can unite storage with an array ofpowerful processing and analytic frameworks.

Cloudera Manager Helps you easily deploy, manage, monitor, anddiagnose issues with your cluster. Cloudera Manageris critical for operating clusters at scale.

Cloudera Support Provides technical support for Hadoop. WithCloudera Support, you gain more uptime, faster issueresolution, better performance to support yourmission-critical applications, and faster delivery ofplatform features.

Executive Summary

Data management 11

Page 12: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Cloudera Enterprise Data HubCloudera Enterprise includes several advanced components that extend andcomplement the value of Apache Hadoop, as shown in the following table.

Table 2 Cloudera Enterprise Data Hub

Component Description

Online NoSQL – HBase HBase is a distributed key-value store thathelps you build real-time applications onmassive tables (billions of rows and millions ofcolumns) with fast, random access.

Analytic SQL – Impala Impala is a massively parallel processing(MPP) SQL engine that is built for Hadoop.

Search – Cloudera Search Cloudera Search, based on Apache Solr,enables you to query and browse data inHadoop, similar to searching Google or aneCommerce site.

In-Memory Machine Learning and StreamProcessing – Apache Spark

Spark delivers fast, in-memory analytics andreal-time stream processing for Hadoop.

Data Management – Cloudera Navigator Cloudera Navigator provides criticalenterprise data audit, lineage, and datadiscovery capabilities that enterprises require.It includes Active Data Optimization (ClouderaNavigator Optimizer), Governance and DataManagement (Cloudera Navigator includingauditing, lineage, discovery, and policy lifecycle management), Encryption and KeyManagement (Cloudera Navigator Encryptand Key Trustee).

Cloudera Hadoop 6.1 Ready SolutionThe Cloudera Hadoop 6.1 Ready Solution lowers the barrier to adoption fororganizations that are intending to use Apache Hadoop in production.

Although Hadoop is popular and widely used, installing, configuring, and running aproduction Hadoop cluster involves multiple considerations, including:

l Appropriate Hadoop software distribution and extensions

l Monitoring and management software

l Allocation of Hadoop services to physical nodes

l Selection of appropriate server hardware

l Design of the network fabric

l Sizing and scalability

l Performance

These considerations are complicated by the need to understand the type ofworkloads that will run on the cluster, the fast-moving pace of the core Hadoop

Executive Summary

12 Ready Architecture for Cloudera Hadoop 6.1

Page 13: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

project, and the challenges of managing a system that is designed to scale tothousands of nodes in a single cluster.

Dell EMC’s customer-centered approach is to create rapidly deployable and highlyoptimized end-to-end Hadoop solutions that run on hyperscale hardware. Dell EMC'sunique Hadoop solution combines optimized hardware, software, and services tostreamline deployment and improve the customer experience.

Dell EMC and Cloudera designed this solution jointly. It embodies all the hardware,software, resources, and services that are needed to run Hadoop in a productionenvironment. This end-to-end solution enables you to be in production with Hadoop ina shorter time than is possible with homegrown solutions.

The solution is based on Cloudera Enterprise, Dell EMC PowerEdge servers, and DellEMC Networking hardware. This solution includes best practices, optimized serverconfigurations, and optimized network infrastructure.

Executive Summary

Cloudera Hadoop 6.1 Ready Solution 13

Page 14: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Solution use case summary

This solution is designed to address the use cases described in the following table.

Table 3 Solution use cases

Use case Description

Big Data analytics Rapidly query petabyte-scale unstructuredand semistructured data in real time by usingHBase and Hive

Data storage Collect and store unstructured and semi-structured data in a secure, fault-resilientscalable data store that can be organized andsorted for indexing and analysis

Batch processing of unstructured data Batch process (index, analyze, and so on)tens to hundreds of petabytes of unstructuredand semistructured data

Data archiving Archive medium-term (12–36 months) datafrom EDW/DBMS to expedite access,increase data retention time, or meet dataretention policies or compliance requirements

Big Data visualization Capture, index, and visualize unstructured andsemistructured Big Data in real time

Search and predictive analytics Crawl, extract, index, and transformsemistructured and unstructured data forsearch and predictive analytics

Executive Summary

14 Ready Architecture for Cloudera Hadoop 6.1

Page 15: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

CHAPTER 2

Ready Architecture Components

This chapter presents the following topics:

l Solution components.......................................................................................... 16l Dell EMC PowerEdge rack servers......................................................................17

Ready Architecture Components 15

Page 16: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Solution componentsThe Dell EMC PowerEdge servers, Dell EMC Networking switches, and the operatingsystem make up the foundation on which the solution software stack runs.

The following figure shows the primary components of this Ready Architecture.

Figure 1 Solution components

The Store layer components provide multiple layers of functionality on top of thisfoundation. The Hadoop Distributed File System (HDFS) provides the core storage fordata files in the system. HDFS is a distributed, scalable, reliable, and portable filesystem. Apache Kudu provides a columnar relational storage option, while ApacheHBase provides NoSQL access to storage. Object storage is also available.

The Integrate layer shows the components that move data in and out of the Hadoopsystem. Apache Sqoop provides data transfer to and from relational databases whileApache Flume and Apache Kafka are optimized for real-time processing of event andlog data. Also, the HDFS API and tools can be used to move data files to and from theHadoop system.

YARN provides a resource management framework for running distributed applicationsunder Hadoop. The most popular distributed application is Hadoop’s MapReduce.Other applications, such as Apache Spark, Apache Hive, and Apache Pig, also rununder YARN. Apache Sentry and RecordService provide enterprise-grade securityservices.

The right side of the figure shows the data management capabilities that areintegrated across the entire system. The left side of the figure shows the operationalcomponents that Cloudera Manager provides for Hadoop administration andmanagement.

The following multiple complementary processing and access alternatives sit on top ofthe Cloudera Enterprise core:

l Batch data processing

l Stream data processing

l SQL query

l Data search

Ready Architecture Components

16 Ready Architecture for Cloudera Hadoop 6.1

Page 17: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

You can use these layers simultaneously or independently, depending on the workloadand problems that you need to solve, as shown in the following table.

Table 4 Data processing and access components

Access layer Description

Batch data processing Spark, Hive, Pig, and MapReduce provide access to themassively parallel Hadoop data processing framework.

Stream data processing Spark provides stream processing.

SQL query Impala provides SQL query access to data.

Data search Apache Solr provides real-time search of indexed data.

Dell EMC PowerEdge rack serversThis Ready Architecture uses Dell EMC's latest rack or modular server solutions.

Highlights

l Highly optimized air flow design that enables exceptional configuration flexibilityand industry-leading energy efficiency

l Out of band management architecture that facilitates rapid bare metal deploymentand remediation regardless of operating system state

l Embedded SupportAssist that reduces troubleshooting and downtime withembedded diagnostics and automated case creation

Automated productivity

l Up to 4 times performance improvement in common management tasks with thenew iDRAC9 dual-core ARM processor (compared to iDRAC8)

l Use of the same next-generation of embedded automation to standardize BIOSand secure boot configuration, firmware updates, server asset inventory, healthmonitoring, and power/reset control across all Dell EMC PowerEdge servers

l Embedded proactive automated support that resolves issues up to 90 percentfaster

Comprehensive security

l Fully signed firmware updates in which embedded trust only allows authenticatedcode to run

l Security lock-down that protects your server configuration and firmware (BIOS,iDRAC, and RAID) from malicious changes

l Secure instant erase for HDDs, SSDs, and NVMs

l A more secure, unique default password

l Redfish, a new REST-based management API, that is more secure and scalablethan legacy Intelligent Platform Management Interface (IPMI)

Dell EMC PowerEdge R640 serverThe Dell EMC PowerEdge R640 server is a dense, general purpose, scale-out computenode.

Dell EMC PowerEdge R640 is an ideal choice for dense scale-out data centercomputing and storage in a 1U/2S platform. It enables optimization of application

Ready Architecture Components

Dell EMC PowerEdge rack servers 17

Page 18: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

performance, price performance, or performance per watt per unit of rack space inmost data center environments.

The following figure shows the server.

Figure 2 Dell EMC PowerEdge R640 server 10 x 2.5 in. chassis

Dell EMC PowerEdge R740xd serverThe Dell EMC PowerEdge R740xd server is a highly configurable software-definedstorage server.

The Dell EMC PowerEdge R740xd is the ideal platform for uncompromising storageperformance and data set processing in a 2U/2S form factor. It provides excellentstorage performance and density for applications such as software-defined storage.The Dell EMC PowerEdge R740xd is designed with the versatility that is demanded bycloud service providers, Hadoop and Big Data users, and for colocation hosting.

The following figure shows the server.

Figure 3 Dell EMC PowerEdge R740xd server 3.5 in. chassis

Ready Architecture Components

18 Ready Architecture for Cloudera Hadoop 6.1

Page 19: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

CHAPTER 3

Solution Architecture Overview

This chapter presents the following topics:

l Cluster architecture........................................................................................... 20l Network architecture......................................................................................... 27l Rack server hardware configurations................................................................. 36

Solution Architecture Overview 19

Page 20: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Cluster architectureThis Ready Architecture addresses all aspects of a production Hadoop cluster,including the software layers, server hardware, and network fabric, as well asscalability, performance, and ongoing management.

High-level node architecture

The following figure displays the roles for the nodes in a basic cluster.

Figure 4 Cluster architecture

The cluster environment consists of multiple software services running on multiplephysical server nodes. The implementation divides the server nodes into several roles,and each node has a configuration that is optimized for its role in the cluster. Thephysical server configurations are divided into two broad classes:

l Worker Nodes, which handle the bulk of the Hadoop processing

l Master Nodes, which support services that are needed for the cluster operation

A high-performance network fabric connects the cluster nodes and separates the coredata network from management functions.

The minimum configuration supports nine cluster nodes, plus an optionalAdministration Node, as shown in the following table.

Table 5 Cluster node roles

Physical node Required or optional Hardware configuration

Administration Node Optional Infrastructure

Master Node 1 Required Infrastructure

Master Node 2 Required Infrastructure

Master Node 3 Required Infrastructure

Edge Node Required Infrastructure

Worker Node 1 Required Worker

Worker Node 2 Required Worker

Solution Architecture Overview

20 Ready Architecture for Cloudera Hadoop 6.1

Page 21: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Table 5 Cluster node roles (continued)

Physical node Required or optional Hardware configuration

Worker Node 3 Required Worker

Worker Node 4 Required Worker

Worker Node 5 Required Worker

Node definitionsThe following list provides node definitions for this solution.

l Administration Node—Provides cluster deployment and management capabilities.The Administration Node is optional in cluster deployments, depending on whetherexisting provisioning, monitoring, and management infrastructure is used.

l Master Node 1—Runs all the services that manage the HDFS data storage andYARN resource management. It is sometimes called the “master name node.”There are four primary services running on the Master Node 1:

n YARN Resource Manager—Supports cluster resource management, includingMapReduce jobs

n NameNode—Supports HDFS data storage

n Journal Manager—Supports high availability

n ZooKeeper—Supports coordination

l Master Node 2—When quorum-based HA mode is used, runs the standbynamenode process, a second journal manager, and an optional standby resourcemanager. This node also runs the Spark History Server and a second ZooKeeperservice.

l High availability (HA) Node—Provides the third journal node for HA. The MasterNode 1s and Master Node 2s provide the first and second journal nodes. It alsoruns a third ZooKeeper service. The operational databases that are required forCloudera Manager and additional metastores are on the HA node.

l Edge Node—Provides an interface between the data and processing capacity thatis available in the Hadoop cluster and a user of that capacity. An Edge Node has anadditional connection to the Edge Network and is sometimes called a “gatewaynode.” At least one Edge Node is required.

l Worker Node—Runs all the services that are required to store blocks of data onthe local hard drives and run processing tasks against that data. A minimum of fiveWorker Nodes are required. Larger clusters are scaled primarily by adding WorkerNodes. The primary services running on the Worker Nodes are:

n DataNode daemon (to support HDFS data storage)

n NodeManager daemon (to support YARN job execution)

n Services managed with Cloudera Manager service pools instead of YARN, suchas Impala and HBase

Spark jobs run on the Worker Nodes. However, there is no persistent service that isassociated with Spark jobs.

Solution Architecture Overview

High-level node architecture 21

Page 22: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Node locationsThe following table describes the node locations and functions of the cluster services.

Table 6 Service locations by node

Physical node Software function

Administration Node Systems Management Services

First Edge Node l Hadoop Clients

l Cloudera Manager

l YARN Management

Master Node 1 l NameNode

l Resource Manager

l ZooKeeper

l Quorum Journal Node

l Impala State Store and Catalog Daemons

l Kudu Master

l YARN Management

Master Node 2 l Yum Repositories

l Standby NameNode

l Standby Resource Manager (optional)

l Spark History Server

l Spark2 History Server

l Quorum Journal Node

l Hbase Master

l Hbase REST Server

l Thrift Server

l Hue Server

l Hue Load Balancer

l ZooKeeper

Master Node 3 l ZooKeeper

l Quorum Journal Node

l Operational Databases (PostgreSQL)

Worker Node(N) l DataNode

l NodeManager

l HBase RegionServer

l Impala Daemon

l Kudu Tablet Server

Solution Architecture Overview

22 Ready Architecture for Cloudera Hadoop 6.1

Page 23: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Cluster sizingThis Ready Architecture is organized into three units to help you size the solution asthe Hadoop environment grows.

From smallest to largest, the units include:

l Rack on page 23

l Pod on page 23

l Cluster on page 23

Each unit has specific characteristics and sizing considerations. The design goal forthe Hadoop environment is to enable you to scale the environment by adding morecapacity without replacing existing components.

Rack

A rack is the smallest unit for a Hadoop environment.

A rack consists of the power, network cabling, and data and management switches tosupport a group of Worker Nodes. A rack is a physical unit and its capacity is definedby physical constraints that include available space, power, cooling, and floor loading.Ensure that a rack uses its own power within the data center, independent from otherracks, and is treated as a fault zone. If a rack fails in a multiple-rack pod or cluster, thecluster continues to function with reduced capacity.

This Ready Architecture uses 12 nodes as the size of a rack, but higher or lowerdensities are possible. Typically, a rack contains about 12 nodes using a scale-outserver such as the Dell EMC PowerEdge R740xd server. The node density of a rackdoes not affect overall cluster scaling and sizing, but it does affect fault zones in thecluster.

Pod

A pod is the set of nodes that is connected to the first level of network switches in thecluster. It consists of one or more racks.

A pod can include a small number of nodes initially and expand to the maximumnumber of nodes over time. A pod is a second-level fault zone above the rack level. If apod fails in a multiple-pod cluster, the cluster continues to function with reducedcapacity. A pod can support enough Hadoop server nodes and network switches for aminimum commercial-scale installation.

In this Ready Architecture, a pod supports up to 36 nodes (typically three racks). Thissize results in a bandwidth oversubscription of 2.25:1 between pods in a full cluster.The size of a pod can vary from this baseline recommendation. Changing the pod sizeaffects the bandwidth oversubscription at the pod level, the size of the fault zones,and the maximum cluster size.

Cluster

A cluster is a single Hadoop environment that is attached to a pair of network switchesthat provide an aggregation layer for the entire cluster.

A cluster can range in size from a pod consisting of a single rack up to many pods. Asingle-pod cluster is a special case and can function without an aggregation layer. Thisscenario is typical for smaller clusters before the addition of more pods.

In this Ready Architecture, the limit on the total size of a cluster depends on thechoice of Layer 2 or Layer 3 switching and the switch models that are used. See Sizingsummary on page 24 for the limits.

Solution Architecture Overview

High-level node architecture 23

Page 24: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Sizing summary

The minimum configuration supports nine nodes:

l Master Node 1

l Master Node 2

l Master Node 3

l Edge Node

l Five Worker Nodes

Although each cluster requires a minimum of one Edge Node, larger clusters andclusters with high ingest volumes or rates might require additional Edge Nodes.Cloudera recommends a baseline of one Edge Node for every twenty Worker Nodes.

The hardware configuration for the Infrastructure Nodes supports clusters in therange of petabyte storage. Other than the Infrastructure Nodes, cluster capacity isprimarily a function of the server platform and disk drives that are chosen, and thenumber of Worker Nodes.

The following table shows the recommended number of nodes per pod and pods percluster for 25 GbE clusters using the S5048-ON switch.

Table 7 Recommended number of nodes and pods for 25 GbE cluster

Nodes perrack

Nodes per pod Pods percluster

Nodes percluster

Bandwidthoversubscription

12 36 8 288 2.25 : 1

The following table shows alternatives for cluster sizing with different bandwidthoversubscription ratios.

Table 8 Alternative number of nodes and pods for 25 Gbe cluster

Nodes perrack

Nodes per pod Pods percluster

Nodes percluster

Bandwidthoversubscription

12 48 8 384 3 : 1

12 36 10 360 3 : 1

12 24 16 384 3 : 1

Solution Architecture Overview

24 Ready Architecture for Cloudera Hadoop 6.1

Page 25: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Power and cooling are typically the primary constraints on rack density. However, arack is a potential fault zone and rack density affects overall cluster reliability,especially for smaller clusters. The following table shows possible scenarios that arebased on typical data center constraints.

Table 9 Rack and pod density scenarios

Server platform Nodesper rack

Racksper pod

Comments

Dell EMC PowerEdgeR740xd

12 3 Typical configuration that requires less than10 kW power per rack and provides goodrack-level fault zone isolation

Dell EMC PowerEdgeR740xd

10 2 Smaller rack and pod fault zones withslightly higher bandwidth oversubscriptionof 2.5 : 1

High availability

This Ready Architecture implements high availability (HA) at multiple levels through acombination of hardware redundancy and software support.

Hadoop redundancyThe Hadoop distributed file system implements redundant storage for data resiliency,and is aware of node and rack location.

Data is replicated across multiple nodes and across racks. This replication providesmultiple copies of data for reliability if there are disk or node failures. It can alsoincrease performance. The number of replicas defaults to three and can be changedeasily at the cluster and file level. The specified networks provide sufficient bandwidthfor replication traffic as well as production traffic. Hadoop automatically balances dataacross the cluster nodes and creates additional replicas when a node fails. Thebandwidth that is used for replication can also be controlled.

Note

The Hadoop job parallelism model can scale to larger and smaller numbers of nodes,enabling jobs to run when parts of the cluster are offline.

Network redundancy

The production network can optionally use bonded connections to pairs of switches ineach pod and switch pairs at the aggregation level. This configuration providesincreased bandwidth capacity and allows operation at reduced capacity if a networkport, network cable, or switch fails.

When using 25 GbE as the core fabric, we typically do not use bonded networking. Forlarge clusters, we recommend the use of Layer 3 aggregation, which provides networkredundancy at the spine-switch level. Refer to 25 GbE Layer 3 Dell EMC NetworkingZ9100-ON cluster aggregation on page 33 for details.

HDFS highly available NameNodesThis Ready Architecture implements high availability (HA) for the Hadoop DistributedFile System (HDFS) directory through a quorum mechanism that replicates critical

Solution Architecture Overview

High availability 25

Page 26: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

namenode data across multiple physical nodes. Production clusters implementnamenode HA.

In quorum-based HA, there are typically two namenode processes running on twophysical servers. At any point in time, one of the NameNodes is in an Active state andthe other is in a Standby state. The Active NameNode is responsible for all clientoperations in the cluster, while the Standby NameNode acts as a slave, maintainingenough state to provide a fast failover if necessary.

For the Standby NameNode to keep its state synchronized with the Active NameNodein this implementation, both nodes communicate with a group of separate daemonscalled JournalNodes. When the Active NameNode modifies any namespace, itconsistently logs a record of the modification to a majority of these JournalNodes.

The Standby NameNode can read the edits from the JournalNodes, and is constantlywatching them for changes to the edit log. As the Standby NameNode detects theedits, it applies them to its own namespace. If a failover occurs, the StandbyNameNode ensures that it has read all the edits from the JournalNodes beforepromoting itself to the Active state. This action ensures that the namespace state isfully synchronized before a failover occurs.

To provide a fast failover, it is necessary that the Standby NameNode has up-to-dateinformation about the location of blocks in the cluster. Therefore, the Worker Nodesare configured with the location of both the NameNode and Standby NameNode, andthey send block location information and heartbeats to both.

Because edit log modifications must be written to a majority of JournalNodes, theremust be an odd number of (and at least three) JournalNode daemons. TheJournalNode daemons run on the Master Node 1, Master Node 2, and Master Node 3in this Ready Architecture.

YARN resource manager high availabilityThis Ready Architecture supports high availability (HA) for the Hadoop YARNresource manager.

Without resource manager HA, currently running jobs to fail when a Hadoop resourcemanager fails. When resource manager HA is enabled, jobs can continue running if aresource manager fails.

On failover, the applications can resume from their last check-pointed state. Forexample, completed map tasks in a MapReduce job are not rerun on a subsequentattempt. This action enables events such as machine crashes or planned maintenanceto be handled without any significant performance effect on running applications.

An Active/Standby pair of resource managers implements resource manager HA. Onstartup, each resource manager is in the standby state, which means that the processis started, but the state is not loaded. When transitioning to the active state, theresource manager loads the internal state from the designated state store and startsall the internal services. The stimulus to transition-to-active comes from either theadministrator or through the integrated failover controller when automatic failover isenabled.

Note

Resource manager HA is not always implemented in production clusters.

Solution Architecture Overview

26 Ready Architecture for Cloudera Hadoop 6.1

Page 27: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Database server high availabilityThis Ready Architecture supports high availability (HA) for the operational databases.

The database server that is used for both the Cloudera Manager operational andmetadata databases stores its data on a RAID 10 partition to provide redundancy in theevent of a drive failure.

Note

Our default installation uses a single PostgreSQL instance. Therefore, there is a singlepoint of failure. You can implement database server HA by using one or moreadditional PostgreSQL instances on other nodes in the cluster or by using an externaldatabase server.

Network architectureThe cluster network is designed to meet the needs of a high performance and scalablecluster, while providing redundancy and access to management capabilities.

The architecture is a leaf/spine model that is based on 25 GbE networkingtechnologies. It uses Dell EMC Networking S5048-ON switches for the leaves and DellEMC Networking Z9100-ON switches for the spine.

IPv4 is used for the network layer. At this time, the architecture does not support theuse of IPv6 for network connectivity.

The following figure shows the logical network architecture.

Figure 5 Cluster network fabric architecture

Solution Architecture Overview

Network architecture 27

Page 28: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Network definitionsThree distinct networks are used in the cluster.

The following table describes the CDH networks and their purposes.

Table 10 CDH network definitions

Network Description Available services

Cluster DataNetwork

The Data network carries the bulk of the traffic within thecluster. This network is aggregated within each pod, and podsare aggregated into the cluster switch.

The Cloudera Enterprise services areavailable on this network.

Note

The Cloudera Enterprise services do notsupport multihoming and are onlyaccessible on the Cluster Data Network.

iDRAC/BMCNetwork

The BMC network connects the BMC or iDRAC ports and theout-of-band management ports of the switches. It is used forhardware provisioning and management. This network isaggregated into a management switch in each rack.

This network provides access to theBMC and iDRAC functionality on theservers. It also provides access to themanagement ports of the clusterswitches.

Edge Network The Edge network provides connectivity from the EdgeNodes to an existing premises network, either directly, or byusing the pod or cluster aggregation switches.

SSH access to Edge Nodes is availableon this network, and other applicationservices may be configured and available.

Cluster physical networksThe following table lists the distinct networks that are used in the cluster.

Table 11 Cluster networks

Logical network Connection Switch

Cluster Data network 25 GbE Top-of-rack (ToR) (pod) switchesand aggregation switches

BMC network 1 GbE Dedicated switch per rack

Edge network 25 GbE Direct to Edge network or through apod or aggregation switch

Solution Architecture Overview

28 Ready Architecture for Cloudera Hadoop 6.1

Page 29: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Each network uses a separate VLAN and dedicated components when possible. Thefollowing figure shows the logical organization of the network.

For more information about the configuration of the interfaces and switches, see thesolution deployment guide.

Figure 6 Hadoop 25 GbE network connections

Physical network componentsThe physical networks of this Ready Architecture consist of the followingcomponents:

l Server node connections

l Network fabric on page 30

n 25 GbE pod switches on page 31

n 25 GbE cluster aggregation switches on page 33

l iDRAC management network on page 34

Network integration information is presented in:

l Core network integration on page 34

l Layer 2 and Layer 3 separation on page 34

All equipment is listed in:

l Network equipment summary: 25 GbE configurations on page 35

Server node connectionsServer connections to the network switches for the Data network use Ethernettechnology. Connections to the network use 25 GbE, which is recommended for newDell EMC PowerEdge R640 and Dell EMC PowerEdge R740xd server deployments.

Edge Nodes have an additional available network connection. This connectionfacilitates high-performance cluster access between applications running on thosenodes and the optional Edge network.

Solution Architecture Overview

Physical network components 29

Page 30: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Server connections to the BMC network use a single connection from the iDRAC portto a S3048-ON management switch in each rack, as shown in the following figures.

Figure 7 Dell EMC PowerEdge R640 network ports

Figure 8 Dell EMC PowerEdge R740xd Worker Node network ports

The following table shows the mapping of individual interfaces to networks and bonds.

Table 12 Network/bond/interface cross reference

Server Platform Network Interface Bond

Dell EMC PowerEdge R740xd Cluster Data em1 none

Dell EMC PowerEdge R640 Cluster Data em1 none

Dell EMC PowerEdge R640 Edge em2 none

Network fabricWe recommend 25 GbE for new deployments of Dell EMC PowerEdge R740xd andDell EMC PowerEdge R640 servers.

Clusters larger than a single pod require an aggregation layer. The aggregation layercan be implemented at either Layer 2 (L2) or Layer 3 (L3). The choice depends on theinitial size and planned scaling.

Layer 2 aggregation provides lower cost and medium scalability, and can supportapproximately 250 nodes.

Solution Architecture Overview

30 Ready Architecture for Cloudera Hadoop 6.1

Page 31: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Layer 3 aggregation is recommended for:

l Larger initial deployments of over 250 nodes

l Deployments where extreme scale-out is planned to about 1500 nodes

l Instances where the cluster must be colocated with other applications in adifferent rack

The scalability depends on the switches that are used and the oversubscription ratio,and is summarized in Sizing summary on page 24.

The standard implementation instructions and tools that are described in the solutiondeployment guide are oriented to a Layer 2 aggregation implementation, while Layer 3aggregation is a customized deployment.

The following sections describe the fabric:

l 25 GbE pod switches on page 31

l 25 GbE cluster aggregation switches on page 33

25 GbE pod switches

Each pod uses a Dell EMC Networking S5048-ON switch as the first layer switch. Thepod switches are often referred to as top of rack (ToR) switches, although thisarchitecture splits a physical rack from a logical pod.

The Dell EMC Networking S5048-ON switch is a multiple-rate 100 GbE 1U spineswitch that is optimized for high-performance, ultra-low-latency data centerrequirements. The Dell EMC Networking Z9100-ON switch can provide a cumulativebandwidth of 7.4 Tb/sec of throughput at line-rate traffic from every port. It can beconfigured with up to:

l 32 ports of 100 GbE (QSFP28)

l 64 ports of 50 GbE (QSFP+)

l 32 ports of 40 GbE (QSFP+)

l 128 ports of 25 GbE (QSFP+)

l 128+2 ports of 10 GbE

The following figure shows the single-pod network configuration, with a Dell EMCNetworking S5048-ON switch aggregating the pod traffic.

Solution Architecture Overview

Physical network components 31

Page 32: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Figure 9 25 GbE single-pod networking equipment

For a single pod, the pod switch can act as the aggregation layer for the entire cluster.For multiple-pod clusters, a cluster aggregation layer is required.

In this architecture, each pod is managed as a separate entity from a switchingperspective, and the individual pod switches connect only to the aggregation switch.

Solution Architecture Overview

32 Ready Architecture for Cloudera Hadoop 6.1

Page 33: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

25 GbE cluster aggregation switches

For clusters consisting of more than one pod, the architecture uses the Dell EMCNetworking Z9100-ON switch for aggregation.

The Dell EMC Networking Z9100-ON can be used for both Layer 2 and Layer 3implementations.

25 GbE Layer 2 Dell EMC Networking Z9100-ON clusteraggregation

The following figure illustrates the configuration for a multiple-pod cluster using theZ9100-ONswitch for cluster aggregation switch with Layer 2 networking.

The uplink from each S5048-ON pod switch to the aggregation layer uses four 100GbE interfaces in a bonded configuration, providing a collective bandwidth of 400 Gbfrom each pod.

Figure 10 Dell EMC Networking Z9100-ON multiple-pod networking equipment

25 GbE Layer 3 Dell EMC Networking Z9100-ON cluster aggregationThe Dell EMC Networking Z9100-ON core switch can be used for aggregation at Layer3 in larger clusters using 25 GbE.

We use a different network architecture for a cluster that uses Layer 3 aggregation,based on Equal-Cost Multipath (ECMP) routing and a leaf/spine organization. In thisconfiguration, a cluster can scale to over 1,500 nodes, with a low 3 :1 oversubscriptionper pod.

Solution Architecture Overview

Physical network components 33

Page 34: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

The following figure shows this alternative configuration for a multiple-pod cluster byusing Layer 3 and ECMP routing.

Figure 11 Multiple-pod view using Dell EMC Networking Z9100-ON switches (based on Layer 3ECMP)

For more details about Layer 3 Leaf/Spine deployment, see Leaf-Spine Deploymentand Best Practices Guide for Greenfield Deployments.

iDRAC management networkIn addition to the Cluster Data network, a separate network is provided for clustermanagement - the iDRAC (or BMC) network.

The iDRAC management ports are aggregated into a per-rack Dell EMC NetworkingS3048-ON switch with a dedicated VLAN. This aggregate provides a dedicatediDRAC/BMC network for hardware provisioning and management. Switchmanagement ports are also connected to this network.

The management switches can be connected to the core or connected to a dedicatedmanagement network if out of band management is required.

Core network integrationThe aggregation layer functions as the network core for the cluster.

In most instances, the cluster connects to a larger core in the enterprise, as shown in Figure 10 on page 33. When you use the Dell EMC Networking S5048-ON switch, two100 GbE ports are reserved at the aggregation level for connection to the core.Connection details are site-specific and must be determined as part of the deploymentplanning.

Layer 2 and Layer 3 separation

The Layer 2 and Layer 3 boundaries are separated at either the pod or the aggregationlayer. Either option is equally viable. This Ready Architecture is based on Layer 2 forswitching in the cluster.

The colors blue and red in Figure 11 on page 34 represent the Layer 2 and Layer 3boundaries.

Solution Architecture Overview

34 Ready Architecture for Cloudera Hadoop 6.1

Page 35: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Network equipment summary: 25 GbE configurationsThe following tables summarize the required cluster networking equipment.

Table 13 Per rack network equipment

Component Quantity

Total racks 1 (12 nodes nominal)

Management switch 1 x Dell EMC Networking S3048-ON

Switch interconnect cables 1 x 1 GbE cables (to next rack managementswitch)

Table 14 Per pod network equipment

Component Quantity

Total racks 3 (36 Nodes)

Top-of-rack switches 1 x Dell EMC Networking S5048-ON

Pod uplink cables (to aggregate switch) 4 x 100 Gb QSFP+ cables

Table 15 Per cluster aggregation network switches for multiple pods

Component Quantity

Total pods 8

Aggregation layer switches 1 x Dell EMC Networking Z9100-ON

The following table summarizes the number of cables that are needed for a cluster.

Table 16 Per node network cables required

Description 1 GbE cables required 25 GbE connections withQSFP+ required

Master Nodes 1 x number of nodes 1 x number of nodes

Edge Nodes 1 x number of nodes 2 x number of nodes

Worker Nodes 1 x number of nodes 1 x number of nodes

Note

25 GbE node connections typically use a QSFP+ to Quad QSPF breakout cable, so thecable count is typically one-fourth the number of connections in the preceding table.

Solution Architecture Overview

Physical network components 35

Page 36: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Rack server hardware configurationsThis Ready Architecture supports the Dell EMC PowerEdge R640 and Dell EMCPowerEdge R740xd servers using configurations for the following:

l Infrastructure Nodes on page 36

l Worker Nodes on page 37

l Edge Nodes on page 39

For more information about configuration, refer to Appendix A, which provides therecommended rack layout for Dell EMC PowerEdge R740xd clusters.

Infrastructure NodesInfrastructure Nodes host the critical cluster services. The configuration is optimizedto reduce downtime and provide high performance.

The following table shows the recommended configuration.

Table 17 Hardware configurations: Dell EMC PowerEdge R640 Infrastructure Nodes

Components Details

Platform Dell EMC PowerEdge R640

Chassis 2.5 in. chassis with up to 10 hard drives and 2 PCIe slots

Processor Dual Intel Xeon Gold 6134 3.2 GHz (8 core) 24.75M Cache

RAM 192 GB (12 x 16 GB 2,667 MT/s)

Network Daughter Card Mellanox ConnectX-4 Lx Dual Port 25 GbE DA/SFP rNDC

Boot configuration From PERC controller

Storage controller Dell EMC PERC H740P 2 Gb NV Cache, Minicard

Disk - Spindles 8 x 1 TB 7.2K RPM NLSAS 12 Gb/s

Disk - SSD 2 x 480 GB SSD SAS mixed-use 12 Gb/s

Drive configuration Combination of RAID 1, RAID 10, and dedicated drives

Note

Consult your Dell EMC account representative before changing the recommended disksizes.

The Infrastructure Nodes (Master Node 1, Master Node 2, Master Node 3, and EdgeNode) are configured as multiple partitions and file systems by using all availabledrives. Each partition is optimized for both performance and reliability.

The following table shows the recommended disk and partition layout for theInfrastructure Nodes.

Solution Architecture Overview

36 Ready Architecture for Cloudera Hadoop 6.1

Page 37: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Table 18 Dell EMC PowerEdge R640 Infrastructure Node volumes

Physicaldisks

Used by Volume type

2-3 Operating system RAID1

0 ZooKeeper journal Passthrough SSD

1 NameNode journal Passthrough SSD

4-5 HDFS metadata RAID1

6-9 Database storage RAID10

Table 19 Dell EMC PowerEdge R640 Infrastructure Node partitions

Disk Partition Mountpoint

Size Filesystemtype

Description

Virtual 0 Primary /boot 1,074 MB ext4 BIOS boot files that must be within the first 2 GB ofdisk

Virtual 0 LVM / 100 GB ext4 Root file system

Virtual 0 LVM swap 4 GB swap Operating system swap space partition

Virtual 0 LVM /home 1 GB ext4 User home directories

Virtual 0 LVM /var 825 GB ext4 Operational data directory for databases. It primarilycontains the Cloudera Manager databases becausethe Postgres Data Directory (PGDATA) is typically

in /var/lib/pgsql. Configure alternatives to

Postgres to store their data files here.

Worker NodesWorker Nodes are the workhorses of the cluster. Worker Nodes combine compute andstorage. Depending on the intended workload, they can be optimized for storage-heavy, compute-heavy, or mixed loads.

The following table shows a 2U chassis option using large form-factor (LFF) 3.5 in.drives for data. This option provides dense storage capability with high performancecompute and solid state storage for fast caching of temporary data.

Table 20 Hardware configurations: Dell EMC PowerEdge R740xd Worker Nodes

Component Details

Platform Dell EMC PowerEdge R740xd server

Chassis Chassis with up to 12 x 3.5 in. HDD, 4 x 3.5 in. HDDs on MPand 4 x 2.5 in. HDDs on Flex Bay

Processor Dual Intel Xeon Gold 6140 2.3 GHz, 18 Core, 25 M Cache

RAM (minimum) 384 GB (12 x 32 GB 2667 MT/s)

Network Daughter Card Mellanox ConnectX-4 Lx Dual Port 25 GbE DA/SFP rNDC

Boot configuration BOSS controller card + with 2 M.2 Sticks 240 GB

Solution Architecture Overview

Worker Nodes 37

Page 38: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Table 20 Hardware configurations: Dell EMC PowerEdge R740xd Worker Nodes (continued)

Component Details

Storage controller Dell EMC PERC HBA330 RAID Controller, 12 Gb Minicard

Disk - spindles 16 x 4 TB 7.2 K RPM SATA 6 Gb/ps 512n 3.5 in. hot-plughard drive

Disk - SSD 4 x 480 GB SSD SAS mixed-use 12 Gb/ps

Drive configuration RAID 1 - OS

JBOD - data drives

The following tables show the recommended disk and partition layout for the WorkerNodes.

Table 21 Dell EMC PowerEdge R740xd Worker Node volumes

Physicaldisks

Usage Volume type

BOSS 0, BOSS1

Operating system RAID 1

0-11 HDFS data Passthrough

12-15 Selectable Passthrough SSD

16-19 HDFS data Passthrough

Table 22 Dell EMC PowerEdge R740xd Worker Node partitions

Virtual disk Partition Mountpoint

Size Filesystemtype

Description

DellBOSS 1 Primary /boot 1074 MB ext4 Contains BIOS boot files that must be within thefirst 2 GB of disk

DellBOSS 2 LVM / 100 GB ext4 Root file system

DellBOSS 3 LVM swap 4 GB swap Operating system swap space partition

DellBOSS 4 LVM /home 1 GB ext4 User home directories

DellBOSS 5 LVM /var 117.5 GB ext4 Contains variable data such as system loggingfiles, databases, mail and printer spooldirectories, and transient and temporary files

sda Primary /data/1 4096 GB ext4 Contains HDFS data

sdb Primary /data/2 4096 GB ext4 Contains HDFS data

sdn Primary /data/n 4096 GB ext4 Contains HDFS data

ssd1 Primary /datassd/1 4096 GB ext4 Tiered HDFS Storage, Spark cache, MapReducetemp files, or HBase tiered cache

ssd2 Primary /datassd/2 4096 GB ext4 Tiered HDFS Storage, Spark cache, MapReducetemp files, or HBase tiered cache

Solution Architecture Overview

38 Ready Architecture for Cloudera Hadoop 6.1

Page 39: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Table 22 Dell EMC PowerEdge R740xd Worker Node partitions (continued)

Virtual disk Partition Mountpoint

Size Filesystemtype

Description

sss3 Primary /datassd/3 4096 GB ext4 Tiered HDFS Storage, Spark cache, MapReducetemp files, or HBase tiered cache

ssd4 Primary /datassd/4 4096 GB ext4 Tiered HDFS Storage, Spark cache, MapReducetemp files, or HBase tiered cache

Note

l Dell EMC does not recommend that you configure a large swap space. Due to thelarge and random performance degradation that might result, avoid swapping in aHadoop cluster.

l Operating system partitions are configured with the Logical Volume Managerenabled.

Edge NodesEdge Nodes are the primary interface through which the data moves in and out of thecluster. They are also used to run applications that access the cluster. Because of thewide variation in applications, Edge Node configurations can vary significantly. Themain characteristic of Edge Nodes is a connection to the Cluster Data network andadditional network connections for external access.

A common baseline choice for Edge Node configuration uses the same configurationas an infrastructure node, as shown in the following table.

Table 23 Hardware Configurations – Dell EMC PowerEdge R640 Edge Nodes

Component Details

Platform Dell EMC PowerEdge R640

Chassis 2.5 in. Chassis with up to 10 Hard Drives and 2 PCIe slots

Processor Dual Intel Xeon Gold 6134 3.2 GHz (8 Core) 24.75 M Cache

RAM 192 GB (12 x 16 GB 2667 MT/s)

Network Daughter Card Mellanox ConnectX-4 Lx Dual Port 25 GbE DA/SFP rNDC

Boot Configuration From PERC controller

Storage Controller Dell EMC PERC H740P Gb NV Cache, Minicard

Disk - Spindles 8 x 1 TB 7.2 K RPM NLSAS 12 Gb/ps

Disk - SSD 2 x 480 GB SSD SAS mixed-use 12 Gb/ps

Drive Configuration Combination of RAID 1, RAID 10, and dedicated drives

Solution Architecture Overview

Edge Nodes 39

Page 40: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Table 24 Dell EMC PowerEdge R640 Edge Node volumes

Physicaldisks

Used by Volume type

2-3 Operating system RAID1

0 Spare Flash storage Passthrough SSD

1 Spare Flash storage Passthrough SSD

4-9 Determined by application RAID10, or determined by application

Table 25 Dell EMC PowerEdge R640 Edge Node partitions

Disk Partition Mountpoint

Size Filesystemtype

Description

Virtual 0 Primary /boot 1,074 MB ext4 BIOS boot files that must be within first 2 GBof disk

Virtual 0 LVM / 100 GB ext4 Root file system

Virtual 0 LVM swap 4 GB swap Operating system swap space partition

Virtual 0 LVM /home 1 GB ext4 User home directories

Virtual 0 LVM /var 825 GB ext4 Operational data directory for databases. Itprimarily contains the Cloudera Managerdatabases because the Postgres Data Directory(PGDATA) is typically in /var/lib/pgsql.

Configure alternatives to Postgres to storetheir data files here.

Node configuration

See Dell EMC PowerEdge R740xd Worker Nodes Physical Rack Configuration on page47 for the recommended rack layout for Dell EMC PowerEdge R740xd clusters.

Infrastructure Node sizingThe hardware configuration for Infrastructure Nodes supports petabyte-scaleclusters, based on the number of HDFS blocks that are used.

The size of the HDFS metadata storage must be adjusted for clusters:

l Larger than 250 nodes

l With per-node HDFS storage larger than 64 TB

l With very large HDFS block sizes

Approximately 2 TB of RAID10 storage is available for the operational databases,including the HIVE metastore and the Cloudera Manager databases. This storage isenough for a typical large cluster. You might need to adjust the size of this partitionfor very large clusters.

The load on the Master Node 3 is less than on the other Infrastructure Nodes. TheMaster Node 3 configuration is the same as Master Node 1 and Master Node 2. Itsimplifies operational hardware maintenance and the Master Node 3 can be used as aspare node. You can specialize the configuration of this node, if necessary.

Solution Architecture Overview

40 Ready Architecture for Cloudera Hadoop 6.1

Page 41: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Worker Node sizing

Storage sizingDrive capacities greater than 4 TB or node storage density greater than 48 TB requirespecial consideration for HDFS setup. Configurations of this size approach the limit ofHadoop per-node storage capacity. At a minimum, the HDFS block size must be noless than 128 MB and can be as large as 1,024 MB. Because the number of files, blocksper file, compression, and reserved space factor into the calculations, theconfiguration requires an analysis of the intended cluster usage and data.

When sizing nodes, per-node density also has an impact on cluster performance ifnodes fail. The bandwidth that is required to replicate the lost data affects overallperformance, the time that is required to finish the recovery is lengthy, and data isunder-replicated and at risk during the recovery.

CAUTION

Do not configure a single Worker Node with more than 100 TB of storage.

Note

Your Dell EMC representative can assist you with estimates and calculations.

Node subtypesYou can configure Worker Nodes to match their intended use in the cluster. Werecommend the subtypes that are described in the following table.

Table 26 Specialized Worker Node subtypes

Subtype Clouderacategory

Workload/usage Mount point

Generic Data Engineering Batch processing, data lake, datapipelines, Spark, and MapReduce

/datassd/<n>

HBase OperationalDatabase

HBase with inserts, updates, andqueries

/hbase/<n>

Kudu Analytical Database Interactive and analytical queries withKudu and Impala

/kudu/<n>

Tiered Data Engineering Batch processing/HDFS with tieredstorage

/datassd/<n>

l Generic—The node is configured for general Hadoop data engineering operations.Rotational storage is configured for HDFS use, and solid state drives areconfigured for temporary file storage for Spark and MapReduce spill data.

l HBase—The node is configured for operational usage of HBase. Solid state drivesare configured for use as the HBase write ahead log and bucket cache.

l Kudu—The node is configured for Kudu and Impala analytical operations. Solidstate drives are set up for the Kudu write ahead log, Impala scratch space, andgeneral temporary files.

l Tiered—The node is configured for general Hadoop data engineering operations.Solid states drives are configured as a separate HDFS storage tier.

ProcessorsThe recommended Intel Xeon SP processors provide the best balance among severalcharacteristics that include cost, performance, and power consumption. Alternative

Solution Architecture Overview

Node configuration 41

Page 42: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

processors can be used to optimize different scenarios. The recommended processorsinclude dual AVX-512 units for the highest performance on analytic applications.

Memory sizingThe base recommendation of 384 GB assumes a mixture of MapReduce, query, andcomputational analytic workloads. Spark can take advantage of substantially highermemory footprints for RDD caches. Additional memory can be useful for Sparkworkloads. You might need to change the Sparkspark.memory.offHeap.enabled configuration setting to take advantage of thelarger available memory. You can also configure HBase with a large memory cache.Many HBase workloads can benefit from the larger cache from larger memoryconfigurations.

When changing memory configurations, keep the number of DIMMs to 6 or 12 perprocessor, if possible. Having fewer than six DIMMs per processor incurs a significantperformance penalty because not all memory channels are used. Using more than twoDIMMs per processor channel slightly reduces memory speed.

Edge Node sizingThe baseline Edge Node configuration uses the same configuration as anInfrastructure Node.

In practice, Edge Nodes are typically customized based on their intended purpose. Thefirst Edge Node typically runs Cloudera Manager, but it has additional capacity to runother applications.

If the Edge Node is used to run driver or front-end programs for cluster applications,additional memory might be required. The processors rarely need to be upgraded.

For instances in which Edge Nodes are used for streaming data or general ingestoperations, you can configure disk space as a staging area for input data. Therecommended chassis configuration has 10 x 2.5 in. drive bays available, which can beconfigured with HDDs or SSDs as needed with no restrictions beyond the underlyingserver platform restrictions. You can use alternative server platforms whenappropriate.

Edge Nodes include dual 25 GbE connections. One connection is used for the clusterdata network, and the other connection is available for interfacing to the core networkor external networks. You can add more network interfaces for additional capacity oryou can use alternative network interface cards. We recommend using a 25 GbEconnection to the cluster data network. However, you can use a lower-speedconnection if it is adequate for your environment.

Solution Architecture Overview

42 Ready Architecture for Cloudera Hadoop 6.1

Page 43: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

CHAPTER 4

References

This chapter presents the following topics:

l Cloudera partnership and certification............................................................... 44l Dell EMC Customer Solution Centers.................................................................44l Technical support.............................................................................................. 45

References 43

Page 44: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Cloudera partnership and certification

Note

Cloudera, Inc., as a result of a merger transaction, is now the parent company ofHortonworks, Inc.

Cloudera is a key contributor to the Apache Hadoop project. CDH, the Clouderadistribution including Apache Hadoop, is a highly scalable open-source platform forhigh-volume data management and analytics. CDH integrates with existing enterpriseIT infrastructure, enabling data engineers and data scientists to quickly and easilydevelop and deploy Hadoop applications in a cost-efficient manner.

Dell EMC is a Platinum member of the Cloudera IHV Program. Platinum membership isthe highest level of partnership and indicates Dell EMC's ongoing commitment toCloudera and our customers.

The Dell EMC infrastructure in this guide is Cloudera-certified.

Dell EMC Customer Solution Centers

Our global network of dedicated Dell EMC Customer Solution Centers are trustedenvironments where world class IT experts collaborate with customers and prospectsto share best practices, facilitate in-depth discussions of effective business strategiesusing briefings, workshops, or proof of concept (PoCs), and help businesses becomemore successful and competitive. Dell EMC Customer Solution Centers reduce therisks associated with new technology investments and can help improve speed ofimplementation.

References

44 Ready Architecture for Cloudera Hadoop 6.1

Page 45: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Technical supportThe following table shows the supported components and operating environments forthis Ready Architecture.

Table 27 Solution Support Matrix

Category Component Available support

Operating system Red Hat Enterprise LinuxServer

Red Hat Linux support

CentOS Dell EMC Hardware support

Java Virtual Machine Sun Oracle JVM Not available

Hadoop Cloudera Enterprise Cloudera support

Hadoop Cloudera Manager Cloudera support

Hadoop Cloudera Navigator Cloudera support

References

Technical support 45

Page 46: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

46 Ready Architecture for Cloudera Hadoop 6.1

Page 47: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

APPENDIX A

Dell EMC PowerEdge R740xd Worker NodesPhysical Rack Configuration

This appendix contains suggested rack layouts for single-rack, single-pod, andmultiple-pod installations. Rack layouts vary depending on power, cooling, and loadingconstraints.

l Worker Nodes single-rack configuration............................................................ 48l Worker Nodes initial rack configuration............................................................. 49l Worker Nodes additional pod rack configuration................................................50

Dell EMC PowerEdge R740xd Worker Nodes Physical Rack Configuration 47

Page 48: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Worker Nodes single-rack configuration

Table 28 Single-rack configuration: Worker Nodes

RU RACK1

42 R1 - Switch 1: Dell EMC Networking S5048-ON

41 Cable management

40 Cable management

39 Cable management

38 R1 - Dell EMC Networking S3048-ON iDRAC Management switch

37 Cable management

36 Cable management

35

29

Empty

28

27

Edge01: Dell EMC PowerEdge R640

26

25

Master Node 1: Dell EMC PowerEdge R640

24

23

Master Node 2: Dell EMC PowerEdge R640

22

21

Master Node 3: Dell EMC PowerEdge R640

20

19

Empty

18

17

Empty

16

15

R1 - Chassis08: Dell EMC PowerEdge R740xd

14

13

R1 - Chassis07: Dell EMC PowerEdge R740xd

12

11

R1 - Chassis06: Dell EMC PowerEdge R740xd

10

9

R1 - Chassis05: Dell EMC PowerEdge R740xd

8 R1 - Chassis04: Dell EMC PowerEdge R740xd

Dell EMC PowerEdge R740xd Worker Nodes Physical Rack Configuration

48 Ready Architecture for Cloudera Hadoop 6.1

Page 49: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Table 28 Single-rack configuration: Worker Nodes (continued)

RU RACK1

7

6

5

R1 - Chassis03: Dell EMC PowerEdge R740xd

4

3

R1 - Chassis02: Dell EMC PowerEdge R740xd

2

1

R1 - Chassis01: Dell EMC PowerEdge R740xd

Worker Nodes initial rack configuration

Table 29 Initial pod rack configuration: Dell EMC PowerEdge R740xd Worker Nodes

RU RACK1 RACK2 RACK3

42 Empty R2 - Switch 1: Dell EMC NetworkingS5048-ON

Empty

41 Empty Empty Empty

40 Cable management Cable management Cable management

39 Cable management Cable management Cable management

38 R1 - Dell EMC Networking S3048-ONiDRAC Management switch

R2 - Dell EMC Networking S3048-ON iDRAC Management switch

R3 - Dell EMC Networking S3048-ON iDRAC Management switch

37 Cable management Cable management Cable management

36 Cable management Cable management Cable management

35 Master Node 1: Dell EMC PowerEdgeR640

Edge01: Dell EMC PowerEdge R640 R3 - Switch 1: Dell EMC NetworkingZ9100-ON

34 Empty Empty Empty

33

32

Empty Master Node 2: Dell EMC PowerEdgeR640

Master Node 3: Dell EMC PowerEdgeR640

31

21

Empty Empty Empty

20

19

R1 - Chassis10: Dell EMC PowerEdgeR740xd

R2 - Chassis10: Dell EMC PowerEdgeR740xd

R3 - Chassis10: Dell EMC PowerEdgeR740xd

18

17

R1 - Chassis09: Dell EMC PowerEdgeR740xd

R2 - Chassis09: Dell EMCPowerEdge R740xd

R3 - Chassis09: Dell EMCPowerEdge R740xd

Dell EMC PowerEdge R740xd Worker Nodes Physical Rack Configuration

Worker Nodes initial rack configuration 49

Page 50: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Table 29 Initial pod rack configuration: Dell EMC PowerEdge R740xd Worker Nodes (continued)

RU RACK1 RACK2 RACK3

16

15

R1 - Chassis08: Dell EMC PowerEdgeR740xd

R2 - Chassis08: Dell EMCPowerEdge R740xd

R3 - Chassis08: Dell EMCPowerEdge R740xd

14

13

R1 - Chassis07: Dell EMC PowerEdgeR740xd

R2 - Chassis07: Dell EMC PowerEdgeR740xd

R3 - Chassis07: Dell EMC PowerEdgeR740xd

12

11

R1 - Chassis06: Dell EMC PowerEdgeR740xd

R2 - Chassis06: Dell EMCPowerEdge R740xd

R3 - Chassis06: Dell EMCPowerEdge R740xd

10

9

R1 - Chassis05: Dell EMC PowerEdgeR740xd

R2 - Chassis05: Dell EMCPowerEdge R740xd

R3 - Chassis05: Dell EMCPowerEdge R740xd

8

7

R1 - Chassis04: Dell EMC PowerEdgeR740xd

R2 - Chassis04: Dell EMCPowerEdge R740xd

R3 - Chassis04: Dell EMCPowerEdge R740xd

6

5

R1 - Chassis03: Dell EMC PowerEdgeR740xd

R2 - Chassis03: Dell EMCPowerEdge R740xd

R3 - Chassis03: Dell EMCPowerEdge R740xd

4

3

R1 - Chassis02: Dell EMC PowerEdgeR740xd

R2 - Chassis02: Dell EMCPowerEdge R740xd

R3 - Chassis02: Dell EMCPowerEdge R740xd

2

1

R1 - Chassis01: Dell EMC PowerEdgeR740xd

R2 - Chassis01: Dell EMC PowerEdgeR740xd

R3 - Chassis01: Dell EMC PowerEdgeR740xd

Worker Nodes additional pod rack configuration

Table 30 Additional pod rack configuration: Dell EMC PowerEdge R740xd Worker Nodes

RU RACK1 RACK2 RACK3

42 Empty R2 - Switch 1: Dell EMC NetworkingS5048-ON

Empty

41 Empty Empty Empty

40 Cable Management Cable management Cable management

39 Cable Management Cable management Cable management

38 R1 - Dell EMC Networking S3048-ONiDRAC Management switch

R2 - Dell EMC Networking S3048-ON iDRAC Management switch

R3 - Dell EMC Networking S3048-ON iDRAC Management switch

37 Cable management Cable management Cable management

36 Cable management Cable management Cable management

35

25

Empty Empty Empty

Dell EMC PowerEdge R740xd Worker Nodes Physical Rack Configuration

50 Ready Architecture for Cloudera Hadoop 6.1

Page 51: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Table 30 Additional pod rack configuration: Dell EMC PowerEdge R740xd WorkerNodes (continued)

RU RACK1 RACK2 RACK3

24

23

R1 - Chassis12: Dell EMC PowerEdgeR740xd

R2 - Chassis12: Dell EMC PowerEdgeR740xd

R1 - Chassis12: Dell EMC PowerEdgeR740xd

22

21

R1 - Chassis11: Dell EMC PowerEdgeR740xd

R2 - Chassis11: Dell EMC PowerEdgeR740xd

R3 - Chassis11: Dell EMC PowerEdgeR740xd

20

19

R1 - Chassis10: Dell EMC PowerEdgeR740xd

R2 - Chassis10: Dell EMC PowerEdgeR740xd

R3 - Chassis10: Dell EMC PowerEdgeR740xd

18

17

R1 - Chassis09: Dell EMC PowerEdgeR740xd

R2 - Chassis09: Dell EMCPowerEdge R740xd

R3 - Chassis09: Dell EMCPowerEdge R740xd

16

15

R1 - Chassis08: Dell EMC PowerEdgeR740xd

R2 - Chassis08: Dell EMCPowerEdge R740xd

R3 - Chassis08: Dell EMCPowerEdge R740xd

14

13

R1 - Chassis07: Dell EMC PowerEdgeR740xd

R2 - Chassis07: Dell EMC PowerEdgeR740xd

R3 - Chassis07: Dell EMC PowerEdgeR740xd

12

11

R1 - Chassis06: Dell EMC PowerEdgeR740xd

R2 - Chassis06: Dell EMCPowerEdge R740xd

R3 - Chassis06: Dell EMCPowerEdge R740xd

10

9

R1 - Chassis05: Dell EMC PowerEdgeR740xd

R2 - Chassis05: Dell EMCPowerEdge R740xd

R3 - Chassis05: Dell EMCPowerEdge R740xd

8

7

R1 - Chassis04: Dell EMC PowerEdgeR740xd

R2 - Chassis04: Dell EMCPowerEdge R740xd

R3 - Chassis04: Dell EMCPowerEdge R740xd

6

5

R1 - Chassis03: Dell EMC PowerEdgeR740xd

R2 - Chassis03: Dell EMCPowerEdge R740xd

R3 - Chassis03: Dell EMCPowerEdge R740xd

4

3

R1 - Chassis02: Dell EMC PowerEdgeR740xd

R2 - Chassis02: Dell EMCPowerEdge R740xd

R3 - Chassis02: Dell EMCPowerEdge R740xd

2

1

R1 - Chassis01: Dell EMC PowerEdgeR740xd

R2 - Chassis01: Dell EMC PowerEdgeR740xd

R3 - Chassis01: Dell EMC PowerEdgeR740xd

Dell EMC PowerEdge R740xd Worker Nodes Physical Rack Configuration

Worker Nodes additional pod rack configuration 51

Page 52: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

52 Ready Architecture for Cloudera Hadoop 6.1

Page 53: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

APPENDIX B

Tested Component Versions

This appendix describes the versions of software and firmware that are used duringvalidation of this Ready Architecture:

l Software versions.............................................................................................. 54l Network switch firmware versions.....................................................................54l Dell EMC PowerEdge R640 firmware versions...................................................54l Dell EMC PowerEdge R740xd firmware versions............................................... 55

Tested Component Versions 53

Page 54: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Software versions

Table 31 Software versions

Component Version

Operating system Red Hat Enterprise Linux 7.6

CDH 6.1.1

HDFS 3.0.0

YARN 3.0.0

MapReduce2 3.0.0

Hive 2.1.1

Zookeeper 3.4.5

Spark2 2.4.0

Network switch firmware versions

Table 32 Network switch firmware versions

Component Version

Dell EMC S5048-ON 9.12(1.0)

Dell EMC S3048-ON 9.11(2.4)

Dell EMC PowerEdge R640 firmware versions

Table 33 Dell EMC PowerEdge R640 firmware versions

Component Version

BIOS 1.6.12

iDRAC with LC 3.21.26.22

Mellanox ConnectX-4 LX 25 GbE SFP RackNDC

14.23.15.04

Driver for operating system deployment 18.10.17

Dell 12 Gb expander firmware 2.25

PERC H740P 50.5.0.1750

CPLD 1.0.2

Tested Component Versions

54 Ready Architecture for Cloudera Hadoop 6.1

Page 55: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

Dell EMC PowerEdge R740xd firmware versions

Table 34 Dell EMC PowerEdge R740xd firmware versions

Component Version

BIOS 1.6.12

iDRAC with LC 3.21.26.22

Mellanox ConnectX-4 LX 25 GbE SFP RackNDC

14.23.15.04

Driver for operating system deployment 18.10.17

Dell 12 Gb expander firmware 2.25

HBA 330 mini 16.17.00.03

CPLD 1.0.6

Nonexpander storage backplane 4.26

Tested Component Versions

Dell EMC PowerEdge R740xd firmware versions 55

Page 56: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

56 Ready Architecture for Cloudera Hadoop 6.1

Page 57: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

GLOSSARY

A

API Application Programming Interface

ASCII American Standard Code for Information Interchange, a binary code for alphanumericcharacters developed by ANSI.

B

BIOS Basic input/output system

BMC Baseboard Management Controller

BMP Bare Metal Provisioning

C

CDH The Cloudera distribution including Apache Hadoop

Clos A multiple-stage, non-blocking network switch architecture. It reduces the number ofrequired ports within a network switch fabric.

CMC Chassis Management Controller

CRM Customer Relationship Management

D

DBMS Database Management System

DGI Data Governance Initiative

DTK Dell EMC OpenManage Deployment Toolkit

E

EBCDIC Extended Binary Coded Decimal Interchange Code, a binary code for alphanumericcharacters developed by IBM.

Ready Architecture for Cloudera Hadoop 6.1 57

Page 58: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

ECMP Equal Cost Multi-Path

EDW Enterprise Data Warehouse

EoR End-of-Row Switch/Router

ERP Enterprise Resource Planning

ETL Extract, Transform, Load is a process for extracting data from various data sources,transforming the data into proper structure for storage, and then loading the data into adata store.

F

FQDN A Fully Qualified Domain Name (FQDN) is the portion of an Internet Uniform ResourceLocator that fully identifies the server to which an Internet request is addressed. TheFQDN includes the second-level domain name, such as dell.com, and any other levelsas required.

H

HBA Host Bus Adapter

HDF Cloudera DataFlow

HDFS Hadoop Distributed File System

HVE Hadoop Virtualization Extensions

I

iDRAC Integrated Dell Remote Access with Lifecycle Controller

IPMI Intelligent Platform Management Interface

J

JBOD Just a Bunch of Disks

JDBC Java Database Connectivity

JDK Java Development Kit

K

KPI Key Performance Indicator

Glossary

58 Ready Architecture for Cloudera Hadoop 6.1

Page 59: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

L

LACP Link Aggregation Control Protocol

LAG Link Aggregation Group

LOM Local Area Network on Motherboard

M

MTU A maximum transmission unit is the largest size packet or frame, in octets, that can besent over a packet/frame-based computer network.

N

NIC Network Interface Card

NTP Network Time Protocol

NVM Node Version Manager

O

OS Operating system

OS-HCTK A configuration utility with sample scripts and configuration files that is used toautomate the setup and configuration of BIOS and RAID settings for Dell EMC serversin OpenStack and Hadoop open source software solutions.

P

PAM Pluggable Authentication Modules, a centralized authentication method for Linuxsystems.

Q

QSFP Quad Small Form-factor Pluggable

R

RAID Redundant Array of Independent Disks

REST Representational State Transfer

RPM Red Hat Package Manager

RSTP Rapid Spanning Tree Protocol

Glossary

Ready Architecture for Cloudera Hadoop 6.1 59

Page 60: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

RTO Recovery Time Objective

RU A Rack Unit measures 1.75 inches, or 44.45 mm, in a 19-inch or 23-inch electronicequipment rack frame.

S

SIEM Security Information and Event Management

SLA Service Level Agreement

SSD Solid-state Drive (or Solid-state Disk)

T

THP Transparent Huge Pages

ToR Top-of-Rack Switch/Router

U

UID A code identifying each user on a Unix or Unix-like computer system

V

VLT Virtual Link Trunking

VRRP Virtual Router Redundancy Protocol

Y

YARN Yet Another Resource Negotiator

Glossary

60 Ready Architecture for Cloudera Hadoop 6.1

Page 61: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

INDEX

AActive NameNode 21Administration node 21Apache projects

Ambari 21, 27HBase 21Hive 54Spark 21, 54Tez 54ZooKeeper 21, 54

Architecturecluster 20network 27

CCDH 10, 54

tested software version 54Cloudera Enterprise

components 11Ready Solution 12

Cloudera Manager 11Cluster sizing

sizing 23

EEdge Node

configuration 39definition 21sizing 42

HHadoop 10High availability

database server 27YARN resource manager 26

High Availability node 21

IiDRAC/BMC network 34Infrastructure Node

definition 21network ports 29sizing 40

NNetworking switches 54

PParallelism model 25PowerEdge rack servers

firmware versions 54, 55

R640 17R740xd 18

RRed Hat Linux

tested software version 54Redundancy

Hadoop 25Replicas 25

SSizing

cluster 23Edge Node 42Infrastructure Node 40summary 24Worker Node 41

Standby NameNode 21

TTested software versions 54

UUse cases 14

WWorker Node

configuration 37definition 21network ports 29sizing 41

YYARN

resource manager 26tested software version 54

Ready Architecture for Cloudera Hadoop 6.1 61

Page 62: Ready Solutions for Data Analytics - Dell EMC · 2019. 5. 2. · Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference

62 Ready Architecture for Cloudera Hadoop 6.1