17
HPE Reference Configuration for HPE Apollo 4200 Gen10 with Hadoop 3 Reference Architecture

HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

  • Upload
    others

  • View
    44

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

HPE Reference Configuration for HPE Apollo 4200 Gen10 with Hadoop 3

Reference Architecture

Page 2: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture

Contents Executive summary ................................................................................................................................................................................................................................................................................................................................ 3 HPE Apollo 4200 Gen10 server ................................................................................................................................................................................................................................................................................................ 3 HPE Apollo 4200 Gen10 enhancements .......................................................................................................................................................................................................................................................................... 4 Major Hadoop 3 distros ..................................................................................................................................................................................................................................................................................................................... 5

Hortonworks Data Platform overview ............................................................................................................................................................................................................................................................................ 5 Cloudera Enterprise overview ................................................................................................................................................................................................................................................................................................ 5 MapR Enterprise Edition overview .................................................................................................................................................................................................................................................................................... 5

Big Data infrastructure designs .................................................................................................................................................................................................................................................................................................. 6 Solution overview ..................................................................................................................................................................................................................................................................................................................................... 6

HPE Apollo 4200 within the HPE Elastic Platform for Big Data Analytics architecture .................................................................................................................................................. 6 HPE Apollo 4200 for Traditional Big Data Analytics ....................................................................................................................................................................................................................................... 7

Solution components and configuration guide ............................................................................................................................................................................................................................................................ 8 Single-rack Reference Configuration ............................................................................................................................................................................................................................................................................... 8 Pre-deployment considerations ........................................................................................................................................................................................................................................................................................... 9 High Availability considerations....................................................................................................................................................................................................................................................................................... 10 Software components for control blocks ................................................................................................................................................................................................................................................................. 10 Software components for compute/storage blocks ....................................................................................................................................................................................................................................... 11

Capacity and sizing ............................................................................................................................................................................................................................................................................................................................ 11 Best practices and tuning guidelines .......................................................................................................................................................................................................................................................................... 12

HPE Sizer for the Elastic Platform for Big Data Analytics............................................................................................................................................................................................................................... 12 HPE Performance Cluster Manager .................................................................................................................................................................................................................................................................................... 12 HPE Pointnext Services ................................................................................................................................................................................................................................................................................................................. 15 Summary ...................................................................................................................................................................................................................................................................................................................................................... 15 Appendix A: HPE Pointnext value-added services and support .............................................................................................................................................................................................................. 15 Resources and additional links ................................................................................................................................................................................................................................................................................................ 17

Page 3: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 3

Executive summary This HPE Reference Configuration provides a recommended solution for running analytics workloads with Hadoop 3 (Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition) on the HPE Apollo 4200 Gen10 server.

This Reference Configuration (RC) describes deployment options for Hadoop 3 using HPE Apollo 4200 Gen10 servers as building blocks in the HPE Elastic Platform for Big Data Analytics (EPA) architecture or for a traditional cluster. The HPE EPA is a big data analytics infrastructure solution based on modular building blocks of compute and storage optimized for modern workloads. The Apollo 4200 Gen10 servers are cost-effective industry-standard storage servers that provide exceptional storage density in a 2U form factor that holds up to 28 LFF disks. This RC also provides suggested configurations that highlight the benefits of a building block approach to address the diverse processing and storage requirements typical of modern big data platforms.

For configuration-level details on HPE EPA solutions, refer to the HPE Reference Configuration for Elastic Platform for Big Data Analytics. The HPE Enterprise Library provides a comprehensive list of Reference Architectures, Reference Configurations, and technical articles on big data at, http://h17007.www1.hpe.com/us/en/enterprise/reference-architecture/info-library/index.aspx?workload=big_data.

Target audience: This document is intended for decision makers, system and solution architects, system administrators and experienced users who are interested in reducing the time to design and purchase an HPE and Hadoop solution. An intermediate knowledge of Apache Hadoop and scale out infrastructure is recommended.

Document purpose: The purpose of this document is to highlight the key benefits of the HPE Apollo 4200 Gen10 for technical audiences and provide guidance for end users on selecting the right configuration for building an elastic or traditional Hadoop cluster using the Apollo 4200 Gen10. Sample configurations for both are provided in this document.

This white paper describes testing performed on HPE Apollo 4200 Gen10 in October 2018.

HPE Apollo 4200 Gen10 server The HPE Apollo 4200 Gen10 server offers an architecture optimized for Big Data Analytics and other data storage intensive workloads. Its unique, easily serviceable 2U design saves data center space. It delivers accelerated performance with superior bandwidth and a balanced architecture, Intel® Xeon® Scalable Processor Family, and NVMe connected SSDs. The focus on security extends from FIPS 140-2 Level 1 validated storage controllers down to the system silicon level, taking full advantage of HPE innovations in firmware protection, malware detection, and recovery.

• HPE Apollo 4200 Gen10 server offers revolutionary storage and compute density in a 2U form factor. It provides excellent storage capacity along with an unprecedented selection of processors to match for data intensive workloads. HPE Apollo 4200 Gen10 server allows you to save valuable data center space through its unique density optimized 2U form factor which holds up to 28 LFF or 54 SFF hot plug drives.

• HPE Smart Array P408i-a and P816i-a controllers provide increased I/O throughput performance, resulting in a significant performance increase for I/O bound Hadoop workloads (a common use case) and the flexibility for the customer to choose the desired amount of resilience in the Hadoop cluster with either Just a Bunch of Disks (JBOD) or various RAID configurations.

• The HPE iLO management engine on the servers contains HPE Integrated Lights-Out 5 (iLO 5) and features a complete set of embedded management features for HPE Power/Cooling, Agentless Management, Active Health System, and Intelligent Provisioning which reduces node and cluster level administration costs for Hadoop.

• The new features of the HPE Apollo 4200 Gen10 server include:

– Intel Xeon Processor Scalable Family, up to 24 cores

– Faster memory with sixteen (16) HPE SmartMemory DDR4 2666 MT/s, 6 channels per processor

– Up to 62% higher performance across processors with three (3) Intel Ultra Path Interconnect (UPI) design

– Up to six (6) U.2 SFF NVMe SSD support for metadata or caching applications

– Generation iLO 5 to cover security, performance, and management

– iLO Service Port on front support

– Dual (2) dedicated iLO NIC ports to save infrastructure with rack level deployment

Page 4: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 4

Figure 1. HPE Apollo 4200 Gen10 server

For more details on the HPE Apollo 4200 Gen10 server, visit HPE High-Performance Computing Solutions.

HPE Apollo 4200 Gen10 enhancements The table below details the main enhancements between the HPE Apollo 4200 Gen9 and HPE Apollo 4200 Gen10 server. Particularly important to Hadoop and big data analytics are the new Intel Xeon Scalable processor upgrades with additional cores and 62% higher inter-processor bandwidth due to the Intel Ultra Path Interconnect (Intel UPI). In addition, 6 memory channels per CPU and faster memory increase the memory bandwidth by greater than 50%.

Big data compute intensive workloads take advantage of these increases in CPU and memory performance. Especially Apache Spark will benefit from the faster memory-to-memory transfers across nodes along with the use of 25GbE cards on the HPE Gen10 servers.

Table 1. HPE Apollo 4200 Gen10 enhancements

Specifications HPE Apollo 4200 Gen9 System HPE Apollo 4200 Gen10 System

Processor Intel Xeon E5-2600 v3/v4 product family Intel Xeon Scalable processors (8100, 6100, 5100, and 4100 series)

Processors / core / speed

(2) processors; up to 22 cores 145W (2) processors; up to 24 cores 150W

Memory (type, max, slots)

Supports up to 2400MT/s DDR4 SmartMemory

4 channels per CPU

1 TB Max with 64GB LRDIMM@2400 MHz; 16 DIMM slots

Supports up to 2666MT/s DDR4 SmartMemory

6 channels per CPU, 50% greater bandwidth

1 TB Max with 64GB LRDIMM@2666 MT/s, 16 DIMM slots

Drives Bays Front: Up to 24 LFF or 48 SFF in the two front HDD Cages

Optional Rear HDD Cages: 4LFF, 2SFF+2HHHL PCIe, or 6SFF

Optional M.2 kits

Front: Up to 24 LFF or 48 SFF in the two front HDD Cages

Optional Rear HDD Cages: 4LFF, 2SFF+2FHHL PCIe (supports (2) UFF Dual M.2), or 6NVMe

Optional M.2 kits

Network controller Embedded dual 1Gb NIC

FlexibleLOM/PCIe Standup

Embedded dual 1Gb NIC

PCIe Standup ((1) 16x PCIe Gen3 slot from each processor)

Infrastructure management

iLO 4 Management (standard), Intelligent Provisioning (standard), iLO Advanced (optional)

HPE iLO 5 Management (standard), (2) iLO dedicated management ports; Intelligent Provisioning (standard), UEFI, iLO Advanced (optional), HPE OneView Advanced (optional)

Power Supply – Hot Plug

(2) HPE 800W or 1400W, Flex Slot Power Supplies (AC/DC/277V AC)

(2) HPE 800W or 1600W, Flex Slot Power Supplies (AC/DC/277V AC)

Storage Controller (1) HPE Dynamic Smart Array B140i; (1) HPE Smart Array P840ar; optional HPE Smart Array PCIe cards; up to 3 HPE Smart Array Gen9 Controllers

(1) HPE Smart Array S100i; optional HPE Smart Array cards; up to 3 HPE Smart Array Gen10 Controllers

Warranty (parts, labor, onsite support)

3/1/1 (3/3/3 for APJ only) 3/3/3

Page 5: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 5

Major Hadoop 3 distros Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition are the three main Hadoop 3 distros. Below is a brief overview of each and a link to additional information. All of these distributions are supported on HPE Big Data configurations such as HPE Apollo 4200 Gen10.

Hortonworks Data Platform overview Hortonworks is a major contributor to Apache Hadoop, the world’s most popular big data platform. Hortonworks focuses on further accelerating the development and adoption of Apache Hadoop by making the software more robust and easier to consume for enterprises and more open and extensible for solution providers. The Hortonworks Data Platform (HDP), powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner.

HDP is a platform for multi-workload data processing across an array of processing methods – from batch through interactive and real-time – all supported with solutions for governance, integration, security and operations. As the only completely open Hadoop data platform available, HDP integrates with and augments your existing best-of-breed applications and systems so you can gain value from your enterprise Big Data, with minimal changes to your data architectures. Finally, HDP allows you to deploy Hadoop wherever you want it – from cloud or on-premises as an appliance, and across both Linux® and Microsoft® Windows®. For detailed information, visit Hortonworks.

Cloudera Enterprise overview Founded in 2008, Cloudera was the first company to commercialize Apache Hadoop and to develop enterprise-grade solutions built on this powerful open source technology. Today, Cloudera is the leading innovator in and largest contributor to the Hadoop open source software community. Cloudera employs a “hybrid open” subscription software business model, affording customers all the benefits of open source software, plus the features and support expected from traditional enterprise software such as security, data governance and system management.

Cloudera Enterprise is built on top of Cloudera’s Enterprise Data Hub (EDH) software platform. In this way, it empowers organizations to store, process and analyze all enterprise data, of whatever type, in any volume – creating remarkable cost-efficiencies as well as enabling business transformation. It is one place to store all your data for as long as desired in its original fidelity. With Apache Hadoop at its core, it is a new, more powerful and scalable data platform with the flexibility to run a variety of workloads – batch processing, interactive SQL, enterprise search, advanced analytics – together with the robust security, governance, data protection, and management that enterprises require.

For detailed information, visit Cloudera Enterprise.

MapR Enterprise Edition overview The MapR Data Platform delivers distributed processing, real-time analytics, and enterprise-grade requirements across cloud and on-premises environments, while leveraging the significant ongoing development in open source technologies including Spark and Hadoop. MapR Data Platform fabric powers the shared services of the Data Platform including high availability, unified security, multi-tenancy, disaster recovery, global namespace, data management, automation, global event streaming and real-time data access.

MapR 6.x provides a new MapR Control System (MCS), a beautiful and unified management solution for efficiently administering all data in the MapR Data Platform and underlying cluster infrastructure. Key aspects of MapR 6.x include a best-in-class database management system for building global data-intensive applications, a data science offering to fit the needs of all data teams, and automated platform health and security capabilities. For detailed information, visit MapR.

Page 6: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 6

Big Data infrastructure designs The traditional approach with Hadoop is to use co-located compute and storage, which works well for batch analytics using HDFS and MapReduce. HPE EPA is a modular infrastructure foundation designed to deliver a scalable, multi-tenant platform by enabling independent scaling of compute and storage through infrastructure building blocks that are optimized for density and running disparate workloads.

The “Solution overview” below includes both a traditional and an EPA configuration for Hadoop 3 and the HPE Apollo 4200 Gen10 server.

For more information on EPA solutions, refer to HPE Reference Configuration for Elastic Platform for Big Data Analytics.

Solution overview These configurations are based on Hadoop 3 and the HPE Apollo 4200 Gen10 for a big data analytics cluster.

HPE Apollo 4200 within the HPE Elastic Platform for Big Data Analytics architecture The HPE Elastic Platform for Big Data Analytics architecture optimizes efficiency and price performance through a building block approach. This architecture allows for independent scaling of compute and storage, while accommodating the independent growth of data and workloads. As compute and data storage requirements change the architecture allows customers to easily scale by adding compute and storage blocks independently. The HPE Sizer for the Elastic Platform for Big Data Analytics (a.k.a. HPE EPA Sizing Tool) can be used to do a detailed configuration based on your requirements.

HPE EPA configurations for big data analytics infrastructure blueprints are composed of five blocks: storage blocks, compute blocks, control blocks, network blocks, and rack blocks. Listed below are the blocks and models used in this solution.

Table 2. HPE EPA blocks using the Apollo 4200

Blocks Model

Control Block HPE ProLiant DL360 Gen10

Compute Block HPE Apollo 2600 with HPE ProLiant XL170r Gen10

Storage Block HPE Apollo 4200 Gen10

Network Block HPE FlexFabric 5950 48SFP28 8QSFP28 switch

Rack Block 1200mm or 1075mm

Page 7: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 7

Reference Configuration for an HPE Elastic Platform for Big Data Analytics cluster with HPE Apollo 4200 Gen10 Refer to Figure 2 for a rack-level view of a single-rack EPA configuration with HPE Apollo 4200 Gen10 server as storage block.

Figure 2. Single-rack EPA configuration with the Apollo 4200 Gen10 storage block

HPE Apollo 4200 for Traditional Big Data Analytics HPE Apollo 4200 clusters for traditional big data analytics infrastructure blueprints are composed of four blocks: storage/compute blocks, control blocks, network blocks, and rack blocks. Listed below are the blocks and models used in this solution.

Table 3. HPE Apollo 4200 solution for traditional big data analytics components

Blocks Model

Control Block HPE ProLiant DL360 Gen10

Compute/Storage Block HPE Apollo 4200 Gen10

Network Block HPE FlexFabric 5950 48SFP28 8QSFP28 switch

Rack Block 1200mm or 1075mm

Page 8: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 8

Reference Configuration for a traditional big data analytics cluster with HPE Apollo 4200 Gen10 Refer to Figure 3 for a rack-level view of the single-rack traditional big data configuration with HPE Apollo 4200 Gen10 server as the compute/storage block.

Figure 3. Single-rack traditional solution with Apollo 4200 Gen10 compute/storage block

For more information about EPA solutions, refer to HPE Reference Configuration for Elastic Platform for Big Data Analytics.

Solution components and configuration guide Single-rack Reference Configuration This single-rack Hadoop Reference Configuration (RC) is designed to perform well as a single-rack cluster design but also form the basis for a much larger multi-rack design. When moving from the single-rack to multi-rack design, one can simply add racks to the cluster without having to change any components within the single-rack. This RC reflects the following:

• Single-rack network block

The HPE FlexFabric 5950 48SFP28 8QSFP28 switch is a high density ToR switch available as a 1RU 48-port 25GbE SFP28 with 8-port 100GbE QSFP28 form factor. This switch can be used for high-density 10GbE/25GbE ToR with 100GbE/40GbE/25GbE/10GbE spine/ToR connectivity. 100GbE ports may be split into four 25GbE ports or they can support 40GbE which can be split into four by 10GbE for a total of 128 25/10GbE ports. The HPE FlexFabric 5950 48SFP28 8QSFP28 switch includes eight 100GbE uplinks which can be used to connect the switches in the rack into the desired network. Keep in mind that if HPE Intelligent Resilient Fabric (IRF) bonding is used, it requires 2x 100GbE ports per switch, which would leave 6x 100GbE ports on each HPE FlexFabric 5950 48SFP28 8QSFP28 switch for uplinks

Page 9: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 9

• Power and cooling

In planning for large clusters, it is important to properly manage power redundancy and distribution. To ensure the servers and racks have adequate power redundancy we recommend that each server have a backup power supply, and each rack have at least two Power Distribution Units (PDUs). There is an additional cost associated with procuring redundant power supplies.

Pre-deployment considerations The operating system and the network are key factors you need to consider prior to designing and deploying a Hadoop cluster. The following subsections articulate the design decisions in creating the baseline configurations for the Reference Configurations.

Operating system All Hadoop Distros support 64-bit operating systems. In this RC, we tested with Red Hat® Enterprise Linux® (RHEL) 7.5.

Key point Hewlett Packard Enterprise recommends all HPE Apollo systems be upgraded to the latest BIOS and firmware versions before installing the OS. HPE Service Pack for ProLiant1 (SPP) is a comprehensive systems software and firmware update solution, which is delivered as a single ISO image. The minimum SPP version recommended is the latest. The latest version of SPP is available at: http://h17007.www1.hpe.com/us/en/enterprise/servers/products/service_pack/spp/index.aspx

Computations Employing Hyper-Threading increases effective core count, potentially allowing the Hadoop Distro used to assign more cores as needed.

I/O performance The more disks you have, the less likely it is that you will have multiple tasks accessing a given disk at the same time. This avoids queued I/O requests and incurring the resulting I/O performance degradation.

Disk configuration One has a choice of SAS or SATA drives for the Hadoop server nodes and as with any component there is a cost/performance tradeoff.

Network Configuring a single ToR switch per rack introduces a single point of failure for each rack. In a multi-rack system such a failure will result in a very long replication recovery time as Hadoop rebalances storage; and, in a single-rack system such a failure could bring down the whole cluster. Consequently, configuring two ToR switches per rack is recommended for all production configurations as it provides an additional measure of redundancy. This can be further improved by configuring link aggregation between the switches. The most desirable way to configure link aggregation is by bonding the two physical NICs on each server. Port1 wired to the first ToR switch and Port2 wired to the second ToR switch, with the two switches IRF bonded. When done properly, this allows the bandwidth of both links to be used. If either of the switches fail, the servers will still have full network functionality, but with the performance of only a single link. Not all switches have the ability to do link aggregation from individual servers to multiple switches; however, the HPE FlexFabric 5950 48SFP28 8QSFP28 switch supports this through HPE Intelligent Resilient Fabric (IRF) technology. In addition, switch failures can be further mitigated by incorporating dual power supplies for the switches.

Hadoop is rack-aware and tries to limit the amount of network traffic between racks. The bandwidth and latency provided by two bonded 25 Gigabit Ethernet (GbE) connections from the compute nodes to the ToR switch is more than adequate for most Hadoop configurations.

A more detailed white paper for Hadoop Networking best practices is available at, http://h20195.www2.hpe.com/V2/GetDocument.aspx?docname=a00004216enw

For sizing the cluster, use the HPE Sizer for the Elastic Platform for Big Data Analytics available at, http://h20195.www2.hpe.com/V2/GetDocument.aspx?docname=a00005868enw

1 Also supports HPE Apollo servers

Page 10: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 10

High Availability considerations The following are some of the High Availability (HA) features considered in this Reference Configuration:

• ResourceManager HA – To make a YARN cluster highly available, the underlying architecture of an Active/Standby pair is configured, hence the completed tasks of in-flight jobs are not re-run on recovery after the ResourceManager is restarted or failed over. One ResourceManager is Active and one or more ResourceManagers are in standby mode waiting to take over should anything happen to the Active ResourceManager.

• OS availability and reliability – For reliability of the server, the OS disk is configured in a RAID1 configuration thus mitigating failure of the system from OS hard disk failures.

• Network reliability – The Reference Configuration uses a traditional big data analytics solution network block with two HPE FlexFabric 5950 48SFP28 8QSFP28 switches for redundancy, resiliency, and scalability, through using Intelligent Resilient Fabric (IRF) bonding. We recommend using redundant power supplies.

• Power supply – To ensure the servers and racks have adequate power redundancy, we recommend that each server have a backup power supply, and each rack have at least two Power Distribution Units (PDUs).

Software components for control blocks The control block is made up of three HPE ProLiant DL360 Gen10 servers, with an optional fourth server acting as an edge or gateway node depending on the customer enterprise network requirements.

Management node The management node hosts the applications that submit jobs to the Hadoop cluster. We recommend that you install with the software components shown in Table 4.

Table 4. Management node basic software components

Software Description

Red Hat Enterprise Linux 7.5 Recommended Operating System

HPE Insight Cluster Management Utility (CMU) 8.2 Infrastructure Deployment, Management, and Monitoring

Oracle JDK 1.8 Java Development Kit

Hadoop Management Console Cloudera Manager, Ambari, or MapR Control System

ZooKeeper Cluster coordination service

Head nodes The head node servers contain the following software components with the HA feature enabled.

Table 5 shows the head node servers’ base software components.

Table 5. Head node server base software components

Software Description

Red Hat Enterprise Linux 7.5 Recommended Operating System

Oracle JDK 1.8 Java Development Kit

ResourceManager YARN ResourceManager

NameNode HDFS NameNode2

ZooKeeper Cluster coordination service

Additional Services Head functions for any additional Hadoop distribution services installed

2 MapR has the NameNode distributed across all Storage Nodes. Not required on the Head Nodes

Page 11: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 11

Edge nodes The edge node hosts the client configurations that submit jobs to the Hadoop cluster; this control block is optional depending on the customer enterprise network requirements. We recommend that you install the following software components shown in Table 6.

Table 6. Edge node basic software components

Software Description

Red Hat Enterprise Linux 7.5 Recommended Operating System

Oracle JDK 1.8 Java Development Kit

Gateway Services Hadoop Gateway Services (FS, YARN, MapReduce, HBase, and others)

Software components for compute/storage blocks The compute nodes run the DataNode, NodeManager and YARN container processes and thus storage capacity and compute performance are important factors.

HPE Apollo 4200 Gen10 solution software components Table 7 lists the compute node software components.

Table 7. Compute/storage node base software components

Software Description

Red Hat Enterprise Linux 7.5 Recommended Operating System

Oracle JDK 1.8 Java Development Kit

NodeManager The NodeManager process for YARN

DataNode The DataNode process for HDFS

For hardware configuration guidelines, refer to the HPE Reference Configuration for Elastic Platform for Big Data Analytics.

Capacity and sizing Hadoop cluster storage sizing requires careful planning and identifying the current and future storage and compute needs. Use the following as general guidelines for data inventory:

• Sources of data

• Frequency of data

• Raw storage

• Processed FS storage

• Replication factor

• Default compression turned on

• Space for intermediate files

Page 12: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 12

Best practices and tuning guidelines YARN configuration For configuring YARN, update the default values of the following attributes with ones that reflect the cores and memory available on a compute node.

• yarn.nodemanager.resource.memory-mb – Defines the memory available to processing YARN containers on the node in MB.

• yarn.nodemanager.resource.cpu-vcores – Defines the number of CPUs available to process YARN containers on the node.

While configuring YARN for MapReduce jobs, make sure that the following attributes have been specified with sufficient vcores and memory. They represent resource allocation attributes for map and reduce containers. Note that the optimum values for these attributes depend on the nature of workload/use case.

• mapreduce.map.memory.mb – Defines the container size for map tasks in MB.

• mapreduce.reduce.memory.mb – Defines the container size for reduce tasks in MB.

Isolating ZooKeeper nodes For large clusters (100 nodes or more), isolate ZooKeeper on nodes that do not perform any other function. Isolating ZooKeeper enables the node to perform its functions without competing for resources with other processes. Installing a ZooKeeper-only node is similar to any typical node installation, but with a specific subset of packages.

HPE Sizer for the Elastic Platform for Big Data Analytics HPE has developed the HPE Sizer for the Elastic Platform for Big Data Analytics to assist customers with proper sizing of these environments. Based on design requirements, the sizer will provide a suggested bill of materials (BOM) and metrics data for a traditional big data cluster which can be modified further to meet customer requirements.

To download the HPE Sizer for the Elastic Platform for Big Data Analytics, visit hpe.com/info/sizers.

HPE Performance Cluster Manager HPE Performance Cluster Manager delivers an integrated system management solution for Linux-based High Performance Computing (HPC) clusters such as big data environments. HPE Performance Cluster Manager provides complete provisioning, management, and monitoring for clusters scaling to 100,000 nodes. The software enables fast system setup from bare-metal, comprehensive hardware monitoring and management, image management, software updates and power management. HPE Performance Cluster Manager reduces the time and resources spent administering HPC systems – lowering total cost of ownership, increasing productivity and providing a better return on hardware investments. A simple graphical interface enables an “at-a-glance” real-time or 3D historical view of the entire cluster for both infrastructure and application (including Hadoop) metrics, provides frictionless scalable remote management and analysis, and allows rapid provisioning of software to all nodes of the system.

Page 13: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 13

Figure 4. HPE Performance Cluster Manager – system management

Best practice HPE recommends using HPE Performance Cluster Manager for all Hadoop clusters. HPE Performance Cluster Manager allows one to easily correlate Hadoop metrics with cluster infrastructure metrics, such as CPU Utilization, Network Transmit/Receive, Memory Utilization and I/O Read/Write. This allows characterization of Hadoop workloads and optimization of the system thereby improving the performance of the Hadoop cluster. HPE Performance Cluster Manager Time View Metric Visualizations will help you understand, based on your workloads, whether your cluster needs more memory, a faster network or processors with faster clock speeds.

HPE Performance Cluster Manager is highly flexible and customizable, offers both GUI and CLI interfaces, supports Ansible, and can be used to deploy a range of software environments, from simple compute farms to highly customized, application-specific configurations. HPE Performance Cluster Manager is available for HPE ProLiant and HPE BladeSystem servers, and is supported on a variety of Linux operating systems, including Red Hat Enterprise Linux, SUSE Linux Enterprise Server, CentOS, and Ubuntu. Figures 5 and 6 show the instant and time view of the HPE Performance Cluster Manager.

Page 14: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 14

Figure 5. HPE Performance Cluster Manager – Instant View

Figure 6 shows the Time View of the HPE Performance Cluster Manager.

Figure 6. HPE Performance Cluster Manager – Time View

Page 15: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 15

HPE Pointnext Services HPE recommends that customers purchase the option of services from HPE Pointnext, as detailed in Appendix A: HPE Pointnext value-added services and support. These services include hardware configuration from the HPE Pointnext Factory Express team and Big Data consulting services for design, architecture, implementation and management of Hadoop solutions.

Summary Hewlett Packard Enterprise and Hadoop 3 distros allow one to derive new business insights from big data by providing a platform to store, manage and process data at scale. However, designing and ordering Hadoop clusters can be both complex and time consuming. This white paper provided a Reference Configuration for deploying clusters of varying sizes with Hadoop 3 on HPE infrastructure and management software. These configurations leverage HPE servers, storage, and networking, along with integrated management software and bundled support. In addition, this white paper has been created to assist in the rapid design and deployment of Hadoop on HPE infrastructure for clusters of various sizes.

Appendix A: HPE Pointnext value-added services and support In order to help customers jump-start their big data solution development, HPE Pointnext offers flexible, value-added services, including Factory Express and Big Data Consulting services which can accommodate an end-to-end customer experience.

HPE Pointnext Factory Express Services Factory-integration services are available for customers seeking a streamlined deployment experience. With the purchase of Factory Express services, your cluster will arrive racked and cabled, with software installed and configured per an agreed upon custom statement of work, for the easiest deployment possible. HPE Factory Express Level 4 Service (HA454A1) is the recommended Factory Integration service for big data covering hardware and software integration, as well as end-to-end delivery project management. Please engage HPE Pointnext Factory Express for details and quoting assistance. For more information and assistance on Factory Integration services, you can go to: hpe.com/us/en/services/factory-express.html

Or contact:

• AMS: [email protected]

• APJ: [email protected]

• EMEA: [email protected]

HPE Pointnext Big Data Consulting – Reference Configuration Implementation Service for Hadoop With the HPE Reference Architecture Implementation Service for Hadoop, experienced HPE Big Data consultants install, configure, deploy, and test your Hadoop environment based on the HPE Reference Configuration for Hadoop. HPE will implement a Hadoop design: naming, hardware, networking, software, administration, backup and operating procedures and work with you to configure the environment according to your goals and needs. HPE will also conduct an acceptance test to validate and prove that the system is operating as defined in the Reference Configuration.

HPE GreenLake Big Data – your complete end-to-end solution HPE Pointnext offers a scalable solution that radically simplifies your experience with Hadoop. It takes much of the complexity and cost off your back, so that you can focus purely on deriving intelligence from your Hadoop clusters. Offering support for both symmetrical and asymmetrical environments, HPE GreenLake Big Data offers complete end-to-end solutions that include hardware, software, and HPE Pointnext services. HPE Pointnext experts will get you set up and operational, and help you manage and maintain your clusters. They will also simplify billing, aligning it with business KPIs. With HPE’s unique pricing and billing method, it’s much easier to understand your existing Hadoop costs and better predict future costs associated with your solution. HPE GreenLake Big Data covers the whole Hadoop lifecycle. It is composed of the required hardware, software, and HPE Pointnext services to provide a comprehensive, end-to-end solution – including data migration, if needed.

Page 16: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture Page 16

HPE Pointnext Advisory, Transform and Manage – Big Data Consulting Services HPE Pointnext Big Data Consulting Services cover the spectrum of services to advise, transform, and manage your Hadoop environment, helping you to reshape your IT infrastructure to corral increasing volumes of bytes – from e-mails, social media, and website downloads – and convert them into beneficial information. Our Big Data solutions encompass strategy, design, implementation, protection and compliance. We deliver these solutions in three steps.

1. Big Data Architecture Strategy and Roadmap: We’ll define the functionalities and capabilities needed to align your IT with your Big Data initiatives. Through transformation workshops, roadmap and design services, you’ll learn to capture, consolidate, manage and protect business-aligned information, including structured, semi-structured and unstructured data.

2. Big Data System Infrastructure: HPE experts will design and implement a high-performance, integrated platform to support a strategic architecture for Big Data. Choose from design and implementation services, Reference Configuration implementations and integration services. Your flexible, scalable infrastructure will support the Big Data variety, consolidation, and analysis, needed to help drive your business.

3. Big Data Protection: Ensure availability, security and compliance of Big Data systems. Our consultants can help you safeguard your data, achieve regulatory compliance and lifecycle protection across your Big Data landscape, as well as improve your backup and continuity measures.

For additional information, visit: hpe.com/us/en/services/consulting/big-data.html

Hewlett Packard Enterprise Support options HPE offers a variety of support levels to meet your needs:

• HPE Datacenter Care – HPE Datacenter Care provides a more personalized, customized approach for large, complex environments, with one solution for reactive, proactive, and multi-vendor support needs.

• HPE Support Plus 24 – For a higher return on your server and storage technology, our combined reactive support service delivers integrated onsite hardware/software support services available 24x7x365, including access to HPE technical resources, 4-hour response onsite hardware support and software updates.

• HPE Proactive Care – HPE Proactive Care begins with providing all of the benefits of proactive monitoring and reporting along with rapid reactive care. You also receive enhanced reactive support, through access to HPE’s expert reactive support specialists. You can customize your reactive support level by selecting either 6 hour call-to-repair or 24x7 with 4 hour onsite response. You may also choose DMR (Defective Media Retention) option.

• HPE Proactive Care with the HPE Personalized Support Option – Adding the Personalized Support Option for HPE Proactive Care is highly recommended. The Personalized Support option builds on the benefits of HPE Proactive Care Service, providing you an assigned Account Support Manager who knows your environment and delivers support planning, regular reviews, and technical and operational advice specific to your environment. These proactive services will be coordinated with Microsoft's proactive services that come with Microsoft Premier Mission Critical, if applicable.

• HPE Proactive Select – And to address your ongoing/changing needs, HPE recommends adding Proactive Select credits to provide tailored support options from a wide menu of services, designed to help you optimize capacity, performance, and management of your environment. These credits may also be used for assistance in implementing updates for the solution. As your needs change over time you flexibly choose the specific services best suited to address your current IT challenges.

• Other offerings – In addition, Hewlett Packard Enterprise highly recommends HPE Education Services (for customer training and education) and additional Pointnext, as well as in-depth installation or implementation services as may be needed.

Page 17: HPE Reference Configuration for HPE Apollo 4200 Gen10 with ... Architecture Page 5. Major Hadoop 3 distros . Hortonworks Data Platform, Cloudera Enterprise, and MapR Enterprise Edition

Reference Architecture

Share now

Get updates

© Copyright 2018 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein.

Intel and Xeon are trademarks of the Intel Corporation in the U.S. and other countries. Linux is a registered trademark of Linus Torvalds in the U.S. and other countries. Red Hat is a registered trademark of Red Hat, Inc. in the United States and other countries. Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. Oracle and Java are registered trademarks of Oracle and/or its affiliates.

a00061276enw, December 2018

Resources and additional links MapR, mapr.com

Hortonworks, hortonworks.com

Cloudera Enterprise, cloudera.com

HPE Solutions for Apache Hadoop, hpe.com/info/hadoop

HPE Performance Cluster Manager, hpe.com/us/en/product-catalog/detail/pip.hpe-performance-cluster-manager-software.1010836945.html

HPE FlexFabric 5900 switch series, hpe.com/us/en/product-catalog/networking/networking-switches/pip.fixed-port-l3-managed-ethernet-switches.5221896.html

HPE FlexFabric 5950 switch series, hpe.com/us/en/product-catalog/networking/networking-switches/pip.hpe-flexfabric-5950-switch-series.1008901775.html

HPE Apollo servers, hpe.com/us/en/solutions/hpc-high-performance-computing.html

HPE Networking, hpe.com/networking

HPE Services, hpe.com/services

Red Hat, redhat.com

HPE Sizer for the Elastic Platform for Big Data Analytics: http://h20195.www2.hpe.com/V2/GetDocument.aspx?docname=a00005868enw

HPE Education Services: http://h10076.www1.hpe.com/ww/en/training/portfolio/bigdata.html

To help us improve our documents, please provide feedback at hpe.com/contact/feedback.