Hitachi Solution for Databases in an Enterprise Data ......Hitachi Storage Virtualization Operating Syst em provides storage virtualiz ation, high availability, superior perf ormance,

September 2018

By Shashikant Gaikwad, Subhash Shinde

Reference Architecture Guide

Hitachi Solution for Databases in an Enterprise Data Warehouse Offload Package for Oracle Database to Cloudera Distribution of Apache Hadoop

FeedbackHitachi Data Systems welcomes your feedback. Please share your thoughts by sending an email message to [email protected]. To assist the routing of this message, use the paper number in the subject and the title of this white paper in the text.

Revision History

Revision Changes Date

MK-SL-098-00 Initial release August 20, 2018MK-SL-098-01 Updated text in Merge and Join Two Tables Data Copy September 24, 2018

mailto:[email protected]?subject=Document%20AS-NNN-RR%20mailto:[email protected]?subject=Document%20MK-SL-098-01%20

Table of Contents

Solution Overview 2Business Benefits 2

High Level Infrastructure 3

Key Solution Components 4Pentaho 6

Hitachi Advanced Server DS120 7

Hitachi Virtual Storage Platform Gx00 Models 7

Hitachi Virtual Storage Platform Fx00 Models 7

Brocade Switches 7

Cisco Nexus Data Center Switches 7

Cloudera 8

Oracle Database 9

Red Hat Enterprise Linux 9

Solution Design 9

Solution Validation 11Storage Architecture 13

Network Architecture 15

Data Analytics and Performance Monitoring Using Hitachi Storage Advisor 17

Oracle Enterprise Data Offload Workflow 18

Engineering Validation 30Test Methodology 30

Test Results 31

1

Hitachi Solution for Databases in an Enterprise Data Warehouse Offload Package for Oracle Database to Cloudera Distribution of Apache HadoopReference Architecture Guide

Use this reference architecture guide to implement Hitachi Solution for Databases in an enterprise data warehouse offload package for Oracle Database. This Oracle converged infrastructure provides a high performance, integrated, solution for advanced analytics using the following Big Data applications:

Hitachi Advanced Server DS120 with Intel Xeon Silver 4110 processors

Pentaho

Cloudera Distribution Hadoop This architecture establishes best practices for environments where you can copy data in an enterprise data warehouse to Apache Hive database on top of Hadoop Distributed File System (HDFS). Then, you obtain your data for analysis from the Hive database instead of from the busy Oracle server.

This reference architecture guide is for you if you are in one of the following roles and need to create a big data management and advanced analytics solution:

Data scientist

Database administrator

System administrator

Storage administrator

Database performance analyzer

IT professional with the responsibility of planning and deploying an EDW offload solutionTo use this reference architecture guide to create your big data infrastructure, you should have familiarity with the following:

Hitachi Advanced Server DS120

Pentaho

Big data and Cloudera Distribution Hadoop (CDH)

Apache Hive

Oracle Real Application Cluster Database 12c Release 1

IP networks

Red Hat Enterprise Linux

1

2

Note — Testing of this configuration was in a lab environment. Many things affect production environments beyond prediction or duplication in a lab environment. Follow the recommended practice of conducting proof-of-concept testing for acceptable results in a non-production, isolated test environment that otherwise matches your production environment before your production implementation of this solution.

Solution OverviewUse this reference architecture to implement Hitachi Solution for Databases in an enterprise data warehouse offload package for Oracle Database to Cloudera Distribution Hadoop.

Business BenefitsThis solution provides the following benefits:

Improve database manageabilityYou can take a "divide and conquer" approach to data management by moving data onto a lower cost storage tier without disrupting access to data.

Extreme scalabilityLeveraging the extreme scalability of Hadoop Distributed File System, you can offload data from the Oracle servers into commodity servers running big data solutions.

Lower total cost of ownership (TCO)Reduce your capital expenditure by reducing the resources needed to run applications. Using Hadoop Distributed File System and low-cost storage makes it possible to keep information that is not deemed currently critical, but that you still might want to access, off the Oracle servers.

This approach reduces the number of CPUs needed to run the Oracle database, optimizing your infrastructure. This potentially delivers hardware and software savings, including maintenance and support costs.

Reduce the costs of running your workloads by leveraging less expensive, general purpose servers running Hadoop cluster.

Improve availability Reduce scheduled downtime by allowing database administrators to perform backup operations on a smaller subset of your data. With offloading, perform daily backups on hot data, and less frequent backups on warm data.

For an extremely large database, this can make the difference between having enough time to complete a backup in off-business hours or not.

This also reduces the amount of time required to recover your data.

Analyze data without affecting the production environmentWhen processing and offloading data through Pentaho Data Integration, dashboards in Pentaho can analyze your data without affecting the performance of the Oracle production environment.

2

3

High Level InfrastructureFigure 1 shows the high-level infrastructure for this solution. The configuration of Hitachi Advanced Server DS120 provides the following characteristics:

Fully redundant hardware

High compute and storage density

Flexible and scalable I/O options

Sophisticated power and thermal design to avoid unnecessary operation expenditures

Quick deployment and maintenance

Figure 1

3

4

To avoid any performance impact to the production database, Hitachi Vantara recommends using a configuration with a dedicated IP network for the following:

Production Oracle database

Pentaho server

Hadoop serversUplink speed to the corporate network depends on your environment and requirements. The Cisco Nexus 93180YC-EX switches can support uplink speeds of 40 GbE.

This solution uses Hitachi Unified Compute Platform CI in a solution for an Oracle Database architecture with Hitachi Advanced Server DS220, Hitachi Virtual Storage Platform G600, and two Brocade G620 SAN switches hosted Oracle EDW. You can use your existing Oracle database environment or purchase Hitachi Unified Compute Platform CI to host Oracle RAC or a standalone solution to host Enterprise Data Warehouse.

Key Solution ComponentsThe key solution components for this solution are listed in Table 1, “Hardware Components,” on page 4 and Table 2, “Software Components,” on page 6.

TABLE 1. HARDWARE COMPONENTS

Hardware Detailed Description Firmware or Driver Version Quantity

Hitachi Virtual Storage Platform G600 (VSP G600)

One controller

8 × 16 Gb/s Fibre Channel ports

8 × 12 Gb/s backend SAS ports

256 GB cache memory

40 × 960 GB SSDs, plus 2 spares

16 Gb/s × 2 ports CHB

83-04-47-40/00 1

Hitachi Advanced Server DS220 (Oracle host)

2 Intel Xeon Gold 6140 CPU @ 2.30 GHz

768 GB (64GB × 12) DIMM DDR4 synchronous registered (buffered) 2666 MHz

3A10.H3 1

Intel XXV710 Dual Port 25 GbE NIC cards i40e-2.3.6 2

Emulex LightPulse LPe31002-M6 2-Port 16 Gb/s Fibre Channel adapter

11.2.156.27 2

4

5

Hitachi Advanced Server DS120 (Hadoop host)

2 Intel Xeon Silver 4110 CPU @ 2.10GHz

2 × 64 GB MLC SATADOM for boot

384 GB (32 GB × 12) DIMM DDR4 synchronous registered (buffered) 2666 MHz

3A10.H3 3

Intel XXV710 dual port 25 GbE NIC cards i40e-2.3.6 6

1.8 TB SAS drives 4

Hitachi Advanced Server DS120 (Pentaho host)

2 Intel Xeon Silver 4110 CPU @ 2.10GHz

2 × 64 GB MLC SATADOM for boot

128 GB (32 GB × 4) DIMM DDR4 synchronous registered (buffered) 2666 MHz

3A10.H3 1

Intel XXV710 Dual Port 25 GbE i40e-2.3.6 2

Brocade G620 switches

48 port Fibre Channel switch

16 Gb/s SFPs

Brocade hot-pluggable SFP+, LC connector

V8.0.1 2

Cisco Nexus 93180YC-EX switches

48 × 10/25 GbE Fiber ports

6 × 40/100 Gb/s quad SFP (QSFP28) ports

7.0(3)I5(1) 2

Cisco Nexus 3048TP switch

1 GbE 48-Port Ethernet switch 7.0(3)I4(2) 1

TABLE 1. HARDWARE COMPONENTS (CONTINUED)

Hardware Detailed Description Firmware or Driver Version Quantity

5

6

[Note 1] These software programs were used for this Oracle Database architecture built on Hitachi Unified Compute Platform CI. They may not be required for your implementation.

PentahoA unified data integration and analytics program, Pentaho addresses the barriers that block your organization's ability to get value from all your data. Simplify preparing and blending any data with a spectrum of tools to analyze, visualize, explore, report, and predict. Open, embeddable, and extensible, Pentaho ensures that each member of your team — from developers to business users — can translate data into value.

Internet of things — Integrate machine data with other data for better outcomes.

Big data — Accelerate value with Apache Hadoop, NoSQL, and other big data.

Data integration — Access, manage, and blend any data from any source.This solution uses Pentaho Data Integration to drive the extract, transform, and load (ETL) process. The end target of this process is an Apache Hive database on top of Hadoop Distributed File System.

Business analytics — Turn data into insights with embeddable analytics.

TABLE 2. SOFTWARE COMPONENTS

Software Version Function

Red Hat Enterprise Linux Version 7.3

Kernel Version: kernel-3.10.0-514.36.5.el7.x86_64

Operating system for Cloudera Distribution Hadoop environment.

Red Hat Enterprise Linux Version 7.4

3.10.0-693.11.6.el7.x86_64

Operating system for Oracle Environment

Microsoft® Windows Server® 2012 R2 Standard Operating system for Pentaho Data Integration (PDI) environment.

Oracle 12c Release 1 (12.1.0.2.0) Database Software

Oracle Grid Infrastructure 12c Release 1 (12.1.0.2.0) Volume Management, File System Software, and Oracle Automatic Storage Management

Pentaho Data Integration 8.0 Extract-transfer-load software

Cloudera Distribution Hadoop 5.13 Hadoop Distribution

Apache Hive 1.1.0 Hive Database

Red Hat Enterprise Linux Device Mapper Multipath

1.02.140 Multipath Software

Hitachi Storage Navigator [Note 1] Microcode dependent Storage management Software

Hitachi Storage Advisor (HSA) [Note 1] 2.1.0 Storage orchestration software

6

http://www.pentaho.com/

7

Hitachi Advanced Server DS120 Optimized for performance, high density, and power efficiency in a dual-processor server, Hitachi Advanced Server DS120 delivers a balance of compute and storage capacity. This rack mounted server has the flexibility to power a wide range of solutions and applications.

The highly-scalable memory supports up to 3 TB using 24 slots of 2666 MHz DDR4 RDMM. DS120 is powered by the Intel Xeon scalable processor family for complex and demanding workloads. There are flexible OCP and PCIe I/O expansion card options available. This server supports up to 12 storage devices with up to 4 NVMe.

Hitachi Virtual Storage Platform Gx00 Models Hitachi Virtual Storage Platform Gx00 models are based on industry-leading enterprise storage technology. With flash-optimized performance, these systems provide advanced capabilities previously available only in high-end storage arrays. With the Virtual Storage Platform Gx00 models, you can build a high performance, software-defined infrastructure to transform data into valuable information.

Hitachi Storage Virtualization Operating System provides storage virtualization, high availability, superior performance, and advanced data protection for all Virtual Storage Platform Gx00 models. This proven, mature software provides common features to consolidate assets, reclaim space, extend life, and reduce migration effort.

This solution uses Virtual Storage Platform G600, which supports Oracle Real Application Clusters.

Hitachi Virtual Storage Platform Fx00 ModelsHitachi Virtual Storage Platform Fx00 models deliver superior all-flash performance for business-critical applications, with continuous data availability. High-performance network attached storage with non-disruptive deduplication reduces the required storage capacity by up to 90% with the power to handle large, mixed-workload environments.

Hitachi Storage Virtualization Operating System provides storage virtualization, high availability, superior performance, and advanced data protection for all Virtual Storage Platform Fx00 models. This proven, mature software provides common features to consolidate assets, reclaim space, extend life, and reduce migration effort.

Brocade SwitchesBrocade and Hitachi Vantara partner to deliver storage networking and data center solutions. These solutions reduce complexity and cost, as well as enable virtualization and cloud computing to increase business agility.

Optionally, this solution uses the following Brocade product:

Brocade G620 switch, 48-port Fibre Channel In this solution, SAN switches are optional. Direct connect is possible under certain circumstances. Check the support matrix to ensure support for your choice.

Cisco Nexus Data Center SwitchesCisco Nexus data center switches are built for scale, industry-leading automation, programmability, and real-time visibility.

This solution uses the following Cisco switches to provide Ethernet connectivity:

Nexus 93180YC-EX, 48-port 10/25 GbE switch

Nexus 3048TP, 48-port 1GbE Switch

7

https://www.hitachivantara.com/en-us/pdf/datasheet/hitachi-datasheet-advanced-server-ds120.pdfhttps://www.hitachivantara.com/en-us/products/storage/virtual-storage-platform-g-series.htmlhttp://www.oracle.com/us/products/database/options/real-application-clusters/overview/index.htmlhttps://www.hds.com/en-us/products-solutions/storage/virtual-storage-platform-f-series.htmlhttps://www.hds.com/en-us/products-solutions/storage/brocade-networking.htmlhttps://www.cisco.com/c/en/us/products/switches/data-center-switches/index.html

8

Cloudera Cloudera is the leading provider of enterprise-ready, big data software and services. Cloudera Enterprise Data Hub is the market-leading Hadoop distribution. It includes Apache Hadoop, Cloudera Manager, related open source projects, and technical support.

Big Data is a generic term to cover a set of components that are used with very large data sets to provide advanced data analytics. Big data usually refers to large volumes of unstructured or semi-structured data.

Usually, a big data solution is part of the Apache Hadoop project. However, big data can include components from many different software companies.

This reference architecture uses Red Hat Enterprise Linux and Cloudera Enterprise Data Hub.

The following is a partial list of the software components and modules that can be used in a Cloudera Enterprise Data Hub deployment:

Cloudera ManagerCloudera Manager provides management and monitoring of the Cloudera Hadoop distribution cluster

Apache Hadoop Distributed File SystemHadoop Distributed File System (HDFS) is a distributed high-performance file system designed to run on commodity hardware.

Apache Hadoop CommonThese common utilities support the other Hadoop modules. This programming framework supports the distributive processing of large data sets.

Apache Hadoop YARNApache Hadoop YARN is a framework for job scheduling and cluster resource management. This splits the functionalities of the following into separate daemons:

ResourceManager interfaces with the client to track tasks and assign tasks to NodeMangers management NodeManager launches and tracks execution on the worker nodes

Apache HiveApache Hive is data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

Apache ZooKeeperApache ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

ZooKeeper Master NodeZooKeeper is a high-availability system, whereby two or more nodes can connect to a ZooKeeper master node. The ZooKeeper master node controls the nodes to provide high availability.

8

http://hadoop.apache.org/https://www.redhat.com/en/technologies/linux-platforms/enterprise-linuxhttps://www.cloudera.com/content/dam/www/marketing/resources/solution-briefs/enterprise-data-hub-solution-brief.pdf.landing.htmlhttp://hadoop.apache.org/docs/r1.2.1/hdfs_design.htmlhttps://github.com/apache/hadoop-commonhttps://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.htmlhttps://hive.apache.org/https://www.hitachivantara.com/en-us/pdf/architecture-guide/hitachi-solution-for-enterprise-data-intelligence-with-cloudera-whitepaper.pdfhttps://www.hitachivantara.com/en-us/pdf/architecture-guide/hitachi-solution-for-enterprise-data-intelligence-with-cloudera-whitepaper.pdf

9

ZooKeeper Standby Master NodeWhen ZooKeeper runs in a highly-available setup, there can be several nodes configured as ZooKeeper master nodes. Only one of these configured nodes is active as a master node at any time. The others are standby active nodes.

If the currently-active master node fails, then the ZooKeeper cluster itself promotes one of the standby master nodes to the active master node.

Oracle Database Oracle Database has a multi-tenant architecture so you can consolidate many databases quickly and manage them as a cloud service. Oracle Database also includes in-memory data processing capabilities for analytical performance. Additional database innovations deliver efficiency, performance, security, and availability. Oracle Database comes in two editions: Enterprise Edition and Standard Edition 2.

Oracle Automatic Storage Management (Oracle ASM) is a volume manager and a file system for Oracle database files. This supports single-instance Oracle Database and Oracle Real Application Clusters configurations. Oracle ASM is the recommended storage management solution that provides an alternative to conventional volume managers, file systems, and raw devices.

Red Hat Enterprise LinuxRed Hat Enterprise Linux delivers military-grade security, 99.999% uptime, support for business-critical workloads, and so much more. Ultimately, the platform helps you reallocate resources from maintaining the status quo to tackling new challenges.

Device mapper multipathing (DM-Multipath) allows you to configure multiple I/O paths between server nodes and storage arrays into a single device.

These I/O paths are physical SAN connections that can include separate cables, switches, and controllers. Multipathing aggregates the I/O paths, creating a new device that consists of the aggregated paths.

Solution DesignThis describes the reference architecture environment to implement Hitachi Solution for the Databases in an enterprise data warehouse offload package for Oracle Database. The environment uses Hitachi Advanced Server D120 and Hitachi Advanced Server DS220.

The infrastructure configuration includes the following:

Pentaho server — There is one server configured to run Pentaho.

Hadoop Cluster Servers — There are at least three servers configured to run Hadoop Cluster with Hive2 database. The number of Hadoop servers can be expended, based on size of working set, sharding, and other factors. This solution uses local HDDs as JBOD for the Hadoop cluster hosts.

IP network connection — There are IP connections to connect the Pentaho server, the Apache Hive servers, and the Oracle server through Cisco Nexus switches.

Oracle Database Architecture built on Hitachi Unified Compute Platform CI — This is the Oracle infrastructure used for hosting Oracle Enterprise Data Warehouse. The infrastructure includes one Hitachi Advanced Server DS 220, two Brocade G620 SAN switches, and Hitachi Virtual Storage Platform G600. In your implementation, any Hitachi Virtual Storage Platform Gx00 model or Virtual Storage Platform Fx00 model can be used. Oracle database can be configured for a single instance or an Oracle Real Application Cluster environment.

9

https://www.oracle.com/database/index.htmlhttps://docs.oracle.com/database/121/OSTMG/toc.htmhttps://www.redhat.com/en/technologies/linux-platforms/enterprise-linux

10

The following are the components in a minimal deployment, single rack configuration of this solution. It uses a single Hitachi Advanced Server DS120.

Switches Top-of-rack data switches Management switches

3 master nodes 2 Intel 4110 processors 256 GB of RAM 10 × 1.8 TB SAS drives 2 × 128 GB SATADOMS

1 edge node 2 Intel 4110 processors 128 GB of RAM 2 × 1.8 TB SAS drives 2 × 128GB SATADOMS

1 Pentaho node 2 Intel 4110 processors 368 GB of RAM 4 × 1.8 TB SAS drives 2 × 128 GB SATADOMS

9 worker nodes 2 Intel 4110 processors 368 GB of RAM 12 x 1.8 TB SAS drives 2 × 128 GB SATADOMSThe 9 workers nodes provide 1.8 × 12 × 6 TB raw storage, allowing for 129.6 TB of raw storage. After a replication factor of 3 and reserving 20% for overhead, this gives 34.56 TB of storage. There are multiple reasons to add more worker nodes:

Provide more storage After monitoring system performance to see if there is a need, improve performance

The queries require more than the standard 20% overhead.

For complete configuration deployment and information see Hitachi Solution for Enterprise Data Intelligence with Cloudera.

10

https://www.hitachivantara.com/en-us/pdf/architecture-guide/hitachi-solution-for-enterprise-data-intelligence-with-cloudera-whitepaper.pdf

11

Solution Validation For validation purposes, this reference architecture uses three Hitachi Advanced Server DS120 for a three-node Hadoop host configuration, and one Advanced Server DS120 server to host Pentaho Data Integration (PDI) tool. The architecture provides the compute power for the Apache Hive database to handle complex database queries and a large volume of transaction processing in parallel.

Table 3, “Hitachi Advanced Server DS120 Specifications,” and Table 4, “Hitachi Advanced Server DS220 Specifications,” describe the details of the server configuration for this solution.

Table 5 shows the server BIOS and Red Hat Enterprise Linux 7.4 kernel parameters for the Apache Hadoop cluster servers.

TABLE 3. HITACHI ADVANCED SERVER DS120 SPECIFICATIONS

Server Server Name Role CPU Cores RAM

Hadoop Server 1 hadoopnode1 Hadoop Cluster Node 1

16 384 GB (32 GB × 12)


16 384 GB (32 GB × 12)


16 384 GB (32 GB × 12)

Pentaho Server edwpdi Pentaho PDI Server

16 128 GB (32 GB × 4)

TABLE 4. HITACHI ADVANCED SERVER DS220 SPECIFICATIONS

Server Server Name Role CPU Cores RAM

Oracle Oracle host Oracle Server 36 768 GB (64 GB × 12)

TABLE 5. BIOS AND RED HAT ENTERPRISE LINUX 7.4 KERNEL PARAMETERS FOR THE APACHE HADOOP CLUSTER SERVERS

Parameter Category Setting Value

BIOS NUMA ENABLE

DISK READ AHEAD DISABLE

11

12

Table 6 shows the server BIOS and Red Hat Enterprise Linux 7.4 kernel parameters for the Pentaho server.

Table 7 shows the parameters for the Apache Hadoop Environment

RHEL 7.4 Kernel ulimit unlimited

vm.dirty_ratio 20

vm.dirty_background_ratio 10

vm.swappiness 1

tansparent_hugepage never

IO scheduler Noop

RHEL 7.3OS services tuned disable

TABLE 6. PARAMETERS FOR THE PENTAHO SERVER


BIOS NUMA ENABLE

DISK READ AHEAD DISABLE

TABLE 7. PARAMETERS FOR THE APACHE HADOOP ENVIRONMENT

Setting Property Value

Memory Tuning mapred.child.java.opts 2048 MB

Replication factor dfs.replication 3

Map CPU vcores mapreduce.map.cpu.vcores 2

Reduce CPU vcores mapreduce.reduce.cpu.vcores 2

Buffer size io.file.buffer.size 128

DFS Block size dfs.block.size 128

Speculative Execution mapreduce.map.tasks.speculative.excution TRUE

Hive vectorized execution set hive.vectorized.execution.enabled TRUE

TABLE 5. BIOS AND RED HAT ENTERPRISE LINUX 7.4 KERNEL PARAMETERS FOR THE APACHE HADOOP CLUSTER SERVERS


12

13

Table 8 has the connection parameters for the Oracle environment.

Note — The connection information as given in Table 8 may differ for your production environment, based on your offload configuration. To execute multiple transformation steps, Pentaho Data Integration (PDI) opens multiple database connections. Make sure that Oracle Database allows sufficient number of connections to offload using PDI.

Storage ArchitectureThis describes the storage architecture for this solution.

Storage Configuration for Hadoop ClusterThis configuration uses recommended practices with Hitachi Advanced Server D120 and CDH for the design and deployment of storage for Hadoop.

Configure a total of four HDDs on each Hadoop Node.

For the best performance and size to accommodate the Oracle Enterprise Data Warehouse offloading space, adjust the size of HDDs to meet your business requirements.

For more information on how Hitachi provides high performance, integrated and converged solution for Oracle database, see Hitachi Solution for Databases Reference Architecture for Oracle Real Application Clusters Database 12c

Hive Server Heap Size hive-env.sh hadoop_opts 16GB

Spark driver Memory spark.driver.memory 12GB

TABLE 8. CONNECTION PARAMETERS FOR ORACLE ENVIRONMENT


Database processes processes 5000

Database sessions sessions 7000

Database transactions transaction 6000

TABLE 7. PARAMETERS FOR THE APACHE HADOOP ENVIRONMENT (CONTINUED)


13

https://www.hitachivantara.com/en-us/pdf/white-paper/solution-databases-reference-architecture-oracle-rac-database-12c.pdf

14

Table 9 shows the storage configuration used for the Hadoop cluster in this solution.

Note — On the node where you install Cloudera Management services, you need 100 GB space allocated exclusively for Cloudera Management services data. If you want to use the root partition as data directories for Cloudera Management services, have a root partition with 100 GB. If you want to use any other disk, make sure available space in it is more than 100 GB.

File System Recommendation for Cloudera Distribution for Hadoop Hadoop Distributed File System (HDFS) is designed to run on top of an underlying file system in an operating system. Cloudera recommends that you use either of the following file systems tested on the supported operating systems.

ext3 — This is the most tested underlying filesystem for HDFS.

ext4 — This scalable extension of ext3 is supported in more recent Linux releases.

XFS — This is the default file system in Red Hat Enterprise Linux 7.

Note — Cloudera does not support in-place upgrades from ext3 to ext4. Cloudera recommends that you format disks as ext4 before using them as data directories.

TABLE 9. STORAGE CONFIGURATION FOR APACHE HIVE DATABASE

RAID Level JBOD

Drive Type 1.2 TB HDD

Number of Drives 4

Total Useable Capacity 4.8 TB

File System Type XFS

File System Block Size 4 KB

Disk Readahead (BIOS Setting) Disable

14

https://www.cloudera.com/documentation/enterprise/release-notes/topics/rn_consolidated_pcm.html#cdh_cm_supported_os

15

Network ArchitectureThis architecture requires the following separate networks for the Pentaho server and the Cloudera Distribution Hadoop servers:

Public Network — This network must be scalable. In addition, it must meet the low latency needs of the network traffic generated by the servers running applications in the environment.

Management Network — This network provides BMC connections to the physical servers.

Data Network — This network provides communication between nodes.Hitachi Vantara recommends using pairs of 10/25 Gb/s NICs for the public network and 1 Gb/s LOM for the management network.

Observe these points when configuring public network in your environment:

For each server in the configuration, use at least two identical, high-bandwidth, low-latency NICs for the public network.

Use NIC bonding to provide failover and load balancing of within a server. If using two dual-port NICs, NIC bonding can be configured across two cards.

Ensure all NICs are set to full duplex modeFor the complete network architecture, see Hitachi Solution for Enterprise Data Intelligence with Cloudera.

Figure 2 on page 16 shows the network configuration in this solution.

15

https://www.hitachivantara.com/en-us/pdf/architecture-guide/hitachi-solution-for-enterprise-data-intelligence-with-cloudera-whitepaper.pdf

16

Figure 2

Table 10, “Network Configuration and IP Addresses,” on page 17 shows the network configuration, IP addresses, and name configuration that was used when testing the environment with Hitachi Advanced Server DS120 Your implementation of this solution can differ.

Configure pairs ports from different physical NIC cards to avoid a single point of failure (SPoF) when installing two NICs on each server. However, if high availability is not a concern for you, this environment supports using one NIC on the Cloudera Distribution Hadoop servers and the Pentaho server for lower cost.

The IP configuration for the Oracle environment can be adjusted for Oracle real application cluster, as applicable.

16

17

Data Analytics and Performance Monitoring Using Hitachi Storage AdvisorUse Hitachi Storage Advisor for data analytics and performance monitoring with this solution.

By reducing storage infrastructure management complexities, Hitachi Storage Advisor simplifies management operations. This helps you to rapidly configure storage systems and IT services for new business applications.

Hitachi Storage Advisor can be used for this Oracle Database architecture, built on Hitachi Unified Compute Platform CI. However, this may not be required for your implementation.

TABLE 10. NETWORK CONFIGURATION AND IP ADDRESSES

Server NIC Ports Subnet NIC BOND

IP Address Network Bandwidth (Gb/s)

Cisco Nexus Switch

Switch Number

Port

DS120 Server 1 (Hadoop)

NIC - 0 167 Bond0 172.17.167.71 Public 10/25 1 11

NIC - 3 10/25 2

BMC- Dedicated NIC

242 - 172.17.242.165 Management 1 3 11


NIC - 0 167 Bond0 172.17.167.72 Public 10/25 1 12

NIC - 3 10/25 2

BMC- Dedicated NIC

242 - 172.17.242.166 Management 1 3 12


NIC - 0 167 Bond0 172.17.167.73 Public 10/25 1 13

NIC - 3 10/25 2

BMC- Dedicated NIC

242 - 172.17.242.167 Management 1 3 13

DS120 Server 4 (Pentaho)

NIC - 0 167 Bond0 172.17.167.74 Public 10/25 1 14

NIC - 3 10/25 2

BMC- Dedicated NIC

242 - 172.17.242.74 Management 1 3 14

DS220 Server (Oracle)

NIC-0 167 Bond0 172.17.167.75 Public 10/25 1 15

NIC-3 10/25 2

BMC- Dedicated NIC

242 - 172.17.242.75 Management 1 3 15

17

https://www.hitachivantara.com/en-us/pdf/datasheet/hitachi-datasheet-storage-advisor.pdf

18

Oracle Enterprise Data Offload WorkflowUse a Python script to generate the Oracle Enterprise Data Workflow mapping of large Oracle data sets to Cloudera Distribution Hadoop, and then offload the data using Pentaho Data Integration. Have this script create a transformation Kettle file with Spoon for the data offload.

This auto-generated transformation transfers row data from Oracle database tables or views in a schema to the Apache Hadoop database. Pentaho Data Integration uses this transformation directly.

Pentaho Data IntegrationPentaho Data Integration (PDI) allows you to ingest, blend, cleanse, and prepare diverse data from any source. With visual tools to eliminate coding and complexity, Pentaho puts all data sources and the best quality data at the fingertips of businesses and IT users.

Using intuitive drag-and-drop data integration coupled with data agnostic connectivity, your use of Pentaho Data Integration can span from flat files and RDBMS to Cloudera Distribution Hadoop and beyond. Go beyond a standard extract-transform-load (ETL) designer to scalable and flexible management for end-to-end data flows.

In this reference architecture, the end target of the ETL process is an Apache Hive database.

To set up Pentaho to connect to a Cloudera Distribution Hadoop Cluster, make sure that correct plugin/shim is configured. View the plugin configuration steps at Set Up Pentaho to Connect to a Cloudera Cluster.

For this solution, configure and set Active Shim to Cloudera CDH 5.13. Figure 3 shows setting this in Cloudera Hadoop Distribution for PDI.

Figure 3

18

http://www.pentaho.com/product/data-integrationhttps://help.pentaho.com/Documentation/8.0/Setup/Configuration/Hadoop_Clusters/Cloudera

19

Script for Automatic Enterprise Data Workflow OffloadYou can use an Oracle Enterprise Data Workflow script that uses existing user accounts with appropriate permissions for Oracle, Hadoop Distributed File System and Apache Hive database access. Have the script test the Oracle, HDFS, and Apache Hive connections first and then proceed with further options.

The main options to generate a Pentaho Data Integration transformation are the following:

Transfer all tables in selected Oracle schema to Apache Hive on Hadoop.

Transfer Oracle tables based on partition to Apache Hive on Hadoop.

Transfer specific Oracle table rows to Apache Hive on Hadoop based on date range.

Transfer specific Oracle table rows to Apache Hive on Hadoop based on column key value.

Example Workflows Offloads Using Pentaho Data IntegrationThese are enterprise data warehouse offload examples that Pentaho Data Integration uses for the enterprise data warehouse offload. These are created using the graphical user interface in Pentaho Data Integration.

Full Table Data Copy from Oracle to Apache Hive

It can be a challenge to convert data types between two database systems. With the graphical user interface in Pentaho Data Integration, you can do data type conversion with a few clicks. No coding is needed.

Use the graphical user interface in Pentaho Data Integration to construct a workflow to copy all the data from an Oracle table to Apache Hive.

You can define data connections for the Pentaho server so that Pentaho Data Integration can access data from sources like Oracle Database. See Define Data Connections for the Pentaho Server for these procedures.

Figure 4 shows the Pentaho Data Integration workflow for a full table data copy from Oracle to Apache Hive in the user interface.

Figure 4

Your workflow can copy data directly from Oracle to Apache Hive. However, there is a performance penalty if you do that. Read about this performance penalty.

To avoid the performance penalty, create the PDI workflow with the following three steps:

1. Read the Oracle table (table input).

2. Copy the Oracle table to Hadoop Distributed File System (Hadoop file output).

3. Load the HDFS file into the Apache Hive table (execute SQL script).

19

https://help.pentaho.com/Documentation/7.1/0H0/Specify_Data_Connections_for_the_Pentaho_Server/Define_Data_Connections_for_the_Pentaho_Serverhttps://support.pentaho.com/hc/en-us/requests/83157

20

To create a full table data copy workflow, do the following.

1. On the Pentaho Data Integration workflow, double-click Read Oracle Table (Figure 4 on page 19). The Table Input dialog box opens (Figure 5).

2. On the Table Input page, provide the connection information and SQL query for the data, and click OK.

Figure 5

3. Set parameters for the file transfer.

(1) To transfer Oracle table input data to Hadoop Distributed File System, double-click Copy Oracle Table to HDFS File' (Figure 4 on page 19). The Hadoop File Output dialog box opens to the File tab (Figure 6 on page 21).

20

21

Figure 6

(2) Click the Fields tab (Figure 7 on page 22).

(3) Change settings as needed, and then click OK.

Figure 7 shows setting value for HDFS output file to convert a date field (TIME_ID) to a sting value during offload.

21

22

Figure 7

4. Load the HDFS file into the Apache Hive database.

(1) Double-click Load HDFS File into Hive Table (Figure 4 on page 19). The Execute SQL statements dialog box opens (Figure 8 on page 23).

(2) Change settings as necessary, and then click OK.

You can form a Hive SQL (HQL) query to match your requirements.

22

23

Figure 8

23

24

Figure 9 shows the execution of the workflow.

Figure 9

24

25

Figure 10 shows the copied table information in Apache Hive Database.

Figure 10

Merge and Join Two Tables Data Copy from Oracle to Apache Hive

Often you need to join two or more enterprise data warehouse tables.

This example helps you understand how to query two different data sources simultaneously while saving the operational result (join, in this case) to Hadoop. This example is applicable to read data from Oracle and Hive databases, as well. You can perform the join on one table from the Oracle database and the other table from Hive database.

Figure 11 on page 26 shows the transformation workflow and the execution results for the merged workflow in the graphical user interface of Pentaho Data Integration.

25

26

Figure 11

26

27

Figure 12 shows transformation information to load the HDFS file into the Hive table.

Figure 12

27

28

Figure 13 shows simple verification of join data on hive database.

Figure 13

If done as part of the Pentaho workflow, sorting rows in large tables could be time consuming. Hitachi Vantara recommends sorting all the rows in server memory instead of using a memory-plus-disk approach.

On the Sort row dialog box (Figure 14 on page 29), make the following settings:

Use the Sort size (rows in memory) text box to control how many rows are sorted in server memory.

Use the Free memory threshold (in %) text box to help avoid filling all available memory in server memory. Make sure to allocate enough RAM to Pentaho Data Integration on the server when you need to do large sorting tasks.

Figure 14 shows controlling the cache size in the Sort rows dialog box from the graphical user interface in Pentaho Data Integration.

28

29

Figure 14

Sorting on the database is faster often than sorting externally, especially if there is an index on the sort field or fields. You can use this as another option to create better performance.

More Oracle tables can be joined one at time with same Pentaho Data Integration sorting and join steps. You can also use Execute SQL Script in Pentaho for another option to join multiple tables. This example in the Pentaho Community Forums shows how to do this from Pentaho Data Integration.

Pentaho Kettle Performance By default, all steps used in the Kettle transformation for offloading the data runs in parallel and consumes CPU and memory resources. As the resources are limited, plan resource utilization in how many transformations (steps) should be run in parallel.

Form multiple transformations with a limited number of steps and then execute them in sequence. This ensures copying or offloading all data without exhausting the resources in the server.

Decide on the number of steps in a transformation based on the available CPU and memory resources on the Kettle application host. Refer to the Pentaho server hardware information in Table 1, “Hardware Components,” on page 4.

Environment workflow Information Multiple Kettle transformations were created to run the copy or offload sequentially, rather than creating multiple

parallel transformations through a Kettle job.

29

https://forums.pentaho.com/showthread.php?83324-Multiple-table-inputs-with-merge-joins-in-one-transformation

30

Transformation steps information The number of tables in each Kettle transformation were set to 80. The number of columns in source Oracle database tables were set to 22.

Observations The CPU resource utilization was between 55% and 60% throughout the transformation tests. This means that

there was no resource depletion within the server. Possibly the number of tables in your workflow could be increased.

Workflow execution with optimum performance depends on the following: How much CPU resources are available? How much data is being copied or offloaded in a table row? How many tables were included in the transformation?

Engineering ValidationThis summarizes the key observations from the test results for the Hitachi Solution for Databases in an enterprise data warehouse offload package for Oracle Database. This environment uses Hitachi Virtual Hitachi Advanced Server DS120, Pentaho 8.0 and the Cloudera Distribution of the Hadoop Ecosystem 5.13.

When evaluating this Oracle Enterprise Data Warehouse solution, the laboratory environment used the following:

One Hitachi Unified Compute Platform CI for the Oracle Database environment

Four Hitachi Advanced Server DS120

One Hitachi Virtual Storage Platform G600

Two Brocade G620 SAN switches Using this same test environment is not a requirement to deploy this solution.

Test MethodologyThe source data was preloaded into a sample database scheme into Oracle database.

After preloading the data, a few examples Pentaho Data Integration workflows were developed for offloading data to -Apache Hive database.

Once data was loaded into the Apache Hive database, certain verification was done and make sure the data was offloaded correctly.

The examples workflows found in Oracle Enterprise Data Offload Workflow were used to validate this environment.

Testing involved following this procedure for each example workflow:

1. Verify the following for each Oracle table before running the enterprise data offload workflow:

Number of rows Number of columns Data types

2. Run the enterprise data offload workflow.

30

31

3. Verify the following for the Apache Hive table after running the enterprise data offload workflow to see if the numbers matched those in the Oracle table:

Number of documents (same as number of rows) Number fields (same as number of columns) Data types

Test ResultsAfter running each enterprise data offload workflow example, the test results showed the same number of documents (rows), fields (columns) and data types in the Oracle database as -Apache Hive database.

These results show that you can use Pentaho Data Integration to move data from an Oracle host to Apache Hive on top of Hadoop Distributed File System to relieve the workload on your Oracle host. This provides a cost-effective solution to expanding capacity to relieve server utilization pressures.

31

For More InformationHitachi Vantara Global Services offers experienced storage consultants, proven methodologies and a comprehensive services portfolio to assist you in implementing Hitachi products and solutions in your environment. For more information, see the Services website.

Demonstrations and other resources are available for many Hitachi products. To schedule a live demonstration, contact a sales representative or partner. To view on-line informational resources, see the Resources website.

Hitachi Academy is your education destination to acquire valuable knowledge and skills on Hitachi products and solutions. Our Hitachi Certified Professional program establishes your credibility and increases your value in the IT marketplace. For more information, see the Hitachi Vantana Training and Certification website.

For more information about Hitachi products and services, contact your sales representative, partner, or visit the Hitachi Vantara website.

https://www.hitachivantara.com/en-us/services.htmlhttps://www.hitachivantara.com/en-us/news-resources/resources.htmlhttps://www.hitachivantara.com/en-us/services/training-certification.htmlhttps://www.hitachivantara.com/https://www.hitachivantara.com/

1

Corporate Headquarters2845 Lafayette StreetSanta Clara, CA 96050-2639 USAwww.HitachiVantara.com | community.HitachiVantara.com

Regional Contact InformationAmericas: +1 408 970 1000 or [email protected], Middle East and Africa: +44 (0) 1753 618000 or [email protected] Pacific: +852 3189 7900 or [email protected]

Hitachi Vantara

© Hitachi Vantara Corporation 2018. All rights reserved. HITACHI is a trademark or registered trademark of Hitachi, Ltd. VSP is a registered trademark of Hitachi Vantara Corporation. Microsoft and Windows Server are trademarks or registered trademarks of Microsoft Corporation. All other trademarks, service marks, and company names are properties of their respective owners.

Notice: This document is for informational purposes only, and does not set forth any warranty, expressed or implied, concerning any equipment or service offered or to be offered by Hitachi Data Systems Corporation.

MK-SL-098-01. September 2018.

https://www.hitachivantara.com/community.hitachivantara.comtel:+14089701000mailto:[email protected]:+4401753618000mailto:[email protected]+85231897900mailto:[email protected]://twitter.com/HitachiVantarahttps://www.linkedin.com/company/11257500https://www.facebook.com/HitachiVantarahttps://www.youtube.com/user/HitachiVantara

Solution OverviewBusiness BenefitsHigh Level Infrastructure

Key Solution ComponentsPentahoHitachi Advanced Server DS120Hitachi Virtual Storage Platform Gx00 ModelsHitachi Virtual Storage Platform Fx00 ModelsBrocade SwitchesCisco Nexus Data Center SwitchesClouderaOracle DatabaseRed Hat Enterprise Linux

Solution DesignSolution ValidationStorage ArchitectureStorage Configuration for Hadoop ClusterFile System Recommendation for Cloudera Distribution for Hadoop

Network ArchitectureData Analytics and Performance Monitoring Using Hitachi Storage AdvisorOracle Enterprise Data Offload WorkflowPentaho Data IntegrationScript for Automatic Enterprise Data Workflow OffloadExample Workflows Offloads Using Pentaho Data IntegrationPentaho Kettle Performance

Engineering ValidationTest MethodologyTest Results

Documents

Hitachi Solution for Databases in an Enterprise Data ......Hitachi Storage Virtualization Operating Syst em provides storage virtualiz ation, high availability, superior perf ormance,