Best Practices for Deploying Hadoop (BigInsights) in the Cloud

Best Practices for Deploying InfoSphere BigInsights and InfoSphere Streams in the CloudIBD-3456

Leons Petrazickis, IBM Canada

© 2013 IBM Corporation

Please Note

IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.

Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

Agenda

Introduction

Optimizing for disk performance

Optimizing Java for computational performance

Optimizing MapReduce for computational performance

Optimizing with Adaptive MapReduce

Common considerations for InfoSphere BigInsights and InfoSphere Streams

Questions and Answers

Prerequisites

To get the most out of this session, you should be familiar with the basics of the following: Hadoop and Streams MapReduce HDFS or GPFS Linux shell XML

My Team

IBM Information Management Cloud Computing Centre of Competence Information Management Demo Cloud

Deploy complete stacks of IBM software for demonstration and evaluation purposes

[email protected] Images and templates with IBM software for public clouds

IBM SmartCloud Enterprise IBM SoftLayer Amazon EC2

My Work

Development: Ruby on Rails, Python, Bash/KSH shell scripting, Java

IBM SmartCloud Enterprise Public cloud InfoSphere BigInsights, InfoSphere Streams, DB2

RightScale and Amazon EC2 Public cloud InfoSphere BigInsights, InfoSphere Streams, DB2

IBM PureApplication System Private cloud appliance DB2

Background

BigInsights recommendations are based on my experience optimizing BigInsights Enterprise 2.1 performance on an OpenStack private cloud

Streams recommendations are based on my experience optimizing Streams 3.1 performance on IBM SmartCloud Enterprise

Some recommendations are based on work with the IBM Social Media Accelerator to process enormous amounts of Twitter data using BigInsights and Streams

Hadoop Challenges in the Cloud

Hadoop does batch processing of data stored on disk. The bottleneck is disk I/O.

Infrastructure-as-a-Service clouds have traditionally focused on uses such as web servers that are optimized for in-memory operation and have different constraints.

Hadoop Disk Performance

Disk Performance

Hadoop performance is I/O bound. It depends on disk performance.

Hadoop is for batch processing of data stored on disks Contrast with real-time and in-memory workloads (Streams,

Apache), which depend on memory and processor speed Infrastructure-as-a-Service clouds (IaaS) were originally

optimized for in-memory workloads, not disk workloads Cloud disk performance has traditionally been weak due to

virtualization abstraction and network separation between computational units and storage

Different clouds have different solutions to this

Disk Performance – Choice of Cloud

Choice of cloud provider and instance type is crucial Some cloud providers are worse for Hadoop than others Favour local storage over network-attached storage (NAS)

For example, EBS on Amazon tends to be slower than local storage

Options SoftLayer and clouds of physical hardware Storage-optimized instances on Amazon EC2 Other public and private clouds that keep storage as close to

computational nodes as possible

Disk performance – Concepts

Hadoop Distributed File System (HDFS) and General Parallel File System (GPFS) are both abstractions

HDFS and GPFS run on top of disk filesystems A disk is a device A disk is divided into partitions Partitions are formatted with filesystems Formatted partitions can be mounted as a directory and used

to store anything For Hadoop, we want Just-a-Bunch-Of-Disks (JBOD), not

RAID. HDFS has built-in redundancy. Eschew Linux Logical Volume Manager (LVM).

Disk performance – Partitioning

We’ll use /dev/sdb as a sample disk name Disks greater than 2TB in size require the use of a GUID

Partition Table (GPT) instead of Master Boot Record (MBR) parted -s /dev/sdb mklabel gpt

For Hadoop storage, create a single partition per disk Partition editor can be finicky about where that partition stops

and starts end=$( parted /dev/sdb print free -m | grep

sdb | cut -d: -f2 ) parted -s /dev/sdb mkpart logical 1 $end

If you were working with disk /dev/sdb, you will now have a partition called /dev/sdb1

Disk performance – Formatting

Many options: ext4, ext3, xfs xfs is not included in base Red Hat Enterprise Linux (RHEL), so

assume ext4 mkfs -t ext4 -m 1 -O

dir_index,extent,sparse_super /dev/sdb1 “-m 1” reduces the number of filesystem blocks reserved for root

to 1%. Hadoop does not run as root. “dir_index” makes listing files in a directory faster. Instead of

using a linked list, the filesystem will use a hashed B-tree. “extent” makes the filesystem faster when working with large files.

HDFS divides data into blocks of 64MB or more, so you’ll have many large files.

“sparse_super” saves space on large filesystems by keeping fewer backups of superblocks. Big Data processing implies large filesystems.

Disk performance – Mounting

Before you can access a partition, you have to mount it in an empty directory mkdir -p /disks/sdb1 mount -noatime -nodiratime /dev/sdb1

/disks/sdb1 “noatime” skips writing file access time to disk every time a

file is accessed “nodiratime” does the same for directories In order for the system to re-mount your partition after reboot,

you also have to add it to the /etc/fstab configuration file echo "/dev/sdb1 /disks/sdb1 ext4

defaults,noatime,nodiratime 1 2" >> /etc/fstab

HDFS Data Storage on Multiple Partitions

Don’t forget that you can spread HDFS across multiple partitions (and so disks) on a single system

In the cloud, the root partition / is usually very small. You definitely don’t want to store Big Data on it.

Don’t use the root of a mounted filesystem (e.g. /disks/sdb1) as the data path. Create a subdirectory (e.g. /disks/sdb1/data)

mkdir -p /disks/sdb1/data

Otherwise, HDFS will get confused by things Linux puts in the root (e.g. /disks/sdb1/lost+found)

HDFS Data Storage – Installation and Timing

You can set HDFS data storage path during installation or after installation.

BigInsights has a fantastic installer for Hadoop – offers both a web-based graphical installer, and a powerful silent install for response file.

Web-based graphical installer will generate a silent install response file for you for future automation.

BigInsights also comes with sample silent install response files.

HDFS Data Storage – During installation

During installation, HDFS data storage path is controlled by the values of <hdfs-data-directory /> and <data-directory />

For example: <cluster-configuration>

<hadoop><datanode><data-directory> /disks/sdb1/data,/disks/vdc1/data

</data-directory></datanode></hadoop> <node-list><node><hdfs-data-directory>

/disks/sdb1/data,/disks/vdc1/data </hdfs-data-directory></node></node-list>

</cluster-configuration>

HDFS Data Storage – During Installation (2)

Multiple paths are separated by commas

Any path with an omitted initial / is considered relative to the installation’s <directory-prefix />

If <directory-prefix/> is “/mnt”, then the <hdfs-data-directory/> “hadoop/data” would be interpreted as “/mnt/hadoop/data”

You can mix relative and absolute paths in the comma-separated list of directories

HDFS Data Storage – After Installation

You can change the path of HDFS data storage after installation

Path is controlled by dfs.data.dir variable in hdfs-site.xml In Hadoop 2.0, dfs.data.dir is renamed to dfs.datanode.data.dir Note: With BigInsights, never modify configuration files in

$BIGINSIGHTS_HOME/hadoop-conf/ directly Modify $BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/hdfs-

site.xml Then run synconf.sh to apply the configuration setting across

the cluster echo 'y' | syncconf.sh hadoop force

Note: Never reformat data nodes in BigInsights. Reformatting will erase BigInsights libraries from HDFS.

HDFS Namenode Storage

The Namenode of a Hadoop cluster stores the locations of all the files on the cluster

During installation, the path of this storage is determined by the value of <name-directory />

After installation, the path of namenode storage is determined by the value of dfs.name.dir variable in hdfs-site.xml

You can separate multiple locations with commas

In Hadoop 2.0, dfs.name.dir is renamed to dfs.namenode.name.dir

Hadoop Computational Performance

Java and Computational Performance

BigInsights and Hadoop are Java-based

Configuration the Java Virtual Machine (JVM) correctly is crucial to processing of Big Data in Hadoop

Correct JVM configuration depends on both the machine as well as the type of data

BigInsights has a configuration preprocessor that will easily size the configuration to match the machine

Java and Computational Performance

Note: Never modify mapred-site.xml in $BIGINSIGHTS_HOME/hadoop-conf/ directly

Modify mapred-site.xml in $BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/

Run syncconf.sh to process the calculations and apply the new configuration to the cluster

Java and Computational Performance A key property for performance is the amount of memory

allocated to each Java process or task Keep in mind many tasks will be running at the same time,

and you’ll want them all to fit within available machine memory with some margin

A good value for many use cases is 600m <property>

<name>mapred.child.java.opts</name> <value>-Xmx600m</value>

</property> When working with the IBM Social Media Accelerator, you’ll

want much more memory per task. 4096m or more is common, with implications for size of machine expected.

Note: Do not enable -Xshareclasses. This was a bad default in older BigInsights releases.

Java and Computational Performance – Streams

Streams and Streams Studio are Java applications

You can increase the amount of memory allocated to the Streams Web Server (SWS) as follows, where X is in megabytes:

streamtool setproperty --instance_id myinstance SWS.jvmMaximumSize=X

streamtool stopinstance --instance-id myinstance

streamtool startinstance --instance-id myinstance

You can increase the amount of memory for Streams Studio in <install-directory>/StreamsStudio/streamsStudio.ini

After -vmargs, add -Xmx1024m or similar

MapReduce and Computational Performance

Hadoop traditionally uses the MapReduce algorithm for processing Big Data in parallel on a cluster of machines

Each machine runs a certain number of Mappers and Reducers

A Hadoop Mapper is a task that splits input data into intermediate key-value pairs

A Hadoop Reducer is a task that that reduces a set of intermediate key-value pairs with a shared key to a smaller set of avlues


You’ll want more than one reduce tasks per machine, with both the number of available cores and the amount of available memory constricting the number you can have

The 600 denominator comes from the value for JVM memory in mapred.child.java.opts

<property>

<name>mapred.reduce.tasks</name>

<value><%= Math.ceil(numOfTaskTrackers * avgNumOfCores * 0.5 * 0.9) %></value>

</property>


Map tasks and reduce tasks use the machine differently. Map tasks will fetch input locally, while reduce tasks will fetch input from the network. They will run at the same time.

Running more tasks than will fit in a machine’s memory will cause tasks to fail.

Set the number of map tasks per machine to use slightly less than half the number of available processor cores <name>tasktracker.map.tasks.maximum</name>

<value><%= Math.min(Math.ceil(numOfCores * 1.0),Math.ceil(0.8*0.66*totalMem/600)) %></value>

Set the number of reduce tasks per machine to half the number of map tasks <name>tasktracker.map.tasks.maximum</name>

<value><%= Math.min(Math.ceil(numOfCores * 0.5),Math.ceil(0.8*0.33*totalMem/600)) %></value>


Cloud machine size Number of mappers Number of reducers

1 core, 2GB 1 1

1 core, 4GB 1 1

2 core, 8GB 2 1

4 core, 15GB 4 2

16 core, 61GB 16 8

16 core, 117GB 16 8

More options in mapred-site.xml

“mapred.child.ulimit” lets you control virtual memory used by Hadoop’s Java processes. 1.5x the size of mapred-child-java-opts is a good. Note that the value is in kilobytes. If the Java options are “-Xmx600m”, then a good value for the ulimit is 600*1.5*1024 which is “921600”.

“io.sort.mb” controls the size of the output buffer for map tasks. When it’s 80% full, it will start being written to disk. Increasing the size of the output buffer will reduce the number of separate writes to disk. Increasing the size will use more memory and do less disk I/O.

“io.sort.factor” defines the number of files that can be merged at one time. Merging is done when a map tasks is complete, and again before reducers start executing your analytic code. Increasing the size will use more memory and do less disk I/O.

More options in mapred-site.xml (2)

“mapred.compress.map.output” enables compression when writing the output of map tasks. Compression used more processor capacity but reduces disk I/O. Compression algorithm is determined by “mapred.map.output.compression.codec”

“mapred.job.tracker.handler.count” determines the size of the thread pool for responding to network requests from clients and tasktrackers. A good value is the natural logarithm (ln) of cluster size times 20. “dfs.namenode.handler.count” should also be set to this, as it performs the same functions for HDFS.

“mapred.jobtracker.taskScheduler” determines the algorithm used for assigning tasks to task trackers. For production, you’ll want something more sophisticated than the default JobQueueTaskScheduler.

Kernel Configuration

Linux kernel configuration is stored in /etc/sysctl.conf “vm.swappiness” controls kernel’s swapping of data from

memory to disk. You’ll want to discourage swapping to disk, so 0 is a good value.

“vm.overcommit_memory” allows more memory to be allocated than exists on the system. If you experience memory shortages, you may want to set this to 1 as the way the JVM spawns Hadoop processes will have them request more memory than they need. Further tuning is done through “vm.overcommit_ratio”.

More BigInsights Performance

Visualization & DiscoveryVisualization & Discovery IntegrationIntegration

Workload OptimizationWorkload OptimizationStreams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsightsIBM InfoSphere BigInsights

Runtime / SchedulerRuntime / Scheduler

Advanced Analytic EnginesAdvanced Analytic Engines

File SystemFile System

MapReduce

HDFS

Data StoreData StoreHBase

Text Processing Engine & Extractor Library)

BigSheetsJDBC

Applications & DevelopmentApplications & Development

Text Analytics MapReduce

Pig & Jaql Hive

AdministrationAdministration

Index

Splittable Text Compression

Enhanced Security

Flexible SchedulerJaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard & Visualization

Apps

Workflow Monitoring

ManagementManagement

HCatalog

Security

Audit & History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

Symphony

GPFS FPO

Optional

Symphony AE

IBM Big Data Platform

Adaptive MapReduce

Adaptive MapReduce lets mappers communicated through a distributed metadata store and take into account the global state of the job

Open the install.properties before you install BigInsights To Enable Adaptive MapReduce, set the following:

AdaptiveMR.Enable=true To also enable High Availability, set the following:

AdaptiveMR.HA.Enable=true High Availability requires at least nodes in your cluster Adaptive MapReduce is a single-tenant implementation of

IBM Platform Symphony

Common Considerations for BigInsights and Streams

Common Considerations

Both BigInsights and Streams rely on working with large numbers of open files and running processes

Raise the Linux limit on the number of open files (“nofile”) to 131072 or more in /etc/security/limits.conf

Raise the Linux limit on the number of processes (“nproc”) to unlimited in /etc/security/limits.conf

Remove RHEL forkbomb protection from /etc/security/limits.d/90-nproc.conf

Validate your changes with a fresh login as your BigInsights and Streams users (e.g. biadmin, streamsadmin) and the ulimit command

Questions and Answers

Acknowledgements and DisclaimersAvailability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

© Copyright IBM Corporation 2013. All rights reserved.

•U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

•Please update paragraph below for the particular product or family brand trademarks you mention such as WebSphere, DB2, Maximo, Clearcase, Lotus, etc

IBM, the IBM logo, ibm.com, [IBM Brand, if trademarked], and [IBM Product, if trademarked] are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

If you have mentioned trademarks that are not from IBM, please update and add the following lines:

[Insert any special 3rd party trademark names/attributions here]

Other company, product, or service names may be trademarks or service marks of others.

Communities

• On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more

o Find the community that interests you …

• Information Management bit.ly/InfoMgmtCommunity

• Business Analytics bit.ly/AnalyticsCommunity

• Enterprise Content Management bit.ly/ECMCommunity

• IBM Champions

o Recognizing individuals who have made the most outstanding contributions to Information Management, Business Analytics, and Enterprise Content Management communities

• ibm.com/champion

Thank You Your feedback is important!

• Access the Conference Agenda Builder to complete your session surveys

o Any web or mobile browser at http://iod13surveys.com/surveys.html

o Any Agenda Builder kiosk onsite

Technology

Best Practices for Deploying Hadoop (BigInsights) in the Cloud