36
© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill Senior Big Data Architect Unisys AMCOE Office St. Louis Missouri Top 10 Linux (Red Hat or CentOS) Operating System Tips for Building an Optimal Performing Hadoop Cluster

St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY.

St. Louis Hadoop User Group MeetUpMarch 24, 2015

Roger Hill – Senior Big Data Architect Unisys – AMCOE Office St. Louis Missouri

Top 10 Linux (Red Hat or CentOS) Operating System Tips for Building an Optimal Performing Hadoop Cluster

Page 2: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 2

Who is this “Roger Hill” you speak of ?

• Unisys Senior Architect (28 years experience) for BDAaaS group within AMCOE data center, providing team guidance and direction around administration, development, business use cases, hardware recommendation and selection, best practices, industry standards, performance tuning and monitoring, security standards and policies regarding Unisys Big Data environments.

• Lead team with standardization for installation and configuration of Hadoop environment build outs. Construct and drive project schedules, timelines, and project lifecycles to acceptable levels of performance and closure. Train junior team members in Hadoop core competencies and fundamentals.

• Red Hat Certified Engineer (RHEL6)

• Assist in Hadoop developers in the ETL process, responsible for data ingestion orchestration, data import/export, data governance, data protection and security polices and implementations. Hands on Hadoop administration, and Java application troubleshooting, environment updating and patching on a regular basis, environment maintenance scheduling and migration planning.

• In my past positions I've worked previously as a network admin, programmer, developer, system admin, system engineer, and application architect. I've worked as a Senior Consultant for companies like IBM, HP, DISYS, Savvis, CenturyLink and now Unisys. I am a certified Hadoop Engineer (Cloudera) and Hbase specialist. I help to architect and design the layout for new Big Data environments, and automate and standardize the build process. I enjoy helping others, and provide training on various Big Data technical areas, and I enjoy solving complex problems . Currently we are working with Hortonworks and HDP 2.2 on AWS, with Platfora, R Studio, Pentaho and Graylog to complete the BDAaaS stack.

Page 3: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 3

What this presentation will likely NOT cover

• Environment Deployment

– How to install hadoop software

– Hardware Selection

• Hadoop Cluster Sizing

– Dependant upon data, workload and many other things

– Virtualization vs. Dedicated hardware

• Code Optimization

– Debugging bad code, garbage collection

– Code tracing, logic optimization

• Vendor Selection, We are Vendor Agnostic or Neutral

– These tips and tricks will work anywhere ... almost

Page 4: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 4

What I hope that you do DO get out of this presentation

The answer to many questions related to Hadoop (or ‘Big Data’ is) …

… “It depends” …

“Because, we know our environment, application needs and we know EXACTLY

what type of infrastructure we need”

“Said No one. Ever”

Page 5: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 5

What I hope that you do DO get out of this presentation

• Better Understanding of Preparing the Linux OS for Hadoop

– Creating a “baseline” OS image

– Configuring, and Tuning the OS for Hadoop

• Better ability to Troubleshoot Hadoop + OS related issues

– Create a system of ‘checks and competencies’ for your environment

– Instinctive knowledge, of where to look and what to check first

• Hadoop Cluster Performance Tuning and Adjustment

– How to build a ‘stable environment’ for your cluster

– How to ‘stabilize an existing environment’ that is

maybe not so ‘stable’

Page 6: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 6

To build a good house, we need to first start with a “Good Foundation” (Why this is important !)

! Pay Attention !

• Start correctly

• Start correctly

• Start correctly

Page 7: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 7

We have to remember that Hadoop is a ‘High End Computing Environment’ and complex and the little things we do at the beginning of a project will matter !

• Tip Number One

– Create a good “baseline document record” of all of your servers at the beginning of the project (script to record hardware and software info and basic configuration)

– Have a repeatable installation and configuration method or process (Pre-installation scripts, Post Installation scripts, Kickstart, LinuxCOE, Puppet, Chef, Salt or Ansible, VM templates)

Page 8: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 8

Why does this matter ?

...because he is using Hadoop too ...

Page 9: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 9

Why does this matter ?

And potentially, his next mapreduce job ...

... could turn this ...

Page 10: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 10

Why does this matter ?

And potentially, his next mapreduce job ...

... could turn this ...

... into this ...

Page 11: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 11

Disable Swap to reduce performance hits

• Definition (Red Hat)“Swap space in Linux is used when the amount of physical memory (RAM) is full. If the system needs more memory resources and the RAM is full, inactive pages in memory are moved to the swap space. While swap space can help machines with a small amount of RAM, it should not be considered a replacement for more RAM. Swap space is located on hard drives, which have a slower access time than physical memory. ”

“Swap space can be a dedicated swap partition

(recommended), a swap file, or a combination of

swap partitions and swap files.”

• When it occurs – Likely you experience drastically slowed performance.

– If swapping activity continues, there is potential to “crash” your app and or the server.

– Re-configure your application, or add more RAM, reduce SGA size.

• Sometimes called paging

– Some vendors recommend not having ANY swap partitions now ?

Page 12: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 12

To “swap” or not, that is the question ... ... so ... don’t

• Tip Number Two : What exactly is swap space anyhow ?

– Just like when you write a check that you don’t actually have the funds for.

– The old standards for swap space allocation ... 2x the server memory, really ? (Not today)

• How to find and display swap space on my Linux server ?

# free –m

# grep SwapTotal /proc/meminfo

# vmstat 3 10

# sar -B 3 3

# top

# cat /proc/sys/vm/swappiness (kernel tuning parameter

controlling swap)

Page 13: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 13

To “swap” or not, that is the question ... ... so ... don’t

• How to find and see the swap space on my Linux server ?– “free –m”

– “grep SwapTotal /proc/meminfo”

– “vmstat 3 10”

– “sar -B 3 3”

– “cat /proc/sys/vm/swappiness” (kernel tuning param controlling swap)

– “top”

Page 14: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 14

Disable Swap

• How to find and modify (disable) swap space on a Linux server ?

• Immediate in running system kernel

• To make permanent across server reboots

• To disable devices and files for paging and swapping

# echo 0 > /proc/sys/vm/swappiness

# echo “vm.swappiness=0” >> /etc/sysctl.conf

# swapoff -a

Page 15: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 15

Turn off the “noatime” attribute to eliminate heavy IO Operations on Linux Disks and Partitions

• Tip Number Three

– Tracking File Access Time on Linux Partions

– What is “File Access Time” : The mount option “noatime” which says ‘do not update inode access times on this filesystem (e.g, for faster access on the mail spool to speed up mail servers)’

– Why disable ? Because Hadoop traditionally does many more “seeks” , HDFS is still ‘write-once,read many’, and lastly NameNodetracks the access information, enabling “atime” is redundant in our case. Disabling the “FTA” will cut down on Disk IO transactions, thereby helping with performance.

Page 16: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 16

• Tip Number Three

– Tracking File Access Time on Linux Partions

How to display “Tracking File Access Time” setting ?

How to modify “File Access Time” (Also assumes “nodirtime”)

OR

# mount/dev/sda2 on / type ext4 (rw,noatime)

# cat /etc/fstab | egrep -v "^$|#"/dev/sda2 / ext4 defaults,noatime 1 1

# vi /etc/fstab

# mount -o remount,rw,noatime /dev/sda2 /

Turn off the “noatime” attribute to eliminate heavy IO Operations on Linux Disks and Partitions

Page 17: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 17

Change to ‘Root Reserved Space’ on your Linux Disks and Partitions to a lower value to reclaim wasted disk space

• Tip Number Four– Root Reserved Space

– By default the ext4 filesystem reserves 5% of disk for ‘root-owned’ files.

– This is “51GB” of wasted space on a 1 TB HDD ! This reserved root space for root user was done to avoid fragmentation.

– For modern high-capacity disks, this is higher than necessary. It is generally safe to reduce the percentage of reserved blocks to free up disk space when the partition is either very large or used for archive (not many writes).

– Over a large environment this equates to a “lot of wasted space” for no reason.

Page 18: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 18

Change to ‘Root Reserved Space’ on your Linux Disks and Partitions to a lower value to reclaim wasted disk space

• Tip Number Four– Root Reserved Space

– The "tune2fs" command is used by the system administrator to change/modify tunable parameters on ext2, ext3 and ext4 type filesystems. To display the current values that are set you can use the tune2fs command with the "-l" option.

– How to display current “root reserved space” on your partitions.

[root@server01 ~]# tune2fs -l /dev/sda2 | grep -i "Reserved block count"Reserved block count: 114521

Page 19: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 19

Change to ‘Root Reserved Space’ on your Linux Disks and Partitions to a lower value to reclaim wasted disk space

• Tip Number Four– Root Reserved Space

– How to modify your current “root reserved space” on your partitions.

Setting to 1% ‘reserved root space’ (for the extra cautious)

Setting to 0% ‘reserved root space’ (still relatively safe)

# tune2fs -m 1 /dev/sda2

tune2fs 1.41.12 (17-May-2010)

Setting reserved blocks percentage to 1% (22904

blocks)

# tune2fs -m 0 /dev/sda2

tune2fs 1.41.12 (17-May-2010)

Setting reserved blocks percentage to 0% (0 blocks)

Page 20: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 20

Enable the Linux “nscd” service or daemon to improve name resolution lookups

• Tip Number Five– Enable and turn on “Named Service Cache Daemon”

– What is the “nscd” service ? A daemon that caches name service requests for ‘Passwords’, ‘Groups’ and ‘Hosts’

– Why turn it on ? Helps with high latency LDAP, NIS, NIS+ requests, it takes up very little resources, and there is no configuration required. Hadoop and is subcomponents are a very ‘network heavy dependant application’.

– (Hbase is especially DNS performance dependant)

Page 21: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 21

Enable the Linux “nscd” service or daemon to improve name resolution lookups

• Tip Number Five– How to display your “Name Service Cache Daemon” current

settings .

– How to enable your “Name Service Cache Daemon”

current settings.

# chkconfig --list nscdnscd 0:off 1:off 2:on 3:on 4:on 5:on 6:off

# service nscd statusnscd (pid 6417) is running...

# nscd -g

# chkconfig nscd on# service nscd start

Page 22: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 22

Modify the ulimit settings to give Hadoop greater ‘Open Number of Files’ (or more ‘File Descriptors’) , and increase ‘Number of max processes’

• Tip Number Six– A “File Handle” is the way that the kernel refers to a file, also known as “File

Descriptors” .

– The “Number of Open files” refers to a limit of the number of simultaneous files that the kernel can have open and write to at one time.

– The “Max Number of Processes” is also a ulimit defineable variable.

– “File Handle Limits” and “Max Number of Processes” are kernel tuning parameters via the ‘ulimit –a’ command and file “/etc/security/limits.conf”.

Page 23: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 23

• Tip Number Six : – Why should we modify ?

“java2010-09-13 01:24:17,336 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:

Disk-related IOException in BlockReceiver constructor. Cause is java.io.IOException: Too many open files”

“Exception in thread "Thread-0" java.lang.OutOfMemoryError: unable to create new native thread”

– Linux ulimit default settings are too low for many Java based applications, especially a highly demanding application like Hadoop.

Modify the ulimit settings to give Hadoop greater ‘Open Number of Files’ (or more ‘File Descriptors’) , and increase ‘Number of max processes’

Page 24: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 24

• Tip Number Six : – How to display “Number Open files (File Handle Limits)” and “Maximum

number processes”

– Linux default settings are too low for many Java based applications, especially a highly demanding application like Hadoop.

[root@server01 ~]# ulimit –n // Number Open files (Descriptors) 1024

[root@server01 ~]# ulimit –u // Max number pocesses4096

Modify the ulimit settings to give Hadoop greater ‘Open Number of Files’ (or more ‘File Descriptors’), and increase ‘Number of max processes’

Page 25: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 25

Modify the ulimit settings to give Hadoop greater ‘Open Number of Files’ (or more ‘File Descriptors’) , and increase ‘Number of max processes’

• Tip Number Six : – How to modify “Number Open files (File Handle Limits)” and “Maximum

number processes”

– Linux default settings are too low for many Java based applications, especially a highly demanding application like Hadoop.

[root@server01 ~]# ulimit -n 32768 // Number Open files (Descriptors)

[root@server01 ~]# ulimit -u unlimited // Max number pocesses

[root@server01 ~]# vi /etc/security/limits.conf * - nofile 32768* - nproc 65536

Page 26: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 26

Have a dedicated Disk for the OS and logs, that is separate from the ‘Hadoop Data’

• Tip Number Seven– Dedicated Disk for OS and Logs

– Dedicated Disk for HDFS, NameNode Data, etc

• Why would this make a difference ? - OS operations and OS logs require much overhead

- Hadoop logging is quite “verbose”

- Need ‘separate space’ for Hadoop execution and Userspace exection

Page 27: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 27

Have a dedicated Disk for the OS and logs, that is separate from the ‘Hadoop Data’

• Tip Number Seven– Layout “Disk sda” for the OS and Logs

– Layout “Disk sdb, sdc, etc” for Hadoop HDFS Data and NameNode metadata

• How to display your Linux disks and layout

# df –ha

OR

# lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT

sr0 11:0 1 1024M 0 rom

sda 8:0 0 10G 0 disk

├─sda1 8:1 0 300M 0 part /boot

├─sda2 8:2 0 8.8G 0 part /

└─sda3 8:3 0 992M 0 part [SWAP]

The variable that controls HDFS location is : “dfs.data.dir” is within hdfs-site.xml

Note : That “dfs.data.dir” may contain a space or comma separated list of directory names, so that data may be stored on multiple local devices.

The variable that controls NameNode meta data location is : “dfs.namenode.name.dir” also is within hdfs-site.xml

Note : Determines where the NameNode will store the "name table". If a comma delimited list is giving, "name table" is stored in all locations for redundancy.

Page 28: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 28

Name Resolution Options – ALL

• Tip Number Eight

– Name Resolution, Forward and Reverse for DNS

– Setup of the “/etc/hosts” file, setup canonical names properly

– Setup of the “/etc/hosts” file, setup the loopback address

– Your server’s hostname MUST match the DNS Fully Qualified Domain Name assigned to it.

10.8.15.1 server01.example.com server01 master0110.8.15.2 server02.example.com server02 worker01

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# hostname –f OR uname –n # host server01.example.com

Page 29: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 29

Name Resolution Options – ALL

• Tip Number Eight

– Name Resolution, Forward and Reverse for DNS

– How to display and test forward DNS and Reverse DNS.

– Your servers DNS resolution file is at “/etc/resolv.conf”

# host www.cyberciti.bizwww.cyberciti.biz has address 75.126.153.206www.cyberciti.biz has IPv6 address 2607:f0d0:1002:51::4

# host 75.126.153.206206.153.126.75.in-addr.arpa domain name pointer www.cyberciti.biz.

# cat /etc/resolv.conf# Generated by NetworkManagerdomain localdomainsearch localdomain example.comnameserver 192.168.21.2

Other Linux DNS tools :

# nslookup

# dig

Page 30: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 30

Benchmark your Hardware as much as possible before installation

• Tip Number Nine

– Benchmark your IO time on your disks

– Why would this matter ? Disk IO is often a bottleneck, if you have something configured wrong (hardware, OS, etc) it may show up here first

– There are many many options :

1. SysBench - Custom made scripts

2. IOZone - Hardinfo – CPU Benchmark

3. Phoronix Test Suite - GtkPerf – GTK+ Benchmark

4. kSar sar grapher - Bonnie++

5. Cacti - glxgears

Page 31: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 31

Network Configuration set the MTU correctly for the Linux NIC device to support ‘Jumbo Frames’

• Tip Number Ten

– Network Interfaces MTU=9000 to take advantage of ‘Jumbo Frames’

NOTE : Your network hardware MUST support this, Most modern networking equipment is capable of handling larger frames but must be explicitly configured to do so. Frames which take advantage of this ability are known as ‘jumbo frames’, and 9000 bytes is a popular choice for the MTU.

Why does this matter ? Hadoop is a very “network dependant” application that will use as much bandwidth as it is given. Having an optimal network configuration is essential to getting the most out of your Hadoop cluster !

Page 32: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 32

Network Configuration set the MTU correctly for the Linux NIC device to support ‘Jumbo Frames’

• Tip Number Ten

– Network Interfaces MTU=9000 to take advantage of ‘Jumbo Frames’

Page 33: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 33

Network Configuration set the MTU correctly for the Linux NIC device to support ‘Jumbo Frames’

• Tip Number Ten

– Network Interfaces MTU=9000 to take advantage of ‘Jumbo Frames’

How to display your current network card settings from within Linux.

How to modify your current network card settings from within Linux

How to validate your MTU settings with network equipment

# netstat -i

OR

# ifconfig –a | grep MTU

# ifconfig eth0 mtu 9000

OR

# vi /etc/sysconfig/network-scripts/ifcfg-eth0MTU=9000

# ping -M do -c 4 -s 1500 192.168.21.2PING 192.168.21.2 (192.168.21.2) 1500(1528) bytes of data.

Page 34: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 34

Top 10 Linux (Red Hat or CentOS) Operating System Tips for Building an Optimal Performing Hadoop Cluster - Summary

• Top 10 Linux Tips Summary

– # 1 Create a Documentation Baseline Record

– # 2 Disable Swap to Reduce Performance Hits

– # 3 Set the mount option “noatime” on Linux Partitions

– #4 Change ‘Root Reserved Space’ on your Linux Disks to 1 or 0%

– #5 Enable the Linux “nscd” service to improve DNS lookups

– #6 Modify the ulimit settings to have greater ‘Open Number of Files’ and more ‘File Descriptors’

– #7 Have a dedicated Disk for the OS and logs, separate from the ‘Hadoop Data’ Disk

– #8 Name Resolution Options – Forward and Reverse

– #9 Benchmark your Hardware as much as possible before installation

– #10 Network Configuration set the MTU correctly for the Linux NIC device to support ‘Jumbo Frames’ (if your network supports)

Page 35: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 35

We have to remember that Hadoop is a ‘High End Computing Environment’ and complex and the little things we do at the beginning of a project will matter !

• Bonus Round (Time Contingent ... If Roger, left you any time)

– NIC Bonding

– Monitor your environment (Nagios, Ganglia or Other)

– Misc Kernel Tuning Parameters

Page 36: St. Louis Hadoop User Group MeetUp€¦ · FOR INTERNAL USE ONLY. St. Louis Hadoop User Group MeetUp March 24, 2015 Roger Hill –Senior Big Data Architect Unisys –AMCOE Office

© 2014 Unisys Corporation. All rights reserved. FOR INTERNAL USE ONLY. 36

Thank-You !

Q & A

Roger Hill – Senior Big Data Architect

Email : [email protected]