40
2 Did someone just order Hadoop? - Best practices from the field uweseiler

Hadoop Operations - Best practices from the field

Embed Size (px)

Citation preview

Page 1: Hadoop Operations - Best practices from the field

2

Did someone just order Hadoop?

- Best practices from the field

uweseiler

Page 2: Hadoop Operations - Best practices from the field

2 About me

Big Data Nerd

TravelpiratePhotography Enthusiast

Hadoop Trainer NoSQL Fan Boy

Page 3: Hadoop Operations - Best practices from the field

2 About us

specializes on...

Big Data Nerds Agile Ninjas Continuous Delivery Gurus

Enterprise Java Specialists Performance Geeks

Join us!

Page 4: Hadoop Operations - Best practices from the field

2 Agenda

• Basics• Software I

• Architecture & Rack Design

• Hardware & Cluster Sizing

• Software II

• Advanced• Data Ingestion

• Operation & Monitoring

• Security

Page 5: Hadoop Operations - Best practices from the field

2 Agenda

• Basics• Software I

• Architecture & Rack Design

• Hardware & Cluster Sizing

• Software II

• Advanced• Data Ingestion

• Operation & Monitoring

• Security

Page 6: Hadoop Operations - Best practices from the field

2 Deployment Options

On PremiseHadoop

ApplianceHadoopHosting

Hadoop as a service

Bare Metal Cloud

Page 7: Hadoop Operations - Best practices from the field

2 Hadoop Distributions

Page 8: Hadoop Operations - Best practices from the field

2 Cloudera vs. Hortonworks

Guess what:

Both will do the job!

Page 9: Hadoop Operations - Best practices from the field

2 Cloudera vs. Hortonworks

Which ideology do you prefer?

“Closed“ Source Open Source

Page 10: Hadoop Operations - Best practices from the field

2 Cloudera vs. Hortonworks

Pricing model

Software +

Support Support

Page 11: Hadoop Operations - Best practices from the field

2 Agenda

• Basics• Software I

• Architecture & Rack Design

• Hardware & Cluster Sizing

• Software II

• Advanced• Data Ingestion

• Operation & Monitoring

• Security

Page 12: Hadoop Operations - Best practices from the field

2

Platform for Data Exploration

ETL

Visualization

Data Warehouse

Create the Big Picture

Page 13: Hadoop Operations - Best practices from the field

2

HDFS

Map Reduce

TezSpark

Pig Hive

YARN

Solr

Sqoop

NFS Gateway

Falcon

Ambari

Oozie

Knox

Ranger

Ganglia

Nagios

Monitoring

ZooKeeper

Journal NodesCluster Management Services

Data Ingestion

& Governance

Data Storage

Data Processing Search SecurityWorkflow

Mgmt.

MySQL

MySQL

MySQL

Pick your Hadoop Stack

Page 14: Hadoop Operations - Best practices from the field

2 Rack Design (without HA)

Rack 1 Rack 2

NameNode

ResourceManager

Mgmt. Server

5 x Master Nodes

5 x Worker Nodes

6 x Worker Nodes

Gateway Server

Nexus 3 K

Cisco Catalyst 2960

1 x ToR Switch Nexus 3 K 1 x ToR Switch

1 x Mgmt. Network Cisco Catalyst 2960 1 x Mgmt. Network

SecondaryNameNode

Page 15: Hadoop Operations - Best practices from the field

2

Rack 1 Rack 2

NameNode (Active)

ResourceManager(Active)

Mgmt. Server

4 x Master Nodes

5 x Worker Nodes

6 x Worker Nodes

NameNode(Passive)

ResourceManager(Passive)

2 x Standby HA Nodes

Gateway Server

Nexus 3 K

Cisco Catalyst 2960

1 x ToR Switch Nexus 3 K 1 x ToR Switch

1 x Mgmt. Network Cisco Catalyst 2960 1 x Mgmt. Network

Rack Design (with HA)

Page 16: Hadoop Operations - Best practices from the field

2

HDFS DataNode

YARN NodeManager

Hadoop Client Libraries

Worker Nodes

NameNode (Active)

ZooKeeper Server

Journal Node

Hadoop Client Libraries

HDFS NameNode (Active) NameNode (Passive)

ZooKeeper Server

Journal Node

Hadoop Client Libraries

HDFS NameNode (Passive)

ResourceManager

App Timeline Server

MapReduce2 History Server

ZooKeeper Server

Journal Node

Hadoop Client Libraries

YARN ResourceManager (Active)

MySQL Server

• Hive MetaStore

• Oozie

• Ganglia

HiveServer2

Oozie Server

Ganglia Server

Nagios Server

Zookeeper Server

Journal Node

Kerberos

Hadoop Client Libraries

Management ServerHue Server

Ambari Server

NFS Gateway Server

WebHCat Server

WebHDFS

Falcon

Sqoop

Solr

Hadoop Client Libraries

Gateway Server

ResourceManager

App Timeline Server

MapReduce2 History Server

ZooKeeper Server

Journal Node

Hadoop Client Libraries

YARN ResourceManager (Passive)

Service Mapping

Page 17: Hadoop Operations - Best practices from the field

2 Agenda

• Basics• Software I

• Architecture & Rack Design

• Hardware & Cluster Sizing

• Software II

• Advanced• Data Ingestion

• Operation & Monitoring

• Security

Page 18: Hadoop Operations - Best practices from the field

2 Hardware

• Get good-quality commodity hardware!

• Buy the sweet-spot in pricing: 3 TB disk, 128 GB RAM, 8-12 core CPU– More memory is better. Always.

• First scale horizontally than vertically (1U 6 disks vs. 2U 12 disks)

– Get to at least 30-40 machines or 3-4 racks

• Don‘t forget about rack size (42U) and power consumption.

• Use pilot cluster to learn about load patterns

– Balanced workload

– Compute intensive

– I/O intensive

Page 19: Hadoop Operations - Best practices from the field

2 It’s about storage

Total: 3,00 TB

Intermediate data: ~25% - 0,75 TB

= 2,25 TB

HDFS Replication: 3

= 0,75 TB

x 12 disk

x 11 Data Nodes

= 99 TB

Compression: …well, it depends…

Page 20: Hadoop Operations - Best practices from the field

2 It’s about Zen

Xeon 10C Model E5-2660v2

4 Memory Channels

10 Cores

8 x 16 GB

12 disks

Page 21: Hadoop Operations - Best practices from the field

2 Hardware

Component HDFS NameNode

+

HDFS Secondary NN

+

YARN Resource Manager

Management Server

+

Gateway

Server

Worker Nodes

CPU 2 x 3+ GHz with 8+ cores 2 x 3+ GHz with 8+ cores 2 x 2.6+ GHz with 8+ cores

Memory 128 GB

(DDR3, ECC)

128 GB

(DDR3, ECC)

128 GB

(DDR3, ECC)

Storage 2 x 1+ TB

(RAID 1, OS)

1 x 1 TB

(Hadoop Logs)

1 x 1 TB

(ZooKeeper)

1 x 3 TB

(HDFS)

2 x 1+ TB

(RAID 1, OS)

1 x 1 TB

(Hadoop Logs)

1 x 3 TB

(HDFS)

2 x 1+ TB

(RAID 1, OS)

10 x 3 TB (HDFS)

If disk chassis allows:

12 x 3 TB (HDFS)

Network 2 x Bonded

10 GbE NICs

1 x 1 GbE NIC

(for mgmt.)

2 x Bonded

10 GbE NICs

1 x 1 GbE NIC

(for mgmt.)

2 x Bonded

10 GbE NICs

1 x 1 GbE NIC

(for mgmt.)

Page 22: Hadoop Operations - Best practices from the field

2 Example: IBM x3650 series

Master Nodes

Data Nodes

Page 23: Hadoop Operations - Best practices from the field

2 Agenda

• Basics• Architecture & Rack Design

• Hardware

• Software

• Advanced• Data Ingestion

• Operation & Monitoring

• Security

Page 24: Hadoop Operations - Best practices from the field

2 Operating System

Page 25: Hadoop Operations - Best practices from the field

2 Linux File System

• Ext3

• Ext4

• XFS with -noattime, -inode64, -nobarrier options

Possibly better performance, be aware of delayed dataallocation (Consider turning off the delalloc option in /etc/fstab)

Page 26: Hadoop Operations - Best practices from the field

2 OS Optimizations

• Of course depending on your OS choice• Specific recommendations available by OS vendors

• Common recommendations• No physical I/O Scheduling (competes with virtual/HDFS

I/O Scheduling) (e.g. use NOOP Scheduler)

• Adjust vm.swapiness to 0

• Set number of file handles (ulimit, soft+hard) to 16384 (Data Nodes) / 65536 (Master Nodes)

• Set number of pending connections (net.core.somaxconn) to 1024

• Use Jumbo Frames (MTU=9000)

• Consider network bonding (802.3ad)

Page 27: Hadoop Operations - Best practices from the field

2 Java

• Oracle JDK 1.7 (64-bit)

• Oracle JDK 1.6 (64-bit)

• Open JDK 7 (64-bit)

Page 28: Hadoop Operations - Best practices from the field

2 Java Optimizations

• Use 64 bit JVM for all daemons

– Compressed OOPS enabled by default (Java 6 u23+)

• Java Heap Size

– Set Xmx == Xms

– Avoid Java defaults for NewSize and MaxNewSize

• Use 1/8 to 1/6 of max size for JVM’s larger than 4 GB

– Configure –XX:PermSize=128 MB, -XX:MaxPermSize=256 MB

• Use low-latency GC collector

– Set -XX:+UseConcMarkSweepGC, -XX:ParallelGCThreads=<N>

• Use high <N> on NameNode & ResourceManager

• Useful for debugging– -verbose:gc -Xloggc:<file> -XX:+PrintGCDetails– -XX:ErrorFile=<file>– -XX:+HeapDumpOnOutOfMemoryError

Page 29: Hadoop Operations - Best practices from the field

2 Hadoop Configuration

• Multiple redundant directories for NameNode metadata

– One of dfs.namenode.name.dir should be on NFS

– Softmount NFS with -tcp,soft,intr,timeo=20,retrans=5

• Take periodic backups of NameNode metadata

– Make copies of the entire storage directory

• Set dfs.datanode.failed.volumes.tolerated=true– Disk failure is no longer complete DataNode failure

– Especially important for large density nodes

• Set dfs.namenode.name.dir.restore=true

• Restores NN storage directory during checkpointing

• Reserve a lot of disk space for NameNode logs

– Hadoop logging is verbose – set aside multiple GB’s

– NameNode logs roll with in minutes – hard to debug issues

• Use version control for configuration!

Page 30: Hadoop Operations - Best practices from the field

2 Agenda

• Basics• Software I

• Architecture & Rack Design

• Hardware & Cluster Sizing

• Software II

• Advanced• Data Ingestion

• Operation & Monitoring

• Security

Page 31: Hadoop Operations - Best practices from the field

2 Options for Data Ingestion

MapReduce

WebHDFS

hadoop fs -put

NFS Gateway

hadoop distcp

Oracle, Teradata, SQL Server, et al.

Connectors…

Page 32: Hadoop Operations - Best practices from the field

2 Agenda

• Basics• Architecture & Rack Design

• Hardware

• Software

• Advanced• Data Ingestion

• Operation & Monitoring

• Security

Page 33: Hadoop Operations - Best practices from the field

2 Operation

Apache Ambari Cloudera Manager

Page 34: Hadoop Operations - Best practices from the field

2 Monitoring

• The basics: Nagios, Ganglia, Ambari/Cloudera Manager, Hue

• Admins need to understand the principles behind Hadoop and learn about their tool set: fsck, dfsadmin, …

• Monitor the hardware usage for your work load

– Disk I/O, network I/O, CPU and memory usage

– Use this information when expanding cluster capacity

• Monitor the usage with Hadoop metrics

– JVM metrics: GC times, memory used, thread Status

– RPC metrics: especially latency to track slowdowns

– HDFS metrics: Used storage, # of files & blocks, cluster load, file system operations

– Job Metrics: Slot utilization and Job status

• Tweak configurations during upgrades & maintenance windows on an ongoing basis

• Establish regular performance tests

– Use Oozie to run standard test like TeraSort, TestDFSIO, HiBench, …

Page 35: Hadoop Operations - Best practices from the field

2 Agenda

• Basics• Architecture & Rack Design

• Hardware

• Software

• Advanced• Data Ingestion

• Operation & Monitoring

• Security

Page 36: Hadoop Operations - Best practices from the field

2 Security today

Kerberos in native Apache

Hadoop

Perimeter Security with Apache Knox

• LDAP• SSO

Authentication

Control access to cluster.

Authorization

Restrict access to explicit data

Audit

Understand who did what

Data Protection

Encrypt data at rest & motion

Native in Apache Hadoop• HDFS Permissions + ACL’s• Queues + Job ACL’s• Process Execution audit trail

Fine grained role based authorization• Hive• Apache Sentry• Apache Accumulo

Service level authorization with Knox

Central security policies with Ranger

Wire encryption in native

Apache Hadoop

Wire Encryption with Knox

Orchestrated encryption with 3rd party tools

Page 37: Hadoop Operations - Best practices from the field

2 Apache Knox

Knox

DMZ

Client

SSO

HDFS

Map Reduce

Tez Spark

Pig Hive

YARN

Solr

Ambari

Oozie

Knox

Ranger

GangliaNagios

ZooKeeperJournal Nodes

Firewall

Firewall Hadoop Cluster

LDAP

SSH

REST

WebHDFS

WebHCat

Oozie

Hive

YARN

SSH

Page 38: Hadoop Operations - Best practices from the field

2 Data Boxing

Raw data layerRead & Write

Read

Division 1--

Read & WriteDivision 2

--

Read & Write

Read

Set up data boxing using• Users & Groups• HDFS Permissions & ACL‘s• Higher level where applicable

Page 39: Hadoop Operations - Best practices from the field

2 Apache Ranger

File Level Access Control

Control permissions

Supports• HDFS• Hive• HBase• Storm• Knox

Page 40: Hadoop Operations - Best practices from the field

2 Thanks for listening

Twitter: @uweseiler

Mail:[email protected]

XING:https://www.xing.com/profile

/Uwe_Seiler