39
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Keep your Hadoop cluster at its best! Chris Nauroth Sheetal Dolas Hadoop Summit, San Jose, 2016

Keep your Hadoop cluster at its best!

Embed Size (px)

Citation preview

Page 1: Keep your Hadoop cluster at its best!

1   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Keep your Hadoop cluster at its best!Chris Nauroth Sheetal DolasHadoop Summit, San Jose, 2016

Page 2: Keep your Hadoop cluster at its best!

2   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

About Us

⬢  Principal Engineer @ Hortonworks

⬢  Committer and PMC, Apache Hadoop

–  Key  contributor  to  HDFS  ACLs,  Windows  compaJbility,  and  operability  improvements  

⬢  Hadoop user since 2010

–  Experience  deploying,  maintaining  and  using  Hadoop  clusters  

[email protected]

cnauroth

Chris Nauroth

Page 3: Keep your Hadoop cluster at its best!

3   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

About Us

⬢  SmartSense Engineering Lead @ Hortonworks

⬢  Most of the career has been in the field, solving real life business problems

⬢  Last 6+ years in Big Data

⬢  Committer and PMC, Apache Metron

[email protected]

sheetal_dolas

Sheetal Dolas

Page 4: Keep your Hadoop cluster at its best!

4   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Agenda

⬢  Days in a life of Hadoop users – Real war stories!

⬢  Hadoop Operational Challenges

⬢  Winning and avoiding the wars

⬢  Q & A

Page 5: Keep your Hadoop cluster at its best!

5   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Days in a life of Hadoop users���Real war stories!

Page 6: Keep your Hadoop cluster at its best!

6   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Story I: Unstable NameNode, Frequent Fail Overs

⬢  NameNode periodically becomes unresponsive

⬢  In HA scenario, fails over to standby

⬢  In short time, falls back again

⬢  Very frequent fail overs and fail backs

It was the garbage collection!

Page 7: Keep your Hadoop cluster at its best!

7   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Story II: Very high CPU usage but low throughput

⬢  Unusually high system CPU usage

⬢  Jobs slowed down

⬢  Reduced data IO

System CPU

User CPU N/W IO

Transparent Huge Pages (THP) was turned on!

Page 8: Keep your Hadoop cluster at its best!

8   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Job Perfor

mance

Cluster Stability

Story III: Cascading impact and cluster melt down

⬢  HDFS upgraded

⬢  HDFS utilization kept on increasing even after large data deletion

⬢  Rebalancing made the situation worse

⬢  Eventually HDFS became unresponsive

un-finalized HDFS had cascading impact on cluster!

Page 9: Keep your Hadoop cluster at its best!

9   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Story IV: Overloaded cluster

⬢  Jobs run slower

⬢  Always waiting containers and jobs, all YARN queues are fully utilized

⬢  Some jobs had to wait for hours to get the container slots

Sub optimally configured container sizes!

Requested Memory

Used Memory

Page 10: Keep your Hadoop cluster at its best!

10   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Story V: Accidental deletion of critical datasets

⬢  User accidentally executed hdfs dfs -rm -R on a root directory

⬢  Delete is issued in parallel, control + c did not help

⬢  In panic, user shuts down HDFS immediately (fortunately)

⬢  Restarts later to check trash, loses all data

⬢  It’s nearly impossible to recover blocks from local file system

This is a more common mistake than one may think!

Page 11: Keep your Hadoop cluster at its best!

11   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Story VI: Hive query returning random results

⬢  A hive query returns different results every time

⬢  Results are usually accurate during office hours

⬢  After office hours, results keep changing randomly on every execution

-- QUERY: WHAT IS TODAY’S TOTAL SALE AS OF NOW ? SELECT SUM(amount) FROM sales WHERE sale_date = TO_DATE (UNIX_TIMESTAMP())  

One of the host had a different time zone!

Page 12: Keep your Hadoop cluster at its best!

12   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

and the stories continue…

Page 13: Keep your Hadoop cluster at its best!

Hadoop operational challenges

Page 14: Keep your Hadoop cluster at its best!

14   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Hadoop has lots of configurations

⬢  So many configurations! Overwhelming for many users

⬢  Best practices are evolving and change across versions

Page 15: Keep your Hadoop cluster at its best!

15   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Many configurations are cluster and workload specific

⬢  A configuration good for one cluster may not be suitable for another cluster

⬢  Optimally configured clusters may become sub optimal tomorrow as they grow

Page 16: Keep your Hadoop cluster at its best!

16   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Large clusters add to the complexities

⬢  Managing, updating and keeping nodes in sync becomes challenging

⬢  Nodes going down miss the maintenance cycles and get out of sync

⬢  Newly added nodes may have different standards (java version, os, user configurations etc.)

⬢  Clusters start having heterogeneous hardware over period of time

Page 17: Keep your Hadoop cluster at its best!

17   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Winning ���and���avoiding���the wars with SmartSense

Page 18: Keep your Hadoop cluster at its best!

18   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

⬢  Proactive support & personalized cluster insights by

–  Enabling  faster  case  resoluJon  

–  Applying  industry  best  pracJces  

–  Providing  proacJve  analysis  

⬢  SmartSense is a collection of tools and services

–  Evaluates  cluster’s  current  configuraJon  and  runJme  environment  against  rich  set  of  rules  

–  Rules  are  dynamic,  reacJng  to  thresholds  tailored  to  the  specific  cluster  and  its  workloads  

–  ConJnuously  evolving  and  improving  rule  sets,  developed  by  or  in  close  consultaJon  with  acJve  commiWers,  support  engineers,  field  engineers.  

SmartSense

Page 19: Keep your Hadoop cluster at its best!

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AGENT   AGENT  

AGENT  AGENT  AGENT  

AGENT  

LANDING  ZONE  

SERVER  

AMBARI  

AGENT   AGENT  

AGENT  AGENT  AGENT  

AGENT  

BUNDLE  

WORKER  NODE  

WORKER  NODE  

WORKER  NODE  

WORKER  NODE  

WORKER  NODE  

WORKER  NODE  

SmartSense  AnalyJcs  

SmartSense Architecture

GATEWAY  

Page 20: Keep your Hadoop cluster at its best!

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Addressing: Unstable NameNode, Frequent Fail Overs

Daunting Questions

⬢  What is right Heap size for my NN ?

⬢  What should be the new gen size ?

⬢  Which GC should I use ?

⬢  What GC options to be configured?

⬢  What if my cluster grows ?

SmartSense Answer

⬢  Rule: hdfs_nn_jvm_opts

⬢  Calculates Heap size based on–  Current  heap  usage  –  Total  number  of  objects  in  file  system  –  Best  pracJces  

⬢  Recalculates dependent JVM options based on Heap size

⬢  Validates existing JVM opts

⬢  Provides continuous validations and proactive recommendations

Page 21: Keep your Hadoop cluster at its best!

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã  Heap Size

–  200 bytes per HDFS object (files, directories, blocks)

–  25 % buffer

Ã  -Xms should be same as –Xmx

Ã  New generation size should be 1/8th of –Xmx (capped at 8G)

Ã  Use Concurrent Mark Sweep (CMS) Garbage Collection

–  -XX:+UseConcMarkSweepGC

–  -XX:CMSInitiatingOccupancyFraction=70

–  -XX:+UseCMSInitiatingOccupancyOnly

–  -XX:ParallelGCThreads=8

NameNode JVM Opts

Page 22: Keep your Hadoop cluster at its best!

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Addressing: Very high CPU usage but low throughput

Daunting Questions

⬢  Is THP applicable to my OS version ?

⬢  Is it disabled ? Completely disabled ?

⬢  How do I make sure it is disabled on newly added nodes too ?

⬢  How do I make these configurations person independent ?

SmartSense Answer

⬢  Rule: os_thp

⬢  Checks if thp is completely disabled

⬢  Provides OS specific disabling instructions

⬢  Continuous evaluation that validates newly added nodes and re-commissioned nodes

Page 23: Keep your Hadoop cluster at its best!

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Disable THP

⬢  For RedHat & CentOS

echo "never" > /sys/kernel/mm/redhat_transparent_hugepage/enabled

⬢  For Debian, Ubuntu & SUSE

echo "never" > /sys/kernel/mm/transparent_hugepage/enabled

System CPU

User CPU

N/W IO

Page 24: Keep your Hadoop cluster at its best!

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Addressing: Cascading impact and cluster melt down

Daunting Questions

⬢  Should I finalize upgrade ?

⬢  What is right time to finalize ?

⬢  How do I make sure it does not fall through cracks ?

SmartSense Answer

⬢  Rule: hdfs_nn_finalize_upgrade

⬢  Checks HDFS health after upgrade

⬢  Evaluates how long HDFS is running in un-finalized state

⬢  Reminds until it is finalized

Page 25: Keep your Hadoop cluster at its best!

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã  Check NN UI / JMX for upgrade status

Ã  Do not finalize HDFS upgrade until–  All files and blocks have been verified after upgrade

–  Critical jobs have been executed at least once after upgrade

Ã  Finalize between 2 - 7 days after upgrade

hdfs dfsadmin -finalizeUpgrade

HDFS Upgrade finalization

Page 26: Keep your Hadoop cluster at its best!

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Addressing : Overloaded cluster

Daunting Questions

⬢  What is right container size for my cluster ?

⬢  If I add additional components (HBase, Storm), how does the container size change ?

⬢  How does container sizes change when I add new types of nodes in the cluster ?

⬢  What’s impact on container sizes if I add SSDs to the nodes?

SmartSense Answer

⬢  Rules: yarn_container_size, mr_container_size, tez_container_size

⬢  Evaluates resources available on individual host (CPU, Memory, Disks, Running Services etc.)

⬢  Calculates technology specific container sizes (MR, Tez, Hive)

⬢  Continuously evaluates as the cluster dynamics change

Page 27: Keep your Hadoop cluster at its best!

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Container sizing

Ã  Identify resources (CPU, Memory, Disks) available on each node

Ã  Keep aside resources required for other processes (OS, DN, NM, HBase RS)

Ã  Calculate max possible containers for each resource (CPU, Memory, Disks)

–  CPU Containers: 4x cores

–  Disk Containers: ( 3x HDD + 10x SSD )

–  Memory Containers: (Available RAM / 2 )

Ã  Number of containers = Min (CPU Containers, Disk Containers, Memory Containers)

Page 28: Keep your Hadoop cluster at its best!

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Addressing: Accidental deletion of critical datasets

Daunting Questions

⬢  Is HDFS trash enabled ?

⬢  What is safe trash interval ?

⬢  How to prevent accidental deletion of critical data ?

SmartSense Answer

⬢  Rule: hdfs_trash_interval–  Checks  if  trash  is  enabled  –  Validates  if  trash  interval  is  within  

reasonable  limits  

⬢  Rule: hdfs_nn_protect_imp_dirs–  New  feature  available  in  Hadoop  2.8  –  Helps  you  mark  criJcal  directories  such  

as    “/”,    “/user”,  “/user/apps/hive”,  “/user/apps/hbase”  etc.  are  delete  protected.  

Page 29: Keep your Hadoop cluster at its best!

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDFS Trash interval and directory protection

Ã  fs.trash.interval detects number of minutes after which the trashed data gets deleted

–  0 means trash disabled (data gets deleted immediately)

–  Keep it the range 1440 (1 day) – 10080 (7 days)

–  Recommended 4320 (3 days)

Ã  fs.protected.directories specifies directories that will be delete protected

–  Available from Hadoop 2.8

–  List all key directories there ("/", "/user","/user/apps", "/user/apps/hive", "/user/apps/hbase", "/user/apps/hbase/data", "/mapred", "/mapred/system", "/tmp" etc. )

Page 30: Keep your Hadoop cluster at its best!

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Addressing : Hive query returning random results

Daunting Questions

⬢  Is my cluster configured consistently ?

⬢  How do I prevent such hard to analyze issues ?

⬢  How do I make sure newly added do not bring these types of issues ?

⬢  How do I make these set ups person independent ?

SmartSense Answer

⬢  Rule: os_time_zone

⬢  Checks if all hosts have same time zone

⬢  Rule os_service_ntpd_on make sure all host times are in sync

⬢  Continuous evaluation that validates newly added nodes and re-commissioned nodes

Page 31: Keep your Hadoop cluster at its best!

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

There are 250+ more such rulesOperationsÃ  hdfs_dn_volume_tolerance Ã  hdfs_dn_xceivers Ã  hdfs_nn_handler_countÃ  …Ã  yarn_zk_quorum Ã  yarn_nm_recovery Ã  …Ã  os_hostname_reverse_lookup Ã  os_ssd_tuningÃ  …Ã  hive_mr_strict_mode Ã  hive_datanucleus_cache Ã  …Ã  tez_am_heap Ã  tez_shuffle_buffer Ã  …

PerformanceÃ  ams_mc_distributed_configsÃ  ams_mc_write_pathÃ  ...Ã  hbase_jvm_optsÃ  hbase_rs_open_region_threadsÃ  hbase_tcp_nodelayÃ  ...Ã  hdfs_dn_jvm_optsÃ  hdfs_mount_optionsÃ  hdfs_nn_dn_staleness_intervalÃ  ...Ã  hive_auto_convert_joinÃ  hive_disable_cachingÃ  hive_enable_cboÃ  ...

SecurityÃ  hdfs_dn_volume_tolerance Ã  hdfs_audit_logÃ  hdfs_block_access_tokenÃ  hdfs_enable_security_checkÃ  hdfs_nn_super_user_groupÃ  hdfs_zkfc_ha_aclÃ  ...Ã  ranger_policy_refresh_intervalÃ  smartsense_2_way_ssl_enabledÃ  ...Ã  yarn_ats_securityÃ  yarn_enable_aclÃ  ...

Page 32: Keep your Hadoop cluster at its best!

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

There is more than just configurations

How do I show back/charge back my tenants ?

Who are the top users of

my platform ?What type of

work loads are running on my

cluster ?

Which jobs have significant impact on my

cluster ?

How do I improve

performance of key jobs ?

What is good time for

maintenance?

Page 33: Keep your Hadoop cluster at its best!

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Activity Analysis

Page 34: Keep your Hadoop cluster at its best!

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summary

Ã  There are many things involved in managing Hadoop cluster

Ã  Best practices evolve and change across versions

Ã  What is optimal today may not be optimal for tomorrow

Ã  Changing cluster dynamics, workload characteristic need continuous re-evaluation and configuration adjustments

Ã  SmartSense can significantly help avoid common mistakes, issues, pitfalls and simplify Hadoop operations

Page 35: Keep your Hadoop cluster at its best!

35   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Lets keep your Hadoop cluster at its best!Thank You!

Page 36: Keep your Hadoop cluster at its best!

36   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Appendix

Page 37: Keep your Hadoop cluster at its best!

37   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

More Resources

⬢  https://docs.hortonworks.com/index.html

⬢  http://hortonworks.com/products/subscriptions/smartsense/

⬢  http://hortonworks.com/info/smartsense/

⬢  http://hortonworks.com/blog/introducing-hortonworks-smartsense/

⬢  https://www.youtube.com/watch?v=IKulo9c8PjE

⬢  https://community.hortonworks.com/topics/smartsense.html

Page 38: Keep your Hadoop cluster at its best!

38   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

SmartSense Bundle Security

⬢  All  Bundles  are  Anonymized  and  Encrypted  

⬢  MulJple  built-­‐in  security  measures  

–  Ambari clear text passwords are not collected

–  Hive and Oozie database properties are not collected

–  All IP addresses and host names are anonymized

⬢  Extensible  security  rules  –  Exclude properties within specific Hadoop configuration files

–  Global REGEX replacements across all configuration, metrics, and logs

Page 39: Keep your Hadoop cluster at its best!

39   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

SmartSense Stack Support

HDP 2.4 HDP 2.3 HDP 2.2 HDP 2.1 HDP 2.0

SmartSense 1.x

Ambari 2.2Built-In!

Ambari 2.1Plug-In

Ambari 2.0Plug-In

Ambari 1.7 Ambari 1.6

SmartSense 1.x