48
Debugging Apache Hadoop YARN Cluster in Production Jian He, Junping Du and Xuan Gong Hortonworks YARN Team 06/30/2016

Debugging Apache Hadoop YARN Cluster in Production

Embed Size (px)

Citation preview

Page 1: Debugging Apache Hadoop YARN Cluster in Production

Debugging Apache Hadoop YARN Cluster in ProductionJian He, Junping Du and Xuan GongHortonworks YARN Team06/30/2016

Page 2: Debugging Apache Hadoop YARN Cluster in Production

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Who are We

Junping Du– Apache Hadoop Committer and PMC

Member– Dev Lead in Hortonworks YARN team

Xuan Gong– Apache Hadoop Committer and PMC

Member– Software Engineer

Jian He– Apache Hadoop Committer and PMC

Member– Staff Software Engineer

Page 3: Debugging Apache Hadoop YARN Cluster in Production

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Today’s Agenda

YARN in a Nutshell Trouble-shooting Process and Tools Case Study Enhanced YARN Log Tool Demo Summary and Future

Page 4: Debugging Apache Hadoop YARN Cluster in Production

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaYARN in a Nutshell

Page 5: Debugging Apache Hadoop YARN Cluster in Production

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

YARN Architecture

ResourceManager NodeManager ApplicationMaster Other daemons:

– Application History/Timeline Server

– Job History Server (for MR only)

– Proxy Server– Etc.

Page 6: Debugging Apache Hadoop YARN Cluster in Production

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

RM and NM in a nutshell

Page 7: Debugging Apache Hadoop YARN Cluster in Production

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaYARN in a Nutshell

Trouble-shooting Process and Tools

Page 8: Debugging Apache Hadoop YARN Cluster in Production

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

“Troubles” to start troubleshooting effort on a YARN cluster

Applications Failed Applications Hang/Slow YARN configuration doesn’t work YARN APIs (CLI, WebService, etc.) doesn’t work YARN daemons crashed (OOM issue, etc.) YARN daemons’ log has error/warnings YARN cluster monitoring tools (like Ambari) alertProblem Type Distribution

ConfigurationExecuting JobsCluster AdministrationInstallationApplication DevelopmentPerformance

Page 9: Debugging Apache Hadoop YARN Cluster in Production

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Process: Phenomenon -> Root Cause -> Solution

Solution:– Infrastructure/Hardware issue

• Replace disks• Fix network

– Mis-configuration• Fix configuration• Enhance documentation

– Setup issue• Fix setup• Restart services

– Application issue• Update application• Workaround

– A YARN Bug• Report/fix it in Apache

community!

Phenomenon:– Application Failed

Root cause:– Container Launch failures

• Classpath issue• Resource localization

failures– Too many attempt failures

• Network connection issue• NM disk issues• AM failed caused by node

restarted– Application logic issue

• Container failed with OOM, etc.

– Security issue• Token related issues

Page 10: Debugging Apache Hadoop YARN Cluster in Production

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Iceberg of troubleshooting – Case Study

"java.lang.RuntimeException: java.io.FileNotFoundException: /etc/hadoop/2.3.4.0-3485/0/core-site.xml (Too many open files in system)”

That actually due to too many TCP connections issue

Page 11: Debugging Apache Hadoop YARN Cluster in Production

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Iceberg of troubleshooting – Dig Deeply Most connections are from local NM to DNs

– LogAggregationService– ResourceLocalizationService

We found the root cause is threads leak on NM LogAggregationService:– YARN-4697NM aggregation thread pool is not bound by limits

– YARN-4325Purge app state from NM state-store should cover more LOG_HANDLING cases

– YARN-4984 LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread leak.

Page 12: Debugging Apache Hadoop YARN Cluster in Production

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Lesson Learned for Trouble-shooting on a production cluster

What’s mean by a “Production” Cluster?– Cannot afford stop/restart cluster for trouble shooting– Most operations on cluster are “Read Only”– In fenced network, remote debugging with local cluster admin.

Lesson learned:1. Get related info (screenshots, log files, jstack, memory heap dump, etc.) as much as you can

2. Work closely with the end user to gain an understanding of the issue and symptoms

3. Setup knowledge base used to compare to previous cases

4. If possible, reproduce the issue on test/backup cluster – easy to trouble shooting and verify

5. Version your configuration!

Page 13: Debugging Apache Hadoop YARN Cluster in Production

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Handy Tools for YARN Troubleshooting Log UI Historic Info

– JobHistoryServer (for MR only)– Application Timeline Service (v1, v1.5, v2.0)

Monitoring tools, like: AMBARI Runtime info

– Memory Dump– Jstack– System Metrics

Page 14: Debugging Apache Hadoop YARN Cluster in Production

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Log

Log CLI– yarn logs -applicationId <application ID> [OPTIONS]– Discuss more later

Enable Debug log– When daemons are NOT running

• Put log level settings like: export YARN_ROOT_LOGGER = “DEBUG, console” to yarn-env.sh• Start the daemons

– When Daemons are running• Dynamic change log level via daemon’s logLevel UI/CLI• CLI:

– yarn daemonlog [-getlevel <host:httpPort> <classname>]– yarn daemonlog [-setlevel <host:httpPort> <classname> <level>]

– for YARN Client side• Similar setting as daemons not running

Page 15: Debugging Apache Hadoop YARN Cluster in Production

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Runtime Log Level settings in YARN UI

RM: http://<rm_addr>:8088/logLevel NM: http://<nm_addr>:8042/logLevel ATS: http://<ats_addr>:8188/logLevel

Page 16: Debugging Apache Hadoop YARN Cluster in Production

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

UI (Ambari and YARN)

Page 17: Debugging Apache Hadoop YARN Cluster in Production

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Job History Server

Page 18: Debugging Apache Hadoop YARN Cluster in Production

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Memory dump analysis

Page 19: Debugging Apache Hadoop YARN Cluster in Production

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hadoop metrics RPC metrics

– RpcQueueTimeAvgTime– ReceivedBytes…

JVM metrics– MemHeapUsedM– ThreadsBlocked…

Documentation:– http://s.apache.org/UwSu

Page 20: Debugging Apache Hadoop YARN Cluster in Production

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

YARN top

top like command line view for application stats, queue stats

Page 21: Debugging Apache Hadoop YARN Cluster in Production

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaYARN in a Nutshell

Trouble-shooting Process and Tools

Case Study

Page 22: Debugging Apache Hadoop YARN Cluster in Production

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why is my job hung ?

Job can be stuck at 3 states.

NEW_SAVING: Waiting for app to be persisted in state-store- Connection error with state-store (zookeeper etc.)

Accepted: Waiting to allocate ApplicationMaster container.- Low max- AM-resource-percentage config

Running: waiting for containers to be allocated?- Are there resources available for the app- Otherwise, application land issue, stuck on socket read/write.

App states

Page 23: Debugging Apache Hadoop YARN Cluster in Production

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Case Study

Friday evening, Customer experiences cluster outages. Large amount of jobs getting stuck. There are resources available in the cluster. Restarting Resource Manger can resolve issue temporarily But after several hours, cluster again goes back to the bad state

Page 24: Debugging Apache Hadoop YARN Cluster in Production

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Case Study

Are there any resources available in the queue ?

Page 25: Debugging Apache Hadoop YARN Cluster in Production

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Case Study

Are there any resources available for the app ?– Sometimes, even if cluster has resources, user may still not be able to run their applications

because they hit the user-limit.– User-limit controls how much resources a single user can use– Check user-limit info on the scheduler UI– Check application head room on application UI

Page 26: Debugging Apache Hadoop YARN Cluster in Production

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 27: Debugging Apache Hadoop YARN Cluster in Production

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Case study

Not a problem of resource contention. Use yarn logs command to get hung application logs.

– Found app waiting for containers to be allocated.

Problem: cluster has free resources, but app is not able to use it.

Page 28: Debugging Apache Hadoop YARN Cluster in Production

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Case study

May be a scheduling issue. Analyze the scheduler log. (Most difficult)

– User not much familiar with the scheduler log.– RM log is too huge, hard to do text searching in the logs.– Getting worse if enabling debug log.

Dump the scheduling log into a separate file

Page 29: Debugging Apache Hadoop YARN Cluster in Production

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Case study

Scheduler log shows several apps are skipped for scheduling. Pick one of the applications, go to the application attempt UI, Check the resource requests table (see below), notice billions of containers are asked by

the application.

8912124

Page 30: Debugging Apache Hadoop YARN Cluster in Production

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Case study

Tried to kill those misbehaving jobs, cluster went fine. Find the user who submit those jobs and stop him/her from doing that. Big achievement so far, unblock the cluster.

Offline debugging and find product bug. Surprisingly, we use int for measuring memory size in the scheduler. That misbehaving app asked too much resources, which caused integer overflow in the

scheduler. YARN-4844, replace int with long for resource memory API.

Page 31: Debugging Apache Hadoop YARN Cluster in Production

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What we learn

Rebooting service can solve many problems. – Thanks to working-preserving RM and NM recovery (YARN-556 & YARN-1336).

Denial of Service - Poorly written, or accidental configuration for workloads can cause component outages. – Carefully code against DOS scenarios. – Example: User RPC method (getQueueInfo) holds scheduler lock

UI enhancement– Small change, big impact.– Example: Resource requests table on application very useful in this case.

Alerts– Ask too many containers, alerting to the users.

Page 32: Debugging Apache Hadoop YARN Cluster in Production

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Case study 2

10 % of the jobs are failing every day. After they re-run, jobs sometime finish successfully. No resource contention when jobs are running Logs contain a lot of mysterious connection errors (unable to read call parameters)

Page 33: Debugging Apache Hadoop YARN Cluster in Production

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Case study 2

Initial attempt,– Dig deeper into the code to see under what conditions, this exception may throw.– Not able to figure out.

Page 34: Debugging Apache Hadoop YARN Cluster in Production

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Case study 2

Requested more failed application logs Identify pattern for these applications Finally, we realize all apps failed on a certain set of nodes. Ask customer to exclude those nodes. Jobs running fine after that. Customer checked “/var/log/messages” and found disk issues for those nodes.

When dealing with mysterious connection failures, hung problems, try to find

correlation between failed apps and nodes.

Page 35: Debugging Apache Hadoop YARN Cluster in Production

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaYARN in a Nutshell

Trouble-shooting Process and Tools

Case Study

Enhanced YARN Log Tool Demo

Page 36: Debugging Apache Hadoop YARN Cluster in Production

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enhanced YARN Log CLI (YARN-4904)

Useful Log CLIs– Get container logs for running apps

• yarn logs –applicationId ${appId}– Get a specific container log

• yarn logs –applicationId ${appId} –containerId ${containerId}– Get AM Container logs.

• yarn logs -applicationId ${appId} –am 1

– Get a specific log file• yarn logs -applicationId ${appId} –logFiles syslog• Support java regular expression

– Get the log file's first 'n' bytes or the last 'n' bytes• yarn logs –applicationId ${appId} –size 100

– Dump the application/container logs• yarn logs –applicationId ${appId} –out ${local_dir}– List application/container log information

• yarn logs –applicationId ${appId} -show_application_log_info• yarn logs –applicationId ${appId} –containerId ${containerId} -show_container_log_info

Page 37: Debugging Apache Hadoop YARN Cluster in Production

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 38: Debugging Apache Hadoop YARN Cluster in Production

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaYARN in a Nutshell

Trouble-shooting Process and Tools

Case Study

Enhanced YARN Log Tool Demo

Summary and Future

Page 39: Debugging Apache Hadoop YARN Cluster in Production

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summary and Future

Summary– Methodology and Tools for trouble-shooting on YARN– Case Study– Enhanced YARN Log CLI

• YARN-4904 Future Enhancement

– ATS (Application Timeline Service) v2• YARN-2928• #hs16sj “How YARN Timeline Service v.2 Unlocks 360-Degree Platform Insights at

Scale”

– New ResourceManager UI• YARN-3368

Page 40: Debugging Apache Hadoop YARN Cluster in Production

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

New RM UI (YARN-3368)

Page 41: Debugging Apache Hadoop YARN Cluster in Production

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You

Page 42: Debugging Apache Hadoop YARN Cluster in Production

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Backup Slides

Page 43: Debugging Apache Hadoop YARN Cluster in Production

43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

YARN Log Command example Screenshot

yarn logs –applicationId application_1467090861129_0001 –am 1

Page 44: Debugging Apache Hadoop YARN Cluster in Production

44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

YARN Log Command example Screenshot

yarn logs –applicationId application_1467090861129_0001 –containerId container_1467090861129_0001_01_000002

Page 45: Debugging Apache Hadoop YARN Cluster in Production

45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

YARN Log Command example Screenshot

yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr

Page 46: Debugging Apache Hadoop YARN Cluster in Production

46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

YARN Log Command example Screenshot

yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr –size -1000

Page 47: Debugging Apache Hadoop YARN Cluster in Production

47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

YARN Log Command example Screenshot

yarn logs –applicationId application_1467090861129_0001 –out ${localDir}

Page 48: Debugging Apache Hadoop YARN Cluster in Production

48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

YARN Log Command example Screenshot

yarn logs –applicationId application_1467090861129_0001 –-show_application_log_info

yarn logs –applicationId application_1467090861129_0001 –-show_container_log_info