Debugging Apache Hadoop YARN Cluster in ProductionJian He, Junping Du and Xuan GongHortonworks YARN Team06/30/2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who are We
Junping Du– Apache Hadoop Committer and PMC
Member– Dev Lead in Hortonworks YARN team
Xuan Gong– Apache Hadoop Committer and PMC
Member– Software Engineer
Jian He– Apache Hadoop Committer and PMC
Member– Staff Software Engineer
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Today’s Agenda
YARN in a Nutshell Trouble-shooting Process and Tools Case Study Enhanced YARN Log Tool Demo Summary and Future
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaYARN in a Nutshell
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Architecture
ResourceManager NodeManager ApplicationMaster Other daemons:
– Application History/Timeline Server
– Job History Server (for MR only)
– Proxy Server– Etc.
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RM and NM in a nutshell
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaYARN in a Nutshell
Trouble-shooting Process and Tools
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Troubles” to start troubleshooting effort on a YARN cluster
Applications Failed Applications Hang/Slow YARN configuration doesn’t work YARN APIs (CLI, WebService, etc.) doesn’t work YARN daemons crashed (OOM issue, etc.) YARN daemons’ log has error/warnings YARN cluster monitoring tools (like Ambari) alertProblem Type Distribution
ConfigurationExecuting JobsCluster AdministrationInstallationApplication DevelopmentPerformance
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Process: Phenomenon -> Root Cause -> Solution
Solution:– Infrastructure/Hardware issue
• Replace disks• Fix network
– Mis-configuration• Fix configuration• Enhance documentation
– Setup issue• Fix setup• Restart services
– Application issue• Update application• Workaround
– A YARN Bug• Report/fix it in Apache
community!
Phenomenon:– Application Failed
Root cause:– Container Launch failures
• Classpath issue• Resource localization
failures– Too many attempt failures
• Network connection issue• NM disk issues• AM failed caused by node
restarted– Application logic issue
• Container failed with OOM, etc.
– Security issue• Token related issues
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Iceberg of troubleshooting – Case Study
"java.lang.RuntimeException: java.io.FileNotFoundException: /etc/hadoop/2.3.4.0-3485/0/core-site.xml (Too many open files in system)”
That actually due to too many TCP connections issue
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Iceberg of troubleshooting – Dig Deeply Most connections are from local NM to DNs
– LogAggregationService– ResourceLocalizationService
We found the root cause is threads leak on NM LogAggregationService:– YARN-4697NM aggregation thread pool is not bound by limits
– YARN-4325Purge app state from NM state-store should cover more LOG_HANDLING cases
– YARN-4984 LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread leak.
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lesson Learned for Trouble-shooting on a production cluster
What’s mean by a “Production” Cluster?– Cannot afford stop/restart cluster for trouble shooting– Most operations on cluster are “Read Only”– In fenced network, remote debugging with local cluster admin.
Lesson learned:1. Get related info (screenshots, log files, jstack, memory heap dump, etc.) as much as you can
2. Work closely with the end user to gain an understanding of the issue and symptoms
3. Setup knowledge base used to compare to previous cases
4. If possible, reproduce the issue on test/backup cluster – easy to trouble shooting and verify
5. Version your configuration!
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Handy Tools for YARN Troubleshooting Log UI Historic Info
– JobHistoryServer (for MR only)– Application Timeline Service (v1, v1.5, v2.0)
Monitoring tools, like: AMBARI Runtime info
– Memory Dump– Jstack– System Metrics
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Log
Log CLI– yarn logs -applicationId <application ID> [OPTIONS]– Discuss more later
Enable Debug log– When daemons are NOT running
• Put log level settings like: export YARN_ROOT_LOGGER = “DEBUG, console” to yarn-env.sh• Start the daemons
– When Daemons are running• Dynamic change log level via daemon’s logLevel UI/CLI• CLI:
– yarn daemonlog [-getlevel <host:httpPort> <classname>]– yarn daemonlog [-setlevel <host:httpPort> <classname> <level>]
– for YARN Client side• Similar setting as daemons not running
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Runtime Log Level settings in YARN UI
RM: http://<rm_addr>:8088/logLevel NM: http://<nm_addr>:8042/logLevel ATS: http://<ats_addr>:8188/logLevel
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
UI (Ambari and YARN)
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Job History Server
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Memory dump analysis
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop metrics RPC metrics
– RpcQueueTimeAvgTime– ReceivedBytes…
JVM metrics– MemHeapUsedM– ThreadsBlocked…
Documentation:– http://s.apache.org/UwSu
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN top
top like command line view for application stats, queue stats
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaYARN in a Nutshell
Trouble-shooting Process and Tools
Case Study
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why is my job hung ?
Job can be stuck at 3 states.
NEW_SAVING: Waiting for app to be persisted in state-store- Connection error with state-store (zookeeper etc.)
Accepted: Waiting to allocate ApplicationMaster container.- Low max- AM-resource-percentage config
Running: waiting for containers to be allocated?- Are there resources available for the app- Otherwise, application land issue, stuck on socket read/write.
App states
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
Friday evening, Customer experiences cluster outages. Large amount of jobs getting stuck. There are resources available in the cluster. Restarting Resource Manger can resolve issue temporarily But after several hours, cluster again goes back to the bad state
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
Are there any resources available in the queue ?
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
Are there any resources available for the app ?– Sometimes, even if cluster has resources, user may still not be able to run their applications
because they hit the user-limit.– User-limit controls how much resources a single user can use– Check user-limit info on the scheduler UI– Check application head room on application UI
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
Not a problem of resource contention. Use yarn logs command to get hung application logs.
– Found app waiting for containers to be allocated.
Problem: cluster has free resources, but app is not able to use it.
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
May be a scheduling issue. Analyze the scheduler log. (Most difficult)
– User not much familiar with the scheduler log.– RM log is too huge, hard to do text searching in the logs.– Getting worse if enabling debug log.
Dump the scheduling log into a separate file
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
Scheduler log shows several apps are skipped for scheduling. Pick one of the applications, go to the application attempt UI, Check the resource requests table (see below), notice billions of containers are asked by
the application.
8912124
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
Tried to kill those misbehaving jobs, cluster went fine. Find the user who submit those jobs and stop him/her from doing that. Big achievement so far, unblock the cluster.
Offline debugging and find product bug. Surprisingly, we use int for measuring memory size in the scheduler. That misbehaving app asked too much resources, which caused integer overflow in the
scheduler. YARN-4844, replace int with long for resource memory API.
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What we learn
Rebooting service can solve many problems. – Thanks to working-preserving RM and NM recovery (YARN-556 & YARN-1336).
Denial of Service - Poorly written, or accidental configuration for workloads can cause component outages. – Carefully code against DOS scenarios. – Example: User RPC method (getQueueInfo) holds scheduler lock
UI enhancement– Small change, big impact.– Example: Resource requests table on application very useful in this case.
Alerts– Ask too many containers, alerting to the users.
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
10 % of the jobs are failing every day. After they re-run, jobs sometime finish successfully. No resource contention when jobs are running Logs contain a lot of mysterious connection errors (unable to read call parameters)
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
Initial attempt,– Dig deeper into the code to see under what conditions, this exception may throw.– Not able to figure out.
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
Requested more failed application logs Identify pattern for these applications Finally, we realize all apps failed on a certain set of nodes. Ask customer to exclude those nodes. Jobs running fine after that. Customer checked “/var/log/messages” and found disk issues for those nodes.
When dealing with mysterious connection failures, hung problems, try to find
correlation between failed apps and nodes.
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaYARN in a Nutshell
Trouble-shooting Process and Tools
Case Study
Enhanced YARN Log Tool Demo
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enhanced YARN Log CLI (YARN-4904)
Useful Log CLIs– Get container logs for running apps
• yarn logs –applicationId ${appId}– Get a specific container log
• yarn logs –applicationId ${appId} –containerId ${containerId}– Get AM Container logs.
• yarn logs -applicationId ${appId} –am 1
– Get a specific log file• yarn logs -applicationId ${appId} –logFiles syslog• Support java regular expression
– Get the log file's first 'n' bytes or the last 'n' bytes• yarn logs –applicationId ${appId} –size 100
– Dump the application/container logs• yarn logs –applicationId ${appId} –out ${local_dir}– List application/container log information
• yarn logs –applicationId ${appId} -show_application_log_info• yarn logs –applicationId ${appId} –containerId ${containerId} -show_container_log_info
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaYARN in a Nutshell
Trouble-shooting Process and Tools
Case Study
Enhanced YARN Log Tool Demo
Summary and Future
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary and Future
Summary– Methodology and Tools for trouble-shooting on YARN– Case Study– Enhanced YARN Log CLI
• YARN-4904 Future Enhancement
– ATS (Application Timeline Service) v2• YARN-2928• #hs16sj “How YARN Timeline Service v.2 Unlocks 360-Degree Platform Insights at
Scale”
– New ResourceManager UI• YARN-3368
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New RM UI (YARN-3368)
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Backup Slides
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –am 1
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –containerId container_1467090861129_0001_01_000002
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr –size -1000
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –out ${localDir}
48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –-show_application_log_info
yarn logs –applicationId application_1467090861129_0001 –-show_container_log_info