Upload
raastech
View
123
Download
6
Tags:
Embed Size (px)
Citation preview
Oracle SOA Suite 11g Troubleshooting Methodology
April 10th, 2013
16:15-17:15
Mile High Ballroom 3C
Harold Dost III Senior Consultant
Raastech, Inc.
Slide 2 of 64 © Raastech, Inc. 2012 | All rights reserved.
1. Introduction
2. The Problem
3. The Art of Troubleshooting: Where Do You Start?
4. Infrastructure Issues
5. Performance Issues
6. Deployment Issues
7. Summary
Agenda
Slide 3 of 64 © Raastech, Inc. 2012 | All rights reserved.
INTRODUCTION
Slide 4 of 64 © Raastech, Inc. 2012 | All rights reserved.
Harold Dost III
5+ years of Oracle middleware experience
Experience in large implementations involving SOA Suite, BAM, AIA,
OSB, OSR, ODI, OWSM, OER, OEG, and more
OCE (SOA Foundation Practitioner)
About Me
Slide 5 of 64 © Raastech, Inc. 2012 | All rights reserved.
THE PROBLEM
Slide 6 of 64 © Raastech, Inc. 2012 | All rights reserved.
The Macy’s support team had an exceedingly difficult time
pinpointing the specific cause of the problem.
Not only did the team involve representatives for each IT
functional area, they had no way to troubleshoot from the
source and no one team had visibility of the complete
picture.
In general resolving problems took the Macy’s melded
support team approximately multiple days.
http://www.splunk.com/web_assets/pdfs/secure/Troubleshooting_Critical_Applications.pdf
How Every Large Company Troubleshoots
Slide 7 of 64 © Raastech, Inc. 2012 | All rights reserved.
The Macy’s support team had an exceedingly difficult time
pinpointing the specific cause of the problem.
Not only did the team involve representatives for each IT
functional area, they had no way to troubleshoot from the
source and no one team had visibility of the complete
picture.
In general resolving problems took the Macy’s melded
support team approximately multiple days.
http://www.splunk.com/web_assets/pdfs/secure/Troubleshooting_Critical_Applications.pdf
How Every Large Company Troubleshoots
Slide 8 of 64 © Raastech, Inc. 2012 | All rights reserved.
In the past, network admins were to blame for everything.
Problem With Troubleshooting Integrations
Slide 9 of 64 © Raastech, Inc. 2012 | All rights reserved.
In the 21st century, the integration folks are the new target.
Problem With Troubleshooting Integrations
Slide 10 of 64 © Raastech, Inc. 2012 | All rights reserved.
Numerous touch points
Numerous SOA technologies
Focus of this presentation is on Oracle SOA Suite 11g
Problem With Troubleshooting Integrations W
eb
Ap
plic
ati
on
OE
G
OS
B
SO
A S
uit
e
OS
B
ODI
1
3
2
4
Slide 11 of 64 © Raastech, Inc. 2012 | All rights reserved.
We created an Ant wrapper script that loops through and
deploys all composites
Calls the deploy target in ant-sca-deploy.xml
Always getting OutOfMemoryError: PermGen space
after exactly 66 composite deployments
Weird… but at least consistent
Real World Scenario – Bizarre Behaviour
Slide 12 of 64 © Raastech, Inc. 2012 | All rights reserved.
Real World Scenario – Vague & Unclear
The infamous and ever misleading “Unable to access the
following endpoints” error
Slide 13 of 64 © Raastech, Inc. 2012 | All rights reserved.
Could be:
Caused by: java.net.SocketTimeoutException:
Read timed out
Message send failed:
sun.security.validator.ValidatorException:
PKIX path building failed:
sun.security.provider.certpath.SunCertPathBu
ilderException: unable to find valid
certification path to requested target
Real World Scenario – Vague & Unclear
Slide 14 of 64 © Raastech, Inc. 2012 | All rights reserved.
THE ART OF TROUBLESHOOTING: WHERE DO YOU START?
Slide 15 of 64 © Raastech, Inc. 2012 | All rights reserved.
Part skill
Some people have natural tendency to pinpoint problem areas
Can be learned; usually involves methodical approach and logic
Part knowledge
Without understanding the product, doesn’t matter how smart
you are
Most frustrating when it’s related to an area we don’t know
What is Troubleshooting?
Slide 16 of 64 © Raastech, Inc. 2012 | All rights reserved.
Co-Workers
Internet searches
OTN discussion forums http://support.oracle.com
My Oracle Support http://support.oracle.com
Oracle Troubleshooting Guide
http://docs.oracle.com/cd/E15586_01/fusionapps.1111/e14496/soa_trouble.htm
Oracle SOA Suite 11g Administrator’s Handbook http://www.packtpub.com/oracle-soa-suite-11g-administrators-handbook/book
Existing Resources
Slide 17 of 64 © Raastech, Inc. 2012 | All rights reserved.
Start Somewhere – Narrow Down Problem Area
Issues
Performance
Server-wide Service-specific
Runtime
Composite Infrastructure
Deployment
Slide 18 of 64 © Raastech, Inc. 2012 | All rights reserved.
INFRASTRUCTURE ISSUES
Slide 19 of 64 © Raastech, Inc. 2012 | All rights reserved.
Could be a server issue
Could be a coding issue
Could be a business fault that should be handled by the code
Must be able to differentiate between infrastructure errors
and composite instance errors
Troubleshooting the Infrastructure
Slide 20 of 64 © Raastech, Inc. 2012 | All rights reserved.
1. Use logs
2. Use thread dumps
Troubleshooting the Infrastructure
Slide 21 of 64 © Raastech, Inc. 2012 | All rights reserved.
The soa_server1.out log file contains most runtime
issues
Must differentiate between infrastructure errors and
composite instance errors
1. Using Logs
Slide 22 of 64 © Raastech, Inc. 2012 | All rights reserved.
Random crashes immediately after go-live
Only happened in Production
No warning signs
Error does not appear on the EM console
Example: Infrastructure Error
<Aug 5, 2011 12:00:02 AM EDT> <Error>
<oracle.soa.bpel.engine.dispatch> <BEA-000000>
<failed to handle message
javax.ejb.EJBException: EJB Exception:
java.lang.StackOverflowError...
Slide 23 of 64 © Raastech, Inc. 2012 | All rights reserved.
Often easy to distinguish
Should be handled by the code
Shows as a faulted instance on the EM console
Example: Business Fault
<Aug 6, 2011 10:10:33 AM EDT> <Error>
<oracle.soa.mediator.serviceEngine> <BEA-000000>
<Got an exception:
oracle.fabric.common.FabricInvocationException:
javax.xml.ws.soap.SOAPFaultException:
Message: Organization 129024 not found. Stack trace: at
Core.WebServices.Message.MessageWebService.SaveNotification(O
rganization organization, Notification notification) in
c:\Data\1.0\Core\Message\MessageWebService.svc.cs:line 100,
detail=javax.xml.ws.soap.SOAPFaultException:
Slide 24 of 64 © Raastech, Inc. 2012 | All rights reserved.
Thrown by external system
No action needed
Shows as a faulted instance on the EM console
No action needed; follow up with target system
Example: System Fault (but not your fault!)
<Aug 6, 2011 10:10:33 AM EDT> <Error>
<oracle.soa.mediator.serviceEngine> <BEA-000000>
<Got an exception:
oracle.fabric.common.FabricInvocationException:
javax.xml.ws.soap.SOAPFaultException:
CreateCustomer failed with Message: Cannot insert the value
NULL into column 'CustomerID', table '@Customers'; column
does not allow nulls. INSERT fails.
Slide 25 of 64 © Raastech, Inc. 2012 | All rights reserved.
The infamous and ever misleading “Unable to access the
following endpoints” error
Example: System Fault
Slide 26 of 64 © Raastech, Inc. 2012 | All rights reserved.
In this case, due to:
Message send failed:
sun.security.validator.ValidatorException:
PKIX path building failed:
sun.security.provider.certpath.SunCertPathBu
ilderException: unable to find valid
certification path to requested target
Example: System Fault
Slide 27 of 64 © Raastech, Inc. 2012 | All rights reserved.
Just an infrastructure warning
Threads would eventually clear themselves up
Does not show on the EM console
Due to failed transaction that continues to retry
Example: Coding or Infrastructure Problem?
<Sep 30, 2011 11:30:04 PM EDT> <Warning>
<oracle.integration.platform.instance.store.async> <BEA-000000>
<Unable to allocate additional threads,
as all the threads [10] are in use.
Threads distribution :
Fabric Instance Activity = 1,Fabric-Instance-Manager = 9,>
Slide 28 of 64 © Raastech, Inc. 2012 | All rights reserved.
A lot more information is logged in the soa_server1-
diagnostic.log file
Modifying Logger Levels
Slide 29 of 64 © Raastech, Inc. 2012 | All rights reserved.
A lot more information is logged in the soa_server1-
diagnostic.log file
Modifying Logger Levels
[2012-01-01T22:35:56.144-05:00] [soa_server1] [TRACE] [] [oracle.soa.adapter]
[ecid: cb680017c6a0acfe:-3f1527ec:13487d1ea4c:-8000-0000000000000fe1,0:2]
JmsProducer_execute:[default destination = jndi/CustomerJMSQueue]:
Successfully produced message.
[2012-01-01T22:35:56.256-05:00] [soa_server1] [NOTIFICATION] [] [oracle.soa.adapter]
[ecid: cb680017c6a0acfe:-5675273b:1348cccad75:-8000-0000000000055743,0]
JMSAdapter JMSConsumer JMSMessageConsumer_consume: Got message with ID
ID:<458362.1325475356144.0> from destination jndi/CustomerJMSQueue
[2012-01-01T22:35:56.261-05:00] [soa_server1] [TRACE] [] [oracle.soa.adapter]
[ecid: cb680017c6a0acfe:-5675273b:1348cccad75:-8000-0000000000055743,0]
JMS Adapter JMSProducer:CustomerJMS [ CustomerProduce_ptt::CustomerProduce(body)
] XMLHelper_convertJmsMessageHeadersAndPropertiesToXML:
<JMSInboundHeadersAndProperties xmlns="http://xmlns.oracle.com/pcbpel/
adapter/jms/">[[
<JMSInboundHeaders>
<JMSMessageID>ID:<458362.1325475356144.0></JMSMessageID>
<JMSTimestamp>1325475356144</JMSTimestamp>
Slide 30 of 64 © Raastech, Inc. 2012 | All rights reserved.
When a managed server goes into warning state, what are
you supposed to do?
2. Using Thread Dumps
Slide 31 of 64 © Raastech, Inc. 2012 | All rights reserved.
Navigate to Servers > (managed server) > Monitoring >
Threads
Understanding Stuck Threads
Slide 32 of 64 © Raastech, Inc. 2012 | All rights reserved.
AdminServer.log
bam_server1.log
Understanding Stuck Threads
####<Dec 23, 2011 6:03:49 PM EST> <Error> <WebLogicServer>
<soahost1> <AdminServer> <BEA-000337> <[STUCK] ExecuteThread: '0'
for queue: 'weblogic.kernel.Default (self-tuning)' has been busy
for "658" seconds
####<Dec 23, 2011 5:53:36 PM EST> <Error> <JMX> <soahost1> <bam_
server1> <[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.
Default (self-tuning)'> <<WLS Kernel>> <> <> <1324680816405> <BEA-
149500> <An exception occurred while registering the MBean
com.bea:Name=AdminServer,Type=WebServiceRequestBufferingQueue,
WebServiceBuffering=AdminServer,Server=AdminServer,
WebService=AdminServer. java.lang.OutOfMemoryError: PermGen space
Slide 33 of 64 © Raastech, Inc. 2012 | All rights reserved.
1. We found AdminServer to be in the “Warning” state, due
to a stuck thread.
2. We confirmed that there was indeed a stuck
“ ExecuteThread ” as shown on both the Oracle
WebLogic Administration Console and the AdminServer.log file.
3. By reviewing the soa_server1.log and
bam_server1.log files, we found startup errors in the
BAM server log.
4. The BAM server was unable to register an AdminServer MBean due to the java.lang.OutOfMemoryError
exception that was thrown.
Understanding Stuck Threads
Slide 34 of 64 © Raastech, Inc. 2012 | All rights reserved.
PERFORMANCE ISSUES
Slide 35 of 64 © Raastech, Inc. 2012 | All rights reserved.
Is logging in to Oracle Enterprise Manager Fusion
Middleware Control extremely slow?
Are all composite instances completing in an unusually
longer period of time?
Are the logs or your dehydration database growing
unusually quickly?
Are you seeing an exceptionally high number of errors in
the logs?
Server Wide Performance Issues
Slide 36 of 64 © Raastech, Inc. 2012 | All rights reserved.
root@soahost1:/root> df –m
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/sda8 996 451 494 48% /
/dev/sda9 815881 697454 76314 91% /u01
/dev/sda7 996 36 909 4% /home
/dev/sda5 1984 138 1744 8% /tmp
/dev/sda3 1984 283 1598 16% /var
/dev/sda2 5950 3842 1802 69% /usr
/dev/sda1 99 12 83 13% /boot
tmpfs 8023 0 8023 0% /dev/shm
Check available disk space
Often an overlooked area
Slide 37 of 64 © Raastech, Inc. 2012 | All rights reserved.
The vmstat command easily outputs CPU, memory, and I/O
statistics
Do not rely on Linux’s reporting of available memory, and
best to look at SWAP space usage
Why Linux reports 100% memory usage all the time
http://blog.raastech.com/2008/01/why-linux-reports-100-memory-usage-all.html
Check CPU, RAM, and I/O
root@soahost1:/root> vmstat -S m
procs -------memory--------- --swap-- ---io-- --system-- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 59 402 15055 0 0 2 16 0 0 2 2 96 1 0
Slide 38 of 64 © Raastech, Inc. 2012 | All rights reserved.
System log files can reveal resource issues:
Check OS Resources
root@soahost1:/root> cat /var/log/messages
Aug 31 20:53:22 uslx286 sshd[22480]: fatal:
setresuid 10000: Resource temporarily unavailable
root@soahost1:/root> ps -A | wc -l
297
root@soahost1:/root> lsof | wc -l
6064
Too many open files can exhaust system resources:
Too many running processes can exhaust system resources:
Slide 39 of 64 © Raastech, Inc. 2012 | All rights reserved.
For performance, consider the following:
Switching from Sun JDK to JRockit JDK
Optimizing JVM settings
Additional JVM performance tuning documentation from
Oracle can be found at:
http://docs.oracle.com/cd/E23943_01/web.1111/e13814.pdf
http://docs.oracle.com/cd/E15289_01/doc.40/e15060.pdf
JVM Performance Tuning
Slide 40 of 64 © Raastech, Inc. 2012 | All rights reserved.
Add this to the PORT_MEM_ARGS, argument in the setSOADomainEnv.sh(.cmd) script
-XX:+HeapDumpOnOutOfMemoryError
Although this is not a performance setting, we recommend
setting it to dump the heap to an hprof file when java.lang.OutOfMemoryError exceptions are thrown
This is useful for later analysis and troubleshooting
JVM Logging
Slide 41 of 64 © Raastech, Inc. 2012 | All rights reserved.
Ensuring that the heap allocated to the JVM is appropriately
sized (that is, comparing heap versus non-heap usage)
Ensure that there is no excessive garbage collection
Monitor JVM thread performance
Check JVM
Slide 42 of 64 © Raastech, Inc. 2012 | All rights reserved.
Data source errors are usually easy to identify – when
exhausted, errors show up everywhere
Check Data Sources
Slide 43 of 64 © Raastech, Inc. 2012 | All rights reserved.
Involve a DBA
Check Database Performance
Slide 44 of 64 © Raastech, Inc. 2012 | All rights reserved.
Navigate to Monitoring > Performance Summary
Can choose metrics to display for any composite
Viewing Performance Summary Graphs
Slide 45 of 64 © Raastech, Inc. 2012 | All rights reserved.
Right-click on Monitoring > Request Processing
Utilizing SQL queries is so much better
Viewing Request Processing Metrics
Slide 46 of 64 © Raastech, Inc. 2012 | All rights reserved.
Remember SQL output from last page?
Let’s also get the invoke durations
Composite Instance Performance
SELECT
composite_instance_id,
composite_creation_date,
component_name,
action,
component_state,
TO_CHAR((TO_NUMBER(SUBSTR(TO_CHAR(updated_time-created_time),12,2))*60*60) +
(TO_NUMBER(SUBSTR(TO_CHAR(updated_time-created_time),15,2))*60) +
TO_NUMBER(SUBSTR(TO_CHAR(updated_time-created_time),18,4)),'999990.000') duration
FROM
mediator_instance
WHERE
component_name = 'Order.Create’
Slide 47 of 64 © Raastech, Inc. 2012 | All rights reserved.
DEPLOYMENT ISSUES
Slide 48 of 64 © Raastech, Inc. 2012 | All rights reserved.
Involves:
1. Compilation
ant -f ant-sca-package.xml package -
DcompositeDir=$CODE/HelloWorld -
DcompositeName=HelloWorld -Drevision=1.0
2. Deployment
ant -f ant-sca-deploy.xml deploy -
DserverURL=$SOAURL/soa-infra/deployer -
Duser=$USERNAME -Dpassword=$PASSWORD -
DsarLocation=$CODE/HelloWorld/deploy/sca_HelloWorl
d_rev1.0.jar -Dpartition=default -Doverwrite=true
-DforceDefault=true
Understanding the Ant Deployment Process
Slide 49 of 64 © Raastech, Inc. 2012 | All rights reserved.
Compilation done via the package target in ant-sca-
package.xml
The package target calls other targets to perform:
1. Cleanup
2. Validation
3. Compilation
Understanding the Ant Compilation Process
Slide 50 of 64 © Raastech, Inc. 2012 | All rights reserved.
Removes any existing SAR files
Compilation: The init Target
clean:
[echo] deleting
/u01/svn/HelloWorld/deploy/sca_HelloWorld_rev1.0.jar
Slide 51 of 64 © Raastech, Inc. 2012 | All rights reserved.
Sets environment variables and validates all resources
within the code
Compilation: The scac-validate Target
scac-validate:
[echo] Running scac-validate in
/u01/svn/HelloWorld/composite.xml
[echo] oracle.home =
/u01/app/oracle/middleware/Oracle_SOA1/bin/..
[input] skipping input as property compositeDir has already
been set.
[input] skipping input as property compositeName has already
been set.
[input] skipping input as property revision has already been
set.
Slide 52 of 64 © Raastech, Inc. 2012 | All rights reserved.
Compiles the code
Compilation: The scac Target
scac:
[scac] Validating composite "/u01/svn/HelloWorld/composite.xml"
[scac] error: location
.
Load of wsdl "HelloWorldWebService.wsdl with Message part
element undefined in wsdl [file:/u01/svn/HelloWorld/
.
[echo]
[echo] ERROR IN TRYCATCH BLOCK:
[echo] /u01/scripts/build.soa.xml:112: The following
error occurred while executing this line:
.
[echo] /u01/app/oracle/middleware/Oracle_SOA1/bin/ant-sca-
compile.xml:269: Java returned: 1 Check log file : /tmp/out.err
for errors
Slide 53 of 64 © Raastech, Inc. 2012 | All rights reserved.
Understand that ant runs on the client machine, not the SOA
server [echo] /u01/app/oracle/middleware/Oracle_SOA1/bin/ant-sca
deploy.xml:188: java.lang.OutOfMemoryError: PermGen space
Compilation errors, check out.err and understand adf-
config.xml
oracle.fabric.common.wsdl.SchemaBuilder.loadEmbeddedSchemas
(SchemaBuilder.java:492) Caused by: java.io.IOException:
oracle.mds.exception.MDSException: MDS-00054: The file to be
loaded oramds:/apps/Common/HelloWorld.xsd does not exist.
Deployment errors are usually straightforward [deployComposite] INFO: Creating HTTP connection to
host:soahost1, port:8001
[deployComposite] java.net.UnknownHostException: soahost1
Types of Errors
Slide 54 of 64 © Raastech, Inc. 2012 | All rights reserved.
Located in Unix/Linux:
/tmp/out.err
Located in Microsoft Windows:
C:\Users\[user]\AppData\Local\Temp\out.err
Location of out.err
Slide 55 of 64 © Raastech, Inc. 2012 | All rights reserved.
OTHER STUFF
Slide 56 of 64 © Raastech, Inc. 2012 | All rights reserved.
DMS Spy Servlet displays instant Dynamic Monitoring
Service (DMS) related metrics
Navigate to http://<host>:<soaport>/dms/Spy
http://docs.oracle.com/cd/E15586_01/core.1111/e10108/monitor.htm#CFAHIAIB
The DMS Spy Servlet
Slide 57 of 64 © Raastech, Inc. 2012 | All rights reserved.
The EDN Database Debug Log can be accessed at:
http://<host>:<soaport>/soa-infra/events/edn-db-log
Changing the oracle.integration.platform.blocks.event.saq
logger to TRACE:32 captures the body of the event
message is available in the EDN trace
Check Event Delivery Network (EDN)
Slide 58 of 64 © Raastech, Inc. 2012 | All rights reserved.
SUMMARY
Slide 59 of 64 © Raastech, Inc. 2012 | All rights reserved.
Troubleshooting is part art, part product knowledge
Oracle SOA Suite 11g errors can mostly be classified into:
Runtime (or infrastructure) errors
Performance issues/errors
Deployment errors
Summary
Slide 60 of 64 © Raastech, Inc. 2012 | All rights reserved.
For infrastructure errors:
Identify whether it is a composite or an infrastructure error
Consider increasing logger levels
Identifying the root cause of stuck threads may require some
drill-down investigation
Summary
Slide 61 of 64 © Raastech, Inc. 2012 | All rights reserved.
For performance issues:
Identify whether it is a server-wide performance issue, or
specific to a single composite
Check overall system health, even the obvious areas
Obtaining composite instance performance metrics is easily
done through SQL
Summary
Slide 62 of 64 © Raastech, Inc. 2012 | All rights reserved.
For deployment errors:
Understand the ant compilation (i.e., packaging) and
deployment processes
Understand adf-config.xml
Summary
Slide 63 of 64 © Raastech, Inc. 2012 | All rights reserved.
Oracle SOA Suite 11g Administrator’s
Handbook
http://www.packtpub.com/oracle-soa-suite-11g-
administrators-handbook/book
Chapter 6: Troubleshooting the Oracle
SOA Suite 11g Infrastructure
“Highly recommended, a tour de
force.”
~Mark Nelson, Oracle A-Team
Book
http://redstack.wordpress.com/2012/10/28/a-review-of-oracle-soa-suite-11g-administrators-handbook/
Slide 64 of 64 © Raastech, Inc. 2012 | All rights reserved.
Harold Dost III
Senior Consultant
@hdost
Contact Information
Slide 65 of 64 © Raastech, Inc. 2012 | All rights reserved.
Session #:185
Oracle SOA Suite 11g Troubleshooting Methodology
ioug.org/eval
Evaluation