Troubleshooting AWR Snapshot Collection Issues (Doc ID 1301503.1)

8/19/2019 Troubleshooting AWR Snapshot Collection Issues (Doc ID 1301503.1)

1/8

Troubleshooting: AWR Snapshot Collection Issues (Doc ID 1301503.1)

In this Document

Purpose

Troubleshooting Steps

Snapshots not starting

Snapshots not completing successfully

Known Issues

Community Discussions

References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.2.0.4 and later

Information in this document applies to any platform.

Oracle RDBMS Automatic Workload Repository

PURPOSE

The purpose of this document is to describe the diagnostic data that can be collected and the corrective steps that can be taken when

encountering issues with slow, hung or missing AWR snapshots.

TROUBLESHOOTING STEPS

Sometimes AWR snapshots aren't created as expected. A check in DBA_HIST_SNAPSHOT shows no snapshots in the expected time range.

col instance_number format 999 heading 'Inst'

col startup_time format a15 heading 'Startup'

col begin_interval_time format a15 heading 'Begin snap'

col end_interval_time format a15 heading 'End Snap'

col flush_elapsed format a20 heading 'flush elapsed' col error_count format 9999 heading 'Err#'

SQL> select instance_number, startup_time,begin_interval_time,

end_interval_time,flush_elapsed,error_count

from dba_hist_snapshot where rownum < 5

order by begin_interval_time desc

/

Inst Startup Begin snap End Snap flush elapsed Err#

---- --------------- --------------- ------------- ------------------ -----

1 12-FEB-11 09.27 16-FEB-11 09.00 16-FEB-11 10.00 +00000 00:00:00.2 0

1 12-FEB-11 09.27 15-FEB-11 09.00 15-FEB-11 10.00 +00000 00:00:00.1 0

1 12-FEB-11 09.27 12-FEB-11 03.00 12-FEB-11 04.00 +00000 00:00:00.1 0

1 12-FEB-11 09.27 12-FEB-11 02.00 12-FEB-11 03.00 +00000 00:00:00.1 0

There are two different causes for this scenario: snapshots not starting, and snapshots not completing successfully.

Snapshots not starting

Start out by verifying if snapshots are in fact starting. You need to know the time of the last snapshot and the configured snapshot interval.

You can find this with the following SQL:

col systimestamp form a35

col most_recent_snap_time form a25

col snap_interval form a17

SQL> select systimestamp, most_recent_snap_time, snap_interval from wrm$_wr_control where dbid = (select dbid

from v$database);

SYSTIMESTAMP MOST_RECENT_SNAP_TIME SNAP_INTERVAL

----------------------------------- ------------------------- --------------


2/8

28-DEC-09 01.15.25.098600 PM -08:00 24-DEC-09 01.00.13.418 PM 01:00:00.0

Snapshots should happen every hour (+0 01:00:00). AWR tries to take the snapshots at the top of the hour (08:00, 09:00, 10:00, ...). That

means that the first snapshot after startup, or the first automatic snapshot following a manual snapshot, may appear to be "late" or "early".

Common reasons snapshots might not be collected are given below.

Database Closed

In the case shown here, the last snap time was 24th Dec 01:00 PM (see output above) and the date at the time of checking is 28th Dec, so

here the last snapshot was taken over 4 days ago. The explanation is simple: the database was just opened, and has been closed for severa

days.

SQL> alter session set nls_date_format = 'DD-MON-YYYY HH24:MI:SS';

Session altered.

SQL> select open_time from v$thread ;

OPEN_TIME

--------------------

28-DEC-2009 13:11:46

The instance alert log will confirm this.

STATISTICS_LEVEL setting

If the parameter STATISTICS_LEVEL is set to BASIC then snapshots will not be taken automatically.

SQL> show parameter statistic

NAME TYPE VALUE

------------------------------------ ----------- -------------------

optimizer_use_pending_statistics boolean FALSE

statistics_level string TYPICAL

timed_os_statistics integer 0

timed_statistics boolean TRUE

Parameter with "SWRF", "AWR" or "SYSAUX" in them could also have an impact. If any of those are set find out why. If they're no longer

required try to unset them.

Database Open Mode

Some open modes may also disable automatic snapshots. For example, in restricted mode or read only mode, snapshots are disabled.

Check the database open mode:

SQL> select open_mode from v$database;

OPEN_MODE

----------

READ WRITE

If the above isn't sufficient to resolve the issue, customers may need to gather traces and send to Oracle support. Often, it's best to get the

MMON action trace ("28" below) and the snapshot flush trace ("10" below).

SQL> alter session set "_swrf_test_action" = 28;

Session altered.


Session altered.

Wait for at least an hour, and then gather the MMON and M00x traces.

To disable the traces again, use the following syntax:


Session altered.


Session altered.

With the tracing enabled the MMON trace file shows the start time and completion messages for each auto flush and purge:

*** MODULE NAME:() 2011-02-17 06:27:37.051


3/8

*** ACTION NAME:() 2011-02-17 06:27:37.051

KEWRAFM: Beginning one MMON Auto-Flush cycle ...

Finished one MMON Auto-Flush cycle.

*** 2011-02-17 06:28:37.057

KEWRAFM: Beginning one MMON Auto-Flush cycle ...

Finished one MMON Auto-Flush cycle.

*** 2011-02-17 06:29:22.192

KEWRAPM: Beginning one MMON Auto-Purge cycle ...

KEWRAPM: Finished one MMON Auto-Purge cycle.

The MMON slave file trace file output will show the begin and end times of each step of the AWR statistics

collection:

*** MODULE NAME:(MMON_SLAVE) 2011-02-17 07:00:16.734

*** ACTION NAME:(Auto-Flush Slave Action) 2011-02-17 07:00:16.734

kewrfos_flush_onesnap: Flushing snapshot id: 584 at time: 02/17/11 07:00:16

[WRH$_ACTIVE_SESSION_HISTORY Total], btime: 2058403137, etime:2058403138, Elapsed=1

[WRH$_SYSMETRIC_HISTORY Total], btime: 2058403138, etime:2058403141, Elapsed=3

[WRH$_FILEMETRIC_HISTORY Total], btime: 2058403141, etime:2058403141, Elapsed=0

[WRH$_WAITCLASSMETRIC_HISTORY Total], btime: 2058403141, etime:2058403141, Elapsed=0

[WRH$_SESSMETRIC_HISTORY Total], btime: 2058403141, etime:2058403141, Elapsed=0

[Top Segment Selection], btime: 2058403141, etime:2058403141, Elapsed=0

Snapshots not completing successfully

One common source of confusion is that the view DBA_HIST_SNAPSHOT filters out snapshots that didn't complete successfully.

That means that DBA_HIST_SNAPSHOT is not a reliable source to investigate this problem. Instead we can refer to the information in

WRM$_SNAPSHOT. For instance, let's look at the last 10 snapshots in the database. First, please run this query to verify whether

WRM$_SNAPSHOT_DETAILS in the particular release or not:

select owner, object_name, object_type from dba_objects where object_name LIKE UPPER('WRM$%')

and object_type='TABLE';

WRM$_SNAPSHOT_DETAILS does not exist in 11.1.0.7 database but exists in 11.2.0.3.

If this object do not exist in certain release, the SQLs will generate error when executed.

col instance_number form 999 head INST

col begin_interval_time form a25 col flush_elapsed form a17

col status form 999

col error_count form 999 head ERR

col snap_flag form 999 head SNAP

SQL> select * from

(select snap_id,

instance_number,

begin_interval_time,

flush_elapsed,

status,

error_count,

snap_flag

from wrm$_snapshot

where dbid = (select dbid from v$database)

order by snap_id desc)

where rownum


4/8

In this case, we can see that a few of the snapshots had errors (ERROR_COUNT > 0), but they still completed (STATUS = 0 and

FLUSH_ELAPSED is not NULL).

To investigate the errors, start with the WRM$_SNAP_ERROR table:

SQL> select * from wrm$_snap_error

where dbid = (select dbid from v$database)

order by snap_id;

SNAP_ID DBID INST TABLE_NAME ERROR_NUMBER

---------- ---------- ----- ------------------------------ ------------

97 581521084 1 WRH$_SQL_PLAN 942

99 581521084 1 WRH$_SQL_PLAN 942100 581521084 1 WRH$_SQL_PLAN 942

In most cases, the M00x process will write some information about the error to a file as well. It will be either an incident dump, or a regular

trace file. From 11g onwards incident dumps should always be collected for service requests where available. If there are no errors, but

status is non-zero, it means that the snapshot ran for too long and was terminated, or that it's still running. In those cases, it's good to trace

the snapshots to get some timing information.

If this happens with automatic snapshots, you may find that manual snapshots actually complete but take a long period of time to do so.

Tracing with normal SQL trace will tend to just show high CPU usage:

begin

dbms_workload_repository.create_snapshot();

end;

call count cpu elapsed disk query current rows

------- ------ -------- ---------- ---------- ---------- ---------- ----------

Parse 1 0.00 0.00 0 0 0 0

Execute 1 485.98 485.98 0 1 34 1

Fetch 0 0.00 0.00 0 0 0 0

------- ------ -------- ---------- ---------- ---------- ---------- ----------

total 2 485.98 485.98 0 1 34 1

This is not particularly useful in terms of diagnostics. You need to get further into the processing so as to determine the source of the issue.

The easiest way to trace the snapshots is the same one shown above:


Session altered.

This will write some information (about 120 rows per snapshot) into the M00x process trace, including some timing information.

The output willl be similar to the following:

...

[Top SQL Selection], btime: 1961091529, etime:1961095597, Elapsed=4068

[WRH$_SQL_BIND_METADATA Attr Selection], btime: 1961095597, etime:1961095598, Elapsed=1

[WRH$_SQL_BIND_METADATA Update], btime: 1961095598, etime:1961095598, Elapsed=0

[WRH$_SQL_BIND_METADATA Insert], btime: 1961095598, etime:1961095599, Elapsed=1

[WRH$_SQL_BIND_METADATA Total], btime: 1961095597, etime:1961095599, Elapsed=2

[WRH$_SQL_PLAN Attr Selection], btime: 1961095599, etime:1961095601, Elapsed=2

[WRH$_SQL_PLAN Update], btime: 1961095601, etime:1961095602, Elapsed=1

[WRH$_SQL_PLAN Insert], btime: 1961095602, etime:1961095604, Elapsed=2

[WRH$_SQL_PLAN Total], btime: 1961095599, etime:1961095604, Elapsed=5

[WRH$_OPTIMIZER_ENV Attr Selection], btime: 1961095604, etime:1961095604, Elapsed=0

[WRH$_OPTIMIZER_ENV Update], btime: 1961095604, etime:1961095604, Elapsed=0[WRH$_OPTIMIZER_ENV Insert], btime: 1961095604, etime:1961095604, Elapsed=0

[WRH$_OPTIMIZER_ENV Total], btime: 1961095604, etime:1961095604, Elapsed=0

[WRH$_SQLTEXT Attr Selection], btime: 1961095604, etime:1961095605, Elapsed=1

[WRH$_SQLTEXT Update], btime: 1961095605, etime:1961095605, Elapsed=0

[WRH$_SQLTEXT Insert], btime: 1961095605, etime:1961095605, Elapsed=0

[WRH$_SQLTEXT Total], btime: 1961095604, etime:1961095605, Elapsed=1

[WRH$_SQLSTAT All Stats], btime: 1961095605, etime:1961095618, Elapsed=13

[WRH$_SQLSTAT Total], btime: 1961095605, etime:1961095618, Elapsed=13

[WRH$_SQL_SUMMARY Total], btime: 1961095618, etime:1961095618, Elapsed=0

[SQL Group Total], btime: 1961091529, etime:1961095618, Elapsed=4089

[Snapshot Total], btime: 1961089445, etime:1961095618, Elapsed=6173

To interpret, look at the [Snapshot Total] at the end of the report and then see which like makes up the largest proportion. In this case the

[Top SQL Selection] section is using the time:


5/8

[Top SQL Selection], btime: 1961091529, etime:1961095597, Elapsed=4068

[WRH$_SQL_BIND_METADATA Attr Selection], btime: 1961095597, etime:1961095598, Elapsed=1

...

[WRH$_SQL_SUMMARY Total], btime: 1961095618, etime:1961095618, Elapsed=0

[SQL Group Total], btime: 1961091529, etime:1961095618, Elapsed=4089

[Snapshot Total], btime: 1961089445, etime:1961095618, Elapsed=6173

You can use this information to look for similar bugs or other occurrences.

If more details are required, then SQL tracing can be used to collect more information. There are two ways of getting SQL trace for

snapshots. The best way is usually to set module/action-specific tracing for the two flush actions:

begin

dbms_monitor.serv_mod_act_trace_enable(service_name=>'SYS$BACKGROUND',

module_name=>'MMON_SLAVE',

action_name=>'Auto-Flush Slave Action');

dbms_monitor.serv_mod_act_trace_enable(service_name=>'SYS$BACKGROUND',

module_name=>'MMON_SLAVE',

action_name=>'Remote-Flush Slave Action');

end;

/

To disable the tracing later, use "dbms_monitor.serv_mod_act_trace_disable" instead.

An alternative way is to use an AWR test event:

alter session set events 'immediate trace name awr_test level 3';

To disable that form of tracing later, use "awr_test level 4" instead.

For completeness, there are two other forms of tracing that can also be interesting.

# trace AWR snapshots


# trace AWR purging


# trace AWR SQL


# turn off all of the above, if set


Note: These commands will take effect for the entire instance, even though the syntax used is "alter session".

ADDM

There's one additional scenario where snapshots don't finish properly. At the end of the snapshot itself, we also start ADDM (the Automatic

Database Diagnostic Monitor), which analyzes ASH data to identify potential performance issues. If the call stack (use "pstack" or similar)

shows "keh" functions, ADDM is most likely running. The proper diagnosis of ADDM performance is a separate topic. However, a good place

to start would be to get SQL trace for the process using the 10046 trace event, and run tkprof to see if there are any unusual amounts of

waits or any plans that look bad.

Workarounds

When you review the trace information, you might find a single table that seems to be causing a problem. If so, you can try to disable the

flushing of that table, so that you can verify if snapshots work OK after that. Here's an example of how to disable flush of the table

WRH$_TEMPSTATXS. Please note that this command should be run on all instances in the cluster.

alter system set "_awr_disabled_flush_tables" = 'wrh$_tempstatxs';

Complications

In some cases, the SQL trace will show a lot of 'reliable message' waits. That just means that the local node (the one that started thesnapshot) is waiting for the remote nodes to complete. The real issue at that point will be in the trace files from the remote nodes.

In some cases, all local and remote nodes will have completed their snapshots successfully, but the local node doesn't realize that, and

keeps waiting for 'reliable message' until it hits the 15-minute timeout. This is believed to be caused by a bug, but it has not yet been

resolved.

If this is the case, each node will have a successfully completed snapshot (STATUS=0) with an elapsed time less than the timeout

(FLUSH_ELAPSED < 00:15:00.00).

Table timing information

Release 11.2.0.2 introduces a new table, WRM$_SNAPSHOT_DETAILS, which contains timing information for each individual table flush.

This can be used to determine which table is taking a lot of time. For instance, sometimes a single table takes close to 15 minutes, causing

the entire flush to fail.

set pagesize 999


6/8

column name format a28

column time format a29

variable dbid number

exec select dbid into :dbid from v$database;

PL/SQL procedure successfully completed.

SQL> variable snap_id number

SQL> exec select max(snap_id) into :snap_id from wrm$_snapshot where dbid=:dbid;

PL/SQL procedure successfully completed.

SQL> select table_name_kewrtb name, end_time-begin_time time2 from wrm$_snapshot_details, x$kewrtb

3 where snap_id = :snap_id

4 and dbid = :dbid

5 and table_id = table_id_kewrtb

6 order by table_id;

NAME TIME

---------------------------- -----------------------------

WRH$_STAT_NAME +000000000 00:00:00.238

WRH$_PARAMETER_NAME +000000000 00:00:03.159

WRH$_EVENT_NAME +000000000 00:00:00.476

WRH$_LATCH_NAME +000000000 00:00:00.296

WRH$_FILESTATXS +000000000 00:00:00.041

WRH$_TEMPSTATXS +000000000 00:00:00.011

WRH$_DATAFILE +000000000 00:00:00.018

WRH$_TEMPFILE +000000000 00:00:00.033

WRH$_SQLSTAT +000000000 00:00:00.071WRH$_SQLTEXT +000000000 00:00:00.000

WRH$_SQLTEXT +000000000 00:00:00.067

WRH$_SQL_SUMMARY +000000000 00:00:00.023

WRH$_SQL_PLAN +000000000 00:00:00.204

WRH$_SYSTEM_EVENT +000000000 00:00:00.042

WRH$_BG_EVENT_SUMMARY +000000000 00:00:00.040

WRH$_WAITSTAT +000000000 00:00:00.018

WRH$_ENQUEUE_STAT +000000000 00:00:00.031

WRH$_LATCH +000000000 00:00:00.028

WRH$_LATCH_MISSES_SUMMARY +000000000 00:00:00.125

WRH$_LIBRARYCACHE +000000000 00:00:00.010

WRH$_DB_CACHE_ADVICE +000000000 00:00:00.079

WRH$_BUFFER_POOL_STATISTICS +000000000 00:00:00.027

WRH$_ROWCACHE_SUMMARY +000000000 00:00:00.033

WRH$_SGA +000000000 00:00:00.010

WRH$_SGASTAT +000000000 00:00:00.047WRH$_PGASTAT +000000000 00:00:00.014

WRH$_RESOURCE_LIMIT +000000000 00:00:00.012

WRH$_SHARED_POOL_ADVICE +000000000 00:00:00.017

WRH$_SQL_WORKAREA_HISTOGRAM +000000000 00:00:00.016

WRH$_PGA_TARGET_ADVICE +000000000 00:00:00.009

WRH$_INSTANCE_RECOVERY +000000000 00:00:00.026

WRH$_SYSSTAT +000000000 00:00:00.110

WRH$_PARAMETER +000000000 00:00:00.052

WRH$_UNDOSTAT +000000000 00:00:00.009

WRH$_PROCESS_MEMORY_SUMMARY +000000000 00:00:00.048

WRH$_SEG_STAT +000000000 00:00:00.048

WRH$_SEG_STAT_OBJ +000000000 00:00:00.246

WRH$_SERVICE_NAME +000000000 00:00:00.013

WRH$_SERVICE_STAT +000000000 00:00:00.040

WRH$_ACTIVE_SESSION_HISTORY +000000000 00:00:00.172

WRH$_SYSMETRIC_HISTORY +000000000 00:00:00.009

WRH$_FILEMETRIC_HISTORY +000000000 00:00:00.007WRH$_WAITCLASSMETRIC_HISTORY +000000000 00:00:00.007

WRH$_TABLESPACE_STAT +000000000 00:00:00.070

WRH$_LOG +000000000 00:00:00.071

WRH$_MTTR_TARGET_ADVICE +000000000 00:00:00.009

WRH$_TABLESPACE_SPACE_USAGE +000000000 00:00:00.027

WRH$_SQL_BIND_METADATA +000000000 00:00:00.141

WRH$_JAVA_POOL_ADVICE +000000000 00:00:00.032

WRH$_THREAD +000000000 00:00:00.006

WRH$_SESSMETRIC_HISTORY +000000000 00:00:00.012

WRH$_OSSTAT +000000000 00:00:00.031

WRH$_SYS_TIME_MODEL +000000000 00:00:00.028

WRH$_OPTIMIZER_ENV +000000000 00:00:00.011

WRH$_SERVICE_WAIT_CLASS +000000000 00:00:00.036

SQL Group +000000000 00:00:00.542


7/8

Segment Group +000000000 00:00:00.312

Datafile Group +000000000 00:00:00.060

Tempfile Group +000000000 00:00:00.044

Service Group +000000000 00:00:00.089

Undo Group +000000000 00:00:00.009

WRH$_COMP_IOSTAT +000000000 00:00:00.007

WRH$_SGA_TARGET_ADVICE +000000000 00:00:00.008

WRH$_EVENT_HISTOGRAM +000000000 00:00:00.033

WRH$_MUTEX_SLEEP +000000000 00:00:00.030

WRH$_MEMORY_TARGET_ADVICE +000000000 00:00:00.007

WRH$_MEMORY_RESIZE_OPS +000000000 00:00:00.056

WRH$_IOSTAT_FUNCTION +000000000 00:00:00.037

WRH$_IOSTAT_FILETYPE +000000000 00:00:00.017WRH$_IOSTAT_FILETYPE_NAME +000000000 00:00:00.053

WRH$_RSRC_CONSUMER_GROUP +000000000 00:00:00.024

WRH$_RSRC_PLAN +000000000 00:00:00.005

WRH$_MEM_DYNAMIC_COMP +000000000 00:00:00.008

WRH$_DISPATCHER +000000000 00:00:00.006

WRH$_SHARED_SERVER_SUMMARY +000000000 00:00:00.033

WRH$_IOSTAT_DETAIL +000000000 00:00:00.036

WRH$_MVPARAMETER +000000000 00:00:00.023

WRH$_TABLESPACE +000000000 00:00:00.014

78 rows selected.

In this case, WRH$_PARAMETER_NAME took a lot more time than the other tables, but still not enough to be a problem.

Please note that WRH$_SQLTEXT was flushed twice. That's expected behaviour.

Known Issues

Document 1392603.1 AWR or STATSPACK Snapshot collection extremely slow in 11gR2

Document 1466333.1 AWR Snapshots Missing Intermittently Accompanied by Spike in CPU from MMON Process

Community Discussions

Still have questions? Use the communities window below to search for similar discussions or start a new discussion on this subject. (Window

is the live community not a screenshot)

Click here to open in main browser window


8/8

All Places > My Oracle Support Community > Oracle Database (MOSC) > Database Tuning (MOSC) > Discussions

33049 Views Tags:

Discussion Thread: Diagnosing Database Pe[Document ID 1359094.1]

Steve Dixon-Oracle Apr 23, 2014 6:24 AM

This thread can be used for discussions associated with : FAQ: How to Use AWR Reports

[ID 1359094.1 ]

Additionally there is a thread discussing an article: AWR Report Interpretation Checklist f

(Doc ID 1628089.1 ) which you can find here.

REFERENCES

NOTE:748642.1 - How to Generate an AWR Report and Create Baselines

NOTE:761298.1 - MMON Suspended Due to ORA-12751 "cpu time or run time policy violation"NOTE:1392603.1 - AWR or STATSPACK Snapshot collection extremely slow in 11gR2

NOTE:1086120.1 - Quick Instructions For Obtaining The Automatic Workload Repository (AWR) Report

NOTE:13257247.8 - Bug 13257247 - AWR Snapshot collection hangs due to slow inserts into WRH$_TEMPSTATXS.

NOTE:1363422.1 - Automatic Workload Repository (AWR) Reports - Main Information Sources

Documents

Troubleshooting AWR Snapshot Collection Issues (Doc ID 1301503.1)