Oracle ASH Monitoring

I have over 60 Oracle diagnostic SQL scripts in my arsenal.  A lot of them hit ASH and AWR.  I’ve tested these scripts on a high traffic web-site backed by an 6 node Oracle RAC Cluster.  Everyone of these scripts saved my day at one point, but there is one script that stands out from them all – Oracle ASH Top Waits Script: h1.sql.

Oracle ASH Top Waits

This little gem is the #1 thing I reach for when I am asked to troubleshoot a performance problem that was reported hours ago. For example lets say it’s 9:30am and I get a call from a dev saying that her APEX web application is hanging – she suspects a locking issue. All I need to know at this point is when the problem was first reported – armed with the start time (lets say 7:00am) I simply do this:

sqlplus / as sysdba
@h1 0700 0930 0 -1 -1

What’s happening here? There are 5 parameters:

1: start HHMI     [0700 = 7:00am]
2: end HHMI       [0930 = 9:30am]
3: days back      [0 = today; or 7 = seven days back]
4: instance       [1 = INST_ID=1, give -1 for all]
5: service_hash   [1225919510 = dba_services.name_hash, give -1 for all]

and here’s what I get back:

Oracle ASH Top Waits Script output sample

Can you spot the problem? It’s “SQL*Net message from dblink”, there are no locks – it’s a simple problem of a badly written distributed query that is waiting on remote DB. That was easy! One other hidden benefit of this script is that it saves it’s output in a table under a RUN_ID (in this case RUN_ID=81) allowing you to compare the output of two RUN_IDs and clearly spot the differences in values grouped by WAIT EVENT. This is extremely valuable especially when someone says – “this used to work last week!”, you simply do this:

sqlplus / as sysdba
@h1 0700 0930 7 -1 -1

The “7” in the third parameter instructs the script to look 7 days back for the same time frame (7-9:30am). The output of the above report will have it’s own RUN_ID=82 (next in sequence) and you can now compare the two using a special script h1d.sql like so:

@h1d 81 82

We didn’t need OEM or any GUI apps to get to the bottom of the problem – all because the diagnostics data is already in AWR tables and is available to us directly from command line / sqlplus. Years ago – we’d have to sample v$session_wait to get similar diagnostics, in fact, I wrote a complete monitoring system that utilized such technique. But now, Oracle built this into the core engine in a form of Active Session History (ASH) that automatically samples this data every second with practically no overhead to the database! That is an incredible level of instrumentation available to us and it would be a shame not to utilize it beyond what the OEM reports are capable of.

Note however that ASH sampling is short lived – only 1/10th of it’s sampling is saved in AWR based on some internal thresholds Oracle came up with. Now imagine that we:

  1. Take a script like h1.sql and refactor it to use ASH instead of AWR because ASH samples are at higher fidelity (AWR only get’s 1/10th of ASH data).
  2. Run this script every 3 minutes to capture the live heartbeat of the database.
  3. Define your own thresholds on top of this sampling and get notified if something is amiss.
  4. Save all this data for historical purposes so that even if AWR gets wiped out you have solid performance metrics years later.
  5. Wrap this all up in an easy to deploy (single binary) distribution which only takes a minute to install on a new host.

Are you interested? Does this sound like something you’d like in your shop? If yes – then I’d like to get some feedback from you because I am building it! Sign up for the EvenTorEx Mailing List to start the discussion and get the private beta as soon as it’s ready (no SPAM here I guarantee it!):

EvenTorEx Mailing List

December 2, 2015

Posted In: Operations, Scripts

Tags: , ,