DBAToolZ

Lightweight Oracle DBA Tools

 

Making sense of ASH TIME_WAITED Units

It’s widely believed that v$active_session_history.TIME_WAITED is presented in microseconds. It sure seems like it, but something is amiss as my small test shows. Using two sessions I created a test table in one, inserted a row and then from another session tried to select the same row for update:

| session A > SQL> create table ash_time_waited_test (t number);
| session A > 
| session A > Table created.
| session A > 
| session A > SQL> insert into ash_time_waited_test values(1);
| session A > 
| session A > 1 row created.
| session A > 
| session A > SQL> commit;
| session A > 
| session A > Commit complete.
| session A > 
| session A > SQL> update ash_time_waited_test set t=2 where t=1;
| session A > 
| session A > 1 row updated.
 
| session B > SQL> !date
| session B > Thu Jul 30 18:20:15 PDT 2009
| session B > 
| session B > SQL> select * from ash_time_waited_test for update;
 
| session A > 
| session A > SQL> commit;
| session A > 
| session A > Commit complete.
| session A > 
| session A > SQL>
| session A > 
 
| session B > 
| session B >          T
| session B > ----------
| session B >          2
| session B > 
| session B > SQL> !date
| session B > Thu Jul 30 18:20:38 PDT 2009
| session B >

What I found next was rather strange — the sum(TIME_WAITED) from ASH (v$active_session_history) did not add up to the total time session A kept the lock on the test table:

First we get the SQL_ID for session B (the waiter) so we can tie into ASH.SQL_ID later:

SQL> select sql_id from v$sql where sql_text = 'select * from ash_time_waited_test for update';
 
SQL_ID
-------------
fpm89fyrsf6gv

Now having the SQL_ID in hand we can dig into ASH to see what waits our session experienced:

SQL> col SAMPLE_TIME format a25
SQL> col event format a30
SQL> select sample_time,EVENT,WAIT_TIME,TIME_WAITED from v$active_session_history
  2  where sql_id = 'fpm89fyrsf6gv'
  3  order by sample_time;
 
SAMPLE_TIME               EVENT                           WAIT_TIME TIME_WAITED
------------------------- ------------------------------ ---------- -----------
30-JUL-09 06.20.25.751 PM enq: TX - row lock contention           0      489233
30-JUL-09 06.20.26.761 PM enq: TX - row lock contention           0      489244
30-JUL-09 06.20.27.761 PM enq: TX - row lock contention           0      489242
30-JUL-09 06.20.28.761 PM enq: TX - row lock contention           0      489238
30-JUL-09 06.20.29.761 PM enq: TX - row lock contention           0      489244
30-JUL-09 06.20.30.761 PM enq: TX - row lock contention           0      489242
30-JUL-09 06.20.31.761 PM enq: TX - row lock contention           0      489244
30-JUL-09 06.20.32.761 PM enq: TX - row lock contention           0      489253
30-JUL-09 06.20.33.761 PM enq: TX - row lock contention           0      489233
30-JUL-09 06.20.34.761 PM enq: TX - row lock contention           0      489245
30-JUL-09 06.20.35.761 PM enq: TX - row lock contention           0      489245
 
11 rows selected.
 
SQL>

Based on the output above we know that the first sample that trapped this wait occurred at “06.20.25.751 PM” and the last one at “06.20.35.761 PM”. The difference (delta) between these two timestamps is just over 10 seconds:

SQL> select e.SAMPLE_TIME - s.SAMPLE_TIME
  2  from v$active_session_history s,
  3       v$active_session_history e
  4  where s.sample_id = (select min(sample_id) from v$active_session_history where sql_id = 'fpm89fyrsf6gv')
  5    and e.sample_id = (select max(sample_id) from v$active_session_history where sql_id = 'fpm89fyrsf6gv')
  6  ;
 
E.SAMPLE_TIME-S.SAMPLE_TIME
---------------------------------------------------------------------------
+000000000 00:00:10.010
 
SQL>

So far so good. Now lets see what we get by summing up the TIME_WAITED:

SQL> select sum(TIME_WAITED) from v$active_session_history
  2  where sql_id = 'fpm89fyrsf6gv'
  3  ;
 
SUM(TIME_WAITED)
----------------
         5381663

Interesting? If the ASH.TIME_WAITED is measured in microseconds then the total wait time = 5381663/1000000 = 5.381663 seconds and as we know it’s not correct. In fact, we can prove this using another metric, the ELAPSED_TIME_* from DBA_HIST_SQLSTAT:

SQL> select ELAPSED_TIME_TOTAL, ELAPSED_TIME_DELTA,
  2  APWAIT_TOTAL, APWAIT_DELTA,
  3  CCWAIT_TOTAL, CCWAIT_DELTA
  4  from DBA_HIST_SQLSTAT
  5  where sql_id = 'fpm89fyrsf6gv'
  6  ;
 
ELAPSED_TIME_TOTAL ELAPSED_TIME_DELTA APWAIT_TOTAL APWAIT_DELTA CCWAIT_TOTAL CCWAIT_DELTA
------------------ ------------------ ------------ ------------ ------------ ------------
          11110172           11110172     11106973     11106973          254          254
 
SQL>

DBA_HIST_SQLSTAT is dead on — with Elapsed Time Total = 11110172/1000000 = 11.110172 seconds … So what’s happening with ASH then? Why such a large > 2 times discrepancy?

I have my theory — I think that because ASH sampling is taken every second half of that seconds is actually taken to process the data and for this reason the TIME_WAITED gets truncated. Whatever the reason is, one thing is for sure — ASH.TIME_WAITED, at least in this particular case, cannot be trusted.

The final question … how do you then use ASH? How do you arrive at a more or less accurate metrics, specifically the TIME_WAITED? One way is this — we know what ASH sampling takes place every second, so instead of relying on the buggy TIME_WAITED values we can rely on the actual number of times that a SQL_ID appeared in ASH … If a SQL_ID was trapped by ASH 10 consecutive times with WAIT_TIME=0 (Waiting) we then know for sure that it waited for at least 10 seconds …

UPDATE Aug 12, 2009

Graham Wood from Oracle explains the issue with ASH: TIME_WAITED in his paper Sifting through the ASHes (see Page 35). He also lists some other “ASH data gotcha’s” on page 27. What really caught my attention through, was page 36, which explains “ASH vs AWR/Statpack” and it specifically says that for “Counts/occurrence data” analysis you should go to AWR not ASH.

And it really makes sense, after all, ASH is sampled every second and while it sounds like “often enough” it really isn’t. All you have to do is run sql_trace on a busy session and you’ll see that there’s a lot that goes on between the 1 second intervals — all of it would be missed by ASH. ASH is great at putting spotlight on the “big issues”, not so great for subtle, sub second waits which could and do add up to a considerable wait time.

Bottom line for me is that there’s no substitute for the good, old sql_trace, but overall, ASH is a huge improvement over what we had or rather didn’t have before. It definitely replaces my own monitoring scripts which I’ve used for many years now, sampling v$session_wait where wait_time=0 every 5 seconds.

No related posts.


6 Responses to “Making sense of ASH TIME_WAITED Units”

  1. 1
    Karl Arao:

    Yes I agree :)
    ASH data is a statistical approximation.. Not exact, but close enough to be used.. that’s why we do count(*) because each sample is 1second. You can know more from the seminar slides of Kyle Hailey. The first 9 slides tackled more on ASH..

    http://www.perfvision.com/ftp/emea/

    - Karl Arao

  2. 2
    MVE:

    Thanks Karl! I am going through the PPPs and the AAS (Average Active Sessions) approach to looking at system overload makes sense to me — good stuff :)

  3. 3
    Blogroll Report 31/07/2009 – 07/08/2009 « Coskan’s Approach to Oracle:

    [...] Vitaliy Mogilevskiy- Making sense of ASH TIME_WAITED Units [...]

  4. 4
    Roger Baker:

    You mention at the top of the article that it is believed TIME_WAITED is presented in microseconds. However the Graham Wood article that you reference, http://www.oracle.com/technology/products/manageability/database/pdf/twp03/PPT_active_session_history.pdf indicates on page 35 that the units are “ms” which is milliseconds. I realize whatever it is, it is an approximation, but why cant Oracle just tell us what it is in their documentation?

  5. 5
    MVE:

    @Roger,

    I think it’s just the way the author represented the test case in this document, it wasn’t meant as the UOM definition:

  6. 6
    JB:

    Not sure what version you were using when this test was run but looks like it was buggy, 10.1 perhaps?

    For the record:

    1/ TIME_WAITED is in microseconds
    2/ TIME_WAITED is unknown if session is in wait when sampled into ASH, it is “fixed up” to be accurate in subsequent samples
    3/ You cannot in general hope to add up TIME_WAITED over samples and get “total wait time” for session
    4/ Enqueue waits like your experiment time-out every 3 seconds and re-wait

    Here is output from similar experiment to yours executed on 10.2. Session was captured into ASH waiting for enqueue in 8 samples. Enqueue timed-out twice and these (completed) waits were “fixed up” to show 3 seconds of wait (2999200 and 2999926 microseconds) with a final wait of 1.6 seconds for total of approx 7.6 seconds. Sum of TIME_WAITED in this case is equal to cumulative total wait time for the lock.

    SQL> /

    SAMPLE_TIME EVENT TIME_WAITED
    ———————————– —————————— ———–
    28-FEB-12 02.56.39.080 PM enq: TX – row lock contention 1652592
    28-FEB-12 02.56.38.080 PM enq: TX – row lock contention 0
    28-FEB-12 02.56.37.080 PM enq: TX – row lock contention 2999926
    28-FEB-12 02.56.36.080 PM enq: TX – row lock contention 0
    28-FEB-12 02.56.35.080 PM enq: TX – row lock contention 0
    28-FEB-12 02.56.34.080 PM enq: TX – row lock contention 2999200
    28-FEB-12 02.56.33.080 PM enq: TX – row lock contention 0
    28-FEB-12 02.56.32.080 PM enq: TX – row lock contention 0

Leave a Reply

Vitaliy Mogilevskiy - Solving ORACLE problems since 1996.

View Vitaliy Mogilevskiy's profile on LinkedIn

What you will find here is what works for me day in day out - 24x7x365 - simple ORACLE DBA Tools written using components that are available on every Unix platform with an ORACLE database -- sqlplus and KSH.

Do you have DBA questions? Head over to Ask DBA.

RSS DBA Forums


Recent Comments

  • Sachin: Thanks! This helped really well.
  • jeevan: Hi, how to calculate Oracle process maximum limit to set based on servers.
  • vivek: I am getting following error SQL*Plus: Release 11.1.0.7.0 – Production on Thu May 22 18:42:01 2014...
  • Kevin: Thanks Vitaliy. It was a big help.
  • derm exclusive a: WOW just ԝhat I was looking for. Came here by searching for instantclient basic

Categories

Blogroll


Archives