It’s widely believed that ASH / v$active_session_history.TIME_WAITED is presented in microseconds. It sure seems like it, but something is amiss as my small test shows. Using two sessions I created a test table in one, inserted a row and then from another session tried to select the same row for update:
| session A > SQL> create table ash_time_waited_test (t number);
| session A >
| session A > Table created.
| session A >
| session A > SQL> insert into ash_time_waited_test values(1);
| session A >
| session A > 1 row created.
| session A >
| session A > SQL> commit;
| session A >
| session A > Commit complete.
| session A >
| session A > SQL> update ash_time_waited_test set t=2 where t=1;
| session A >
| session A > 1 row updated.
| session B > SQL> !date
| session B > Thu Jul 30 18:20:15 PDT 2009
| session B >
| session B > SQL> select * from ash_time_waited_test for update;
| session A >
| session A > SQL> commit;
| session A >
| session A > Commit complete.
| session A >
| session A > SQL>
| session A >
| session B >
| session B > T
| session B > ----------
| session B > 2
| session B >
| session B > SQL> !date
| session B > Thu Jul 30 18:20:38 PDT 2009
| session B >
What I found next was rather strange — the sum(TIME_WAITED) from ASH (v$active_session_history) did not add up to the total time session A kept the lock on the test table:
First we get the SQL_ID for session B (the waiter) so we can tie into ASH.SQL_ID later:
SQL> select sql_id from v$sql where sql_text = 'select * from ash_time_waited_test for update';
SQL_ID
-------------
fpm89fyrsf6gv
Now having the SQL_ID in hand we can dig into ASH to see what waits our session experienced:
SQL> col SAMPLE_TIME format a25
SQL> col event format a30
SQL> select sample_time,EVENT,WAIT_TIME,TIME_WAITED from v$active_session_history
2 where sql_id = 'fpm89fyrsf6gv'
3 order by sample_time;
SAMPLE_TIME EVENT WAIT_TIME TIME_WAITED
------------------------- ------------------------------ ---------- -----------
30-JUL-09 06.20.25.751 PM enq: TX - row lock contention 0 489233
30-JUL-09 06.20.26.761 PM enq: TX - row lock contention 0 489244
30-JUL-09 06.20.27.761 PM enq: TX - row lock contention 0 489242
30-JUL-09 06.20.28.761 PM enq: TX - row lock contention 0 489238
30-JUL-09 06.20.29.761 PM enq: TX - row lock contention 0 489244
30-JUL-09 06.20.30.761 PM enq: TX - row lock contention 0 489242
30-JUL-09 06.20.31.761 PM enq: TX - row lock contention 0 489244
30-JUL-09 06.20.32.761 PM enq: TX - row lock contention 0 489253
30-JUL-09 06.20.33.761 PM enq: TX - row lock contention 0 489233
30-JUL-09 06.20.34.761 PM enq: TX - row lock contention 0 489245
30-JUL-09 06.20.35.761 PM enq: TX - row lock contention 0 489245
11 rows selected.
SQL>
Based on the output above we know that the first sample that trapped this wait occurred at “06.20.25.751 PM” and the last one at “06.20.35.761 PM”. The difference (delta) between these two timestamps is just over 10 seconds:
SQL> select e.SAMPLE_TIME - s.SAMPLE_TIME
2 from v$active_session_history s,
3 v$active_session_history e
4 where s.sample_id = (select min(sample_id) from v$active_session_history where sql_id = 'fpm89fyrsf6gv')
5 and e.sample_id = (select max(sample_id) from v$active_session_history where sql_id = 'fpm89fyrsf6gv')
6 ;
E.SAMPLE_TIME-S.SAMPLE_TIME
---------------------------------------------------------------------------
+000000000 00:00:10.010
SQL>
So far so good. Now lets see what we get by summing up the TIME_WAITED:
SQL> select sum(TIME_WAITED) from v$active_session_history
2 where sql_id = 'fpm89fyrsf6gv'
3 ;
SUM(TIME_WAITED)
----------------
5381663
Interesting? If the ASH.TIME_WAITED is measured in microseconds then the total wait time = 5381663/1000000 = 5.381663 seconds and as we know it’s not correct. In fact, we can prove this using another metric, the ELAPSED_TIME_* from DBA_HIST_SQLSTAT:
SQL> select ELAPSED_TIME_TOTAL, ELAPSED_TIME_DELTA,
2 APWAIT_TOTAL, APWAIT_DELTA,
3 CCWAIT_TOTAL, CCWAIT_DELTA
4 from DBA_HIST_SQLSTAT
5 where sql_id = 'fpm89fyrsf6gv'
6 ;
ELAPSED_TIME_TOTAL ELAPSED_TIME_DELTA APWAIT_TOTAL APWAIT_DELTA CCWAIT_TOTAL CCWAIT_DELTA
------------------ ------------------ ------------ ------------ ------------ ------------
11110172 11110172 11106973 11106973 254 254
SQL>
DBA_HIST_SQLSTAT is dead on — with Elapsed Time Total = 11110172/1000000 = 11.110172 seconds … So what’s happening with ASH then? Why such a large > 2 times discrepancy?
I have my theory — I think that because ASH sampling is taken every second half of that seconds is actually taken to process the data and for this reason the TIME_WAITED gets truncated. Whatever the reason is, one thing is for sure — ASH.TIME_WAITED, at least in this particular case, cannot be trusted.
The final question … how do you then use ASH? How do you arrive at a more or less accurate metrics, specifically the TIME_WAITED? One way is this — we know what ASH sampling takes place every second, so instead of relying on the buggy TIME_WAITED values we can rely on the actual number of times that a SQL_ID appeared in ASH … If a SQL_ID was trapped by ASH 10 consecutive times with WAIT_TIME=0 (Waiting) we then know for sure that it waited for at least 10 seconds …
UPDATE Aug 12, 2009
Graham Wood from Oracle explains the issue with ASH: TIME_WAITED in his paper Sifting through the ASHes (see Page 35). He also lists some other “ASH data gotcha’s” on page 27. What really caught my attention through, was page 36, which explains “ASH vs AWR/Statpack” and it specifically says that for “Counts/occurrence data” analysis you should go to AWR not ASH.
And it really makes sense, after all, ASH is sampled every second and while it sounds like “often enough” it really isn’t. All you have to do is run sql_trace on a busy session and you’ll see that there’s a lot that goes on between the 1 second intervals — all of it would be missed by ASH. ASH is great at putting spotlight on the “big issues”, not so great for subtle, sub second waits which could and do add up to a considerable wait time.
Bottom line for me is that there’s no substitute for the good, old sql_trace, but overall, ASH is a huge improvement over what we had or rather didn’t have before. It definitely replaces my own monitoring scripts which I’ve used for many years now, sampling v$session_wait where wait_time=0 every 5 seconds.
Sat: January 2 2016: Update!
I am building a new monitoring framework that:
- Samples Oracle ASH every N minutes and captures the live heartbeat of the database.
- Allows you to define thresholds on top of this sampling and notifies you if something is amiss.
- Saves all this data for historical purposes so that when AWR is wiped out you have solid performance metrics years later.
- Monitors Oracle LOCKS.
- Monitors Oracle SPACE.
- Delivers this functionality in an easy to deploy (single binary) distribution that only takes a minute to install on a new host.
Does this sound like something you’d like in your shop? If yes – then I’d like to hear from you – sign up for the mailing list below to start the discussion and get the private beta as soon as it’s ready (no SPAM here I guarantee it!):
EvenTorEx Mailing List6 Comments
Vitaliy Mogilevskiy July 31, 2009
Posted In: Tuning
Tags: ASH, DBA_HIST_SQLSTAT, microseconds, SQL_ID, TIME_WAITED, v$active_session_history, WAIT_TIME
Yes I agree
ASH data is a statistical approximation.. Not exact, but close enough to be used.. that’s why we do count(*) because each sample is 1second. You can know more from the seminar slides of Kyle Hailey. The first 9 slides tackled more on ASH..
http://www.perfvision.com/ftp/emea/
– Karl Arao
Thanks Karl! I am going through the PPPs and the AAS (Average Active Sessions) approach to looking at system overload makes sense to me — good stuff
[…] Vitaliy Mogilevskiy- Making sense of ASH TIME_WAITED Units […]
You mention at the top of the article that it is believed TIME_WAITED is presented in microseconds. However the Graham Wood article that you reference, http://www.oracle.com/technology/products/manageability/database/pdf/twp03/PPT_active_session_history.pdf indicates on page 35 that the units are “ms” which is milliseconds. I realize whatever it is, it is an approximation, but why cant Oracle just tell us what it is in their documentation?
@Roger,
I think it’s just the way the author represented the test case in this document, it wasn’t meant as the UOM definition:
Not sure what version you were using when this test was run but looks like it was buggy, 10.1 perhaps?
For the record:
1/ TIME_WAITED is in microseconds
2/ TIME_WAITED is unknown if session is in wait when sampled into ASH, it is “fixed up” to be accurate in subsequent samples
3/ You cannot in general hope to add up TIME_WAITED over samples and get “total wait time” for session
4/ Enqueue waits like your experiment time-out every 3 seconds and re-wait
Here is output from similar experiment to yours executed on 10.2. Session was captured into ASH waiting for enqueue in 8 samples. Enqueue timed-out twice and these (completed) waits were “fixed up” to show 3 seconds of wait (2999200 and 2999926 microseconds) with a final wait of 1.6 seconds for total of approx 7.6 seconds. Sum of TIME_WAITED in this case is equal to cumulative total wait time for the lock.
SQL> /
SAMPLE_TIME EVENT TIME_WAITED
———————————– —————————— ———–
28-FEB-12 02.56.39.080 PM enq: TX – row lock contention 1652592
28-FEB-12 02.56.38.080 PM enq: TX – row lock contention 0
28-FEB-12 02.56.37.080 PM enq: TX – row lock contention 2999926
28-FEB-12 02.56.36.080 PM enq: TX – row lock contention 0
28-FEB-12 02.56.35.080 PM enq: TX – row lock contention 0
28-FEB-12 02.56.34.080 PM enq: TX – row lock contention 2999200
28-FEB-12 02.56.33.080 PM enq: TX – row lock contention 0
28-FEB-12 02.56.32.080 PM enq: TX – row lock contention 0