“ASH math” of time_waited explained with pictures and simulation

As explained by John Beresniewicz, Graham Wood and Uri Shaft in their excellent overview ASH architecture and advanced usage, avg( v$active_session_history.time_waited ) is not a correct estimate of the average latency (the “true average”) esperienced by a wait event, the reason being that short events are less likely to be sampled. In order to correct this, the authors propose a formula that gives an unbiased estimate of the “true average”.

In this post I will quantitatively illustrate why sampling is so bad for time_waited, how the formula corrects it, and the sampling “distortion” of “reality” in general, by using an “ASH simulator” I have built and analyzing its data using basic statistic tools (and graphs). I hope that this might help others (as it has definitely helped myself) to better understand the formula and especially the characteristics of this fantastic tool of Oracle named ASH – the one that I use the most nowadays when tuning and troubleshooting, mined by almost every script of my toolbox.

All the code and spools are available here.

the simulator

Obviously implemented in PL/SQL, as all great things in life, this pipelined function produces a stream of events:

SQL> select ... from table( sim_pkg.events( ... ) );

SESSION_STATE ELAPSED_USEC

------------- ----------------

WAITING 180,759.713

ON CPU 500,000.000

WAITING 164,796.844

ON CPU 500,000.000

WAITING 2,068,034.610

ON CPU 500,000.000

WAITING 2,720,383.092

ON CPU 500,000.000

It simulates a process that alternates between consuming CPU and waiting for an event; the desired probability density functions of the two stochastic processes “on cpu” and “waiting” can be passed as arguments.

This stream can then be sampled by pipelining the stream to this other function:

SQL> select ... from table( sim_pkg.samples (

cursor( select * from table( sim_pkg.events( ... ) ) )

);

SAMPLE_ID SAMPLE_TIME SESSION_STATE TIME_WAITED

---------- ----------- ------------- ----------------

0 0 WAITING 180,759.713

1 1000000 ON CPU .000

2 2000000 WAITING .000

3 3000000 WAITING 2,068,034.610

4 4000000 WAITING .000

5 5000000 WAITING .000

6 6000000 WAITING 2,720,383.092

7 7000000 ON CPU .000

Note that the function also reproduces the artificial zeros introduced by ASH’s fix-up mechanism (i.e., event with time_waited =2,720,383.092 spans three samples and hence has the two previous samples set to zero; same for its predecessor).

Notice also that one “short” event has been missed by the reaping sampling hand.

investigation by pictures

Let’s produce an event stream that follows an uniform distribution on the interval [0 .. 2 sec]; here is its histogram (produced by the wonderful geom_histogram() of R’s ggplot2 package):

histogram of event stream

So we can visually appreciate and confirm that all events are contained in the requested interval and respect the desired distribution; please also note that the average (the thin dashed vertical line) is almost perfectly equal to the expected value E[ events ] = 1.0.

Let’s sample the stream, ignore the artificial zeros, and superimpose the samples’ histogram to the previous one:

histogram of events and samples combined

Page 1 of 2 | Next page