CBO: the formula for the "density" column statistic (densities part II)

In this post we are going to explore and explain the rationale for the formula used by dbms_stats to compute the “density” column statistic, used by the CBO in versions less than 10.2.0.4 to estimate the cardinality of a class of SQL statements. In the next post, we will speak about its replacement, named “NewDensity” in 10053 trace files.

We will consider only the non-trivial case of Height-Balanced histograms, since for Frequency Histograms density is a constant (0.5 / num_rows) and for columns without histogram, it is simply 1/num_distinct.

Let’s illustrate the test case on which we will base our discussion, contained in this zip file.

First, a table T is created with the following exponential value distribution:
[text]
SQL> select value, count(*)
2 from t
3 group by value
4 order by value;

VALUE COUNT(*)
———- ———-
1 1
2 2
4 4
8 8
16 16
64 64
[/text]
The test case then computes a SIZE 5 Height-Balanced histogram. The resulting histogram (from dba_histograms) is as follows (note that I have added the column POPULARITY that marks popular values with “1”; EP is shorthand for column ENDPOINT_NUMBER, VALUE for column ENDPOINT_VALUE):
[text]
SQL> select ep, value, popularity from formatted_hist;

EP VALUE POPULARITY
———- ———- ———-
0 1 0
1 16 0
5 64 1
[/text]
The test case then issues this SQL statement that contains only an equality filter predicate on table T:
[sql]
select …
from t
where value = 2.4;
[/sql]
The literal value 2.4 is not contained in the table (and hence in the histogram), in order to make the CBO factor in “density” in its estimate of the expected cardinality – in fact, as it might be known, density is used when the literal is not popular (that is, not equal to 64 in our case), and it doesn’t matter whether the literal is not contained in the histogram, or contained as an unpopular value (1 and 16 in our case), or even contained in the table or not. All it takes is its being not popular.
Side note: I’m assuming the literal is inside the closed min-max interval (1-64 in this case); when outside, it depends on the version.

When the literal is not popular, the formula used for the expected cardinality calculation is equal to
[text]
E[card] = density * num_rows;
[/text]
That is easy to verify from the test case logs; in 9i we can see that density = 0.115789474 and num_rows=95, hence 0.115789474 * 95 = 11.000000000 which is exactly equal to the CBO estimate for our statement.

The formula used by dbms_stats to compute “density” was published in Jonathan Lewis’ book Cost Based Oracle (page 172) and Wolfgang Breitling’s presentation Histograms – Myths and Facts. The key fact is that the formula takes as input the rows of what I’ve nicknamed the not-popular subtable (NPS), that is, the original table without the rows whose values are popular values (in this case, 64 is the only popular value). Letting num_rows_nps the number of rows of the NPS (for our example, num_rows_nps=1+2+4+8+16=31), we have:
[text]
density = (1 / num_rows) *
sum (count (value) ^ 2) / num_rows_nps
for “value” belonging to the NPS
[/text]
The script performs this calculation automatically; it is anyway instructive to perform the calculation manually at least one time:
density = (1/95) * (1*1+2*2+4*4+8*8+16*16) / 31 = .115789474
that matches perfectly the density we observed in the script log before.

What is the statistical rationale for this seemingly strange computation?

If we plug it inside the formula for E[card], we can see that num_rows is cancelled:
[text]
E[card] = sum (count (value) ^ 2) / num_rows_nps
summed over all “values” belonging to the NPS
[/text]

Now we must reference the statistical concepts introduced in this post, and consider the family of all statements of our kind that can reference the NPS:
[sql]
select …
from t
where x = :x;
😡 being a value belonging to the NPS
[/sql]
its E[card] is
[text]
E[card] = sum ( w(:x) * E[count(:x)] )
for all values of 😡 (belonging to the NPS)
[/text]
dbms_stats takes count(:x) as the best estimate for E[count(:x)] (for example, E[count(4)] = count(4) = 4 in our case). All we have to do in order to obtain the observed formula, is to assume w(:x) = count(:x) / num_rows_nps:
[text]
E[card] = sum ( (count(:x) / num_rows_nps) * count(:x) )
= sum ( count(:x) ^ 2 ) / num_rows_nps
for all values of 😡 (belonging to the NPS)
[/text]

The meaning of the above particular shape of w(:x) is that the probability that the client submits a certain value for 😡 is proportional to the number of rows (in the NPS) that has that value; more precisely, that if X% of rows has a certain common value, X% of user-submitted statements that “hit” the NPS will ask for that value. Under this assumption, dbms_stats precomputes “density” to give back the above E[card] when the literal is known to be not popular, hence hitting the NPS – remember that the CBO operates under the “non-empty result set assumption“, hence if the literal does not hit a popular value, it must hit a value of the NPS.

The above assumption for w(:x) is quite a strange assumption – and in fact, we will see in the next post that in 11g (and 10.2.0.4), this assumption has been dropped and replaced with a more standard one. The “density” column statistics is in fact ignored in 10.2.0.4+ and a value computed at run-time, named “newDensity” in 10053 trace files, is used instead.

Other posts belonging to this series:
densities part I
densities part III
densities part IV

Rudy says:

October 13, 2009 at 10:01

I checked Lewis’ book, but he doesn’t explain how he worked out the density formula – he will publish it in the upcoming book, so we have to wait (how long?!?).

Your explanation about w(:x) is very interesting, although the post is hard for me to read with your notation. I must say it’s convincing, why is it strange for you?

No doubt it’s an easy task for dbms_stats to compute the estimated value for count(:x), or E[count(:x)]… it’s count(:x) indeed, no need to estimate :-)))

Alberto Dell'Era says:

October 13, 2009 at 10:42

@rudy

In Lewis’ book the formula was introduced using words instead of math symbols but still understandable; Wolfgang uses a more formal mathematical definition.

What I think is “strange” is the assumption about the shape of w(:x); why a customer that has ordered 100 books should ask about the order book list 100 times as frequently as another customer that has ordered only 1 book ? 11g (and 10.2.0.4) corrects this with “NewDensity”, as we will see in the next post.

About E[count(:x)] = “observed count”: that is not true in general, since one might use a statistical model that infer E[count(:x)] from the observed count, for example taking into account the way the table is modified over time (e.g: that an insert_date column is constantly increasing its max value). That would be, of course, very complicated in practice, but statistically sound and possible.

Alberto Dell’Era’s Oracle blog » CBO: “NewDensity” replaces “density” in 11g, 10.2.0.4 (densities part III) says:

October 16, 2009 at 22:33

[…] a previous post, we already discussed the pre-10.2.0.4 scenario: we saw how and when the “density” column statistic […]

Blogroll Report 09/10/2009-16/10/2009 « Coskan’s Approach to Oracle says:

October 22, 2009 at 01:29

[…] Alberto Dell’Era-CBO: the formula for the “density” column statistic (densities part II) […]

Alberto Dell’Era’s Oracle blog » CBO: about the statistical definition of “cardinality” (densities part I) says:

October 25, 2009 at 19:44

[…] posts belonging to this series: densities part II densities part III densities part […]

density estimate says:

April 2, 2010 at 21:07

[…] bone reduces resulting in osteoporosis with porous bone fragility and high risk of bone fracture,Alberto Dell’Era’s Oracle blog CBO: the formula for the …In this post we are going to explore and explain the rationale for the formula used by dbms_stats to […]

Leave a Reply Cancel reply