# CBO: the formula for the “density” column statistic (densities part II)

In this post we are going to explore and explain the rationale for the formula used by dbms_stats to compute the “density” column statistic, used by the CBO in versions less than 10.2.0.4 to estimate the cardinality of a class of SQL statements. In the next post, we will speak about its replacement, named “NewDensity” in 10053 trace files.

We will consider only the non-trivial case of Height-Balanced histograms, since for Frequency Histograms density is a constant (0.5 / num_rows) and for columns without histogram, it is simply 1/num_distinct.

Let’s illustrate the test case on which we will base our discussion, contained in this zip file.

First, a table T is created with the following exponential value distribution:

SQL> select value, count(*)

2 from t

3 group by value

4 order by value;

VALUE COUNT(*)

---------- ----------

1 1

2 2

4 4

8 8

16 16

64 64

The test case then computes a SIZE 5 Height-Balanced histogram. The resulting histogram (from dba_histograms) is as follows (note that I have added the column POPULARITY that marks popular values with “1″; EP is shorthand for column ENDPOINT_NUMBER, VALUE for column ENDPOINT_VALUE):

SQL> select ep, value, popularity from formatted_hist;

EP VALUE POPULARITY

---------- ---------- ----------

0 1 0

1 16 0

5 64 1

The test case then issues this SQL statement that contains only an equality filter predicate on table T:

select ...

from t

where value = 2.4;

The literal value 2.4 is not contained in the table (and hence in the histogram), in order to make the CBO factor in “density” in its estimate of the expected cardinality – in fact, as it might be known, density is used when the literal is not popular (that is, not equal to 64 in our case), and it doesn’t matter whether the literal is not contained in the histogram, or contained as an unpopular value (1 and 16 in our case), or even contained in the table or not. All it takes is its being not popular.

Side note: I’m assuming the literal is inside the closed min-max interval (1-64 in this case); when outside, it depends on the version.

When the literal is not popular, the formula used for the expected cardinality calculation is equal to

E[card] = density * num_rows;

That is easy to verify from the test case logs; in 9i we can see that density = 0.115789474 and num_rows=95, hence 0.115789474 * 95 = 11.000000000 which is exactly equal to the CBO estimate for our statement.

The formula used by dbms_stats to compute “density” was published in Jonathan Lewis’ book Cost Based Oracle (page 172) and Wolfgang Breitling’s presentation Histograms – Myths and Facts. The key fact is that the formula takes as input the rows of what I’ve nicknamed the not-popular subtable (NPS), that is, the original table without the rows whose values are popular values (in this case, 64 is the only popular value). Letting num_rows_nps the number of rows of the NPS (for our example, num_rows_nps=1+2+4+8+16=31), we have:

density = (1 / num_rows) *

sum (count (value) ^ 2) / num_rows_nps

for "value" belonging to the NPS

The script performs this calculation automatically; it is anyway instructive to perform the calculation manually at least one time:

density = (1/95) * (1*1+2*2+4*4+8*8+16*16) / 31 = .115789474

that matches perfectly the density we observed in the script log before.

What is the statistical rationale for this seemingly strange computation?

If we plug it inside the formula for E[card], we can see that num_rows is cancelled:

E[card] = sum (count (value) ^ 2) / num_rows_nps

summed over all "values" belonging to the NPS

Now we must reference the statistical concepts introduced in this post, and consider the family of all statements of our kind that can reference the NPS:

select ...

from t

where x = :x;

:x being a value belonging to the NPS

its E[card] is

E[card] = sum ( w(:x) * E[count(:x)] )

for all values of :x (belonging to the NPS)

dbms_stats takes count(:x) as the best estimate for E[count(:x)] (for example, E[count(4)] = count(4) = 4 in our case). All we have to do in order to obtain the observed formula, is to assume w(:x) = count(:x) / num_rows_nps:

E[card] = sum ( (count(:x) / num_rows_nps) * count(:x) )

= sum ( count(:x) ^ 2 ) / num_rows_nps

for all values of :x (belonging to the NPS)

Page 1 of 2 | Next page