CBO: “NewDensity” replaces “density” in 11g, (densities part III)

The formula is statistically based on replacing the previous versions’ assumption (that we labeled “strange and strong”) about w(:x) with the standard assumption that the client will ask for the values in the NPS with the same probability; mathematically, that means replacing the formula “w(:x) = count(:x) / num_rows_nps” with the standard “w(:x) = 1 / num_distinct_nps” (where num_distinct_nps is of course the number of distinct values of the NPS). If you plug this shape of w(:x) into the formula for E[card], you get

E[card] = sum ( w(:x) * E[count(:x)] ) =

= sum (E[count(:x)] ) / num_distinct_nps

for all values of :x (belonging to the NPS)

that is

E[card] = num_rows_nps / num_distinct_nps

which is, not surprising, the standard formula used for columns without histograms, but applied to the NPS, not the whole table.

One possibility for producing the above E[card] value at run-time could have been to change dbms_stats to compute a value for “density” equal to (num_rows_nps / num_distinct_nps) / num_rows; but forcing users to recompute statistics for all their tables in their upgraded databases is not really a viable option. Hence, the CBO designers chose to simply ignore “density” and calculate the above formula at run-time, mining the histogram, at the cost of reduced precision. In fact, the easy part is num_distinct_nps, which is obviously exactly equal to num_distinct minus the number of popular values; but num_rows_nps can only calculated approximately, since the histogram is a (deterministic) sample of the column values obtained by first sorting the column values and then sampling on a uniform grid (for more information and illustration, see the first part of this article of mine). Using the histogram, the best approximation for num_rows_nps is num_rows times the fraction of buckets not covered by popular values. Hence, using the 10053 terminology

num_distinct_nps = NDV - PopValCnt (exactly)

num_rows_nps = [(BktCnt - PopBktCnt) / BktCnt] * num_rows (approximately)

which gets back (again, approximately) the E[card] formula above, as can be trivially checked.

It might be desirable that one day, NewDensity gets calculated exactly by dbms_stats and stored in the data dictionary, at least for columns with new statistics, albeit the precision reduction is probably more than acceptable (that is, I have never seen a case where that has been an issue). The test case script, just for the sake of completeness, calculates the exact figure as well; it gets back an E[card] of 6.2 instead of 3.8.

For a summary of the above discussion and some more discussion, check back this investigation of mine. By the way, NewDensity replaces “density” also in join cardinality formulae, even if I have not run a complete investigation – but that is not surprising at all.

As a final nore – NewDensity is used also for Frequency Histograms, and in a very creative way; we will discuss this in part IV of this series.

Other posts belonging to this series:

densities part I

densities part II

densities part IV

Page 2 of 2 | Previous page

5 comments on this post.
  1. Blogroll Report 09/10/2009-16/10/2009 « Coskan’s Approach to Oracle:

    [...] Alberto Dell’Era-CBO: “NewDensity” replaces “density” in 11g, (densities part III) [...]

  2. Alberto Dell’Era’s Oracle blog » CBO: the formula for the “density” column statistic (densities part II):

    [...] posts belonging to this series: densities part I densities part III densities part [...]

  3. Ung:

    Thanks for these excellent article.

    I’ve just tested out in but couldn’t find the NewDensity, just wondering if this is only happened to windows?


  4. Alberto Dell'Era:


    it shouldn’t be an operating-system dependent feature in theory … may you run one of the script of my test case that generates the 10053 trace on your system and send the trace/log to alberto.dellera@gmail.com ? Thanks

  5. TomPier:

    great post as usual!

Leave a comment