CBO: “NewDensity” replaces “density” in 11g, 10.2.0.4 (densities part III)

The formula is statistically based on replacing the previous versions’ assumption (that we labeled “strange and strong”) about w(:x) with the standard assumption that the client will ask for the values in the NPS with the same probability; mathematically, that means replacing the formula “w(:x) = count(:x) / num_rows_nps” with the standard “w(:x) = 1 / num_distinct_nps” (where num_distinct_nps is of course the number of distinct values of the NPS). If you plug this shape of w(:x) into the formula for E[card], you get

E[card] = sum ( w(:x) * E[count(:x)] ) =

= sum (E[count(:x)] ) / num_distinct_nps

for all values of :x (belonging to the NPS)

that is

E[card] = num_rows_nps / num_distinct_nps

which is, not surprising, the standard formula used for columns without histograms, but applied to the NPS, not the whole table.

One possibility for producing the above E[card] value at run-time could have been to change dbms_stats to compute a value for “density” equal to (num_rows_nps / num_distinct_nps) / num_rows; but forcing users to recompute statistics for all their tables in their upgraded databases is not really a viable option. Hence, the CBO designers chose to simply ignore “density” and calculate the above formula at run-time, mining the histogram, at the cost of reduced precision. In fact, the easy part is num_distinct_nps, which is obviously exactly equal to num_distinct minus the number of popular values; but num_rows_nps can only calculated approximately, since the histogram is a (deterministic) sample of the column values obtained by first sorting the column values and then sampling on a uniform grid (for more information and illustration, see the first part of this article of mine). Using the histogram, the best approximation for num_rows_nps is num_rows times the fraction of buckets not covered by popular values. Hence, using the 10053 terminology

num_distinct_nps = NDV - PopValCnt (exactly)

num_rows_nps = [(BktCnt - PopBktCnt) / BktCnt] * num_rows (approximately)

which gets back (again, approximately) the E[card] formula above, as can be trivially checked.

It might be desirable that one day, NewDensity gets calculated exactly by dbms_stats and stored in the data dictionary, at least for columns with new statistics, albeit the precision reduction is probably more than acceptable (that is, I have never seen a case where that has been an issue). The test case script, just for the sake of completeness, calculates the exact figure as well; it gets back an E[card] of 6.2 instead of 3.8.

For a summary of the above discussion and some more discussion, check back this investigation of mine. By the way, NewDensity replaces “density” also in join cardinality formulae, even if I have not run a complete investigation – but that is not surprising at all.

As a final nore – NewDensity is used also for Frequency Histograms, and in a very creative way; we will discuss this in part IV of this series.

Other posts belonging to this series:

densities part I

densities part II

densities part IV

Page 2 of 2 | Previous page

7 comments on this post.
  1. Blogroll Report 09/10/2009-16/10/2009 « Coskan’s Approach to Oracle:

    [...] Alberto Dell’Era-CBO: “NewDensity” replaces “density” in 11g, 10.2.0.4 (densities part III) [...]

  2. Alberto Dell’Era’s Oracle blog » CBO: the formula for the “density” column statistic (densities part II):

    [...] posts belonging to this series: densities part I densities part III densities part [...]

  3. Ung:

    Thanks for these excellent article.

    I’ve just tested out in 10.2.0.4 but couldn’t find the NewDensity, just wondering if this is only happened to windows?

    Rgds
    Ung

  4. Alberto Dell'Era:

    @Ung

    it shouldn’t be an operating-system dependent feature in theory … may you run one of the script of my test case that generates the 10053 trace on your system and send the trace/log to alberto.dellera@gmail.com ? Thanks

  5. TomPier:

    great post as usual!

  6. Elena:

    Hello Alberto,

    we are running Oracle RDBMS 11.2.0.4 on Solaris 64bit.
    We have a very simple slow query in production, Oracle does not take the existing index because of wrong histograms. It looks like bug 18377553 or bug 10174050, because we have 500 million values in a column, frequency histograms, but just one bucket.
    Anyway, investigating the issue I had to look at the 10053 trace and also found your article about NewDensity calculation.
    It is an old post, but may be, you are still interested on this.
    For me it looks like NewDensity is just set to 0.5 and not to NewDensity 0.5 / num_rows. My be it is a change in 11.2.0.4 or a part of the bug or… I am wrong ;-).
    From the 10053 trace:
    Single Table Cardinality Estimation for TBL_…
    Column (#2):
    NewDensity:0.500000, OldDensity:0.000000 BktCnt:517289083, PopBktCnt:517289083, PopValCnt:1, NDV:507249889
    Column (#2): F…(
    AvgLen: 96 NDV: 507249889 Nulls: 0 Density: 0.500000
    Histogram: Freq #Bkts: 1 UncompBkts: 517289083 EndPtVals: 1
    Using density: 0.500000 of col #2 as selectivity of unpopular value pred

    (517289083 is number of rows in the table)

    Best regards
    Elena

  7. Alberto Dell'Era:

    Elena,

    how could it be a Frequency Histogram, since NDV is definitely much greater than the max value you can specify for SIZE (i.e. 254) ?
    It looks like the CBO is wrongly classifying the histogram.

    Check the “Card:” line just below in the 10053 trace, and check whether it is equal to either 0.5 or 0.5 * num_rows.

    Anyway, as stated in the post, I observed that the rule “NewDensity = 0.5 / num_rows” is generally observed, but I cannot be sure that is observed “always” ;) – albeit this rule seems very sound from a statistical perspective.

Leave a comment