CBO: "NewDensity" replaces "density" in 11g, 10.2.0.4 (densities part III)

In this post we are going to explore and explain the rationale for the formula used by the CBO to compute the “NewDensity” figure that replaces, from 10.2.0.4 onwards, the “density” column statistic in the cardinality estimation formulae for columns with height-balanced (HB) histograms defined.

In a previous post, we already discussed the pre-10.2.0.4 scenario: we saw how and when the “density” column statistic is used in the cardinality formula for equality filter predicates, we explained its statistical rationale and defining formula, introduced the concept of the NPS (Not Popular Subtable), and built a test case. Now we are going to use the very same test case and explain the differences in the most recent versions (the previous post zip file contains logs for them also).

To summarize the test case – we have a table T with a single column VALUE, exponentially distributed, and with a SIZE 5 Height-Balanced histogram collected on. The histogram is:
[text]
SQL> select ep, value, popularity from formatted_hist;

EP VALUE POPULARITY
———- ———- ———-
0 1 0
1 16 0
5 64 1
[/text]
Thus, we have a single popular value, 64; all the others are unpopular.

In this “densities” series of post, we focus on a SQL statement that contains only an equality filter predicate on table T:
[sql]
select …
from t
where value = 2.4;
[/sql]
the literal value is not a popular value (but inside the 1-64 interval) and hence, in pre-10.2.0.4, the formula used for the expected cardinality calculation is equal to:
[text]
E[card] = density * num_rows;
[/text]
We discussed, in the previous post, how density is carefully calculated by dbms_stats to get back the expected cardinality of the family (class) of all possible equality filter predicate statements that hit the NPS, under the usual “non-empty result set assumption” and the further (strange and strong) assumption that the more a value is represented in the NPS, the higher the probability that the value is used as the literal of the equality predicate (an assumption that mathematically translates into the formula “w(:x) = count(:x) / num_rows_nps”).

Switching to 10.2.0.4 – the formula for E[card] is still the same, but with “density” replaced by “NewDensity” (as hinted by the fact that “density” is reported as “OldDensity” in the 10053 trace files, as we are going to see in a moment):
[text]
E[card] = NewDensity * num_rows;
[/text]

NewDensity is not stored anywhere in the data dictionary, but it is computed at query optimization time by the CBO (note that density is still computed by dbms_stats using the old formula, but then it is ignored by the CBO). The NewDensity formula is based mainly on some histogram-derived figures; using the same names found in 10053 traces:

[text]
NewDensity = [(BktCnt – PopBktCnt) / BktCnt] / (NDV – PopValCnt)
[/text]
Where BktCnt (“Bucket Count”) is the number of buckets (the “N” in the “SIZE N” clause);
PopBktCnt (“Popular Bucket Count”) the number of buckets covered by the popular values;
PopValCnt (“Popular Value Count”) is the number of popular values; NDV (“Number of Distinct Values”) is the traditional name used by CBO developers for the num_distinct column statistic. With the exception of NDV, all these values are derived from the histogram.

Side note: if the numerator is equal to zero, NewDensity is set to 0.5 / num_rows, thus giving an E[card] = 0.5, as far as I have seen (not exaustively) in a few test cases; it looks like a lower-bound “sanity check”. The denominator cannot be zero for HB histograms.

To illustrate the formula: the histogram of our test case has 5 buckets, hence BktCnt=5; 64 is the only popular value, hence PopValCnt =1; this popular value covers 4 buckets (since its EP is 5 and the previous EP is 1), hence PopBktCnt=4; we know that the column has num_distinct=6, hence NDV=6. This is in fact what we see in the 10053 trace file (in 11.1.0.7 and 11.2.0.1):

[text]
SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for T[T]
Column (#1):
NewDensity:0.040000, OldDensity:0.115789 BktCnt:5, PopBktCnt:4, PopValCnt:1, NDV:6
Using density: 0.040000 of col #1 as selectivity of unpopular value pred
Table: T Alias: T
Card: Original: 95.000000 Rounded: 4 Computed: 3.80 Non Adjusted: 3.80
[/text]
So NewDensity = [(5-4)/5] / (6-1) = 1/25 = 0.04 and E[card]=0.04*95=3.8, which is exactly what we see in the above trace fragment.

The formula is statistically based on replacing the previous versions’ assumption (that we labeled “strange and strong”) about w(:x) with the standard assumption that the client will ask for the values in the NPS with the same probability; mathematically, that means replacing the formula “w(:x) = count(:x) / num_rows_nps” with the standard “w(:x) = 1 / num_distinct_nps” (where num_distinct_nps is of course the number of distinct values of the NPS). If you plug this shape of w(:x) into the formula for E[card], you get
[text]
E[card] = sum ( w(:x) * E[count(:x)] ) =
= sum (E[count(:x)] ) / num_distinct_nps
for all values of 😡 (belonging to the NPS)
[/text]
that is
[text]
E[card] = num_rows_nps / num_distinct_nps
[/text]
which is, not surprising, the standard formula used for columns without histograms, but applied to the NPS, not the whole table.

One possibility for producing the above E[card] value at run-time could have been to change dbms_stats to compute a value for “density” equal to (num_rows_nps / num_distinct_nps) / num_rows; but forcing users to recompute statistics for all their tables in their upgraded databases is not really a viable option. Hence, the CBO designers chose to simply ignore “density” and calculate the above formula at run-time, mining the histogram, at the cost of reduced precision. In fact, the easy part is num_distinct_nps, which is obviously exactly equal to num_distinct minus the number of popular values; but num_rows_nps can only calculated approximately, since the histogram is a (deterministic) sample of the column values obtained by first sorting the column values and then sampling on a uniform grid (for more information and illustration, see the first part of this article of mine). Using the histogram, the best approximation for num_rows_nps is num_rows times the fraction of buckets not covered by popular values. Hence, using the 10053 terminology
[text]
num_distinct_nps = NDV – PopValCnt (exactly)

num_rows_nps = [(BktCnt – PopBktCnt) / BktCnt] * num_rows (approximately)
[/text]
which gets back (again, approximately) the E[card] formula above, as can be trivially checked.

It might be desirable that one day, NewDensity gets calculated exactly by dbms_stats and stored in the data dictionary, at least for columns with new statistics, albeit the precision reduction is probably more than acceptable (that is, I have never seen a case where that has been an issue). The test case script, just for the sake of completeness, calculates the exact figure as well; it gets back an E[card] of 6.2 instead of 3.8.

For a summary of the above discussion and some more discussion, check back this investigation of mine. By the way, NewDensity replaces “density” also in join cardinality formulae, even if I have not run a complete investigation – but that is not surprising at all.

As a final nore – NewDensity is used also for Frequency Histograms, and in a very creative way; we will discuss this in part IV of this series.

Other posts belonging to this series:
densities part I
densities part II
densities part IV

Blogroll Report 09/10/2009-16/10/2009 « Coskan’s Approach to Oracle says:

October 22, 2009 at 01:29

[…] Alberto Dell’Era-CBO: “NewDensity” replaces “density” in 11g, 10.2.0.4 (densities part III) […]

Alberto Dell’Era’s Oracle blog » CBO: the formula for the “density” column statistic (densities part II) says:

November 22, 2009 at 17:57

[…] posts belonging to this series: densities part I densities part III densities part […]

Ung says:

December 16, 2009 at 11:30

Thanks for these excellent article.

I’ve just tested out in 10.2.0.4 but couldn’t find the NewDensity, just wondering if this is only happened to windows?

Rgds
Ung

Alberto Dell'Era says:

December 16, 2009 at 13:05

@Ung

it shouldn’t be an operating-system dependent feature in theory … may you run one of the script of my test case that generates the 10053 trace on your system and send the trace/log to alberto.dellera@gmail.com ? Thanks

TomPier says:

May 4, 2010 at 10:07

great post as usual!

Elena says:

August 8, 2014 at 08:32

Hello Alberto,

we are running Oracle RDBMS 11.2.0.4 on Solaris 64bit.
We have a very simple slow query in production, Oracle does not take the existing index because of wrong histograms. It looks like bug 18377553 or bug 10174050, because we have 500 million values in a column, frequency histograms, but just one bucket.
Anyway, investigating the issue I had to look at the 10053 trace and also found your article about NewDensity calculation.
It is an old post, but may be, you are still interested on this.
For me it looks like NewDensity is just set to 0.5 and not to NewDensity 0.5 / num_rows. My be it is a change in 11.2.0.4 or a part of the bug or… I am wrong ;-).
From the 10053 trace:
Single Table Cardinality Estimation for TBL_…
Column (#2):
NewDensity:0.500000, OldDensity:0.000000 BktCnt:517289083, PopBktCnt:517289083, PopValCnt:1, NDV:507249889
Column (#2): F…(
AvgLen: 96 NDV: 507249889 Nulls: 0 Density: 0.500000
Histogram: Freq #Bkts: 1 UncompBkts: 517289083 EndPtVals: 1
Using density: 0.500000 of col #2 as selectivity of unpopular value pred

(517289083 is number of rows in the table)

Best regards
Elena

Alberto Dell'Era says:

August 24, 2014 at 11:51

Elena,

how could it be a Frequency Histogram, since NDV is definitely much greater than the max value you can specify for SIZE (i.e. 254) ?
It looks like the CBO is wrongly classifying the histogram.

Check the “Card:” line just below in the 10053 trace, and check whether it is equal to either 0.5 or 0.5 * num_rows.

Anyway, as stated in the post, I observed that the rule “NewDensity = 0.5 / num_rows” is generally observed, but I cannot be sure that is observed “always” 😉 – albeit this rule seems very sound from a statistical perspective.

Leave a Reply to Blogroll Report 09/10/2009-16/10/2009 « Coskan’s Approach to Oracle Cancel reply