As we have seen in the previous posts of this series, in 11g a new figure named "NewDensity" has been introduced as a replacement for the "density" column statistic for columns whose histogram has been collected; this change has been backported in 10.2.0.4 also.
In the previous post we discussed how NewDensity influences the CBO cardinality estimate for Height-Balanced histograms; in this one we are going to investigate the same for Frequency Histograms. We will see that the most important change is the introduction of the “half the least popular" rule (see the "Histogram change" post by Jonathan Lewis, which distills the findings of Randolf Geist and Riyaj Shamsudeen) - a surprising rule that might easily cause trouble (in fact as Jonathan reports in the comments - bug 6082745 was opened against this rule).
The test case (script density_post_freq.sql) considers the same test statement we focused on in the previous post (a single equality filter predicate which asks for a value inside the min-max range of the column):
select ... from t where value = 64.5;
Of course we compute a Frequency instead of an Height-Balanced histogram, and use a slightly different value distribution in order to highlight the new rule:
SQL> select value, count(*) 2 from t 3 group by value 4 order by value; VALUE COUNT(*) ---------- ---------- 8 8 16 16 64 64 128 128
The histogram generated by the test case is (from DBA_HISTOGRAMS):
VALUE EP BKT ---------- ---------- ---------- 8 8 8 16 24 16 64 88 64 128 216 128
VALUE is an abbreviation for ENDPOINT_VALUE, EP for ENDPOINT_NUMBER.
BKT is the number of buckets covered by the value (i.e.: EP minus the previous EP), that is, the number of rows whose column value was equal to VALUE at statistics collection time.
When the filter predicate selects a value contained in the histogram, the new releases behave the same as the old ones (but check the "bottom note about singleton values" at the bottom for a a minor but interesting detail): neither density nor NewDensity is used, and the cardinality estimate is the usual intuitive one. In the complementary case of a value not contained in the histogram (but still inside the min-max interval), the cardinality used to be calculated as density*num_rows and it is now NewDensity*num_rows. Note the simmetry with the Height-Balanced case: the formula is the same, with NewDensity simply replacing density.
NewDensity with the “half the least popular" rule active
By default the rule is active, and in this case, NewDensity is set to
NewDensity = 0.5 * bkt(least_popular_value) / num_rows
and hence, for non-existent values:
E[card] = (0.5 * bkt(least_popular_value) / num_rows) * num_rows = 0.5 * bkt(least_popular_value)
For our test case, the least_popular_value is 8 and bkt(8) = 8, hence E[card] = 0.5 * 8 = 4 thanks to NewDensity being equal to 0.5 * 8 / 216 = 0.018518519. In fact, we can verify in the 10053 traces (in 10.2.0.4, 22.214.171.124, 126.96.36.199) for our statement, that asks for a not-existent value (64.5), that E[card] and NewDensity are set as above:
NewDensity:0.018519, OldDensity:0.002315 BktCnt:216, PopBktCnt:216, PopValCnt:4, NDV:4 Using density: 0.018519 of col #1 as selectivity of unpopular value pred Table: T Alias: NOT_EXISTENT Card: Original: 216.000000 Rounded: 4 Computed: 4.00 Non Adjusted: 4.00
As another check, let's see what happens if bkt(least_popular_value) = 1, that is, if there is (at least) one value that occurred exactly one time (a singleton value) at statistics collection time. Adding such a row to our test case is trivial (just uncomment the first insert row in the script); in this scenario, our formula above predicts E[card] = 0.5 with NewDensity = 0.5 / 217 = .002304147, and in fact (check the *_least_is_one.trc traces):
NewDensity:0.002304, OldDensity:0.002304 BktCnt:217, PopBktCnt:216, PopValCnt:4, NDV:5 Using density: 0.002304 of col #1 as selectivity of unpopular value pred Table: T Alias: NOT_EXISTENT Card: Original: 217.000000 Rounded: 1 Computed: 0.50 Non Adjusted: 0.50
note that E[card] gets rounded up from 0.5 to 1 (as usual).
What is the rationale behind this rule? Thanks to Randolf Geist (see the comment in Jonathan's blog entry above), we know that it was introduced as a patch to solve one particular scenario (see bug 5483301) and then included in the main release, for some reason. Luckily, the rule can be disabled and the old sane behaviour can be restored.
NewDensity with the “half the least popular" rule disabled
To disable the new rule, just switch off the patch 5483301:
alter session set "_fix_control"='5483301:off';
(or alter system if you want to make it permanent)
with this setting, NewDensity becomes simply
NewDensity = 0.5 / num_rows
and hence, again for non-existent values:
E[card] = 0.5
which is exactly what we got in pre-10.2.0.4, where density was used (and density was, and is still, set to 0.5 / num_rows by dbms_stats). So the cardinality estimate is 0.5 (rounded up to 1).
For our test case, we predict NewDensity = 0.5 / 216 = 0.002314815. In fact our 10053 traces tell us:
NewDensity:0.002315, OldDensity:0.002315 BktCnt:216, PopBktCnt:216, PopValCnt:4, NDV:4 Table: T Alias: NOT_EXISTENT_OFF Card: Original: 216.000000 Rounded: 1 Computed: 0.50 Non Adjusted: 0.50
The rationale for this behaviour is sound; the CBO knows that no row with the requested value existed at statistics collection time, hence it returns the minimal cardinality estimate compatible with the non empty result set assumption (check this post for the importance of this assumption). If the statistics are reasonably fresh, this is the only sane estimate that can be made.
Playing with density - a warning
If you set your own column stats using dbms_stats.set_column_stats, the behaviour is different; I haven't made any extensive investigations but as far as I can tell, the value you provide for density is used instead of NewDensity. User-provided column statistics are flagged with dba_tab_cols.user_stats = 'YES'. You can disguise your user statistics as non-user by setting the flags parameter of dbms_stats.set_column_stats to 2 - but since the latter parameter is labeled as "for Oracle internal use only", I would do it only for investigations purposes - that is, never in production.
Bottom note about singleton values: actually in pre-10.2.0.4 versions, if the value was present in the Frequency histogram but covering a single bucket (hence it was present in the table exactly one time at statistic collection time), it used to be classified as "unpopular" and hence used to get the same treatment as a value not in the histogram - the end result being that the cardinality was estimated as 0.5 rounded up to 1; now it is 1 before rounding as one would intuitively expects. I hope to be able to investigate whether this change fixes the issues about join cardinality estimation I investigated - see "the mystery of halving" in this investigation of mine if interested.