CBO: NewDensity for Frequency Histograms,11g-10.2.0.4 (densities part IV)

As we have seen in the previous posts of this series, in 11g a new figure named “NewDensity” has been introduced as a replacement for the “density” column statistic for columns whose histogram has been collected; this change has been backported in 10.2.0.4 also.

In the previous post we discussed how NewDensity influences the CBO cardinality estimate for Height-Balanced histograms; in this one we are going to investigate the same for Frequency Histograms. We will see that the most important change is the introduction of the “half the least popular” rule (see the “Histogram change” post by Jonathan Lewis, which distills the findings of Randolf Geist and Riyaj Shamsudeen) – a surprising rule that might easily cause trouble (in fact as Jonathan reports in the comments – bug 6082745 was opened against this rule).

The test case (script density_post_freq.sql) considers the same test statement we focused on in the previous post (a single equality filter predicate which asks for a value inside the min-max range of the column):
[sql]
select …
from t
where value = 64.5;
[/sql]

Of course we compute a Frequency instead of an Height-Balanced histogram, and use a slightly different value distribution in order to highlight the new rule:
[sql]
SQL> select value, count(*)
2 from t
3 group by value
4 order by value;

VALUE COUNT(*)
———- ———-
8 8
16 16
64 64
128 128
[/sql]

The histogram generated by the test case is (from DBA_HISTOGRAMS):
[text]
VALUE EP BKT
———- ———- ———-
8 8 8
16 24 16
64 88 64
128 216 128
[/text]
VALUE is an abbreviation for ENDPOINT_VALUE, EP for ENDPOINT_NUMBER.
BKT is the number of buckets covered by the value (i.e.: EP minus the previous EP), that is, the number of rows whose column value was equal to VALUE at statistics collection time.

When the filter predicate selects a value contained in the histogram, the new releases behave the same as the old ones (but check the “bottom note about singleton values” at the bottom for a a minor but interesting detail): neither density nor NewDensity is used, and the cardinality estimate is the usual intuitive one. In the complementary case of a value not contained in the histogram (but still inside the min-max interval), the cardinality used to be calculated as density*num_rows and it is now NewDensity*num_rows. Note the simmetry with the Height-Balanced case: the formula is the same, with NewDensity simply replacing density.

NewDensity with the “half the least popular” rule active

By default the rule is active, and in this case, NewDensity is set to
[text]
NewDensity = 0.5 * bkt(least_popular_value) / num_rows
[/text]
and hence, for non-existent values:
[text]
E[card] = (0.5 * bkt(least_popular_value) / num_rows) * num_rows
= 0.5 * bkt(least_popular_value)
[/text]
For our test case, the least_popular_value is 8 and bkt(8) = 8, hence E[card] = 0.5 * 8 = 4 thanks to NewDensity being equal to 0.5 * 8 / 216 = 0.018518519. In fact, we can verify in the 10053 traces (in 10.2.0.4, 11.1.0.7, 11.2.0.1) for our statement, that asks for a not-existent value (64.5), that E[card] and NewDensity are set as above:
[text]
NewDensity:0.018519, OldDensity:0.002315 BktCnt:216, PopBktCnt:216, PopValCnt:4, NDV:4
Using density: 0.018519 of col #1 as selectivity of unpopular value pred
Table: T Alias: NOT_EXISTENT
Card: Original: 216.000000 Rounded: 4 Computed: 4.00 Non Adjusted: 4.00
[/text]

As another check, let’s see what happens if bkt(least_popular_value) = 1, that is, if there is (at least) one value that occurred exactly one time (a singleton value) at statistics collection time. Adding such a row to our test case is trivial (just uncomment the first insert row in the script); in this scenario, our formula above predicts E[card] = 0.5 with NewDensity = 0.5 / 217 = .002304147, and in fact (check the *_least_is_one.trc traces):

[text]
NewDensity:0.002304, OldDensity:0.002304 BktCnt:217, PopBktCnt:216, PopValCnt:4, NDV:5
Using density: 0.002304 of col #1 as selectivity of unpopular value pred
Table: T Alias: NOT_EXISTENT
Card: Original: 217.000000 Rounded: 1 Computed: 0.50 Non Adjusted: 0.50
[/text]
note that E[card] gets rounded up from 0.5 to 1 (as usual).

What is the rationale behind this rule? Thanks to Randolf Geist (see the comment in Jonathan’s blog entry above), we know that it was introduced as a patch to solve one particular scenario (see bug 5483301) and then included in the main release, for some reason. Luckily, the rule can be disabled and the old sane behaviour can be restored.

NewDensity with the “half the least popular” rule disabled

To disable the new rule, just switch off the patch 5483301:
alter session set “_fix_control”=’5483301:off’;
(or alter system if you want to make it permanent)

with this setting, NewDensity becomes simply
[text]
NewDensity = 0.5 / num_rows
[/text]
and hence, again for non-existent values:
[text]
E[card] = 0.5
[/text]
which is exactly what we got in pre-10.2.0.4, where density was used (and density was, and is still, set to 0.5 / num_rows by dbms_stats). So the cardinality estimate is 0.5 (rounded up to 1).

For our test case, we predict NewDensity = 0.5 / 216 = 0.002314815. In fact our 10053 traces tell us:
[text]
NewDensity:0.002315, OldDensity:0.002315 BktCnt:216, PopBktCnt:216, PopValCnt:4, NDV:4
Table: T Alias: NOT_EXISTENT_OFF
Card: Original: 216.000000 Rounded: 1 Computed: 0.50 Non Adjusted: 0.50
[/text]

The rationale for this behaviour is sound; the CBO knows that no row with the requested value existed at statistics collection time, hence it returns the minimal cardinality estimate compatible with the non empty result set assumption (check this post for the importance of this assumption). If the statistics are reasonably fresh, this is the only sane estimate that can be made.

Playing with density – a warning

If you set your own column stats using dbms_stats.set_column_stats, the behaviour is different; I haven’t made any extensive investigations but as far as I can tell, the value you provide for density is used instead of NewDensity. User-provided column statistics are flagged with dba_tab_cols.user_stats = ‘YES’. You can disguise your user statistics as non-user by setting the flags parameter of dbms_stats.set_column_stats to 2 – but since the latter parameter is labeled as “for Oracle internal use only”, I would do it only for investigations purposes – that is, never in production.

—
Bottom note about singleton values: actually in pre-10.2.0.4 versions, if the value was present in the Frequency histogram but covering a single bucket (hence it was present in the table exactly one time at statistic collection time), it used to be classified as “unpopular” and hence used to get the same treatment as a value not in the histogram – the end result being that the cardinality was estimated as 0.5 rounded up to 1; now it is 1 before rounding as one would intuitively expects. I hope to be able to investigate whether this change fixes the issues about join cardinality estimation I investigated – see “the mystery of halving” in this investigation of mine if interested.

Other posts belonging to this series:
densities part I
densities part II
densities part III

14 Responses to "CBO: NewDensity for Frequency Histograms,11g-10.2.0.4 (densities part IV)"

Blogroll Report 16/10/2009-23/10/2009 « Coskan’s Approach to Oracle says:

November 3, 2009 at 14:48

[…] Alberto Dell’Era-CBO: NewDensity for Frequency Histograms,11g-10.2.0.4 (densities part IV) […]

Reply
Randolf Geist says:

December 3, 2009 at 16:38

Alberto,

regarding your comment “Playing with density”: I’ve suggested in a recent discussion to refer to your excellent blog series on Oracle-l (and try to set the density manually to get the “old” density behaviour back) – yesterday Wolfgang Breitling mentioned when we met at the UKOUG that manually setting the density actually removes the histogram, and generating a histogram manually will still use the NewDensity but not the density provided as part of the user-generated statistics.

I haven’t had time yet to test this personally, but now that Wolfgang has mentioned it I’m quite sure I remember that he’s right regarding setting the density manually – it removes the complete histogram immediately.

Can you comment on what you meant by above: “If you set your own column stats using dbms_stats.set_column_stats…”?

Have you tried to modify the density only or generating your own histogram?

Randolf

Reply
1. Alberto Dell'Era says:
  
  December 3, 2009 at 17:59
  
  Hi Randolf,
  
  thank you very much for your positive feedback, I really appreciate it – as you “may” know, it takes a lot of time to write about this kind of things trying to be clear and exhaustive …
  
  I use the following tiny SP to set the density while preserving the histogram:
  [sql]
  — a utility procedure to change density, preserving the histogram
  create or replace procedure set_density (
  p_table_name varchar2,
  p_column_name varchar2,
  p_new_density number,
  p_flags number default null
  )
  is
  l_distcnt number;
  l_old_density number;
  l_nullcnt number;
  l_srec dbms_stats.statrec;
  l_avgclen number;
  l_new_density number;
  begin
  — get the current column statistics
  dbms_stats.get_column_stats (
  ownname => user,
  tabname => p_table_name,
  colname => p_column_name,
  distcnt => l_distcnt,
  density => l_old_density,
  nullcnt => l_nullcnt,
  srec => l_srec,
  avgclen => l_avgclen
  );
  
  — reset them, overwriting “density”
  if p_new_density is not null then
  l_new_density := p_new_density;
  else
  l_new_density := l_old_density;
  end if;
  
  dbms_stats.set_column_stats (
  ownname => user,
  tabname => p_table_name,
  colname => p_column_name,
  distcnt => l_distcnt,
  density => l_new_density,
  nullcnt => l_nullcnt,
  srec => l_srec,
  avgclen => l_avgclen,
  no_invalidate => false,
  flags => p_flags
  );
  
  dbms_output.put_line (‘density of ‘||p_table_name||’.’||p_column_name||’ changed from ‘||l_old_density||’ to ‘|| l_new_density);
  
  end set_density;
  /
  show errors;
  [/sql]
  I’ve tested it right now again, albeit on 11.1.0.7 only, and it actually preserves the histogram. It is a script I’ve used for some years now.
  
  Notice how it works : it fetches the histogram from the data dictionary, and rewrites it back verbatim with a new density, possibly changing the “flags” as well …
  
  Reply
Randolf Geist says:

December 3, 2009 at 18:44

Alberto,

thanks for the clarification.

So you mean to say: Using your SP you’ve manually modified the density while preserving the histogram – and the CBO uses then the manually modified density from the dictionary for calculation but not the NewDensity any longer?

I thought that Wolfgang meant to say that he has generated a histogram manually along with a density, but the CBO sill applied the NewDensity algorithm to his made-up histogram.

At least two things come into my mind that might matter:
– I don’t know the version he has used to test
– He seems to have setup a made-up histogram rather than preserving an existing one, however I don’t see where this exactly should make any difference

Anyway, thanks for the feedback. If I find some time I’ll test it myself resp. ask Wolfgang for clarification what he did and what version he used to test.

Randolf

Reply
1. Alberto Dell'Era says:
  
  December 3, 2009 at 20:18
  
  Randolf,
  
  >So you mean to say: Using your SP you’ve manually modified the density
  >while preserving the histogram – and the CBO uses then the manually
  >modified density from the dictionary for calculation but not the NewDensity any longer?
  
  Yes if you use the default and only supported value (null) for the “flags” parameter of dbms_stats.set_column_stats – as far as I understand, in this case the statistics get labeled as “user generated stats” (dba_tab_cols.user_stats = ‘YES’). If you set “flags” to 2, you are pretending to be dbms_stats or a system routine (dba_tab_cols.user_stats = ‘NO’) and hence NewDensity gets used.
  
  It makes sense – if the user has provided her own value for “density”, use it instead of NewDensity.
  
  NB: I have exchanged in the past many, many emails with Wolfgang and I’ve noticed that He commonly uses flags=2 in his scripts … that should explain everything nicely 🙂
  
  Reply
Xiang Rao says:

August 22, 2012 at 20:59

Hi Alberto,
I am wondering if the num_rows inside the “half the least popular” formula NewDensity = 0.5 * bkt(least_popular_value)/num_rows should be the last endponiter_number? In your case, both of your row count and the buckets are the same: 216.

I have a table with 99,826,738 rows, the concerned column has 60 distinct values ranged between 3 and 451, with frequncy histogram of 17 buckets, and density 5.17e-9. The last endponiter_number is 5,327 and the least popular value 303 has only 1 bucket. For value 4 (not any of the endpoint_value, but beteween first 2 values), Oracle gives cardinality estimate as 9,370 and newDensity as 0.000094, which is 0.5*bkt(303)/last_endpoint_number = 0.5*1/5,327. I was looking for the theory behind the newDensity value 0.000094, and google brought me to your article again.

Thanks,

Xiang

Reply
1. Alberto Dell'Era says:
  
  August 25, 2012 at 18:56
  
  Hi Xiang,
  
  yes, you’re right – in all my examples I have always used no sampling when gathering statistics (to keep the discussion simple and the examples deterministic), and when no sampling, the last endpoint_number is always exactly the same as num_rows. But it is the last endpoint_number that is to be used in the formula, as you kindly point out – which makes a difference to you since you are sampling for sure.
  
  Just to be complete, here’s a demo that shows the effect of sampling on endpoint_number for a table with num_rows=100000:
  [sql]
  SQL> create table tt as select 0 as x from dual connect by level <= 100000;
  SQL> exec dbms_stats.gather_table_stats(user,’tt’,method_opt=>’for all columns size 254′, estimate_percent=>0.000001);
  SQL> select max(endpoint_number) from user_histograms where table_name=’TT’ and column_name=’X’;
  
  MAX(ENDPOINT_NUMBER)
  ——————–
  2710
  SQL> exec dbms_stats.gather_table_stats(user,’tt’,method_opt=>’for all columns size 254′, estimate_percent=>null);
  SQL> select max(endpoint_number) from user_histograms where table_name=’TT’ and column_name=’X’;
  
  MAX(ENDPOINT_NUMBER)
  ——————–
  100000
  [/sql]
  Thanks for commenting!
  
  Reply
Rickey says:

January 14, 2013 at 05:34

Definitely believe that that you stated. Your
favorite reason seemed to be on the net the easiest thing to
take into accout of. I say to you, I definitely get annoyed even as folks think
about concerns that they just do not know about.

You managed to hit the nail upon the top as well as defined out
the whole thing with no need side-effects , other people could take a signal.

Will likely be again to get more. Thanks

Reply
Sayan Malakshinov says:

December 3, 2013 at 21:43

Alberto,

could you tell me please: is this my view dba_newdensity written correctly?
[sql]
create or replace view dba_newdensity as
with
histgrm1 as (
select–+ inline merge
owner as owner
,table_name as table_name
,column_name as column_name
,endpoint_number as ep_num
,endpoint_value as ep_val
,endpoint_actual_value as ep_act_val
,lag(endpoint_number) over(partition by owner,table_name,column_name order by endpoint_number) as ep_num_prev
from
dba_histograms h1
)
,histgrm2 as (
select–+ inline
owner as owner
,table_name as table_name
,column_name as column_name
,ep_num as ep_num
,ep_val as ep_val
,ep_act_val as ep_act_val
,ep_num – nvl(ep_num_prev,0) as bkt
,decode (ep_num – nvl(ep_num_prev,0)
, 0, 0
, 1, 0
, 1
) as popularity
from
histgrm1 h
)
,hist_agg as (
select–+ inline
owner
,table_name
,column_name
,max(ep_num) as BktCnt — should be equal to sum(bkt)
,sum(decode(popularity, 1, bkt,0)) as PopBktCnt
,sum(decode(popularity, 1, 1 ,0)) as PopValCnt
— ,min(bkt) keep(dense_rank first order by decode(popularity,1,ep_num)) as bkt_least_popular_value
— ,min(decode(popularity,1,bkt)) keep(dense_rank first order by decode(popularity,1,bkt) nulls last) as bkt_least_popular_value
,min(decode(popularity,1,bkt)) as bkt_least_popular_value
from histgrm2
group by owner,table_name,column_name
)
select
st.owner
,st.table_name
,st.column_name
,st.histogram
,h.BktCnt
,h.PopBktCnt
,h.PopValCnt
,st.num_distinct as NDV
,h.bkt_least_popular_value
,st.density as old_Density
,case st.histogram
when ‘FREQUENCY’
then — 0.5 * bkt_least_popular_value / t.num_rows
0.5 * bkt_least_popular_value / BktCnt
when ‘HEIGHT BALANCED’
then ( 1 – PopBktCnt / BktCnt ) / (st.num_distinct – PopValCnt)
end as newdensity
from
dba_tab_col_statistics st
,hist_agg h
,dba_tables t
where
st.owner = h.owner
and st.table_name = h.table_name
and st.column_name = h.column_name
and st.owner = t.owner
and st.table_name = t.table_name;[/sql]

Reply
Sayan Malakshinov says:

December 3, 2013 at 21:49

without unnecessary:
[sql]
create or replace view dba_newdensity as
with
histgrm1 as (
select–+ inline merge
owner as owner
,table_name as table_name
,column_name as column_name
,endpoint_number as ep_num
,endpoint_value as ep_val
,endpoint_actual_value as ep_act_val
,lag(endpoint_number) over(partition by owner,table_name,column_name order by endpoint_number) as ep_num_prev
from
dba_histograms h1
)
,histgrm2 as (
select–+ inline
owner as owner
,table_name as table_name
,column_name as column_name
,ep_num as ep_num
,ep_val as ep_val
,ep_act_val as ep_act_val
,ep_num – nvl(ep_num_prev,0) as bkt
,decode (ep_num – nvl(ep_num_prev,0)
, 0, 0
, 1, 0
, 1
) as popularity
from
histgrm1 h
)
,hist_agg as (
select–+ inline
owner
,table_name
,column_name
,max(ep_num) as BktCnt — should be equal to sum(bkt)
,sum(decode(popularity, 1, bkt,0)) as PopBktCnt
,sum(decode(popularity, 1, 1 ,0)) as PopValCnt
,min(decode(popularity,1,bkt)) as bkt_least_popular_value
from histgrm2
group by owner,table_name,column_name
)
select
st.owner
,st.table_name
,st.column_name
,st.histogram
,h.BktCnt
,h.PopBktCnt
,h.PopValCnt
,st.num_distinct as NDV
,h.bkt_least_popular_value
,st.density as old_Density
,case st.histogram
when ‘FREQUENCY’
then 0.5 * bkt_least_popular_value / BktCnt
when ‘HEIGHT BALANCED’
then ( 1 – PopBktCnt / BktCnt ) / (st.num_distinct – PopValCnt)
end as newdensity
from
dba_tab_col_statistics st
,hist_agg h
where
st.owner = h.owner
and st.table_name = h.table_name
and st.column_name = h.column_name;
create public synonym dba_newdensity for dba_newdensity;
grant select on dba_newdensity to public;[/sql]

Reply
1. Alberto Dell'Era says:
  
  January 25, 2014 at 16:26
  
  Sayan,
  
  sorry for the long delay, my notification module has gone berserk again ;(
  
  There was a NewDensity calculation function inside the test case, have you checked that ? As far as I know my function is correct; if that’s not the case, I would like to know the reason
  
  Reply
DiF says:

October 9, 2015 at 22:20

Hi Alberto,

I tried to influence the optimizer to use Old Density value when calculating the join cardinality/selectivity for a statement, but without any success. Is it possible to do such a thing ?

–10053 before
Table Stats::
Table: TAB_1 Alias: TAB_1 (Using composite stats)
#Rows: 3913250500 #Blks: 34223419 AvgRowLen: 408.00 ChainCnt: 0.00
Index Stats::
Index: IDX_1 Col#: 16
USING COMPOSITE STATS
LVLS: 2 #LB: 897500 #DK: 231507 LB/K: 3.00 DB/K: 19.00 CLUF: 4469600.00
ALL PARTITIONS USABLE

Column (#16):
NewDensity:0.000004, OldDensity:0.000072 BktCnt:254, PopBktCnt:28, PopValCnt:1, NDV:231507
Column (#16): COL_1(
AvgLen: 5 NDV: 231507 Nulls: 0 Density: 0.000004 Min: -1 Max: 836355
Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 227

Access Path: index (AllEqJoinGuess)
Index: IDX_1
resc_io: 214.00 resc_cpu: 2089826
ix_sel: 0.000004 ix_sel_with_filters: 0.000004
NL Join : Cost: 1077.02 Resp: 63.62 Degree: 20
Cost_io: 1074.00 Cost_cpu: 10495264
Resp_io: 63.44 Resp_cpu: 626640

Height histogram on column COL_1 has a single popular value (‘-1′, about 10% of the rows, 430M). Unfortunately this popular value is the one that is being retrieved from TAB_1. Instead of E-Rows of 17K (3,913,250,500 / 231,507 = 16903.37873152864) we get 431,381,945 rows (cardinality of popular value). COL_1 is a join column (no predicates are applied to it) for TAB_1, which is an inner table in a NESTED LOOP.

The following piece of code was executed:
exec dbms_stats.set_column_stats(OWNNAME =>null, TABNAME =>’TAB_1′, colname =>’COL_1’,density => 0.00007169487, flags => null);

–0.00007169487 is the Old Density
–Histogram on COL_1 disappeared after command was executed
–dba_tab_cols.user_stats = ‘YES’ for COL_1
–10053 after:

Table Stats::
Table: TAB_1 Alias: TAB_1 (Using composite stats)
#Rows: 3912654200 #Blks: 34224610 AvgRowLen: 408.00 ChainCnt: 0.00
Index Stats::
Index: IDX_1 Col#: 16
USING COMPOSITE STATS
LVLS: 2 #LB: 897500 #DK: 231507 LB/K: 3.00 DB/K: 19.00 CLUF: 4469600.00
ALL PARTITIONS USABLE
Column (#16): COL_1(
AvgLen: 5 NDV: 237644 Nulls: 0 Density: 0.000072 Min: -1 Max: 841836

Access Path: index (AllEqJoinGuess)
Index: IDX_1
resc_io: 214.00 resc_cpu: 2089626
ix_sel: 0.000004 ix_sel_with_filters: 0.000004
NL Join : Cost: 1077.02 Resp: 63.62 Degree: 20
Cost_io: 1074.00 Cost_cpu: 10494264
Resp_io: 63.44 Resp_cpu: 626584

Even if Density: 0.000072 for COL_1, index selectivity is still calculated using 1/NDV (1/231507 = 0.00000431952381569). Is this happening because the histogram is missing and density is not taken into account when selectivity is calculated ?

Is there any method to change ix_sel ?

Regards,
Dif

Reply
1. Alberto Dell'Era says:
  
  November 25, 2015 at 21:10
  
  DiF,
  
  sorry for the delay … have you tried changing the index stats to check whether the CBO is using them while ignoring the column ones?
  
  You can set _optimizer_enable_density_improvements=false to get back the old density, as stated in
  http://www.adellera.it/investigations/11g_newdensity/11gNewDensity.pdf
  
  HTH – Al
  
  Reply
12c Hybrid histogram – All Things Oracle says:

January 26, 2016 at 17:19

[…] buckets, the number of popular values, etc. Nevertheless, using the following select (inspired by Alberto Dell’ Era) we can get a reliable value for the […]

Reply

Leave a Reply Cancel reply