Range predicates selectivity
Abstract

An investigation, based on Jonathan Lewis' test case "Discrete Dangers" on page 126 of "Cost Based Oracle", to discover and illustrate the exact formula that the CBO uses to estimate the cardinality for (bounded) range predicates.

The main result is that Jonathan's "standard formula" applies in a wider context than expected (by me at least), allowing for a couple of slight modifications relevant for some (but important) cases only.

We'll start on this page with (open,open) ranges, than we'll continue on the next page with (closed,closed) ranges, and then with (open,closed),(closed,open) on another page. The conclusion is here.

Selection over an infinitesimal range

The script range_sel_inf.sql builds this table (10.2.0.2):
  create table t as select 10 * mod (rownum-1, 2) x from dual connect by level <= 2000000;
and then collects statistics without histograms:
  exec dbms_stats.gather_table_stats (user, 't', method_opt=>'for all columns size 1', estimate_percent=>100);
The resulting data distribution is
select x, count(*) from t group by x;

         X   COUNT(*)
---------- ----------
         0    1000000
        10    1000000
ie, a table with
  num_distinct = 2
  num_rows / num_distinct = 1,000,000
  min_x := min(column x)  = 0
  max_x := max(column x)  = 10
Now the script runs this series of statements, and collects the CBO estimate of cardinality for each one:
select x from t where x > 0.1234 and x < 0.1234 + 0.001
select x from t where x > 0.3456 and x < 0.3456 + 0.001
..
select x from t where x > 9.6789 and x < 9.6789 + 0.001
The template is of course a selection over a (bounded, open) range predicate:
select x from t where x > low_x and x < low_x + w
keeping w constant and very small ("infinitesimal"), and letting low_x vary to cover the (minx_, max_x) interval.

If you diagram the estimated cardinality versus low_x, you get the red line on the following graph:
The green line (num_distinct=3) is obtained by adding to the table another distinct value (we add 1,000,000 rows to keep num_rows/num_distinct the same):
insert into t(x) select 6.1 from dual connect by level <= 1000000;
And the blue line (num_distinct=4) after adding a fourth distinct value:
insert into t(x) select 7.42 from dual connect by level <= 1000000;
By looking at the graphs and the data, we can make the following deductions (the ones labeled by [JLB] are already discussed in Jonathan Lewis' Book quoted above; here I've just verified that they apply to my data sets): Selection over a finite range

The script range_sel_finite.sql repeats the very same steps of the script above, but the statements' template is now
select x from t where x > low_x and x < high_x
where low_x = constant = 0.1, and high_x varies to cover the whole interval (low_x, max_x).
The diagram of the estimated cardinality versus high_x (click to get the data sets for the red, green and blue line) is:
We can make the following deductions: Let's move to the closed range case.

For corrections / feedback:
alberto.dellera@gmail.com