CBO: the “non-empty result set” assumption

The CBO assumes that SELECT statements are always going to retrieve at least one row – even if this is not necessarily the case, of course. Understanding why this is done is both useful and fascinating.

We must start from the very beginning and remember that one of the most important tasks of the CBO is estimating the statement cardinality, that is, to make a guess about the number of rows that will be fetched. In statistics, that means that the CBO must calculate (estimate) the expected value of the cardinality random variable.

In order to calculate the expected value, in our case, we can consider the table potential population (i.e. the set of all possible row values), execute (ideally!) the statement over each table in the population, and compute (again ideally) the average of the cardinality of each result set.

The population must be coherent with the set of observations stored in the data dictionary when the table and column statistics were collected; in other words, the population must satisfy a set of statistical constraints. For example the number of distinct values in each column must be equal (or statistically equal) to the num_distinct statistic; the range of values must be inside (or statistically inside) the min-max interval dictated by low_value-high_value, etc.

Now consider a simple statement with a filter predicate:

select ...

from t

where x = 1;

Assuming that column X contains numbers, there are an infinite number of values of X inside the min-max interval (assuming that min is not equal to max) that can satisfy the constraints. In the table population, how many tables have X=1, and how many rows will be retrieved by the statement?

If a frequency histogram has been collected on column X, the population is constrained to (statistically) satisfy it, and hence we have the answer: the expected cardinality is zero if value X is not contained in the histogram and strictly greater than zero (computed with the usual formula) otherwise.

But if no histogram is collected on the column, the number of tables with X=1 will be negligible, and hence the expected value will be zero. That is not very useful.

But if we assume that the result set is never empty, then we have another constraint to apply. That means that the value X is contained in all tables of the population, and (if we add the additional customary assumption of uniform distribution of values) we can easily derive the usual num_rows / num_distinct(X) formula.

Note that the “non-empty result set” assumption is very strong; it means that the statement and the table are not independent, but actually are highly correlated, since the assumption is equivalent to say that the client executes the statement in order to retrieve rows whose existence is certain before the statement execution. In other words, the CBO infers information about the data from the statement itself, not only from the data dictionary statistics, trusting that the user has some knowledge about the data stored in the table.

The assumption is of course more than reasonable for almost all statements and clients, but not always. For instance, X=1 might mean “new record” in a table-queue that contains the history of the last few years as well, and the table migth have no observable record with X=1 thanks to the consumer(s) being very quick or the producer(s) rarely enqueuing records. Or maybe, X=1 might mean “failed record” in a process that never fails, and the statement could be issued for checking the rows existence, not for retrieving them. In this kind of scenario, the CBO predictions can be affected by the “non-empty result set” assumption.

6 comments on this post.
  1. Entradas de Oracle semanas 35-37 « Gruñidos sobre Oracle y SAP:

    [...] en SAP debe ser configurado segĂșn sus recomendaciones. Alberto Dell’Era nos explica que el CBO asume siempre que una sentencia SELECT devuelve por lo menos una fila. Acabo con un articulo interesante de Porus Homi Havewala, “Patch a Thousand Databases, Using [...]

  2. Alberto Dell’Era’s Oracle blog » CBO: about the statistical definition of “cardinality” (densities part I):

    [...] all values of :x for which w(:x) is non zero. At the other end of the spectrum, that is, under the non empty result set assumption, we have E[card] = sum ( w(:x) ) * num_rows / num_distinct(X) = num_rows / num_distinct(X), the [...]

  3. Alberto Dell’Era’s Oracle blog » CBO: NewDensity for Frequency Histograms,11g- (densities part IV):

    [...] statistics collection time, hence it returns the minimal cardinality estimate compatible with the non empty result set assumption (check this post for the importance of this assumption). If the statistics are reasonably fresh, [...]

  4. Alberto Dell’Era’s Oracle blog » CBO: the formula for the “density” column statistic (densities part II):

    [...] is known to be not popular, hence hitting the NPS – remember that the CBO operates under the “non-empty result set assumption”, hence if the literal does not hit a popular value, it must hit a value of the [...]

  5. Alberto Dell’Era’s Oracle blog » CBO: “NewDensity” replaces “density” in 11g, (densities part III):

    [...] (class) of all possible equality filter predicate statements that hit the NPS, under the usual “non-empty result set assumption” and the further (strange and strong) assumption that the more a value is represented in the NPS, [...]

  6. non zero values:

    [...] with a zero simulation however they can if required be executed in a non zero simulation time. …Alberto Dell’Era’s Oracle blog CBO: the non-empty result …The CBO assumes that SELECT statements are always going to retrieve at least one row – even if this [...]

Leave a comment