CBO: the formula for the “density” column statistic (densities part II)

In this post we are going to explore and explain the rationale for the formula used by dbms_stats to compute the "density" column statistic, used by the CBO in versions less than 10.2.0.4 to estimate the cardinality of a class of SQL statements. In the next post, we will speak about its replacement, named "NewDensity" in 10053 trace files.

We will consider only the non-trivial case of Height-Balanced histograms, since for Frequency Histograms density is a constant (0.5 / num_rows) and for columns without histogram, it is simply 1/num_distinct.

Let's illustrate the test case on which we will base our discussion, contained in this zip file.

First, a table T is created with the following exponential value distribution:

SQL> select value, count(*)
  2    from t
  3   group by value
  4   order by value;

     VALUE   COUNT(*)
---------- ----------
         1          1
         2          2
         4          4
         8          8
        16         16
        64         64

The test case then computes a SIZE 5 Height-Balanced histogram. The resulting histogram (from dba_histograms) is as follows (note that I have added the column POPULARITY that marks popular values with "1"; EP is shorthand for column ENDPOINT_NUMBER, VALUE for column ENDPOINT_VALUE):

SQL> select ep, value, popularity from formatted_hist;

        EP      VALUE POPULARITY
---------- ---------- ----------
         0          1          0
         1         16          0
         5         64          1

The test case then issues this SQL statement that contains only an equality filter predicate on table T:

select ...
  from t
 where value = 2.4;

The literal value 2.4 is not contained in the table (and hence in the histogram), in order to make the CBO factor in "density" in its estimate of the expected cardinality - in fact, as it might be known, density is used when the literal is not popular (that is, not equal to 64 in our case), and it doesn't matter whether the literal is not contained in the histogram, or contained as an unpopular value (1 and 16 in our case), or even contained in the table or not. All it takes is its being not popular.
Side note: I'm assuming the literal is inside the closed min-max interval (1-64 in this case); when outside, it depends on the version.

When the literal is not popular, the formula used for the expected cardinality calculation is equal to

E[card] = density * num_rows;

That is easy to verify from the test case logs; in 9i we can see that density = 0.115789474 and num_rows=95, hence 0.115789474 * 95 = 11.000000000 which is exactly equal to the CBO estimate for our statement.

The formula used by dbms_stats to compute "density" was published in Jonathan Lewis' book Cost Based Oracle (page 172) and Wolfgang Breitling's presentation Histograms - Myths and Facts. The key fact is that the formula takes as input the rows of what I've nicknamed the not-popular subtable (NPS), that is, the original table without the rows whose values are popular values (in this case, 64 is the only popular value). Letting num_rows_nps the number of rows of the NPS (for our example, num_rows_nps=1+2+4+8+16=31), we have:

   density = (1 / num_rows) *
   sum (count (value) ^ 2) / num_rows_nps
   for  "value" belonging to the NPS

The script performs this calculation automatically; it is anyway instructive to perform the calculation manually at least one time:
density = (1/95) * (1*1+2*2+4*4+8*8+16*16) / 31 = .115789474
that matches perfectly the density we observed in the script log before.

What is the statistical rationale for this seemingly strange computation?

If we plug it inside the formula for E[card], we can see that num_rows is cancelled:

E[card] = sum (count (value) ^ 2) / num_rows_nps
summed over all "values" belonging to the NPS

Now we must reference the statistical concepts introduced in this post, and consider the family of all statements of our kind that can reference the NPS:

select ...
  from t
 where x = :x;
:x being a value belonging to the NPS

its E[card] is

E[card] = sum ( w(:x) * E[count(:x)] )
for all values of :x (belonging to the NPS)

dbms_stats takes count(:x) as the best estimate for E[count(:x)] (for example, E[count(4)] = count(4) = 4 in our case). All we have to do in order to obtain the observed formula, is to assume w(:x) = count(:x) / num_rows_nps:

E[card] = sum ( (count(:x) / num_rows_nps) * count(:x) )
        = sum ( count(:x) ^ 2 ) / num_rows_nps
for all values of :x (belonging to the NPS)

The meaning of the above particular shape of w(:x) is that the probability that the client submits a certain value for :x is proportional to the number of rows (in the NPS) that has that value; more precisely, that if X% of rows has a certain common value, X% of user-submitted statements that "hit" the NPS will ask for that value. Under this assumption, dbms_stats precomputes "density" to give back the above E[card] when the literal is known to be not popular, hence hitting the NPS - remember that the CBO operates under the "non-empty result set assumption", hence if the literal does not hit a popular value, it must hit a value of the NPS.

The above assumption for w(:x) is quite a strange assumption - and in fact, we will see in the next post that in 11g (and 10.2.0.4), this assumption has been dropped and replaced with a more standard one. The "density" column statistics is in fact ignored in 10.2.0.4+ and a value computed at run-time, named "newDensity" in 10053 trace files, is used instead.

Other posts belonging to this series:
densities part I
densities part III
densities part IV

CBO: about the statistical definition of “cardinality” (densities part I)

Let's explore the concept of cardinality from the point of view of the statistician; this is both to get a clearer vision of the matter (i.e. for fun) and to path the way for understanding the rationale for the "density" statistics as calculated by dbms_stats (the topic of an upcoming post).

Let's consider a statement with input parameters (bind variables), and consider the most fundamental of them all, the one with a filter predicate:

select ...
  from t
 where x = :x;

the cardinality "card" of the set of rows retrieved depends on the table possible values and the actual inputs provided by the client as bind variable values. What about the expected value E[card] ?
Let:
1) w(:x) ("w" stands for "workload") the probability mass function of the random variable :x (that completely characterizes the workload);
2) E[count(:x)] the expected value of the cardinality of the rows retrieved for each value of :x.

We have, assuming that the two are independent:

E[card] = sum ( w(:x) * E[count(:x)] )
for all values of :x

To solve the formula we have to know (or assume) the client-dictated w(:x). The same goes for the table potential population (that -together with the statement of course- shapes E[count(:x)]); we must either have some statistical measurements about the table (for example, a frequency histogram on column X that we consider representative of the table population) or assume them (for example, assume a certain distribution for the column X values).

It is interesting to explore the most used scenario: a uniform (or assumed uniform) distribution for the column X values, of which we know the number of distinct values num_distinct(X) and the total number of values num_rows (let them be deterministically known for simplicity, and exclude null values). That means that E[count(:x)]) is equal to num_rows / num_distinct(X) over a finite set that contains num_distinct(X) values and is equal to zero over the remaining ones.

It is relatively easy to see that E[card] depends on how w(:x) and E[count(:x)]) overlap. At one end of the spectrum, if the client always submits values for :x that are not contained in the table, E[card] is zero - since the client choice matematically translates into E[count(:x)]) being zero over all values of :x for which w(:x) is non zero. At the other end of the spectrum, that is, under the non empty result set assumption, we have E[card] = sum ( w(:x) ) * num_rows / num_distinct(X) = num_rows / num_distinct(X), the usual formula used by the CBO in many (most) situations.

The "density" column statistic formula can be derived in a similar way, but using a different assumption about w(:x) - as we will see in a dedicated post.

Other posts belonging to this series:
densities part II
densities part III
densities part IV

An interview with Mark Townsend

While attending the 11gR2 launch event in Milan last Thursday, I had the distinguished opportunity (invited, as a blogger, by the Oracle team that was organizing the event) to meet Mark Townsend and exchange a few words about the new features of 11gR2 and the Oracle database in general as well.

For those who don't know, Mark is (among other things) the Vice President in charge of coordinating the Product Managers and a technical expert at the same time, and this rare combination has the advantage that you can ask him about any feature you like at whatever granularity you like, from the strategic level down to the technical gory details. In fact Mark is frequently seen at public events (such as Oracle Open World), speaking to mixed audiences composed of both Engineers and Management.

We spoke about many new features of 11gR2, so much in fact that in order to do justice to the information that Mark very kindly provided me, I will use them as the foundation of some future blog posts. In brief anyway, I noticed that this release has features that are targeted mostly to vastly improve the "grid", but with very interesting features for the "core" as well (my personal favourites being the in-memory parallel execution and the new SCN-based MV log, which I plan to blog about in the very near future).

But especially, I didn't miss the unique opportunity to discuss with one of the top players of Oracle Corporation about the amount of information that Oracle shares with its professionals, obviously trying to push for much more. I'm sure that everyone that works with any kind of software product agrees with me on the fact that knowing how the product works is key not only to troubleshooting (that being almost obvious) but to good design also, or perhaps especially; the more you know, the better designer you are going to be.

We had a lot of back and forth on this topic, but to summarize, Mark agreed on detailed information being very useful for experts, but also pointed out that it takes years and years to become proficient enough to be able to digest very detailed documentation, and in the meanwhile, too much information is going to confuse, rather than clear things (especially for juniors coming from other database products). Just the current detail level is overwhelming, the current docs being composed of a staggering 21,000,000 words - and Oracle is in fact trying to organize the documentation in layers as much as possible, starting from high-level descriptions (the Two Day DBA course), than the Concept manual, and than the rest. The most details, however, will be still provided forever as Metalink (aka "My Oracle Support") notes, the only reason being to avoid confusing people, not to hide anything (besides strategic algorithms not covered by Patents of course).

So in short, from this discussion I have understood that Oracle is willing to share information with its community; it only wants to find the right way to do so, since the community is huge (hundreds of thousands of people) and the product is very, very complex. Anyway, I have insisted for more information about the two topics that I'm sure that are not documented well enough, naming the "auto" features and the CBO algorithms - and actually I feel like I have insisted perhaps too much on that with Mark ... but that's something I really care about, both for professional reasons and keen interest alone.

Well, I might add that information sharing, and community involvement as well, is in my opinion one of the main factors of Oracle's success; actually, being able to dig a lot into the inner workings of the product is the reason why I chose an Oracle career ten years ago.

PS I'm back from my vacation and I have a lot of interesting things to investigate at work that look like perfect candidates for being turned into posts, so I will be able to blog more frequently in the future. I also have a series of posts about the CBO that is "almost complete".

CBO: the “non-empty result set” assumption

The CBO assumes that SELECT statements are always going to retrieve at least one row - even if this is not necessarily the case, of course. Understanding why this is done is both useful and fascinating.

We must start from the very beginning and remember that one of the most important tasks of the CBO is estimating the statement cardinality, that is, to make a guess about the number of rows that will be fetched. In statistics, that means that the CBO must calculate (estimate) the expected value of the cardinality random variable.

In order to calculate the expected value, in our case, we can consider the table potential population (i.e. the set of all possible row values), execute (ideally!) the statement over each table in the population, and compute (again ideally) the average of the cardinality of each result set.

The population must be coherent with the set of observations stored in the data dictionary when the table and column statistics were collected; in other words, the population must satisfy a set of statistical constraints. For example the number of distinct values in each column must be equal (or statistically equal) to the num_distinct statistic; the range of values must be inside (or statistically inside) the min-max interval dictated by low_value-high_value, etc.

Now consider a simple statement with a filter predicate:

select ...
  from t
 where x = 1;

Assuming that column X contains numbers, there are an infinite number of values of X inside the min-max interval (assuming that min is not equal to max) that can satisfy the constraints. In the table population, how many tables have X=1, and how many rows will be retrieved by the statement?

If a frequency histogram has been collected on column X, the population is constrained to (statistically) satisfy it, and hence we have the answer: the expected cardinality is zero if value X is not contained in the histogram and strictly greater than zero (computed with the usual formula) otherwise.

But if no histogram is collected on the column, the number of tables with X=1 will be negligible, and hence the expected value will be zero. That is not very useful.

But if we assume that the result set is never empty, then we have another constraint to apply. That means that the value X is contained in all tables of the population, and (if we add the additional customary assumption of uniform distribution of values) we can easily derive the usual num_rows / num_distinct(X) formula.

Note that the "non-empty result set" assumption is very strong; it means that the statement and the table are not independent, but actually are highly correlated, since the assumption is equivalent to say that the client executes the statement in order to retrieve rows whose existence is certain before the statement execution. In other words, the CBO infers information about the data from the statement itself, not only from the data dictionary statistics, trusting that the user has some knowledge about the data stored in the table.

The assumption is of course more than reasonable for almost all statements and clients, but not always. For instance, X=1 might mean "new record" in a table-queue that contains the history of the last few years as well, and the table migth have no observable record with X=1 thanks to the consumer(s) being very quick or the producer(s) rarely enqueuing records. Or maybe, X=1 might mean "failed record" in a process that never fails, and the statement could be issued for checking the rows existence, not for retrieving them. In this kind of scenario, the CBO predictions can be affected by the "non-empty result set" assumption.

NoCOUG’s “First international SQL challenge”

Just a short note to tell my friends that I have been bestowed the August Order of the Wooden Pretzel, that is, that I won the NoCOUG's "First international SQL challenge" with this solution.

I'm especially happy to see that, after (way too) many years since graduation, I am still able to use my math skills to solve problems ... :).

Many thanks to Iggy Fernandez for setting up the contest (and running it with love and dedication), Dan Tow for judging the solution and of course Chen Shapira for advertising and "supporting" it on her blog!

fast refresh of single-table materialized views – algorithm summary

Today we are going to investigate how Oracle fast refreshes materialized views (MVs) of a single master table, containing no aggregate but, at most, filter predicates and additional column definitions:

create materialized view test_mv
build immediate
refresh fast on demand
with rowid
-- with primary key
as
select test_t1.*, x1+x2 as x1x2
  from test_t1
 where x1 != 0.42;

This kind of MVs might be considered a degenerate case of a join-only MV, a topic that we investigated in an earlier post, and one could expect the same algorithm. But that is not the case: the test case shows that the algorithm used is very different.
The two main differences are (as we are going to illustrate in detail) that UPDATEs are actually used in this case (as noted by Cristian Cudizio) instead of DELETE+INSERT only, and especially that row-by-row propagation is performed instead of using a couple of single SQL statements.

This kind of MV is frequently used for replication across a db-link (with a clause such as "from test_t1@db_link"); in this scenario, the MV used to be named SNAPSHOT in old releases. I have checked this scenario as well (not included in the test case) and the only difference is that, obviously, the master table test_t1 is referenced via a db-link and a few hints are injected by the refreshing engine.

In the test case, I have checked both the WITH ROWID and WITH PRIMARY KEY options for the MV DDL; the algorithm turns out as being identical, besides (obviously) that in the former the rowid and in the latter the primary key is used to identify rows.

I am going to follow the path of the previous discussion about join-only MVs referenced above, as both the test case format and some of the actual refresh steps are very similar. I have tested on 9.2.0.8, 10.2.0.4 and 11.1.0.7 for the most common DML on the base table (conventional INSERTs, UPDATEs and DELETEs). I have seen no difference in the algorithm for the three kernel versions.

Materialized view logs configuration

Even for this test case, I have configured the materialized view logs to "log everything" to check whether Oracle is able to take advantage of more information in the log:

create materialized view log on test_t1
with sequence, rowid, primary key (x1, x2)
including new values;

but even for single-table MVs the algorithm uses only the rowid or primary key information, hence the minimal (and hence optimal) log configuration is, for the WITH ROWID option:

create materialized view log on test_t1 with rowid;
 

and for the WITH PRIMARY KEY option:

create materialized view log on test_t1 with primary key;
 

Log snapshots

The first step in the refresh algorithm is to take a log snapshot, exactly as in the join-only case, by setting snaptime$$ = current time. Hence the marked log rows (the ones and only ones to consider for propagation) will be characterized by snaptime$$ <= current time and > last snapshot refresh time. See the previous post about the join-only case for a more in-depth discussion.

Note: actually, for the sake of precision, two (minor) differences with the join-only case are that the snapshot statement is exactly the same in all versions (there's no special version for 11.1.0.7) and that the log is not "inspected to count the number and type of the logged modifications".

Core algorithm: the DELETE and UPSERT steps

Then, the core replication starts. The propagation from the master table is composed of two simple steps, steps that I've named DELETE and UPSERT (UPDate + insERT).

The first DELETE step is a simple select-then-delete row-by-row processing, where each row returned by a select statement is passed to a single-row delete statement.
For the WITH ROWID option, the select statement of the DELETE step is (editing for readability: removing hints, unnecessary aliases, etc):

select distinct m_row$$
  from (
select m_row$$
  from mlog$_test_t1
 where snaptime$$ > :1
   and dmltype$$ != 'I'
       ) log
 where m_row$$ not in
       (
select rowid from test_t1 mas
 where (mas.x1 <> 0.42)
   and mas.rowid = log.m_row$$
       );

and the delete is a trivial

delete from test_mv where m_row$$ = :1;

The select+delete purpose is to delete all marked rows that are not in the master table anymore, or that are still there but that do not satisfy the MV defining SQL (here, x1 != 0.42) anymore.

In fact, the first in-line view fetches from the log the rowid of a subset (those whose dmltype$$ != 'I') of the marked rows, since :1 is set to the date of the previous refresh of the materialized view. Well actually - the SQL, as it is, would also get the log rows inserted after the snapshot was taken, which is obviously not acceptable since the propagation must operate on a stable set of rows. I'm not sure how the non-marked rows are excluded, but probably the various "select for update" on the log data dictionary tables might play a role by locking the commits on the logs, or maybe the serialization level is set to read-only or serializable (I will investigate this in the future). For now, let's make the conjecture that only the marked rows are selected.

The last correlated subquery simply filters out the rowid of the rows that are still in the master table. The condition dmltype$$ != 'I' ('I' stands for INSERT) is only an optimization, since an inserted row would be filtered out by the subquery anyway - unless it has not been deleted after being inserted, but that would be recorded with another log row with dmltype$$ = 'D'.

Why are updates (dmltype$$ = 'U') not optimized away as well? This is to delete rows from the MV that no longer belong to the current image of the MV defining SQL statement, since they used to satisfy the filter condition (here, x1 != 0.42) but no longer do after an update. Thanks to the filter condition (x1 != 0.42) being included in the subquery, any row that does not satisfy it anymore after an update will not be filtered out, and hence will be deleted.

Note that column m_row$$ of the MV is a hidden (but not virtual) column that records, for each MV row, the rowid of the corresponding master table row. It is automatically created when you define the MV with the WITH ROWID option; an index is automatically created on m_row$$ as well (unless you specify USING NO INDEX, something that does not make sense if you want to fast refresh the MV). Hence you do not need to create any additional index, neither on the master table nor on the MV, to optimize this step of the fast refresh.

Switching to the WITH PRIMARY KEY option, the select statement of the DELETE step is

select distinct pk1
  from (
select pk1
  from mlog$_test_t1
 where snaptime$$ > :1
   and dmltype$$ != 'I')
       ) log
 where pk1 not in
       (
select pk1
  from test_t1 mas
 where (mas.x1 <> 0.42)
   and log.pk1 = mas.pk1
       );

and the delete is simply

delete from test_mv where pk1 = :1;

That is, the statements are the same as in the WITH ROWID case, with the primary key instead of the rowid in all statements. Since the master table must have a primary key for the MV create to succeed, and since an index on the MV that spans the primary key column(s) is automatically created (unless you specify USING NO INDEX of course), even in the WITH PRIMARY KEY case you do not need to create any additional index for performance. Actually, for best performance, an index on the master table that combines the PK and the column(s) referenced by the MV filter condition - here on (pk1, x1) - might help a bit, since probably the optimal plan is a nested loop having test_t1 as the inner table. This would avoid a block get on the master tables for marked rows not satisfying the MV filter condition; the effectiveness of this index depends on whether you have a lot of updates on the column referenced in the filter condition.

The UPSERT step is a simple select-then-upsert row-by-row processing, where each row returned by a select statement (that calculates the current image of the row that needs to be propagated to the MV) is used to update the corresponding row in the MV; if the update finds no row, the row is inserted.

For the WITH ROWID option, the select statement of the UPSERT step is:

select current.x1, current.x2, current.pk1, current.x1x2,
       rowidtochar (current.rowid) m_row$$
  from (
select x1, x2, pk1, x1+x2 as x1x2
  from test_t1
 where (x1 <> 0.42)
       ) current,
       (
select distinct m_row$$
  from mlog$_test_t1
 where snaptime$$ > :1
   and dmltype$$ != 'D'
       ) log
 where current.rowid = log.m_row$$;

and the update and insert statements are simply:

update test_mv set x1=:1, x2=:2, pk1=:3, x1x2 = :4 where m_row$$ = :5;
insert into test_mv (x1,x2,pk1,x1x2,m_row$$) values (:1,:2,:3,:4,:5);

The select+upsert purpose is to calculate the new image of all marked rows that satisfy the MV defining SQL filter condition (here, x1 != 0.42) and then overwrite the old image in the MV with the new one. Note that an update on the master table might produce an insert if the old image did not satisfy the filter condition and the new one does.

The structure of the select statement should be obvious after the previous illustration of the DELETE step. Note of course the different optimization in the second inline view (dmltype$$ != 'D'). Even in this case, the automatically created index on the m_row$$ MV column optimizes the update statement, and no other index is necessary for performance on neither the base table nor the MV.

Switching to the WITH PRIMARY KEY option, the select statement of the UPSERT step is

select current.x1, current.x2, current.pk1, current.x1x2
  from (
select x1, x2, pk1, x1+x2 x1x2
  from test_t1
 where (x1 <> 0.42)
       ) current,
       (
select distinct pk1
  from mlog_test_t1
 where snaptime$$ > :1
   and dmltype$$ != 'D'
       ) log
 where current.pk1 = log.pk1;

and the update and insert statements are:

update test_mv set x1=:1, x2=:2, pk1=:3, x1x2=:4 where pk1=:3;
insert into test_mv  (x1, x2, pk1, x1x2) values (:1, :2, :3, :4);

And the same considerations about the substitution of rowid with the primary key hold. The index on the master table on (pk1, x1) might be of help here as well.

So here it is what the algorithm, essentially, is all about: a row-by-row propagation of all the modified (marked) rows to the MV, with a few optimizations.

Algorithm optimizations

Whatever the type of modifications, the algorithm is always the same: both the DELETE and UPSERT step are performed in all cases. Of course, in both cases, the select statement might select no row.

Xplan 2.0

A lot of new features have been added in version 2.0 of xplan, the sqlplus script I use to investigate about SQL statements performance (I spoke about version 1.0 in this post). Here's a brief description.

wait profile (from ASH)

For each statement, its wait profile is calculated fetching wait information from Active Session History:

-----------------------------------------
|ash event                    |cnt |%   |
-----------------------------------------
|enq: HW - contention         |2606|61.0|
|enq: TX - row lock contention| 875|20.5|
|db file sequential read      | 344| 8.0|
|enq: TX - index contention   | 158| 3.7|
|gc current grant busy        | 152| 3.6|
|cpu                          |  56| 1.3|
|gc current block 2-way       |  34| 0.8|
|gc current block busy        |  13| 0.3|
|gc buffer busy               |  10| 0.2|
|gc cr block 2-way            |   7| 0.2|
|gc current grant 2-way       |   5| 0.1|
|read by other session        |   5| 0.1|
|direct path write            |   3| 0.1|
|gc cr block busy             |   3| 0.1|
|gc cr grant 2-way            |   1| 0.0|
|SQL*Net more data from client|   1| 0.0|
|cr request retry             |   1| 0.0|
-----------------------------------------

By default this feature is on in 10g+ and inspects a window of ash_profile_mins=15 minutes from v$active_session_history.

Important note: you must have bought the appropriate Oracle licence (i.e. the Diagnostic Pack in 11.1) to read from that view and hence to use this feature (xplan will output a warning to remember you about that); you can disable this feature by setting ash_profile_mins=0.

Dump of dependent object definitions

If the statement references some database objects (e.g. a view, a pl/sql function) and hence depends on them, xplan will list them right below the statement text:

SELECT /*+ index(t,t_fbi) ordered use_nl(v) xplan_test_marker */
       T.RR, PLSQL_FUNC(MAX(T.X))
  FROM T, V
 WHERE UPPER(T.X) >= '0'
   AND T.X > :B1
   AND V.RR ='x'
 GROUP BY T.RR
 ORDER BY T.RR

- depends on view DELLERA.V
- depends on function DELLERA.PLSQL_FUNC

and the object definition will be reported at the bottom of the xplan output:

############################################# function DELLERA.PLSQL_FUNC ###
function plsql_func (p varchar2)
return varchar2
is
begin
  return p;
end plsql_func;
############################################# view DELLERA.V ###
view columns: #1 X(NUMBER),#2 PADDING(VARCHAR2),#3 RR(VARCHAR2)
select x, padding, rr
  from t
 where x > 0

Reading other RAC instance statements

Now you can read from another instance by specifying the option inst_id (defaults to the instance you are connected). This is handy for inspecting other instances of the RAC cluster without reconnecting.

Automatic dump of AWR most-expensive statements

The experimental script xplan_awr.sql will inspect AWR (Active Workload Repository) and dump all the statements that are still in the library cache and that have exceeded some resource consumption thresholds in any of the periods marked by two consecutive AWR snapshots. Thresholds can be the percentage of total (e.g. dump if the CPU consumption is more that 10% of total CPU) or the ranking position (e.g. dump if the statement ranks more than 5th in the CPU chart - the typical "top-N" analysis). The thresholds are configurable in the topmost "params" WITH clause.

Again, you must have bought the appropriate Oracle licence to use AWR, and hence to run xplan_awr.sql.

fast refresh of join-only materialized views – algorithm summary

This post investigates how Oracle fast refreshes materialized views containing only joins of master tables:

create materialized view test_mv
build immediate
refresh fast on demand
as
select test_t1.*, test_t1.rowid as test_t1_rowid,
       test_t2.*, test_t2.rowid as test_t2_rowid,
       test_t3.*, test_t3.rowid as test_t3_rowid
  from test_t1, test_t2, test_t3
 where test_t1.j1_2 = test_t2.j2_1
   and test_t2.j2_3 = test_t3.j3_2
;

The fast refresh algorithm is simple and very easy to understand - so trivial in fact that once examined and understood, the possible tuning techniques follow naturally.

The test case traces the fast refresh of the above materialized view (MV) using the 10046 event (aka "sql trace"). The test case has been run on 9.2.0.8, 10.2.0.4 and 11.1.0.7 (the latest versions of 9i, 10g and 11g available as of today), and on all of these versions the algorithm used by the refreshing engine (run by invoking dbms_mview.refresh) appears to be the same, with only a few implementation differences.

The test case explores the most general case: it performs inserts, updates and deletes on all the three master tables (the inserts being conventional; I will explore direct-path inserts another time).

Materialized view logs configuration

In the test case, I have configured the materialized view logs to "log everything", in order to check whether more information in the logs could trigger some special kernel code designed to take advantage of it:

create materialized view log on test_t1
with sequence, rowid, primary key (j1_2, x1)
including new values;

but the engine uses only the rowid information even in 11.1.0.7, so you are better off logging only the rowid if the master table feeds join-only materialized views exclusively:

create materialized view log on test_t1 with rowid;

Minimal logging obviously improves the performance of DML against the master tables, but it also optimizes the fast refresh, since the latter, as we are going to see in a moment, reads each log twice, and of course the less you log, the more compact the logs will be.

Log snapshots

After some preliminary visits to the data dictionary, the first operation performed by the fast refresh engine is to "mark" the modifications (recorded in the materialized view logs) to be propagated to the MV. Only the marked log rows are then fed by the fast refresh engine as input to the next steps.

The "flag" used to mark the rows is the column snaptime$$. When the refresh starts, the engine performs a "snapshot" of the materialized view logs by setting the snaptime$$ of all the new rows (those with snaptime$$ = '01/01/4000') of each log in turn to the current time (SYSDATE).

In detail, the snapshot is performed by issuing this SQL statement (slightly edited for readability) in 9.2.0.8 and 10.2.0.4:

update MLOG$_TEST_T1
   set snaptime$$ = :1
 where snaptime$$ > to_date('2100-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS')

The bind variable :1 is a DATE whose value is equal to SYSDATE.

Note: In 11.1.0.7, the statement is slightly different but makes the same thing, probably in a more scalable way concurrency-wise (check the script spools if you're interested).

You might have noticed the where condition on snaptime$$; that is necessary since the logs might be used by more than one materialized view. When a refresh ends, in fact, the engine checks whether other MVs might need each log row, and deletes only the log rows that have been processed by all dependant MVs; the other ones are left unchanged (and hence keep the snaptime$$ that was set when the fast refresh started). The where condition is needed to avoid overwriting the snaptime$$, and mark with the current time only the brand new rows (those with snaptime$$ = '01/01/4000').

So, at the end of the snapshot, the log rows that must be examined by the refresh engine will be the ones that are marked by having their snaptime$$ between the date of the last refresh (excluded) and :1 (included). All the other log rows must be ignored.

Side note: marking data at a certain point in time and then replicating the marked data is the only replication strategy that can work when you cannot "freeze" the master tables, as this is definitely our case. This is a general topic worth blogging about in the future.

The marked log rows are then inspected to count the number and type of the logged modifications. This is to check whether any of the replication steps (i.e. the DEL and INS steps that we are going to discuss in a moment) could be skipped. Also, the number of modifications is used (in some versions) to inject some hints in the SQL statements of the replication steps, a topic that falls out of the scope of this post.

Core algorithm: the INS and DEL steps

Then, the core replication starts. The replication considers each master table in turn, and for each table, propagates the modifications to the MV. So we have essentially one single algorithm that propagates from a single master table, just repeated once for each master table.

The propagation for each master table is made of two simple steps, steps that I'm going to name after the comments of the SQL as a DEL (DELETE) step followed by an INS (INSERT) step.

The DEL step is (editing for readability: removing hints, unnecessary aliases, etc):

/* MV_REFRESH (DEL) */
delete from test_mv
 where test_t1_rowid in
       (
select * from
       (
select chartorowid (m_row$$)
  from mlog$_test_t1
 where snaptime$$ > :1
       ) as of snapshot (:2)
       )

The subquery simply fetches the rowid of all marked rows, since :1 is the date of the previous refresh of the materialized view, and :2 is the SCN (coded as a RAW variable) of the time when the snapshot was performed. So, this step deletes from the MV all the rows that record the result of the MV-defining join of any of the marked rows (of the current master table) with the other master tables.

This is the step that can benefit from the index on the column that stores the master table rowid (here, test_t1_rowid) that the documentation suggests to create. Note that in order to optimize this step, you need three separate single-column indexes (here, on test_t1_rowid, test_t2_rowid, test_t3_rowid), not a single composite index spanning the (here, three) columns, as it is sometimes wrongly stated.

The INS step is (again editing for readability):

/* MV_REFRESH (INS) */
insert into test_mv
select jv.j1_2, jv.x1, jv.pk1, jv.rid$,
       mas2.j2_1, mas2.j2_3, mas2.x2, mas2.pk2, mas2.rowid,
       mas3.j3_2, mas3.x3, mas3.pk3, mas3.rowid
  from (
select log.rowid rid$, log.*
  from test_t1 log
 where rowid in
       (
select chartorowid(log.m_row$$)
  from mlog$_test_t1
 where snaptime$$ > :1
       )
       ) as of snapshot (:2) jv,
       test_t2 as of snapshot (:2)  mas2,
       test_t3 as of snapshot (:2)  mas3
 where   jv.j1_2 = mas2.j2_1
   and mas2.j2_3 = mas3.j3_2

The subquery is the same as the DEL step and serves the very same function. So, this statement replays the SQL statement of the MV definition, but limited to the marked rows only. Note that all tables are read at the same point in time in the past, the time when the snapshot of the log was performed, thanks to the argument of the "as of snapshot" clause being the same.

In order to speed up the INS step, indexes on the joined columns can be created on the master tables (not on the MV!). This is because, special cases aside, it is well known that the "fast refresh" (the name itself is quite horrible, many people prefer the adjective "incremental") can be actually "fast" only if a small fraction of the master tables is modified (otherwise, a complete refresh is better); in this scenario, almost certainly the optimal plan is a series of NESTED LOOPs that has the current table (test_t1 in this case) as the most outer table, series that can usually benefit a lot by an index on the inner tables join columns. Of course, you must remember that every table, in turn, acts as the most outer table, hence you should index every possible join permutation.

So here what the algorithm is all about: the DEL and INS steps, together, simply delete and recreate the "slice" of the MV that is referenced by the marked rows, whatever the modification type. The algorithm is as simple (and brutal) as it seems.

Algorithm optimizations

The only optimizations performed are the skipping of some steps when they are obviously unnecessary. For every master table, the DEL step is skipped when only INSERTs are marked in the logs; the INS is skipped when only DELETEs are marked, and of course both are skipped if no row is marked. I have not been able to spot any other optimization.

Note that this means that UPDATEs always turn into a delete+insert of the entire "slice". For example, consider the typical case of a parent table (say, CUSTOMER), with a child (say, ORDER) and a grandchild (say, ORDER_LINE); if you update a column of a row of the parent (say, ORDERS_TOTAL_AMOUNT), the parent row and its whole progeny (the "slice") will be deleted and then recreated. This was a quite surprising discovery for me - a fact that I have now committed to memory.

Bind Variables Checker for Oracle – now install-free

I've finally managed to implement an install-free version of my utility to check for bind variables usage. The new script is named bvc_check.sql and when run, it examines the SQL statements stored in the library cache (through gv$sql) and dumps the ones that would be the same if the literals were replaced by bind variables.

An example of the output:

------------------
statements count :  0000000003
bound    : select*from t where x=:n
example 1: select * from t where x = 2
example 2: select * from t where x = 3
------------------

So we have 3 statements that are the same once literals are replaced with bind variables. Two examples are provided; the action of replacing the literals 2 and 3 with the bind variable :n makes the statements the same.

The script are available on this page, that also explains the script workings in more detail and describes other scripts that might be of interest.

Tuning Oracle for Siebel – SQL template

The time has come to write down some of the most relevant discoveries I've made so far while being part of a team that is tuning a huge Siebel installation for a leading Italian company ("huge" especially because of the user base dimension and secondarily because of the hardware deployed, a three-node RAC on pretty powerful SMP machines).

This blog entry is about the structure of the Siebel queries and the effect of the settings of some CBO-related parameters - settings made by the Siebel clients by altering the session at connect time, or required as mandatory in the Siebel installation notes. Other postings may follow in case I discover something worth writing about.

But first of all, let me thank for their support all the fellow members of the OakTable network (especially Tim Gorman that has exchanged long emails with me) and Andy Cowling (introduced to me by Doug Burns) that kindly provided me with a lot of detailed information coming from their vast hands-on experience with Siebel on Oracle.

For the record, the environment is Siebel 8.0, using a three-node 10.2.0.4 RAC cluster on identical SMP machines with 16 CPUs each.

Most of the Siebel queries follow this template:

select ...
  from base, inner1, ..., innerN
 where base.j1 = inner1.j1(+)
   and base.j2 = inner2.j2(+)
   ...
   and base.jN = innerN.jN(+)
   and "filter conditions on base"
  order by base.order1 [asc|desc], base.order2 [asc|desc], ..., base.orderK [asc|desc];

which must be discussed keeping in mind another critical and subtle information: the Siebel clients read only a subset of the returning rows (probably the ones that fit on the screen, sometimes only 2 or 3 on average) - an intention that the Siebel client wisely communicates to the CBO by altering the session and setting optimizer_mode = first_rows_10.

Side note: there are variants to this template; sometimes there are two or three base tables that are joined and filtered together, and more frequently, some of the outer join relations are based on one of the innerN tables, not on the base tables (eg. innerM.jN = innerN.jN), but that is inessential for what we are going to discuss here.

The Siebel client's intention is clear: get some rows from the base table, order them by the columns order1, order2, ..., orderK, get only the first rows, and then get additional information (columns) by following the outer join relations. Note the order of the operations.

Visiting the outer-joined tables almost always comes out as the most expensive part. The reason is two-fold; first, the sheer number of outer-joined tables (ten tables is the norm, up to about 50, since the Siebel relational model is highly normalized and hence the information is highly dispersed), and second and most importantly, because of the join method the CBO is forced to follow.

In fact, Siebel blocks HASH JOINS (_hash_join_enabled=false) and SORT MERGE JOINS (_optimizer_sortmerge_join_enabled=false), which leaves NESTED LOOPS as the only surviving join method. When nested looping, Oracle must obviously start from the outer table listed in the SQL statement (base) and then visit the SQL inner tables (inner1 ... innerN) using the join conditions, something that can be efficiently done only by visiting an index that has innerN.jN as the leading column of its definition. Each of these visits requires blevel+1 consistent gets for the index and some additional consistent gets for every row that is found - usually one or two, sometimes zero, at least in the scenarios I've seen.

So it is clear that every row that is fetched from the base table may easily produce tens and tens of consistent gets coming from the outer joins - and that can easily lead to a disaster if the number of rows from the base table is not very very small. In one of our queries, for example, about one million rows came out from the base table, and since the CBO chose to outer join first and then order, each execution caused a whopping 15,000,000 consistent gets - only to have the Siebel client select the first 3 rows or so. Needless to say, such a huge number of gets is completely unacceptable; the impact on the response time and CPU consumption is obvious, but scalability suffers as well (gets cause latches or mutexes acquisitions that are the most frequent cause of contention) - that means that even a slight increase on the workload may cause severe degradation of response time (especially on RAC).

The solution is to have Oracle order first, and then join - to very quickly feed the first 3 or 4 rows to the client thus sparing the effort to outer join the remaing ones, that will never (or rarely) be fetched. That usually means that you must prepare an ad-hoc index for the query that has the order1,order2, orderK column as the last columns; the first_rows_10 setting will strongly encourage the CBO into choosing the index. For example, say that the last "filter conditions on base" and the order clause are

...
and base.col1 = :1 and upper(base.col2) = :2
order by base.order1 desc, base.order2 asc;

a suitable index might be (col1, upper(col2), order1 desc, order2 asc).
Oracle will hopefully reach the start of the index range (or "slice") corresponding to the filter condition, a range that has the rows sorted exactly as Siebel wants them, and then scan the index range rows in succession; for each of them, it will outer join and then return the result to Siebel, and hence stop after a few rows read from the index and a few outer joins performed.

This is the main "tuning" technique in Siebel; of course there are others (using covering indexes for the outer join tables for example) that can provide some additional advantage to our beloved Oracle instances, that we are going to investigate and that I will blog about if they turn out as being useful. But I doubt that they can be as effective as this one.

« Previous PageNext Page »

Links (alphabetical order)

Blogroll (alphabetical order)

Blog Aggregators (alphabetical order)