### 11gR2: materialized view logs changes

In this post we are going to discuss some 11gR2 changes to materialized view logs that are aimed at increasing the performance of the fast-refresh engine of materialized views (MVs), especially the on-commit variant.

The MV logs, in 10gr2, now comes in two flavours: the traditional (and still the default) **timestamp**-based one and the brand new **commit SCN**-based one; you choose the latter type by specifing the "WITH COMMIT SCN" clause at MV log creation time. Interestingly, the "old" timestamp-based implementation has been changed as well. Let's examine both with the help, as usual, of a test case.

**Timestamp-based MV logs (the "old" type)**

The test case configures an MV log as "log everything", that is, it activates all the logging options:

create materialized view log on test_t1 with sequence, rowid, primary key (x1) including new values;

In pre-11gR2 (e.g. in 11.1.0.7, 10.2.0.4), the MV log columns were:

pk1 number(22) x1 number(22) m_row$$ varchar2(255) sequence$$ number(22) snaptime$$ date(7) dmltype$$ varchar2(1) old_new$$ varchar2(1) change_vector$$ raw(255)

now in 11gR2 (11.2.0.1):

pk1 number(22) x1 number(22) m_row$$ varchar2(255) sequence$$ number(22) snaptime$$ date(7) dmltype$$ varchar2(1) old_new$$ varchar2(1) change_vector$$ raw(255) xid$$ number(22)

the only difference is the new column xid$$ (transaction id) that uniquely identifies the transaction that made the changes to the row. For the curious, the number is a combination of the elements of the triplet (undo segment number, undo slot, undo sequence); it is simply the binary concatenation of the three numbers shifted by (48, 32, 0) bits respectively (as checked in the script).

The xid$$ column is used by the 11gR2 on-commit fast refresh engine, which can now easily retrieve the changes made by the just-committed transaction by its xid; at the opposite, the on-demand fast refresh one keeps using snaptime$$ as it did in previous versions. I will speak about this in more detail in an upcoming post.

**Commit SCN-based MV logs (the "new" type in 11gR2)**

Let's recreate the same MV log, this time adding the commit SCN clause (new in 11GR2):

create materialized view log on test_t1 with sequence, rowid, primary key (x1), COMMIT SCN including new values;

The columns of the MV log are:

pk1 number(22) x1 number(22) m_row$$ varchar2(255) sequence$$ number(22) dmltype$$ varchar2(1) old_new$$ varchar2(1) change_vector$$ raw(255) xid$$ number(22)

so, the only difference from the 11gR2 timestamp-based case is that snaptime$$ is no longer a column of the MV log; the only difference from the pre-11gR2 is that snaptime$$ has been replaced with xid$$.

For this log flavour only, the mapping between the xid that modified the table and its commit-time SCN is now tracked in a new view, all_summap (probably named after "SUMmary MAP", "summary" being yet another synonym for "MV"), which is (as of 11.2.0.1) a straight "select *" of the dictionary table sys.snap_xcmt$. To illustrate, the script makes one insert, one update and one delete on the base table, which translates into 4 rows inside the MV log with the same xid:

SQL> select distinct xid$$ from mlog$_test_t1; XID$$ --------------------- 1126024460895690

after the commit, we get

SQL> select * from all_summap where xid in (select xid$$ from mlog$_test_t1); XID COMMIT_SCN --------------------- ---------- 1126024460895690 2885433

hence, it is now possible to know the infinite-precision time (the SCN) when every modification became visible to an external observer (the commit SCN) by simply joining the MV log and all_summap (or sys.snap_xcmt$). Note that the commit SCN is not propagated to the MV log at all.

**commit SCN-based MV logs for on-demand fast refresh**

This new xid$$ column and commit-SCN mapping table are leveraged by the fast refresh of on-demand MVs as follows (on-commit ones do not need the SCN as they know exactly the xid of the committed transaction; again we will see that in an upcoming post).

With "old style" timestamp-based MV logs, the refresh is performed by using a "mark-and-propagate" algorithm, which is essentially (check this post for some additional details):

1) new log rows are inserted with snaptime$$=4000 A.D;

2) at refresh time, a snapshot of the new rows is taken, that is, all new rows are marked with snaptime$$=sysdate;

3) all modifications whose snaptime$$ is between the date of the last refresh (excluded) and sysdate(included) are propagated to the MV;

4) all obsolete log rows are deleted, that is, all rows whose snaptime$$ is less than or equal the lowest of all refresh times are removed from the log.

With "new style" SCN-based MV logs, the algorithm is, instead:

1) new log rows are inserted with xid$$=transaction id of modifing transaction;

2) at refresh time, the current SCN is retrieved (no snapshot is performed);

3) all modifications whose xid maps to a row in all_summap whose commit_scn is between the SCN of the last refresh (excluded) and the retrieved current SCN(included) are propagated to the MV;

4) obsolete rows are removed from the log as before, this time using the SCN instead of snaptime$$.

The main advantage is that the snapshot is not performed, thus removing the redo and undo generated by the update, and obviously the log visit (usually a full table scan) as well - at the cost of an additional join with all_summap (or sys.snap_xcmt$) later; if the join is calculated efficiently, that is very likely advantageous "in general" (but as always, it depends on your scenario).

It might be (rarely) beneficial to index xid$$, as it is (rarely) beneficial to index snaptime$$. In that case, having no snapshot performed reduces both the undo and redo generated for the index maintenance.

As a side and "philosophical" note, it is also worth noting that the new logging mechanism records more information - now we know which transactions modified the table and the infinite-precision time (the SCN) of modifications, and this is much more informative about the history of the logged table than the mostly meaningless refresh time contained in snaptime$$. This is definitely a better utilization of storage.

I plan to blog about how the new MV log impact fast refreshes in 11gR2 in the near future, focusing on join-only MVs; so stay tuned if you're interested.

- Tuesday, November 3, 2009 materialized views
- 16 Comments »

### CBO: NewDensity for Frequency Histograms,11g-10.2.0.4 (densities part IV)

As we have seen in the previous posts of this series, in 11g a new figure named "NewDensity" has been introduced as a replacement for the "density" column statistic for columns whose histogram has been collected; this change has been backported in 10.2.0.4 also.

In the previous post we discussed how NewDensity influences the CBO cardinality estimate for Height-Balanced histograms; in this one we are going to investigate the same for Frequency Histograms. We will see that the most important change is the introduction of the “half the least popular" rule (see the "Histogram change" post by Jonathan Lewis, which distills the findings of Randolf Geist and Riyaj Shamsudeen) - a surprising rule that might easily cause trouble (in fact as Jonathan reports in the comments - bug 6082745 was opened against this rule).

The test case (script density_post_freq.sql) considers the same test statement we focused on in the previous post (a single equality filter predicate which asks for a value inside the min-max range of the column):

select ... from t where value = 64.5;

Of course we compute a Frequency instead of an Height-Balanced histogram, and use a slightly different value distribution in order to highlight the new rule:

SQL> select value, count(*) 2 from t 3 group by value 4 order by value; VALUE COUNT(*) ---------- ---------- 8 8 16 16 64 64 128 128

The histogram generated by the test case is (from DBA_HISTOGRAMS):

VALUE EP BKT ---------- ---------- ---------- 8 8 8 16 24 16 64 88 64 128 216 128

VALUE is an abbreviation for ENDPOINT_VALUE, EP for ENDPOINT_NUMBER.

BKT is the number of buckets covered by the value (i.e.: EP minus the previous EP), that is, the number of rows whose column value was equal to VALUE at statistics collection time.

When the filter predicate selects a value contained in the histogram, the new releases behave the same as the old ones (but check the "bottom note about singleton values" at the bottom for a a minor but interesting detail): neither density nor NewDensity is used, and the cardinality estimate is the usual intuitive one. In the complementary case of a value not contained in the histogram (but still inside the min-max interval), the cardinality used to be calculated as density*num_rows and it is now NewDensity*num_rows. Note the simmetry with the Height-Balanced case: the formula is the same, with NewDensity simply replacing density.

**NewDensity with the “half the least popular" rule active**

By default the rule is active, and in this case, NewDensity is set to

NewDensity = 0.5 * bkt(least_popular_value) / num_rows

and hence, for non-existent values:

E[card] = (0.5 * bkt(least_popular_value) / num_rows) * num_rows = 0.5 * bkt(least_popular_value)

For our test case, the least_popular_value is 8 and bkt(8) = 8, hence E[card] = 0.5 * 8 = 4 thanks to NewDensity being equal to 0.5 * 8 / 216 = 0.018518519. In fact, we can verify in the 10053 traces (in 10.2.0.4, 11.1.0.7, 11.2.0.1) for our statement, that asks for a not-existent value (64.5), that E[card] and NewDensity are set as above:

NewDensity:0.018519, OldDensity:0.002315 BktCnt:216, PopBktCnt:216, PopValCnt:4, NDV:4 Using density: 0.018519 of col #1 as selectivity of unpopular value pred Table: T Alias: NOT_EXISTENT Card: Original: 216.000000 Rounded: 4 Computed: 4.00 Non Adjusted: 4.00

As another check, let's see what happens if bkt(least_popular_value) = 1, that is, if there is (at least) one value that occurred exactly one time (a singleton value) at statistics collection time. Adding such a row to our test case is trivial (just uncomment the first insert row in the script); in this scenario, our formula above predicts E[card] = 0.5 with NewDensity = 0.5 / 217 = .002304147, and in fact (check the *_least_is_one.trc traces):

NewDensity:0.002304, OldDensity:0.002304 BktCnt:217, PopBktCnt:216, PopValCnt:4, NDV:5 Using density: 0.002304 of col #1 as selectivity of unpopular value pred Table: T Alias: NOT_EXISTENT Card: Original: 217.000000 Rounded: 1 Computed: 0.50 Non Adjusted: 0.50

note that E[card] gets rounded up from 0.5 to 1 (as usual).

What is the rationale behind this rule? Thanks to Randolf Geist (see the comment in Jonathan's blog entry above), we know that it was introduced as a patch to solve one particular scenario (see bug 5483301) and then included in the main release, for some reason. Luckily, the rule can be disabled and the old sane behaviour can be restored.

**NewDensity with the “half the least popular" rule disabled**

To disable the new rule, just switch off the patch 5483301:

alter session set "_fix_control"='5483301:off';

(or alter system if you want to make it permanent)

with this setting, NewDensity becomes simply

NewDensity = 0.5 / num_rows

and hence, again for non-existent values:

E[card] = 0.5

which is exactly what we got in pre-10.2.0.4, where density was used (and density was, and is still, set to 0.5 / num_rows by dbms_stats). So the cardinality estimate is 0.5 (rounded up to 1).

For our test case, we predict NewDensity = 0.5 / 216 = 0.002314815. In fact our 10053 traces tell us:

NewDensity:0.002315, OldDensity:0.002315 BktCnt:216, PopBktCnt:216, PopValCnt:4, NDV:4 Table: T Alias: NOT_EXISTENT_OFF Card: Original: 216.000000 Rounded: 1 Computed: 0.50 Non Adjusted: 0.50

The rationale for this behaviour is sound; the CBO knows that no row with the requested value existed at statistics collection time, hence it returns the minimal cardinality estimate compatible with the non empty result set assumption (check this post for the importance of this assumption). If the statistics are reasonably fresh, this is the only sane estimate that can be made.

**Playing with density - a warning**

If you set your own column stats using dbms_stats.set_column_stats, the behaviour is different; I haven't made any extensive investigations but as far as I can tell, the value you provide for density is used instead of NewDensity. User-provided column statistics are flagged with dba_tab_cols.user_stats = 'YES'. You can disguise your user statistics as non-user by setting the flags parameter of dbms_stats.set_column_stats to 2 - but since the latter parameter is labeled as "for Oracle internal use only", I would do it only for investigations purposes - that is, never in production.

---

*Bottom note about singleton values*: actually in pre-10.2.0.4 versions, if the value was present in the Frequency histogram but covering a single bucket (hence it was present in the table exactly one time at statistic collection time), it used to be classified as "unpopular" and hence used to get the same treatment as a value not in the histogram - the end result being that the cardinality was estimated as 0.5 rounded up to 1; now it is 1 before rounding as one would intuitively expects. I hope to be able to investigate whether this change fixes the issues about join cardinality estimation I investigated - see "the mystery of halving" in this investigation of mine if interested.

Other posts belonging to this series:

densities part I

densities part II

densities part III

- Friday, October 23, 2009 CBO
- 14 Comments »

### CBO: “NewDensity” replaces “density” in 11g, 10.2.0.4 (densities part III)

In this post we are going to explore and explain the rationale for the formula used by the CBO to compute the "NewDensity" figure that replaces, from 10.2.0.4 onwards, the "density" column statistic in the cardinality estimation formulae for columns with height-balanced (HB) histograms defined.

In a previous post, we already discussed the pre-10.2.0.4 scenario: we saw how and when the "density" column statistic is used in the cardinality formula for equality filter predicates, we explained its statistical rationale and defining formula, introduced the concept of the NPS (Not Popular Subtable), and built a test case. Now we are going to use the very same test case and explain the differences in the most recent versions (the previous post zip file contains logs for them also).

To summarize the test case - we have a table T with a single column VALUE, exponentially distributed, and with a SIZE 5 Height-Balanced histogram collected on. The histogram is:

SQL> select ep, value, popularity from formatted_hist; EP VALUE POPULARITY ---------- ---------- ---------- 0 1 0 1 16 0 5 64 1

Thus, we have a single popular value, 64; all the others are unpopular.

In this "densities" series of post, we focus on a SQL statement that contains only an equality filter predicate on table T:

select ... from t where value = 2.4;

the literal value is not a popular value (but inside the 1-64 interval) and hence, in pre-10.2.0.4, the formula used for the expected cardinality calculation is equal to:

E[card] = density * num_rows;

We discussed, in the previous post, how density is carefully calculated by dbms_stats to get back the expected cardinality of the family (class) of all possible equality filter predicate statements that hit the NPS, under the usual "non-empty result set assumption" and the further (strange and strong) assumption that the more a value is represented in the NPS, the higher the probability that the value is used as the literal of the equality predicate (an assumption that mathematically translates into the formula "w(:x) = count(:x) / num_rows_nps").

Switching to 10.2.0.4 - the formula for E[card] is still the same, but with "density" replaced by "NewDensity" (as hinted by the fact that "density" is reported as "OldDensity" in the 10053 trace files, as we are going to see in a moment):

E[card] = NewDensity * num_rows;

NewDensity is not stored anywhere in the data dictionary, but it is computed at query optimization time by the CBO (note that density is still computed by dbms_stats using the old formula, but then it is ignored by the CBO). The NewDensity formula is based mainly on some histogram-derived figures; using the same names found in 10053 traces:

NewDensity = [(BktCnt - PopBktCnt) / BktCnt] / (NDV - PopValCnt)

Where BktCnt ("Bucket Count") is the number of buckets (the "N" in the "SIZE N" clause);

PopBktCnt ("Popular Bucket Count") the number of buckets *covered* by the popular values;

PopValCnt ("Popular Value Count") is the number of popular values; NDV ("Number of Distinct Values") is the traditional name used by CBO developers for the num_distinct column statistic. With the exception of NDV, all these values are derived from the histogram.

Side note: if the numerator is equal to zero, NewDensity is set to 0.5 / num_rows, thus giving an E[card] = 0.5, as far as I have seen (not exaustively) in a few test cases; it looks like a lower-bound "sanity check". The denominator cannot be zero for HB histograms.

To illustrate the formula: the histogram of our test case has 5 buckets, hence BktCnt=5; 64 is the only popular value, hence PopValCnt =1; this popular value covers 4 buckets (since its EP is 5 and the previous EP is 1), hence PopBktCnt=4; we know that the column has num_distinct=6, hence NDV=6. This is in fact what we see in the 10053 trace file (in 11.1.0.7 and 11.2.0.1):

SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for T[T] Column (#1): NewDensity:0.040000, OldDensity:0.115789 BktCnt:5, PopBktCnt:4, PopValCnt:1, NDV:6 Using density: 0.040000 of col #1 as selectivity of unpopular value pred Table: T Alias: T Card: Original: 95.000000 Rounded: 4 Computed: 3.80 Non Adjusted: 3.80

So NewDensity = [(5-4)/5] / (6-1) = 1/25 = 0.04 and E[card]=0.04*95=3.8, which is exactly what we see in the above trace fragment.

The formula is statistically based on replacing the previous versions' assumption (that we labeled "strange and strong") about w(:x) with the standard assumption that the client will ask for the values in the NPS with the same probability; mathematically, that means replacing the formula "w(:x) = count(:x) / num_rows_nps" with the standard "w(:x) = 1 / num_distinct_nps" (where num_distinct_nps is of course the number of distinct values of the NPS). If you plug this shape of w(:x) into the formula for E[card], you get

E[card] = sum ( w(:x) * E[count(:x)] ) = = sum (E[count(:x)] ) / num_distinct_nps for all values of :x (belonging to the NPS)

that is

E[card] = num_rows_nps / num_distinct_nps

which is, not surprising, the standard formula used for columns without histograms, but applied to the NPS, not the whole table.

One possibility for producing the above E[card] value at run-time could have been to change dbms_stats to compute a value for "density" equal to (num_rows_nps / num_distinct_nps) / num_rows; but forcing users to recompute statistics for all their tables in their upgraded databases is not really a viable option. Hence, the CBO designers chose to simply ignore "density" and calculate the above formula at run-time, mining the histogram, at the cost of reduced precision. In fact, the easy part is num_distinct_nps, which is obviously exactly equal to num_distinct minus the number of popular values; but num_rows_nps can only calculated approximately, since the histogram is a (deterministic) sample of the column values obtained by first sorting the column values and then sampling on a uniform grid (for more information and illustration, see the first part of this article of mine). Using the histogram, the best approximation for num_rows_nps is num_rows times the fraction of buckets not covered by popular values. Hence, using the 10053 terminology

num_distinct_nps = NDV - PopValCnt (exactly) num_rows_nps = [(BktCnt - PopBktCnt) / BktCnt] * num_rows (approximately)

which gets back (again, approximately) the E[card] formula above, as can be trivially checked.

It might be desirable that one day, NewDensity gets calculated exactly by dbms_stats and stored in the data dictionary, at least for columns with new statistics, albeit the precision reduction is probably more than acceptable (that is, I have never seen a case where that has been an issue). The test case script, just for the sake of completeness, calculates the exact figure as well; it gets back an E[card] of 6.2 instead of 3.8.

For a summary of the above discussion and some more discussion, check back this investigation of mine. By the way, NewDensity replaces "density" also in join cardinality formulae, even if I have not run a complete investigation - but that is not surprising at all.

As a final nore - NewDensity is used also for Frequency Histograms, and in a very creative way; we will discuss this in part IV of this series.

Other posts belonging to this series:

densities part I

densities part II

densities part IV

- Friday, October 16, 2009 CBO
- 7 Comments »

### CBO: the formula for the “density” column statistic (densities part II)

In this post we are going to explore and explain the rationale for the formula used by dbms_stats to compute the "density" column statistic, used by the CBO in versions less than 10.2.0.4 to estimate the cardinality of a class of SQL statements. In the next post, we will speak about its replacement, named "NewDensity" in 10053 trace files.

We will consider only the non-trivial case of Height-Balanced histograms, since for Frequency Histograms density is a constant (0.5 / num_rows) and for columns without histogram, it is simply 1/num_distinct.

Let's illustrate the test case on which we will base our discussion, contained in this zip file.

First, a table T is created with the following exponential value distribution:

SQL> select value, count(*) 2 from t 3 group by value 4 order by value; VALUE COUNT(*) ---------- ---------- 1 1 2 2 4 4 8 8 16 16 64 64

The test case then computes a SIZE 5 Height-Balanced histogram. The resulting histogram (from dba_histograms) is as follows (note that I have added the column POPULARITY that marks popular values with "1"; EP is shorthand for column ENDPOINT_NUMBER, VALUE for column ENDPOINT_VALUE):

SQL> select ep, value, popularity from formatted_hist; EP VALUE POPULARITY ---------- ---------- ---------- 0 1 0 1 16 0 5 64 1

The test case then issues this SQL statement that contains only an equality filter predicate on table T:

select ... from t where value = 2.4;

The literal value 2.4 is not contained in the table (and hence in the histogram), in order to make the CBO factor in "density" in its estimate of the expected cardinality - in fact, as it might be known, density is used when the literal is not popular (that is, not equal to 64 in our case), and it doesn't matter whether the literal is not contained in the histogram, or contained as an unpopular value (1 and 16 in our case), or even contained in the table or not. All it takes is its being not popular.

Side note: I'm assuming the literal is inside the closed min-max interval (1-64 in this case); when outside, it depends on the version.

When the literal is not popular, the formula used for the expected cardinality calculation is equal to

E[card] = density * num_rows;

That is easy to verify from the test case logs; in 9i we can see that density = 0.115789474 and num_rows=95, hence 0.115789474 * 95 = 11.000000000 which is exactly equal to the CBO estimate for our statement.

The formula used by dbms_stats to compute "density" was published in Jonathan Lewis' book Cost Based Oracle (page 172) and Wolfgang Breitling's presentation Histograms - Myths and Facts. The key fact is that the formula takes as input the rows of what I've nicknamed the not-popular subtable (NPS), that is, the original table without the rows whose values are popular values (in this case, 64 is the only popular value). Letting num_rows_nps the number of rows of the NPS (for our example, num_rows_nps=1+2+4+8+16=31), we have:

density = (1 / num_rows) * sum (count (value) ^ 2) / num_rows_nps for "value" belonging to the NPS

The script performs this calculation automatically; it is anyway instructive to perform the calculation manually at least one time:

density = (1/95) * (1*1+2*2+4*4+8*8+16*16) / 31 = .115789474

that matches perfectly the density we observed in the script log before.

What is the statistical rationale for this seemingly strange computation?

If we plug it inside the formula for E[card], we can see that num_rows is cancelled:

E[card] = sum (count (value) ^ 2) / num_rows_nps summed over all "values" belonging to the NPS

Now we must reference the statistical concepts introduced in this post, and consider the family of all statements of our kind that can reference the NPS:

select ... from t where x = :x; :x being a value belonging to the NPS

its E[card] is

E[card] = sum ( w(:x) * E[count(:x)] ) for all values of :x (belonging to the NPS)

dbms_stats takes count(:x) as the best estimate for E[count(:x)] (for example, E[count(4)] = count(4) = 4 in our case). All we have to do in order to obtain the observed formula, is to assume w(:x) = count(:x) / num_rows_nps:

E[card] = sum ( (count(:x) / num_rows_nps) * count(:x) ) = sum ( count(:x) ^ 2 ) / num_rows_nps for all values of :x (belonging to the NPS)

The meaning of the above particular shape of w(:x) is that the probability that the client submits a certain value for :x is proportional to the number of rows (in the NPS) that has that value; more precisely, that if X% of rows has a certain common value, X% of user-submitted statements that "hit" the NPS will ask for that value. Under this assumption, dbms_stats precomputes "density" to give back the above E[card] when the literal is known to be not popular, hence hitting the NPS - remember that the CBO operates under the "non-empty result set assumption", hence if the literal does not hit a popular value, it must hit a value of the NPS.

The above assumption for w(:x) is quite a strange assumption - and in fact, we will see in the next post that in 11g (and 10.2.0.4), this assumption has been dropped and replaced with a more standard one. The "density" column statistics is in fact *ignored* in 10.2.0.4+ and a value computed at run-time, named "newDensity" in 10053 trace files, is used instead.

Other posts belonging to this series:

densities part I

densities part III

densities part IV

- Saturday, October 10, 2009 CBO
- 6 Comments »

### CBO: about the statistical definition of “cardinality” (densities part I)

Let's explore the concept of cardinality from the point of view of the statistician; this is both to get a clearer vision of the matter (i.e. for fun) and to path the way for understanding the rationale for the "density" statistics as calculated by dbms_stats (the topic of an upcoming post).

Let's consider a statement with input parameters (bind variables), and consider the most fundamental of them all, the one with a filter predicate:

select ... from t where x = :x;

the cardinality "card" of the set of rows retrieved depends on the table possible values and the actual inputs provided by the client as bind variable values. What about the expected value E[card] ?

Let:

1) w(:x) ("w" stands for "workload") the probability mass function of the random variable :x (that completely characterizes the workload);

2) E[count(:x)] the expected value of the cardinality of the rows retrieved for each value of :x.

We have, assuming that the two are independent:

E[card] = sum ( w(:x) * E[count(:x)] ) for all values of :x

To solve the formula we have to know (or assume) the client-dictated w(:x). The same goes for the table potential population (that -together with the statement of course- shapes E[count(:x)]); we must either have some statistical measurements about the table (for example, a frequency histogram on column X that we consider representative of the table population) or assume them (for example, assume a certain distribution for the column X values).

It is interesting to explore the most used scenario: a uniform (or assumed uniform) distribution for the column X values, of which we know the number of distinct values num_distinct(X) and the total number of values num_rows (let them be deterministically known for simplicity, and exclude null values). That means that E[count(:x)]) is equal to num_rows / num_distinct(X) *over a finite set that contains num_distinct(X) values* and is equal to zero over the remaining ones.

It is relatively easy to see that E[card] depends on how w(:x) and E[count(:x)]) overlap. At one end of the spectrum, if the client always submits values for :x that are not contained in the table, E[card] is zero - since the client choice matematically translates into E[count(:x)]) being zero over all values of :x for which w(:x) is non zero. At the other end of the spectrum, that is, under the non empty result set assumption, we have E[card] = sum ( w(:x) ) * num_rows / num_distinct(X) = num_rows / num_distinct(X), the usual formula used by the CBO in many (most) situations.

The "density" column statistic formula can be derived in a similar way, but using a different assumption about w(:x) - as we will see in a dedicated post.

Other posts belonging to this series:

densities part II

densities part III

densities part IV

- Saturday, October 3, 2009 CBO
- 3 Comments »

### An interview with Mark Townsend

While attending the 11gR2 launch event in Milan last Thursday, I had the distinguished opportunity (invited, as a blogger, by the Oracle team that was organizing the event) to meet Mark Townsend and exchange a few words about the new features of 11gR2 and the Oracle database in general as well.

For those who don't know, Mark is (among other things) the Vice President in charge of coordinating the Product Managers and a technical expert at the same time, and this rare combination has the advantage that you can ask him about any feature you like at whatever granularity you like, from the strategic level down to the technical gory details. In fact Mark is frequently seen at public events (such as Oracle Open World), speaking to mixed audiences composed of both Engineers and Management.

We spoke about many new features of 11gR2, so much in fact that in order to do justice to the information that Mark very kindly provided me, I will use them as the foundation of some future blog posts. In brief anyway, I noticed that this release has features that are targeted mostly to vastly improve the "grid", but with very interesting features for the "core" as well (my personal favourites being the in-memory parallel execution and the new SCN-based MV log, which I plan to blog about in the very near future).

But especially, I didn't miss the unique opportunity to discuss with one of the top players of Oracle Corporation about the amount of information that Oracle shares with its professionals, obviously trying to push for much more. I'm sure that everyone that works with any kind of software product agrees with me on the fact that knowing how the product works is key not only to troubleshooting (that being almost obvious) but to good design also, or perhaps especially; the more you know, the better designer you are going to be.

We had a lot of back and forth on this topic, but to summarize, Mark agreed on detailed information being very useful for experts, but also pointed out that it takes years and years to become proficient enough to be able to digest very detailed documentation, and in the meanwhile, too much information is going to confuse, rather than clear things (especially for juniors coming from other database products). Just the current detail level is overwhelming, the current docs being composed of a staggering 21,000,000 words - and Oracle is in fact trying to organize the documentation in layers as much as possible, starting from high-level descriptions (the Two Day DBA course), than the Concept manual, and than the rest. The most details, however, will be still provided forever as Metalink (aka "My Oracle Support") notes, the only reason being to avoid confusing people, not to hide anything (besides strategic algorithms not covered by Patents of course).

So in short, from this discussion I have understood that Oracle is willing to share information with its community; it only wants to find the right way to do so, since the community is huge (hundreds of thousands of people) and the product is very, very complex. Anyway, I have insisted for more information about the two topics that I'm sure that are not documented well enough, naming the "auto" features and the CBO algorithms - and actually I feel like I have insisted perhaps too much on that with Mark ... but that's something I really care about, both for professional reasons and keen interest alone.

Well, I might add that information sharing, and community involvement as well, is in my opinion one of the main factors of Oracle's success; actually, being able to dig a lot into the inner workings of the product is the reason why I chose an Oracle career ten years ago.

PS I'm back from my vacation and I have a lot of interesting things to investigate at work that look like perfect candidates for being turned into posts, so I will be able to blog more frequently in the future. I also have a series of posts about the CBO that is "almost complete".

- Tuesday, September 29, 2009 Uncategorized
- Comment now »

### CBO: the “non-empty result set” assumption

The CBO assumes that SELECT statements are always going to retrieve at least one row - even if this is not necessarily the case, of course. Understanding *why* this is done is both useful and fascinating.

We must start from the very beginning and remember that one of the most important tasks of the CBO is estimating the statement cardinality, that is, to make a guess about the number of rows that will be fetched. In statistics, that means that the CBO must calculate (estimate) the expected value of the cardinality random variable.

In order to calculate the expected value, in our case, we can consider the table potential population (i.e. the set of all possible row values), execute (ideally!) the statement over each table in the population, and compute (again ideally) the average of the cardinality of each result set.

The population must be coherent with the set of observations stored in the data dictionary when the table and column statistics were collected; in other words, the population must satisfy a set of statistical constraints. For example the number of distinct values in each column must be equal (or statistically equal) to the num_distinct statistic; the range of values must be inside (or statistically inside) the min-max interval dictated by low_value-high_value, etc.

Now consider a simple statement with a filter predicate:

select ... from t where x = 1;

Assuming that column X contains numbers, there are an *infinite number* of values of X inside the min-max interval (assuming that min is not equal to max) that can satisfy the constraints. In the table population, how many tables have X=1, and how many rows will be retrieved by the statement?

If a frequency histogram has been collected on column X, the population is constrained to (statistically) satisfy it, and hence we have the answer: the expected cardinality is zero if value X is not contained in the histogram and strictly greater than zero (computed with the usual formula) otherwise.

But if no histogram is collected on the column, the number of tables with X=1 will be negligible, and hence the expected value will be zero. That is not very useful.

But if we assume that the **result set is never empty**, then we have another constraint to apply. That means that the value X is contained in all tables of the population, and (if we add the additional customary assumption of uniform distribution of values) we can easily derive the usual num_rows / num_distinct(X) formula.

Note that the "non-empty result set" assumption is very strong; it means that the statement and the table are not independent, but actually are highly correlated, since the assumption is equivalent to say that the client executes the statement in order to retrieve rows whose existence is certain before the statement execution. In other words, the CBO infers information about the data from the statement itself, not only from the data dictionary statistics, trusting that the user has some knowledge about the data stored in the table.

The assumption is of course more than reasonable for almost all statements and clients, but not always. For instance, X=1 might mean "new record" in a table-queue that contains the history of the last few years as well, and the table migth have no observable record with X=1 thanks to the consumer(s) being very quick or the producer(s) rarely enqueuing records. Or maybe, X=1 might mean "failed record" in a process that never fails, and the statement could be issued for checking the rows existence, not for retrieving them. In this kind of scenario, the CBO predictions can be affected by the "non-empty result set" assumption.

- Thursday, September 3, 2009 CBO, performance tuning
- 6 Comments »

### NoCOUG’s “First international SQL challenge”

Just a short note to tell my friends that I have been bestowed the *August Order of the Wooden Pretzel*, that is, that I won the NoCOUG's "First international SQL challenge" with this solution.

I'm especially happy to see that, after (way too) many years since graduation, I am still able to use my math skills to solve problems ... :).

Many thanks to Iggy Fernandez for setting up the contest (and running it with love and dedication), Dan Tow for judging the solution and of course Chen Shapira for advertising and "supporting" it on her blog!

- Monday, August 24, 2009 Uncategorized
- Comment now »

### fast refresh of single-table materialized views – algorithm summary

Today we are going to investigate how Oracle fast refreshes materialized views (MVs) of a single master table, containing no aggregate but, at most, filter predicates and additional column definitions:

create materialized view test_mv build immediate refresh fast on demand with rowid -- with primary key as select test_t1.*, x1+x2 as x1x2 from test_t1 where x1 != 0.42;

This kind of MVs might be considered a degenerate case of a join-only MV, a topic that we investigated in an earlier post, and one could expect the same algorithm. But that is not the case: the test case shows that the algorithm used is very different.

The two main differences are (as we are going to illustrate in detail) that UPDATEs are actually used in this case (as noted by Cristian Cudizio) instead of DELETE+INSERT only, and especially that row-by-row propagation is performed instead of using a couple of single SQL statements.

This kind of MV is frequently used for replication across a db-link (with a clause such as "from test_t1@db_link"); in this scenario, the MV used to be named SNAPSHOT in old releases. I have checked this scenario as well (not included in the test case) and the only difference is that, obviously, the master table test_t1 is referenced via a db-link and a few hints are injected by the refreshing engine.

In the test case, I have checked both the WITH ROWID and WITH PRIMARY KEY options for the MV DDL; the algorithm turns out as being identical, besides (obviously) that in the former the rowid and in the latter the primary key is used to identify rows.

I am going to follow the path of the previous discussion about join-only MVs referenced above, as both the test case format and some of the actual refresh steps are very similar. I have tested on 9.2.0.8, 10.2.0.4 and 11.1.0.7 for the most common DML on the base table (conventional INSERTs, UPDATEs and DELETEs). I have seen no difference in the algorithm for the three kernel versions.

**Materialized view logs configuration**

Even for this test case, I have configured the materialized view logs to "log everything" to check whether Oracle is able to take advantage of more information in the log:

create materialized view log on test_t1 with sequence, rowid, primary key (x1, x2) including new values;

but even for single-table MVs the algorithm uses only the rowid or primary key information, hence the minimal (and hence optimal) log configuration is, for the WITH ROWID option:

create materialized view log on test_t1 with rowid;

and for the WITH PRIMARY KEY option:

create materialized view log on test_t1 with primary key;

**Log snapshots**

The first step in the refresh algorithm is to take a log snapshot, exactly as in the join-only case, by setting snaptime$$ = current time. Hence the marked log rows (the ones and only ones to consider for propagation) will be characterized by snaptime$$ <= current time and > last snapshot refresh time. See the previous post about the join-only case for a more in-depth discussion.

Note: actually, for the sake of precision, two (minor) differences with the join-only case are that the snapshot statement is exactly the same in all versions (there's no special version for 11.1.0.7) and that the log is not "inspected to count the number and type of the logged modifications".

**Core algorithm: the DELETE and UPSERT steps**

Then, the core replication starts. The propagation from the master table is composed of two simple steps, steps that I've named DELETE and UPSERT (UPDate + insERT).

The first **DELETE step** is a simple select-then-delete row-by-row processing, where each row returned by a select statement is passed to a single-row delete statement.

For the WITH ROWID option, the select statement of the DELETE step is (editing for readability: removing hints, unnecessary aliases, etc):

select distinct m_row$$ from ( select m_row$$ from mlog$_test_t1 where snaptime$$ > :1 and dmltype$$ != 'I' ) log where m_row$$ not in ( select rowid from test_t1 mas where (mas.x1 <> 0.42) and mas.rowid = log.m_row$$ );

and the delete is a trivial

delete from test_mv where m_row$$ = :1;

The select+delete purpose is to delete all marked rows that are not in the master table anymore, or that are still there but that do not satisfy the MV defining SQL (here, x1 != 0.42) anymore.

In fact, the first in-line view fetches from the log the rowid of a subset (those whose dmltype$$ != 'I') of the marked rows, since :1 is set to the date of the previous refresh of the materialized view. Well actually - the SQL, as it is, would also get the log rows inserted after the snapshot was taken, which is obviously not acceptable since the propagation must operate on a stable set of rows. I'm not sure how the non-marked rows are excluded, but probably the various "select for update" on the log data dictionary tables might play a role by locking the commits on the logs, or maybe the serialization level is set to read-only or serializable (I will investigate this in the future). For now, let's make the conjecture that only the marked rows are selected.

The last correlated subquery simply filters out the rowid of the rows that are still in the master table. The condition dmltype$$ != 'I' ('I' stands for INSERT) is only an optimization, since an inserted row would be filtered out by the subquery anyway - unless it has not been deleted after being inserted, but that would be recorded with another log row with dmltype$$ = 'D'.

Why are updates (dmltype$$ = 'U') not optimized away as well? This is to delete rows from the MV that no longer belong to the current image of the MV defining SQL statement, since they used to satisfy the filter condition (here, x1 != 0.42) but no longer do after an update. Thanks to the filter condition (x1 != 0.42) being included in the subquery, any row that does not satisfy it anymore after an update will not be filtered out, and hence will be deleted.

Note that column m_row$$ of the MV is a hidden (but not virtual) column that records, for each MV row, the rowid of the corresponding master table row. It is automatically created when you define the MV with the WITH ROWID option; an index is automatically created on m_row$$ as well (unless you specify USING NO INDEX, something that does not make sense if you want to fast refresh the MV). Hence you do not need to create any additional index, neither on the master table nor on the MV, to optimize this step of the fast refresh.

Switching to the WITH PRIMARY KEY option, the select statement of the DELETE step is

select distinct pk1 from ( select pk1 from mlog$_test_t1 where snaptime$$ > :1 and dmltype$$ != 'I') ) log where pk1 not in ( select pk1 from test_t1 mas where (mas.x1 <> 0.42) and log.pk1 = mas.pk1 );

and the delete is simply

delete from test_mv where pk1 = :1;

That is, the statements are the same as in the WITH ROWID case, with the primary key instead of the rowid in all statements. Since the master table must have a primary key for the MV create to succeed, and since an index on the MV that spans the primary key column(s) is automatically created (unless you specify USING NO INDEX of course), even in the WITH PRIMARY KEY case you do not need to create any additional index for performance. Actually, for best performance, an index on the master table that combines the PK and the column(s) referenced by the MV filter condition - here on (pk1, x1) - might help a bit, since probably the optimal plan is a nested loop having test_t1 as the inner table. This would avoid a block get on the master tables for marked rows not satisfying the MV filter condition; the effectiveness of this index depends on whether you have a lot of updates on the column referenced in the filter condition.

The **UPSERT step** is a simple select-then-upsert row-by-row processing, where each row returned by a select statement (that calculates the current image of the row that needs to be propagated to the MV) is used to update the corresponding row in the MV; if the update finds no row, the row is inserted.

For the WITH ROWID option, the select statement of the UPSERT step is:

select current.x1, current.x2, current.pk1, current.x1x2, rowidtochar (current.rowid) m_row$$ from ( select x1, x2, pk1, x1+x2 as x1x2 from test_t1 where (x1 <> 0.42) ) current, ( select distinct m_row$$ from mlog$_test_t1 where snaptime$$ > :1 and dmltype$$ != 'D' ) log where current.rowid = log.m_row$$;

and the update and insert statements are simply:

update test_mv set x1=:1, x2=:2, pk1=:3, x1x2 = :4 where m_row$$ = :5; insert into test_mv (x1,x2,pk1,x1x2,m_row$$) values (:1,:2,:3,:4,:5);

The select+upsert purpose is to calculate the new image of all marked rows that satisfy the MV defining SQL filter condition (here, x1 != 0.42) and then overwrite the old image in the MV with the new one. Note that an update on the master table might produce an insert if the old image did not satisfy the filter condition and the new one does.

The structure of the select statement should be obvious after the previous illustration of the DELETE step. Note of course the different optimization in the second inline view (dmltype$$ != 'D'). Even in this case, the automatically created index on the m_row$$ MV column optimizes the update statement, and no other index is necessary for performance on neither the base table nor the MV.

Switching to the WITH PRIMARY KEY option, the select statement of the UPSERT step is

select current.x1, current.x2, current.pk1, current.x1x2 from ( select x1, x2, pk1, x1+x2 x1x2 from test_t1 where (x1 <> 0.42) ) current, ( select distinct pk1 from mlog_test_t1 where snaptime$$ > :1 and dmltype$$ != 'D' ) log where current.pk1 = log.pk1;

and the update and insert statements are:

update test_mv set x1=:1, x2=:2, pk1=:3, x1x2=:4 where pk1=:3; insert into test_mv (x1, x2, pk1, x1x2) values (:1, :2, :3, :4);

And the same considerations about the substitution of rowid with the primary key hold. The index on the master table on (pk1, x1) might be of help here as well.

So here it is what the algorithm, essentially, is all about: a row-by-row propagation of all the modified (marked) rows to the MV, with a few optimizations.

**Algorithm optimizations**

Whatever the type of modifications, the algorithm is always the same: both the DELETE and UPSERT step are performed in all cases. Of course, in both cases, the select statement might select no row.

- Tuesday, August 11, 2009 materialized views
- 12 Comments »

### Xplan 2.0

A lot of new features have been added in version 2.0 of xplan, the sqlplus script I use to investigate about SQL statements performance (I spoke about version 1.0 in this post). Here's a brief description.

**wait profile (from ASH)**

For each statement, its wait profile is calculated fetching wait information from Active Session History:

----------------------------------------- |ash event |cnt |% | ----------------------------------------- |enq: HW - contention |2606|61.0| |enq: TX - row lock contention| 875|20.5| |db file sequential read | 344| 8.0| |enq: TX - index contention | 158| 3.7| |gc current grant busy | 152| 3.6| |cpu | 56| 1.3| |gc current block 2-way | 34| 0.8| |gc current block busy | 13| 0.3| |gc buffer busy | 10| 0.2| |gc cr block 2-way | 7| 0.2| |gc current grant 2-way | 5| 0.1| |read by other session | 5| 0.1| |direct path write | 3| 0.1| |gc cr block busy | 3| 0.1| |gc cr grant 2-way | 1| 0.0| |SQL*Net more data from client| 1| 0.0| |cr request retry | 1| 0.0| -----------------------------------------

By default this feature is on in 10g+ and inspects a window of ash_profile_mins=15 minutes from v$active_session_history.

Important note: you must have bought the appropriate Oracle licence (i.e. the Diagnostic Pack in 11.1) to read from that view and hence to use this feature (xplan will output a warning to remember you about that); you can disable this feature by setting ash_profile_mins=0.

**Dump of dependent object definitions**

If the statement references some database objects (e.g. a view, a pl/sql function) and hence depends on them, xplan will list them right below the statement text:

SELECT /*+ index(t,t_fbi) ordered use_nl(v) xplan_test_marker */ T.RR, PLSQL_FUNC(MAX(T.X)) FROM T, V WHERE UPPER(T.X) >= '0' AND T.X > :B1 AND V.RR ='x' GROUP BY T.RR ORDER BY T.RR - depends on view DELLERA.V - depends on function DELLERA.PLSQL_FUNC

and the object definition will be reported at the bottom of the xplan output:

############################################# function DELLERA.PLSQL_FUNC ### function plsql_func (p varchar2) return varchar2 is begin return p; end plsql_func; ############################################# view DELLERA.V ### view columns: #1 X(NUMBER),#2 PADDING(VARCHAR2),#3 RR(VARCHAR2) select x, padding, rr from t where x > 0

**Reading other RAC instance statements**

Now you can read from another instance by specifying the option inst_id (defaults to the instance you are connected). This is handy for inspecting other instances of the RAC cluster without reconnecting.

**Automatic dump of AWR most-expensive statements**

The experimental script xplan_awr.sql will inspect AWR (Active Workload Repository) and dump all the statements that are still in the library cache and that have exceeded some resource consumption thresholds in any of the periods marked by two consecutive AWR snapshots. Thresholds can be the percentage of total (e.g. dump if the CPU consumption is more that 10% of total CPU) or the ranking position (e.g. dump if the statement ranks more than 5th in the CPU chart - the typical "top-N" analysis). The thresholds are configurable in the topmost "params" WITH clause.

Again, you must have bought the appropriate Oracle licence to use AWR, and hence to run xplan_awr.sql.

- Friday, August 7, 2009 performance tuning, tools, xplan
- 18 Comments »