As we have seen in the previous posts of this series, in 11g a new figure named "NewDensity" has been introduced as a replacement for the "density" column statistic for columns whose histogram has been collected; this change has been backported in 10.2.0.4 also.
In the previous post we discussed how NewDensity influences the CBO cardinality estimate for Height-Balanced histograms; in this one we are going to investigate the same for Frequency Histograms. We will see that the most important change is the introduction of the “half the least popular" rule (see the "Histogram change" post by Jonathan Lewis, which distills the findings of Randolf Geist and Riyaj Shamsudeen) - a surprising rule that might easily cause trouble (in fact as Jonathan reports in the comments - bug 6082745 was opened against this rule).
The test case (script density_post_freq.sql) considers the same test statement we focused on in the previous post (a single equality filter predicate which asks for a value inside the min-max range of the column):
select ... from t where value = 64.5;
Of course we compute a Frequency instead of an Height-Balanced histogram, and use a slightly different value distribution in order to highlight the new rule:
SQL> select value, count(*) 2 from t 3 group by value 4 order by value; VALUE COUNT(*) ---------- ---------- 8 8 16 16 64 64 128 128
The histogram generated by the test case is (from DBA_HISTOGRAMS):
VALUE EP BKT ---------- ---------- ---------- 8 8 8 16 24 16 64 88 64 128 216 128
VALUE is an abbreviation for ENDPOINT_VALUE, EP for ENDPOINT_NUMBER.
BKT is the number of buckets covered by the value (i.e.: EP minus the previous EP), that is, the number of rows whose column value was equal to VALUE at statistics collection time.
When the filter predicate selects a value contained in the histogram, the new releases behave the same as the old ones (but check the "bottom note about singleton values" at the bottom for a a minor but interesting detail): neither density nor NewDensity is used, and the cardinality estimate is the usual intuitive one. In the complementary case of a value not contained in the histogram (but still inside the min-max interval), the cardinality used to be calculated as density*num_rows and it is now NewDensity*num_rows. Note the simmetry with the Height-Balanced case: the formula is the same, with NewDensity simply replacing density.
NewDensity with the “half the least popular" rule active
By default the rule is active, and in this case, NewDensity is set to
NewDensity = 0.5 * bkt(least_popular_value) / num_rows
and hence, for non-existent values:
E[card] = (0.5 * bkt(least_popular_value) / num_rows) * num_rows = 0.5 * bkt(least_popular_value)
For our test case, the least_popular_value is 8 and bkt(8) = 8, hence E[card] = 0.5 * 8 = 4 thanks to NewDensity being equal to 0.5 * 8 / 216 = 0.018518519. In fact, we can verify in the 10053 traces (in 10.2.0.4, 184.108.40.206, 220.127.116.11) for our statement, that asks for a not-existent value (64.5), that E[card] and NewDensity are set as above:
NewDensity:0.018519, OldDensity:0.002315 BktCnt:216, PopBktCnt:216, PopValCnt:4, NDV:4 Using density: 0.018519 of col #1 as selectivity of unpopular value pred Table: T Alias: NOT_EXISTENT Card: Original: 216.000000 Rounded: 4 Computed: 4.00 Non Adjusted: 4.00
As another check, let's see what happens if bkt(least_popular_value) = 1, that is, if there is (at least) one value that occurred exactly one time (a singleton value) at statistics collection time. Adding such a row to our test case is trivial (just uncomment the first insert row in the script); in this scenario, our formula above predicts E[card] = 0.5 with NewDensity = 0.5 / 217 = .002304147, and in fact (check the *_least_is_one.trc traces):
NewDensity:0.002304, OldDensity:0.002304 BktCnt:217, PopBktCnt:216, PopValCnt:4, NDV:5 Using density: 0.002304 of col #1 as selectivity of unpopular value pred Table: T Alias: NOT_EXISTENT Card: Original: 217.000000 Rounded: 1 Computed: 0.50 Non Adjusted: 0.50
note that E[card] gets rounded up from 0.5 to 1 (as usual).
What is the rationale behind this rule? Thanks to Randolf Geist (see the comment in Jonathan's blog entry above), we know that it was introduced as a patch to solve one particular scenario (see bug 5483301) and then included in the main release, for some reason. Luckily, the rule can be disabled and the old sane behaviour can be restored.
NewDensity with the “half the least popular" rule disabled
To disable the new rule, just switch off the patch 5483301:
alter session set "_fix_control"='5483301:off';
(or alter system if you want to make it permanent)
with this setting, NewDensity becomes simply
NewDensity = 0.5 / num_rows
and hence, again for non-existent values:
E[card] = 0.5
which is exactly what we got in pre-10.2.0.4, where density was used (and density was, and is still, set to 0.5 / num_rows by dbms_stats). So the cardinality estimate is 0.5 (rounded up to 1).
For our test case, we predict NewDensity = 0.5 / 216 = 0.002314815. In fact our 10053 traces tell us:
NewDensity:0.002315, OldDensity:0.002315 BktCnt:216, PopBktCnt:216, PopValCnt:4, NDV:4 Table: T Alias: NOT_EXISTENT_OFF Card: Original: 216.000000 Rounded: 1 Computed: 0.50 Non Adjusted: 0.50
The rationale for this behaviour is sound; the CBO knows that no row with the requested value existed at statistics collection time, hence it returns the minimal cardinality estimate compatible with the non empty result set assumption (check this post for the importance of this assumption). If the statistics are reasonably fresh, this is the only sane estimate that can be made.
Playing with density - a warning
If you set your own column stats using dbms_stats.set_column_stats, the behaviour is different; I haven't made any extensive investigations but as far as I can tell, the value you provide for density is used instead of NewDensity. User-provided column statistics are flagged with dba_tab_cols.user_stats = 'YES'. You can disguise your user statistics as non-user by setting the flags parameter of dbms_stats.set_column_stats to 2 - but since the latter parameter is labeled as "for Oracle internal use only", I would do it only for investigations purposes - that is, never in production.
Bottom note about singleton values: actually in pre-10.2.0.4 versions, if the value was present in the Frequency histogram but covering a single bucket (hence it was present in the table exactly one time at statistic collection time), it used to be classified as "unpopular" and hence used to get the same treatment as a value not in the histogram - the end result being that the cardinality was estimated as 0.5 rounded up to 1; now it is 1 before rounding as one would intuitively expects. I hope to be able to investigate whether this change fixes the issues about join cardinality estimation I investigated - see "the mystery of halving" in this investigation of mine if interested.
In this post we are going to explore and explain the rationale for the formula used by the CBO to compute the "NewDensity" figure that replaces, from 10.2.0.4 onwards, the "density" column statistic in the cardinality estimation formulae for columns with height-balanced (HB) histograms defined.
In a previous post, we already discussed the pre-10.2.0.4 scenario: we saw how and when the "density" column statistic is used in the cardinality formula for equality filter predicates, we explained its statistical rationale and defining formula, introduced the concept of the NPS (Not Popular Subtable), and built a test case. Now we are going to use the very same test case and explain the differences in the most recent versions (the previous post zip file contains logs for them also).
To summarize the test case - we have a table T with a single column VALUE, exponentially distributed, and with a SIZE 5 Height-Balanced histogram collected on. The histogram is:
SQL> select ep, value, popularity from formatted_hist; EP VALUE POPULARITY ---------- ---------- ---------- 0 1 0 1 16 0 5 64 1
Thus, we have a single popular value, 64; all the others are unpopular.
In this "densities" series of post, we focus on a SQL statement that contains only an equality filter predicate on table T:
select ... from t where value = 2.4;
the literal value is not a popular value (but inside the 1-64 interval) and hence, in pre-10.2.0.4, the formula used for the expected cardinality calculation is equal to:
E[card] = density * num_rows;
We discussed, in the previous post, how density is carefully calculated by dbms_stats to get back the expected cardinality of the family (class) of all possible equality filter predicate statements that hit the NPS, under the usual "non-empty result set assumption" and the further (strange and strong) assumption that the more a value is represented in the NPS, the higher the probability that the value is used as the literal of the equality predicate (an assumption that mathematically translates into the formula "w(:x) = count(:x) / num_rows_nps").
Switching to 10.2.0.4 - the formula for E[card] is still the same, but with "density" replaced by "NewDensity" (as hinted by the fact that "density" is reported as "OldDensity" in the 10053 trace files, as we are going to see in a moment):
E[card] = NewDensity * num_rows;
NewDensity is not stored anywhere in the data dictionary, but it is computed at query optimization time by the CBO (note that density is still computed by dbms_stats using the old formula, but then it is ignored by the CBO). The NewDensity formula is based mainly on some histogram-derived figures; using the same names found in 10053 traces:
NewDensity = [(BktCnt - PopBktCnt) / BktCnt] / (NDV - PopValCnt)
Where BktCnt ("Bucket Count") is the number of buckets (the "N" in the "SIZE N" clause);
PopBktCnt ("Popular Bucket Count") the number of buckets covered by the popular values;
PopValCnt ("Popular Value Count") is the number of popular values; NDV ("Number of Distinct Values") is the traditional name used by CBO developers for the num_distinct column statistic. With the exception of NDV, all these values are derived from the histogram.
Side note: if the numerator is equal to zero, NewDensity is set to 0.5 / num_rows, thus giving an E[card] = 0.5, as far as I have seen (not exaustively) in a few test cases; it looks like a lower-bound "sanity check". The denominator cannot be zero for HB histograms.
To illustrate the formula: the histogram of our test case has 5 buckets, hence BktCnt=5; 64 is the only popular value, hence PopValCnt =1; this popular value covers 4 buckets (since its EP is 5 and the previous EP is 1), hence PopBktCnt=4; we know that the column has num_distinct=6, hence NDV=6. This is in fact what we see in the 10053 trace file (in 18.104.22.168 and 22.214.171.124):
SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for T[T] Column (#1): NewDensity:0.040000, OldDensity:0.115789 BktCnt:5, PopBktCnt:4, PopValCnt:1, NDV:6 Using density: 0.040000 of col #1 as selectivity of unpopular value pred Table: T Alias: T Card: Original: 95.000000 Rounded: 4 Computed: 3.80 Non Adjusted: 3.80
So NewDensity = [(5-4)/5] / (6-1) = 1/25 = 0.04 and E[card]=0.04*95=3.8, which is exactly what we see in the above trace fragment.
The formula is statistically based on replacing the previous versions' assumption (that we labeled "strange and strong") about w(:x) with the standard assumption that the client will ask for the values in the NPS with the same probability; mathematically, that means replacing the formula "w(:x) = count(:x) / num_rows_nps" with the standard "w(:x) = 1 / num_distinct_nps" (where num_distinct_nps is of course the number of distinct values of the NPS). If you plug this shape of w(:x) into the formula for E[card], you get
E[card] = sum ( w(:x) * E[count(:x)] ) = = sum (E[count(:x)] ) / num_distinct_nps for all values of :x (belonging to the NPS)
E[card] = num_rows_nps / num_distinct_nps
which is, not surprising, the standard formula used for columns without histograms, but applied to the NPS, not the whole table.
One possibility for producing the above E[card] value at run-time could have been to change dbms_stats to compute a value for "density" equal to (num_rows_nps / num_distinct_nps) / num_rows; but forcing users to recompute statistics for all their tables in their upgraded databases is not really a viable option. Hence, the CBO designers chose to simply ignore "density" and calculate the above formula at run-time, mining the histogram, at the cost of reduced precision. In fact, the easy part is num_distinct_nps, which is obviously exactly equal to num_distinct minus the number of popular values; but num_rows_nps can only calculated approximately, since the histogram is a (deterministic) sample of the column values obtained by first sorting the column values and then sampling on a uniform grid (for more information and illustration, see the first part of this article of mine). Using the histogram, the best approximation for num_rows_nps is num_rows times the fraction of buckets not covered by popular values. Hence, using the 10053 terminology
num_distinct_nps = NDV - PopValCnt (exactly) num_rows_nps = [(BktCnt - PopBktCnt) / BktCnt] * num_rows (approximately)
which gets back (again, approximately) the E[card] formula above, as can be trivially checked.
It might be desirable that one day, NewDensity gets calculated exactly by dbms_stats and stored in the data dictionary, at least for columns with new statistics, albeit the precision reduction is probably more than acceptable (that is, I have never seen a case where that has been an issue). The test case script, just for the sake of completeness, calculates the exact figure as well; it gets back an E[card] of 6.2 instead of 3.8.
For a summary of the above discussion and some more discussion, check back this investigation of mine. By the way, NewDensity replaces "density" also in join cardinality formulae, even if I have not run a complete investigation - but that is not surprising at all.
As a final nore - NewDensity is used also for Frequency Histograms, and in a very creative way; we will discuss this in part IV of this series.
In this post we are going to explore and explain the rationale for the formula used by dbms_stats to compute the "density" column statistic, used by the CBO in versions less than 10.2.0.4 to estimate the cardinality of a class of SQL statements. In the next post, we will speak about its replacement, named "NewDensity" in 10053 trace files.
We will consider only the non-trivial case of Height-Balanced histograms, since for Frequency Histograms density is a constant (0.5 / num_rows) and for columns without histogram, it is simply 1/num_distinct.
Let's illustrate the test case on which we will base our discussion, contained in this zip file.
First, a table T is created with the following exponential value distribution:
SQL> select value, count(*) 2 from t 3 group by value 4 order by value; VALUE COUNT(*) ---------- ---------- 1 1 2 2 4 4 8 8 16 16 64 64
The test case then computes a SIZE 5 Height-Balanced histogram. The resulting histogram (from dba_histograms) is as follows (note that I have added the column POPULARITY that marks popular values with "1"; EP is shorthand for column ENDPOINT_NUMBER, VALUE for column ENDPOINT_VALUE):
SQL> select ep, value, popularity from formatted_hist; EP VALUE POPULARITY ---------- ---------- ---------- 0 1 0 1 16 0 5 64 1
The test case then issues this SQL statement that contains only an equality filter predicate on table T:
select ... from t where value = 2.4;
The literal value 2.4 is not contained in the table (and hence in the histogram), in order to make the CBO factor in "density" in its estimate of the expected cardinality - in fact, as it might be known, density is used when the literal is not popular (that is, not equal to 64 in our case), and it doesn't matter whether the literal is not contained in the histogram, or contained as an unpopular value (1 and 16 in our case), or even contained in the table or not. All it takes is its being not popular.
Side note: I'm assuming the literal is inside the closed min-max interval (1-64 in this case); when outside, it depends on the version.
When the literal is not popular, the formula used for the expected cardinality calculation is equal to
E[card] = density * num_rows;
That is easy to verify from the test case logs; in 9i we can see that density = 0.115789474 and num_rows=95, hence 0.115789474 * 95 = 11.000000000 which is exactly equal to the CBO estimate for our statement.
The formula used by dbms_stats to compute "density" was published in Jonathan Lewis' book Cost Based Oracle (page 172) and Wolfgang Breitling's presentation Histograms - Myths and Facts. The key fact is that the formula takes as input the rows of what I've nicknamed the not-popular subtable (NPS), that is, the original table without the rows whose values are popular values (in this case, 64 is the only popular value). Letting num_rows_nps the number of rows of the NPS (for our example, num_rows_nps=1+2+4+8+16=31), we have:
density = (1 / num_rows) * sum (count (value) ^ 2) / num_rows_nps for "value" belonging to the NPS
The script performs this calculation automatically; it is anyway instructive to perform the calculation manually at least one time:
density = (1/95) * (1*1+2*2+4*4+8*8+16*16) / 31 = .115789474
that matches perfectly the density we observed in the script log before.
What is the statistical rationale for this seemingly strange computation?
If we plug it inside the formula for E[card], we can see that num_rows is cancelled:
E[card] = sum (count (value) ^ 2) / num_rows_nps summed over all "values" belonging to the NPS
Now we must reference the statistical concepts introduced in this post, and consider the family of all statements of our kind that can reference the NPS:
select ... from t where x = :x; :x being a value belonging to the NPS
its E[card] is
E[card] = sum ( w(:x) * E[count(:x)] ) for all values of :x (belonging to the NPS)
dbms_stats takes count(:x) as the best estimate for E[count(:x)] (for example, E[count(4)] = count(4) = 4 in our case). All we have to do in order to obtain the observed formula, is to assume w(:x) = count(:x) / num_rows_nps:
E[card] = sum ( (count(:x) / num_rows_nps) * count(:x) ) = sum ( count(:x) ^ 2 ) / num_rows_nps for all values of :x (belonging to the NPS)
The meaning of the above particular shape of w(:x) is that the probability that the client submits a certain value for :x is proportional to the number of rows (in the NPS) that has that value; more precisely, that if X% of rows has a certain common value, X% of user-submitted statements that "hit" the NPS will ask for that value. Under this assumption, dbms_stats precomputes "density" to give back the above E[card] when the literal is known to be not popular, hence hitting the NPS - remember that the CBO operates under the "non-empty result set assumption", hence if the literal does not hit a popular value, it must hit a value of the NPS.
The above assumption for w(:x) is quite a strange assumption - and in fact, we will see in the next post that in 11g (and 10.2.0.4), this assumption has been dropped and replaced with a more standard one. The "density" column statistics is in fact ignored in 10.2.0.4+ and a value computed at run-time, named "newDensity" in 10053 trace files, is used instead.
Let's explore the concept of cardinality from the point of view of the statistician; this is both to get a clearer vision of the matter (i.e. for fun) and to path the way for understanding the rationale for the "density" statistics as calculated by dbms_stats (the topic of an upcoming post).
Let's consider a statement with input parameters (bind variables), and consider the most fundamental of them all, the one with a filter predicate:
select ... from t where x = :x;
the cardinality "card" of the set of rows retrieved depends on the table possible values and the actual inputs provided by the client as bind variable values. What about the expected value E[card] ?
1) w(:x) ("w" stands for "workload") the probability mass function of the random variable :x (that completely characterizes the workload);
2) E[count(:x)] the expected value of the cardinality of the rows retrieved for each value of :x.
We have, assuming that the two are independent:
E[card] = sum ( w(:x) * E[count(:x)] ) for all values of :x
To solve the formula we have to know (or assume) the client-dictated w(:x). The same goes for the table potential population (that -together with the statement of course- shapes E[count(:x)]); we must either have some statistical measurements about the table (for example, a frequency histogram on column X that we consider representative of the table population) or assume them (for example, assume a certain distribution for the column X values).
It is interesting to explore the most used scenario: a uniform (or assumed uniform) distribution for the column X values, of which we know the number of distinct values num_distinct(X) and the total number of values num_rows (let them be deterministically known for simplicity, and exclude null values). That means that E[count(:x)]) is equal to num_rows / num_distinct(X) over a finite set that contains num_distinct(X) values and is equal to zero over the remaining ones.
It is relatively easy to see that E[card] depends on how w(:x) and E[count(:x)]) overlap. At one end of the spectrum, if the client always submits values for :x that are not contained in the table, E[card] is zero - since the client choice matematically translates into E[count(:x)]) being zero over all values of :x for which w(:x) is non zero. At the other end of the spectrum, that is, under the non empty result set assumption, we have E[card] = sum ( w(:x) ) * num_rows / num_distinct(X) = num_rows / num_distinct(X), the usual formula used by the CBO in many (most) situations.
The "density" column statistic formula can be derived in a similar way, but using a different assumption about w(:x) - as we will see in a dedicated post.
While attending the 11gR2 launch event in Milan last Thursday, I had the distinguished opportunity (invited, as a blogger, by the Oracle team that was organizing the event) to meet Mark Townsend and exchange a few words about the new features of 11gR2 and the Oracle database in general as well.
For those who don't know, Mark is (among other things) the Vice President in charge of coordinating the Product Managers and a technical expert at the same time, and this rare combination has the advantage that you can ask him about any feature you like at whatever granularity you like, from the strategic level down to the technical gory details. In fact Mark is frequently seen at public events (such as Oracle Open World), speaking to mixed audiences composed of both Engineers and Management.
We spoke about many new features of 11gR2, so much in fact that in order to do justice to the information that Mark very kindly provided me, I will use them as the foundation of some future blog posts. In brief anyway, I noticed that this release has features that are targeted mostly to vastly improve the "grid", but with very interesting features for the "core" as well (my personal favourites being the in-memory parallel execution and the new SCN-based MV log, which I plan to blog about in the very near future).
But especially, I didn't miss the unique opportunity to discuss with one of the top players of Oracle Corporation about the amount of information that Oracle shares with its professionals, obviously trying to push for much more. I'm sure that everyone that works with any kind of software product agrees with me on the fact that knowing how the product works is key not only to troubleshooting (that being almost obvious) but to good design also, or perhaps especially; the more you know, the better designer you are going to be.
We had a lot of back and forth on this topic, but to summarize, Mark agreed on detailed information being very useful for experts, but also pointed out that it takes years and years to become proficient enough to be able to digest very detailed documentation, and in the meanwhile, too much information is going to confuse, rather than clear things (especially for juniors coming from other database products). Just the current detail level is overwhelming, the current docs being composed of a staggering 21,000,000 words - and Oracle is in fact trying to organize the documentation in layers as much as possible, starting from high-level descriptions (the Two Day DBA course), than the Concept manual, and than the rest. The most details, however, will be still provided forever as Metalink (aka "My Oracle Support") notes, the only reason being to avoid confusing people, not to hide anything (besides strategic algorithms not covered by Patents of course).
So in short, from this discussion I have understood that Oracle is willing to share information with its community; it only wants to find the right way to do so, since the community is huge (hundreds of thousands of people) and the product is very, very complex. Anyway, I have insisted for more information about the two topics that I'm sure that are not documented well enough, naming the "auto" features and the CBO algorithms - and actually I feel like I have insisted perhaps too much on that with Mark ... but that's something I really care about, both for professional reasons and keen interest alone.
Well, I might add that information sharing, and community involvement as well, is in my opinion one of the main factors of Oracle's success; actually, being able to dig a lot into the inner workings of the product is the reason why I chose an Oracle career ten years ago.
PS I'm back from my vacation and I have a lot of interesting things to investigate at work that look like perfect candidates for being turned into posts, so I will be able to blog more frequently in the future. I also have a series of posts about the CBO that is "almost complete".
The CBO assumes that SELECT statements are always going to retrieve at least one row - even if this is not necessarily the case, of course. Understanding why this is done is both useful and fascinating.
We must start from the very beginning and remember that one of the most important tasks of the CBO is estimating the statement cardinality, that is, to make a guess about the number of rows that will be fetched. In statistics, that means that the CBO must calculate (estimate) the expected value of the cardinality random variable.
In order to calculate the expected value, in our case, we can consider the table potential population (i.e. the set of all possible row values), execute (ideally!) the statement over each table in the population, and compute (again ideally) the average of the cardinality of each result set.
The population must be coherent with the set of observations stored in the data dictionary when the table and column statistics were collected; in other words, the population must satisfy a set of statistical constraints. For example the number of distinct values in each column must be equal (or statistically equal) to the num_distinct statistic; the range of values must be inside (or statistically inside) the min-max interval dictated by low_value-high_value, etc.
Now consider a simple statement with a filter predicate:
select ... from t where x = 1;
Assuming that column X contains numbers, there are an infinite number of values of X inside the min-max interval (assuming that min is not equal to max) that can satisfy the constraints. In the table population, how many tables have X=1, and how many rows will be retrieved by the statement?
If a frequency histogram has been collected on column X, the population is constrained to (statistically) satisfy it, and hence we have the answer: the expected cardinality is zero if value X is not contained in the histogram and strictly greater than zero (computed with the usual formula) otherwise.
But if no histogram is collected on the column, the number of tables with X=1 will be negligible, and hence the expected value will be zero. That is not very useful.
But if we assume that the result set is never empty, then we have another constraint to apply. That means that the value X is contained in all tables of the population, and (if we add the additional customary assumption of uniform distribution of values) we can easily derive the usual num_rows / num_distinct(X) formula.
Note that the "non-empty result set" assumption is very strong; it means that the statement and the table are not independent, but actually are highly correlated, since the assumption is equivalent to say that the client executes the statement in order to retrieve rows whose existence is certain before the statement execution. In other words, the CBO infers information about the data from the statement itself, not only from the data dictionary statistics, trusting that the user has some knowledge about the data stored in the table.
The assumption is of course more than reasonable for almost all statements and clients, but not always. For instance, X=1 might mean "new record" in a table-queue that contains the history of the last few years as well, and the table migth have no observable record with X=1 thanks to the consumer(s) being very quick or the producer(s) rarely enqueuing records. Or maybe, X=1 might mean "failed record" in a process that never fails, and the statement could be issued for checking the rows existence, not for retrieving them. In this kind of scenario, the CBO predictions can be affected by the "non-empty result set" assumption.
Just a short note to tell my friends that I have been bestowed the August Order of the Wooden Pretzel, that is, that I won the NoCOUG's "First international SQL challenge" with this solution.
I'm especially happy to see that, after (way too) many years since graduation, I am still able to use my math skills to solve problems ... :).
Many thanks to Iggy Fernandez for setting up the contest (and running it with love and dedication), Dan Tow for judging the solution and of course Chen Shapira for advertising and "supporting" it on her blog!
Today we are going to investigate how Oracle fast refreshes materialized views (MVs) of a single master table, containing no aggregate but, at most, filter predicates and additional column definitions:
create materialized view test_mv build immediate refresh fast on demand with rowid -- with primary key as select test_t1.*, x1+x2 as x1x2 from test_t1 where x1 != 0.42;
This kind of MVs might be considered a degenerate case of a join-only MV, a topic that we investigated in an earlier post, and one could expect the same algorithm. But that is not the case: the test case shows that the algorithm used is very different.
The two main differences are (as we are going to illustrate in detail) that UPDATEs are actually used in this case (as noted by Cristian Cudizio) instead of DELETE+INSERT only, and especially that row-by-row propagation is performed instead of using a couple of single SQL statements.
This kind of MV is frequently used for replication across a db-link (with a clause such as "from test_t1@db_link"); in this scenario, the MV used to be named SNAPSHOT in old releases. I have checked this scenario as well (not included in the test case) and the only difference is that, obviously, the master table test_t1 is referenced via a db-link and a few hints are injected by the refreshing engine.
In the test case, I have checked both the WITH ROWID and WITH PRIMARY KEY options for the MV DDL; the algorithm turns out as being identical, besides (obviously) that in the former the rowid and in the latter the primary key is used to identify rows.
I am going to follow the path of the previous discussion about join-only MVs referenced above, as both the test case format and some of the actual refresh steps are very similar. I have tested on 126.96.36.199, 10.2.0.4 and 188.8.131.52 for the most common DML on the base table (conventional INSERTs, UPDATEs and DELETEs). I have seen no difference in the algorithm for the three kernel versions.
Materialized view logs configuration
Even for this test case, I have configured the materialized view logs to "log everything" to check whether Oracle is able to take advantage of more information in the log:
create materialized view log on test_t1 with sequence, rowid, primary key (x1, x2) including new values;
but even for single-table MVs the algorithm uses only the rowid or primary key information, hence the minimal (and hence optimal) log configuration is, for the WITH ROWID option:
create materialized view log on test_t1 with rowid;
and for the WITH PRIMARY KEY option:
create materialized view log on test_t1 with primary key;
The first step in the refresh algorithm is to take a log snapshot, exactly as in the join-only case, by setting snaptime$$ = current time. Hence the marked log rows (the ones and only ones to consider for propagation) will be characterized by snaptime$$ <= current time and > last snapshot refresh time. See the previous post about the join-only case for a more in-depth discussion.
Note: actually, for the sake of precision, two (minor) differences with the join-only case are that the snapshot statement is exactly the same in all versions (there's no special version for 184.108.40.206) and that the log is not "inspected to count the number and type of the logged modifications".
Core algorithm: the DELETE and UPSERT steps
Then, the core replication starts. The propagation from the master table is composed of two simple steps, steps that I've named DELETE and UPSERT (UPDate + insERT).
The first DELETE step is a simple select-then-delete row-by-row processing, where each row returned by a select statement is passed to a single-row delete statement.
For the WITH ROWID option, the select statement of the DELETE step is (editing for readability: removing hints, unnecessary aliases, etc):
select distinct m_row$$ from ( select m_row$$ from mlog$_test_t1 where snaptime$$ > :1 and dmltype$$ != 'I' ) log where m_row$$ not in ( select rowid from test_t1 mas where (mas.x1 <> 0.42) and mas.rowid = log.m_row$$ );
and the delete is a trivial
delete from test_mv where m_row$$ = :1;
The select+delete purpose is to delete all marked rows that are not in the master table anymore, or that are still there but that do not satisfy the MV defining SQL (here, x1 != 0.42) anymore.
In fact, the first in-line view fetches from the log the rowid of a subset (those whose dmltype$$ != 'I') of the marked rows, since :1 is set to the date of the previous refresh of the materialized view. Well actually - the SQL, as it is, would also get the log rows inserted after the snapshot was taken, which is obviously not acceptable since the propagation must operate on a stable set of rows. I'm not sure how the non-marked rows are excluded, but probably the various "select for update" on the log data dictionary tables might play a role by locking the commits on the logs, or maybe the serialization level is set to read-only or serializable (I will investigate this in the future). For now, let's make the conjecture that only the marked rows are selected.
The last correlated subquery simply filters out the rowid of the rows that are still in the master table. The condition dmltype$$ != 'I' ('I' stands for INSERT) is only an optimization, since an inserted row would be filtered out by the subquery anyway - unless it has not been deleted after being inserted, but that would be recorded with another log row with dmltype$$ = 'D'.
Why are updates (dmltype$$ = 'U') not optimized away as well? This is to delete rows from the MV that no longer belong to the current image of the MV defining SQL statement, since they used to satisfy the filter condition (here, x1 != 0.42) but no longer do after an update. Thanks to the filter condition (x1 != 0.42) being included in the subquery, any row that does not satisfy it anymore after an update will not be filtered out, and hence will be deleted.
Note that column m_row$$ of the MV is a hidden (but not virtual) column that records, for each MV row, the rowid of the corresponding master table row. It is automatically created when you define the MV with the WITH ROWID option; an index is automatically created on m_row$$ as well (unless you specify USING NO INDEX, something that does not make sense if you want to fast refresh the MV). Hence you do not need to create any additional index, neither on the master table nor on the MV, to optimize this step of the fast refresh.
Switching to the WITH PRIMARY KEY option, the select statement of the DELETE step is
select distinct pk1 from ( select pk1 from mlog$_test_t1 where snaptime$$ > :1 and dmltype$$ != 'I') ) log where pk1 not in ( select pk1 from test_t1 mas where (mas.x1 <> 0.42) and log.pk1 = mas.pk1 );
and the delete is simply
delete from test_mv where pk1 = :1;
That is, the statements are the same as in the WITH ROWID case, with the primary key instead of the rowid in all statements. Since the master table must have a primary key for the MV create to succeed, and since an index on the MV that spans the primary key column(s) is automatically created (unless you specify USING NO INDEX of course), even in the WITH PRIMARY KEY case you do not need to create any additional index for performance. Actually, for best performance, an index on the master table that combines the PK and the column(s) referenced by the MV filter condition - here on (pk1, x1) - might help a bit, since probably the optimal plan is a nested loop having test_t1 as the inner table. This would avoid a block get on the master tables for marked rows not satisfying the MV filter condition; the effectiveness of this index depends on whether you have a lot of updates on the column referenced in the filter condition.
The UPSERT step is a simple select-then-upsert row-by-row processing, where each row returned by a select statement (that calculates the current image of the row that needs to be propagated to the MV) is used to update the corresponding row in the MV; if the update finds no row, the row is inserted.
For the WITH ROWID option, the select statement of the UPSERT step is:
select current.x1, current.x2, current.pk1, current.x1x2, rowidtochar (current.rowid) m_row$$ from ( select x1, x2, pk1, x1+x2 as x1x2 from test_t1 where (x1 <> 0.42) ) current, ( select distinct m_row$$ from mlog$_test_t1 where snaptime$$ > :1 and dmltype$$ != 'D' ) log where current.rowid = log.m_row$$;
and the update and insert statements are simply:
update test_mv set x1=:1, x2=:2, pk1=:3, x1x2 = :4 where m_row$$ = :5; insert into test_mv (x1,x2,pk1,x1x2,m_row$$) values (:1,:2,:3,:4,:5);
The select+upsert purpose is to calculate the new image of all marked rows that satisfy the MV defining SQL filter condition (here, x1 != 0.42) and then overwrite the old image in the MV with the new one. Note that an update on the master table might produce an insert if the old image did not satisfy the filter condition and the new one does.
The structure of the select statement should be obvious after the previous illustration of the DELETE step. Note of course the different optimization in the second inline view (dmltype$$ != 'D'). Even in this case, the automatically created index on the m_row$$ MV column optimizes the update statement, and no other index is necessary for performance on neither the base table nor the MV.
Switching to the WITH PRIMARY KEY option, the select statement of the UPSERT step is
select current.x1, current.x2, current.pk1, current.x1x2 from ( select x1, x2, pk1, x1+x2 x1x2 from test_t1 where (x1 <> 0.42) ) current, ( select distinct pk1 from mlog_test_t1 where snaptime$$ > :1 and dmltype$$ != 'D' ) log where current.pk1 = log.pk1;
and the update and insert statements are:
update test_mv set x1=:1, x2=:2, pk1=:3, x1x2=:4 where pk1=:3; insert into test_mv (x1, x2, pk1, x1x2) values (:1, :2, :3, :4);
And the same considerations about the substitution of rowid with the primary key hold. The index on the master table on (pk1, x1) might be of help here as well.
So here it is what the algorithm, essentially, is all about: a row-by-row propagation of all the modified (marked) rows to the MV, with a few optimizations.
Whatever the type of modifications, the algorithm is always the same: both the DELETE and UPSERT step are performed in all cases. Of course, in both cases, the select statement might select no row.
A lot of new features have been added in version 2.0 of xplan, the sqlplus script I use to investigate about SQL statements performance (I spoke about version 1.0 in this post). Here's a brief description.
wait profile (from ASH)
For each statement, its wait profile is calculated fetching wait information from Active Session History:
----------------------------------------- |ash event |cnt |% | ----------------------------------------- |enq: HW - contention |2606|61.0| |enq: TX - row lock contention| 875|20.5| |db file sequential read | 344| 8.0| |enq: TX - index contention | 158| 3.7| |gc current grant busy | 152| 3.6| |cpu | 56| 1.3| |gc current block 2-way | 34| 0.8| |gc current block busy | 13| 0.3| |gc buffer busy | 10| 0.2| |gc cr block 2-way | 7| 0.2| |gc current grant 2-way | 5| 0.1| |read by other session | 5| 0.1| |direct path write | 3| 0.1| |gc cr block busy | 3| 0.1| |gc cr grant 2-way | 1| 0.0| |SQL*Net more data from client| 1| 0.0| |cr request retry | 1| 0.0| -----------------------------------------
By default this feature is on in 10g+ and inspects a window of ash_profile_mins=15 minutes from v$active_session_history.
Important note: you must have bought the appropriate Oracle licence (i.e. the Diagnostic Pack in 11.1) to read from that view and hence to use this feature (xplan will output a warning to remember you about that); you can disable this feature by setting ash_profile_mins=0.
Dump of dependent object definitions
If the statement references some database objects (e.g. a view, a pl/sql function) and hence depends on them, xplan will list them right below the statement text:
SELECT /*+ index(t,t_fbi) ordered use_nl(v) xplan_test_marker */ T.RR, PLSQL_FUNC(MAX(T.X)) FROM T, V WHERE UPPER(T.X) >= '0' AND T.X > :B1 AND V.RR ='x' GROUP BY T.RR ORDER BY T.RR - depends on view DELLERA.V - depends on function DELLERA.PLSQL_FUNC
and the object definition will be reported at the bottom of the xplan output:
############################################# function DELLERA.PLSQL_FUNC ### function plsql_func (p varchar2) return varchar2 is begin return p; end plsql_func; ############################################# view DELLERA.V ### view columns: #1 X(NUMBER),#2 PADDING(VARCHAR2),#3 RR(VARCHAR2) select x, padding, rr from t where x > 0
Reading other RAC instance statements
Now you can read from another instance by specifying the option inst_id (defaults to the instance you are connected). This is handy for inspecting other instances of the RAC cluster without reconnecting.
Automatic dump of AWR most-expensive statements
The experimental script xplan_awr.sql will inspect AWR (Active Workload Repository) and dump all the statements that are still in the library cache and that have exceeded some resource consumption thresholds in any of the periods marked by two consecutive AWR snapshots. Thresholds can be the percentage of total (e.g. dump if the CPU consumption is more that 10% of total CPU) or the ranking position (e.g. dump if the statement ranks more than 5th in the CPU chart - the typical "top-N" analysis). The thresholds are configurable in the topmost "params" WITH clause.
Again, you must have bought the appropriate Oracle licence to use AWR, and hence to run xplan_awr.sql.
This post investigates how Oracle fast refreshes materialized views containing only joins of master tables:
create materialized view test_mv build immediate refresh fast on demand as select test_t1.*, test_t1.rowid as test_t1_rowid, test_t2.*, test_t2.rowid as test_t2_rowid, test_t3.*, test_t3.rowid as test_t3_rowid from test_t1, test_t2, test_t3 where test_t1.j1_2 = test_t2.j2_1 and test_t2.j2_3 = test_t3.j3_2 ;
The fast refresh algorithm is simple and very easy to understand - so trivial in fact that once examined and understood, the possible tuning techniques follow naturally.
The test case traces the fast refresh of the above materialized view (MV) using the 10046 event (aka "sql trace"). The test case has been run on 220.127.116.11, 10.2.0.4 and 18.104.22.168 (the latest versions of 9i, 10g and 11g available as of today), and on all of these versions the algorithm used by the refreshing engine (run by invoking dbms_mview.refresh) appears to be the same, with only a few implementation differences.
The test case explores the most general case: it performs inserts, updates and deletes on all the three master tables (the inserts being conventional; I will explore direct-path inserts another time).
Materialized view logs configuration
In the test case, I have configured the materialized view logs to "log everything", in order to check whether more information in the logs could trigger some special kernel code designed to take advantage of it:
create materialized view log on test_t1 with sequence, rowid, primary key (j1_2, x1) including new values;
but the engine uses only the rowid information even in 22.214.171.124, so you are better off logging only the rowid if the master table feeds join-only materialized views exclusively:
create materialized view log on test_t1 with rowid;
Minimal logging obviously improves the performance of DML against the master tables, but it also optimizes the fast refresh, since the latter, as we are going to see in a moment, reads each log twice, and of course the less you log, the more compact the logs will be.
After some preliminary visits to the data dictionary, the first operation performed by the fast refresh engine is to "mark" the modifications (recorded in the materialized view logs) to be propagated to the MV. Only the marked log rows are then fed by the fast refresh engine as input to the next steps.
The "flag" used to mark the rows is the column snaptime$$. When the refresh starts, the engine performs a "snapshot" of the materialized view logs by setting the snaptime$$ of all the new rows (those with snaptime$$ = '01/01/4000') of each log in turn to the current time (SYSDATE).
In detail, the snapshot is performed by issuing this SQL statement (slightly edited for readability) in 126.96.36.199 and 10.2.0.4:
update MLOG$_TEST_T1 set snaptime$$ = :1 where snaptime$$ > to_date('2100-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS')
The bind variable :1 is a DATE whose value is equal to SYSDATE.
Note: In 188.8.131.52, the statement is slightly different but makes the same thing, probably in a more scalable way concurrency-wise (check the script spools if you're interested).
You might have noticed the where condition on snaptime$$; that is necessary since the logs might be used by more than one materialized view. When a refresh ends, in fact, the engine checks whether other MVs might need each log row, and deletes only the log rows that have been processed by all dependant MVs; the other ones are left unchanged (and hence keep the snaptime$$ that was set when the fast refresh started). The where condition is needed to avoid overwriting the snaptime$$, and mark with the current time only the brand new rows (those with snaptime$$ = '01/01/4000').
So, at the end of the snapshot, the log rows that must be examined by the refresh engine will be the ones that are marked by having their snaptime$$ between the date of the last refresh (excluded) and :1 (included). All the other log rows must be ignored.
Side note: marking data at a certain point in time and then replicating the marked data is the only replication strategy that can work when you cannot "freeze" the master tables, as this is definitely our case. This is a general topic worth blogging about in the future.
The marked log rows are then inspected to count the number and type of the logged modifications. This is to check whether any of the replication steps (i.e. the DEL and INS steps that we are going to discuss in a moment) could be skipped. Also, the number of modifications is used (in some versions) to inject some hints in the SQL statements of the replication steps, a topic that falls out of the scope of this post.
Core algorithm: the INS and DEL steps
Then, the core replication starts. The replication considers each master table in turn, and for each table, propagates the modifications to the MV. So we have essentially one single algorithm that propagates from a single master table, just repeated once for each master table.
The propagation for each master table is made of two simple steps, steps that I'm going to name after the comments of the SQL as a DEL (DELETE) step followed by an INS (INSERT) step.
The DEL step is (editing for readability: removing hints, unnecessary aliases, etc):
/* MV_REFRESH (DEL) */ delete from test_mv where test_t1_rowid in ( select * from ( select chartorowid (m_row$$) from mlog$_test_t1 where snaptime$$ > :1 ) as of snapshot (:2) )
The subquery simply fetches the rowid of all marked rows, since :1 is the date of the previous refresh of the materialized view, and :2 is the SCN (coded as a RAW variable) of the time when the snapshot was performed. So, this step deletes from the MV all the rows that record the result of the MV-defining join of any of the marked rows (of the current master table) with the other master tables.
This is the step that can benefit from the index on the column that stores the master table rowid (here, test_t1_rowid) that the documentation suggests to create. Note that in order to optimize this step, you need three separate single-column indexes (here, on test_t1_rowid, test_t2_rowid, test_t3_rowid), not a single composite index spanning the (here, three) columns, as it is sometimes wrongly stated.
The INS step is (again editing for readability):
/* MV_REFRESH (INS) */ insert into test_mv select jv.j1_2, jv.x1, jv.pk1, jv.rid$, mas2.j2_1, mas2.j2_3, mas2.x2, mas2.pk2, mas2.rowid, mas3.j3_2, mas3.x3, mas3.pk3, mas3.rowid from ( select log.rowid rid$, log.* from test_t1 log where rowid in ( select chartorowid(log.m_row$$) from mlog$_test_t1 where snaptime$$ > :1 ) ) as of snapshot (:2) jv, test_t2 as of snapshot (:2) mas2, test_t3 as of snapshot (:2) mas3 where jv.j1_2 = mas2.j2_1 and mas2.j2_3 = mas3.j3_2
The subquery is the same as the DEL step and serves the very same function. So, this statement replays the SQL statement of the MV definition, but limited to the marked rows only. Note that all tables are read at the same point in time in the past, the time when the snapshot of the log was performed, thanks to the argument of the "as of snapshot" clause being the same.
In order to speed up the INS step, indexes on the joined columns can be created on the master tables (not on the MV!). This is because, special cases aside, it is well known that the "fast refresh" (the name itself is quite horrible, many people prefer the adjective "incremental") can be actually "fast" only if a small fraction of the master tables is modified (otherwise, a complete refresh is better); in this scenario, almost certainly the optimal plan is a series of NESTED LOOPs that has the current table (test_t1 in this case) as the most outer table, series that can usually benefit a lot by an index on the inner tables join columns. Of course, you must remember that every table, in turn, acts as the most outer table, hence you should index every possible join permutation.
So here what the algorithm is all about: the DEL and INS steps, together, simply delete and recreate the "slice" of the MV that is referenced by the marked rows, whatever the modification type. The algorithm is as simple (and brutal) as it seems.
The only optimizations performed are the skipping of some steps when they are obviously unnecessary. For every master table, the DEL step is skipped when only INSERTs are marked in the logs; the INS is skipped when only DELETEs are marked, and of course both are skipped if no row is marked. I have not been able to spot any other optimization.
Note that this means that UPDATEs always turn into a delete+insert of the entire "slice". For example, consider the typical case of a parent table (say, CUSTOMER), with a child (say, ORDER) and a grandchild (say, ORDER_LINE); if you update a column of a row of the parent (say, ORDERS_TOTAL_AMOUNT), the parent row and its whole progeny (the "slice") will be deleted and then recreated. This was a quite surprising discovery for me - a fact that I have now committed to memory.