Fast refresh of aggregate-only materialized views with MAX – algorithm

In this post I will illustrate the algorithm used by Oracle (in 11.2.0.3) to fast refresh a materialized view (MV) containing only the MAX aggregate function:

create materialized view test_mv
build immediate
refresh fast on demand
with rowid
as
select gby        as mv_gby,
       count(*)   as mv_cnt_star,
       max  (dat) as mv_max_dat
  from test_master
 --where whe = 0
 group by gby
;

The where clause is commented to enable fast refresh whatever type of DML occurs on the master table, in order to investigate all possible scenarios; the case having the where-clause is anywhere a sub-case of the former and we will illustrate it as well below.

As usual, the MV log is configured to "log everything":

create materialized view log on test_master
with rowid ( whe, gby, dat ), sequence
including new values;

In the general introduction to aggregate-only MVs we have seen how the refresh engine first marks the log rows, then inspects TMPDLT (loading its rows into the result cache at the same time) to classify its content as insert-only (if it contains only new values), delete-only (if it contains only old values) or general (if it contains a mix of new/old values). In the MAX scenario, a specialized (and much more performant) algorithm exists only for the insert-only case, and every other case falls back to the general algorithm.

Let's illustrate, with the help of the usual supporting test case, and building on the shoulders of the already illustrated SUM case.

refresh for insert-only TMPDLT

The refresh is made using this single merge statement:

/* MV_REFRESH (MRG) */
/* MV_REFRESH (MRG) */
merge into test_mv
using (
  with tmpdlt$_test_master as  (
    -- check introduction post for statement
  )
  select gby,
         sum( 1 )   as cnt_star,
         max( dat ) as max_dat
    from (select rid$, gby, dat, dml$$
            from tmpdlt$_test_master
         ) as of snapshot(:b_scn)
    -- where whe = 0 (if the where clause is specified in the MV)
  group by gby
) deltas
on (sys_op_map_nonnull(test_mv.mv_gby) = sys_op_map_nonnull(deltas.gby))
when matched then
  update set
    test_mv.mv_cnt_star = test_mv.mv_cnt_star + deltas.cnt_star
    test_mv.mv_max_dat =
      decode( test_mv.mv_max_dat,
              null, deltas.max_dat,
              decode( deltas.max_dat,
                      null, test_mv.mv_max_dat,
                      greatest( test_mv.mv_max_dat, deltas.max_dat )
                    )
            )
when not matched then
  insert ( test_mv.mv_gby, test_mv.mv_max_dat, test_mv.mv_cnt_star )
  values ( deltas.gby, deltas.max_dat, deltas.cnt_star )

Similarly to what it is done for the SUM case, it simply calculates the delta values to be propagated by grouping-and-maximazing the new values contained in TMPDLT (essentially, it executes the MV statement on the filtered mv log), and then looks for matches over the grouped-by expression (using the null-aware function sys_op_map_nonnull, already illustrated in the post about SUM). It then applies the deltas to the MV, or simply inserts them if no match is found.

The delta application algorithm is very simple: since only inserts have been performed, the MV max(dat) value cannot decrease, but only increase if max(dat) calculated by the deltas is greater. Hence it is simply a matter to set the new value to the greatest of the old and the max of the deltas, with a few decodes to handle nulls in the obvious way.

Note that count(dat) is not used, and even not present in the MV definition.

Note especially, as in the SUM case, that the master table, test_master, is not accessed at all - unfortunately that cannot be done for the general case, as we will see shortly.

This algorithm is used also when the where clause is specified in the MV (adding this clause makes the MV an "insert-only MV", as per Oracle definition, which means that can be fast refreshed only after inserts and not after other DML types); the only difference is the obvious addition of the where clause in the deltas calculation as well (as commented in the statement above).

It's also very interesting to remember that this algorithm can be used when only inserts(new values) are present in TMPDLT, not in the log, and hence it can be used even when deletes or inserts are present in the log, provided they are redundant (as seen in the general introduction post). This is especially useful for where-clause MVs, since it widens the possibility to refresh beyond insert-only, as already demonstrated in script tmpdlt_enables_fast_refresh_of_insert_only_mv.sql of the introduction post.

refresh for mixed-DML TMPDLT

The refresh is accomplished using two statements, a delete that removes every gby value referenced in the log:

/* MV_REFRESH (DEL) */
delete from test_mv
 where sys_op_map_nonnull(mv_gby) in (
        select sys_op_map_nonnull(gby)
          from (select gby
                  from mlog$_test_master
                 where snaptime$$ > :b_st0
               ) as of snapshot(:b_scn)
       )

and an insert that recalculates them reading from the master table:

/* MV_REFRESH (INS) */
insert into test_mv
select gby, count(*), max(dat)
 from (select gby, dat
         from test_master
        where ( sys_op_map_nonnull(gby) ) in (
                select sys_op_map_nonnull(gby)
                  from (select gby, dat
                          from mlog$_test_master
                         where snaptime$$ > :b_st0
                       ) as of snapshot(:b_scn)
              )
       )
 group by gby

This is necessary since a delete(old value) might remove the max value present in the MV, and to know the new max value we must necessarily visit the master table. This might not happen for all log values, but the refresh engine takes the easiest (and drastic) option of deleting and recreating all anyway.

Note that this might result in massive degradation of performance - this algorithm in not O(modifications) but O(modifications * V), where V is the average number of rows per distinct value of gby, which is generally O(mv size). For example: if your order table doubles in size, you must expect to double the refresh time, even if the number of orders modified is always the same.

As in the SUM case, a bit surprisingly, this statement does not use TMPDLT, but reads straight from the log; the same observations made for the SUM case apply here as well.

optimizations

The insert-only case is very similar to the SUM case, and thus please refer to the high-level discussion presented there if interested (but, obviously, the creation of the index on mv_cnt_star is not needed in the MAX case). As a side note, one might notice that insert-only algorithms for aggregation are a class of their own, vastly simpler and vastly more performant (and that does not come as a surprise).

The mixed-DML case is another story altogether - to optimize it you must (almost always) create a proper index on the master table. At least the expression sys_op_map_nonnull(gby) must be indexed, but I would strongly advise to create this covering index:

create index test_master_covering on test_master ( sys_op_map_nonnull(gby), gby, dat )

this way everything needed for recalculating a given gby value is neatly clustered in some leaf blocks, instead of being spread out across all the table. You might spare thousands of table block visits if, as it is quite often the case, you have thousands of rows for each gby values and a bad clustering_factor.

Note also that this index is probably highly compressible, thus adding compress 2 (or even 3, depending on the "dat" column statistic distribution) is a great thing to do as well.

Script gby_max_with_covering_index.sql shows the possible index-only resulting plan:

--------------------------------------------------
|Id|Operation               |Name                |
--------------------------------------------------
| 0|INSERT STATEMENT        |                    |
| 1| LOAD TABLE CONVENTIONAL|                    |
| 2|  HASH GROUP BY         |                    |
| 3|   NESTED LOOPS         |                    |
| 4|    SORT UNIQUE         |                    |
| 5|     TABLE ACCESS FULL  |MLOG$_TEST_MASTER   |
| 6|    INDEX RANGE SCAN    |TEST_MASTER_COVERING|
--------------------------------------------------

For optimizing the log, and disabling TMPDLT, the same considerations made for the SUM case hold.

Fast refresh of aggregate-only materialized views with SUM – algorithm

In this post I will illustrate the algorithm used by Oracle (in 11.2.0.3) to fast refresh a materialized view (MV) containing only the SUM aggregate function:

create materialized view test_mv
build immediate
refresh fast on demand
with rowid
as
select gby        as mv_gby,
       count(*)   as mv_cnt_star,
       sum  (dat) as mv_sum_dat,
       count(dat) as mv_cnt_dat
  from test_master
 where whe = 0
 group by gby
;

Note that count(dat) is specified - you could avoid that if column dat is constrained to be not-null (as stated in the documentation), but I'm not covering that corner case here.

The MV log is configured to "log everything":

create materialized view log on test_master
with rowid ( whe, gby, dat ), sequence
including new values;

In the general introduction to aggregate-only MVs we have seen how the refresh engine first marks the log rows, then inspects TMPDLT (loading its rows into the result cache at the same time) to classify its content as insert-only (if it contains only new values), delete-only (if it contains only old values) or general (if it contains a mix of new/old values). Here we illustrate the refreshing SQL in all three scenarios, extracted from the supporting test case.

refresh for insert-only TMPDLT

The refresh is made using this single merge statement:

/* MV_REFRESH (MRG) */
merge into test_mv
using (
  with tmpdlt$_test_master as (
    -- check introduction post for statement
  )
  select gby,
         sum( 1 )                           as cnt_star,
         sum( 1 * decode(dat, null, 0, 1) ) as cnt_dat,
         nvl( sum( 1 * dat), 0 )            as sum_dat
    from (select gby, whe, dat
            from tmpdlt$_test_master
         ) as of snapshot(:b_scn)
   where whe = 0
   group by gby
) deltas
on ( sys_op_map_nonnull(test_mv.mv_gby) = sys_op_map_nonnull(deltas.gby) )
when matched then
  update set
    test_mv.mv_cnt_star = test_mv.mv_cnt_star + deltas.cnt_star,
    test_mv.mv_cnt_dat  = test_mv.mv_cnt_dat  + deltas.cnt_dat,
    test_mv.mv_sum_dat  = decode( test_mv.mv_cnt_dat + deltas.cnt_dat,
                                  0, null,
                                  nvl(test_mv.mv_sum_dat,0) + deltas.sum_dat
                                 )
when not matched then
  insert ( test_mv.mv_gby, test_mv.mv_cnt_dat, test_mv.mv_sum_dat, test_mv.mv_cnt_star )
  values ( deltas.gby, deltas.cnt_dat, decode (deltas.cnt_dat, 0, null, deltas.sum_dat), deltas.cnt_star)

It simply calculates the delta values to be propagated by grouping-and-summing the new values contained in TMPDLT that satisfy the where-clause (essentially, it executes the MV statement on the mv log without redundant values), and then looks for matches over the grouped-by expression (using the null-aware function sys_op_map_nonnull, more on this later). It then applies the deltas to the MV, or simply inserts them if no match is found.

Note that mv_sum_dat (that materializes sum(dat)) is set to null if, and only if, (the updated value of) mv_cnt_dat (that materializes count(dat)) is zero (signaling that for this value of mv_gby, all values of dat in the master table are null). This is done in all three scenarios of the algorithm.

The matching function sys_op_map_nonnull() is there to match null values with null values, since aggregating by null is perfectly legal, yet you cannot match null with null in a merge/join. This function returns a raw value that is never null, and set to 0xFF when the input is null and to the binary representation of the input postfixed with 0x00 for other input values. Note that a function-based index, named I_SNAP$_TEST_MV in our case, is automatically created on sys_op_map_nonnull(mv_gby) to give the CBO the opportunity to optimize the match (unless the MV is created specifying USING NO INDEX, which is probably almost never a good idea when you need to fast refresh).

Note also that the master table, test_master, is not accessed at all, as it is always the case for SUM (but not necessarily for MAX, as we will see in the next post). This elegant decoupling (possible thanks to the mathematical properties of the addition, of course) of the master table from the MV greatly improves performance and also simplifies performance tuning.

refresh for delete-only TMPDLT

The refresh is made in two steps, the first being this update statement:

/* MV_REFRESH (UPD) */
update /*+ bypass_ujvc */ (
  select test_mv.mv_cnt_dat,
         deltas .cnt_dat,
         test_mv.mv_sum_dat,
         deltas .sum_dat,
         test_mv.mv_cnt_star,
         deltas .cnt_star
    from test_mv,
         ( with tmpdlt$_test_master as (
             -- check introduction post for statement
           )
           select gby,
                  sum( -1 )                           as cnt_star
                  sum( -1 * decode(dat, null, 0, 1) ) as cnt_dat,
                  nvl( sum(-1 * dat), 0)              as sum_dat
             from (select gby, whe, dat
                     from tmpdlt$_test_master mas$
                  ) as of snapshot(:b_scn)
            where whe = 0
            group by gby
         ) deltas
   where sys_op_map_nonnull(test_mv.mv_gby) = sys_op_map_nonnull(deltas.gby)
)
set mv_cnt_star = mv_cnt_star + cnt_star,
    mv_cnt_dat  = mv_cnt_dat  + cnt_dat,
    mv_sum_dat  = decode(mv_cnt_dat + cnt_dat, 0, null, nvl(mv_sum_dat,0) + sum_dat)

this calculates the same deltas as the insert-only case, just with signs reversed since, of course, we are propagating deletes instead of inserts; it then applies them using an updatable join in-line view instead of a merge.

Then, this delete statement is issued:

/* MV_REFRESH (DEL) */
delete from test_mv where mv_cnt_star = 0;

this is because, when mv_cnt_star (that materializes count(*)) is zero after the deltas application, it means that all the rows belonging to that value of mv_gby have been deleted in the master table, and hence that value must be removed from the MV as well.

Note that an index on mv_cnt_star is NOT automatically created (as of 11.2.0.3) - it might be a very good idea to create it, to avoid a full scan of the MV at every refresh, which is O(mv size) and not O(modifications) as the other steps (thus rendering the whole refresh process O(mv size)).

refresh for mixed-DML TMPDLT

The refresh is accomplished using a single merge statement, which is an augmented version of the insert-only statement plus a delete clause that implements the last part of the delete-only refresh:

/* MV_REFRESH (MRG) */
merge into test_mv
using (
 select gby,
        sum( decode(dml$$, 'I',  1, -1) )                          as cnt_star,
        sum( decode(dml$$, 'I',  1, -1) * decode(dat, null, 0, 1)) as cnt_dat,
        nvl( sum(decode(dml$$, 'I',  1, -1) * dat), 0)             as sum_dat
   from (select chartorowid(m_row$$) rid$, gby, whe, dat,
                decode(old_new$$, 'N', 'I', 'D') as dml$$,
                dmltype$$
           from mlog$_test_master
          where snaptime$$ > :b_st0
        ) as of snapshot(:b_scn)
  where whe = 0
  group by gby
  ) deltas
on ( sys_op_map_nonnull(test_mv.mv_gby) = sys_op_map_nonnull(deltas.gby) )
when matched then
  update set
    test_mv.mv_cnt_star = test_mv.mv_cnt_star + deltas.cnt_star,
    test_mv.mv_cnt_dat  = test_mv.mv_cnt_dat  + deltas.cnt_dat,
    test_mv.mv_sum_dat  = decode( test_mv.mv_cnt_dat + deltas.cnt_dat,
                                  0, null,
                                  nvl(test_mv.mv_sum_dat,0) + deltas.sum_dat
                                ),
  delete where ( test_mv.mv_cnt_star = 0 )
when not matched then
  insert ( test_mv.mv_gby, test_mv.mv_cnt_dat, test_mv.mv_sum_dat, test_mv.mv_cnt_star )
  values ( deltas.gby, deltas.cnt_dat, decode (deltas.cnt_dat, 0, null, deltas.sum_dat), deltas.cnt_star)
   where (deltas.cnt_star > 0)

Here the deltas are calculated by reversing the sign of "old" values (dml$$ not equal to 'I', which is the same as old_new$$ not equal to 'N'; note in passing that it does not distinguish between old_new$$ equal to 'O' or 'U', as stated in the introduction post), of course adjusting cnt_star and cnt_dat accordingly.

The removal of rows that get their mv_cnt_star set to zero is performed as a side case of the update, which is very nice since it does not call for an index on that column.

Surprisingly, this statement does not use TMPDLT, but reads straight from the log; I don't know the reason behind this, and whether this is always the case or if TMPDLT is sometimes used, depending, perhaps, on some heuristic decision. Surely, while using TMPDLT is mandatory in the other two cases (since the statements work only if their input is insert/delete only, and that is checked over TMPDLT only), it is just a possible optimization choice here.

optimizations

Knowing the "internal" workings presented here makes it vastly easier (I hope) to optimize the refresh process and avoid pitfalls; it is of course unfeasible to cover all possible real-life scenarios, but I can offer some general high-level considerations.

Obviously, fast refreshing is better than complete refreshing only when the number of modifications stored in the MV log is "small" compared to the MV size. In this scenario the usual optimal values propagation plan reads from the log, computes TMPDLT (when used), and joins the MV using NESTED LOOPS using the (automatically created) index on sys_op_map_nonnull(mv_gby) - or possibly using HASH JOIN, or SORT MERGE JOIN, again using the same index.

Hence the only optimization worth making is to create the index on mv_cnt_star, unless you can be absolutely sure that you will never be in the delete-only scenario, or unless you don't care about the resulting full MV scan.

Since the master table is never read during the refresh, it can be left alone. This is great.

The best access method for the log is usually a full table scan, since all rows are normally read, and hence usually nothing has to be done on the log. I can imagine that in rare cases one might consider optimizing the log by e.g. creating an index - for example a covering index on all the referenced columns; or one to speed up the analytic functions computations of TMPDLT avoiding the sort; or one prefixed by snaptime$$ if other MVs read from the log and refreshes at different times, etc.

Maybe, sometimes, it might prove beneficial to disable TMPDLT, as discussed in the introduction post.

Fast refresh of aggregate-only materialized views – introduction

This post introduces a series about the algorithm used by Oracle (in 11.2.0.3) to fast refresh a materialized view (MV) containing only an aggregate:

create materialized view test_mv
build immediate
refresh fast on demand
with rowid
as
select gby        as mv_gby,
       count(*)   as mv_cnt_star,
       AGG  (dat) as mv_AGG_dat,
       count(dat) as mv_cnt_dat
  from test_master
 where whe = 0
 group by gby
;

Where AGG is either SUM or MAX, the most important aggregates.

In the next posts, I will illustrate the algorithms used to propagate conventional (not direct-load) inserts, updates and deletes on the master table; I will illustrate also the specialized versions of the algorithms used when only one type of DML has been performed (if they exist).

In this post, we sets the stage, make some general observations, and illustrate the very first steps of the algorithm that are common to all scenarios. Everything is supported by the usual test case.

Materialized view logs configuration

I have configured the materialized view log on the master table to "log everything", to give the most complete information possible to the MV refresh engine:

create materialized view log on test_master
with rowid ( whe, gby, dat ), sequence
including new values;

With this configuration, each modification to the master table logs the rowid of the affected rows (in column m_row$$), and it is labeled with an increasing value (in sequence$$) that enables the MV refresh engine to reconstruct the order in which the modifications happened. In detail, let’s see what’s inside the logs after we modify a single row (from mvlog_examples.sql):

After an INSERT:

SEQUENCE$$ M_ROW$$              DMLTYPE$$ OLD_NEW$$    WHE    GBY    DAT
---------- -------------------- --------- --------- ------ ------ ------
     10084 AAAWK0AAEAAAxTHAD6   I         N             10     10     10

This logs the new values (old_new$$=’N’) of an Insert (dmltype$$=’I’).

After a DELETE:

SEQUENCE$$ M_ROW$$              DMLTYPE$$ OLD_NEW$$    WHE    GBY    DAT
---------- -------------------- --------- --------- ------ ------ ------
     10085 AAAWK0AAEAAAxTFAAA   D         O              0      0      1

This logs the old values (old_new$$=’O’) of a Delete (dmltype$$=’D’).

After an UPDATE:

SEQUENCE$$ M_ROW$$              DMLTYPE$$ OLD_NEW$$    WHE    GBY    DAT
---------- -------------------- --------- --------- ------ ------ ------
     10086 AAAWK0AAEAAAxTHAD6   U         U             10     10     10
     10087 AAAWK0AAEAAAxTHAD6   U         N             10     10     99

This logs both the old values (old_new$$=’U’) and the the new values (old_new$$=’N’) of an Update (dmltype$$=’U’). So we see that the update changed DAT from 10 to 99, without changing the other columns.

Note that the update log format is the same as a delete (at sequence 10086) immediately followed by an insert (at sequence 10087) at the same location on disk (AAAWK0AAEAAAxTHAD6), the only differences being dmltype$$=’U’ and old_new$$ set to ’U’ instead of ‘O’ for the old values.

But if you ignore these differences, you can consider the log a sequence of deletes/inserts, or if you prefer, a stream of old/new values. And this is exactly what the refresh engine does - it does not care whether an old value is present because it logs a delete or the "erase side" of an update, and ditto for new values. It "sees" the log as a stream of old/new values, as we will demonstrate.

Log snapshots

When the MV fast refresh is started, the first step is to "mark" the logged modifications to be propagated to the MV by setting snaptime$$ equal to the current time - check the description contained in this post for details (note also another possible variant with "commit-scn mv logs"). MV log purging (at the end of the refresh) is the same as well.

TMPDLT (deleting the redundant log values)

The stream of old/new values marked in the log might contain pairs of redundant values, each pair being composed of a new value (insert) immediately followed by an old value (delete) on the same row; every such pair can be ignored without affecting the refresh result. Filtering out these pairs is the job of this SQL fragment (nicknamed "TMPDLT"), heavily edited for readability:

with tmpdlt$_test_master as (
  select /*+ result_cache(lifetime=session) */
         rid$, gby, dat, whe,
         decode(old_new$$, 'N', 'I', 'D') dml$$,
         old_new$$,  snaptime$$,
         dmltype$$
   from (select log.*,
                min( sequence$$ ) over (partition by rid$) min_sequence$$,
                max( sequence$$ ) over (partition by rid$) max_sequence$$
           from (select chartorowid(m_row$$) rid$, gby, dat, whe,
                        dmltype$$, sequence$$, old_new$$, snaptime$$
                   from mlog$_test_master
                  where snaptime$$ > :b_st0
                ) as of snapshot(:b_scn) log
        )
  where ( (old_new$$ in ('O', 'U') ) and (sequence$$ = min_sequence$$) )
     or ( (old_new$$ = 'N'         ) and (sequence$$ = max_sequence$$) )
)

The "log" in-line view is the usual one selecting the marked log rows; the outer query blocks, for each rowid, keeps only the first logged value but only if it is old, and the last logged one, but only if it is new.

So for example (check tmpdlt_pair_removal_examples.sql), TMPDLT filters out the rows marked with (*) from this triple update of the same row:

SEQUENCE$$ OLD_NEW$$    GBY    DAT    WHE
---------- --------- ------ ------ ------
     10142 U              0      1      0
     10143 N              0   1000      0  *
     10144 U              0   1000      0  *
     10145 N              0   2000      0  *
     10146 U              0   2000      0  *
     10147 N              0   3000      0

and it removes completely this pair (obtained by inserting a row and then deleting it):

SEQUENCE$$ OLD_NEW$$    GBY    DAT    WHE
---------- --------- ------ ------ ------
     10162 N             -1     -1     -1  *
     10163 O             -1     -1     -1  *

As we will see, the result of TMPDLT is (almost always) the actual input to the refreshing algorithm, instead of the "raw" log rows. Note that this prefiltering is relatively expensive, and while it might be somehow beneficial to remove some redundant values, it is useful especially when the log contains a mix of new and old values and TMPDLT is able to turn it into a stream of new-only(insert-only) or old-only(delete-only) one. When it happens, the more specialized versions of the algorithm can be used, thus saving resources - even if the savings could not repay the cost of TMPDLT, in general.

Even better, this prefiltering shows its greatest advantage when you have to refresh a so-called "insert-only MV", that is, a MV that refuses to fast-refresh if updates or deletes where performed, but it fast-refreshes happily when only inserts were performed: you might be able to fast refresh and avoid a complete refresh if TMPDLT is able to filter out all the old values. This happens for example if you insert some rows first, and then modify (or delete) only the newly inserted rows before refreshing - as demonstrated by tmpdlt_enables_fast_refresh_of_insert_only_mv.sql using the classic insert-only MV, a MV containing MAX and a where clause.

TMPDLT caching

The first operation performed by the refresh engine is to classify the log content as new-only(insert-only), old-only(delete-only) or mixed, both to decide which refresh algorithm to use (insert-only, delete-only, general) and to raise, possibly, "ORA-32314 REFRESH FAST unsupported after deletes/updates" for insert-only MVs.

To classify the log, it issues a SQL statement on TMPDLT, that lists for every possible DML (Insert,Update,Delete) the max value of snaptime$$ contained in the log. In passing, this might enable some optimizations such as multi-step refreshes, but I have not investigated this.

Immediately after this, the chosen refreshing algorithm version might reference TMPDLT again - this time (possibly) saving resources since TMPDLT is result-cached, thanks to the hint "result_cache(lifetime=session)".

The caching is a potentially relevant optimization since analytic functions can use a lot or resources for big (long and/or wide) MV logs. It means also that one must check also the result cache size (and utilization) when tuning for performance - and check that the analytic functions in the first place, of course, have enough resources to operate efficiently.

Side note: the undocumented modifier "lifetime=session" simply means that the result is flushed (at least) when the session ends (check result_cache_session_lifetime.sql), which is a nice optimization since TMPDLT is flashed back in time and hence is NOT flushed when the log is modified. It is anayway explicitly flushed as soon as the refresh ends, hence this is only a safe net just in case the refresh fails for some reason (e.g. disk full).

TMPDLT disabling

What if you don't benefit from TMPDLT since your log does not contain (enough) redundant values, and you don't want to pay the cost of its processing and/or caching ?

You can disable it by removing the sequence from the MV log, that actually, as far as I know, seems to be used only by this prefilter. If this is done, all the refreshing statements read directly from the log; script tmpdlt_disabling_all.sql proves this (you will better appreciate how it works and its output after the next two posts that illustrate the actual refreshing algorithms of SUM and MAX, but you can already see that TMPDLT disappears from the refreshing statements).

The same scripts investigates also disabling it by setting the undocumented parameter "_mav_refresh_opt"=32, but as always, ask Oracle Support first (also because there's no official note explaining how it works on MOS, and I haven't used it in production yet - since I actually discovered it about one week ago while preparing this post).

In the next post, we will examine the SUM scenario.

“alter session force parallel query”, and indexes

This post is a brief discussion about the advantages of activating parallelism by altering the session environment instead of using the alternative ways (hints, DDL). The latter ways are the most popular in my experience, but I have noticed that their popularity is actually due, quite frequently, more to imperfect understanding rather than informed decision - and that's a pity since "alter session force parallel query" can really save everyone a lot of tedious work and improve maintainability a great deal.

We will also check that issuing

alter session force parallel query parallel N;

is the same as specifying the hints

/*+ parallel (t,N)  */
/*+ parallel_index (t, t_idx, N) */

for all tables referenced in the query, and for all indexes defined on them (the former is quite obvious, the latter not that much).

Side note: it is worth remembering that hinting the table for parallelism does not cascade automatically to its indexes as well - you must explicitly specify the indexes that you want to be accessed in parallel by using the separate parallel_index hint (maybe specifying "all indexes" by using the two-parameter variant "parallel_index(t,N)"). The same holds for "alter table parallel N" and "alter index parallel N", of course.

the power of "force parallel query"

I've rarely found any reason for avoiding index parallel operations nowadays - usually both the tables and their indexes are stored on disks with the same performance figures (if not the same set of disks altogether), and the cost of the initial segment checkpoint is not generally different. At the opposite, using an index can offer terrific opportunities for speeding up queries, especially when a full table scan can be substituted by a fast full scan on a (perhaps much) smaller index.

Thus, I almost always let the CBO consider index parallelism as well. Three methods can be used:
- statement hints (the most popular option)
- alter table/index parallel N
- "force parallel query".

I rather hate injecting parallel hints everywhere in my statements since it is very risky. It is far too easy to forget to specify a table or index (or simply misspell them), not to mention to forget new potentially good indexes added after the statement had been finalized. Also, you must change the statement as well even if you simply want to change the degree of parallelism, perhaps just because you are moving from an underequipped, humble and cheap test environment to a mighty production server. At the opposite, "force parallel query" is simple and elegant - just a quick command and you're done, and with a single place to touch in order to change the parallel degree.

"alter table/index parallel N" is another weak technique as well in my opinion, mainly for two reasons. The first one is that it is a permanent modification to the database objects, and after the query has finished, it is far too easy to fail to revert the objects back to their original degree setting (because of failure or coding bug). The second one is the risk of two concurrent sessions colliding on the same object that they both want to read, but with different degrees of parallelism.
Both the two problems above do not hold only when you always want to run with a fixed degree for all statements; but even in this case, I would consider issuing "force parallel query" (maybe inside a logon trigger) instead of having to set/change the degree for all tables/indexes accessed by the application.

I have noticed that many people are afraid of "force parallel query" because of the word "force", believing that it switches every statement into parallel mode. But this is not the case: as Tanel Poder recently illustrated, the phrase "force parallel query" is misleading; a better one would be something like "consider parallel query", since it is perfectly equivalent to hinting the statement for parallelism as far as I can tell (see below). And hinting itself tells the CBO to consider parallelism in addition to serial execution; the CBO is perfectly free to choose a serial execution plan if it estimates that it will cost less - as demonstrated by Jonathan Lewis years ago.
Hence there's no reason to be afraid, for example, that a nice Index Range Scan that selects just one row might turn into a massively inefficient Full Table Scan (or index Fast Full Scan) of a one million row table/index. That is true besides bugs and CBO limitations, obviously; but in these hopefully rare circumstances, one can always use the no_parallel and no_parallel_index to fix the issue.

"force parallel query" and hinting: test case

Let's show that altering the session is equivalent to hinting. I will illustrate the simplest case only - a single-table statement that can be resolved either by a full table scan or an index fast full scan (check script force_parallel_main.sql in the test case), but in the test case zip two other scenarios (a join and a subquery) are tested as well. Note: I have only checked 9.2.0.8 and 11.2.0.3 (but I would be surprised if the test case could not reproduce in 10g as well).

Table "t" has an index t_idx on column x, and hence the statement

select sum(x) from t;

can be calculated by either scanning the table or the index. In serial, the CBO chooses to scan the smaller index (costs are from 11.2.0.3):

select /* serial */ sum(x) from t;
--------------------------------------
|Id|Operation             |Name |Cost|
--------------------------------------
| 0|SELECT STATEMENT      |     | 502|
| 1| SORT AGGREGATE       |     |    |
| 2|  INDEX FAST FULL SCAN|T_IDX| 502|
--------------------------------------
 

If we now activate parallelism for the table, but not for the index, the CBO chooses to scan the table:

select /*+ parallel(t,20) */ sum(x) from t
------------------------------------------
|Id|Operation              |Name    |Cost|
------------------------------------------
| 0|SELECT STATEMENT       |        | 229|
| 1| SORT AGGREGATE        |        |    |
| 2|  PX COORDINATOR       |        |    |
| 3|   PX SEND QC (RANDOM) |:TQ10000|    |
| 4|    SORT AGGREGATE     |        |    |
| 5|     PX BLOCK ITERATOR |        | 229|
| 6|      TABLE ACCESS FULL|T       | 229|
------------------------------------------

since the cost for the parallel table access is now down from the serial cost of 4135 (check the test case logs) to the parallel cost 4135 / (0.9 * 20) = 229, thus less than the cost (502) of the serial index access.

Hinting the index as well makes the CBO apply the same scaling factor (0.9*20) to the index as well, and hence we are back to index access:

select /*+ parallel_index(t, t_idx, 20) parallel(t,20) */ sum(x) from t
---------------------------------------------
|Id|Operation                 |Name    |Cost|
---------------------------------------------
| 0|SELECT STATEMENT          |        |  28|
| 1| SORT AGGREGATE           |        |    |
| 2|  PX COORDINATOR          |        |    |
| 3|   PX SEND QC (RANDOM)    |:TQ10000|    |
| 4|    SORT AGGREGATE        |        |    |
| 5|     PX BLOCK ITERATOR    |        |  28|
| 6|      INDEX FAST FULL SCAN|T_IDX   |  28|
---------------------------------------------

Note that the cost computation is 28 = 502 / (0.9 * 20), less than the previous one (229).

"Forcing" parallel query:

alter session force parallel query parallel 20;

select /* force parallel query  */ sum(x) as from t
---------------------------------------------
|Id|Operation                 |Name    |Cost|
---------------------------------------------
| 0|SELECT STATEMENT          |        |  28|
| 1| SORT AGGREGATE           |        |    |
| 2|  PX COORDINATOR          |        |    |
| 3|   PX SEND QC (RANDOM)    |:TQ10000|    |
| 4|    SORT AGGREGATE        |        |    |
| 5|     PX BLOCK ITERATOR    |        |  28|
| 6|      INDEX FAST FULL SCAN|T_IDX   |  28|
---------------------------------------------

Note that the plan is the same (including costs), as predicted.

Side note: let's verify, just for fun, that the statement can run serially even if the session is "forced" as parallel (note that I have changed the statement since the original always benefits from parallelism):

alter session force parallel query parallel 20;

select /* force parallel query (with no parallel execution) */ sum(x) from t
WHERE X < 0
----------------------------------
|Id|Operation         |Name |Cost|
----------------------------------
| 0|SELECT STATEMENT  |     |   3|
| 1| SORT AGGREGATE   |     |    |
| 2|  INDEX RANGE SCAN|T_IDX|   3|
----------------------------------

Side note 2: activation of parallelism for all referenced objects can be obtained, in 11.2.0.3, using the new statement-level parallel hint (check this note by Randolf Geist for details):

select /*+ parallel(20) */ sum(x) from t
---------------------------------------------------
|Id|Operation                 |Name    |Table|Cost|
---------------------------------------------------
| 0|SELECT STATEMENT          |        |     |  28|
| 1| SORT AGGREGATE           |        |     |    |
| 2|  PX COORDINATOR          |        |     |    |
| 3|   PX SEND QC (RANDOM)    |:TQ10000|     |    |
| 4|    SORT AGGREGATE        |        |     |    |
| 5|     PX BLOCK ITERATOR    |        |     |  28|
| 6|      INDEX FAST FULL SCAN|T_IDX   |T    |  28|
---------------------------------------------------

This greatly simplifies hinting, but of course you must still edit the statement if you need to change the parallel degree.

fast refresh of outer-join-only materialized views – algorithm, part 2

In this post, we are going to complete part 1 illustrating the (considerably more complex) general case of a fast refresh from a master inner table without a unique constraint on the joined column(s).

To recap, now the outer slice can be composed of more than one row, for example:

ooo inn1
ooo inn2

and hence, both the DEL and INS step must consider (and read) the whole outer slice even if only a subset of the inner rows have been modified. This requires both more resources and a considerably more complex algorithm. Let's illustrate it (the mandatory test case is here).

The DEL macro step

This sub step (named DEL.del by me) is performed first:

/* MV_REFRESH (DEL) */
delete from test_mv where rowid in (
select rid
  from (
select test_mv.rowid rid,
       row_number() over (partition by test_outer_rowid order by rid$ nulls last) r,
       count(*)     over (partition by test_outer_rowid ) t_cnt,
       count(rid$)  over (partition by test_outer_rowid ) in_mvlog_cnt
  from test_mv, (select distinct rid$ from mlog$_test_inner) mvlog
 where /* read touched outer slices start */
       test_mv.test_outer_rowid in
          (
          select test_outer_rowid
            from test_mv
           where test_inner_rowid in (select rid$ from mlog$_test_inner)
          )
       /* read touched outer slices end */
   and test_mv.test_inner_rowid = mvlog.rid$(+)
       )
 /* victim selection start */
 where t_cnt > 1
   and ( (in_mvlog_cnt = t_cnt and r > 1)
          or
         (in_mvlog_cnt < t_cnt and r <= in_mvlog_cnt)
       )
 /* victim selection end */
)

followed by the DEL.upd one:

/* MV_REFRESH (UPD) */
update test_mv
   set jinner = null, xinner = null, pkinner = null, test_inner_rowid = null
 where test_inner_rowid in (select rid$ from mlog$_test_inner)

This two steps combined do change all the rows of the MV marked in the log (and only them, other rows are not modified at all); the first step deletes some of them, leaving all the others to the second one, that sets to null their columns coming from the inner table.

DEL.upd is straighforward. Let's illustrate the DEL.del algorithm:

a) the section "read touched outer slices" fetches all the MV outer slices that have at least one of their rows marked in the log;
b) the slices are outer joined with the "mvlog" in-line view, so that rid$ will be nonnull for all rows marked in the log;
c) the analytic functions, for each outer slice separately, compute the number of rows (column t_cnt), the number of rows marked (column in_mvlog_cnt), and then attach a label (column r) that orders the row (order is not important at all besides non-marked rows being ordered last)
d) the where-predicate "victim selection" dictates which rows to delete.

The victim selection predicate has three sub-components, each implementing a different case (again, considering each slice separately):

"t_cnt > 1": do not delete anything if the slice contains only one row (since it is for sure marked and hence will be nulled by DEL.upd)

             rid$ t_cnt in_mvlog_cnt r  action
ooo inn1 not-null     1            1 1  updated by DEL.upd   


"in_mvlog_cnt = t_cnt and r > 1": all rows are marked, delete all but one (that will be nulled by DEL.upd)

             rid$ t_cnt in_mvlog_cnt r  action
ooo inn1 not-null     3            3 1  updated by DEL.upd
ooo inn2 not-null     3            3 2  deleted by DEL.del
ooo inn3 not-null     3            3 3  deleted by DEL.del


"in_mvlog_cnt < t_cnt and r <= in_mvlog_cnt": only some rows are marked; delete all marked rows, keep all the others.

             rid$ t_cnt in_mvlog_cnt r  action
ooo inn1 not-null     3            2 1  deleted by DEL.del
ooo inn2 not-null     3            2 2  deleted by DEL.del
ooo inn3     null     3            2 3  nothing


The INS macro step

The first sub-step is INS.ins:

/* MV_REFRESH (INS) */
insert into test_mv
select  o.jouter,  o.xouter,  o.pkouter, o.rowid,
       jv.jinner, jv.xinner, jv.pkinner, jv.rid
  from ( select test_inner.rowid rid,
                test_inner.*
           from test_inner
          where rowid in (select rid$ from mlog_test_inner)
       ) jv, test_outer o
 where jv.jinner = o.jouter

this sub-step simply find matches in the outer table for the marked inner table rows (note that it is an inner join, not an outer join), and inserts them in the MV.

Then, INS.del:

/* MV_REFRESH (DEL) */
delete from test_mv sna$ where rowid in (
select rid
 from (
select test_mv.rowid rid,
       row_number()            over (partition by test_outer_rowid order by test_inner_rowid nulls first) r,
       count(*)                over (partition by test_outer_rowid ) t_cnt,
       count(test_inner_rowid) over (partition by test_outer_rowid ) nonnull_cnt
  from test_mv
 where /* read touched outer slices start */
       test_mv.test_outer_rowid in
          (
          select o.rowid
            from ( select test_inner.rowid rid$,
                          test_inner.*
                     from test_inner
                    where rowid in (select rid$ from mlog$_test_inner)
                 ) jv, test_outer o
           where jv.jinner = o.jouter
          )
      /* read touched outer slices end */
      )
 /* victim selection start */
 where t_cnt > 1
   and ( (nonnull_cnt = 0 and r > 1)
          or
         (nonnull_cnt > 0 and r <= t_cnt - nonnull_cnt)
       )
 /* victim selection end */
)

this substep has a SQL structure very similar to DEL.upd, hence I will simply outline the algorith: first, the statement identifies (in the "read touched outer slices" section) all the outer slices that had at least one rows inserted by INS.ins, by replaying its join; then, for each slice, it deletes any row, if it exists, that has column "test_inner_rowid" set to null (check the "victim selection predicate").

Side note: I cannot understand how nonnull_cnt could be = 0 - possibly that is for robustness only or because it can handle variants of the DEL step I haven't observed.

speeding up

These are the indexes that the CBO might enjoy using to optimize the steps of the propagation from the inner table:
- DEL.del: test_mv(test_inner_rowid, test_outer_rowid)
- DEL.upd: test_mv(test_inner_rowid)
- INS.ins: test_outer(jouter)
- INS.del: test_outer(jouter) and test_mv(test_outer_rowid , test_inner_rowid)

And hence, to optimize all steps:
- test_outer(jouter)
- test_mv(test_inner_rowid, test_outer_rowid)
- test_mv(test_outer_rowid , test_inner_rowid)

And of course we need the usual index on test_inner(jinner) to optimize the propagation from the outer table (not shown in this post), unless we positively know that the outer table is never modified.

Note that the two indexes test_mv(test_inner_rowid, test_outer_rowid) and test_mv(test_outer_rowid , test_inner_rowid) allow to skip visiting the MV altogether (except for deleting rows, obviously) and hence might reduce the number of consistent gets dramatically (the indexes are both "covering" indexes for the SQL statements we observed in the DEL.del and INS.del) .

For example, in my test case (check ojoin_mv_test_case_indexed.sql), the plan for the DEL.del step is:

--------------------------------------------------------------
| 0|DELETE STATEMENT                |                        |
| 1| DELETE                         |TEST_MV                 |
| 2|  NESTED LOOPS                  |                        |
| 3|   VIEW                         |VW_NSO_1                |
| 4|    SORT UNIQUE                 |                        |
| 5|     VIEW                       |                        |
| 6|      WINDOW SORT               |                        |
| 7|       HASH JOIN OUTER          |                        |
| 8|        HASH JOIN SEMI          |                        |
| 9|         INDEX FULL SCAN        |TEST_MV_TEST_INNER_ROWID|
|10|         VIEW                   |VW_NSO_2                |
|11|          NESTED LOOPS          |                        |
|12|           TABLE ACCESS FULL    |MLOG$_TEST_INNER        |
|13|           INDEX RANGE SCAN     |TEST_MV_TEST_INNER_ROWID|
|14|        VIEW                    |                        |
|15|         SORT UNIQUE            |                        |
|16|          TABLE ACCESS FULL     |MLOG$_TEST_INNER        |
|17|   MAT_VIEW ACCESS BY USER ROWID|TEST_MV                 |
--------------------------------------------------------------
5 - filter[ (T_CNT>1 AND ((IN_MVLOG_CNT=T_CNT AND R>1)
OR (IN_MVLOG_CNT<T_CNT AND R<=IN_MVLOG_CNT))) ]
...

Note the absence of any access to the MV to identify the rows to be deleted (row source operation 5 and its progeny; note the filter operation, which is the final "victim selection predicate"); the MV is only accessed to physically delete the rows.

Ditto for the INS.del step:

-------------------------------------------------------------------
| 0|DELETE STATEMENT                     |                        |
| 1| DELETE                              |TEST_MV                 |
| 2|  NESTED LOOPS                       |                        |
| 3|   VIEW                              |VW_NSO_1                |
| 4|    SORT UNIQUE                      |                        |
| 5|     VIEW                            |                        |
| 6|      WINDOW SORT                    |                        |
| 7|       HASH JOIN SEMI                |                        |
| 8|        INDEX FULL SCAN              |TEST_MV_TEST_INNER_ROWID|
| 9|        VIEW                         |VW_NSO_2                |
|10|         NESTED LOOPS                |                        |
|11|          NESTED LOOPS               |                        |
|12|           TABLE ACCESS FULL         |MLOG$_TEST_INNER        |
|13|           TABLE ACCESS BY USER ROWID|TEST_INNER              |
|14|          INDEX RANGE SCAN           |TEST_OUTER_JOUTER_IDX   |
|15|   MAT_VIEW ACCESS BY USER ROWID     |TEST_MV                 |
-------------------------------------------------------------------
5 - filter[ (T_CNT>1 AND ((NONNULL_CNT=0 AND R>1)
OR (NONNULL_CNT>0 AND R<=T_CNT-NONNULL_CNT))) ]
...



You might anyway create just the two "standard" single-column indexes on test_mv(test_inner_rowid) and test_mv(test_outer_rowid) and be happy with the resulting performance, even if you now will access the MV to get the "other" rowid - it all depends, of course, on your data (how many rows you have in each slice, and how many slices are touched by the marked rows) and how you modify the master tables.

fast refresh of outer-join-only materialized views – algorithm, part 1

In this series of posts we will discuss how Oracle refreshes materialized views (MV) containing only OUTER joins, covering only 11.2.0.3. We will use the very same scenario (MV log configuration, DML type, etc) as in the inner join case, "just" turning the inner join into an outer join:

create materialized view test_mv
build immediate
refresh fast on demand
as
select test_outer.*, test_outer.rowid as test_outer_rowid,
       test_inner.*, test_inner.rowid as test_inner_rowid
  from test_outer, test_inner
 where test_outer.jouter = test_inner.jinner(+)
;

For the outer case, the overall strategy is the same we already saw for inner joins: modifications from each master table are propagated separately to the MV, and by still performing, for each master table, the same two macro steps (first the DEL, and then the INS one).

Actually, propagation from the outer table is exactly the same (with the obvious slight difference of performing an outer join to recalculate the "slice" in the INS step), and hence we will discuss only the propagation from the inner table, which is considerably more complex.

There are actually two different propagation algorithms, one much simpler and less resource-intensive that requires a unique constraint on the joined columns (here, "jinner" alone); in this post I will discuss the details of this specialized algorithm only, leaving the details of the "general" other one for the next post.

Does the existence of a a unique constraint on the joined colum(s) enable such a dramatic simplification of the propagation that justifies a specialized algorithm? Yes, absolutely - and it is interesting to understand the reason since it comes out naturally from the very semantic of the outer join SQL construct, and hence we can also improve our understanding of this important component of the SQL syntax as a by-product.

Let's start by remembering that every row of the outer table is always represented in the MV, possibly with all columns coming from the inner table set to null (including, most importantly, test_inner_rowid) if no match is found in the inner table. If M matches are found for a given outer row, M rows will be present in the MV; e.g. for M=2, we will see the following "outer slice" (my definition) corresponding to outer row ooo:

ooo inn1
ooo inn2

Now, if a unique constraint exists on the joined column(s), M can be at most 1, and hence only two possible states are possible for our outer slice:

(a) ooo inn1
(b) ooo *null*

Hence, if inn1 is marked in the log, propagating its deletion in the DEL step is just a matter of simply switching the slice from state (a) to (b) using an update statement, and conversely, propagating its insertion in the INS step is just a matter of updating the slice from state (b) to state (a). In other words, the possible matching rows of the outer table are already there, in the MV, and all we need to do is to "flip their state" if necessary. Thus it is possible to propagate using only quite efficient update statements - no delete or insert needs to be performed at all.

Now, consider how the absence of unique constraint adds additional complexity. In this case this scenario is possible:

ooo inn1
ooo inn2

if only one of (inn1, inn2) is marked in the log, the DEL step can simply delete only the corresponding row in the MV, but if both are marked, it must leave a single row with all the columns coming from the inner table set to null:

ooo *null*

conversely, the INS step must remember to remove the above row if it finds at least a match in the outer table.

In other words, the whole "outer slice" must be considered and examined by both steps; it is not enough to consider only marked rows "in isolation", as it was the case in the inner join scenario and the constrained outer join scenario. This is considerably more complex, and in fact, the "general" algorithm was designed and implemented only in 10g - before 10g it was mandatory to have a unique constraint on the joined columns to enable fast refresh.

conventions and investigation scope
To reduce the visual clutter, instead of this log reading fragment (whose meaning we already discussed in the previous post)

(select rid$
   from (select  chartorowid(mas$.m_row$$) rid$
           from mlog$_test_inner mas$
          where mas$.snaptime$$ > :b_st0
        )
) as of snapshot(:b_scn) mas$

I will use the following simplified notation

(select rid$ from mlog$_test_inner)

And of course, I will restructure and simplify the SQL statements to increase readability (the original statements are included in the test case of course).

I will also define a row as "marked in the log" if it has a match in the set "select rid$ from mlog$_test_inner" - the matching column being test_inner_rowid for the MV and the actual rowid for the inner table test_inner.

I am covering only 11.2.0.3 in the same scenario of the inner join post quoted above: MV logs configured to "log everything" and a single fast refresh propagating all possible types of regular (no direct-path) INSERTs, UPDATEs and DELETEs performed on the master tables (I haven't investigated possible variants of the refresh algorithm if only a subset of those DML types is performed).

The DEL macro step with the unique constraint
As stated above , it consists of a simple update:

/* MV_REFRESH (UPD) */
update test_mv
   set jinner = null, xinner = null, pkinner = null, test_inner_rowid = null
 where test_inner_rowid in (select rid$ from mlog$_test_inner)

that simply flips to null the columns coming from the inner table of all marked rows of the MV.

The INS macro step with the unique constraint
Again, as stated above, this step consists of a (not so simple) update:

/* MV_REFRESH (UPD) */
update /*+ bypass_ujvc */ (
select test_mv.jinner           target_0, jv.jinner  source_0,
       test_mv.xinner           target_1, jv.xinner  source_1,
       test_mv.pkinner          target_2, jv.pkinner source_2,
       test_mv.test_inner_rowid target_3, jv.rid$    source_3
 from ( select test_inner.rowid rid$, test_inner.*
          from test_inner
         where rowid in (select rid$ from mlog$_test_inner)
      ) jv,
      test_outer,
      test_mv
where test_outer.jouter = jv.jinner
  and test_mv.test_outer_rowid = test_outer.rowid
)
set target_0 = source_0,
    target_1 = source_1,
    target_2 = source_2,
    target_3 = source_3

this statement joins the marked rows of the inner table with the outer table (using an inner join, not an outer join, of course) and then looks for matching slices (by test_outer_rowid) in the MV; for every match, it flips the columns coming from the inner table from null to their actual values.

As a side note, it's worth noting that the statement updates an updatable in-line view, which is actually "tagged as updatable" by the hint "bypass_ujvc" ("bypass the Updatable Join Validity Check" probably), an hint that only Oracle code can use nowadays.

speeding up

Looking at the above SQL statements, it comes out naturally that if your fast refresh process propagates a small number of modifications, it is beneficial, to speed up the fast refresh from the inner table, to create
- for the DEL step: an index on test_mv (test_inner_rowid)
- for the INS step: an index on test_mv(test_outer_rowid) and another on test_outer(jouter).

To also speed up the refresh from the outer table (not shown in this post), you would also create an index on on test_inner(jinner) and test_mv(test_outer_rowid).

So in essence, the very same indexes as in the inner join case need to be created for the outer join. But note that if you never propagate from the outer table, the index on test_mv(test_outer_rowid) has to be created anyway - that index was not necessary in the inner join case.

Of course, as the number of modifications increase, and/or if you fast-refresh in parallel, the indexes might not be used by the CBO that could prefer full-scanning the tables; in that case they would just be update overhead. Your mileage may vary, as always - but knowing the actual algorithm and SQL submitted is for sure very helpful to decide. Hope this helps.

OLTP compression: migrated rows are compressed

In his articles Compression in Oracle – Part 2: Read-Only Data and Compression in Oracle – Part 3: OLTP Compression, Jonathan Lewis has shown that block (re)compression is never attempted on updates - it is attempted only on inserts (and, of course, only if the used space crosses the PCTFREE threshold).

Now, since a row migration could be considered a row re-insertion - does it trigger a compression attempt of the block that the row is migrated into? The answer is yes, as I will show you in a while.

It is worth remembering that row migration can be quite dramatic in an OLTP-compressed table, because (as shown by Jonathan above) when a tokenized column of a compressed row is updated to a new value, the whole token is expanded, modified and left uncompressed, even if the new value could be substituted by a token already present in the block. Hence any kind of update, even the tiniest, has the potential to inflate the row considerably, and of course, if the row can't fit into the free space, the row has to be migrated to a new block.

Compressing migrated rows has the benefit that at least the whole size of the table is not probably going to grow much, assuming a reference scenario where rows were inserted randomly in the first place (i.e. without cleverly colocating similar rows in the same block, e.g. by inserting them in bulk using a well-designed order by), and assuming that the updating process that causes migration is random as well (which is probably almost always true). The "only" overhead is the additional block get necessary to follow the rowid that acts as the "forwarding address" from the original row position (where only the row header is stored) to the new (where the row piece is now stored).

Side note: it's interesting to note that this overhead is not present when doing a full table scan, since full scans simply do not follow the "forwarding address" since they are going to find the row piece anyway (a well-known fact that is also checked in my test case for completeness). Since, as reasoned about above for our reference scenario, the table size probably does not change much, a frequently-full-scanning DWH is probably going to enjoy similar performance even after updates (and this is a bit ironic for a feature named "OLTP compression"); not so for OLTP systems or DWHs that use single row access a lot (e.g. by following rowids from bitmap indexes), that have to pay the migration penalty.

But let's leave speculations alone, and get back to hacking ...

test case: migrated rows are compressed
Note: the test case (main.sql), its spool (main.lst) and block dumps (main.trc) are available in oltp_compress_migrated_rows.zip, alongside other test cases mentioned in the following.

For simplicity, we are going to use a tablespace with segment space managed manually, i.e. where free blocks are linked together by one freelist, and with a block size of 4K; Oracle version is 11.2.0.3.

The test table is (note the PCTFREE set to 50%):

create table t (x varchar2(400 char))
pctfree 50
compress for oltp
;

The initial row set is generated by:

insert /*+ append */ into t
select rpad( 'A', 400, 'A') from dual connect by level <= 10000;

That is, all identical rows, all chars set to 'A' (coded as 0x41 in the WE8ISO8859P1 single-byte charset of my test database). Note that each row is sized 400 bytes and hence, when not compressed, only 10 rows can fit in a 4K block, and only 5 rows would be inserted given the PCTFREE=50 setting.

The block dump of the first block after the segment header shows that we have 2017 bytes free (tosp=0x7e1), as expected. In the row dump we find, in order, first the (lonely) token:

tab 0, row 0, @0xdf3
tl: 405 fb: --H-FL-- lb: 0x0  cc: 1
col  0: [400] {repeat 41, 400 times}
bindmp: 00 8a fa 01 90 {repeat 41, 400 times}



and then 137 highly-compressed rows, all with the same structure, referencing the token:

tab 1, row 0, @0xdee
tl: 5 fb: --H-FL-- lb: 0x0  cc: 1
col  0: [400] {repeat 41, 400 times}
bindmp: 2c 00 01 01 00



Note: check the above mentioned articles by Jonathan for a detailed explanation of the compression format. The bindmp line shows what's actually stored in the block: "2c 00 01" are the flag byte, the lock byte, and the column count; "01" means "1 token follows" and "00" is the pointer to token zero.

Let's update every row to lowercase. i.e. to a string of 400 'a' (0x61), thus causing massive row migration:

update t set x = lower(x);

After the update, in our block the token is gone; the first 5 rows were uncompressed but were kept in the block since they fit in the free space:

tab 1, row 0, @0xc5c
tl: 407 fb: --H-FL-- lb: 0x2  cc: 1
col  0: [400] {repeat 61, 400 times}
bindmp: 2c 02 01 00 fa 01 90 {repeat 61, 400 times}



but , indeed, all the other 132 (=137-5) ones were migrated:

tab 1, row 5, @0x5f7
tl: 9 fb: --H----- lb: 0x2  cc: 0
nrid:  0x04000148.0
bindmp: 20 02 00 04 00 01 48 00 00



The flag byte "--H-----" means "this is just the row header" (check this great note by Steve Adams) and nrid is the "forwarding address" rowid we spoke about previously.

Now, the interesting part - migrated rows got compressed.

Indeed, walking down the blocks containing the migrated rows, we see
a) token 0, holding 400 bytes set to 0x61 (same as above, not shown)
b) a certain number (usually a dozen) of compressed rows:

tab 1, row 0, @0xce0
tl: 11 fb: ----FL-- lb: 0x1  cc: 1
hrid: 0x04000101.5
col  0: [400] {repeat 61, 400 times}
bindmp: 0c 01 01 04 00 01 01 00 05 01 00



note that the row is migrated: the flag byte "----FL--" means "this is the First and Last row piece, the Header is not here" and hrid is the pointer to the header (i.e. the original row position). Of course, note that the row is compressed: the bindmp shows the usual row header triplet "0c 01 01", then the hrid "04 00 01 01 00 05" and then the usual "01" meaning "1 token follows" plus "00", the pointer to token zero.

c) uno or two uncompressed rows:

tab 1, row 10, @0xae0
tl: 413 fb: ----FL-- lb: 0x1  cc: 1
hrid: 0x04000101.f
col  0: [400] {repeat 61, 400 times}
bindmp: 0c 01 01 04 00 01 01 00 0f 00 fa 01 90 {repeat 61, 400 times}



note the bindmp, and the flag byte telling us, again, that this row is indeed migrated.
In addition, tosp is usually set to 0x907 (2311 bytes), about half of the block, honoring the PCTFREE setting to 50%.

This layout is typical of a series of row insertions in a empty block: the first inserts get simply stored uncompressed, until one that would make the used space cross the pctfree threshold triggers a potential compression before being inserted uncompressed. Note the order (already noted by Jonathan Lewis): first the compression is potentially attempted, then the row is stored - hence at least one uncompressed row is always present. There can be more because the compression is not always attempted, depending, possibly, on some cost consideration by the kernel. Check thr.sql and thr.trc if interested.

uncompressed un-migration
As a side investigation, I have also checked what happens when a row is "un-migrated"; I was expecting that un-migration could trigger a compression, but this does not seem to be the case, at least in my scenario.

I have prepared the first block (check the last part of main.trc) with only 12 migrated rows (leaving a very big free space of tosp=0xde6=3558 bytes), and then I have updated them to the original value in uppercase using "update t set x = upper(x)" . These rows, being identical, could be comfortably be un-migrated and compressed all together in the original block (just remember that we observed a block with 137 such rows at the start of the main investigation), but that has not happened. Instead, the first 8 rows have been un-migrated indeed but left uncompressed, and the last 4 has been simply left migrated; also, the block free space has dropped to a mere tosp=0x176=374 bytes, thus severely hitting the PCTFREE reserve.

Un-migration is a potential source of table size increase and/or pctfree space consumption, if this happens all the times in all scenarios - but I have not checked this extensively, though.

overlapping ranges with priority

A customer of ours (a leading Italian consumer goods retailer) has asked us to solve the following problem, that occurs quite frequently and that is not trivial to solve efficiently - and that is very interesting to design and fun to blog about!

the problem
Prices of skus (i.e. goods) have validity ranges (time intervals) and can overlap; on an overlapping range, the strongest priority (lower number) wins. In pictures:

b---(0,$200)---d
          c---(1,$300)---e



the expected output is, since priority 0 is stronger then 1:

b----($200)----d---($300)---e



I'm going to illustrate in detail the pure-SQL algorithm I have designed, and then discuss about its performance and efficiency. As usual, everything I discuss is supported by an actual demo. Please note that the algorithm uses analytics functions very, very heavily.

pure SQL solution
The input table:

create table ranges (
  sku   varchar2(10) not null,
  a     int not null,
  b     int not null,
  prio  int not null,
  price int not null
);
alter table ranges add constraint ranges_pk primary key (sku, a, b);

Let's provide the opening example as input:

insert into ranges(sku, a, b, prio, price) values ('sku1', ascii('b'), ascii('d'), 0, 200);
insert into ranges(sku, a, b, prio, price) values ('sku1', ascii('c'), ascii('e'), 1, 300);

The algorith is implemented as a single view; let's comment each step and show its output over the example:

create or replace view ranges_output_view
as

The instants in time where the ranges start or begin:

with instants as (
  select sku, a as i from ranges
  union
  select sku, b as i from ranges
),

Output: b,c,d,e.

The base ranges, i.e. the consecutive ranges that connect all the instants:

base_ranges as (
  select *
    from (
  select sku,
         i as ba,
         lead(i) over (partition by sku order by i) as bb
    from instants
         )
   where bb is not null
),
 b------c------d------e 



The original ranges factored over the base ranges; in other words, "cut" by the instants:

factored_ranges as (
  select i.sku, bi.ba, bi.bb, i.a, i.b, i.prio, i.price
    from ranges i, base_ranges bi
   where i.sku = bi.sku
     and (i.a <= bi.ba and bi.ba < i.b)
),
b---(0,$200)---c---(0,$200)---d
               c---(1,$300)---d---(1,$300)---e



Then, let's filter out the factored ranges with weaker priority (that have a stronger priority range with the same extremes "covering" them):

strongest_factored_ranges as (
  select sku, ba, bb, prio, price
    from (
  select sku, ba, bb, prio, price,
         dense_rank () over (partition by sku, ba, bb order by prio) as rnk
    from factored_ranges
         )
   where rnk = 1
),
b---(0,$200)---c---(0,$200)---d---(1,$300)---e



The problem could be now considered solved, if you could live with consecutive intervals showing the same price (such as b--c and c--d above). If you can't for whatever reason (I couldn't), we can join them using analytics again in a way similar to this asktom technique (look at the bottom for "Analytics to the Rescue (Again)").
First, we calculate "step", a nonnegative number that will be zero if a range can be joined to the previous one, since:
a) they are consecutive (no gap between them)
b) they have the same price:

ranges_with_step as (
  select sku, ba, bb, prio, price,
         decode ( price, lag(price) over (partition by sku order by ba),  ba - lag(bb) over (partition by sku order by ba), 1000 ) step
    from strongest_factored_ranges
),
RANGE_CODED                    STEP
------------------------ ----------
b---(0,$200)---c               1000
c---(0,$200)---d                  0
d---(1,$300)---e               1000



Then we compute the integral of step over the ranges; joinable ranges will hence have the same value for "interval" since step is zero:

ranges_with_step_integral as (
  select sku, ba, bb, prio, price, step,
         sum(step) over (partition by sku order by ba rows between unbounded preceding and current row) as integral
    from ranges_with_step
),
RANGE_CODED                INTEGRAL
------------------------ ----------
b---(0,$200)---c               1000
c---(0,$200)---d               1000
d---(1,$300)---e               2000



The joined joinable ranges :

ranges_joined as (
  select *
    from (
  select sku, ba, bb, prio, price, step, integral,
         min(ba) over (partition by sku, integral) as a,
         max(bb) over (partition by sku, integral) as b
    from ranges_with_step_integral
         )
   where step > 0
)
select sku, a, b, price from ranges_joined;
b---(0,$200)---c---(1,$300)---e



predicate-"pushability"
The first desirable property of this view is that a predicate (such as an equality predicate, but it works even for the "between" operator, less-than, etc) on sku can be pushed down the view to the base tables. For:

select * from ranges_output_view where sku = 'k100';

the plan is:

----------------------------------------------------------
| Id  | Operation                            | Name      |
----------------------------------------------------------
...
|  10 |           TABLE ACCESS BY INDEX ROWID| RANGES    |
|* 11 |            INDEX RANGE SCAN          | RANGES_PK |
...
|* 17 |                INDEX RANGE SCAN      | RANGES_PK |
|* 18 |                INDEX RANGE SCAN      | RANGES_PK |
----------------------------------------------
---
11 - access("I"."SKU"='k100')
---
17 - access("SKU"='k100')
18 - access("SKU"='k100')

That means that only the required SKU(s) are fed to the view, and proper indexes (such as RANGES_PK in this case) can be used. So, if you need to refresh only a few skus the response time is going to be almost istantaneous - provided that you have only sane (a few) ranges per sku. Hence you can use the same view for both calculating prices of all skus (say, in a nightly batch) and calculating a small subset of skus (say, online), and that is a great help for maintenance and testing.

running in parallel
Another desirable property is that the view can operate efficiently in parallel, at least in 11.2.0.3 (I have not tested other versions):

-------------------------------------------------------------------
| Operation                                   |IN-OUT| PQ Distrib |
-------------------------------------------------------------------
| SELECT STATEMENT                            |      |            |
|  PX COORDINATOR                             |      |            |
|   PX SEND QC (RANDOM)                       | P->S | QC (RAND)  |
|    VIEW                                     | PCWP |            |
|     WINDOW SORT                             | PCWP |            |
|      VIEW                                   | PCWP |            |
|       WINDOW SORT                           | PCWP |            |
|        VIEW                                 | PCWP |            |
|         WINDOW BUFFER                       | PCWP |            |
|          VIEW                               | PCWP |            |
|           WINDOW SORT PUSHED RANK           | PCWP |            |
|            HASH JOIN                        | PCWP |            |
|             PX RECEIVE                      | PCWP |            |
|              PX SEND HASH                   | P->P | HASH       |
|               PX BLOCK ITERATOR             | PCWC |            |
|                TABLE ACCESS FULL            | PCWP |            |
|             PX RECEIVE                      | PCWP |            |
|              PX SEND HASH                   | P->P | HASH       |
|               VIEW                          | PCWP |            |
|                WINDOW SORT                  | PCWP |            |
|                 PX RECEIVE                  | PCWP |            |
|                  PX SEND HASH               | P->P | HASH       |
|                   VIEW                      | PCWP |            |
|                    SORT UNIQUE              | PCWP |            |
|                     PX RECEIVE              | PCWP |            |
|                      PX SEND HASH           | P->P | HASH       |
|                       UNION-ALL             | PCWP |            |
|                        PX BLOCK ITERATOR    | PCWC |            |
|                         INDEX FAST FULL SCAN| PCWP |            |
|                        PX BLOCK ITERATOR    | PCWC |            |
|                         INDEX FAST FULL SCAN| PCWP |            |
-------------------------------------------------------------------

There's no point of serialization (all servers communicate parallel-to-parallel), the rows are distributed evenly using an hash distribution function (probably over the sku) and all operations are parallel.

sku subsetting and partitioning
It is well known that analytics functions use sort operations heavily, and that means (whether or not you are running in parallel) that the temporary tablespace may be used a lot, possibly too much - as it actually happened to me , leading to (in my case) unacceptable performance.

Side note: as I'm writing this, I realize now that I had probably been hit by the bug illustrated by Jonathan Lewis in Analytic Agony, but of course, overuse of temp could happen, for large datasets, without the bug kicking in.

A possible solution is to process only a sub-batch of the skus at a time, to keep the sorts running in memory (or with one-pass to temp), leveraging the predicate-pushability of the view. In my case, I have made one step further: I have partitioned the table "ranges" by "sku_group", replaced in the view every occurrence of "sku" with the pair "sku_group, sku", and then run something like:

 for s in (select sku_group from "list of sku_group") loop
  select .. from ranges_output_view where sku_group = s.sku_group;
end loop;

The predicate gets pushed down to the table, hence partition elimination kicks in and I can visit the input table one time only, one partition at a time, using a fraction of the resources at a time, and hence vastly improving performance.

And that naturally leads to "do-it-yourself parallelism": running a job for every partition in parallel. I'm going to implement it since the customer is salivating about it ... even if it is probably over-engineering :D .

refresh “fast” of materialized views optimized by Oracle as “complete”

In my current "big" project, I am building a network of nested materialized views to transform rows of one schema into rows of another (very different) schema. The former is used by the old (but still live) version of an application of ours, the latter by the new version; our idea is to incrementally (aka "fast") refresh the network daily in order to have the new schema ready when the new version goes live. We need this nework because we have only a few hours of allowed downtime, and the transformations are very complex: the MV network is going to be composed of at least 200+ MVs, each containing tens of millions of rows.

We have carefully designed all MVs as fast refreshable, and built a simple PL/SQL engine to refresh them in parallel using Oracle jobs; we are hence sure that we can meet our time constraints. Now I'm looking at optimizing the COMPLETE refresh, both for damage limitation (should some MVs required to be refreshed as complete, due to corruption or recreation mandated by to last-minute functional changes) and to speed up the initial network build.

One of the optimization I was thinking about was to refresh as complete any MV whose masters have been completely refreshed before, since it is common knowledge that in this case, "complete" is much faster then "fast" (pardon the pun) - mostly because using the MV log to identify modified rows is far from cheap and it's a huge, useless overhead when ALL rows have been modified. But actually I don't need to worry, since I have discovered that in this case, Oracle silently turns the fast refresh into a complete one, at least in 11.2.0.3.

The test case (script mv_fast_becomes_complete.sql) builds a chain of three MVs:

create table dellera_t ..
create materialized view dellera_mv1 .. as select .. from dellera_t;
create materialized view dellera_mv2 .. as select .. from dellera_mv1;
create materialized view dellera_mv3 .. as select .. from dellera_mv2;

Then the test case modifies some rows of the master table and asks for fast refresh:

exec dbms_mview.refresh('dellera_mv1', method=>'f', atomic_refresh => true);
exec dbms_mview.refresh('dellera_mv2', method=>'f', atomic_refresh => true);
exec dbms_mview.refresh('dellera_mv3', method=>'f', atomic_refresh => true);

It's no surprise that we get:


SQL> select mview_name, last_refresh_type from user_mviews
where mview_name like 'DELLERA%' order by mview_name;
MVIEW_NAME           LAST_REFRESH_TYPE
-------------------- ------------------------
DELLERA_MV1          FAST
DELLERA_MV2          FAST
DELLERA_MV3          FAST

But, when the test case refreshes the first alone as complete, keeping the next ones as fast:

exec dbms_mview.refresh('dellera_mv1', method=>'c', atomic_refresh => true);
exec dbms_mview.refresh('dellera_mv2', method=>'f', atomic_refresh => true);
exec dbms_mview.refresh('dellera_mv3', method=>'f', atomic_refresh => true);

We get:

MVIEW_NAME           LAST_REFRESH_TYPE
-------------------- ------------------------
DELLERA_MV1          COMPLETE
DELLERA_MV2          COMPLETE
DELLERA_MV3          COMPLETE

Hence, our request for fast has been silently turned by Oracle into a complete refresh, for all MVs down the chain. By the way, I have verified that this is true also for join-only and aggregates-only Mvs (check the test case if interested): all it takes to trigger this silent optimization is that at least one master has been completely refreshed before.

Moreover, the trace file shows the following steps, for the fast-turned-complete refresh of dellera_mv2 (edited for readability):

a) update mlog$_dellera_mv1  set snaptime$$ = :1  where ...;
b) delete from dellera_mv2;
c) delete from mlog$_dellera_mv2;
d) insert into dellera_mv2 (m_row$$, x , y , rid#)
select rowid,x,y,rowid from dellera_mv1;
e) delete from mlog$_dellera_mv1 where snaptime$$ <= :1;

This is indeed the signature of a complete refresh: a complete delete (b) of the refreshing MV (dellera_mv2) followed by a straight insert (d) from the master (dellera_mv1), with no visit of the MV log of the master besides the usual initial marking (a) and the final purge (e).
What is also interesting is that the MV log is completely (no reference to snaptime$$) deleted in (c) BEFORE the insert; insertion that does not populate the MV log since the kernel trigger is "disabled" during a complete refresh (otherwise we would find a lot of rows in the log since the delete happens before and not after, but the script verifies that no row is found). No MV log management overhead, as expected.

It's also interesting to check what changes when the refresh is done with atomic => false (NB I have skipped table-locking and index-maintenance operations, and non pertinent hints, for brevity):

a) same as above
b) same as above
c) truncate table dellera_mv2 purge snapshot log;
d) insert /*+ append */ into dellera_mv2 (m_row$$, x , y , rid#)
select rowid,x,y,rowid from dellera_mv1;
e) same as above

That is, the delete has turned into a "truncate purge snapshot log" and the insert is now running in append mode, the rest is the same. In passing, (b) looks a tad redundant.

As always, you can find all scripts in the test case .zip above - including of course spool files and traces, the latter also processed with xtrace exec flow to quickly mine the SQL statements of interest.

Xplan: now with “self” measures for row source operations

One of the most useful information that the Oracle kernel attaches to plans in the library cache are measures of various resource consumption figures, such as elapsed time, consistent and current gets, disk reads, etcetera. These can be made available for each plan line (aka "row source operation").

These figures are always cumulative, that is, include both the resource consumed by the line itself and all of its progeny. It is very often extremely useful to exclude the progeny from the measure, to get what we could name the "self" figure (following, of course, the terminology introduced by Cary Millsap and Jeff Holt in their famous book Optimizing Oracle Performance).

My sqlplus script xplan now implements the automatic calculation of the "self" for the most important measures, including elapsed time and buffer gets, the most used ones when tuning a statement.

Let's see an example, and then elaborate on their most important application: as a resource profile when tuning.

A simple example
Here's an illustrative example for the measure "elapsed time":

---------------------------------------------------
|Ela    |Ela+    |Id|Operation                    |
-last----last--------------------------------------
|801,097|       =| 0|SELECT STATEMENT             |
|801,097| +79,017| 1| HASH JOIN                   |
|673,010|+262,274| 2|  TABLE ACCESS BY INDEX ROWID|
|410,736| 410,736| 3|   INDEX FULL SCAN           |
| 49,070|  49,070| 4|  TABLE ACCESS FULL          |
-usec----usec--------------------------------------

The first column ("Ela"), whose values are read straight from v$sql_plan_statistics, is the cumulative elapsed time of each row source operation and all its progeny (children, grandchildren, etc). Hence for example, you can see that line#1 (HASH JOIN) run for 801msec, including the time spent by line #2,3,4 (its progeny).

The second column ("Ela+") is the corresponding "self" column, derived from "Ela" by subtracting the time spent by the children - line#1 has two children (#2 and #4), and hence we get 801-673-49=79msec.

Self measures as a resource profile for the plan
Having the "self" measures available makes extremely easy to identify the most expensive row source operations, which are (usually) the first worth considering when tuning (or studying) a SQL statement. Actually, the "self" set is the resource profile of the plan: it blames each consumer (here, the plan lines) for its share of the resource consumed.

For example, line#3 is the most expensive with its 410 msec worth of time - if we are lucky and can reduce its time consumption almost to zero, we would cut the consumption of the whole statement by (about) 50%. It is definitely a line on which to invest some of our tuning time - by e.g. investigating whether a predicate failed to being pushed down; try building a more optimal (e.g. smaller) index; try hinting it into a "FAST FULL SCAN", etc etc.

The second best option for tuning is line#2, a "TABLE ACCESS BY INDEX ROWID"... maybe we could eliminate it completely by adding the fetched columns at the end of the index read by line#3, thus possibly saving 262msec (about 25%) of time.

And so on.

I have found these "self" figures extremely useful in all my recent tuning projects - I hope that the same could turn true for some of you, and maybe that you could suggest me some way to improve xplan :)

Next Page »

Links (alphabetical order)

Blogroll (alphabetical order)

Blog Aggregators (alphabetical order)


Switch to our mobile site