“alter session force parallel query”, and indexes
This post is a brief discussion about the advantages of activating parallelism by altering the session environment instead of using the alternative ways (hints, DDL). The latter ways are the most popular in my experience, but I have noticed that their popularity is actually due, quite frequently, more to imperfect understanding rather than informed decision - and that's a pity since "alter session force parallel query" can really save everyone a lot of tedious work and improve maintainability a great deal.
We will also check that issuing
alter session force parallel query parallel N;
is the same as specifying the hints
/*+ parallel (t,N) */ /*+ parallel_index (t, t_idx, N) */
for all tables referenced in the query, and for all indexes defined on them (the former is quite obvious, the latter not that much).
Side note: it is worth remembering that hinting the table for parallelism does not cascade automatically to its indexes as well - you must explicitly specify the indexes that you want to be accessed in parallel by using the separate parallel_index hint (maybe specifying "all indexes" by using the two-parameter variant "parallel_index(t,N)"). The same holds for "alter table parallel N" and "alter index parallel N", of course.
the power of "force parallel query"
I've rarely found any reason for avoiding index parallel operations nowadays - usually both the tables and their indexes are stored on disks with the same performance figures (if not the same set of disks altogether), and the cost of the initial segment checkpoint is not generally different. At the opposite, using an index can offer terrific opportunities for speeding up queries, especially when a full table scan can be substituted by a fast full scan on a (perhaps much) smaller index.
Thus, I almost always let the CBO consider index parallelism as well. Three methods can be used:
- statement hints (the most popular option)
- alter table/index parallel N
- "force parallel query".
I rather hate injecting parallel hints everywhere in my statements since it is very risky. It is far too easy to forget to specify a table or index (or simply misspell them), not to mention to forget new potentially good indexes added after the statement had been finalized. Also, you must change the statement as well even if you simply want to change the degree of parallelism, perhaps just because you are moving from an underequipped, humble and cheap test environment to a mighty production server. At the opposite, "force parallel query" is simple and elegant - just a quick command and you're done, and with a single place to touch in order to change the parallel degree.
"alter table/index parallel N" is another weak technique as well in my opinion, mainly for two reasons. The first one is that it is a permanent modification to the database objects, and after the query has finished, it is far too easy to fail to revert the objects back to their original degree setting (because of failure or coding bug). The second one is the risk of two concurrent sessions colliding on the same object that they both want to read, but with different degrees of parallelism.
Both the two problems above do not hold only when you always want to run with a fixed degree for all statements; but even in this case, I would consider issuing "force parallel query" (maybe inside a logon trigger) instead of having to set/change the degree for all tables/indexes accessed by the application.
I have noticed that many people are afraid of "force parallel query" because of the word "force", believing that it switches every statement into parallel mode. But this is not the case: as Tanel Poder recently illustrated, the phrase "force parallel query" is misleading; a better one would be something like "consider parallel query", since it is perfectly equivalent to hinting the statement for parallelism as far as I can tell (see below). And hinting itself tells the CBO to consider parallelism in addition to serial execution; the CBO is perfectly free to choose a serial execution plan if it estimates that it will cost less - as demonstrated by Jonathan Lewis years ago.
Hence there's no reason to be afraid, for example, that a nice Index Range Scan that selects just one row might turn into a massively inefficient Full Table Scan (or index Fast Full Scan) of a one million row table/index. That is true besides bugs and CBO limitations, obviously; but in these hopefully rare circumstances, one can always use the no_parallel and no_parallel_index to fix the issue.
"force parallel query" and hinting: test case
Let's show that altering the session is equivalent to hinting. I will illustrate the simplest case only - a single-table statement that can be resolved either by a full table scan or an index fast full scan (check script force_parallel_main.sql in the test case), but in the test case zip two other scenarios (a join and a subquery) are tested as well. Note: I have only checked 9.2.0.8 and 11.2.0.3 (but I would be surprised if the test case could not reproduce in 10g as well).
Table "t" has an index t_idx on column x, and hence the statement
select sum(x) from t;
can be calculated by either scanning the table or the index. In serial, the CBO chooses to scan the smaller index (costs are from 11.2.0.3):
select /* serial */ sum(x) from t; -------------------------------------- |Id|Operation |Name |Cost| -------------------------------------- | 0|SELECT STATEMENT | | 502| | 1| SORT AGGREGATE | | | | 2| INDEX FAST FULL SCAN|T_IDX| 502| --------------------------------------
If we now activate parallelism for the table, but not for the index, the CBO chooses to scan the table:
select /*+ parallel(t,20) */ sum(x) from t ------------------------------------------ |Id|Operation |Name |Cost| ------------------------------------------ | 0|SELECT STATEMENT | | 229| | 1| SORT AGGREGATE | | | | 2| PX COORDINATOR | | | | 3| PX SEND QC (RANDOM) |:TQ10000| | | 4| SORT AGGREGATE | | | | 5| PX BLOCK ITERATOR | | 229| | 6| TABLE ACCESS FULL|T | 229| ------------------------------------------
since the cost for the parallel table access is now down from the serial cost of 4135 (check the test case logs) to the parallel cost 4135 / (0.9 * 20) = 229, thus less than the cost (502) of the serial index access.
Hinting the index as well makes the CBO apply the same scaling factor (0.9*20) to the index as well, and hence we are back to index access:
select /*+ parallel_index(t, t_idx, 20) parallel(t,20) */ sum(x) from t --------------------------------------------- |Id|Operation |Name |Cost| --------------------------------------------- | 0|SELECT STATEMENT | | 28| | 1| SORT AGGREGATE | | | | 2| PX COORDINATOR | | | | 3| PX SEND QC (RANDOM) |:TQ10000| | | 4| SORT AGGREGATE | | | | 5| PX BLOCK ITERATOR | | 28| | 6| INDEX FAST FULL SCAN|T_IDX | 28| ---------------------------------------------
Note that the cost computation is 28 = 502 / (0.9 * 20), less than the previous one (229).
"Forcing" parallel query:
alter session force parallel query parallel 20; select /* force parallel query */ sum(x) as from t --------------------------------------------- |Id|Operation |Name |Cost| --------------------------------------------- | 0|SELECT STATEMENT | | 28| | 1| SORT AGGREGATE | | | | 2| PX COORDINATOR | | | | 3| PX SEND QC (RANDOM) |:TQ10000| | | 4| SORT AGGREGATE | | | | 5| PX BLOCK ITERATOR | | 28| | 6| INDEX FAST FULL SCAN|T_IDX | 28| ---------------------------------------------
Note that the plan is the same (including costs), as predicted.
Side note: let's verify, just for fun, that the statement can run serially even if the session is "forced" as parallel (note that I have changed the statement since the original always benefits from parallelism):
alter session force parallel query parallel 20; select /* force parallel query (with no parallel execution) */ sum(x) from t WHERE X < 0 ---------------------------------- |Id|Operation |Name |Cost| ---------------------------------- | 0|SELECT STATEMENT | | 3| | 1| SORT AGGREGATE | | | | 2| INDEX RANGE SCAN|T_IDX| 3| ----------------------------------
Side note 2: activation of parallelism for all referenced objects can be obtained, in 11.2.0.3, using the new statement-level parallel hint (check this note by Randolf Geist for details):
select /*+ parallel(20) */ sum(x) from t --------------------------------------------------- |Id|Operation |Name |Table|Cost| --------------------------------------------------- | 0|SELECT STATEMENT | | | 28| | 1| SORT AGGREGATE | | | | | 2| PX COORDINATOR | | | | | 3| PX SEND QC (RANDOM) |:TQ10000| | | | 4| SORT AGGREGATE | | | | | 5| PX BLOCK ITERATOR | | | 28| | 6| INDEX FAST FULL SCAN|T_IDX |T | 28| ---------------------------------------------------
This greatly simplifies hinting, but of course you must still edit the statement if you need to change the parallel degree.
- Friday, May 17, 2013 CBO
- 1 Comment »
fast refresh of outer-join-only materialized views – algorithm, part 2
In this post, we are going to complete part 1 illustrating the (considerably more complex) general case of a fast refresh from a master inner table without a unique constraint on the joined column(s).
To recap, now the outer slice can be composed of more than one row, for example:
ooo inn1
ooo inn2
and hence, both the DEL and INS step must consider (and read) the whole outer slice even if only a subset of the inner rows have been modified. This requires both more resources and a considerably more complex algorithm. Let's illustrate it (the mandatory test case is here).
The DEL macro step
This sub step (named DEL.del by me) is performed first:
/* MV_REFRESH (DEL) */
delete from test_mv where rowid in (
select rid
from (
select test_mv.rowid rid,
row_number() over (partition by test_outer_rowid order by rid$ nulls last) r,
count(*) over (partition by test_outer_rowid ) t_cnt,
count(rid$) over (partition by test_outer_rowid ) in_mvlog_cnt
from test_mv, (select distinct rid$ from mlog$_test_inner) mvlog
where /* read touched outer slices start */
test_mv.test_outer_rowid in
(
select test_outer_rowid
from test_mv
where test_inner_rowid in (select rid$ from mlog$_test_inner)
)
/* read touched outer slices end */
and test_mv.test_inner_rowid = mvlog.rid$(+)
)
/* victim selection start */
where t_cnt > 1
and ( (in_mvlog_cnt = t_cnt and r > 1)
or
(in_mvlog_cnt < t_cnt and r <= in_mvlog_cnt)
)
/* victim selection end */
)
followed by the DEL.upd one:
/* MV_REFRESH (UPD) */ update test_mv set jinner = null, xinner = null, pkinner = null, test_inner_rowid = null where test_inner_rowid in (select rid$ from mlog$_test_inner)
This two steps combined do change all the rows of the MV marked in the log (and only them, other rows are not modified at all); the first step deletes some of them, leaving all the others to the second one, that sets to null their columns coming from the inner table.
DEL.upd is straighforward. Let's illustrate the DEL.del algorithm:
a) the section "read touched outer slices" fetches all the MV outer slices that have at least one of their rows marked in the log;
b) the slices are outer joined with the "mvlog" in-line view, so that rid$ will be nonnull for all rows marked in the log;
c) the analytic functions, for each outer slice separately, compute the number of rows (column t_cnt), the number of rows marked (column in_mvlog_cnt), and then attach a label (column r) that orders the row (order is not important at all besides non-marked rows being ordered last)
d) the where-predicate "victim selection" dictates which rows to delete.
The victim selection predicate has three sub-components, each implementing a different case (again, considering each slice separately):
"t_cnt > 1": do not delete anything if the slice contains only one row (since it is for sure marked and hence will be nulled by DEL.upd)
rid$ t_cnt in_mvlog_cnt r action
ooo inn1 not-null 1 1 1 updated by DEL.upd
"in_mvlog_cnt = t_cnt and r > 1": all rows are marked, delete all but one (that will be nulled by DEL.upd)
rid$ t_cnt in_mvlog_cnt r action
ooo inn1 not-null 3 3 1 updated by DEL.upd
ooo inn2 not-null 3 3 2 deleted by DEL.del
ooo inn3 not-null 3 3 3 deleted by DEL.del
"in_mvlog_cnt < t_cnt and r <= in_mvlog_cnt": only some rows are marked; delete all marked rows, keep all the others.
rid$ t_cnt in_mvlog_cnt r action
ooo inn1 not-null 3 2 1 deleted by DEL.del
ooo inn2 not-null 3 2 2 deleted by DEL.del
ooo inn3 null 3 2 3 nothing
The INS macro step
The first sub-step is INS.ins:
/* MV_REFRESH (INS) */
insert into test_mv
select o.jouter, o.xouter, o.pkouter, o.rowid,
jv.jinner, jv.xinner, jv.pkinner, jv.rid
from ( select test_inner.rowid rid,
test_inner.*
from test_inner
where rowid in (select rid$ from mlog_test_inner)
) jv, test_outer o
where jv.jinner = o.jouter
this sub-step simply find matches in the outer table for the marked inner table rows (note that it is an inner join, not an outer join), and inserts them in the MV.
Then, INS.del:
/* MV_REFRESH (DEL) */
delete from test_mv sna$ where rowid in (
select rid
from (
select test_mv.rowid rid,
row_number() over (partition by test_outer_rowid order by test_inner_rowid nulls first) r,
count(*) over (partition by test_outer_rowid ) t_cnt,
count(test_inner_rowid) over (partition by test_outer_rowid ) nonnull_cnt
from test_mv
where /* read touched outer slices start */
test_mv.test_outer_rowid in
(
select o.rowid
from ( select test_inner.rowid rid$,
test_inner.*
from test_inner
where rowid in (select rid$ from mlog$_test_inner)
) jv, test_outer o
where jv.jinner = o.jouter
)
/* read touched outer slices end */
)
/* victim selection start */
where t_cnt > 1
and ( (nonnull_cnt = 0 and r > 1)
or
(nonnull_cnt > 0 and r <= t_cnt - nonnull_cnt)
)
/* victim selection end */
)
this substep has a SQL structure very similar to DEL.upd, hence I will simply outline the algorith: first, the statement identifies (in the "read touched outer slices" section) all the outer slices that had at least one rows inserted by INS.ins, by replaying its join; then, for each slice, it deletes any row, if it exists, that has column "test_inner_rowid" set to null (check the "victim selection predicate").
Side note: I cannot understand how nonnull_cnt could be = 0 - possibly that is for robustness only or because it can handle variants of the DEL step I haven't observed.
speeding up
These are the indexes that the CBO might enjoy using to optimize the steps of the propagation from the inner table:
- DEL.del: test_mv(test_inner_rowid, test_outer_rowid)
- DEL.upd: test_mv(test_inner_rowid)
- INS.ins: test_outer(jouter)
- INS.del: test_outer(jouter) and test_mv(test_outer_rowid , test_inner_rowid)
And hence, to optimize all steps:
- test_outer(jouter)
- test_mv(test_inner_rowid, test_outer_rowid)
- test_mv(test_outer_rowid , test_inner_rowid)
And of course we need the usual index on test_inner(jinner) to optimize the propagation from the outer table (not shown in this post), unless we positively know that the outer table is never modified.
Note that the two indexes test_mv(test_inner_rowid, test_outer_rowid) and test_mv(test_outer_rowid , test_inner_rowid) allow to skip visiting the MV altogether (except for deleting rows, obviously) and hence might reduce the number of consistent gets dramatically (the indexes are both "covering" indexes for the SQL statements we observed in the DEL.del and INS.del) .
For example, in my test case (check ojoin_mv_test_case_indexed.sql), the plan for the DEL.del step is:
-------------------------------------------------------------- | 0|DELETE STATEMENT | | | 1| DELETE |TEST_MV | | 2| NESTED LOOPS | | | 3| VIEW |VW_NSO_1 | | 4| SORT UNIQUE | | | 5| VIEW | | | 6| WINDOW SORT | | | 7| HASH JOIN OUTER | | | 8| HASH JOIN SEMI | | | 9| INDEX FULL SCAN |TEST_MV_TEST_INNER_ROWID| |10| VIEW |VW_NSO_2 | |11| NESTED LOOPS | | |12| TABLE ACCESS FULL |MLOG$_TEST_INNER | |13| INDEX RANGE SCAN |TEST_MV_TEST_INNER_ROWID| |14| VIEW | | |15| SORT UNIQUE | | |16| TABLE ACCESS FULL |MLOG$_TEST_INNER | |17| MAT_VIEW ACCESS BY USER ROWID|TEST_MV | -------------------------------------------------------------- 5 - filter[ (T_CNT>1 AND ((IN_MVLOG_CNT=T_CNT AND R>1) OR (IN_MVLOG_CNT<T_CNT AND R<=IN_MVLOG_CNT))) ] ...
Note the absence of any access to the MV to identify the rows to be deleted (row source operation 5 and its progeny; note the filter operation, which is the final "victim selection predicate"); the MV is only accessed to physically delete the rows.
Ditto for the INS.del step:
------------------------------------------------------------------- | 0|DELETE STATEMENT | | | 1| DELETE |TEST_MV | | 2| NESTED LOOPS | | | 3| VIEW |VW_NSO_1 | | 4| SORT UNIQUE | | | 5| VIEW | | | 6| WINDOW SORT | | | 7| HASH JOIN SEMI | | | 8| INDEX FULL SCAN |TEST_MV_TEST_INNER_ROWID| | 9| VIEW |VW_NSO_2 | |10| NESTED LOOPS | | |11| NESTED LOOPS | | |12| TABLE ACCESS FULL |MLOG$_TEST_INNER | |13| TABLE ACCESS BY USER ROWID|TEST_INNER | |14| INDEX RANGE SCAN |TEST_OUTER_JOUTER_IDX | |15| MAT_VIEW ACCESS BY USER ROWID |TEST_MV | ------------------------------------------------------------------- 5 - filter[ (T_CNT>1 AND ((NONNULL_CNT=0 AND R>1) OR (NONNULL_CNT>0 AND R<=T_CNT-NONNULL_CNT))) ] ...
You might anyway create just the two "standard" single-column indexes on test_mv(test_inner_rowid) and test_mv(test_outer_rowid) and be happy with the resulting performance, even if you now will access the MV to get the "other" rowid - it all depends, of course, on your data (how many rows you have in each slice, and how many slices are touched by the marked rows) and how you modify the master tables.
- Monday, April 29, 2013 materialized views
- Comment now »
fast refresh of outer-join-only materialized views – algorithm, part 1
In this series of posts we will discuss how Oracle refreshes materialized views (MV) containing only OUTER joins, covering only 11.2.0.3. We will use the very same scenario (MV log configuration, DML type, etc) as in the inner join case, "just" turning the inner join into an outer join:
create materialized view test_mv
build immediate
refresh fast on demand
as
select test_outer.*, test_outer.rowid as test_outer_rowid,
test_inner.*, test_inner.rowid as test_inner_rowid
from test_outer, test_inner
where test_outer.jouter = test_inner.jinner(+)
;
For the outer case, the overall strategy is the same we already saw for inner joins: modifications from each master table are propagated separately to the MV, and by still performing, for each master table, the same two macro steps (first the DEL, and then the INS one).
Actually, propagation from the outer table is exactly the same (with the obvious slight difference of performing an outer join to recalculate the "slice" in the INS step), and hence we will discuss only the propagation from the inner table, which is considerably more complex.
There are actually two different propagation algorithms, one much simpler and less resource-intensive that requires a unique constraint on the joined columns (here, "jinner" alone); in this post I will discuss the details of this specialized algorithm only, leaving the details of the "general" other one for the next post.
Does the existence of a a unique constraint on the joined colum(s) enable such a dramatic simplification of the propagation that justifies a specialized algorithm? Yes, absolutely - and it is interesting to understand the reason since it comes out naturally from the very semantic of the outer join SQL construct, and hence we can also improve our understanding of this important component of the SQL syntax as a by-product.
Let's start by remembering that every row of the outer table is always represented in the MV, possibly with all columns coming from the inner table set to null (including, most importantly, test_inner_rowid) if no match is found in the inner table. If M matches are found for a given outer row, M rows will be present in the MV; e.g. for M=2, we will see the following "outer slice" (my definition) corresponding to outer row ooo:
ooo inn1
ooo inn2
Now, if a unique constraint exists on the joined column(s), M can be at most 1, and hence only two possible states are possible for our outer slice:
(a) ooo inn1
(b) ooo *null*
Hence, if inn1 is marked in the log, propagating its deletion in the DEL step is just a matter of simply switching the slice from state (a) to (b) using an update statement, and conversely, propagating its insertion in the INS step is just a matter of updating the slice from state (b) to state (a). In other words, the possible matching rows of the outer table are already there, in the MV, and all we need to do is to "flip their state" if necessary. Thus it is possible to propagate using only quite efficient update statements - no delete or insert needs to be performed at all.
Now, consider how the absence of unique constraint adds additional complexity. In this case this scenario is possible:
ooo inn1
ooo inn2
if only one of (inn1, inn2) is marked in the log, the DEL step can simply delete only the corresponding row in the MV, but if both are marked, it must leave a single row with all the columns coming from the inner table set to null:
ooo *null*
conversely, the INS step must remember to remove the above row if it finds at least a match in the outer table.
In other words, the whole "outer slice" must be considered and examined by both steps; it is not enough to consider only marked rows "in isolation", as it was the case in the inner join scenario and the constrained outer join scenario. This is considerably more complex, and in fact, the "general" algorithm was designed and implemented only in 10g - before 10g it was mandatory to have a unique constraint on the joined columns to enable fast refresh.
conventions and investigation scope
To reduce the visual clutter, instead of this log reading fragment (whose meaning we already discussed in the previous post)
(select rid$
from (select chartorowid(mas$.m_row$$) rid$
from mlog$_test_inner mas$
where mas$.snaptime$$ > :b_st0
)
) as of snapshot(:b_scn) mas$
I will use the following simplified notation
(select rid$ from mlog$_test_inner)
And of course, I will restructure and simplify the SQL statements to increase readability (the original statements are included in the test case of course).
I will also define a row as "marked in the log" if it has a match in the set "select rid$ from mlog$_test_inner" - the matching column being test_inner_rowid for the MV and the actual rowid for the inner table test_inner.
I am covering only 11.2.0.3 in the same scenario of the inner join post quoted above: MV logs configured to "log everything" and a single fast refresh propagating all possible types of regular (no direct-path) INSERTs, UPDATEs and DELETEs performed on the master tables (I haven't investigated possible variants of the refresh algorithm if only a subset of those DML types is performed).
The DEL macro step with the unique constraint
As stated above , it consists of a simple update:
/* MV_REFRESH (UPD) */ update test_mv set jinner = null, xinner = null, pkinner = null, test_inner_rowid = null where test_inner_rowid in (select rid$ from mlog$_test_inner)
that simply flips to null the columns coming from the inner table of all marked rows of the MV.
The INS macro step with the unique constraint
Again, as stated above, this step consists of a (not so simple) update:
/* MV_REFRESH (UPD) */
update /*+ bypass_ujvc */ (
select test_mv.jinner target_0, jv.jinner source_0,
test_mv.xinner target_1, jv.xinner source_1,
test_mv.pkinner target_2, jv.pkinner source_2,
test_mv.test_inner_rowid target_3, jv.rid$ source_3
from ( select test_inner.rowid rid$, test_inner.*
from test_inner
where rowid in (select rid$ from mlog$_test_inner)
) jv,
test_outer,
test_mv
where test_outer.jouter = jv.jinner
and test_mv.test_outer_rowid = test_outer.rowid
)
set target_0 = source_0,
target_1 = source_1,
target_2 = source_2,
target_3 = source_3
this statement joins the marked rows of the inner table with the outer table (using an inner join, not an outer join, of course) and then looks for matching slices (by test_outer_rowid) in the MV; for every match, it flips the columns coming from the inner table from null to their actual values.
As a side note, it's worth noting that the statement updates an updatable in-line view, which is actually "tagged as updatable" by the hint "bypass_ujvc" ("bypass the Updatable Join Validity Check" probably), an hint that only Oracle code can use nowadays.
speeding up
Looking at the above SQL statements, it comes out naturally that if your fast refresh process propagates a small number of modifications, it is beneficial, to speed up the fast refresh from the inner table, to create
- for the DEL step: an index on test_mv (test_inner_rowid)
- for the INS step: an index on test_mv(test_outer_rowid) and another on test_outer(jouter).
To also speed up the refresh from the outer table (not shown in this post), you would also create an index on on test_inner(jinner) and test_mv(test_outer_rowid).
So in essence, the very same indexes as in the inner join case need to be created for the outer join. But note that if you never propagate from the outer table, the index on test_mv(test_outer_rowid) has to be created anyway - that index was not necessary in the inner join case.
Of course, as the number of modifications increase, and/or if you fast-refresh in parallel, the indexes might not be used by the CBO that could prefer full-scanning the tables; in that case they would just be update overhead. Your mileage may vary, as always - but knowing the actual algorithm and SQL submitted is for sure very helpful to decide. Hope this helps.
- Monday, April 22, 2013 materialized views
- 2 Comments »
OLTP compression: migrated rows are compressed
In his articles Compression in Oracle – Part 2: Read-Only Data and Compression in Oracle – Part 3: OLTP Compression, Jonathan Lewis has shown that block (re)compression is never attempted on updates - it is attempted only on inserts (and, of course, only if the used space crosses the PCTFREE threshold).
Now, since a row migration could be considered a row re-insertion - does it trigger a compression attempt of the block that the row is migrated into? The answer is yes, as I will show you in a while.
It is worth remembering that row migration can be quite dramatic in an OLTP-compressed table, because (as shown by Jonathan above) when a tokenized column of a compressed row is updated to a new value, the whole token is expanded, modified and left uncompressed, even if the new value could be substituted by a token already present in the block. Hence any kind of update, even the tiniest, has the potential to inflate the row considerably, and of course, if the row can't fit into the free space, the row has to be migrated to a new block.
Compressing migrated rows has the benefit that at least the whole size of the table is not probably going to grow much, assuming a reference scenario where rows were inserted randomly in the first place (i.e. without cleverly colocating similar rows in the same block, e.g. by inserting them in bulk using a well-designed order by), and assuming that the updating process that causes migration is random as well (which is probably almost always true). The "only" overhead is the additional block get necessary to follow the rowid that acts as the "forwarding address" from the original row position (where only the row header is stored) to the new (where the row piece is now stored).
Side note: it's interesting to note that this overhead is not present when doing a full table scan, since full scans simply do not follow the "forwarding address" since they are going to find the row piece anyway (a well-known fact that is also checked in my test case for completeness). Since, as reasoned about above for our reference scenario, the table size probably does not change much, a frequently-full-scanning DWH is probably going to enjoy similar performance even after updates (and this is a bit ironic for a feature named "OLTP compression"); not so for OLTP systems or DWHs that use single row access a lot (e.g. by following rowids from bitmap indexes), that have to pay the migration penalty.
But let's leave speculations alone, and get back to hacking ...
test case: migrated rows are compressed
Note: the test case (main.sql), its spool (main.lst) and block dumps (main.trc) are available in oltp_compress_migrated_rows.zip, alongside other test cases mentioned in the following.
For simplicity, we are going to use a tablespace with segment space managed manually, i.e. where free blocks are linked together by one freelist, and with a block size of 4K; Oracle version is 11.2.0.3.
The test table is (note the PCTFREE set to 50%):
create table t (x varchar2(400 char)) pctfree 50 compress for oltp ;
The initial row set is generated by:
insert /*+ append */ into t select rpad( 'A', 400, 'A') from dual connect by level <= 10000;
That is, all identical rows, all chars set to 'A' (coded as 0x41 in the WE8ISO8859P1 single-byte charset of my test database). Note that each row is sized 400 bytes and hence, when not compressed, only 10 rows can fit in a 4K block, and only 5 rows would be inserted given the PCTFREE=50 setting.
The block dump of the first block after the segment header shows that we have 2017 bytes free (tosp=0x7e1), as expected. In the row dump we find, in order, first the (lonely) token:
tab 0, row 0, @0xdf3
tl: 405 fb: --H-FL-- lb: 0x0 cc: 1
col 0: [400] {repeat 41, 400 times}
bindmp: 00 8a fa 01 90 {repeat 41, 400 times}
and then 137 highly-compressed rows, all with the same structure, referencing the token:
tab 1, row 0, @0xdee
tl: 5 fb: --H-FL-- lb: 0x0 cc: 1
col 0: [400] {repeat 41, 400 times}
bindmp: 2c 00 01 01 00
Note: check the above mentioned articles by Jonathan for a detailed explanation of the compression format. The bindmp line shows what's actually stored in the block: "2c 00 01" are the flag byte, the lock byte, and the column count; "01" means "1 token follows" and "00" is the pointer to token zero.
Let's update every row to lowercase. i.e. to a string of 400 'a' (0x61), thus causing massive row migration:
update t set x = lower(x);
After the update, in our block the token is gone; the first 5 rows were uncompressed but were kept in the block since they fit in the free space:
tab 1, row 0, @0xc5c
tl: 407 fb: --H-FL-- lb: 0x2 cc: 1
col 0: [400] {repeat 61, 400 times}
bindmp: 2c 02 01 00 fa 01 90 {repeat 61, 400 times}
but , indeed, all the other 132 (=137-5) ones were migrated:
tab 1, row 5, @0x5f7 tl: 9 fb: --H----- lb: 0x2 cc: 0 nrid: 0x04000148.0 bindmp: 20 02 00 04 00 01 48 00 00
The flag byte "--H-----" means "this is just the row header" (check this great note by Steve Adams) and nrid is the "forwarding address" rowid we spoke about previously.
Now, the interesting part - migrated rows got compressed.
Indeed, walking down the blocks containing the migrated rows, we see
a) token 0, holding 400 bytes set to 0x61 (same as above, not shown)
b) a certain number (usually a dozen) of compressed rows:
tab 1, row 0, @0xce0
tl: 11 fb: ----FL-- lb: 0x1 cc: 1
hrid: 0x04000101.5
col 0: [400] {repeat 61, 400 times}
bindmp: 0c 01 01 04 00 01 01 00 05 01 00
note that the row is migrated: the flag byte "----FL--" means "this is the First and Last row piece, the Header is not here" and hrid is the pointer to the header (i.e. the original row position). Of course, note that the row is compressed: the bindmp shows the usual row header triplet "0c 01 01", then the hrid "04 00 01 01 00 05" and then the usual "01" meaning "1 token follows" plus "00", the pointer to token zero.
c) uno or two uncompressed rows:
tab 1, row 10, @0xae0
tl: 413 fb: ----FL-- lb: 0x1 cc: 1
hrid: 0x04000101.f
col 0: [400] {repeat 61, 400 times}
bindmp: 0c 01 01 04 00 01 01 00 0f 00 fa 01 90 {repeat 61, 400 times}
note the bindmp, and the flag byte telling us, again, that this row is indeed migrated.
In addition, tosp is usually set to 0x907 (2311 bytes), about half of the block, honoring the PCTFREE setting to 50%.
This layout is typical of a series of row insertions in a empty block: the first inserts get simply stored uncompressed, until one that would make the used space cross the pctfree threshold triggers a potential compression before being inserted uncompressed. Note the order (already noted by Jonathan Lewis): first the compression is potentially attempted, then the row is stored - hence at least one uncompressed row is always present. There can be more because the compression is not always attempted, depending, possibly, on some cost consideration by the kernel. Check thr.sql and thr.trc if interested.
uncompressed un-migration
As a side investigation, I have also checked what happens when a row is "un-migrated"; I was expecting that un-migration could trigger a compression, but this does not seem to be the case, at least in my scenario.
I have prepared the first block (check the last part of main.trc) with only 12 migrated rows (leaving a very big free space of tosp=0xde6=3558 bytes), and then I have updated them to the original value in uppercase using "update t set x = upper(x)" . These rows, being identical, could be comfortably be un-migrated and compressed all together in the original block (just remember that we observed a block with 137 such rows at the start of the main investigation), but that has not happened. Instead, the first 8 rows have been un-migrated indeed but left uncompressed, and the last 4 has been simply left migrated; also, the block free space has dropped to a mere tosp=0x176=374 bytes, thus severely hitting the PCTFREE reserve.
Un-migration is a potential source of table size increase and/or pctfree space consumption, if this happens all the times in all scenarios - but I have not checked this extensively, though.
- Sunday, April 7, 2013 compression
- Comment now »
overlapping ranges with priority
A customer of ours (a leading Italian consumer goods retailer) has asked us to solve the following problem, that occurs quite frequently and that is not trivial to solve efficiently - and that is very interesting to design and fun to blog about!
the problem
Prices of skus (i.e. goods) have validity ranges (time intervals) and can overlap; on an overlapping range, the strongest priority (lower number) wins. In pictures:
b---(0,$200)---d
c---(1,$300)---e
the expected output is, since priority 0 is stronger then 1:
b----($200)----d---($300)---e
I'm going to illustrate in detail the pure-SQL algorithm I have designed, and then discuss about its performance and efficiency. As usual, everything I discuss is supported by an actual demo. Please note that the algorithm uses analytics functions very, very heavily.
pure SQL solution
The input table:
create table ranges ( sku varchar2(10) not null, a int not null, b int not null, prio int not null, price int not null ); alter table ranges add constraint ranges_pk primary key (sku, a, b);
Let's provide the opening example as input:
insert into ranges(sku, a, b, prio, price) values ('sku1', ascii('b'), ascii('d'), 0, 200);
insert into ranges(sku, a, b, prio, price) values ('sku1', ascii('c'), ascii('e'), 1, 300);
The algorith is implemented as a single view; let's comment each step and show its output over the example:
create or replace view ranges_output_view as
The instants in time where the ranges start or begin:
with instants as ( select sku, a as i from ranges union select sku, b as i from ranges ),
Output: b,c,d,e.
The base ranges, i.e. the consecutive ranges that connect all the instants:
base_ranges as (
select *
from (
select sku,
i as ba,
lead(i) over (partition by sku order by i) as bb
from instants
)
where bb is not null
),
b------c------d------e
The original ranges factored over the base ranges; in other words, "cut" by the instants:
factored_ranges as (
select i.sku, bi.ba, bi.bb, i.a, i.b, i.prio, i.price
from ranges i, base_ranges bi
where i.sku = bi.sku
and (i.a <= bi.ba and bi.ba < i.b)
),
b---(0,$200)---c---(0,$200)---d
c---(1,$300)---d---(1,$300)---e
Then, let's filter out the factored ranges with weaker priority (that have a stronger priority range with the same extremes "covering" them):
strongest_factored_ranges as (
select sku, ba, bb, prio, price
from (
select sku, ba, bb, prio, price,
dense_rank () over (partition by sku, ba, bb order by prio) as rnk
from factored_ranges
)
where rnk = 1
),
b---(0,$200)---c---(0,$200)---d---(1,$300)---e
The problem could be now considered solved, if you could live with consecutive intervals showing the same price (such as b--c and c--d above). If you can't for whatever reason (I couldn't), we can join them using analytics again in a way similar to this asktom technique (look at the bottom for "Analytics to the Rescue (Again)").
First, we calculate "step", a nonnegative number that will be zero if a range can be joined to the previous one, since:
a) they are consecutive (no gap between them)
b) they have the same price:
ranges_with_step as (
select sku, ba, bb, prio, price,
decode ( price, lag(price) over (partition by sku order by ba), ba - lag(bb) over (partition by sku order by ba), 1000 ) step
from strongest_factored_ranges
),
RANGE_CODED STEP ------------------------ ---------- b---(0,$200)---c 1000 c---(0,$200)---d 0 d---(1,$300)---e 1000
Then we compute the integral of step over the ranges; joinable ranges will hence have the same value for "interval" since step is zero:
ranges_with_step_integral as (
select sku, ba, bb, prio, price, step,
sum(step) over (partition by sku order by ba rows between unbounded preceding and current row) as integral
from ranges_with_step
),
RANGE_CODED INTEGRAL ------------------------ ---------- b---(0,$200)---c 1000 c---(0,$200)---d 1000 d---(1,$300)---e 2000
The joined joinable ranges :
ranges_joined as (
select *
from (
select sku, ba, bb, prio, price, step, integral,
min(ba) over (partition by sku, integral) as a,
max(bb) over (partition by sku, integral) as b
from ranges_with_step_integral
)
where step > 0
)
select sku, a, b, price from ranges_joined;
b---(0,$200)---c---(1,$300)---e
predicate-"pushability"
The first desirable property of this view is that a predicate (such as an equality predicate, but it works even for the "between" operator, less-than, etc) on sku can be pushed down the view to the base tables. For:
select * from ranges_output_view where sku = 'k100';
the plan is:
----------------------------------------------------------
| Id | Operation | Name |
----------------------------------------------------------
...
| 10 | TABLE ACCESS BY INDEX ROWID| RANGES |
|* 11 | INDEX RANGE SCAN | RANGES_PK |
...
|* 17 | INDEX RANGE SCAN | RANGES_PK |
|* 18 | INDEX RANGE SCAN | RANGES_PK |
----------------------------------------------
---
11 - access("I"."SKU"='k100')
---
17 - access("SKU"='k100')
18 - access("SKU"='k100')
That means that only the required SKU(s) are fed to the view, and proper indexes (such as RANGES_PK in this case) can be used. So, if you need to refresh only a few skus the response time is going to be almost istantaneous - provided that you have only sane (a few) ranges per sku. Hence you can use the same view for both calculating prices of all skus (say, in a nightly batch) and calculating a small subset of skus (say, online), and that is a great help for maintenance and testing.
running in parallel
Another desirable property is that the view can operate efficiently in parallel, at least in 11.2.0.3 (I have not tested other versions):
------------------------------------------------------------------- | Operation |IN-OUT| PQ Distrib | ------------------------------------------------------------------- | SELECT STATEMENT | | | | PX COORDINATOR | | | | PX SEND QC (RANDOM) | P->S | QC (RAND) | | VIEW | PCWP | | | WINDOW SORT | PCWP | | | VIEW | PCWP | | | WINDOW SORT | PCWP | | | VIEW | PCWP | | | WINDOW BUFFER | PCWP | | | VIEW | PCWP | | | WINDOW SORT PUSHED RANK | PCWP | | | HASH JOIN | PCWP | | | PX RECEIVE | PCWP | | | PX SEND HASH | P->P | HASH | | PX BLOCK ITERATOR | PCWC | | | TABLE ACCESS FULL | PCWP | | | PX RECEIVE | PCWP | | | PX SEND HASH | P->P | HASH | | VIEW | PCWP | | | WINDOW SORT | PCWP | | | PX RECEIVE | PCWP | | | PX SEND HASH | P->P | HASH | | VIEW | PCWP | | | SORT UNIQUE | PCWP | | | PX RECEIVE | PCWP | | | PX SEND HASH | P->P | HASH | | UNION-ALL | PCWP | | | PX BLOCK ITERATOR | PCWC | | | INDEX FAST FULL SCAN| PCWP | | | PX BLOCK ITERATOR | PCWC | | | INDEX FAST FULL SCAN| PCWP | | -------------------------------------------------------------------
There's no point of serialization (all servers communicate parallel-to-parallel), the rows are distributed evenly using an hash distribution function (probably over the sku) and all operations are parallel.
sku subsetting and partitioning
It is well known that analytics functions use sort operations heavily, and that means (whether or not you are running in parallel) that the temporary tablespace may be used a lot, possibly too much - as it actually happened to me , leading to (in my case) unacceptable performance.
Side note: as I'm writing this, I realize now that I had probably been hit by the bug illustrated by Jonathan Lewis in Analytic Agony, but of course, overuse of temp could happen, for large datasets, without the bug kicking in.
A possible solution is to process only a sub-batch of the skus at a time, to keep the sorts running in memory (or with one-pass to temp), leveraging the predicate-pushability of the view. In my case, I have made one step further: I have partitioned the table "ranges" by "sku_group", replaced in the view every occurrence of "sku" with the pair "sku_group, sku", and then run something like:
for s in (select sku_group from "list of sku_group") loop select .. from ranges_output_view where sku_group = s.sku_group; end loop;
The predicate gets pushed down to the table, hence partition elimination kicks in and I can visit the input table one time only, one partition at a time, using a fraction of the resources at a time, and hence vastly improving performance.
And that naturally leads to "do-it-yourself parallelism": running a job for every partition in parallel. I'm going to implement it since the customer is salivating about it ... even if it is probably over-engineering :D .
- Monday, October 1, 2012 case studies
- Comment now »
refresh “fast” of materialized views optimized by Oracle as “complete”
In my current "big" project, I am building a network of nested materialized views to transform rows of one schema into rows of another (very different) schema. The former is used by the old (but still live) version of an application of ours, the latter by the new version; our idea is to incrementally (aka "fast") refresh the network daily in order to have the new schema ready when the new version goes live. We need this nework because we have only a few hours of allowed downtime, and the transformations are very complex: the MV network is going to be composed of at least 200+ MVs, each containing tens of millions of rows.
We have carefully designed all MVs as fast refreshable, and built a simple PL/SQL engine to refresh them in parallel using Oracle jobs; we are hence sure that we can meet our time constraints. Now I'm looking at optimizing the COMPLETE refresh, both for damage limitation (should some MVs required to be refreshed as complete, due to corruption or recreation mandated by to last-minute functional changes) and to speed up the initial network build.
One of the optimization I was thinking about was to refresh as complete any MV whose masters have been completely refreshed before, since it is common knowledge that in this case, "complete" is much faster then "fast" (pardon the pun) - mostly because using the MV log to identify modified rows is far from cheap and it's a huge, useless overhead when ALL rows have been modified. But actually I don't need to worry, since I have discovered that in this case, Oracle silently turns the fast refresh into a complete one, at least in 11.2.0.3.
The test case (script mv_fast_becomes_complete.sql) builds a chain of three MVs:
create table dellera_t .. create materialized view dellera_mv1 .. as select .. from dellera_t; create materialized view dellera_mv2 .. as select .. from dellera_mv1; create materialized view dellera_mv3 .. as select .. from dellera_mv2;
Then the test case modifies some rows of the master table and asks for fast refresh:
exec dbms_mview.refresh('dellera_mv1', method=>'f', atomic_refresh => true);
exec dbms_mview.refresh('dellera_mv2', method=>'f', atomic_refresh => true);
exec dbms_mview.refresh('dellera_mv3', method=>'f', atomic_refresh => true);
It's no surprise that we get:
SQL> select mview_name, last_refresh_type from user_mviews where mview_name like 'DELLERA%' order by mview_name; MVIEW_NAME LAST_REFRESH_TYPE -------------------- ------------------------ DELLERA_MV1 FAST DELLERA_MV2 FAST DELLERA_MV3 FAST
But, when the test case refreshes the first alone as complete, keeping the next ones as fast:
exec dbms_mview.refresh('dellera_mv1', method=>'c', atomic_refresh => true);
exec dbms_mview.refresh('dellera_mv2', method=>'f', atomic_refresh => true);
exec dbms_mview.refresh('dellera_mv3', method=>'f', atomic_refresh => true);
We get:
MVIEW_NAME LAST_REFRESH_TYPE -------------------- ------------------------ DELLERA_MV1 COMPLETE DELLERA_MV2 COMPLETE DELLERA_MV3 COMPLETE
Hence, our request for fast has been silently turned by Oracle into a complete refresh, for all MVs down the chain. By the way, I have verified that this is true also for join-only and aggregates-only Mvs (check the test case if interested): all it takes to trigger this silent optimization is that at least one master has been completely refreshed before.
Moreover, the trace file shows the following steps, for the fast-turned-complete refresh of dellera_mv2 (edited for readability):
a) update mlog$_dellera_mv1 set snaptime$$ = :1 where ...; b) delete from dellera_mv2; c) delete from mlog$_dellera_mv2; d) insert into dellera_mv2 (m_row$$, x , y , rid#) select rowid,x,y,rowid from dellera_mv1; e) delete from mlog$_dellera_mv1 where snaptime$$ <= :1;
This is indeed the signature of a complete refresh: a complete delete (b) of the refreshing MV (dellera_mv2) followed by a straight insert (d) from the master (dellera_mv1), with no visit of the MV log of the master besides the usual initial marking (a) and the final purge (e).
What is also interesting is that the MV log is completely (no reference to snaptime$$) deleted in (c) BEFORE the insert; insertion that does not populate the MV log since the kernel trigger is "disabled" during a complete refresh (otherwise we would find a lot of rows in the log since the delete happens before and not after, but the script verifies that no row is found). No MV log management overhead, as expected.
It's also interesting to check what changes when the refresh is done with atomic => false (NB I have skipped table-locking and index-maintenance operations, and non pertinent hints, for brevity):
a) same as above b) same as above c) truncate table dellera_mv2 purge snapshot log; d) insert /*+ append */ into dellera_mv2 (m_row$$, x , y , rid#) select rowid,x,y,rowid from dellera_mv1; e) same as above
That is, the delete has turned into a "truncate purge snapshot log" and the insert is now running in append mode, the rest is the same. In passing, (b) looks a tad redundant.
As always, you can find all scripts in the test case .zip above - including of course spool files and traces, the latter also processed with xtrace exec flow to quickly mine the SQL statements of interest.
- Sunday, September 16, 2012 materialized views
- 1 Comment »
Xplan: now with “self” measures for row source operations
One of the most useful information that the Oracle kernel attaches to plans in the library cache are measures of various resource consumption figures, such as elapsed time, consistent and current gets, disk reads, etcetera. These can be made available for each plan line (aka "row source operation").
These figures are always cumulative, that is, include both the resource consumed by the line itself and all of its progeny. It is very often extremely useful to exclude the progeny from the measure, to get what we could name the "self" figure (following, of course, the terminology introduced by Cary Millsap and Jeff Holt in their famous book Optimizing Oracle Performance).
My sqlplus script xplan now implements the automatic calculation of the "self" for the most important measures, including elapsed time and buffer gets, the most used ones when tuning a statement.
Let's see an example, and then elaborate on their most important application: as a resource profile when tuning.
A simple example
Here's an illustrative example for the measure "elapsed time":
--------------------------------------------------- |Ela |Ela+ |Id|Operation | -last----last-------------------------------------- |801,097| =| 0|SELECT STATEMENT | |801,097| +79,017| 1| HASH JOIN | |673,010|+262,274| 2| TABLE ACCESS BY INDEX ROWID| |410,736| 410,736| 3| INDEX FULL SCAN | | 49,070| 49,070| 4| TABLE ACCESS FULL | -usec----usec--------------------------------------
The first column ("Ela"), whose values are read straight from v$sql_plan_statistics, is the cumulative elapsed time of each row source operation and all its progeny (children, grandchildren, etc). Hence for example, you can see that line#1 (HASH JOIN) run for 801msec, including the time spent by line #2,3,4 (its progeny).
The second column ("Ela+") is the corresponding "self" column, derived from "Ela" by subtracting the time spent by the children - line#1 has two children (#2 and #4), and hence we get 801-673-49=79msec.
Self measures as a resource profile for the plan
Having the "self" measures available makes extremely easy to identify the most expensive row source operations, which are (usually) the first worth considering when tuning (or studying) a SQL statement. Actually, the "self" set is the resource profile of the plan: it blames each consumer (here, the plan lines) for its share of the resource consumed.
For example, line#3 is the most expensive with its 410 msec worth of time - if we are lucky and can reduce its time consumption almost to zero, we would cut the consumption of the whole statement by (about) 50%. It is definitely a line on which to invest some of our tuning time - by e.g. investigating whether a predicate failed to being pushed down; try building a more optimal (e.g. smaller) index; try hinting it into a "FAST FULL SCAN", etc etc.
The second best option for tuning is line#2, a "TABLE ACCESS BY INDEX ROWID"... maybe we could eliminate it completely by adding the fetched columns at the end of the index read by line#3, thus possibly saving 262msec (about 25%) of time.
And so on.
I have found these "self" figures extremely useful in all my recent tuning projects - I hope that the same could turn true for some of you, and maybe that you could suggest me some way to improve xplan :)
- Monday, August 27, 2012 CBO, performance tuning, xplan
- 7 Comments »
Third International NoCOUG SQL & NoSQL Challenge!
For anyone into using their SQL skills creatively, and getting out of the boring SQL-coding daily routine ... here is a puzzle that is both entertaining and challenging, and with a real prize for the winner!
Official abstract:
"In this challenge (see page 25), the Wicked Witch of the West needs help in creating a magic spell to ensure that the Third Annual Witching & Wizarding Ball is a grand success. The winner will receive the August Order of the Wooden Pretzel in keeping with the Steven Feuerstein’s observation that "some people can perform seeming miracles with straight SQL, but the statements end up looking like pretzels created by somebody who is experimenting with hallucinogens." There are currently four knights of the August Order of the Wooden Pretzel: Alberto Dell’Era (Italy) who won the first challenge in 2009 and Andre Araujo (Australia), Rob van Wijk (Netherlands), and Ilya Chuhnakov (Russia) who won the second challenge in 2011."
I remember having a lot of fun when I joined the first edition of this Challenge - I hope the same for you :)
- Monday, May 28, 2012 Uncategorized
- 1 Comment »
Xtrace: an Oracle session trace browser – exec flow
Tracing a session is extremely useful when you need to investigate how a client interacts with the database - the client could be an application of yours, a third-party application, or an Oracle module such as dbms_stats or dbms_mview. To get the perfect picture of the client-server dialogue, you "simply" need to consider all EXEC lines in the trace file, and associate to each line the executed statement and the bind variable values; a very tedious and error-prone task when done manually, that Xtrace can make for you (and for free).
Let's see the tool in action. Consider tracing a call to this stored procedure, that executes a recursive SQL statement :
create or replace procedure test_proc( p_x int )
is
begin
for i in 1..p_x loop
for k in (select count(*) from t where x > i) loop
null;
end loop;
end loop;
end;
Here is the output of Xtrace:

Reading it bottom-up, you can see that the client called the SP, which in turn executed recursively (note the indentation) the SQL statement twice.
You can also ask Xtrace to display the bind variable values used for each execution:

So - the client passed the value "2" for :p_x to the SP, which in turn executed the SQL statement first passing "1" for :B1, and then passing "2".
Interested ? Try it live (requires Java Web Start):
When Xtrace opens up, press the "options" button and then the "EXEC FLOW analysis" button. Enable/disable the bind variable values using the "display BINDS under EXEC" checkbox; color the statements as you like.
We introduced Xtrace in this post; the Xtrace home page contains the tool (which can be used online or downloaded) - and a manual for advanced uses.
- Monday, May 17, 2010 xtrace
- 1 Comment »