NoCOUG’s “First international SQL challenge”

Just a short note to tell my friends that I have been bestowed the August Order of the Wooden Pretzel, that is, that I won the NoCOUG's "First international SQL challenge" with this solution.

I'm especially happy to see that, after (way too) many years since graduation, I am still able to use my math skills to solve problems ... :).

Many thanks to Iggy Fernandez for setting up the contest (and running it with love and dedication), Dan Tow for judging the solution and of course Chen Shapira for advertising and "supporting" it on her blog!

fast refresh of single-table materialized views - algorithm summary

Today we are going to investigate how Oracle fast refreshes materialized views (MVs) of a single master table, containing no aggregate but, at most, filter predicates and additional column definitions:

create materialized view test_mv
build immediate
refresh fast on demand
with rowid
-- with primary key
as
select test_t1.*, x1+x2 as x1x2
  from test_t1
 where x1 != 0.42;

This kind of MVs might be considered a degenerate case of a join-only MV, a topic that we investigated in an earlier post, and one could expect the same algorithm. But that is not the case: the test case shows that the algorithm used is very different.
The two main differences are (as we are going to illustrate in detail) that UPDATEs are actually used in this case (as noted by Cristian Cudizio) instead of DELETE+INSERT only, and especially that row-by-row propagation is performed instead of using a couple of single SQL statements.

This kind of MV is frequently used for replication across a db-link (with a clause such as "from test_t1@db_link"); in this scenario, the MV used to be named SNAPSHOT in old releases. I have checked this scenario as well (not included in the test case) and the only difference is that, obviously, the master table test_t1 is referenced via a db-link and a few hints are injected by the refreshing engine.

In the test case, I have checked both the WITH ROWID and WITH PRIMARY KEY options for the MV DDL; the algorithm turns out as being identical, besides (obviously) that in the former the rowid and in the latter the primary key is used to identify rows.

I am going to follow the path of the previous discussion about join-only MVs referenced above, as both the test case format and some of the actual refresh steps are very similar. I have tested on 9.2.0.8, 10.2.0.4 and 11.1.0.7 for the most common DML on the base table (conventional INSERTs, UPDATEs and DELETEs). I have seen no difference in the algorithm for the three kernel versions.

Materialized view logs configuration

Even for this test case, I have configured the materialized view logs to "log everything" to check whether Oracle is able to take advantage of more information in the log:

create materialized view log on test_t1
with sequence, rowid, primary key (x1, x2)
including new values;

but even for single-table MVs the algorithm uses only the rowid or primary key information, hence the minimal (and hence optimal) log configuration is, for the WITH ROWID option:

create materialized view log on test_t1 with rowid;
 

and for the WITH PRIMARY KEY option:

create materialized view log on test_t1 with primary key;
 

Log snapshots

The first step in the refresh algorithm is to take a log snapshot, exactly as in the join-only case, by setting snaptime$$ = current time. Hence the marked log rows (the ones and only ones to consider for propagation) will be characterized by snaptime$$ <= current time and > last snapshot refresh time. See the previous post about the join-only case for a more in-depth discussion.

Note: actually, for the sake of precision, two (minor) differences with the join-only case are that the snapshot statement is exactly the same in all versions (there's no special version for 11.1.0.7) and that the log is not "inspected to count the number and type of the logged modifications".

Core algorithm: the DELETE and UPSERT steps

Then, the core replication starts. The propagation from the master table is composed of two simple steps, steps that I've named DELETE and UPSERT (UPDate + insERT).

The first DELETE step is a simple select-then-delete row-by-row processing, where each row returned by a select statement is passed to a single-row delete statement.
For the WITH ROWID option, the select statement of the DELETE step is (editing for readability: removing hints, unnecessary aliases, etc):

select distinct m_row$$
  from (
select m_row$$
  from mlog$_test_t1
 where snaptime$$ > :1
   and dmltype$$ != 'I'
       ) log
 where m_row$$ not in
       (
select rowid from test_t1 mas
 where (mas.x1 <> 0.42)
   and mas.rowid = log.m_row$$
       );

and the delete is a trivial

delete from test_mv where m_row$$ = :1;

The select+delete purpose is to delete all marked rows that are not in the master table anymore, or that are still there but that do not satisfy the MV defining SQL (here, x1 != 0.42) anymore.

In fact, the first in-line view fetches from the log the rowid of a subset (those whose dmltype$$ != 'I') of the marked rows, since :1 is set to the date of the previous refresh of the materialized view. Well actually - the SQL, as it is, would also get the log rows inserted after the snapshot was taken, which is obviously not acceptable since the propagation must operate on a stable set of rows. I'm not sure how the non-marked rows are excluded, but probably the various "select for update" on the log data dictionary tables might play a role by locking the commits on the logs, or maybe the serialization level is set to read-only or serializable (I will investigate this in the future). For now, let's make the conjecture that only the marked rows are selected.

The last correlated subquery simply filters out the rowid of the rows that are still in the master table. The condition dmltype$$ != 'I' ('I' stands for INSERT) is only an optimization, since an inserted row would be filtered out by the subquery anyway - unless it has not been deleted after being inserted, but that would be recorded with another log row with dmltype$$ = 'D'.

Why are updates (dmltype$$ = 'U') not optimized away as well? This is to delete rows from the MV that no longer belong to the current image of the MV defining SQL statement, since they used to satisfy the filter condition (here, x1 != 0.42) but no longer do after an update. Thanks to the filter condition (x1 != 0.42) being included in the subquery, any row that does not satisfy it anymore after an update will not be filtered out, and hence will be deleted.

Note that column m_row$$ of the MV is a hidden (but not virtual) column that records, for each MV row, the rowid of the corresponding master table row. It is automatically created when you define the MV with the WITH ROWID option; an index is automatically created on m_row$$ as well (unless you specify USING NO INDEX, something that does not make sense if you want to fast refresh the MV). Hence you do not need to create any additional index, neither on the master table nor on the MV, to optimize this step of the fast refresh.

Switching to the WITH PRIMARY KEY option, the select statement of the DELETE step is

select distinct pk1
  from (
select pk1
  from mlog$_test_t1
 where snaptime$$ > :1
   and dmltype$$ != 'I')
       ) log
 where pk1 not in
       (
select pk1
  from test_t1 mas
 where (mas.x1 <> 0.42)
   and log.pk1 = mas.pk1
       );

and the delete is simply

delete from test_mv where pk1 = :1;

That is, the statements are the same as in the WITH ROWID case, with the primary key instead of the rowid in all statements. Since the master table must have a primary key for the MV create to succeed, and since an index on the MV that spans the primary key column(s) is automatically created (unless you specify USING NO INDEX of course), even in the WITH PRIMARY KEY case you do not need to create any additional index for performance. Actually, for best performance, an index on the master table that combines the PK and the column(s) referenced by the MV filter condition - here on (pk1, x1) - might help a bit, since probably the optimal plan is a nested loop having test_t1 as the inner table. This would avoid a block get on the master tables for marked rows not satisfying the MV filter condition; the effectiveness of this index depends on whether you have a lot of updates on the column referenced in the filter condition.

The UPSERT step is a simple select-then-upsert row-by-row processing, where each row returned by a select statement (that calculates the current image of the row that needs to be propagated to the MV) is used to update the corresponding row in the MV; if the update finds no row, the row is inserted.

For the WITH ROWID option, the select statement of the UPSERT step is:

select current.x1, current.x2, current.pk1, current.x1x2,
       rowidtochar (current.rowid) m_row$$
  from (
select x1, x2, pk1, x1+x2 as x1x2
  from test_t1
 where (x1 <> 0.42)
       ) current,
       (
select distinct m_row$$
  from mlog$_test_t1
 where snaptime$$ > :1
   and dmltype$$ != 'D'
       ) log
 where current.rowid = log.m_row$$;

and the update and insert statements are simply:

update test_mv set x1=:1, x2=:2, pk1=:3, x1x2 = :4 where m_row$$ = :5;
insert into test_mv (x1,x2,pk1,x1x2,m_row$$) values (:1,:2,:3,:4,:5);

The select+upsert purpose is to calculate the new image of all marked rows that satisfy the MV defining SQL filter condition (here, x1 != 0.42) and then overwrite the old image in the MV with the new one. Note that an update on the master table might produce an insert if the old image did not satisfy the filter condition and the new one does.

The structure of the select statement should be obvious after the previous illustration of the DELETE step. Note of course the different optimization in the second inline view (dmltype$$ != 'D'). Even in this case, the automatically created index on the m_row$$ MV column optimizes the update statement, and no other index is necessary for performance on neither the base table nor the MV.

Switching to the WITH PRIMARY KEY option, the select statement of the UPSERT step is

select current.x1, current.x2, current.pk1, current.x1x2
  from (
select x1, x2, pk1, x1+x2 x1x2
  from test_t1
 where (x1 <> 0.42)
       ) current,
       (
select distinct pk1
  from mlog_test_t1
 where snaptime$$ > :1
   and dmltype$$ != 'D'
       ) log
 where current.pk1 = log.pk1;

and the update and insert statements are:

update test_mv set x1=:1, x2=:2, pk1=:3, x1x2=:4 where pk1=:3;
insert into test_mv  (x1, x2, pk1, x1x2) values (:1, :2, :3, :4);

And the same considerations about the substitution of rowid with the primary key hold. The index on the master table on (pk1, x1) might be of help here as well.

So here it is what the algorithm, essentially, is all about: a row-by-row propagation of all the modified (marked) rows to the MV, with a few optimizations.

Algorithm optimizations

Whatever the type of modifications, the algorithm is always the same: both the DELETE and UPSERT step are performed in all cases. Of course, in both cases, the select statement might select no row.

Xplan 2.0

A lot of new features have been added in version 2.0 of xplan, the sqlplus script I use to investigate about SQL statements performance (I spoke about version 1.0 in this post). Here's a brief description.

wait profile (from ASH)

For each statement, its wait profile is calculated fetching wait information from Active Session History:

-----------------------------------------
|ash event                    |cnt |%   |
-----------------------------------------
|enq: HW - contention         |2606|61.0|
|enq: TX - row lock contention| 875|20.5|
|db file sequential read      | 344| 8.0|
|enq: TX - index contention   | 158| 3.7|
|gc current grant busy        | 152| 3.6|
|cpu                          |  56| 1.3|
|gc current block 2-way       |  34| 0.8|
|gc current block busy        |  13| 0.3|
|gc buffer busy               |  10| 0.2|
|gc cr block 2-way            |   7| 0.2|
|gc current grant 2-way       |   5| 0.1|
|read by other session        |   5| 0.1|
|direct path write            |   3| 0.1|
|gc cr block busy             |   3| 0.1|
|gc cr grant 2-way            |   1| 0.0|
|SQL*Net more data from client|   1| 0.0|
|cr request retry             |   1| 0.0|
-----------------------------------------

By default this feature is on in 10g+ and inspects a window of ash_profile_mins=15 minutes from v$active_session_history.

Important note: you must have bought the appropriate Oracle licence (i.e. the Diagnostic Pack in 11.1) to read from that view and hence to use this feature (xplan will output a warning to remember you about that); you can disable this feature by setting ash_profile_mins=0.

Dump of dependent object definitions

If the statement references some database objects (e.g. a view, a pl/sql function) and hence depends on them, xplan will list them right below the statement text:

SELECT /*+ index(t,t_fbi) ordered use_nl(v) xplan_test_marker */
       T.RR, PLSQL_FUNC(MAX(T.X))
  FROM T, V
 WHERE UPPER(T.X) >= '0'
   AND T.X > :B1
   AND V.RR ='x'
 GROUP BY T.RR
 ORDER BY T.RR

- depends on view DELLERA.V
- depends on function DELLERA.PLSQL_FUNC

and the object definition will be reported at the bottom of the xplan output:

############################################# function DELLERA.PLSQL_FUNC ###
function plsql_func (p varchar2)
return varchar2
is
begin
  return p;
end plsql_func;
############################################# view DELLERA.V ###
view columns: #1 X(NUMBER),#2 PADDING(VARCHAR2),#3 RR(VARCHAR2)
select x, padding, rr
  from t
 where x > 0

Reading other RAC instance statements

Now you can read from another instance by specifying the option inst_id (defaults to the instance you are connected). This is handy for inspecting other instances of the RAC cluster without reconnecting.

Automatic dump of AWR most-expensive statements

The experimental script xplan_awr.sql will inspect AWR (Active Workload Repository) and dump all the statements that are still in the library cache and that have exceeded some resource consumption thresholds in any of the periods marked by two consecutive AWR snapshots. Thresholds can be the percentage of total (e.g. dump if the CPU consumption is more that 10% of total CPU) or the ranking position (e.g. dump if the statement ranks more than 5th in the CPU chart - the typical "top-N" analysis). The thresholds are configurable in the topmost "params" WITH clause.

Again, you must have bought the appropriate Oracle licence to use AWR, and hence to run xplan_awr.sql.

fast refresh of join-only materialized views - algorithm summary

This post investigates how Oracle fast refreshes materialized views containing only joins of master tables:

create materialized view test_mv
build immediate
refresh fast on demand
as
select test_t1.*, test_t1.rowid as test_t1_rowid,
       test_t2.*, test_t2.rowid as test_t2_rowid,
       test_t3.*, test_t3.rowid as test_t3_rowid
  from test_t1, test_t2, test_t3
 where test_t1.j1_2 = test_t2.j2_1
   and test_t2.j2_3 = test_t3.j3_2
;

The fast refresh algorithm is simple and very easy to understand - so trivial in fact that once examined and understood, the possible tuning techniques follow naturally.

The test case traces the fast refresh of the above materialized view (MV) using the 10046 event (aka "sql trace"). The test case has been run on 9.2.0.8, 10.2.0.4 and 11.1.0.7 (the latest versions of 9i, 10g and 11g available as of today), and on all of these versions the algorithm used by the refreshing engine (run by invoking dbms_mview.refresh) appears to be the same, with only a few implementation differences.

The test case explores the most general case: it performs inserts, updates and deletes on all the three master tables (the inserts being conventional; I will explore direct-path inserts another time).

Materialized view logs configuration

In the test case, I have configured the materialized view logs to "log everything", in order to check whether more information in the logs could trigger some special kernel code designed to take advantage of it:

create materialized view log on test_t1
with sequence, rowid, primary key (j1_2, x1)
including new values;

but the engine uses only the rowid information even in 11.1.0.7, so you are better off logging only the rowid if the master table feeds join-only materialized views exclusively:

create materialized view log on test_t1 with rowid;

Minimal logging obviously improves the performance of DML against the master tables, but it also optimizes the fast refresh, since the latter, as we are going to see in a moment, reads each log twice, and of course the less you log, the more compact the logs will be.

Log snapshots

After some preliminary visits to the data dictionary, the first operation performed by the fast refresh engine is to "mark" the modifications (recorded in the materialized view logs) to be propagated to the MV. Only the marked log rows are then fed by the fast refresh engine as input to the next steps.

The "flag" used to mark the rows is the column snaptime$$. When the refresh starts, the engine performs a "snapshot" of the materialized view logs by setting the snaptime$$ of all the new rows (those with snaptime$$ = '01/01/4000') of each log in turn to the current time (SYSDATE).

In detail, the snapshot is performed by issuing this SQL statement (slightly edited for readability) in 9.2.0.8 and 10.2.0.4:

update MLOG$_TEST_T1
   set snaptime$$ = :1
 where snaptime$$ > to_date('2100-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS')

The bind variable :1 is a DATE whose value is equal to SYSDATE.

Note: In 11.1.0.7, the statement is slightly different but makes the same thing, probably in a more scalable way concurrency-wise (check the script spools if you're interested).

You might have noticed the where condition on snaptime$$; that is necessary since the logs might be used by more than one materialized view. When a refresh ends, in fact, the engine checks whether other MVs might need each log row, and deletes only the log rows that have been processed by all dependant MVs; the other ones are left unchanged (and hence keep the snaptime$$ that was set when the fast refresh started). The where condition is needed to avoid overwriting the snaptime$$, and mark with the current time only the brand new rows (those with snaptime$$ = '01/01/4000').

So, at the end of the snapshot, the log rows that must be examined by the refresh engine will be the ones that are marked by having their snaptime$$ between the date of the last refresh (excluded) and :1 (included). All the other log rows must be ignored.

Side note: marking data at a certain point in time and then replicating the marked data is the only replication strategy that can work when you cannot "freeze" the master tables, as this is definitely our case. This is a general topic worth blogging about in the future.

The marked log rows are then inspected to count the number and type of the logged modifications. This is to check whether any of the replication steps (i.e. the DEL and INS steps that we are going to discuss in a moment) could be skipped. Also, the number of modifications is used (in some versions) to inject some hints in the SQL statements of the replication steps, a topic that falls out of the scope of this post.

Core algorithm: the INS and DEL steps

Then, the core replication starts. The replication considers each master table in turn, and for each table, propagates the modifications to the MV. So we have essentially one single algorithm that propagates from a single master table, just repeated once for each master table.

The propagation for each master table is made of two simple steps, steps that I'm going to name after the comments of the SQL as a DEL (DELETE) step followed by an INS (INSERT) step.

The DEL step is (editing for readability: removing hints, unnecessary aliases, etc):

/* MV_REFRESH (DEL) */
delete from test_mv
 where test_t1_rowid in
       (
select * from
       (
select chartorowid (m_row$$)
  from mlog$_test_t1
 where snaptime$$ > :1
       ) as of snapshot (:2)
       )

The subquery simply fetches the rowid of all marked rows, since :1 is the date of the previous refresh of the materialized view, and :2 is the SCN (coded as a RAW variable) of the time when the snapshot was performed. So, this step deletes from the MV all the rows that record the result of the MV-defining join of any of the marked rows (of the current master table) with the other master tables.

This is the step that can benefit from the index on the column that stores the master table rowid (here, test_t1_rowid) that the documentation suggests to create. Note that in order to optimize this step, you need three separate single-column indexes (here, on test_t1_rowid, test_t2_rowid, test_t3_rowid), not a single composite index spanning the (here, three) columns, as it is sometimes wrongly stated.

The INS step is (again editing for readability):

/* MV_REFRESH (INS) */
insert into test_mv
select jv.j1_2, jv.x1, jv.pk1, jv.rid$,
       mas2.j2_1, mas2.j2_3, mas2.x2, mas2.pk2, mas2.rowid,
       mas3.j3_2, mas3.x3, mas3.pk3, mas3.rowid
  from (
select log.rowid rid$, log.*
  from test_t1 log
 where rowid in
       (
select chartorowid(log.m_row$$)
  from mlog$_test_t1
 where snaptime$$ > :1
       )
       ) as of snapshot (:2) jv,
       test_t2 as of snapshot (:2)  mas2,
       test_t3 as of snapshot (:2)  mas3
 where   jv.j1_2 = mas2.j2_1
   and mas2.j2_3 = mas3.j3_2

The subquery is the same as the DEL step and serves the very same function. So, this statement replays the SQL statement of the MV definition, but limited to the marked rows only. Note that all tables are read at the same point in time in the past, the time when the snapshot of the log was performed, thanks to the argument of the "as of snapshot" clause being the same.

In order to speed up the INS step, indexes on the joined columns can be created on the master tables (not on the MV!). This is because, special cases aside, it is well known that the "fast refresh" (the name itself is quite horrible, many people prefer the adjective "incremental") can be actually "fast" only if a small fraction of the master tables is modified (otherwise, a complete refresh is better); in this scenario, almost certainly the optimal plan is a series of NESTED LOOPs that has the current table (test_t1 in this case) as the most outer table, series that can usually benefit a lot by an index on the inner tables join columns. Of course, you must remember that every table, in turn, acts as the most outer table, hence you should index every possible join permutation.

So here what the algorithm is all about: the DEL and INS steps, together, simply delete and recreate the "slice" of the MV that is referenced by the marked rows, whatever the modification type. The algorithm is as simple (and brutal) as it seems.

Algorithm optimizations

The only optimizations performed are the skipping of some steps when they are obviously unnecessary. For every master table, the DEL step is skipped when only INSERTs are marked in the logs; the INS is skipped when only DELETEs are marked, and of course both are skipped if no row is marked. I have not been able to spot any other optimization.

Note that this means that UPDATEs always turn into a delete+insert of the entire "slice". For example, consider the typical case of a parent table (say, CUSTOMER), with a child (say, ORDER) and a grandchild (say, ORDER_LINE); if you update a column of a row of the parent (say, ORDERS_TOTAL_AMOUNT), the parent row and its whole progeny (the "slice") will be deleted and then recreated. This was a quite surprising discovery for me - a fact that I have now committed to memory.

Bind Variables Checker for Oracle - now install-free

I've finally managed to implement an install-free version of my utility to check for bind variables usage. The new script is named bvc_check.sql and when run, it examines the SQL statements stored in the library cache (through gv$sql) and dumps the ones that would be the same if the literals were replaced by bind variables.

An example of the output:

------------------
statements count :  0000000003
bound    : select*from t where x=:n
example 1: select * from t where x = 2
example 2: select * from t where x = 3
------------------

So we have 3 statements that are the same once literals are replaced with bind variables. Two examples are provided; the action of replacing the literals 2 and 3 with the bind variable :n makes the statements the same.

The script are available on this page, that also explains the script workings in more detail and describes other scripts that might be of interest.

Tuning Oracle for Siebel - SQL template

The time has come to write down some of the most relevant discoveries I've made so far while being part of a team that is tuning a huge Siebel installation for a leading Italian company ("huge" especially because of the user base dimension and secondarily because of the hardware deployed, a three-node RAC on pretty powerful SMP machines).

This blog entry is about the structure of the Siebel queries and the effect of the settings of some CBO-related parameters - settings made by the Siebel clients by altering the session at connect time, or required as mandatory in the Siebel installation notes. Other postings may follow in case I discover something worth writing about.

But first of all, let me thank for their support all the fellow members of the OakTable network (especially Tim Gorman that has exchanged long emails with me) and Andy Cowling (introduced to me by Doug Burns) that kindly provided me with a lot of detailed information coming from their vast hands-on experience with Siebel on Oracle.

For the record, the environment is Siebel 8.0, using a three-node 10.2.0.4 RAC cluster on identical SMP machines with 16 CPUs each.

Most of the Siebel queries follow this template:

select ...
  from base, inner1, ..., innerN
 where base.j1 = inner1.j1(+)
   and base.j2 = inner2.j2(+)
   ...
   and base.jN = innerN.jN(+)
   and "filter conditions on base"
  order by base.order1 [asc|desc], base.order2 [asc|desc], ..., base.orderK [asc|desc];

which must be discussed keeping in mind another critical and subtle information: the Siebel clients read only a subset of the returning rows (probably the ones that fit on the screen, sometimes only 2 or 3 on average) - an intention that the Siebel client wisely communicates to the CBO by altering the session and setting optimizer_mode = first_rows_10.

Side note: there are variants to this template; sometimes there are two or three base tables that are joined and filtered together, and more frequently, some of the outer join relations are based on one of the innerN tables, not on the base tables (eg. innerM.jN = innerN.jN), but that is inessential for what we are going to discuss here.

The Siebel client's intention is clear: get some rows from the base table, order them by the columns order1, order2, ..., orderK, get only the first rows, and then get additional information (columns) by following the outer join relations. Note the order of the operations.

Visiting the outer-joined tables almost always comes out as the most expensive part. The reason is two-fold; first, the sheer number of outer-joined tables (ten tables is the norm, up to about 50, since the Siebel relational model is highly normalized and hence the information is highly dispersed), and second and most importantly, because of the join method the CBO is forced to follow.

In fact, Siebel blocks HASH JOINS (_hash_join_enabled=false) and SORT MERGE JOINS (_optimizer_sortmerge_join_enabled=false), which leaves NESTED LOOPS as the only surviving join method. When nested looping, Oracle must obviously start from the outer table listed in the SQL statement (base) and then visit the SQL inner tables (inner1 ... innerN) using the join conditions, something that can be efficiently done only by visiting an index that has innerN.jN as the leading column of its definition. Each of these visits requires blevel+1 consistent gets for the index and some additional consistent gets for every row that is found - usually one or two, sometimes zero, at least in the scenarios I've seen.

So it is clear that every row that is fetched from the base table may easily produce tens and tens of consistent gets coming from the outer joins - and that can easily lead to a disaster if the number of rows from the base table is not very very small. In one of our queries, for example, about one million rows came out from the base table, and since the CBO chose to outer join first and then order, each execution caused a whopping 15,000,000 consistent gets - only to have the Siebel client select the first 3 rows or so. Needless to say, such a huge number of gets is completely unacceptable; the impact on the response time and CPU consumption is obvious, but scalability suffers as well (gets cause latches or mutexes acquisitions that are the most frequent cause of contention) - that means that even a slight increase on the workload may cause severe degradation of response time (especially on RAC).

The solution is to have Oracle order first, and then join - to very quickly feed the first 3 or 4 rows to the client thus sparing the effort to outer join the remaing ones, that will never (or rarely) be fetched. That usually means that you must prepare an ad-hoc index for the query that has the order1,order2, orderK column as the last columns; the first_rows_10 setting will strongly encourage the CBO into choosing the index. For example, say that the last "filter conditions on base" and the order clause are

...
and base.col1 = :1 and upper(base.col2) = :2
order by base.order1 desc, base.order2 asc;

a suitable index might be (col1, upper(col2), order1 desc, order2 asc).
Oracle will hopefully reach the start of the index range (or "slice") corresponding to the filter condition, a range that has the rows sorted exactly as Siebel wants them, and then scan the index range rows in succession; for each of them, it will outer join and then return the result to Siebel, and hence stop after a few rows read from the index and a few outer joins performed.

This is the main "tuning" technique in Siebel; of course there are others (using covering indexes for the outer join tables for example) that can provide some additional advantage to our beloved Oracle instances, that we are going to investigate and that I will blog about if they turn out as being useful. But I doubt that they can be as effective as this one.

Optimizing SQL statements with xplan

Xplan is a utility to simplify and automate the first part of every SQL statement tuning effort, that is, collecting the real plan of the statement, its execution statistics (number of executions, number of buffer gets performed, etc), getting the definition of all the accessed tables (and their indexes), and, last but not least, the CBO-related statistics of the accessed tables (and their indexes and columns) stored in the data dictionary by dbms_stats or ANALYZE.

The utility doesn't need to install any object inside the database since it is a read-only sqlplus script, thus needing minimal support from the customers' DBA production team. It is also so simple to run that I am normally able to ask people with minimal Oracle skills to run xplan on my behalf and then to ship me the output report - with obvious benefits for all.

The report is light and concise, designed with the needs of the SQL tuner in mind; to illustrate, the following small fragment of the report helps to get an immediate picture of the indexes layout:

-----------------------------------
|ColName     |1|2|3|4|5|P|U1|U2|R1|
--------------------U-U------------
|X           |1|1|2|1| |1|  |2 |  |
|PADDING     | | |1|2|1|2|R1|1 |  |
|RR          | | | | | | |  |  |1 |
|SYS_NC00004$|2| | | | | |  |  |  |
|SYS_NC00005$| |2| | | | |  |  |  |
-----------------------------------

You can see at a glance that column X is indexed as the first column of indexes #1, #2 and #4 (#4 being a unique index) and as the second of index #3. Constraints are reported as well, for example the PK is (X, PADDING) and there are two unique constraints (U1, U2) and a FK (R1) constraint.

To see all the information that the report provides, you can check the showcase example in the xplan main page linked above, whose report is annotated to explain the various sections meaning. The most interesting ones are surely the plan section and the table/index/column/partition information.

Using xplan is very simple; just connect to the database with sqlplus (SELECT ANY DICTIONARY and SELECT ANY TABLE are the only necessary privileges) and run the xplan.sql script. There are various ways to tell xplan which statements to report about; probably the most useful one is to ask for statements whose text matches a SQL like expression, for example

SQL>@xplan "select%from%customer%" ""

will dump all the SELECT statements that were run on the CUSTOMER table.

It is also possible to dump a statement by hash value (or by sql_id, module, action, parsing user, and even child number):

SQL>@xplan "" "hash=3280933266"

Some further customizations are possible - for example, you can order the matching statements (technically, the matching child cursors in the shared sql area) by elapsed_time, buffer_gets, etc; you can get a different output file for each hash_value, instead of a single output file; you can suppress or enable certain sections of the report; and so on. For the full list and further details, please check the xplan.sql header.

Xplan is free to download and use. If you decide to try it - for any question, comment or feature request, feel free to drop a comment on this post.

Order of keys inside index blocks

In this post we are going to illustrate how index keys are ordered inside a leaf block of an Oracle B+tree index.

It is well known that index blocks share most of their structure with "regular" (heap) table blocks; in particular, they share most of the way entries are stored inside the block. In tables, a row can be placed anywhere in the bottom part of the block (which is essentially managed like an "heap", in which the exact memory address that the row is stored at is not important); its row address is recorded into one of the slots of a vector named "row directory", located near the beginning of the block.

The address of the row is not published externally: only its position inside the row directory is (as the last part of the rowid, usually named "row number"), in order to enable the kernel to move the row inside the "bottom heap" as it likes - all it has to do is to update the row directory with the new address after the move, and the change will be perfectly transparent to the outside world. This of course enables the optimal utilization of the limited space inside the block, since space inside it can be re-organized at will.

Index entries are stored in the index blocks in exactly the same fashion (and probably much of the code that manages them is the very same, since, for example, they are frequently named "rows" in block dumps and documentation, a convention we will keep in this post). The only difference is that the position inside the row directory is not published as well (there is not such thing as a "keyid" that needs to be known outside of the block); the kernel uses this additional degree of freedom to keep the row directory ordered by key value (in binary order).

Let's check it by using this test case. The script creates a table and defines a one-column index on it:

create table t (x varchar2(10));
create index t_idx on t(x);

and then inserts eight rows in "random" order:

insert into t values('000000');
insert into t values('777777');
insert into t values('111111');
insert into t values('666666');
insert into t values('222222');
insert into t values('555555');
insert into t values('333333');
insert into t values('444444');

The block dump (in 11.1.0.7) of the leaf (and root) block reveals the following index keys (my annotations are enclosed in curly braces):

...
kdxlebksz 8036
row#0[8020] flag: ------, lock: 2, len=16
col 0; len 6; (6):  30 30 30 30 30 30    { '000000' }
col 1; len 6; (6):  01 00 1c 74 00 00
row#1[7988] flag: ------, lock: 2, len=16
col 0; len 6; (6):  31 31 31 31 31 31    { '111111' }
col 1; len 6; (6):  01 00 1c 74 00 02
row#2[7956] flag: ------, lock: 2, len=16
col 0; len 6; (6):  32 32 32 32 32 32    { '222222' }
col 1; len 6; (6):  01 00 1c 74 00 04
row#3[7924] flag: ------, lock: 2, len=16
col 0; len 6; (6):  33 33 33 33 33 33    { '333333' }
col 1; len 6; (6):  01 00 1c 74 00 06
row#4[7908] flag: ------, lock: 2, len=16
col 0; len 6; (6):  34 34 34 34 34 34    { '444444' }
col 1; len 6; (6):  01 00 1c 74 00 07
row#5[7940] flag: ------, lock: 2, len=16
col 0; len 6; (6):  35 35 35 35 35 35    { '555555' }
col 1; len 6; (6):  01 00 1c 74 00 05
row#6[7972] flag: ------, lock: 2, len=16
col 0; len 6; (6):  36 36 36 36 36 36    { '666666' }
col 1; len 6; (6):  01 00 1c 74 00 03
row#7[8004] flag: ------, lock: 2, len=16
col 0; len 6; (6):  37 37 37 37 37 37    { '777777' }
col 1; len 6; (6):  01 00 1c 74 00 01
----- end of leaf block dump -----
...

The address where the "row" (actually an index key) is stored is dumped between square brackets; for example, row#0 (which is the '000000' entry) is stored at address [8020], near the end of the block. You might note that the rows are placed bottom-up in insert order (the first row that was inserted was '000000' and was placed at [8020]; the second was '777777' and was placed at [8004], right above the first, etcetera); they are not ordered by binary value.

The binary dump of the portion of the block containing the "row directory" is

C6E0280 00000000 00001F64 1F341F54 1EF41F14  [....d...T.4.....]
C6E0290 1F041EE4 1F441F24 10E81128 106810A8  [....$.D.(.....h.]

removing auxiliary tracing info, grouping the bytes in two-byte chunks and converting them in decimal, we have

  1F64=8036 end of "bottom heap" (kdxlebksz), also 8020 + 16
  1F34=7988 address of '111111'
  1F54=8020 address of '000000'
  1EF4=7924 address of '333333'
  1F14=7956 address of '222222'
  1F04=7940 address of '555555'
  1EE4=7908 address of '444444'
  1F44=8004 address of '777777'
  1F24=7972 address of '666666'

that is, the addresses in the row directory are ordered by the (binary) value of the key they point to - remember that you must swap the high and low 2-byte quantities in each 32-bit word (the usual little-endian ordering; I haven't checked whether the results would be different on a machine that is not little-endian, as my one - an x86 - is).

So the number after the hash is, indeed, the position (aka index) in the row directory vector; so "row#0[8020]" means "row at position 0 in the row directory, the row being placed at address 8020", and the index positions are actually stored ordered.

The reason for this ordering is to improve the efficiency of typical visits to the index block. Consider a range scan, when the kernel has to find all the index keys contained in the search interval [min_extreme, max_extreme] (possibly min_extreme=max_extreme for equality predicates). As it is well known, the kernel first walks the branch blocks to identify the leaf block that contains min_extreme (or max_extreme, but that would be the same for our discussion); then, it has to identify all the keys that are >= min_extreme and <= max_extreme. Thanks to the ordering, it can simply perform a binary search to locate the first key >= min_extreme, and then walk the next entries in the row directory until it gets to the last one that is <= max_extreme (possibly jumping on the next leaf block). Without this ordering, the kernel would be forced to visit all the keys contained in the block.

The advantage is huge, since visiting all the keys is a O( N ) operation, while a binary search has only O ( log(2,N) ) complexity.

To fully appreciate this on a concrete example, consider that each key of our example needs 16+2 bytes of storage, hence a perfectly packed 8K block might contain approximately 8000 / 18 = 444 keys. Thanks to the ordering, for the very frequent equality search (classic for Primary Key searches for example), the processing consists essentially of the binary search only - hence the number of keys to be considered are down to about ceil( log(2, 444) ) = 9, thus consuming only 9/444 = 2% (!!) of the resources.

It is also worth remembering that this way, only a portion of the block is accessed, thus needing less accesses to central memory (it is unlikely in fact that the whole 8K can be found in the processor's caches), thus reducing the elapsed time considerably since stalling for central memory is a very common bottleneck in database systems - and thus improving scalability for the whole system thanks to the reduction of the traffic on the memory bus.

Of course there's a price to be paid for modifications, since each modification has to keep the row directory ordered, thus shifting the slots of the row directory, which is a relatively expensive operation. Oracle pays an higher bill at modification time to save (hugely) at read time.

As a final observation: note that the branch block visit can be considered a binary search as well, "pre-calculated" by the B+tree structure - the ordering inside a block is just the same divide-and-conquer algorithm applied in a different way.

Why blogging ?

Mainly because I like to write, and people usually like my writings. So, why not ?

Also - because it is a good way of learning. When I was a student, I remember that my favorite reharsal technique was to pretend I was explaining the topic to an imaginary person that knew nothing about it - and that technique worked very well, because when explaining, I very frequently discovered flaws in my knowledge that prompted me to improve even more. Blogging has the same advantage - with the additional bonus that the readers are real, hence they will discover (and hopefully tell me) much more flaws than my good old imaginary friend. A flaw discovered today is an error avoided in production tomorrow.

Then there's interaction with others, that will for sure provide me with interesting insights and other topics to investigate, or even a full-blown discussion extending across posts, blog pages and even ... restaurant tables.

So, here I am. Hope that you will find this Oracle blog useful ...

« Previous Page

Links (alphabetical order)

Blogroll (alphabetical order)

Blog Aggregators (alphabetical order)