Tuning Oracle for Siebel – SQL template

The time has come to write down some of the most relevant discoveries I've made so far while being part of a team that is tuning a huge Siebel installation for a leading Italian company ("huge" especially because of the user base dimension and secondarily because of the hardware deployed, a three-node RAC on pretty powerful SMP machines).

This blog entry is about the structure of the Siebel queries and the effect of the settings of some CBO-related parameters - settings made by the Siebel clients by altering the session at connect time, or required as mandatory in the Siebel installation notes. Other postings may follow in case I discover something worth writing about.

But first of all, let me thank for their support all the fellow members of the OakTable network (especially Tim Gorman that has exchanged long emails with me) and Andy Cowling (introduced to me by Doug Burns) that kindly provided me with a lot of detailed information coming from their vast hands-on experience with Siebel on Oracle.

For the record, the environment is Siebel 8.0, using a three-node 10.2.0.4 RAC cluster on identical SMP machines with 16 CPUs each.

Most of the Siebel queries follow this template:

select ...
  from base, inner1, ..., innerN
 where base.j1 = inner1.j1(+)
   and base.j2 = inner2.j2(+)
   ...
   and base.jN = innerN.jN(+)
   and "filter conditions on base"
  order by base.order1 [asc|desc], base.order2 [asc|desc], ..., base.orderK [asc|desc];

which must be discussed keeping in mind another critical and subtle information: the Siebel clients read only a subset of the returning rows (probably the ones that fit on the screen, sometimes only 2 or 3 on average) - an intention that the Siebel client wisely communicates to the CBO by altering the session and setting optimizer_mode = first_rows_10.

Side note: there are variants to this template; sometimes there are two or three base tables that are joined and filtered together, and more frequently, some of the outer join relations are based on one of the innerN tables, not on the base tables (eg. innerM.jN = innerN.jN), but that is inessential for what we are going to discuss here.

The Siebel client's intention is clear: get some rows from the base table, order them by the columns order1, order2, ..., orderK, get only the first rows, and then get additional information (columns) by following the outer join relations. Note the order of the operations.

Visiting the outer-joined tables almost always comes out as the most expensive part. The reason is two-fold; first, the sheer number of outer-joined tables (ten tables is the norm, up to about 50, since the Siebel relational model is highly normalized and hence the information is highly dispersed), and second and most importantly, because of the join method the CBO is forced to follow.

In fact, Siebel blocks HASH JOINS (_hash_join_enabled=false) and SORT MERGE JOINS (_optimizer_sortmerge_join_enabled=false), which leaves NESTED LOOPS as the only surviving join method. When nested looping, Oracle must obviously start from the outer table listed in the SQL statement (base) and then visit the SQL inner tables (inner1 ... innerN) using the join conditions, something that can be efficiently done only by visiting an index that has innerN.jN as the leading column of its definition. Each of these visits requires blevel+1 consistent gets for the index and some additional consistent gets for every row that is found - usually one or two, sometimes zero, at least in the scenarios I've seen.

So it is clear that every row that is fetched from the base table may easily produce tens and tens of consistent gets coming from the outer joins - and that can easily lead to a disaster if the number of rows from the base table is not very very small. In one of our queries, for example, about one million rows came out from the base table, and since the CBO chose to outer join first and then order, each execution caused a whopping 15,000,000 consistent gets - only to have the Siebel client select the first 3 rows or so. Needless to say, such a huge number of gets is completely unacceptable; the impact on the response time and CPU consumption is obvious, but scalability suffers as well (gets cause latches or mutexes acquisitions that are the most frequent cause of contention) - that means that even a slight increase on the workload may cause severe degradation of response time (especially on RAC).

The solution is to have Oracle order first, and then join - to very quickly feed the first 3 or 4 rows to the client thus sparing the effort to outer join the remaing ones, that will never (or rarely) be fetched. That usually means that you must prepare an ad-hoc index for the query that has the order1,order2, orderK column as the last columns; the first_rows_10 setting will strongly encourage the CBO into choosing the index. For example, say that the last "filter conditions on base" and the order clause are

...
and base.col1 = :1 and upper(base.col2) = :2
order by base.order1 desc, base.order2 asc;

a suitable index might be (col1, upper(col2), order1 desc, order2 asc).
Oracle will hopefully reach the start of the index range (or "slice") corresponding to the filter condition, a range that has the rows sorted exactly as Siebel wants them, and then scan the index range rows in succession; for each of them, it will outer join and then return the result to Siebel, and hence stop after a few rows read from the index and a few outer joins performed.

This is the main "tuning" technique in Siebel; of course there are others (using covering indexes for the outer join tables for example) that can provide some additional advantage to our beloved Oracle instances, that we are going to investigate and that I will blog about if they turn out as being useful. But I doubt that they can be as effective as this one.

Optimizing SQL statements with xplan

Xplan is a utility to simplify and automate the first part of every SQL statement tuning effort, that is, collecting the real plan of the statement, its execution statistics (number of executions, number of buffer gets performed, etc), getting the definition of all the accessed tables (and their indexes), and, last but not least, the CBO-related statistics of the accessed tables (and their indexes and columns) stored in the data dictionary by dbms_stats or ANALYZE.

The utility doesn't need to install any object inside the database since it is a read-only sqlplus script, thus needing minimal support from the customers' DBA production team. It is also so simple to run that I am normally able to ask people with minimal Oracle skills to run xplan on my behalf and then to ship me the output report - with obvious benefits for all.

The report is light and concise, designed with the needs of the SQL tuner in mind; to illustrate, the following small fragment of the report helps to get an immediate picture of the indexes layout:

-----------------------------------
|ColName     |1|2|3|4|5|P|U1|U2|R1|
--------------------U-U------------
|X           |1|1|2|1| |1|  |2 |  |
|PADDING     | | |1|2|1|2|R1|1 |  |
|RR          | | | | | | |  |  |1 |
|SYS_NC00004$|2| | | | | |  |  |  |
|SYS_NC00005$| |2| | | | |  |  |  |
-----------------------------------

You can see at a glance that column X is indexed as the first column of indexes #1, #2 and #4 (#4 being a unique index) and as the second of index #3. Constraints are reported as well, for example the PK is (X, PADDING) and there are two unique constraints (U1, U2) and a FK (R1) constraint.

To see all the information that the report provides, you can check the showcase example in the xplan main page linked above, whose report is annotated to explain the various sections meaning. The most interesting ones are surely the plan section and the table/index/column/partition information.

Using xplan is very simple; just connect to the database with sqlplus (SELECT ANY DICTIONARY and SELECT ANY TABLE are the only necessary privileges) and run the xplan.sql script. There are various ways to tell xplan which statements to report about; probably the most useful one is to ask for statements whose text matches a SQL like expression, for example

SQL>@xplan "select%from%customer%" ""

will dump all the SELECT statements that were run on the CUSTOMER table.

It is also possible to dump a statement by hash value (or by sql_id, module, action, parsing user, and even child number):

SQL>@xplan "" "hash=3280933266"

Some further customizations are possible - for example, you can order the matching statements (technically, the matching child cursors in the shared sql area) by elapsed_time, buffer_gets, etc; you can get a different output file for each hash_value, instead of a single output file; you can suppress or enable certain sections of the report; and so on. For the full list and further details, please check the xplan.sql header.

Xplan is free to download and use. If you decide to try it - for any question, comment or feature request, feel free to drop a comment on this post.

Order of keys inside index blocks

In this post we are going to illustrate how index keys are ordered inside a leaf block of an Oracle B+tree index.

It is well known that index blocks share most of their structure with "regular" (heap) table blocks; in particular, they share most of the way entries are stored inside the block. In tables, a row can be placed anywhere in the bottom part of the block (which is essentially managed like an "heap", in which the exact memory address that the row is stored at is not important); its row address is recorded into one of the slots of a vector named "row directory", located near the beginning of the block.

The address of the row is not published externally: only its position inside the row directory is (as the last part of the rowid, usually named "row number"), in order to enable the kernel to move the row inside the "bottom heap" as it likes - all it has to do is to update the row directory with the new address after the move, and the change will be perfectly transparent to the outside world. This of course enables the optimal utilization of the limited space inside the block, since space inside it can be re-organized at will.

Index entries are stored in the index blocks in exactly the same fashion (and probably much of the code that manages them is the very same, since, for example, they are frequently named "rows" in block dumps and documentation, a convention we will keep in this post). The only difference is that the position inside the row directory is not published as well (there is not such thing as a "keyid" that needs to be known outside of the block); the kernel uses this additional degree of freedom to keep the row directory ordered by key value (in binary order).

Let's check it by using this test case. The script creates a table and defines a one-column index on it:

create table t (x varchar2(10));
create index t_idx on t(x);

and then inserts eight rows in "random" order:

insert into t values('000000');
insert into t values('777777');
insert into t values('111111');
insert into t values('666666');
insert into t values('222222');
insert into t values('555555');
insert into t values('333333');
insert into t values('444444');

The block dump (in 11.1.0.7) of the leaf (and root) block reveals the following index keys (my annotations are enclosed in curly braces):

...
kdxlebksz 8036
row#0[8020] flag: ------, lock: 2, len=16
col 0; len 6; (6):  30 30 30 30 30 30    { '000000' }
col 1; len 6; (6):  01 00 1c 74 00 00
row#1[7988] flag: ------, lock: 2, len=16
col 0; len 6; (6):  31 31 31 31 31 31    { '111111' }
col 1; len 6; (6):  01 00 1c 74 00 02
row#2[7956] flag: ------, lock: 2, len=16
col 0; len 6; (6):  32 32 32 32 32 32    { '222222' }
col 1; len 6; (6):  01 00 1c 74 00 04
row#3[7924] flag: ------, lock: 2, len=16
col 0; len 6; (6):  33 33 33 33 33 33    { '333333' }
col 1; len 6; (6):  01 00 1c 74 00 06
row#4[7908] flag: ------, lock: 2, len=16
col 0; len 6; (6):  34 34 34 34 34 34    { '444444' }
col 1; len 6; (6):  01 00 1c 74 00 07
row#5[7940] flag: ------, lock: 2, len=16
col 0; len 6; (6):  35 35 35 35 35 35    { '555555' }
col 1; len 6; (6):  01 00 1c 74 00 05
row#6[7972] flag: ------, lock: 2, len=16
col 0; len 6; (6):  36 36 36 36 36 36    { '666666' }
col 1; len 6; (6):  01 00 1c 74 00 03
row#7[8004] flag: ------, lock: 2, len=16
col 0; len 6; (6):  37 37 37 37 37 37    { '777777' }
col 1; len 6; (6):  01 00 1c 74 00 01
----- end of leaf block dump -----
...

The address where the "row" (actually an index key) is stored is dumped between square brackets; for example, row#0 (which is the '000000' entry) is stored at address [8020], near the end of the block. You might note that the rows are placed bottom-up in insert order (the first row that was inserted was '000000' and was placed at [8020]; the second was '777777' and was placed at [8004], right above the first, etcetera); they are not ordered by binary value.

The binary dump of the portion of the block containing the "row directory" is

C6E0280 00000000 00001F64 1F341F54 1EF41F14  [....d...T.4.....]
C6E0290 1F041EE4 1F441F24 10E81128 106810A8  [....$.D.(.....h.]

removing auxiliary tracing info, grouping the bytes in two-byte chunks and converting them in decimal, we have

  1F64=8036 end of "bottom heap" (kdxlebksz), also 8020 + 16
  1F34=7988 address of '111111'
  1F54=8020 address of '000000'
  1EF4=7924 address of '333333'
  1F14=7956 address of '222222'
  1F04=7940 address of '555555'
  1EE4=7908 address of '444444'
  1F44=8004 address of '777777'
  1F24=7972 address of '666666'

that is, the addresses in the row directory are ordered by the (binary) value of the key they point to - remember that you must swap the high and low 2-byte quantities in each 32-bit word (the usual little-endian ordering; I haven't checked whether the results would be different on a machine that is not little-endian, as my one - an x86 - is).

So the number after the hash is, indeed, the position (aka index) in the row directory vector; so "row#0[8020]" means "row at position 0 in the row directory, the row being placed at address 8020", and the index positions are actually stored ordered.

The reason for this ordering is to improve the efficiency of typical visits to the index block. Consider a range scan, when the kernel has to find all the index keys contained in the search interval [min_extreme, max_extreme] (possibly min_extreme=max_extreme for equality predicates). As it is well known, the kernel first walks the branch blocks to identify the leaf block that contains min_extreme (or max_extreme, but that would be the same for our discussion); then, it has to identify all the keys that are >= min_extreme and <= max_extreme. Thanks to the ordering, it can simply perform a binary search to locate the first key >= min_extreme, and then walk the next entries in the row directory until it gets to the last one that is <= max_extreme (possibly jumping on the next leaf block). Without this ordering, the kernel would be forced to visit all the keys contained in the block.

The advantage is huge, since visiting all the keys is a O( N ) operation, while a binary search has only O ( log(2,N) ) complexity.

To fully appreciate this on a concrete example, consider that each key of our example needs 16+2 bytes of storage, hence a perfectly packed 8K block might contain approximately 8000 / 18 = 444 keys. Thanks to the ordering, for the very frequent equality search (classic for Primary Key searches for example), the processing consists essentially of the binary search only - hence the number of keys to be considered are down to about ceil( log(2, 444) ) = 9, thus consuming only 9/444 = 2% (!!) of the resources.

It is also worth remembering that this way, only a portion of the block is accessed, thus needing less accesses to central memory (it is unlikely in fact that the whole 8K can be found in the processor's caches), thus reducing the elapsed time considerably since stalling for central memory is a very common bottleneck in database systems - and thus improving scalability for the whole system thanks to the reduction of the traffic on the memory bus.

Of course there's a price to be paid for modifications, since each modification has to keep the row directory ordered, thus shifting the slots of the row directory, which is a relatively expensive operation. Oracle pays an higher bill at modification time to save (hugely) at read time.

As a final observation: note that the branch block visit can be considered a binary search as well, "pre-calculated" by the B+tree structure - the ordering inside a block is just the same divide-and-conquer algorithm applied in a different way.

Why blogging ?

Mainly because I like to write, and people usually like my writings. So, why not ?

Also - because it is a good way of learning. When I was a student, I remember that my favorite reharsal technique was to pretend I was explaining the topic to an imaginary person that knew nothing about it - and that technique worked very well, because when explaining, I very frequently discovered flaws in my knowledge that prompted me to improve even more. Blogging has the same advantage - with the additional bonus that the readers are real, hence they will discover (and hopefully tell me) much more flaws than my good old imaginary friend. A flaw discovered today is an error avoided in production tomorrow.

Then there's interaction with others, that will for sure provide me with interesting insights and other topics to investigate, or even a full-blown discussion extending across posts, blog pages and even ... restaurant tables.

So, here I am. Hope that you will find this Oracle blog useful ...

« Previous Page

Links (alphabetical order)

Blogroll (alphabetical order)

Blog Aggregators (alphabetical order)