Order of keys inside index blocks

In this post we are going to illustrate how index keys are ordered inside a leaf block of an Oracle B+tree index.

It is well known that index blocks share most of their structure with "regular" (heap) table blocks; in particular, they share most of the way entries are stored inside the block. In tables, a row can be placed anywhere in the bottom part of the block (which is essentially managed like an "heap", in which the exact memory address that the row is stored at is not important); its row address is recorded into one of the slots of a vector named "row directory", located near the beginning of the block.

The address of the row is not published externally: only its position inside the row directory is (as the last part of the rowid, usually named "row number"), in order to enable the kernel to move the row inside the "bottom heap" as it likes - all it has to do is to update the row directory with the new address after the move, and the change will be perfectly transparent to the outside world. This of course enables the optimal utilization of the limited space inside the block, since space inside it can be re-organized at will.

Index entries are stored in the index blocks in exactly the same fashion (and probably much of the code that manages them is the very same, since, for example, they are frequently named "rows" in block dumps and documentation, a convention we will keep in this post). The only difference is that the position inside the row directory is not published as well (there is not such thing as a "keyid" that needs to be known outside of the block); the kernel uses this additional degree of freedom to keep the row directory ordered by key value (in binary order).

Let's check it by using this test case. The script creates a table and defines a one-column index on it:

create table t (x varchar2(10));
create index t_idx on t(x);

and then inserts eight rows in "random" order:

insert into t values('000000');
insert into t values('777777');
insert into t values('111111');
insert into t values('666666');
insert into t values('222222');
insert into t values('555555');
insert into t values('333333');
insert into t values('444444');

The block dump (in 11.1.0.7) of the leaf (and root) block reveals the following index keys (my annotations are enclosed in curly braces):

...
kdxlebksz 8036
row#0[8020] flag: ------, lock: 2, len=16
col 0; len 6; (6):  30 30 30 30 30 30    { '000000' }
col 1; len 6; (6):  01 00 1c 74 00 00
row#1[7988] flag: ------, lock: 2, len=16
col 0; len 6; (6):  31 31 31 31 31 31    { '111111' }
col 1; len 6; (6):  01 00 1c 74 00 02
row#2[7956] flag: ------, lock: 2, len=16
col 0; len 6; (6):  32 32 32 32 32 32    { '222222' }
col 1; len 6; (6):  01 00 1c 74 00 04
row#3[7924] flag: ------, lock: 2, len=16
col 0; len 6; (6):  33 33 33 33 33 33    { '333333' }
col 1; len 6; (6):  01 00 1c 74 00 06
row#4[7908] flag: ------, lock: 2, len=16
col 0; len 6; (6):  34 34 34 34 34 34    { '444444' }
col 1; len 6; (6):  01 00 1c 74 00 07
row#5[7940] flag: ------, lock: 2, len=16
col 0; len 6; (6):  35 35 35 35 35 35    { '555555' }
col 1; len 6; (6):  01 00 1c 74 00 05
row#6[7972] flag: ------, lock: 2, len=16
col 0; len 6; (6):  36 36 36 36 36 36    { '666666' }
col 1; len 6; (6):  01 00 1c 74 00 03
row#7[8004] flag: ------, lock: 2, len=16
col 0; len 6; (6):  37 37 37 37 37 37    { '777777' }
col 1; len 6; (6):  01 00 1c 74 00 01
----- end of leaf block dump -----
...

The address where the "row" (actually an index key) is stored is dumped between square brackets; for example, row#0 (which is the '000000' entry) is stored at address [8020], near the end of the block. You might note that the rows are placed bottom-up in insert order (the first row that was inserted was '000000' and was placed at [8020]; the second was '777777' and was placed at [8004], right above the first, etcetera); they are not ordered by binary value.

The binary dump of the portion of the block containing the "row directory" is

C6E0280 00000000 00001F64 1F341F54 1EF41F14  [....d...T.4.....]
C6E0290 1F041EE4 1F441F24 10E81128 106810A8  [....$.D.(.....h.]

removing auxiliary tracing info, grouping the bytes in two-byte chunks and converting them in decimal, we have

  1F64=8036 end of "bottom heap" (kdxlebksz), also 8020 + 16
  1F34=7988 address of '111111'
  1F54=8020 address of '000000'
  1EF4=7924 address of '333333'
  1F14=7956 address of '222222'
  1F04=7940 address of '555555'
  1EE4=7908 address of '444444'
  1F44=8004 address of '777777'
  1F24=7972 address of '666666'

that is, the addresses in the row directory are ordered by the (binary) value of the key they point to - remember that you must swap the high and low 2-byte quantities in each 32-bit word (the usual little-endian ordering; I haven't checked whether the results would be different on a machine that is not little-endian, as my one - an x86 - is).

So the number after the hash is, indeed, the position (aka index) in the row directory vector; so "row#0[8020]" means "row at position 0 in the row directory, the row being placed at address 8020", and the index positions are actually stored ordered.

The reason for this ordering is to improve the efficiency of typical visits to the index block. Consider a range scan, when the kernel has to find all the index keys contained in the search interval [min_extreme, max_extreme] (possibly min_extreme=max_extreme for equality predicates). As it is well known, the kernel first walks the branch blocks to identify the leaf block that contains min_extreme (or max_extreme, but that would be the same for our discussion); then, it has to identify all the keys that are >= min_extreme and <= max_extreme. Thanks to the ordering, it can simply perform a binary search to locate the first key >= min_extreme, and then walk the next entries in the row directory until it gets to the last one that is <= max_extreme (possibly jumping on the next leaf block). Without this ordering, the kernel would be forced to visit all the keys contained in the block.

The advantage is huge, since visiting all the keys is a O( N ) operation, while a binary search has only O ( log(2,N) ) complexity.

To fully appreciate this on a concrete example, consider that each key of our example needs 16+2 bytes of storage, hence a perfectly packed 8K block might contain approximately 8000 / 18 = 444 keys. Thanks to the ordering, for the very frequent equality search (classic for Primary Key searches for example), the processing consists essentially of the binary search only - hence the number of keys to be considered are down to about ceil( log(2, 444) ) = 9, thus consuming only 9/444 = 2% (!!) of the resources.

It is also worth remembering that this way, only a portion of the block is accessed, thus needing less accesses to central memory (it is unlikely in fact that the whole 8K can be found in the processor's caches), thus reducing the elapsed time considerably since stalling for central memory is a very common bottleneck in database systems - and thus improving scalability for the whole system thanks to the reduction of the traffic on the memory bus.

Of course there's a price to be paid for modifications, since each modification has to keep the row directory ordered, thus shifting the slots of the row directory, which is a relatively expensive operation. Oracle pays an higher bill at modification time to save (hugely) at read time.

As a final observation: note that the branch block visit can be considered a binary search as well, "pre-calculated" by the B+tree structure - the ordering inside a block is just the same divide-and-conquer algorithm applied in a different way.


  1. Two Excellent Index Related Blog Posts « Richard Foote’s Oracle Blog

    Monday, May 25, 2009

    [...] second is by Alberto Dell’Era who discusses in a post called Order of keys inside index blocks exactly how Oracle orders and stores the index keys within an index block. I’ve exchanged a [...]

  2. Cristian

    Friday, May 29, 2009

    Hi Alberto,
    congratulations, i will try to follow your example writing a question in English (correct me if i make some mistakes :) ). I’ve get in trouble with the point in wich you demonstrate that items in row directory are ordered by binary value of the key. Why are the byte grouped by groups of two? Another question is about storage needed by each key, you say that in your example is 16+2 bytes: what are the 4+2 bytes over keys used for?

  3. zfriese

    Wednesday, June 3, 2009

    Great article! However, your use of asymptotic notation is non-standard, and might be confusing. Consider simplifying to O(log n). Binary searches have logarithmic complexity. Thanks.

  4. Alberto Dell'Era

    Thursday, June 4, 2009

    Sorry for answering late, since I’ve just got back from a internet-less vacation :)

    @Cristian

    the bytes are ordered in groups of two because in order to address a byte inside the block we need 16 bits (2^16 = 65535, which is enough for all configurable dimensions of the Oracle block).

    We need 16+2 bytes because 16 is the length of the entry itself (see len=16 in the block dump) and 2 is the size of the pointer in the row directory that points to the entry.

    @zfriese

    I will check whether I can find a standard notation to express the logarithm base as well; even if the base is not strictly necessary when performing big-O analysis, I think it adds value in this context.
    Thanks for your comment.

  5. Dennis Sun

    Wednesday, July 1, 2009

    Great job. I tested it on sun sparc server(big-endians). don’t need to swap the high and low 2 bytes:

    106751080 00000000 1F600000 1F501F30 1F101EF0  [.....`...P.0....]
    106751090 1EE01F00 1F201F40 1D391D1D 1D011CE5  [..... .@.9......]
    
    row#0[8016] flag: ------, lock: 2, len=16
    col 0; len 6; (6):  30 30 30 30 30 30
    col 1; len 6; (6):  01 80 05 7f 00 00
    row#1[7984] flag: ------, lock: 2, len=16
    col 0; len 6; (6):  31 31 31 31 31 31
    col 1; len 6; (6):  01 80 05 7f 00 02
    row#2[7952] flag: ------, lock: 2, len=16
    col 0; len 6; (6):  32 32 32 32 32 32
    col 1; len 6; (6):  01 80 05 7f 00 04
    row#3[7920] flag: ------, lock: 2, len=16
    col 0; len 6; (6):  33 33 33 33 33 33
    col 1; len 6; (6):  01 80 05 7f 00 06
    row#4[7904] flag: ------, lock: 2, len=16
    col 0; len 6; (6):  34 34 34 34 34 34
    col 1; len 6; (6):  01 80 05 7f 00 07
    row#5[7936] flag: ------, lock: 2, len=16
    col 0; len 6; (6):  35 35 35 35 35 35
    col 1; len 6; (6):  01 80 05 7f 00 05
    row#6[7968] flag: ------, lock: 2, len=16
    col 0; len 6; (6):  36 36 36 36 36 36
    col 1; len 6; (6):  01 80 05 7f 00 03
    row#7[8000] flag: ------, lock: 2, len=16
    col 0; len 6; (6):  37 37 37 37 37 37
    col 1; len 6; (6):  01 80 05 7f 00 01
    
  6. Alberto Dell'Era

    Thursday, July 2, 2009

    @Dennis

    Thank you very much! Your observation makes this post complete …

    For better readability, here’s your observation “decoded” from hex to decimal, in the same format as above:

    1F50=8016 address of '000000'
    1F30=7984 address of '111111'
    1F10=7952 address of '222222'
    1EF0=7920 address of '333333'
    1EE0=7904 address of '444444'
    1F00=7936 address of '555555'
    1F20=7968 address of '666666'
    1F40=8000 address of '777777'
    
  7. Kostas Hairopoulos

    Monday, August 31, 2009

    Alberto, thank you for sharing this article with us however can please give some lights in your phrase like ” … that is, the addresses in the row directory are ordered by the (binary) value of the key they point to”.
    Its still not clear to me.
    As far as I understand from the leaf block dump, the row slot in row directory follows the key order ( ‘0000000′ is in slot 0, ‘11111111′ is in slot 1 etc) and not the. What will be the order of a new row number 8 and value ‘88888888′ in position 7898?

    Best Regards,
    Kostas Hairopoulos

    Thank you in advance,
    khair

  8. Alberto Dell'Era

    Tuesday, September 1, 2009

    @Kostas

    Yes, slot#0 “contains” ‘00000000′, slot#1 “contains” ‘11111111′, and hence a new row with the key value of ‘88888888′ would be “in” slot#8, wherever it gets inserted in the block.

    Note that I have placed “contains” in double quotes because actually, each slot contains the address of the index key, not the index key itself. For example, in the example of the post above, slot#0 contains the number 8020 that is the address of first byte of its key ‘00000000′.

    The block dump routine simply walks the row directory (an array of 16 bit elements) starting from the first slot and prints both the address and the key value.

    Note: of course the index key order is as above because my database charset is AL32UTF8, and hence the binary representation of the pure-ASCII strings I have used is ordered as the language conventions dictate. Index keys are always stored in the block using the binary value.

  9. DSLR-A900

    Sunday, November 13, 2011

    è stato molto interessante da leggere. Voglio citare il tuo post nel mio blog . Si può ? E tu et un account su Twitter?

  10. Alberto Dell'Era

    Monday, November 14, 2011

    Certo che puoi citarlo … no, non ho account Twitter ;)

Leave a Comment

Please enclose SQL code inside the tag pair [sql] ... [/sql]

Subscribe without commenting

Links (alphabetical order)

Blogroll (alphabetical order)

Blog Aggregators (alphabetical order)


Switch to our mobile site