Hash join

Hash Join

Background

It is very common that people says hash join is good for joining huge table while nested loop is good for small

table. I will not give any comment regarding that statement, but to me, the fundamental and the main

different between hash join and nested loop is that in nested loop we can look-up the inner table using any

value from the outer table (and get benefit of index access if any) while in hash join we cannot do that. The

only possibility to look-up to the inner table (probe table in term of hash join) is by using constant value.

In below example, actually we can divide the query into 2 separate queries.

SELECT /*+ leading(a) use_hash(b) */ a.*, b.*

FROM tbuild a, tprobe b

WHERE a.id = b.id;

Query 1:

SELECT a.* FROM tbuild;

Query 2:

SELECT b.* FROM tprobe;

So, even though we have index on ID column in those 2 tables, Oracle won’t be able to use that index access

(instead of full table scan).

In general, there are only 2 major steps in performing hash join:

1. Build hash table from 1 table based on pre-defined hash function

2. Probe the other table to get the result

Based on how Oracle do those 2 steps , there are 3 types of hash join:

1. Optimal

2. Onepass

3. Multipass

The objective of this article is to see the different between those 3 types along with few other scenarios to see

how Oracle handle it.

During the build phase, Oracle will create in-memory hash buckets (we may call it as hash table). The number

of hash bucket should be more than enough to avoid hash collision. Technically the hash buckets are split into

several partitions. In every partition there are several slots and those slots will be having several blocks. If I can

give analogy, it is very close to partition table. Apart from that, there is bitmap structure that maintain the slot

usage.

In-Memory Hash Table Table Segment

Hash partition Table partition

Every partition consists several slots Every segment consists several extents

Every slot consists several blocks Every extent consists several blocks

Start the Exercise

In this first 3 exercises I will use below attached script (create_tables.txt) to build the required tables.

Each table has 10,000 unique rows. To get the details information on hash join operation, we need to turn on

event 10104. Beside that we need to change ”workarea_size_policy” to MANUAL and we need to

configure the ”hash_area_size” to create the required scenarios (ibegin.sql).

create_tables.txt

ibegin.sql

The query for this exercise is:

SELECT /*+ leading(a) */ a.*, b.*

FROMtbuild a, tprobe b

WHERE a.id=b.id;

From above execution plan, the join will produces 10,000 rows and requires around 2 MB memory for hash

join. The size of hash join can be calculated as below. The query selects all columns in those 2 tables.Total size

of the columns are 201 + 4 = 205. We have 10,000 rows in the table, so in total we require 205 * 10,000 / 1,024

= 2,000 kB.

Hash Join - Optimal

workarea_size_policy = MANUAL

hash_area_size = 10485760 (10 MB)

Optimal hash join is the best type, where Oracle doesn’t requires any temporary space to store the hash

bucket. Everything is done in the memory since there is enough memory to do that.

Let’s analyze the statistics which we can see in above section of trace file.

In the first 3 lines after “Join Type” we can see the size of memory:

o Hash Area is 10,304,443

o Slot Table is 9,478,144

This value is calculated later as 13 * 712 * 1,024, where:

13 is number of slot

712kB is size of each slot

1,024 of course size of 1 kB

o Overhead is 826,299

This is calculated as Hash Area – Slot Table

Number of slot/ cluster is 13

Number of partition is 8

Number of block in every slot is 89

Block size is 8 kB

Slot size is 89 * 8 kB = 712 kB

Bitmap size for each partition is 64 kB

Bitmap for all partitions is 8 * 64 kB

Size of row is 220

The size for overhead is approximately 15 bytes in this table (220 – 205).

Next in below section, the Build phase is started and since we have the optimal hash join, we cannot see any

operation in the temporary tablespace.

Later is the Probe phase. Let’s check few statistics from the trace file:

All partitions (8) are fitted and available in memory

Only 8 out of 13 slots are used

In the “Partition Distribution” we can see:

I take 1 line as example

o Number of rows across all partitions

o How many number of cluster/ slot in each partition

o How many number of slot is available in the memory for each partition

o The status,an indication whether the partition is still in memory or not

If all partitions have kept=1 optimal hash join

If all partitions have kept=0 multipass hash join

If at least 1 partition has kept=1 onepass hash join

The number of bucket is 16,384

This value is 2^14 which is the closest one to 10,000 with this value, we are sure that hash collision

will not happened

Below part shows the histogram of number of rows inside the hash bucket.

Partitions

distribution

The last part is the overall statistics information which are quite self-explanation. The most interested part is

the first line after the title. From 16,384 available buckets, Oracle only use 7,538 buckets (the other 8,846

buckets are empty). That means there are some buckets that hold more than 1 row value (it can be saw in the

above histogram as well). So the hash function is efficient enough to manage/ address more than 1 row into

the bucket without any collision.

Lastly, below is the output of autotrace from SQL*Plus session along with few session statistics. We can see the

statistic for “workarea executions – optimal” is increasing by 3, where actually only 1 is relevant for above

query (the other scenarios also show the increment of 2 in this statistic as well, so we should consider only 1).

There is no temporary tablespace activity in this test case.

Hash Join - Onepass


hash_area_size = 1572864 (1.5 MB)

Number of rows

In onepass type, Oracle needs to dump the data in the temporary tablespace due to insufficient hash area

memory, but when probing the second table, for each available partitions, Oracle only iterates ONCE. This is

why it is called onepass.

In the above capture, the important point is number of block in each slot, which is only 13 (compare to 89 in

the optimal type). It makes the slot size is 104 kB since there are 13 slots in this exercise. The reason why

Oracle reduce the number of block is to manage at least 1 slot in every partition is available in the memory.

The next section (Build phase) is quite interested, Oracle start spilling the data to temporary tablespace.

Once Build phase completed, Oracle start the Probe phase as below. Let’s highlight few points:

In first operation, only 6 slots are in memory

2 partitions are in memory

2,481 rows are processed

Above 3 items are also expressed in below capture

Number of bucket is 4,096

Again, this value is 2^12 (so we can conclude that the value is the closest power of 2)

All 13 slots are used

In the trace file output we see a lot of writing and reading, and we see new section in Probe phase like below

(HASH JOIN GET FLUSHED PARTITIONS). This is the process of reading back the build table and then continue

with probing the second table to get the result. Oracle will do this operation for rest of partitions, and since

the memory is not sufficient to do the operation in one shoot, Oracle will iterates the operation.

We can see clearly in below partition, 1,224 rows are being processed from build and probe table (which is

Partition: 0 if we trace back to the initial step of Probe phase), and at the end of iteration, the number of rows

left to be iterated over is 0.

These is the list of all iterations in this test case. Not sure why Oracle didn’t do the operation in ordered

fashion (from Partition 0 to 5 or from Partition 5 down to 0).

This is the overall statistics for onepass type (not all rows are showed).

Below is the output of autotrace from SQL*Plus session along with few session statistics. We can see the

statistic for “workarea executions – onepass” is increasing by 1 (we can ignore the increment of “workarea

executions – optimal” as I mentioned before). There is balanced activity between read and write against

temporary tablespace, which means Oracle only write once to temporary tablespaceand read once from

temporary tablespace.

Hash Join - Multipass


hash_area_size = 131072 (128 kB)

In multipass type, Oracle also needs to dump the data in the temporary tablespace due to insufficient hash

area memory, but when probing the second table, for each available partitions, Oracle will iterates SEVERAL

times. This is why it is called multipass. This is the least efficient type of hash join.

In above output, every slot has single block only, so the size of slot is 8 kB. The total memory for slot table is 14

* 8 * 1024 = 114,688. The Build probe is getting longer since the size of slot is less (for efficiency please go to

the attached trace file if you want to know how long the “writing” activity of Build phase .

This is few capture of Probe phase for multipass type.

Again power of 2

Below is the details iteration for Partition 0. There are 4 iterations for processing 1,224 rows in this partition.

Theoretically if we want to change the type to onepass operation, we need to multiply “hash_area_size” by 4,

so in this case we need to configure at least 131,072 * 4 = 524,288 (512 kB).

Below is the output of autotrace from SQL*Plus session along with few session statistics. We can see the

statistic for “workarea executions – multipass” is increasing by 1 (we can ignore the increment of “workarea

executions – optimal” as I mentioned before). There is imbalance activity between read and write against

temporary tablespace. Oracle does the read part more compare to write part.

Which One Should Be The Build Table

Now let’s create another test case which will shows us the impact of build table’s size to the performance and

memory size. The configuration will be like below:

Create 2 tables, one table with 10,000 rows (TBIG) and the other with 2,500 rows (TSMALL). Both the

tables has 100 distinct values

Set “workarea_size_policy” to MANUAL

Set “hash_area_size” to 3 MB

Create 2 scenarios, first scenario will uses TBIG as build table and the second will uses TSMALL as build

table

The complete table creation script is attached

create_tables2.txt

The trace files output are attached

These are the execution plan for both queries. The consistent gets is bigger when we use TSMALL as build

table.(unfortunately I didn’t turn on event 10200 so I don’t know from where those consistent gets are

coming).The reason behind this symptom can be explained if we turn on event 10200 (for dumping consistent

gets). Please find below attached excel for the details of consistent gets, along with its trace files.

DBA series - Hash Join.xlsx

Again, I attach below statistics from TBIG and TSMALL table. The next capture is summary of consistent gets

for both scenarios (TSMALL as “build” table and TBIG as “build” table)

During Build phase, what Oracle reads all rows from “build” table. So in case of TBIG, Oracle requires 1,000

consistent gets (number of block in TBIG) and in case of TSMALL, Oracle requires 292 consistent gets (yet I

cannot explain the 78 different in this case)

The next Probe phase is more interesting,instead of loading all available blocks in“probe” table, it looks like

Oracle reads the second table row by row, 1 consistent get for single row. The result is 10,000 consistent gets

when we use TSMALL as “build” table (there are 10,000 rows in TBIG). In case of TBIG as “build” table, Oracle

requires 2,542 consistent gets (there are 2,500 rows in TSMALL again I cannot explain the 42 different, but

during the test I filled-up the buffer by doing full table scan against TBIG and TSMALL, not sure if this was the

RC).

Apart from that, everything is similar.

TSMALL as “build” table

TBIG as “build” table

Now let’s analyze the trace file to see the different from hash memory configuration and components.

From above comparison, we see Oracle works more efficient when the build table is small. We can see the

number of block in the slot is bigger and the number of bucket is smaller. It makes the overall memory

consumption is smaller when we have smaller build table.

Before we go to the conclusion that smaller build table is better than the bigger one, let’s retry the test case

with “workarea_size_policy” = AUTO which is default and recommended by Oracle.

Again we see that the memory consumption is better for smaller build table. But in this time, Oracle decided to

configure more blocks for each slot when bigger build table is used (this is in the reverse way if we compare to

previous test case when we set MANUAL for“workarea_size_policy”)

Conclusion

1. Multipass hash join is the most in-efficient type, and we can change it to, at least, onepass by multiply

the “hash_area_size“ by the number of iterations in one of hash partition.

2. Smaller table is always good as starting point for Build table until unless you see significant downgrade

in the performance. Small table will leads to smaller in-memory hash table.

2. Smaller table is not always good as “build” table, it depends ;-)

o It is good as “build”table as it requires smaller in-memory hash table to start the join

o In the other hand, it generates more consistent gets (again it depends of the size of Probe

table, the number of rows)

Saying this table is good as “build”table, or that table is not good (without confirmed by the number) is

not wise.

-heri-

Technology

Hash join