Ch.12 Indexing and Hashing - Tarleton State University


Citation preview

Ch.12 Indexing and Hashing

Common DB operations we want to support support: random lookup + sequential scan

READ p.482 → Five factors for evaluating indexing/hashing algorithms





Clustered (a.k.a. primary) vs. non-clustered (a.k.a. secondary)

Dense vs. sparse




Clustered or non-clustered?

Other minor practical issues: Overflow blocks

Long records that extend over multiple blocks

Duplicates that extend over multiple blocks

Major practical issue: For a large table, the index itself will be large!

Solutions: Store index in RAM

Store index on disk how many blocks?

o Since index is sorted logarithmic search log2(b) disk accesses

o Logarithmic search vs. linear search, worst-case

Multi-level index → example on next page

Index updates:


o Insertion



o Deletion



Multi-level ……..

READ and take notes: Section 12.2.3 → Detailed algorithms for the above

What if the file is not ordered on the desired searck key?

Secondary index

All secondary indices must be dense!

Problem with all index-sequential files:

Both random lookups and sequential scans get slower after many

insertions and deletions, due to overflow blocks

o Solution: reorganize file periodically O(K) linear time

o Solution: leave room to grow wasted memory

o Use a different type of index!


Introduction to Section 12.3 – Trees

Fundamental benefit of trees: LOGARITHMIC HEIGHT

N = 15 = 24 – 1

H = 4 = log2(N)

Fundamental problem of trees: BALANCING



1] List the 3 classification criteria we covered for indices.

2] A further classification criterion for indices is whether their search key (SK) is

a candidate key (CK) of the table or not.

If SK ≠ CK, then we have to solve this problem: how does a unique index entry

point to multiple tuples?

With clustered indices, we can simply point to the tuple containing the first

ocurrence of SK:

Explain why this works!

Does this solution work for unclustered (secondary indices)? Explain why or

why not.

12.3 B+ Trees for index files

B is for balanced … but there are many definitions of balanced!


Each key stored in

the node is the

minimal key in the

right sub-tree


The non-leaf levels form a hierarchy of sparse indices!

Logarithmic height property:

If there are K search-key values in the file, H ≤ log n/2 (K) Explain this for a BT

Why is it important?

Random searches can be performed in logarithmic time b/c the

height of the tree needs only be traversed once!

(algorithm below)

“Back-of-the-envelope” estimate:


Week 14, Lect 3/3

Quiz: A DB file has a B+ tree index.

The node size in the B+ tree is 4 KB, the searck keys are 24-byte strings, and each

pointer is represented on 8 bytes. What is the maximum # of pointers n that can be

stored in a node?

What is the minimum?

If the file has 5 million search keys, what is the number of disk accesses when we

search for a random key?

What is the number of disk accesses when we access all keys sequentially?

Insertions and deletions to the main file can be handled efficiently,

as the index can be reorganized in logarithmic time.

Important exceptions:

o When inserting, a node becomes too big → split nodes

o When deleting, a node become too small → merge nodes

Insert “Clearview”

Delete “Downtown”

It’s not always possible to merge nodes

Delete “Perryridge” → Node a is left with too few pointers (remember n/2 )

Solution: merge it w/its sibling node → root now has too few pointers → simply

eliminate root and merged node becomes new root!

It’s not always possible to merge nodes!

What if the sibling is (almost) full?

Solution: redistribute the pointers between siblings.

Delete “Perryridge” → As before, has too few pointers, but it’s sibling has now too


borrows the rightmost pointer of .

Rightmost key of can always overwrite the leftmost one of its own parent (root here)!


12.3.4 – B+ Tree File Organization

12.3.5 – Indexing strings


12.4 – B-Trees

12.5 – Multiple-Key Access

12.6 Static Hashing

Hash = implicit index

Notation: set of all search keys K

set of all “bucket” addresses B (buckets are disk blocks)

hash function h is a function from K to B → h(Ki)

A bucket may contain tuples with different search keys → after being read from the disk,

the entire bucket must be searched.

Worst hash function ever: all search keys are mapped into the same bucket!

Properties of a good hash function:

Uniform distribution

Random distribution (Why?)

o Typically, h() operates on the low-level binary representation of the search key

READ example p.508 (31 is prime!)


Week 15, Lect.1/3 (last!)


Practice exercise 12.3 (a): Construct a B+ tree from empty, by inserting the

following values in order:

(2, 3, 5, 7, 11, 17, 19, 23, 29, 31)

The max. # of pointers is n = 4.

Practice exercise 12.4 (d): From the previous tree, delete 23.

Back to hashing …

p.508 “The function can be implemented efficiently …” → Horner’s algorithm!

12.6.2 Bucket overflow

Even if the hash funtion is perfect (i.e. uniform/random), overflow can still occur

due to:

the growth of the DB!

multiple records w/same search key K

Delay overflow by using fudge factor → nB = (nr/fr) (1+d)

When overflow happens, use overflow buckets.

Hash index w/overflow buckets

Do you see why overflow buckets lead to degraded performance?

Solution …

12.7 Dynamic Hashing

Extendable hashing idea: The hashing function generates a “large” number of

bits b (e.g. 32), but not all of them are being used as bucket addresses. Only i (i <

b) are.

Nice example in text pp.515-517

We have the following branch names and the associated hash values (handout):

Buckets can hold only 2 records.

We start w/empty hash table, i = 0 bits → 20 = 1 bucket

Insert Brighton and two Downtown:

Insert Mianus:

Insert three Perryridge:

12.8 Comparison

Ordered indexing (sequential or B+ tree) vs. hashing:

Performance depends on what type of queries we perform most often:

Lookup of individual values vs. range queries

End of material required for final.
