38
Ch.12 Indexing and Hashing Common DB operations we want to support support: random lookup + sequential scan READ p.482 Five factors for evaluating indexing/hashing algorithms Insertion Deletion Concepts: Classifications: Clustered (a.k.a. primary) vs. non-clustered (a.k.a. secondary) Dense vs. sparse

Ch.12 Indexing and Hashing - Tarleton State University

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ch.12 Indexing and Hashing - Tarleton State University

Ch.12 Indexing and Hashing

Common DB operations we want to support support: random lookup + sequential scan

READ p.482 → Five factors for evaluating indexing/hashing algorithms

Insertion

Deletion

Concepts:

Classifications:

Clustered (a.k.a. primary) vs. non-clustered (a.k.a. secondary)

Dense vs. sparse

Page 2: Ch.12 Indexing and Hashing - Tarleton State University

Examples:

Dense:

Sparse:

Clustered or non-clustered?

Page 3: Ch.12 Indexing and Hashing - Tarleton State University

Other minor practical issues: Overflow blocks

Long records that extend over multiple blocks

Duplicates that extend over multiple blocks

Page 4: Ch.12 Indexing and Hashing - Tarleton State University

Major practical issue: For a large table, the index itself will be large!

Solutions: Store index in RAM

Store index on disk how many blocks?

o Since index is sorted logarithmic search log2(b) disk accesses

o Logarithmic search vs. linear search, worst-case

Multi-level index → example on next page

Page 5: Ch.12 Indexing and Hashing - Tarleton State University
Page 6: Ch.12 Indexing and Hashing - Tarleton State University

Index updates:

Single-level

o Insertion

dense

sparse

o Deletion

Dense

Sparse

Multi-level ……..

READ and take notes: Section 12.2.3 → Detailed algorithms for the above

Page 7: Ch.12 Indexing and Hashing - Tarleton State University

What if the file is not ordered on the desired searck key?

Secondary index

All secondary indices must be dense!

Page 8: Ch.12 Indexing and Hashing - Tarleton State University

Problem with all index-sequential files:

Both random lookups and sequential scans get slower after many

insertions and deletions, due to overflow blocks

o Solution: reorganize file periodically O(K) linear time

o Solution: leave room to grow wasted memory

o Use a different type of index!

TREES!

Page 9: Ch.12 Indexing and Hashing - Tarleton State University

Introduction to Section 12.3 – Trees

Fundamental benefit of trees: LOGARITHMIC HEIGHT

N = 15 = 24 – 1

H = 4 = log2(N)

Page 10: Ch.12 Indexing and Hashing - Tarleton State University

Fundamental problem of trees: BALANCING

---------------------------------------------------------------------------------------------------------

Page 11: Ch.12 Indexing and Hashing - Tarleton State University

Quiz:

1] List the 3 classification criteria we covered for indices.

Page 12: Ch.12 Indexing and Hashing - Tarleton State University

2] A further classification criterion for indices is whether their search key (SK) is

a candidate key (CK) of the table or not.

If SK ≠ CK, then we have to solve this problem: how does a unique index entry

point to multiple tuples?

With clustered indices, we can simply point to the tuple containing the first

ocurrence of SK:

Explain why this works!

Does this solution work for unclustered (secondary indices)? Explain why or

why not.

Page 13: Ch.12 Indexing and Hashing - Tarleton State University
Page 14: Ch.12 Indexing and Hashing - Tarleton State University

12.3 B+ Trees for index files

B is for balanced … but there are many definitions of balanced!

Properties:

Each key stored in

the node is the

minimal key in the

right sub-tree

Page 15: Ch.12 Indexing and Hashing - Tarleton State University
Page 16: Ch.12 Indexing and Hashing - Tarleton State University
Page 17: Ch.12 Indexing and Hashing - Tarleton State University

Example:

The non-leaf levels form a hierarchy of sparse indices!

Page 18: Ch.12 Indexing and Hashing - Tarleton State University

Logarithmic height property:

If there are K search-key values in the file, H ≤ log n/2 (K) Explain this for a BT

Why is it important?

Random searches can be performed in logarithmic time b/c the

height of the tree needs only be traversed once!

(algorithm below)

Page 19: Ch.12 Indexing and Hashing - Tarleton State University
Page 20: Ch.12 Indexing and Hashing - Tarleton State University

“Back-of-the-envelope” estimate:

------------------------------------------------------------------------

Page 21: Ch.12 Indexing and Hashing - Tarleton State University

Week 14, Lect 3/3

Quiz: A DB file has a B+ tree index.

The node size in the B+ tree is 4 KB, the searck keys are 24-byte strings, and each

pointer is represented on 8 bytes. What is the maximum # of pointers n that can be

stored in a node?

What is the minimum?

If the file has 5 million search keys, what is the number of disk accesses when we

search for a random key?

What is the number of disk accesses when we access all keys sequentially?

Page 22: Ch.12 Indexing and Hashing - Tarleton State University

Insertions and deletions to the main file can be handled efficiently,

as the index can be reorganized in logarithmic time.

Important exceptions:

o When inserting, a node becomes too big → split nodes

o When deleting, a node become too small → merge nodes

Page 23: Ch.12 Indexing and Hashing - Tarleton State University

Insert “Clearview”

Page 24: Ch.12 Indexing and Hashing - Tarleton State University

Delete “Downtown”

Page 25: Ch.12 Indexing and Hashing - Tarleton State University

It’s not always possible to merge nodes

Delete “Perryridge” → Node a is left with too few pointers (remember n/2 )

Solution: merge it w/its sibling node → root now has too few pointers → simply

eliminate root and merged node becomes new root!

Page 26: Ch.12 Indexing and Hashing - Tarleton State University

It’s not always possible to merge nodes!

What if the sibling is (almost) full?

Solution: redistribute the pointers between siblings.

Delete “Perryridge” → As before, has too few pointers, but it’s sibling has now too

many!

borrows the rightmost pointer of .

Rightmost key of can always overwrite the leftmost one of its own parent (root here)!

Page 27: Ch.12 Indexing and Hashing - Tarleton State University

READ

12.3.4 – B+ Tree File Organization

12.3.5 – Indexing strings

SKIP

12.4 – B-Trees

12.5 – Multiple-Key Access

Page 28: Ch.12 Indexing and Hashing - Tarleton State University

12.6 Static Hashing

Hash = implicit index

Notation: set of all search keys K

set of all “bucket” addresses B (buckets are disk blocks)

hash function h is a function from K to B → h(Ki)

A bucket may contain tuples with different search keys → after being read from the disk,

the entire bucket must be searched.

Worst hash function ever: all search keys are mapped into the same bucket!

Properties of a good hash function:

Uniform distribution

Random distribution (Why?)

o Typically, h() operates on the low-level binary representation of the search key

READ example p.508 (31 is prime!)

-----------------------------------------------------------------------------------------------

Page 29: Ch.12 Indexing and Hashing - Tarleton State University

Week 15, Lect.1/3 (last!)

Quiz

Practice exercise 12.3 (a): Construct a B+ tree from empty, by inserting the

following values in order:

(2, 3, 5, 7, 11, 17, 19, 23, 29, 31)

The max. # of pointers is n = 4.

Practice exercise 12.4 (d): From the previous tree, delete 23.

Page 30: Ch.12 Indexing and Hashing - Tarleton State University

Back to hashing …

p.508 “The function can be implemented efficiently …” → Horner’s algorithm!

12.6.2 Bucket overflow

Even if the hash funtion is perfect (i.e. uniform/random), overflow can still occur

due to:

the growth of the DB!

multiple records w/same search key K

Delay overflow by using fudge factor → nB = (nr/fr) (1+d)

When overflow happens, use overflow buckets.

Page 31: Ch.12 Indexing and Hashing - Tarleton State University
Page 32: Ch.12 Indexing and Hashing - Tarleton State University

Hash index w/overflow buckets

Do you see why overflow buckets lead to degraded performance?

Solution …

Page 33: Ch.12 Indexing and Hashing - Tarleton State University

12.7 Dynamic Hashing

Extendable hashing idea: The hashing function generates a “large” number of

bits b (e.g. 32), but not all of them are being used as bucket addresses. Only i (i <

b) are.

Page 34: Ch.12 Indexing and Hashing - Tarleton State University

Nice example in text pp.515-517

We have the following branch names and the associated hash values (handout):

Page 35: Ch.12 Indexing and Hashing - Tarleton State University

Buckets can hold only 2 records.

We start w/empty hash table, i = 0 bits → 20 = 1 bucket

Insert Brighton and two Downtown:

Page 36: Ch.12 Indexing and Hashing - Tarleton State University

Insert Mianus:

Page 37: Ch.12 Indexing and Hashing - Tarleton State University

Insert three Perryridge:

Page 38: Ch.12 Indexing and Hashing - Tarleton State University

12.8 Comparison

Ordered indexing (sequential or B+ tree) vs. hashing:

Performance depends on what type of queries we perform most often:

Lookup of individual values vs. range queries

End of material required for final.