48
Indexing Techniques

Indexing Techniques. Advanced DatabasesIndexing Techniques2 The Problem What can we introduce to make search more efficient? –Indices! What is an index?

Embed Size (px)

Citation preview

Indexing Techniques

Advanced Databases Indexing Techniques 2

The Problem

• What can we introduce to make search more efficient?– Indices!

• What is an index?

… …

Advanced Databases Indexing Techniques 3

Definitions

• Index: an auxiliary data structure to speed up record retrieval• Search key: the field/s of a table which is/are indexed• Storage: index files that contain index records

– Each entry storing• Actual data record • or, search key value k and record ID <k,rid> • or, search key value k and list of records IDs <k,rid list>

• Types: ordered and unordered (hash) indices

Page i Page i+1

Paul

Anna

Tim

Advanced Databases Indexing Techniques 4

Types of Ordered Indices (1/3)

• Assuming ordered data files• Depending on which field is indexed

– Primary index: search key is ordering key field• Pointer for each page

– Secondary index: search key is non ordering field

Paul00112233Anna00112234Matt00112235Tim00112236

Carol00112237Rob00112238

00112233001122350011223600112238

AnnaCarolPaulTim

primary

secondary

Advanced Databases Indexing Techniques 5

Types of Ordered Indices (2/3)

• Depending on the density of index records– Dense index: an index record for each distinct search key value, ie

every record

– Sparse index: index records for only some search key values• search key value for first record in page• pointer to page

Paul00112233Anna00112234

Matt00112235

Tim00112236Carol00112237

Rob00112238

00112233001122350011223600112238

sparse

001122330011223400112235001122360011223700112238

dense

Advanced Databases Indexing Techniques 6

Types of Ordered Indices (3/3)

• Ordering field is nonkey (may have duplicates)– Clustered index

– Unclustered index

Paul00112233

Anna00112234

Matt00112235

Tim00112236

Carol00112237

Rob00112238Paul01112233

Tim01112236Tim02112236

AnnaCarolMattPaulRobTim

001122330011223400112235001122360011223700112238011122330111223602112236

clustered

unclustered

Advanced Databases Indexing Techniques 7

Indices Exercise

• 215 records• 128 bytes/record• 210 bytes/page• ordered file equality search on ordering field, unspanned

organization– without an index

– with a primary index• on field of size 12 bytes• assume pointer 4 bytes long

Advanced Databases Indexing Techniques 8

Multi-level Indices (1/2)

• If access using first-level index is still expensive• Build a sparse index on the first-level index

– Multi-level Index

• Fan-out: index blocking factor

Paul00112233Anna00112234

Matt00112235

Tim00112236Carol00112237

Rob00112238

0011223300112234

00112235

001122360011223700112238

00112233

00112235

00112236

first-level index

second-level index

Advanced Databases Indexing Techniques 9

Multi-level Indices (2/2)

• 26 index records/page (fan-out)• 215 index records• 1st-level

– 29 pages

• 2nd-level– 29 index records

– 23 pages

• 3rd-level– 23 index records

– 1 page

• 1 <= 215 / (26)t

• t = ceil(log26 215 ) = 3

• t = ceil(logfo#index-records)

Advanced Databases Indexing Techniques 10

Dynamic multi-level indices

• So far assumed indices are physically ordered files– expensive insertions and deletions

• Dynamic multi-level indices– B trees

– B+ trees

Advanced Databases Indexing Techniques 11

Tree-structured Indices

• For each node: K1 < K2 < … Kq-1

• For each value X in subtree pointed to by Pi

– Ki-1< X < Ki, 1<i<q

– X < Ki, i=1

– Ki-1< X, i=q

P1 K1 … Ki-1 Pi Ki … Kq-1 Pq

X X X

Advanced Databases Indexing Techniques 12

B tree

• Problems: empty nodes, unbalanced trees– solution: B trees

… …

… …

Advanced Databases Indexing Techniques 13

B tree: Definition

• Each node: <P1,<K1, Pr1>, P2,…,<Kq-1, Prq-1>, Pq>• Pi tree pointer, Ki search value, Pri data pointer • For each node: K1 < K2 < … Kq-1

• For each value X in subtree pointed to by Pi – Ki-1< X < Ki, 1<i<q– X < Ki, i=1– Ki-1< X, i=q

• Each node at most q pointers– B tree is order q

• Each node at least ceil(q/2) tree pointers– except from root

• Internal node with p pointers has p-1 values• All leaves at the same level

– balanced tree

Advanced Databases Indexing Techniques 14

B tree: Example

5 8

ø 1 ø 3 ø ø 6 ø 7 ø ø 9 ø 12 ø

tree pointer

data pointer

ø null pointer

Advanced Databases Indexing Techniques 15

B+ tree

• Most implementations of B tree are B+ tree• Data pointers only in leaves

– more entries in internal nodes than regular B trees

– less internal nodes

– less levels

– faster access

Advanced Databases Indexing Techniques 16

B+ tree: Definition

• Internal nodes: <P1,K1, P2,…, Pq-1, Kq-1, Pq>

• Leaf nodes: <<K1, Pr1>, <K2, Pr2>,…,<Kp-1, Prp-1>, Pnext>

• Pri points a data records or block of pointers of such records

• leaf order

120 150 180

150 156 179 180 200

100 101 110 120 130

Advanced Databases Indexing Techniques 17

100 101 110 120 130 150 156 179 180 200

3 5 11 30 35

120 150 18030

100

B+ tree: Search

• At each level, find smallest Ki larger than search key

• Follow associated pointer Pi

Advanced Databases Indexing Techniques 18

B+ tree: Insert

• Nodes may overflow or underflow• Ignoring overflow or underflow• Inserting data record with with search key value k

– find leaf node

– if k found• add record to file, create indirect block if there isn’t one• add record pointer to indirect block

– if k not found• add data record to file• insert record pointer in leaf node (all search keys in order)

Advanced Databases Indexing Techniques 19

B+ tree: Delete

• Ignoring overflow or underflow• Find leaf node with search key value k• Find data record pointer, delete record• delete index record

– and indirect block, if any, if empty

Advanced Databases Indexing Techniques 20

B+ tree: Simple Insert

• Insert 42

100 101 110 120 130 150 156 179 180 200

3 5 11 30 35

120 150 18030

100k < 100

42

Advanced Databases Indexing Techniques 21

B+ tree: Leaf Overflow (1/2)

• Insert 9

100 101 110 120 130 150 156 179 180 200

3 5 11 30 35 42

120 150 18030

100k < 100

Advanced Databases Indexing Techniques 22

B+ tree: Leaf Overflow (2/2)

• first ceil(n/2) in existing node, rest in new leaf node• n=3+1=4

100 101 110 120 130 150 156 179 180 200

120 150 1809 30

100k < 100

3 5 30 35 429 11

Advanced Databases Indexing Techniques 23

9 30

k < 100

3 5 30 35 429 11

B+ tree: Internal Node Overflow (1/3)

• Insert 210, insert 205

100 101 110 120 130 150 156 179 180 200 210

120 150 180

100

Advanced Databases Indexing Techniques 24

B+ tree: Internal Node Overflow (2/3)

• Leaf Split

9 30

k < 100

3 5 30 35 429 11

100 101 110 120 130 150 156 179 180 200

120 150 180

100

205 210

Advanced Databases Indexing Techniques 25

B+ tree: Internal Node Overflow (3/3)

9 30

k < 100

3 5 30 35 429 11

100 101 110 120 130 150 156 179 180 200

120

100 150

205 210

180 205

Advanced Databases Indexing Techniques 26

B+ tree: New Root (1/2)

• Insert 210, insert 205

100 101 110 120 130 150 156 179 180 200

120 150 180

205 210

Advanced Databases Indexing Techniques 27

B+ tree: New Root (2/2)

180 205

100 101 110 120 130 150 156 179 180 200

120

205 210

150

Advanced Databases Indexing Techniques 28

Index Insert Exercise

• Insert 8, 7, 41

9 30

3 5 30 35 429 11

Advanced Databases Indexing Techniques 29

B+ tree: Delete

• Simple delete case• Underflow case:

– redistribute records

– coalesce with siblings

– update parents

Advanced Databases Indexing Techniques 30

B+ tree: Simple Delete (1/2)

• Delete 110

180 205

100 101 110 120 130 150 156 179 180 200

120

205 210 215

150

Advanced Databases Indexing Techniques 31

B+ tree: Simple Delete (2/2)

• Leaf Updated

180 205

100 101 120 130 150 156 179 180 200

120

205 210 215

150

Advanced Databases Indexing Techniques 32

B+ tree: Delete Redistribution (1/2)

• Delete 180

180 205

100 101 120 130 150 156 179 180 200

120

205 210 215

150

Advanced Databases Indexing Techniques 33

B+ tree: Delete Redistribution (2/2)

• Redistribute entries– left or right sibling

179 205

100 101 120 130 150 156 179 200

120

205 210

150

Advanced Databases Indexing Techniques 34

B+ tree: Delete Coalesce (1/4)

• Delete 101

179 205

100 101 120 130 150 156 179 200

120

205 210 215

150

Advanced Databases Indexing Techniques 35

B+ tree: Delete Coalesce (2/4)

• Leaf updated• No redistribution

– sibling coalesce

179 205

100 120 130 150 156 179 200

120

205 210 215

150

Advanced Databases Indexing Techniques 36

B+ tree: Delete Coalesce (3/4)

• Leaf updated• No redistribution

– sibling coalesce

179 205

100 120 130 150 156 179 200

205 210 215

150

Advanced Databases Indexing Techniques 37

B+ tree: Delete Coalesce (4/4)

• Redistribution

205

100 120 130 150 156 179 200

150

205 210 215

179

Hashing Techniques

Advanced Databases Indexing Techniques 39

Static Hashing (1/2)

• Store records in buckets with overflow chains• Allocate a fixed number of buckets M• Problems:

– small M• long overflow chains, slow search-delete-insert

null

h

null

Advanced Databases Indexing Techniques 40

Static Hashing (2/2)

• Problems:– large M

• wasted space, slow scan null

h

null

null

Advanced Databases Indexing Techniques 41

Dynamic Hashing

• Splitting and coalescing buckets as the database grows-shrinks• One scheme: Extendible Hashing• Hash function generates large values, eg 32 bits

– use i bits, change i as database size changes

• If overflow, double the number of buckets– use i+1 bits of the hash function

– but, expensive: read all pages M and distribute records in 2*M pages

• solution: use a directory and double the size of the directory– only split bucket that overflowed

Advanced Databases Indexing Techniques 42

Extendible Hashing (1/4)

h(18) = 10010

2

01

00

11

10

16 20

2

1

2

2

Directory

Buckets

3 7

2

A

B

C

D

18

Advanced Databases Indexing Techniques 43

Extendible Hashing (2/4)

h(4) = 00100

2

01

00

11

10

16 20

2

1

2

2

3 7

2

A

B

C

D

18

Advanced Databases Indexing Techniques 44

Extendible Hashing (3/4)

2

01

00

11

10

16

3

1

2

2

3 7

2

A

B

C

D

18

20 4

3A1

Advanced Databases Indexing Techniques 45

Extendible Hashing (4/4)

3

001

000

011

010

16

3

1

2

2

3 7

2

A

B

C

D

18

20 4

3A1

101

100

111

110

• Global Depth• Local Depth• If bucket full:

– split bucket

– increment LD

• If GD=LD– increment GD

– double directory

Advanced Databases Indexing Techniques 46

Extendible Hashing: Delete

• If deletion make bucket empty– merge with split image

• If directory pointers point to same bucket as split image– directory halved

Advanced Databases Indexing Techniques 47

Extendible Hashing: Summary

• Avoids overflow pages• Directory can get large• Key search requires just 2 page reads• Space utilization fluctuates

– 59-90% for uniformly distributed records

Advanced Databases Indexing Techniques 48

Extendible Hashing: Exercise

• Initially GD = LD = 1• M = 2 buckets• Hash function: h(k) = k mod 2i

• inserts: 14, 18, 22, 3, 9• deletes 9, 22, 3

1

01

00

12 8

1

5

1