24
Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

Embed Size (px)

Citation preview

Page 1: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

1

Indices

Advanced Database Systems

Dr. Fatemeh Ahmadi-Abkenari

Page 2: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

2

Indices

An index on a database table provides a convenient mechanism for locating a row (data record) without scanning the entire table and thus greatly reduces the time it takes to process a query.

Definition:

Mechanism for locating

Index Entries

Index Entries

Page 3: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

3

Clustered versus Unclustered IndicesA Clustered index or Main index is a sorted index in which the index entries and the data records are sorted on the same search key (So there is a single clustered index); Otherwise it is said to be Unclustered or Secondary index that could be several.

Data R

ecords

Data File

Inde

x F

ile

Index Entries

Mechanism for locating Index Entries

Page 4: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

4

Clustered versus Unclustered Indices

Another Definition:

In a Clustered index, the physical proximity of index entries in the index implies some degree of proximity among the corresponding data records in the data file. Such indices enable certain queries to be executed more efficiently than with unclustered indices. (Indices created with CREATE TABLE statement)

One Advantage:In retrieving a particular data record in the range, the probability of a cache hit is high.

Page 5: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

5

Inverted File and Fully Inverted File

A file is said to be Inverted on a column if a secondary index exists with that column.

A file is Fully Inverted if a secondary index exists on all columns that are not contained in the primary key.

Page 6: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

6

Sparse versus Dense Indices

A Dense index is the one whose entries are one-to-one correspondence with the records in the data file. A secondary or unclustered index must be dense but a clustered index need not be.

009406321 Jacob Taylor MGT

101202303 John Smyth CS

121232348 David Jones EE

131141151 Anita Cohen CS

234567891 Marry Brown ECO

444555666 Sanjay Sen ENG

666777888 Ann White MAT

Anita Cohen

Sanjay Sen

Marry Brown

John Smyth

Jacob Taylor

Ann White

David Jones

Dense Matrix

Page 7: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

7

Sparse versus Dense Indices

A Sparse index over a sorted file is one in which there is a one-to-one correspondence between index entries and pages of that data file. For having a Sparse matrix, it is essential that the data file be ordered on the same key as the index.

009406321 Jacob Taylor MGT

101202303 John Smyth CS

121232348 David Jones EE

131141151 Anita Cohen CS

234567891 Marry Brown ECO

444555666 Sanjay Sen ENG

666777888 Ann White MAT

009406321

131141151

666777888

Sparse Matrix

Page 8: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

8

Multilevel Indexing

Focusing on Location Mechanism not only index entries

7 19 42 77

7 11 12 15

89 99

19 28 33 34 89 92 94 97

Leaf Entry

Separator Entry

Interpretation:

The leaf entries contains pointers to the data records in a separate file.

The leaf entries contain the data records == A storage structure

A two-level index (Sparse Index) with at most four entries fit in a page

Page 9: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

9

Multilevel Indexing

Terminology:

Index Level ===Any level of a tree index (Separator or Leaf)

Separator Level===Location Mechanism

Leaf Level === Index Entries

Examples:

ISAM B+ Trees

Q: Number of Pages of Index EntriesF: Number of Pages of Data Records Q < F

Page 10: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

10

Index Sequential Access Method (ISAM)

ISAM is based on multi level indexing.

Generally, the data records are contained in leaf level, so ISAM==A storage structure for the data file.

ISAM is a main clustered index over the ordered records on the search key.

Inserting and deleting a row cause a serious problem in ISAM structure.

Suitable index structure for a relatively static table.

Insertion problem could be temporarily avoided by using Fillfactor<1.

Characteristics:

Page 11: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

11

Index Sequential Access Method (ISAM)

P0 K1 P1 K2 ---- Kn Pn

Judy Rick

TomMike Pete

Bob Edie

Abe Al Jane Joe

Bob Jane

Rick Sol

Tom

P0 P2

P1

P0 P0P2P1 P1

Page 12: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

12

Constructing ISAM Index StructureJudy Rick

TomMike Pete

Bob Edie

Abe Al Jane Joe

Bob Jane

Rick Sol

Tom

P0 P2

P1

P0 P0P2P1 P1

1- Allocating pages sequentially in the storage structure for the leaf pages.2- Constructing the separator levels from bottom up.3- The root is the top most index built.

Search-key values appear more than once in the tree

Page 13: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

13

Deletion in ISAM Indices

Judy Rick

Bob Edie

Abe Al Jane Joe

Bob Jane

P0

P0 P2P1

e.g. Jane

1- Search for Jane, starts from root, Jane < Judy P0 is followed. Jane== Jane P2 is followed. 2- Item found and Jane (the corresponding leaf entry) is deleted from the

leaf level page but no change are made to the separator level. (The separator levels never change once constructed)

Because ISAM is a static index

1- A search-key value in separator entry has no corresponding value in a leaf entry.

2- The most serious problem here is the potential waste of space where the deallocated leaf entries reside.

Page 14: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

14

Insertion in ISAM Indices e.g. Ivan

Judy Rick

Bob Edie

Abe Al Joe

Bob Jane

P0

P0 P2P1

Ivan

Overflow chain

The new leaf entry is an overflow of the existing leaf-level page, not a new level.

In a dynamic table with frequent insertion, overflow chains can become long, the index structure becomes less efficient since the overflow chains must be searched to satisfy queries.

Insertion is a serious problem if the appropriate leaf page is full

Fillfactor < 1

Page 15: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

15

B+ Trees

B+ tree is the most commonly used index structure.

B+ tree is based on multilevel indexing.

The data records either could be contained in leaf level, or in a separate data file so, B+ could be both only index or storage structure.

B+ tree has additional sibling pointers in leaf level. Searching at separator level is identical to ISAM technique.

Inserting and deleting a row is easy in B+ tree index structure, so it is a suitable index structure for a dynamic table.

B+ tree is a balanced tree so any path from the root to a leaf page has the same length as any other despite the deletion or insertion.

Characteristics:

Page 16: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

16

B+ Trees

In Insertion, instead of creating overflow chain, the tree structure will be modified. So the number of separators in each page will vary from φ/2 to φ (Fan-out= φ).

CREATE INDEX Trans ON Transcript (Grade)

DROP INDEX Trans

Secondary, Unclustered Index , B+ Tree

Page 17: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

17

B+ Trees - Insertion

Judy Rick

Tom

Rick Sol Tom

1- Vince

2- VeraJudy Rick

Tom Vince

Rick Sol Tom Vera Vince

There is room, so no modification is needed in the tree structure.

There is no room, so the tree structure is modified and a new leaf page is added.

A B C

D

Following Rule No. 1

Page 18: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

18

B+ Trees – Insertion Rules

Rule 1:

In general, when a full leaf page containing φ entries must accommodate an insertion, two leaf pages are created one containing φ/2+1 entries and the other containing φ/2 entries. A separator at the next upper index level will be inserted equals to the smallest entry at the new leaf page.

Page 19: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

19

B+ Trees - Insertion

3- RobSol Vince

Sol Tom Vera Vince

A2 B C

D2

Rick Rob

A1

D1

tom

Fan-out=2. Assuming each node is a page that includes two separator entries.

Judy Rick

Tom Vince

Rick Sol Tom Vera Vince

Following Rule No. 2

A

D

Page 20: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

20

B+ Trees – Insertion Rules

Rule 2:

In general, when a page at the separator level must accommodate φ+1 separators (Sol, Tom and Vince), the middle separator (Tom) in the separator sequence is not sorted in either of the two resulting separator pages but instead is pushed up the tree.

Page 21: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

21

B+ TreesWhy Sibling Pointers?

Sibling pointers in ISAM is not necessary because the leaf pages (that generally contain data records) are sorted in the file when the file is constructed. Since the index is static, the ordering is maintained. Overflow chains supports the dynamically inserted index entries.

The B+ tree is a dynamic index structure. Upon deletion and insertion, the order of leaf pages in the file will alter. So sibling pointers link pages at the leaf level in such a way that the link list contains the search-key values of the data records of the table in sorted order.

Page 22: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

22

B+ Trees

Fan-Out

Fan-out(φ) refers to the number of index separator entry in a page.

1- Fan-out(φ) controls the number of levels in the tree in a way that if φ is a small number, the number of levels would be increased.

2- The number of levels equals the number of I/O operations needed to fetch a leaf entry.

Root index occupies one page and could be maintained in main memory for reducing the cost.

Page 23: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

23

Example:

There are 106 rows in the data file, 10000 pages at the leaf level, the Fan-out is 100. Assume that the size of leaf and separator entries are the same and leaf entries and data records are not integrated. How many I/Os are necessary to retrieve a particular leaf?

The number of I/Os to retrieve a particular leaf page equals to: (Log φ Q) + 1.

So Q= 10000 , φ=100 and The number of I/Os= 3

Fan-Out

Page 24: Indices Advanced Database Systems Dr. Fatemeh Ahmadi-Abkenari 1

24

For Further Reading:

Database Systems, An application-Oriented Approach

Second EditionChapter 9

Michael Kifer, Arthur Bernstein, Philip M. LewisPearson, Addison Wesley Publication

2006