Indexing in DBMSs

1

Indexing in DBMSs

Erik Selberg590db4/29/98

2

Outline

MotivationCost Functions & 521B+-Trees

ISAMUnstructured Text & IRConclusion

3

Motivation

Data stored on disk pages in one way O(n) space

Data can be ordered one way (if at all) O(log n) or O(1) lookup for one attribute O(n) lookup for the rest

Make lookups faster Increase space necessary What about speed of other operations?

4

Cost Functions

B data pages on diskR records per page

O(n) = O(BR)D I/O time (~25ms)

C CPU time (~1-10ms)

H Hash function time (~1-10ms)

5

DBMS operations

Scan - fetch all recordsSearch w/ Equality

Lookups and ModificationsSearch w/ RangeInsertDelete

Bulk operations may be amortized!

6

Baseline Storage

Unorganized (heap)Sorted

Sorted on one keyHashed

static hashing using chaining

7

Unorganized Heaps

Scan BD + BRCSearch = 1/2 (BD + BRC)Search <> BD + BRCInsert 2D + CDelete C + D

Challenge: make this worse

8

Sorted

Scan BD + BRCSearch = D lg B + C lg RSearch <> D lg B + C lg R + #Insert (D lg B + C lg R) + (BD + BRC)Delete (D lg B + C lg R) + (BD + BRC)

Good for range, crappy for rest

9

Static Hash w/ Chaining

Scan 1.25(BD + BRC)Search = H + D + 1/2RCSearch <> 1.25(BD + BRC)Insert (H + D + 1/2RC) + (C + D)Delete (H + D + 1/2RC) + (C + D)

Need to grow and shrink hash tableBad hashes hose you

10

Cost summary

File Type Scan Search = Search <> I nsert Delete

Heap BD + BRC ½(BD + BRC) BD + BRC 2D + C C+D

Sorted BD + BRC D lg B + C lg R D lg B + C lg R + # Srch + BD + BRC Srch + BD + BRC

Hashed 1.25(BD + BRC) H + D + ½RC 1.25(BD + BRC) Srch + C + D Srch + C + D

11

What’s the best structure if:

You’re Amazon.com. Lots of equality lookups, some bulk insertions.

You’re United. Lots of range lookups.

You’re ESPN. Tons of insertions, range lookups. Equal lookups temporal.

12

What is stored in the index?

k key; k* data entryr1 = (Malone, Karl, 123, 13, 4)r2 = (Malone, Moses, 456, 16, 5)

k* = data k* = r1k* = <k, rid> k* = <Malone, r1>k* = <k, rid list> k* = <Malone, (r1,

r2)>

13

Clustered Indices

Order date entries in a similar way to data records on disk

Only one clustered index per table

Index Index

Data entries

Data Records

14

Sparse and Dense Indices

Dense - one entry per record (1-1)

Sparse - one entry per page Clustered, therefore

only one per table

Inverted on a field Dense secondary index

Fully Inverted All fields have index

Baker

Hawkins

Payton

Baker, 4Ellis, 14Foster, 7

Hawkins, 9Keefe, 5

Malone, 12

Payton, 7Stockton, 13

4

75

7

9

1312

14

15

Primary and Secondary Indicies

Primary Index is over the Primary KeyPrimary stores data entry as recordsPrimary has no duplicatesShould only be one

Secondary stores as <k, rid> or <k, rid list>

16

B-Trees

B is for Balanced (that’s good enough for me)

B-Tree Each node has d items, at most d+1 children Balanced tree

B+-Tree Data at leaves Leaves doubly-linked

17

A B+-Tree

20 40 8060

6 15 30 98

1* 2* 3* 6* 9* 99*24*29*

...

18*19*

Keys are at leavesNot all nodes / leaves are full

Common impls keep 50% minimum occupancy

18

B+-Tree Costs

Assume: d == RScan BD + BRCSearch = D lg B + C lg RSearch <> D lg B + #Insert RCD lg BDelete RCD lg B

Some extra work to keep balance

19

Summary + B+-Tree costs

File Type Scan Search = Search <> I nsert Delete

Heap BD + BRC ½(BD + BRC) BD + BRC 2D + C C+D

Sorted BD + BRC D lg B + C lg R D lg B + C lg R + # Srch + BD + BRC Srch + BD + BRC

Hashed 1.25(BD + BRC) H + D + ½RC 1.25(BD + BRC) Srch + C + D Srch + C + D

B+- Tree BD + BRC D lg B + C lg R D lg B + C lg R + # RCD lg B RCD lg B

20

ISAM Trees

Similar to B+-TreeNot balanced, uses chaining

Faster Insert / Delete, slower SearchInternal nodes are static

Good for static DBs and data warehouses

21

Sparse and Clustered Indices

Remember that bit about only one clustered index per table?

Only one clustered index per tableTherefore, only one index has values

that can be read sequentially without lots of page requests

22

How many locks do need to...

Insert a new item into DBUnsorted?Sorted?Hash?B+-Tree?ISAM?

23

Unstructured Text

Database => structured data Schemas Tables OLTP

Information Retrieval => unstructured

So they don’t have much to do with one another, right?

24

IR Queries

Karl AND Malone

“Karl Malone”

Karl NEAR/2 Malone

SELECT Docs(D)WHERE “Karl” in D AND “Malone” in D

SELECT Docs(D)WHERE “Karl Malone” in D

Does this mean “X Y” is a single term?

SELECT Docs(D)WHERE …uh…?

25

Structuring Text

Position is structure!

Karl: par 1, sen 1, word 4

Malone: par 1, sen 1, word 5 par 2, sen 1, word 2 par 3, sen 1, word 7,

zone quote

Admiral KO’d by Jazz power-forward; Malone fined and suspended.

SALT LAKE CITY -- Karl Malone has assured David Robinson the elbow blow that knocked Robinson unconscious was unintentional. Robinson doubts he blow was intended to hurt him, but is not certain.

Nevertheless, Malone on Friday was suspended without pay for one game and fined $5,000 by Rod Thorn, the NBA's senior VP of basketball operations, who normally deals with cases of discipline.

"While I do not believe that Malone intentionally elbowed Robinson, players have a responsibility not to recklessly swing their elbows in a manner that could cause injury to another player," Thorn said.

Malone missed Utah's game Friday night, but the Jazz didn't miss a beat without its leading scorer and routed the L.A. Clippers 127-99.

Meanwhile, Robinson sat out the Spurs' game with Seattle, but San Antonio overcame his absence to beat the SuperSonics 99-84.

The suspension forced Malone to miss just the fifth game of his 13-year career. He had played in 543 consecutive games -- the third-longest streak in the NBA and first for consecutive starts -- and had played in 844 of the Jazz's previous 845 games.

26

IR Queries in SQL

Query: “Karl Malone”, RobinsonMeaning: Docs w/ “Karl Malone” and Robinson

TextIndex(word: string, doc: int, pos: int) SELECT W1.doc

FROM TextIndex W1, W2, W3WHERE (W1.doc = W2.doc && W2.doc = W3.doc) && (W1.word = “Karl” && W2.word = “Malone” && W3.word = “Robinson”) && W1.pos = W2.pos + 1

27

Indexing Issues in IR

Index method: hash table on wordIR folks think about attributesIR folks munge attributes

elbow* => elbow, elbowing, elbowed, etc. “to be or not to be” => “”

IR folks create search keys Malone => Malone, Stockton, Jazz, Sloan,

…

28

IR and DBMSs

IR uses DBMS for low-level storage e.g. hash table storage

Hash table lookup is only first step Clustering Relevance Ranking Feedback, Expansion, ...

Full SQL not needed Custom optimized DB performs better

29

How AltaVista returns so quickly...

Hash indexes mean lots of page requests if there are lots of matches...

Trick #1: use memory.Trick #2: threshold (find 10 pages > 75% rel).

Trick #3: hard time limit. More users, less CPU time / query

Trick #4: prioritize Try to find 10 in memory

30

Summary

Concerned about B, R, not just nHash for equality, B+-Tree for rangeOne index gives good disk performanceIR uses hash indexingIR stores term information

Indexing helps performance, but youstill need to think about what to index!

Documents

Indexing in DBMSs