30
1 Indexing in DBMSs Erik Selberg 590db 4/29/98

Indexing in DBMSs

Embed Size (px)

DESCRIPTION

Indexing in DBMSs. Erik Selberg 590db 4/29/98. Outline. Motivation Cost Functions & 521 B+-Trees ISAM Unstructured Text & IR Conclusion. Motivation. Data stored on disk pages in one way O(n) space Data can be ordered one way (if at all) O(log n) or O(1) lookup for one attribute - PowerPoint PPT Presentation

Citation preview

Page 1: Indexing in DBMSs

1

Indexing in DBMSs

Erik Selberg590db4/29/98

Page 2: Indexing in DBMSs

2

Outline

MotivationCost Functions & 521B+-Trees

ISAMUnstructured Text & IRConclusion

Page 3: Indexing in DBMSs

3

Motivation

Data stored on disk pages in one way O(n) space

Data can be ordered one way (if at all) O(log n) or O(1) lookup for one attribute O(n) lookup for the rest

Make lookups faster Increase space necessary What about speed of other operations?

Page 4: Indexing in DBMSs

4

Cost Functions

B data pages on diskR records per page

O(n) = O(BR)D I/O time (~25ms)

C CPU time (~1-10ms)

H Hash function time (~1-10ms)

Page 5: Indexing in DBMSs

5

DBMS operations

Scan - fetch all recordsSearch w/ Equality

Lookups and ModificationsSearch w/ RangeInsertDelete

Bulk operations may be amortized!

Page 6: Indexing in DBMSs

6

Baseline Storage

Unorganized (heap)Sorted

Sorted on one keyHashed

static hashing using chaining

Page 7: Indexing in DBMSs

7

Unorganized Heaps

Scan BD + BRCSearch = 1/2 (BD + BRC)Search <> BD + BRCInsert 2D + CDelete C + D

Challenge: make this worse

Page 8: Indexing in DBMSs

8

Sorted

Scan BD + BRCSearch = D lg B + C lg RSearch <> D lg B + C lg R + #Insert (D lg B + C lg R) + (BD + BRC)Delete (D lg B + C lg R) + (BD + BRC)

Good for range, crappy for rest

Page 9: Indexing in DBMSs

9

Static Hash w/ Chaining

Scan 1.25(BD + BRC)Search = H + D + 1/2RCSearch <> 1.25(BD + BRC)Insert (H + D + 1/2RC) + (C + D)Delete (H + D + 1/2RC) + (C + D)

Need to grow and shrink hash tableBad hashes hose you

Page 10: Indexing in DBMSs

10

Cost summary

File Type Scan Search = Search <> I nsert Delete

Heap BD + BRC ½(BD + BRC) BD + BRC 2D + C C+D

Sorted BD + BRC D lg B + C lg R D lg B + C lg R + # Srch + BD + BRC Srch + BD + BRC

Hashed 1.25(BD + BRC) H + D + ½RC 1.25(BD + BRC) Srch + C + D Srch + C + D

Page 11: Indexing in DBMSs

11

What’s the best structure if:

You’re Amazon.com. Lots of equality lookups, some bulk insertions.

You’re United. Lots of range lookups.

You’re ESPN. Tons of insertions, range lookups. Equal lookups temporal.

Page 12: Indexing in DBMSs

12

What is stored in the index?

k key; k* data entryr1 = (Malone, Karl, 123, 13, 4)r2 = (Malone, Moses, 456, 16, 5)

k* = data k* = r1k* = <k, rid> k* = <Malone, r1>k* = <k, rid list> k* = <Malone, (r1,

r2)>

Page 13: Indexing in DBMSs

13

Clustered Indices

Order date entries in a similar way to data records on disk

Only one clustered index per table

Index Index

Data entries

Data Records

Page 14: Indexing in DBMSs

14

Sparse and Dense Indices

Dense - one entry per record (1-1)

Sparse - one entry per page Clustered, therefore

only one per table

Inverted on a field Dense secondary index

Fully Inverted All fields have index

Baker

Hawkins

Payton

Baker, 4Ellis, 14Foster, 7

Hawkins, 9Keefe, 5

Malone, 12

Payton, 7Stockton, 13

4

75

7

9

1312

14

Page 15: Indexing in DBMSs

15

Primary and Secondary Indicies

Primary Index is over the Primary KeyPrimary stores data entry as recordsPrimary has no duplicatesShould only be one

Secondary stores as <k, rid> or <k, rid list>

Page 16: Indexing in DBMSs

16

B-Trees

B is for Balanced (that’s good enough for me)

B-Tree Each node has d items, at most d+1 children Balanced tree

B+-Tree Data at leaves Leaves doubly-linked

Page 17: Indexing in DBMSs

17

A B+-Tree

20 40 8060

6 15 30 98

1* 2* 3* 6* 9* 99*24*29*

...

18*19*

Keys are at leavesNot all nodes / leaves are full

Common impls keep 50% minimum occupancy

Page 18: Indexing in DBMSs

18

B+-Tree Costs

Assume: d == RScan BD + BRCSearch = D lg B + C lg RSearch <> D lg B + #Insert RCD lg BDelete RCD lg B

Some extra work to keep balance

Page 19: Indexing in DBMSs

19

Summary + B+-Tree costs

File Type Scan Search = Search <> I nsert Delete

Heap BD + BRC ½(BD + BRC) BD + BRC 2D + C C+D

Sorted BD + BRC D lg B + C lg R D lg B + C lg R + # Srch + BD + BRC Srch + BD + BRC

Hashed 1.25(BD + BRC) H + D + ½RC 1.25(BD + BRC) Srch + C + D Srch + C + D

B+- Tree BD + BRC D lg B + C lg R D lg B + C lg R + # RCD lg B RCD lg B

Page 20: Indexing in DBMSs

20

ISAM Trees

Similar to B+-TreeNot balanced, uses chaining

Faster Insert / Delete, slower SearchInternal nodes are static

Good for static DBs and data warehouses

Page 21: Indexing in DBMSs

21

Sparse and Clustered Indices

Remember that bit about only one clustered index per table?

Only one clustered index per tableTherefore, only one index has values

that can be read sequentially without lots of page requests

Page 22: Indexing in DBMSs

22

How many locks do need to...

Insert a new item into DBUnsorted?Sorted?Hash?B+-Tree?ISAM?

Page 23: Indexing in DBMSs

23

Unstructured Text

Database => structured data Schemas Tables OLTP

Information Retrieval => unstructured

So they don’t have much to do with one another, right?

Page 24: Indexing in DBMSs

24

IR Queries

Karl AND Malone

“Karl Malone”

Karl NEAR/2 Malone

SELECT Docs(D)WHERE “Karl” in D AND “Malone” in D

SELECT Docs(D)WHERE “Karl Malone” in D

Does this mean “X Y” is a single term?

SELECT Docs(D)WHERE …uh…?

Page 25: Indexing in DBMSs

25

Structuring Text

Position is structure!

Karl: par 1, sen 1, word 4

Malone: par 1, sen 1, word 5 par 2, sen 1, word 2 par 3, sen 1, word 7,

zone quote

Admiral KO’d by Jazz power-forward; Malone fined and suspended.

SALT LAKE CITY -- Karl Malone has assured David Robinson the elbow blow that knocked Robinson unconscious was unintentional. Robinson doubts he blow was intended to hurt him, but is not certain.

Nevertheless, Malone on Friday was suspended without pay for one game and fined $5,000 by Rod Thorn, the NBA's senior VP of basketball operations, who normally deals with cases of discipline.

"While I do not believe that Malone intentionally elbowed Robinson, players have a responsibility not to recklessly swing their elbows in a manner that could cause injury to another player," Thorn said.

Malone missed Utah's game Friday night, but the Jazz didn't miss a beat without its leading scorer and routed the L.A. Clippers 127-99.

Meanwhile, Robinson sat out the Spurs' game with Seattle, but San Antonio overcame his absence to beat the SuperSonics 99-84.

The suspension forced Malone to miss just the fifth game of his 13-year career. He had played in 543 consecutive games -- the third-longest streak in the NBA and first for consecutive starts -- and had played in 844 of the Jazz's previous 845 games.

Page 26: Indexing in DBMSs

26

IR Queries in SQL

Query: “Karl Malone”, RobinsonMeaning: Docs w/ “Karl Malone” and Robinson

TextIndex(word: string, doc: int, pos: int) SELECT W1.doc

FROM TextIndex W1, W2, W3WHERE (W1.doc = W2.doc && W2.doc = W3.doc) && (W1.word = “Karl” && W2.word = “Malone” && W3.word = “Robinson”) && W1.pos = W2.pos + 1

Page 27: Indexing in DBMSs

27

Indexing Issues in IR

Index method: hash table on wordIR folks think about attributesIR folks munge attributes

elbow* => elbow, elbowing, elbowed, etc. “to be or not to be” => “”

IR folks create search keys Malone => Malone, Stockton, Jazz, Sloan,

Page 28: Indexing in DBMSs

28

IR and DBMSs

IR uses DBMS for low-level storage e.g. hash table storage

Hash table lookup is only first step Clustering Relevance Ranking Feedback, Expansion, ...

Full SQL not needed Custom optimized DB performs better

Page 29: Indexing in DBMSs

29

How AltaVista returns so quickly...

Hash indexes mean lots of page requests if there are lots of matches...

Trick #1: use memory.Trick #2: threshold (find 10 pages > 75% rel).

Trick #3: hard time limit. More users, less CPU time / query

Trick #4: prioritize Try to find 10 in memory

Page 30: Indexing in DBMSs

30

Summary

Concerned about B, R, not just nHash for equality, B+-Tree for rangeOne index gives good disk performanceIR uses hash indexingIR stores term information

Indexing helps performance, but youstill need to think about what to index!