1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions 521 zB+-Trees...

Preview:

DESCRIPTION

3 Motivation zData stored on disk pages in one way yO(n) space zData can be ordered one way (if at all) yO(log n) or O(1) lookup for one attribute yO(n) lookup for the rest zMake lookups faster yIncrease space necessary yWhat about speed of other operations?

Citation preview

1

Indexing in DBMSs

Erik Selberg590db4/29/98

2

OutlineMotivationCost Functions & 521B+-Trees

ISAMUnstructured Text & IRConclusion

3

MotivationData stored on disk pages in one way

O(n) spaceData can be ordered one way (if at all)

O(log n) or O(1) lookup for one attribute O(n) lookup for the rest

Make lookups faster Increase space necessary What about speed of other operations?

4

Cost FunctionsB data pages on diskR records per page

O(n) = O(BR)D I/O time (~25ms)C CPU time (~1-10ms)H Hash function time (~1-10ms)

5

DBMS operationsScan - fetch all recordsSearch w/ Equality

Lookups and ModificationsSearch w/ RangeInsertDelete

Bulk operations may be amortized!

6

Baseline StorageUnorganized (heap)Sorted

Sorted on one keyHashed

static hashing using chaining

7

Unorganized HeapsScan BD + BRCSearch = 1/2 (BD + BRC)Search <> BD + BRCInsert 2D + CDelete C + D

Challenge: make this worse

8

SortedScan BD + BRCSearch = D lg B + C lg RSearch <> D lg B + C lg R + #Insert (D lg B + C lg R) + (BD + BRC)Delete (D lg B + C lg R) + (BD + BRC)

Good for range, crappy for rest

9

Static Hash w/ ChainingScan 1.25(BD + BRC)Search = H + D + 1/2RCSearch <> 1.25(BD + BRC)Insert (H + D + 1/2RC) + (C + D)Delete (H + D + 1/2RC) + (C + D)

Need to grow and shrink hash tableBad hashes hose you

10

Cost summary

File Type Scan Search = Search <> I nsert Delete

Heap BD + BRC ½(BD + BRC) BD + BRC 2D + C C+D

Sorted BD + BRC D lg B + C lg R D lg B + C lg R + # Srch + BD + BRC Srch + BD + BRC

Hashed 1.25(BD + BRC) H + D + ½RC 1.25(BD + BRC) Srch + C + D Srch + C + D

11

What’s the best structure if:You’re Amazon.com. Lots of equality

lookups, some bulk insertions.

You’re United. Lots of range lookups.

You’re ESPN. Tons of insertions, range lookups. Equal lookups temporal.

12

What is stored in the index?k key; k* data entryr1 = (Malone, Karl, 123, 13, 4)r2 = (Malone, Moses, 456, 16, 5)

k* = data k* = r1k* = <k, rid> k* = <Malone, r1>k* = <k, rid list> k* = <Malone, (r1, r2)>

13

Clustered IndicesOrder date entries in a similar way to

data records on diskOnly one clustered index per table

Index Index

Data entries

Data Records

14

Sparse and Dense IndicesDense - one entry per

record (1-1)Sparse - one entry per

page Clustered, therefore

only one per tableInverted on a field

Dense secondary indexFully Inverted

All fields have index

Baker

Hawkins

Payton

Baker, 4Ellis, 14Foster, 7

Hawkins, 9Keefe, 5

Malone, 12

Payton, 7Stockton, 13

4

75

7

9

1312

14

15

Primary and Secondary Indicies

Primary Index is over the Primary KeyPrimary stores data entry as recordsPrimary has no duplicatesShould only be one

Secondary stores as <k, rid> or <k, rid list>

16

B-TreesB is for Balanced (that’s good enough for

me)B-Tree

Each node has d items, at most d+1 children Balanced tree

B+-Tree Data at leaves Leaves doubly-linked

17

A B+-Tree20 40 8060

6 15 30 98

1* 2* 3* 6* 9* 99*24*29*

...

18*19*

Keys are at leavesNot all nodes / leaves are full

Common impls keep 50% minimum occupancy

18

B+-Tree CostsAssume: d == RScan BD + BRCSearch = D lg B + C lg RSearch <> D lg B + #Insert RCD lg BDelete RCD lg B

Some extra work to keep balance

19

Summary + B+-Tree costs

File Type Scan Search = Search <> I nsert Delete

Heap BD + BRC ½(BD + BRC) BD + BRC 2D + C C+D

Sorted BD + BRC D lg B + C lg R D lg B + C lg R + # Srch + BD + BRC Srch + BD + BRC

Hashed 1.25(BD + BRC) H + D + ½RC 1.25(BD + BRC) Srch + C + D Srch + C + D

B+-Tree BD + BRC D lg B + C lg R D lg B + C lg R + # RCD lg B RCD lg B

20

ISAM TreesSimilar to B+-TreeNot balanced, uses chaining

Faster Insert / Delete, slower SearchInternal nodes are static

Good for static DBs and data warehouses

21

Sparse and Clustered Indices

Remember that bit about only one clustered index per table?

Only one clustered index per tableTherefore, only one index has values

that can be read sequentially without lots of page requests

22

How many locks do need to...

Insert a new item into DBUnsorted?Sorted?Hash?B+-Tree?ISAM?

23

Unstructured TextDatabase => structured data

Schemas Tables OLTP

Information Retrieval => unstructured

So they don’t have much to do with one another, right?

24

IR Queries Karl AND Malone

“Karl Malone”

Karl NEAR/2 Malone

SELECT Docs(D)WHERE “Karl” in D AND “Malone” in D

SELECT Docs(D)WHERE “Karl Malone” in D

Does this mean “X Y” is a single term?

SELECT Docs(D)WHERE …uh…?

25

Structuring TextPosition is structure!

Karl: par 1, sen 1, word 4

Malone: par 1, sen 1, word 5 par 2, sen 1, word 2 par 3, sen 1, word 7,

zone quote

Admiral KO’d by Jazz power-forward; Malone fined and suspended.

SALT LAKE CITY -- Karl Malone has assured David Robinson the elbow blow that knocked Robinson unconscious was unintentional. Robinson doubts he blow was intended to hurt him, but is not certain.

Nevertheless, Malone on Friday was suspended without pay for one game and fined $5,000 by Rod Thorn, the NBA's senior VP of basketball operations, who normally deals with cases of discipline.

"While I do not believe that Malone intentionally elbowed Robinson, players have a responsibility not to recklessly swing their elbows in a manner that could cause injury to another player," Thorn said.

Malone missed Utah's game Friday night, but the Jazz didn't miss a beat without its leading scorer and routed the L.A. Clippers 127-99.

Meanwhile, Robinson sat out the Spurs' game with Seattle, but San Antonio overcame his absence to beat the SuperSonics 99-84.

The suspension forced Malone to miss just the fifth game of his 13-year career. He had played in 543 consecutive games -- the third-longest streak in the NBA and first for consecutive starts -- and had played in 844 of the Jazz's previous 845 games.

26

IR Queries in SQLQuery: “Karl Malone”, RobinsonMeaning: Docs w/ “Karl Malone” and Robinson

TextIndex(word: string, doc: int, pos: int) SELECT W1.doc

FROM TextIndex W1, W2, W3WHERE (W1.doc = W2.doc && W2.doc = W3.doc) && (W1.word = “Karl” && W2.word = “Malone” && W3.word = “Robinson”) && W1.pos = W2.pos + 1

27

Indexing Issues in IRIndex method: hash table on wordIR folks think about attributesIR folks munge attributes

elbow* => elbow, elbowing, elbowed, etc. “to be or not to be” => “”

IR folks create search keys Malone => Malone, Stockton, Jazz, Sloan,

28

IR and DBMSsIR uses DBMS for low-level storage

e.g. hash table storageHash table lookup is only first step

Clustering Relevance Ranking Feedback, Expansion, ...

Full SQL not needed Custom optimized DB performs better

29

How AltaVista returns so quickly...

Hash indexes mean lots of page requests if there are lots of matches...

Trick #1: use memory.Trick #2: threshold (find 10 pages > 75% rel).Trick #3: hard time limit.

More users, less CPU time / queryTrick #4: prioritize

Try to find 10 in memory

30

SummaryConcerned about B, R, not just nHash for equality, B+-Tree for rangeOne index gives good disk performanceIR uses hash indexingIR stores term information

Indexing helps performance, but youstill need to think about what to index!

Recommended