Indexing in DBMSs

Erik Selberg590db4/29/98

Outline

MotivationCost Functions & 521B+-Trees

ISAMUnstructured Text & IRConclusion

Motivation

Data stored on disk pages in one way O(n) space

Data can be ordered one way (if at all) O(log n) or O(1) lookup for one attribute O(n) lookup for the rest

Make lookups faster Increase space necessary What about speed of other operations?

Cost Functions

B data pages on diskR records per page

O(n) = O(BR)D I/O time (~25ms)

C CPU time (~1-10ms)

H Hash function time (~1-10ms)

DBMS operations

Scan - fetch all recordsSearch w/ Equality

Lookups and ModificationsSearch w/ RangeInsertDelete

Bulk operations may be amortized!

Baseline Storage

Unorganized (heap)Sorted

Sorted on one keyHashed

static hashing using chaining

Unorganized Heaps

Scan BD + BRCSearch = 1/2 (BD + BRC)Search <> BD + BRCInsert 2D + CDelete C + D

Challenge: make this worse

Sorted

Scan BD + BRCSearch = D lg B + C lg RSearch <> D lg B + C lg R + #Insert (D lg B + C lg R) + (BD + BRC)Delete (D lg B + C lg R) + (BD + BRC)

Good for range, crappy for rest

Static Hash w/ Chaining

Scan 1.25(BD + BRC)Search = H + D + 1/2RCSearch <> 1.25(BD + BRC)Insert (H + D + 1/2RC) + (C + D)Delete (H + D + 1/2RC) + (C + D)

Need to grow and shrink hash tableBad hashes hose you

Cost summary

File Type Scan Search = Search <> I nsert Delete

Heap BD + BRC ½(BD + BRC) BD + BRC 2D + C C+D

Sorted BD + BRC D lg B + C lg R D lg B + C lg R + # Srch + BD + BRC Srch + BD + BRC

Hashed 1.25(BD + BRC) H + D + ½RC 1.25(BD + BRC) Srch + C + D Srch + C + D

What’s the best structure if:

You’re Amazon.com. Lots of equality lookups, some bulk insertions.

You’re United. Lots of range lookups.

You’re ESPN. Tons of insertions, range lookups. Equal lookups temporal.

What is stored in the index?

k key; k* data entryr1 = (Malone, Karl, 123, 13, 4)r2 = (Malone, Moses, 456, 16, 5)

k* = data k* = r1k* = <k, rid> k* = <Malone, r1>k* = <k, rid list> k* = <Malone, (r1,

Clustered Indices

Order date entries in a similar way to data records on disk

Only one clustered index per table

Index Index

Data entries

Data Records

Sparse and Dense Indices

Dense - one entry per record (1-1)

Sparse - one entry per page Clustered, therefore

only one per table

Inverted on a field Dense secondary index

Fully Inverted All fields have index

Hawkins

Payton

Baker, 4Ellis, 14Foster, 7

Hawkins, 9Keefe, 5

Malone, 12

Payton, 7Stockton, 13

Primary and Secondary Indicies

Primary Index is over the Primary KeyPrimary stores data entry as recordsPrimary has no duplicatesShould only be one

Secondary stores as <k, rid> or <k, rid list>

B-Trees

B is for Balanced (that’s good enough for me)

B-Tree Each node has d items, at most d+1 children Balanced tree

B+-Tree Data at leaves Leaves doubly-linked

A B+-Tree

20 40 8060

6 15 30 98

1* 2* 3* 6* 9* 99*24*29*

18*19*

Keys are at leavesNot all nodes / leaves are full

Common impls keep 50% minimum occupancy

B+-Tree Costs

Assume: d == RScan BD + BRCSearch = D lg B + C lg RSearch <> D lg B + #Insert RCD lg BDelete RCD lg B

Some extra work to keep balance

Summary + B+-Tree costs

File Type Scan Search = Search <> I nsert Delete

Heap BD + BRC ½(BD + BRC) BD + BRC 2D + C C+D

Sorted BD + BRC D lg B + C lg R D lg B + C lg R + # Srch + BD + BRC Srch + BD + BRC

Hashed 1.25(BD + BRC) H + D + ½RC 1.25(BD + BRC) Srch + C + D Srch + C + D

B+- Tree BD + BRC D lg B + C lg R D lg B + C lg R + # RCD lg B RCD lg B

ISAM Trees

Similar to B+-TreeNot balanced, uses chaining

Faster Insert / Delete, slower SearchInternal nodes are static

Good for static DBs and data warehouses

Sparse and Clustered Indices

Remember that bit about only one clustered index per table?

Only one clustered index per tableTherefore, only one index has values

that can be read sequentially without lots of page requests

How many locks do need to...

Insert a new item into DBUnsorted?Sorted?Hash?B+-Tree?ISAM?

Unstructured Text

Database => structured data Schemas Tables OLTP

Information Retrieval => unstructured

So they don’t have much to do with one another, right?

IR Queries

Karl AND Malone

“Karl Malone”

Karl NEAR/2 Malone

SELECT Docs(D)WHERE “Karl” in D AND “Malone” in D

SELECT Docs(D)WHERE “Karl Malone” in D

Does this mean “X Y” is a single term?

SELECT Docs(D)WHERE …uh…?

Structuring Text

Position is structure!

Karl: par 1, sen 1, word 4

Malone: par 1, sen 1, word 5 par 2, sen 1, word 2 par 3, sen 1, word 7,

zone quote

Admiral KO’d by Jazz power-forward; Malone fined and suspended.

SALT LAKE CITY -- Karl Malone has assured David Robinson the elbow blow that knocked Robinson unconscious was unintentional. Robinson doubts he blow was intended to hurt him, but is not certain.

Nevertheless, Malone on Friday was suspended without pay for one game and fined $5,000 by Rod Thorn, the NBA's senior VP of basketball operations, who normally deals with cases of discipline.

"While I do not believe that Malone intentionally elbowed Robinson, players have a responsibility not to recklessly swing their elbows in a manner that could cause injury to another player," Thorn said.

Malone missed Utah's game Friday night, but the Jazz didn't miss a beat without its leading scorer and routed the L.A. Clippers 127-99.

Meanwhile, Robinson sat out the Spurs' game with Seattle, but San Antonio overcame his absence to beat the SuperSonics 99-84.

The suspension forced Malone to miss just the fifth game of his 13-year career. He had played in 543 consecutive games -- the third-longest streak in the NBA and first for consecutive starts -- and had played in 844 of the Jazz's previous 845 games.

IR Queries in SQL

Query: “Karl Malone”, RobinsonMeaning: Docs w/ “Karl Malone” and Robinson

TextIndex(word: string, doc: int, pos: int) SELECT W1.doc

FROM TextIndex W1, W2, W3WHERE (W1.doc = W2.doc && W2.doc = W3.doc) && (W1.word = “Karl” && W2.word = “Malone” && W3.word = “Robinson”) && W1.pos = W2.pos + 1

Indexing Issues in IR

Index method: hash table on wordIR folks think about attributesIR folks munge attributes

elbow* => elbow, elbowing, elbowed, etc. “to be or not to be” => “”

IR folks create search keys Malone => Malone, Stockton, Jazz, Sloan,

IR and DBMSs

IR uses DBMS for low-level storage e.g. hash table storage

Hash table lookup is only first step Clustering Relevance Ranking Feedback, Expansion, ...

Full SQL not needed Custom optimized DB performs better

How AltaVista returns so quickly...

Hash indexes mean lots of page requests if there are lots of matches...

Trick #1: use memory.Trick #2: threshold (find 10 pages > 75% rel).

Trick #3: hard time limit. More users, less CPU time / query

Trick #4: prioritize Try to find 10 in memory

Summary

Concerned about B, R, not just nHash for equality, B+-Tree for rangeOne index gives good disk performanceIR uses hash indexingIR stores term information

Indexing helps performance, but youstill need to think about what to index!

Indexing in DBMSs

Documents

MapReduce and parallel DBMSs: friends or foes?

Chapter 2 Object-Relational DBMSs Chapter 28 in Textbook

Intro to Parallel DBMSs

1 Distributed DBMSs - Concepts and Design Transparencies

Comparing free software for spatial DBMSs

1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions 521 zB+-Trees yISAM zUnstructured Text IR zConclusion

Distributed DBMSs Advanced Concepts

Chapter 23 Distributed DBMSs - Advanced Concepts Transparencies

Self-Tuning Storage and Indexing for Native XML DBMSs...content compression techniques as well as similarity measures for XML are exploited to (self-) tune the storage conﬁguration

Testing SQL-compliance of current DBMSs

MapReduce and Parallel DBMSs

Data, Databases, and DBMSs

Web Technology & DBMSs

Object-Oriented DBMSs

Altibase Inc. Announces its Addition to Gartner's "Who's Who in In-Memory DBMSs"

Indexing Techniques Indexing Techniques in Warehousing …H.Haddouti/UB_Tree.pdf · Indexing Techniques Indexing Techniques in ... Processing Relational OLAP Queries with UB-

Compression in Memory Constrained DBMSs M. Tech Dissertation

Architecture and Classification of DBMSs

Distributed and mobile DBMSs Transparencies. ©Pearson Education 2009 Chapter 16 - Objectives Main concepts of distributed DBMSs (DDBMSs) Differences between

Indexing Techniques in Data Warehousing Environment …H.Haddouti/UB_Tree_paper.pdf · Indexing Techniques in Data Warehousing Environment ... the existing indexing techniques are