View
227
Download
2
Category
Preview:
Citation preview
1/2/2015
1
1
CS232: Database System Principles
INDEXING
2
Given condition on attribute find qualified
records
Attr = value
Condition may also be
• Attr>value
• Attr>=value
Indexing
? value
Qualified records
value
value
Indexing • Data Stuctures used for quickly locating tuples that
meet a specific type of condition – Equality condition: find Movie tuples where Director=X
– Other conditions possible, eg, range conditions: find Employee tuples where Salary>40 AND Salary<50
• Many types of indexes. Evaluate them on – Access time
– Insertion time
– Deletion time
– Disk Space needed (esp. as it effects access time)
1/2/2015
2
4
Topics
• Conventional indexes
• B-trees
• Hashing schemes
Terms and Distinctions • Primary index
– the index on the attribute (a.k.a. search key) that determines the sequencing of the table
• Secondary index
– index on any other attribute
• Dense index
– every value of the indexed attribute appears in the index
• Sparse index
– many values do not appear
10
20
30
40
10
20
30
40
50
70
80
90
100
120
50
70
80
90
A Dense Primary Index
100
120
140
150
Sequential
File
Dense and Sparse Primary Indexes
10
20
30
40
10
20
30
40
50
70
80
90
100
120
50
70
80
90
Dense Primary Index
100
120
140
150
Sparse Primary Index
10
30
50
80
100
140
160
200
10
20
30
40
50
70
80
90
100
120
Find the index record with largest
value that is less or equal to the
value we are looking. + can tell if a value exists without
accessing file (consider projection)
+ better access to overflow records
+ less index space
more + and - in a while
1/2/2015
3
7
Sparse vs. Dense Tradeoff
• Sparse: Less index space per record
can keep more of index in memory
• Dense: Can tell if any record exists
without accessing file
(Later:
– sparse better for insertions – dense needed for secondary indexes)
Multi-Level Indexes
• Treat the index as a file and build an index on it
• “Two levels are usually sufficient. More than three levels are rare.”
• Q: Can we build a dense second level index for a dense index ?
10
30
50
80
100
140
160
200
10
20
30
40
50
70
80
90
100
120
10
100
250
400
250
270
300
350
400
460
500
550
600
750
920
1000
A Note on Pointers
• Record pointers consist of block pointer and position of record in the block
• Using the block pointer only, saves
space at no extra accesses cost
• But a block pointer cannot serve as
record identifier
1/2/2015
4
Representation of Duplicate Values in Primary Indexes
• Index may point to first instance of each value only
10
40
70
100
10
10
10
40
40
70
70
70
100
120
Deletion from Dense Index
10
20
30
10
20
30
50
70
90
100
120
50
70
90
Delete 40, 80
HeaderHeader
Lists of available entries
• Deletion from dense primary index file with no duplicate values is handled in the same way with deletion from a sequential file
• Q: What about deletion from dense primary index with duplicates
Deletion from Sparse Index
• if the deleted entry does not appear in the index do nothing
10
30
50
80
100
140
160
200
10
20
30
50
70
80
90
100
120
HeaderDelete 40
1/2/2015
5
Deletion from Sparse Index (cont’d)
• if the deleted entry does not appear in the index do nothing
• if the deleted entry appears in the index replace it with the next search-key value
– comment: we could leave the deleted value in the index assuming that no part of the system may assume it still exists without checking the block
Delete 30
10
40
50
80
100
140
160
200
10
20
40
50
70
80
90
100
120
Header
Deletion from Sparse Index (cont’d)
• if the deleted entry does not appear in the index do nothing
• if the deleted entry appears in the index replace it with the next search-key value
• unless the next search key value has its own index entry. In this case delete the entry
Delete 40, then 30
10
50
80
100
140
160
200
10
20
50
70
80
90
100
120
HeaderHeader
Insertion in Sparse Index
• if no new block is created then do nothing
10
30
50
80
100
140
160
200
10
20
30
35
50
70
80
90
100
120
HeaderInsert 35
1/2/2015
6
16
Insertion in Sparse Index
• if no new block is created then do nothing
• else create overflow record – Reorganize periodically
– Could we claim space of next block?
– How often do we reorganize and how much expensive it is?
– B-trees offer convincing answers
10
30
50
80
100
140
160
200
10
20
30
50
70
80
90
100
120
HeaderInsert 15
17
Secondary indexes Sequence field
50 30
70 20
40 80
10 100
60 90
File not sorted on secondary search key
18
Secondary indexes Sequence field
50 30
70 20
40 80
10 100
60 90
• Sparse index
30 20 80 100
90 ...
does not make sense!
1/2/2015
7
19
Secondary indexes Sequence field
50 30
70 20
40 80
10 100
60 90
• Dense index
10 20 30 40
50 60 70 ...
10 50 90 ...
sparse high level
First level has to be dense, next levels are sparse (as usual)
20
Duplicate values & secondary indexes
10 20
40 20
40 10
40 10
40 30
21
Duplicate values & secondary indexes
10 20
40 20
40 10
40 10
40 30
10 10 10 20
20 30 40 40
40 40 ...
one option...
Problem:
excess overhead! • disk space
• search time
1/2/2015
8
22
Duplicate values & secondary indexes
10 20
40 20
40 10
40 10
40 30
10
another option: lists of pointers
40 30
20 Problem:
variable size records in
index!
23
Duplicate values & secondary indexes
10 20
40 20
40 10
40 10
40 30
10 20 30 40
50 60 ...
Yet another idea : Chain records with same key?
Problems: • Need to add fields to records, messes up maintenance • Need to follow chain to know records
24
Duplicate values & secondary indexes
10 20
40 20
40 10
40 10
40 30
10 20 30 40
50 60 ...
buckets
1/2/2015
9
25
Why “bucket” + record pointers is useful
Indexes Records
Name: primary EMP (name,dept,year,...)
Dept: secondary
Year: secondary
• Enables the processing of queries working with pointers only. • Very common technique in Information Retrieval
Advantage of Buckets: Process Queries Using Pointers Only
Find employees of the Toys dept with 4 years in the company SELECT Name FROM Employee
WHERE Dept=“Toys” AND Year=4
Toys
PCs
Pens
Suits
Dept Index Aaron Suits 4
Helen Pens 3
Jack PCs 4
Jim Toys 4
Joe Toys 3
Nick PCs 2
Walt Toys 5
Yannis Pens 1
1
2
3
4
Year Index
Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s
27
This idea used in text information retrieval
Documents
...the cat is fat ...
...my cat and my dog like each other...
...Fido the dog ... Buckets known as
Inverted lists
cat
dog
1/2/2015
10
Summary of Indexing So Far • Basic topics in conventional indexes
– multiple levels
– sparse/dense
– duplicate keys and buckets
– deletion/insertion similar to sequential files
• Advantages
– simple algorithms
– index is sequential file
• Disadvantages
– eventually sequentiality is lost because of overflows, reorganizations are needed
29
Example Index (sequential)
continuous
free space
10 20 30
40 50 60
70 80 90
39 31 35 36
32 38 34
33
overflow area (not sequential)
30
Outline:
• Conventional indexes
• B-Trees NEXT
• Hashing schemes
1/2/2015
11
31
• NEXT: Another type of index
– Give up on sequentiality of index
– Try to get “balance”
32
Root
B+Tree Example n=3
100
120
150
180
30
3
5
11
30
35
100
101
110
120
130
150
156
179
180
200
33
Sample non-leaf
to keys to keys to keys to keys
< 57 57 k<81 81k<95 95
57
81
95
1/2/2015
12
34
Sample leaf node:
From non-leaf node
to next leaf
in sequence 57
81
95
To r
eco
rd
with k
ey 5
7
To
reco
rd
with k
ey 8
1
To
reco
rd
with k
ey 8
5
35
In textbook’s notation n=3
Leaf:
Non-leaf:
30
35
30
30 35
30
36
Size of nodes: n+1 pointers
n keys (fixed)
1/2/2015
13
37
Non-root nodes have to be at least half-full
• Use at least
Non-leaf: (n+1)/2 pointers
Leaf: (n+1)/2 pointers to data
38
Full node min. node
Non-leaf
Leaf
n=3
120
150
180
30
3
5
11
30
35
39
B+tree rules tree of order n
(1) All leaves at same lowest level
(balanced tree)
(2) Pointers in leaves point to records
except for “sequence pointer”
1/2/2015
14
40
(3) Number of pointers/keys for B+tree
Non-leaf (non-root) n+1 n (n+1)/2 (n+1)/2- 1
Leaf (non-root) n+1 n
Root n+1 n 1 1
Max Max Min Min ptrs keys ptrsdata keys
(n+1)/2 (n+1)/2
41
Insert into B+tree
(a) simple case – space available in leaf
(b) leaf overflow
(c) non-leaf overflow
(d) new root
42
(a) Insert key = 32 n=3
3
5
11
30
31
30
100
32
1/2/2015
15
43
(a) Insert key = 7 n=3
3
5
11
30
31
30
100
3
5 7
7
44
(c) Insert key = 160
n=3
100
120
150
180
150
156
179
180
200
160
180
160
179
45
(d) New root, insert 45 n=3
10
20
30
1
2
3
10
12
20
25
30
32
40
40
45
40
30
new root
1/2/2015
16
46
(a) Simple case - no example
(b) Coalesce with neighbor (sibling)
(c) Re-distribute keys
(d) Cases (b) or (c) at non-leaf
Deletion from B+tree
47
(b) Coalesce with sibling
– Delete 50
10
40
100
10
20
30
40
50
n=4
40
48
(c) Redistribute keys
– Delete 50
10
40
100
10
20
30
35
40
50
n=4
35
35
1/2/2015
17
49 40
45
30
37
25
26
20
22
10
14
1
3
10
20
30
40
(d) Non-leaf coalese
– Delete 37 n=4
40
30
25
25
new root
50
B+tree deletions in practice
– Often, coalescing is not implemented – Too hard and not worth it!
51
Is LRU a good policy for B+tree buffers?
Of course not!
Should try to keep root in memory
at all times (and perhaps some nodes from second level)
1/2/2015
18
52
Hardware+ indexing problem:
For B+tree, how large should n be?
…
n is number of keys / node
Assumptions
• You have the right to set the block size for the disk where a B-tree will reside.
• Compute the optimum page size n assuming that
– The items are 4 bytes long and the pointers are also 4 bytes long.
– Time to read a node from disk is 12+.003n
– Time to process a block in memory is unimportant
– B+tree is full (I.e., every page has the maximum number of items and pointers
54
Can get:
f(n) = time to find a record
f(n)
nopt n
1/2/2015
19
55
FIND nopt by f’(n) = 0
Answer should be nopt = “few hundred”
What happens to nopt as
• Disk gets faster?
• CPU get faster?
56
Outline/summary
• Conventional Indexes • Sparse vs. dense
• Primary vs. secondary
• B+ trees
• Hashing schemes --> Next
• Bitmap indices
Hashing
• hash function h(key) returns address of bucket
• if the keys for a specific hash value do not fit into one page the bucket is a linked list of pages
key h(key)
Buckets Records
key
1/2/2015
20
58
Example hash function
• Key = ‘x1 x2 … xn’ n byte character string
• Have b buckets
• h: add x1 + x2 + ….. xn
– compute sum modulo b
59
This may not be best function …
Read Knuth Vol. 3 if you really
need to select a good function.
Good hash Expected number of
function: keys/bucket is the
same for all buckets
60
Within a bucket:
• Do we keep keys sorted?
• Yes, if CPU time critical
& Inserts/Deletes not too frequent
1/2/2015
21
61
Next: example to illustrate inserts, overflows, deletes
h(K)
62
EXAMPLE 2 records/bucket
INSERT:
h(a) = 1
h(b) = 2
h(c) = 1
h(d) = 0
0
1
2
3
d
a
c
b
h(e) = 1
e
63
0
1
2
3
a
b
c
e
d
EXAMPLE: deletion
Delete: e f
f
g maybe move
“g” up
c d
1/2/2015
22
64
Rule of thumb:
• Try to keep space utilization
between 50% and 80%
Utilization = # keys used total # keys that fit
• If < 50%, wasting space
• If > 80%, overflows significant depends on how good hash
function is & on # keys/bucket
65
How do we cope with growth?
• Overflows and reorganizations
• Dynamic hashing
• Extensible
• Linear
66
Extensible hashing: two ideas
(a) Use i of b bits output by hash function
b
h(K)
use i grows over time….
00110101
1/2/2015
23
67
(b) Use directory
h(K)[0-i ] to bucket . . .
.
.
.
Example: h(k) is 4 bits; 2 keys/bucket
i = 1
1
1
0001
1001
1100
“slide” conventions: • slide shows h(k), while actual directory has key+pointer
69
Example: h(k) is 4 bits; 2 keys/bucket
i = 1
1
1
0001
1001
1100
Insert 1010 1
1100
1010
New directory
2
00
01
10
11
i =
2
2
1/2/2015
24
70
1
0001
2
1001
1010
2
1100
Insert:
0111
0000
00
01
10
11
2 i =
Example continued
0111
0000
0111
0001
2
2
71
00
01
10
11
2 i =
2 1001
1010
2 1100
2 0111
2 0000
0001
Insert:
1001
Example continued
1001
1001
1010
000
001
010
011
100
101
110
111
3 i =
3
3
72
Extensible hashing: deletion
• No merging of blocks
• Merge blocks and cut directory if possible
(Reverse insert procedure)
1/2/2015
25
73
Deletion example:
• Run thru insert example in reverse!
74
Extensible hashing
Can handle growing files
- with less wasted space
- with no full reorganizations
Summary
+
Indirection
(Not bad if directory in memory)
Directory doubles in size
(Now it fits, now it does not)
-
-
75
Linear hashing
• Another dynamic hashing scheme
Two ideas:
(a) Use i low order bits of hash 01110101
grows
b
i
(b) File grows linearly
1/2/2015
26
76
Example b=4 bits, i =2, 2 keys/bucket
00 01 10 11
0101
1111
0000
1010
m = 01 (max used block)
Future growth buckets
If h(k)[i ] m, then
look at bucket h(k)[i ]
else, look at bucket h(k)[i ] - 2i -1
Rule
0101 • can have overflow chains!
• insert 0101
77
Example b=4 bits, i =2, 2 keys/bucket
00 01 10 11
0101
1111
0000
1010
m = 01 (max used block)
Future growth buckets
10
1010
0101 • insert 0101
11
1111 0101
78
Example Continued: How to grow beyond this?
00 01 10 11
1111 1010 0101
0101
0000
m = 11 (max used block)
i = 2
0 0 0 0 100 101 110 111
3
. . .
100
100
101
101
0101
0101
1/2/2015
27
79
• If U > threshold then increase m
(and i, when m reaches 2i )
When do we expand file?
• Keep track of: #used slots (incl. overflow) #total slots in primary buckets
= U
equiv, #(indexed key ptr pairs) #total slots in primary buckets
80
Linear Hashing
Can handle growing files
- with less wasted space
- with no full reorganizations
No indirection like extensible hashing
Summary
+
+
Can still have overflow chains -
81
Example: BAD CASE
Very full
Very empty Need to move
m here…
Would waste
space...
1/2/2015
28
82
Hashing
- How it works
- Dynamic hashing
- Extensible
- Linear
Summary
83
Next:
• Indexing vs Hashing
• Index definition in SQL
• Multiple key access
84
• Hashing good for probes given key
e.g., SELECT …
FROM R
WHERE R.A = 5
Indexing vs Hashing
1/2/2015
29
85
• INDEXING (Including B Trees) good for
Range Searches:
e.g., SELECT
FROM R
WHERE R.A > 5
Indexing vs Hashing
86
Index definition in SQL
• Create index name on rel (attr)
• Create unique index name on rel (attr)
defines candidate key
• Drop INDEX name
87
CANNOT SPECIFY TYPE OF INDEX
(e.g. B-tree, Hashing, …)
OR PARAMETERS
(e.g. Load Factor, Size of Hash,...)
... at least in SQL...
Note
1/2/2015
30
88
ATTRIBUTE LIST MULTIKEY INDEX
(next)
e.g., CREATE INDEX foo ON R(A,B,C)
Note
89
Motivation: Find records where
DEPT = “Toy” AND SAL > 50k
Multi-key Index
90
Strategy I:
• Use one index, say Dept.
• Get all Dept = “Toy” records and check their salary
I1
1/2/2015
31
91
• Use 2 Indexes; Manipulate Pointers
Toy Sal > 50k
Strategy II:
92
• Multiple Key Index
One idea:
Strategy III:
I1
I2
I3
93
Example
Example Record
Dept Index Salary Index
Name=Joe DEPT=Sales SAL=15k
Art Sales Toy
10k 15k 17k 21k
12k 15k 15k 19k
1/2/2015
32
94
For which queries is this index good?
Find RECs Dept = “Sales” SAL=20k
Find RECs Dept = “Sales” SAL > 20k
Find RECs Dept = “Sales”
Find RECs SAL = 20k
95
Interesting application:
• Geographic Data
DATA:
<X1,Y1, Attributes>
<X2,Y2, Attributes>
x
y
. .
.
96
Queries:
• What city is at <Xi,Yi>?
• What is within 5 miles from <Xi,Yi>?
• Which is closest point to <Xi,Yi>?
1/2/2015
33
97
h
n b
i a
c o
d
10 20
10 20
Example e
g
f
m
l
k
j 25 15 35 20
40
30
20
10
h i a b c d e f g
n o m l j k
• Search points near f • Search points near b
5
15 15
98
Queries
• Find points with Yi > 20
• Find points with Xi < 5
• Find points “close” to i = <12,38>
• Find points “close” to b = <7,24>
99
• Many types of geographic index
structures have been suggested • Quad Trees
• R Trees
1/2/2015
34
100
Outline/summary
• Conventional Indexes • Sparse vs. dense
• Primary vs. secondary
• B+ trees
• Hashing schemes
• Bitmap indices --> Next
Revisit: Processing queries without accessing records until last step Find employees of the Toys dept with 4 years in the company
SELECT Name FROM Employee
WHERE Dept=“Toys” AND Year=4
Toys
PCs
Pens
Suits
Dept Index Aaron Suits 4
Helen Pens 3
Jack PCs 4
Jim Toys 4
Joe Toys 3
Nick PCs 2
Walt Toys 5
Yannis Pens 1
1
2
3
4
Year Index
Bitmap indices: Alternate structure, heavily used in OLAP
102
Toys 00011010
PCs 00100100
Pens 01000001
Suits 10000000
Dept Index Aaron Suits 4
Helen Pens 3
Jack PCs 4
Jim Toys 4
Joe Toys 3
Nick PCs 2
Walt Toys 1
Yannis Pens 1
00000011 1
00000100 2
01001000 3
10110000 4
Assume the tuples of the Employees table are ordered.
+ Find even more quickly intersections and unions (e.g., Dept=“Toys” AND Year=4) ? Seems it needs too much space -> We’ll do compression ? How do we deal with insertions and deletions -> Easier than you think
Year Index Conceptually only!
1/2/2015
35
2nlogm Compression
• Naive solution needs mn bits, where m is #distinct values and n is #tuples
• But there is just n 1’s=> let’s utilize this
• Bit encoding of sequence of runs (e.g. [3,0,1])
103
Toys: 00011010 3 0 1
First run says: The first ace appears after 3 zeros
Second run says: The 2nd ace appears immediately after the 1st
Third run says: The 3rd ace appears after 1 zero after the 2nd
1011 00 0 1 10 says: The binary encoding of the first number needs 1+1 digits. 11 says: The first number is 3
2nlog m compression
• Example
• Pens: 01000001
• Sequence [1,5]
• Encoding: 01110101
104
Insertions and deletions & miscellaneous engineering
• Assume tuples are inserted in order
• Deletions: Do nothing
• Insertions: If tuple t with value v is inserted, add one more run in v’s sequence (compact bitmap)
105
1/2/2015
36
106
The BIG picture….
• Chapters 2 & 3: Storage, records, blocks...
• Chapter 4 & 5: Access Mechanisms - Indexes
- B trees
- Hashing
- Multi key
• Chapter 6 & 7: Query Processing NEXT
Recommended