Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Topics
‣ Indexing
‣ B+Trees
‣ Extendible Hash Index
‣ Consistent Hash Index
‣ Bitmap Index
‣ Conclusion
50
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Consistent Hashing
‣ Problem: Extendible hash index grows at an exponential rate• Caused by increasing the mod function
‣ Consistent Hashing• Back to a static hash function
- h(k) = k % r
• View "key space" as a ring- Goes from 0 to r-1
- Blocks/Buckets stored on the ring
51
i
j
0
k
r-1
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Consistent Hashing: Insert
‣ Inserting a tuple with key k:• Use h(k) to hash k onto ring
• If h(k) is assigned a bucket:- Insert tuple in bucket if space allows
• Otherwise:- Move clockwise to the first bucket you find
- Insert tuple in bucket if space allows
‣ What if bucket runs out of space?
52
i
j
0
k
r-1
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Example
‣ Show the hash ring for an r = 17 ring.• Blocks hold 3 tuples
• Initially, two empty blocks on 5 and 12
• Insert: 7, 11, 72, 6, 19, 17, 23, 27, 18
53
5
0
12
16
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Consistent Hashing: Overflow
‣ When bucket overflows, split:• Create new bucket
• Place it here:
- Where pred(B) is the clockwise predecessor of bucket B.
• Rehash all tuples in overflown bucket
• Attempt insert again, repeating split routine if necessary.
54
Bk<latexit sha1_base64="PV+hgRhToIWFkqDD9JYhrmhJ75A=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9FjqxWNF+wFtKJvtpl262YTdiVBCf4IXD4p49Rd589+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqxpsslrHuBNRwKRRvokDJO4nmNAokbwfj25nffuLaiFg94iThfkSHSoSCUbTSQ70/7pfKbsWdg6wSLydlyNHol756g5ilEVfIJDWm67kJ+hnVKJjk02IvNTyhbEyHvGupohE3fjY/dUrOrTIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4Y2fCZWkyBVbLApTSTAms7/JQGjOUE4soUwLeythI6opQ5tO0YbgLb+8SlrVindZqd5flWv1PI4CnMIZXIAH11CDO2hAExgM4Rle4c2Rzovz7nwsWtecfOYE/sD5/AESoo2o</latexit>
Bnew<latexit sha1_base64="p8ipwbvH/pnRFPVjgeBHUMeDmL4=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9FjqxWMF+wFtKJvtpF262YTdjVJCf4QXD4p49fd489+4bXPQ1gcDj/dmmJkXJIJr47rfztr6xubWdmGnuLu3f3BYOjpu6ThVDJssFrHqBFSj4BKbhhuBnUQhjQKB7WB8O/Pbj6g0j+WDmSToR3QoecgZNVZq1/uZxKdpv1R2K+4cZJV4OSlDjka/9NUbxCyNUBomqNZdz02Mn1FlOBM4LfZSjQllYzrErqWSRqj9bH7ulJxbZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMT3vgZl0lqULLFojAVxMRk9jsZcIXMiIkllClubyVsRBVlxiZUtCF4yy+vkla14l1WqvdX5Vo9j6MAp3AGF+DBNdTgDhrQBAZjeIZXeHMS58V5dz4WrWtOPnMCf+B8/gB3Uo+n</latexit>
Bnew =
(bpred(Bk)+Bk
2 c, if Bk > pred(Bk)⇣pred(Bk) + b r�pred(Bk)+Bk
2 c⌘% r, otherwise
<latexit sha1_base64="WE+PdP/zY6DzJiTe9xJ5gPqTENo=">AAACuXicfVFNj9MwEHXC11K+Chy5jOgu6gqokoIACYFW5cJxkejuSnUVOc64MXWcyHZYqij8RsSNf4PbRrAfiLn4aeb5zcybtFLSuij6FYRXrl67fmPnZu/W7Tt37/XvPziyZW04TnmpSnOSMotKapw66RSeVAZZkSo8Tpcf1vXjr2isLPVnt6pwXrCFlkJy5nwq6f+YJI3G0xbe9WiKC6kb7tVs26NKqLI0QIVhvPGa2XCSLPfhKfinbcYtULNhPIMn1OE310gBu74G7+EPe9ezqJdC4YZnJc6LG3gO/2lAjVzkbh/o3ve/zUqXozmVFrcdUGfd4El/EI2iTcBlEHdgQLo4TPo/aVbyukDtuGLWzuKocvOGGSe5Qm9EbbFifMkWOPNQswLtvNk438Kez2Qg/Cqi1A422bM/GlZYuypSzyyYy+3F2jr5r9qsduLNvJG6qh1qvm0kagWuhPUZIZMGuVMrDxg30s8KPGfeTeeP3fMmxBdXvgyOxqP4xWj86eXgYNLZsUMekcdkSGLymhyQj+SQTAkPXgU0wECEb0MW5uGXLTUMuj8PybkI7W/p9NPG</latexit>
Bk<latexit sha1_base64="PV+hgRhToIWFkqDD9JYhrmhJ75A=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9FjqxWNF+wFtKJvtpl262YTdiVBCf4IXD4p49Rd589+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqxpsslrHuBNRwKRRvokDJO4nmNAokbwfj25nffuLaiFg94iThfkSHSoSCUbTSQ70/7pfKbsWdg6wSLydlyNHol756g5ilEVfIJDWm67kJ+hnVKJjk02IvNTyhbEyHvGupohE3fjY/dUrOrTIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4Y2fCZWkyBVbLApTSTAms7/JQGjOUE4soUwLeythI6opQ5tO0YbgLb+8SlrVindZqd5flWv1PI4CnMIZXIAH11CDO2hAExgM4Rle4c2Rzovz7nwsWtecfOYE/sD5/AESoo2o</latexit>
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Summary of Consistent Hashing
‣ Often used for distributed file sharing: Web caching, Amazon S3
‣ Pros:• Index size grows linearly
• Like extendible hashing, Hash Disruption is localized between new bucket and clockwise predecessor!
‣ Cons:• Equality queries aren't O(1) anymore; they're O(log n)
- Uses a red-black tree to implement the ring.
- Bucket = tree node
• Ring size is fixed once defined.- A maximum of r buckets
55
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Topics
‣ Indexing
‣ B+Trees
‣ Extendible Hash Index
‣ Bitmap Index
‣ Conclusion
56
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Bitmap Indexing (David's DB Research)
‣ Bitmaps are commonly used to index high-dimensional data• Idea: Discretize search-key attributes into a set of "bins"
• (e.g., Oracle, Apache Hive, FastBit, ...)
‣ Advantages:• Exploit CPU's fast bit-wise operations to resolve queries
• Highly compressible
• Paradox: query speedup proportional to compression ratio (How?)
57
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Bitmap Example
58
Tuple Name Age Salary City Favorite Drink ...
t1 Sara 42 65,000 Beaverton Ninkasit2 Julie 50 130,000 West Linn Bridgeportt3 Tom 21 25,000 Portland Boneyard...
Binning
...
...Sara? Julie? Tom?1 0 00 1 00 0 1
Name
Age <= 25 25 < Age < 50 50 <= Age0 1 00 1 01 0 0
Age
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Bitmap Example (Cont.)
‣ Query processing is fast (sometimes)• Find everyone named “Julie” who’s under 40 years of age
59
...
...Sara? Julie? Tom?1 0 00 1 00 0 1
Name
Age < 25 25 < Age < 50 50 <=Age0 1 00 1 01 0 0
Age
=010...
To Disk
(prune)
(prune)
010...
001...
110...
!
& |
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Bitmap Example (Cont.)
‣ Querying is fast (sometimes)• Find everyone making between $24,000 and $50,000 salary
60
< 25,000 >=25,0000 10 11 0
... Salary
...
...
=
111...
To Disk
To Disk
To Disk
001...
110...
!|
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Bitmap Takeaways
‣ Attribute-binning matters• More ranges (more bit-vectors)
- Pros: Approximates exact-match
- Cons: Diminishing returns and size increases
– Vector per name in previous example seems like overkill
- Have to get the bin-ranges right (could add bit-vectors that don't add selectivity)
• Coarser ranges (fewer bit-vectors)- Pros: Size
- Cons: Approximates file scan
– Worse, random disk access depending on tuple distribution in files
61
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Bitmap Takeaways
‣ Bitmap size:• Troubling for high-dimensional (e.g., many attributes) data sets:
- n = Attributes * discretization factor per attribute
- m = tuples (observations, transactions, etc.) -- large!
– To compare two vectors, we read words from cache or memory
– w = word size
• But comparison is fast: bitwise operation processes w-bits at a time
62
2⇥ m
w
O(mn)
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Bitmap Takeaways
‣ Opposing goals:• Minimize disk scans (finer-grained bins)
• Still want bitmap index to fit in cache or core memory (coarser bins)
‣ But bit vectors are long and sparse... what if we compress each vector?
• Typically a bad idea (decoding overhead)• (But if we didn’t have to decode...)
63
50 <=Age0000010010...
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Topics
‣ Indexing
‣ B+Trees
‣ Extendible Hashing
‣ Bitmaps• Bitmap Compression
‣ Conclusion
64
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Word-Aligned Hybrid Code (WAH)
‣ Allow both run-length and literal bit-strings to be encoded• Code size is fixed to CPU-word length (32-bits in our examples)
‣ Allows for two types of code formats:• Literal Word: 0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- Encodes 31 bits of literal string: xxxxxxx...x
- 01010101...10 is really: "1010101...10"
• Fill Word: 1Fxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx- Encodes a run of 31 * (xxxxxx...x) F-bits
- 10000...01111 is really: 31*15 (= 465) consecutive 0s
65
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Example: WAH Compression
66
raw bit-vector
10100... (all 0s)10... (all 0s)
0111... (all 1s)
111
1798
bits
Organize in groups of 31-bits
31 -b
its...
31 -b
its
Merge neighboring groups into group-runs
31 -b
its o
f lite
ral
58 g
roup
s of
31-
zero
s31
-bits
of l
itera
l
Encode as WAH word
010100000..1
1000..111010
011111111..1
31 b
its31
bits
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
WAH Query Processing Example
‣ Consider taking a logical `&' over the following two vectors:
‣ 1860 bits (rows) are represented by each vector
67
010100..0 100..111010 0111111..1
110..011100100010..01 000..010100001..110110
lit (31 bits) lit (31 bits)1798 zeros
lit (31 bits) lit (31 bits) lit (31 bits)399 zeros
v1 & v2
v1
v2
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
WAH Query Processing Example
‣ Step 1: Fetch first word from both vectors.• These two are now known as "active words"
68
010100..0 100..111010 0111111..1
110..011100100010..01 000..010100001..110110
v1
v2
result
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
‣ Step 2: Decode them. They’re both literal words!• So apply the `&' directly between the words
WAH Query Processing Example
69
=
000100..0
&010100..0 100..111010 0111111..1
110..011100100010..01 000..010100001..110110
v1
v2
result
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
‣ Step 3: Fetch next two words
‣ Step 4: Decode: one fill-word and one literal word!
WAH Query Processing Example
70
31 bit literal
000100..0
010100..0 100..111010 0111111..1
110..011100100010..01 000..010100001..110110
v1
v2
result
A run of 58*31 (= 1798) consecutive zeros
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
WAH Query Processing Example
‣ Step 4a: Apply bit-wise & with the fill-bit (0)
71
(This step demonstrates importance of 31-bit groupings)
A run of 58*31 (= 1798) consecutive zeros
000100..0
010100..0 100..111010 0111111..1
110..011100100010..01 000..010100
v1
v2
result
=
000000..0
&
001..110110
31 bit literal
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
WAH Query Processing Example
‣ Step 5: v1 fill-word is still active, so only fetch next word from v2
72
We've expended 31 zeroes from this fill word! Now 57*31 (= 1767) consecutive zeros
010100..0 100..111010 0111111..1
00010..01 000..010100
v1
v2 001..110110
000000..0000100..0result
57*31 (=1767) ones
10..0111001
110..0111001
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
WAH Query Processing Example
‣ After step 5: v1's and v2's fill-words no longer active, so fetch next words from both• They're both literals, so apply the & directly again
73
010100..0 100..111010 0111111..1
00010..01 000..010100
v1
v2 001..110110
000000..0000100..0result 10..0111001
110..0111001
000..010100
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Topics
‣ Indexing
‣ B+Trees
‣ Extendible Hashing
‣ Bitmaps
‣ Conclusion
74
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
In Conclusion...
‣ Ordered Index vs. Hash Index: Which Is Better?
‣ Depends on workload characteristics!• Cost of periodic index/file reorganization
• Frequency of insertions and deletes
• What are the dominant "SELECT" query types?- Lots of range queries: Use ordered indices (B+-Tree)
- Lots of equality queries: Use hash index
- Or, use both!
‣ Most databases implement both. Choice left up to DBA.• Pro tip: Be an informed DBA
75
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
In Conclusion... (Cont.)
‣ Are there other index structures? You bet!• R-Trees (For geospatial 2D data)
• KD-Trees (For multi-dimensional data)
• Grid Files (For multi-dimensional data)
• Linear Hashing (Extendible hashing's cousin)
• Consistent Hashing (used in P2P networks, like BitTorrent)
‣ So many more... (large area of research, even still)
76
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Administrivia 11/18
‣ Reminders:• Hwk 6 (Joins!) due tonight
• Offline team meetings this week
• Capstone party tomorrow (4-6pm) in this room
‣ Last time.. B+Tree multi-level index• Properties
• Excels for- Exact match (point) queries
- Range queries
77
CS 455: Principles of Database Systems - 7 - Indexing and Hashing
Administrivia 11/20
‣ Projects• No more homeworks! Focus on project.
‣ Meetings tomorrow• Olivia/Skye/Spencer/Ana-Lea: 11p
• Katrina/Daniel/Matthew: 12p
• Brody/Lukas/Aaron: 3p
• Ethan/Maddy/Ricardo: 3:30p
• Still nothing setup with (Montana, Ashton, Jiman).
78