29
CS 455: Principles of Database Systems - 7 - Indexing and Hashing Topics Indexing B+Trees Extendible Hash Index Consistent Hash Index Bitmap Index Conclusion 50

CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Topics

‣ Indexing

‣ B+Trees

‣ Extendible Hash Index

‣ Consistent Hash Index

‣ Bitmap Index

‣ Conclusion

50

Page 2: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Consistent Hashing

‣ Problem: Extendible hash index grows at an exponential rate• Caused by increasing the mod function

‣ Consistent Hashing• Back to a static hash function

- h(k) = k % r

• View "key space" as a ring- Goes from 0 to r-1

- Blocks/Buckets stored on the ring

51

i

j

0

k

r-1

Page 3: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Consistent Hashing: Insert

‣ Inserting a tuple with key k:• Use h(k) to hash k onto ring

• If h(k) is assigned a bucket:- Insert tuple in bucket if space allows

• Otherwise:- Move clockwise to the first bucket you find

- Insert tuple in bucket if space allows

‣ What if bucket runs out of space?

52

i

j

0

k

r-1

Page 4: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Example

‣ Show the hash ring for an r = 17 ring.• Blocks hold 3 tuples

• Initially, two empty blocks on 5 and 12

• Insert: 7, 11, 72, 6, 19, 17, 23, 27, 18

53

5

0

12

16

Page 5: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Consistent Hashing: Overflow

‣ When bucket overflows, split:• Create new bucket

• Place it here:

- Where pred(B) is the clockwise predecessor of bucket B.

• Rehash all tuples in overflown bucket

• Attempt insert again, repeating split routine if necessary.

54

Bk<latexit sha1_base64="PV+hgRhToIWFkqDD9JYhrmhJ75A=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9FjqxWNF+wFtKJvtpl262YTdiVBCf4IXD4p49Rd589+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqxpsslrHuBNRwKRRvokDJO4nmNAokbwfj25nffuLaiFg94iThfkSHSoSCUbTSQ70/7pfKbsWdg6wSLydlyNHol756g5ilEVfIJDWm67kJ+hnVKJjk02IvNTyhbEyHvGupohE3fjY/dUrOrTIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4Y2fCZWkyBVbLApTSTAms7/JQGjOUE4soUwLeythI6opQ5tO0YbgLb+8SlrVindZqd5flWv1PI4CnMIZXIAH11CDO2hAExgM4Rle4c2Rzovz7nwsWtecfOYE/sD5/AESoo2o</latexit>

Bnew<latexit sha1_base64="p8ipwbvH/pnRFPVjgeBHUMeDmL4=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9FjqxWMF+wFtKJvtpF262YTdjVJCf4QXD4p49fd489+4bXPQ1gcDj/dmmJkXJIJr47rfztr6xubWdmGnuLu3f3BYOjpu6ThVDJssFrHqBFSj4BKbhhuBnUQhjQKB7WB8O/Pbj6g0j+WDmSToR3QoecgZNVZq1/uZxKdpv1R2K+4cZJV4OSlDjka/9NUbxCyNUBomqNZdz02Mn1FlOBM4LfZSjQllYzrErqWSRqj9bH7ulJxbZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMT3vgZl0lqULLFojAVxMRk9jsZcIXMiIkllClubyVsRBVlxiZUtCF4yy+vkla14l1WqvdX5Vo9j6MAp3AGF+DBNdTgDhrQBAZjeIZXeHMS58V5dz4WrWtOPnMCf+B8/gB3Uo+n</latexit>

Bnew =

(bpred(Bk)+Bk

2 c, if Bk > pred(Bk)⇣pred(Bk) + b r�pred(Bk)+Bk

2 c⌘% r, otherwise

<latexit sha1_base64="WE+PdP/zY6DzJiTe9xJ5gPqTENo=">AAACuXicfVFNj9MwEHXC11K+Chy5jOgu6gqokoIACYFW5cJxkejuSnUVOc64MXWcyHZYqij8RsSNf4PbRrAfiLn4aeb5zcybtFLSuij6FYRXrl67fmPnZu/W7Tt37/XvPziyZW04TnmpSnOSMotKapw66RSeVAZZkSo8Tpcf1vXjr2isLPVnt6pwXrCFlkJy5nwq6f+YJI3G0xbe9WiKC6kb7tVs26NKqLI0QIVhvPGa2XCSLPfhKfinbcYtULNhPIMn1OE310gBu74G7+EPe9ezqJdC4YZnJc6LG3gO/2lAjVzkbh/o3ve/zUqXozmVFrcdUGfd4El/EI2iTcBlEHdgQLo4TPo/aVbyukDtuGLWzuKocvOGGSe5Qm9EbbFifMkWOPNQswLtvNk438Kez2Qg/Cqi1A422bM/GlZYuypSzyyYy+3F2jr5r9qsduLNvJG6qh1qvm0kagWuhPUZIZMGuVMrDxg30s8KPGfeTeeP3fMmxBdXvgyOxqP4xWj86eXgYNLZsUMekcdkSGLymhyQj+SQTAkPXgU0wECEb0MW5uGXLTUMuj8PybkI7W/p9NPG</latexit>

Bk<latexit sha1_base64="PV+hgRhToIWFkqDD9JYhrmhJ75A=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9FjqxWNF+wFtKJvtpl262YTdiVBCf4IXD4p49Rd589+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqxpsslrHuBNRwKRRvokDJO4nmNAokbwfj25nffuLaiFg94iThfkSHSoSCUbTSQ70/7pfKbsWdg6wSLydlyNHol756g5ilEVfIJDWm67kJ+hnVKJjk02IvNTyhbEyHvGupohE3fjY/dUrOrTIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4Y2fCZWkyBVbLApTSTAms7/JQGjOUE4soUwLeythI6opQ5tO0YbgLb+8SlrVindZqd5flWv1PI4CnMIZXIAH11CDO2hAExgM4Rle4c2Rzovz7nwsWtecfOYE/sD5/AESoo2o</latexit>

Page 6: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Summary of Consistent Hashing

‣ Often used for distributed file sharing: Web caching, Amazon S3

‣ Pros:• Index size grows linearly

• Like extendible hashing, Hash Disruption is localized between new bucket and clockwise predecessor!

‣ Cons:• Equality queries aren't O(1) anymore; they're O(log n)

- Uses a red-black tree to implement the ring.

- Bucket = tree node

• Ring size is fixed once defined.- A maximum of r buckets

55

Page 7: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Topics

‣ Indexing

‣ B+Trees

‣ Extendible Hash Index

‣ Bitmap Index

‣ Conclusion

56

Page 8: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Bitmap Indexing (David's DB Research)

‣ Bitmaps are commonly used to index high-dimensional data• Idea: Discretize search-key attributes into a set of "bins"

• (e.g., Oracle, Apache Hive, FastBit, ...)

‣ Advantages:• Exploit CPU's fast bit-wise operations to resolve queries

• Highly compressible

• Paradox: query speedup proportional to compression ratio (How?)

57

Page 9: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Bitmap Example

58

Tuple Name Age Salary City Favorite Drink ...

t1 Sara 42 65,000 Beaverton Ninkasit2 Julie 50 130,000 West Linn Bridgeportt3 Tom 21 25,000 Portland Boneyard...

Binning

...

...Sara? Julie? Tom?1 0 00 1 00 0 1

Name

Age <= 25 25 < Age < 50 50 <= Age0 1 00 1 01 0 0

Age

Page 10: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Bitmap Example (Cont.)

‣ Query processing is fast (sometimes)• Find everyone named “Julie” who’s under 40 years of age

59

...

...Sara? Julie? Tom?1 0 00 1 00 0 1

Name

Age < 25 25 < Age < 50 50 <=Age0 1 00 1 01 0 0

Age

=010...

To Disk

(prune)

(prune)

010...

001...

110...

!

& |

Page 11: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Bitmap Example (Cont.)

‣ Querying is fast (sometimes)• Find everyone making between $24,000 and $50,000 salary

60

< 25,000 >=25,0000 10 11 0

... Salary

...

...

=

111...

To Disk

To Disk

To Disk

001...

110...

!|

Page 12: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Bitmap Takeaways

‣ Attribute-binning matters• More ranges (more bit-vectors)

- Pros: Approximates exact-match

- Cons: Diminishing returns and size increases

– Vector per name in previous example seems like overkill

- Have to get the bin-ranges right (could add bit-vectors that don't add selectivity)

• Coarser ranges (fewer bit-vectors)- Pros: Size

- Cons: Approximates file scan

– Worse, random disk access depending on tuple distribution in files

61

Page 13: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Bitmap Takeaways

‣ Bitmap size:• Troubling for high-dimensional (e.g., many attributes) data sets:

- n = Attributes * discretization factor per attribute

- m = tuples (observations, transactions, etc.) -- large!

– To compare two vectors, we read words from cache or memory

– w = word size

• But comparison is fast: bitwise operation processes w-bits at a time

62

2⇥ m

w

O(mn)

Page 14: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Bitmap Takeaways

‣ Opposing goals:• Minimize disk scans (finer-grained bins)

• Still want bitmap index to fit in cache or core memory (coarser bins)

‣ But bit vectors are long and sparse... what if we compress each vector?

• Typically a bad idea (decoding overhead)• (But if we didn’t have to decode...)

63

50 <=Age0000010010...

Page 15: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Topics

‣ Indexing

‣ B+Trees

‣ Extendible Hashing

‣ Bitmaps• Bitmap Compression

‣ Conclusion

64

Page 16: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Word-Aligned Hybrid Code (WAH)

‣ Allow both run-length and literal bit-strings to be encoded• Code size is fixed to CPU-word length (32-bits in our examples)

‣ Allows for two types of code formats:• Literal Word: 0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

- Encodes 31 bits of literal string: xxxxxxx...x

- 01010101...10 is really: "1010101...10"

• Fill Word: 1Fxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx- Encodes a run of 31 * (xxxxxx...x) F-bits

- 10000...01111 is really: 31*15 (= 465) consecutive 0s

65

Page 17: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Example: WAH Compression

66

raw bit-vector

10100... (all 0s)10... (all 0s)

0111... (all 1s)

111

1798

bits

Organize in groups of 31-bits

31 -b

its...

31 -b

its

Merge neighboring groups into group-runs

31 -b

its o

f lite

ral

58 g

roup

s of

31-

zero

s31

-bits

of l

itera

l

Encode as WAH word

010100000..1

1000..111010

011111111..1

31 b

its31

bits

Page 18: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

WAH Query Processing Example

‣ Consider taking a logical `&' over the following two vectors:

‣ 1860 bits (rows) are represented by each vector

67

010100..0 100..111010 0111111..1

110..011100100010..01 000..010100001..110110

lit (31 bits) lit (31 bits)1798 zeros

lit (31 bits) lit (31 bits) lit (31 bits)399 zeros

v1 & v2

v1

v2

Page 19: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

WAH Query Processing Example

‣ Step 1: Fetch first word from both vectors.• These two are now known as "active words"

68

010100..0 100..111010 0111111..1

110..011100100010..01 000..010100001..110110

v1

v2

result

Page 20: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

‣ Step 2: Decode them. They’re both literal words!• So apply the `&' directly between the words

WAH Query Processing Example

69

=

000100..0

&010100..0 100..111010 0111111..1

110..011100100010..01 000..010100001..110110

v1

v2

result

Page 21: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

‣ Step 3: Fetch next two words

‣ Step 4: Decode: one fill-word and one literal word!

WAH Query Processing Example

70

31 bit literal

000100..0

010100..0 100..111010 0111111..1

110..011100100010..01 000..010100001..110110

v1

v2

result

A run of 58*31 (= 1798) consecutive zeros

Page 22: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

WAH Query Processing Example

‣ Step 4a: Apply bit-wise & with the fill-bit (0)

71

(This step demonstrates importance of 31-bit groupings)

A run of 58*31 (= 1798) consecutive zeros

000100..0

010100..0 100..111010 0111111..1

110..011100100010..01 000..010100

v1

v2

result

=

000000..0

&

001..110110

31 bit literal

Page 23: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

WAH Query Processing Example

‣ Step 5: v1 fill-word is still active, so only fetch next word from v2

72

We've expended 31 zeroes from this fill word! Now 57*31 (= 1767) consecutive zeros

010100..0 100..111010 0111111..1

00010..01 000..010100

v1

v2 001..110110

000000..0000100..0result

57*31 (=1767) ones

10..0111001

110..0111001

Page 24: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

WAH Query Processing Example

‣ After step 5: v1's and v2's fill-words no longer active, so fetch next words from both• They're both literals, so apply the & directly again

73

010100..0 100..111010 0111111..1

00010..01 000..010100

v1

v2 001..110110

000000..0000100..0result 10..0111001

110..0111001

000..010100

Page 25: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Topics

‣ Indexing

‣ B+Trees

‣ Extendible Hashing

‣ Bitmaps

‣ Conclusion

74

Page 26: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

In Conclusion...

‣ Ordered Index vs. Hash Index: Which Is Better?

‣ Depends on workload characteristics!• Cost of periodic index/file reorganization

• Frequency of insertions and deletes

• What are the dominant "SELECT" query types?- Lots of range queries: Use ordered indices (B+-Tree)

- Lots of equality queries: Use hash index

- Or, use both!

‣ Most databases implement both. Choice left up to DBA.• Pro tip: Be an informed DBA

75

Page 27: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

In Conclusion... (Cont.)

‣ Are there other index structures? You bet!• R-Trees (For geospatial 2D data)

• KD-Trees (For multi-dimensional data)

• Grid Files (For multi-dimensional data)

• Linear Hashing (Extendible hashing's cousin)

• Consistent Hashing (used in P2P networks, like BitTorrent)

‣ So many more... (large area of research, even still)

76

Page 28: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Administrivia 11/18

‣ Reminders:• Hwk 6 (Joins!) due tonight

• Offline team meetings this week

• Capstone party tomorrow (4-6pm) in this room

‣ Last time.. B+Tree multi-level index• Properties

• Excels for- Exact match (point) queries

- Range queries

77

Page 29: CS455 7-indexing 4mathcs.pugetsound.edu/~dchiu/CS455/notes/CS455_7-indexing_4.pdf · ‣ Are there other index structures? You bet! • R-Trees (For geospatial 2D data) • KD-Trees

CS 455: Principles of Database Systems - 7 - Indexing and Hashing

Administrivia 11/20

‣ Projects• No more homeworks! Focus on project.

‣ Meetings tomorrow• Olivia/Skye/Spencer/Ana-Lea: 11p

• Katrina/Daniel/Matthew: 12p

• Brody/Lukas/Aaron: 3p

• Ethan/Maddy/Ricardo: 3:30p

• Still nothing setup with (Montana, Ashton, Jiman).

78