47
hashing prof. Stratos Idreos HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ class 18

class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

Embed Size (px)

Citation preview

Page 1: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

hashingprof. Stratos Idreos

HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/

class 18

Page 2: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 2

professors(id,name,…)

courses(id,name, profId,…)

students(id,name,…)

database

give me all students enrolled in cs165select student.name from students, enrolled, courses where courses.name=“cs165” and enrolled.courseId=course.id and student.id=enrolled.studentId

enrolled(studentId,

courseId,…) foreign key

join

Page 3: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 3

new resL[]; new resR[]; k=0 for (i=0;i<L.size;i=i++) for (j=0;j<R.size;j++) if L[i]==R[j] resL[k]=i resR[k++]=j

nested loops

L R

Page 4: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 4

hash join

hash

join input 1 hash table

hash

join input 2

Page 5: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 5

hash table

bucket

val (key),pos

hash

h=f(val)bucket=h mod k

k bucketsN keys

[(val1,pos1), (val7,pos7), …]

Page 6: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 6

stream input

p1 p2 p3 p4

1. read input into stream buffer, hash and write to respective partition buffer 2. when input buffer is consumed, bring the next one 3. when a partition buffer is full, write to L2 we can partition into L1-1 pieces in one pass

partition

p1

p2 p3

p4

Level2

Level 1

Page 7: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 7

grace hash join

hash partitioning

then join each pair of partitionsindependently in memory

join input 1 join input 2

Page 8: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 8

stream input left

p1 right hash table result

probe

p1 left

p1 right

result

L1-2

Level2

Level 1

as long as at least one of the pieces <=L1-2

Page 9: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 9

grace hash join

hash partitioning

apply recursively if a partition does not fit in memory (L1-2)

Page 10: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 10

today: how to create the hash table

Page 11: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 11

selectR.C

selectS.F

join

fetchR.A

fetchS.A

maxR.D

minS.G

Page 12: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 12

static hashing

hash table

bucket

updates may create long lists

can this happen anyway, i.e., even with no updates

Page 13: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 13

simple solution would be to double the hash table and rehash everything…

but…

Page 14: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 14

dynamic hashing

extendible hashing - linear hashing

Page 15: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 15

extendible hashing

directory bucketswhen there is no more space

split individual bucketsand if needed double the directory

hash table

with directory in memory we need 1 I/O to fetch the bucket

not true from memory to caches in general

Page 16: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 16

directory maps hash values to buckets

h=hash(key)

use binary form of h use X last bits to map 2^x buckets

00 01 10 11

for X=2000 001 010 011 100 101 110 111

for X=3

extending the directory …

Page 17: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 17

00 01 10 11

dic. X=2buckets

full

Page 18: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 17

00 01 10 11

dic. X=2

000 001 010 011 100 101 110 111

for X=3

buckets

buckets

full

Page 19: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 17

00 01 10 11

dic. X=2

000 001 010 011 100 101 110 111

for X=3

buckets

buckets

full

Page 20: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 17

00 01 10 11

dic. X=2

000 001 010 011 100 101 110 111

for X=3

buckets

buckets

full

Page 21: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 17

00 01 10 11

dic. X=2

000 001 010 011 100 101 110 111

for X=3

buckets

buckets

full

Page 22: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 17

00 01 10 11

dic. X=2

000 001 010 011 100 101 110 111

for X=3

buckets

buckets

full

Page 23: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 17

00 01 10 11

dic. X=2

000 001 010 011 100 101 110 111

for X=3

buckets

buckets

full

Page 24: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 17

00 01 10 11

dic. X=2

000 001 010 011 100 101 110 111

for X=3

buckets

buckets

full

new

split

Page 25: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 17

00 01 10 11

dic. X=2

000 001 010 011 100 101 110 111

for X=3

buckets

buckets

full

newextend directory only when no split is possible

cannot split

can split

split

Page 26: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 18

linear hashingdo not use a directory (indirection cost)

instead incrementally extend the hash table one bucket at a time

Page 27: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 19

e.g.,use 2 bits of hash value (h0)

next bucket to split00

01 10 11

dictionary is implicit when insert causes

overflow somewhere split next bucket

use +1 bit of hash value

Page 28: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 19

e.g.,use 2 bits of hash value (h0)

next bucket to split00

01 10 11

dictionary is implicit when insert causes

overflow somewhere split next bucket

use +1 bit of hash value

*

Page 29: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 19

e.g.,use 2 bits of hash value (h0)

next bucket to split00

01 10 11

dictionary is implicit when insert causes

overflow somewhere split next bucket

use +1 bit of hash value

*full

Page 30: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 19

e.g.,use 2 bits of hash value (h0)

next bucket to split00

01 10 11

dictionary is implicit when insert causes

overflow somewhere split next bucket

use +1 bit of hash value

*full

Page 31: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 19

e.g.,use 2 bits of hash value (h0)

next bucket to split00

01 10 11

dictionary is implicit when insert causes

overflow somewhere split next bucket

use +1 bit of hash value

*full

Page 32: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 19

e.g.,use 2 bits of hash value (h0)

00 01 10 11

dictionary is implicit when insert causes

overflow somewhere split next bucket

use +1 bit of hash value

next bucket to split

*full

Page 33: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 20

000 001 010 011 100 101 110 111

for X=3buckets

newdictionary is

implicit

00 01 10 11

dictionary is implicit

next bucket to split

for X=2

search: use current hash function (bits) if after split bucket ok else use next hash function (bits+1)

Page 34: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 21

000 001 010 011 100 101 110 111

for X=3buckets

C

00 01 10 11

next bucket to split

for X=2

keep splitting next bucket with each overflow until all buckets are split (for x=2) then restart for x=3

any problemswhat would happen

when we split bucket C

Page 35: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 22

selectR.C

selectS.F

join

fetchR.A

fetchS.A

maxR.D

minS.G

What can we do to start working immediately? (hint: vectorized processing)

What can we do if we wait for all data to arrive? (hint: bulk processing)

Page 36: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 2323

symmetric hash joinL R

while there exist buffered values from L, hash and

probe HT of R

1) hash

2) probe

3) output

when buffer is empty, switch!

Page 37: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 2323

symmetric hash joinL R

while there exist buffered values from R, hash and

probe HT of L

2) probe1) hash

3) output

when buffer is empty, switch!

Page 38: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 2424

static hashing with 2 passeskeys

hash

count size of each bucket

phase 1

hash tablekeys

hash

phase 2

we now exactlywhat

hash tableto build

Page 39: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 25

keys

hash

count size of each bucket

phase 1

size1 size2 size3

… sizeK

array of k slots where k the buckets we want to have

for each bucket we know its size

D

Page 40: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 26

hash tablekeys

hash

phase 2

we now exactlywhat

hash tableto build

size1 size2 size3

… sizeK

0 D[0]+s1 D[1]+s2 D[2]+s3

D D

pass D and sum all counts to get offsets in a sequentially stored hash table

Page 41: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 27

can be done with binary ops only

use X LSBs

h=a*key+bb=h mod K

which hash function should we use?

Page 42: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 28

what happens after the join?

selectR.C

selectS.F

join

fetchR.A

fetchS.A

maxR.D

minS.G

select max(R.D),min(S.G) from R,S where R.A=S.A and R.C<10 and S.F>30

block operator

access patterns

Page 43: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 29

select R.A, R.B, R.C, S.A, S.B, S.C from R, S where R.J=S.J and …

R.* scan

pos R.Jfetch

posR.J

ordered sparse

ordered sparse

preparing the R join input

same for the S join input

we need the original positions so we can fetch other R columns after the join

Page 44: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 30

join input R.J R.J + posR

join input S.J S.J + posS partition (=reorder)

both join inputs to join

join result posR + posS

both sides are unordered

join

ordered sparse

ordered sparse

select R.A, R.B, R.C, S.A, S.B, S.C from R, S where R.J=S.J and …

Page 45: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 31

join result posR + posS

cluster on R

join result clustered on R posR + posS

fetch R payload with sequential pattern

1 2 3 4

ID + posS

cluster on S4 2 3 1

ID + posS

clus

tere

d

clus

tere

d fetch S payload with sequential pattern

4 2 3 1

e.g., ID + S.A

sort on ID decluster

all result columns aligned

radix declustering

both sides are unordered

select R.A, R.B, R.C, S.A, S.B, S.C from R, S where R.J=S.J and …

Cache-Conscious Radix Decluster ProjectionsBy S. Manegold, P. Boncz, N. Nes, and M. Kersten Very Large Databases Conference, 2004

Page 46: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

/33CS165, Fall 2015 Stratos Idreos 32

Textbook Chapter 11

Cache-Conscious Radix Decluster ProjectionsS. Manegold, P. Boncz, N. Nes, and M. Kersten Very Large Databases Conference, 2004

Page 47: class 18 hashing - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs165/doc/class_slides/Class18CS... · extendible hashing directory buckets when there is no more space split

DATA SYSTEMSprof. Stratos Idreos

class 18

hashing