51
CENG 351 CENG 351 1 Hashing for files Hashing for files

Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 11

Hashing for filesHashing for files

Page 2: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 22

IntroductionIntroduction Idea: to reference items in a table Idea: to reference items in a table

directly by doing arithmetic operations directly by doing arithmetic operations to transform keys into table addresses.to transform keys into table addresses.

Steps:Steps:1.1. Compute a hash function that transforms Compute a hash function that transforms

the search key into a table addressthe search key into a table address2.2. Collision-resolution that deals with the keys Collision-resolution that deals with the keys

that may be hashed to the same table that may be hashed to the same table addressaddress

Hashing is a good example of a time-Hashing is a good example of a time-space tradeoff.space tradeoff.

It is a classical computer science It is a classical computer science problem.problem.

Page 3: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 33

MotivationMotivation The primary goal is to locate the desired The primary goal is to locate the desired

record in a single disk access.record in a single disk access. Sequential search: O(N)Sequential search: O(N) B+ trees: O(logB+ trees: O(logk k N)N) Hashing: O(1)Hashing: O(1)

In hashing, the key of a record is In hashing, the key of a record is transformed into an address and the transformed into an address and the record is stored at that address.record is stored at that address.

Hash-basedHash-based indexes are best for indexes are best for equality selections. equality selections. CannotCannot support support range searches.range searches.

Static and dynamic hashing techniques Static and dynamic hashing techniques exist.exist.

Page 4: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 44

Hashing-based IndexHashing-based Index Data entries are kept in Data entries are kept in buckets buckets (an (an

abstract term)abstract term) Each bucket is a collection of primary Each bucket is a collection of primary

data pages and zero or more overflow data pages and zero or more overflow pages.pages.

Given a search key value, k, we can find Given a search key value, k, we can find the bucket where the data entry k* is the bucket where the data entry k* is stored as follows:stored as follows: Use a Use a hash function, hash function, denoted by denoted by h()h() The value of The value of hh(k) is the address for the desired (k) is the address for the desired

bucket where k* is located. bucket where k* is located. hh(k) should distribute the search key values (k) should distribute the search key values

uniformlyuniformly over the collection of buckets over the collection of buckets

Page 5: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 55

Example for hash mappingExample for hash mapping

h(k)= K * 100, round to h(k)= K * 100, round to the nearest integer or the nearest integer or truncatedtruncated

Page 6: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 66

Design FactorsDesign Factors

Bucket size:Bucket size: the number of records the number of records that can be held at the same that can be held at the same address.address.

Loading factorLoading factor: the ratio of the : the ratio of the number of records put in the file to number of records put in the file to the total capacity of the buckets in the total capacity of the buckets in number of records.number of records.

Good hash functionGood hash function: should evenly : should evenly distribute the keys among the distribute the keys among the addresses.addresses.

Overflow resolution technique must Overflow resolution technique must be affectivebe affective..

Page 7: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 77

Example simple hash Example simple hash FunctionsFunctions

Key mod N:Key mod N: N is the size of the table, better if it is prime.N is the size of the table, better if it is prime.

Folding:Folding: e.g. 123|456|789: add them and take mod.e.g. 123|456|789: add them and take mod.

Truncation: Truncation: e.g. 123456789 map to a table of 1000 e.g. 123456789 map to a table of 1000

addresses by picking 3 digits of the key.addresses by picking 3 digits of the key. Squaring:Squaring:

Square the key and then truncateSquare the key and then truncate Radix conversion:Radix conversion:

e.g. 1 2 3 4 treat it to be base 11, truncate if e.g. 1 2 3 4 treat it to be base 11, truncate if necessary.necessary.

Page 8: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 88

Static HashingStatic Hashing Primary AreaPrimary Area: # primary pages fixed, allocated : # primary pages fixed, allocated

sequentially, never de-allocated; (say M sequentially, never de-allocated; (say M buckets).buckets). A simple hash function: A simple hash function: hh((kk) = f() = f(kk) mod M ) mod M

Overflow areaOverflow area: disjoint from the primary area. : disjoint from the primary area. It keeps buckets which hold records whose key It keeps buckets which hold records whose key maps to a full bucket.maps to a full bucket. chainingchaining the address of an overflow bucket to a the address of an overflow bucket to a

primary area.primary area. CollisionCollision does not cause a problem as long as does not cause a problem as long as

there is still room in the mapped bucket. there is still room in the mapped bucket. OverflowOverflow occurs during insertion when a record occurs during insertion when a record is hashed to the bucket that is already full.is hashed to the bucket that is already full.

Page 9: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 99

ExampleExample Assume f(k) = k. Let M = 5. So, h(k) = k mod Assume f(k) = k. Let M = 5. So, h(k) = k mod

55 Bucket factor, bf = 3 records. Place the Bucket factor, bf = 3 records. Place the

records35 60 6 12 57 46 33 62 44, 17 in 5 records35 60 6 12 57 46 33 62 44, 17 in 5 buckets each of bf of 3.buckets each of bf of 3.

0 35 60

1 6 46

2 12 57 62

3 33

4 44

17

Primary area overflow

Page 10: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 1010

Load Factor (Packing Load Factor (Packing density)density)

To limit the amount of overflow we To limit the amount of overflow we allocate more space to the primary area allocate more space to the primary area than needed (i.e. the primary area will than needed (i.e. the primary area will be, say, 70% full)be, say, 70% full)

Load Factor f =Load Factor f =

=> => f =f =

# of records in the file

# of spaces in primary area

n

M * m

Where n is number of records, m is the blocking factor, M is number of blocks

Page 11: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 1111

Effects of load factor f and Effects of load factor f and mm

Performance can be enhanced by the Performance can be enhanced by the choice of bucket size and load factor.choice of bucket size and load factor.

In general, a In general, a smaller load factorsmaller load factor means means less overflow and a faster fetch time; less overflow and a faster fetch time;

but more wasted space.but more wasted space. A A larger blocking factor mlarger blocking factor m means means

less overflow in general, but slower less overflow in general, but slower fetch.fetch.

Page 12: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 1212

Insertion and DeletionInsertion and Deletion Insertion: Insertion: New records are inserted at New records are inserted at

the end of the chain.the end of the chain. Deletion: Two ways are possible:Deletion: Two ways are possible:

1.1. Mark the record to be deletedMark the record to be deleted2.2. Consolidate sparse buckets when Consolidate sparse buckets when

deleting records.deleting records. In the 2In the 2ndnd approach: approach:

When a record is deleted, fill its place with When a record is deleted, fill its place with the last record in the chain of the current the last record in the chain of the current bucket.bucket.

Deallocate the last bucket when it Deallocate the last bucket when it becomes empty.becomes empty.

Page 13: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 1313

Problem of Static Problem of Static HashingHashing

The main problem with static hashing: the The main problem with static hashing: the number of buckets is fixed:number of buckets is fixed: Long overflow chains can develop which will Long overflow chains can develop which will

eventually degrade the performance.eventually degrade the performance.

On the other hand, if a file shrinks greatly, a lot On the other hand, if a file shrinks greatly, a lot of bucket space will be wasted.of bucket space will be wasted.

There are some other hashing techniques that allow There are some other hashing techniques that allow dynamically growing and shrinking hash index. dynamically growing and shrinking hash index. These include: These include: linear hashinglinear hashing extendible hashingextendible hashing

Page 14: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 1414

Linear HashingLinear Hashing It maintains a constant load It maintains a constant load

factor.factor. It avoids reorganization.It avoids reorganization.

It does so, by incrementally adding It does so, by incrementally adding new buckets to the primary area.new buckets to the primary area.

With linear hashing, the last bits With linear hashing, the last bits (right most bits) in the hash (right most bits) in the hash number are used for placing the number are used for placing the records.records.

Page 15: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 1515

ExampleExample000 8 16 32

001 17 25

010 34 50

011 11 27

100 28 12

101 5

110 14

111 55 15

Last 3 bits

e.g.34: 10001028: 01110013: 001101 21: 010101

Insert: 13, 21, 37

f = 15/24

= 63%

Page 16: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 1616

Insertion of recordsInsertion of records

To expand the table: split an existing To expand the table: split an existing bucket denoted by k digits into two bucket denoted by k digits into two buckets using the last k+1 digits.buckets using the last k+1 digits.

e.g.e.g.000

1000

0000

Page 17: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 1717

Expanding the tableExpanding the table000016 32

00117 25

01034 50

01111 27

10028 12

1015 13 21

11014

11155 15

10008

37

Boundary value

Page 18: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 1818

000016 32

000117

001034 50

01111 27

10028 12

1015 13 21

11014

11155 15

10008

100125

101026

Boundary value

37

k = 3

Hash # 1000: uses last 4 digits

Hash # 1101: uses last 3 digits

Page 19: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 1919

Fetching a recordFetching a record Apply the hash function.Apply the hash function. Look at the last k digits.Look at the last k digits.

If it’s less than the boundary value, the If it’s less than the boundary value, the location is in the bucket area labeled location is in the bucket area labeled with the last k+1 digits.with the last k+1 digits.

Otherwise it is in the bucket area Otherwise it is in the bucket area labeled with the last k digits.labeled with the last k digits.

Follow overflow chains as with static Follow overflow chains as with static hashing. hashing.

Page 20: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 2020

InsertionInsertion Search for the correct bucket into which to Search for the correct bucket into which to

place the new record.place the new record. If the bucket is full, allocate a new If the bucket is full, allocate a new

overflow bucket.overflow bucket. If there are now f*m records more than If there are now f*m records more than

needed for a given f, needed for a given f, Add one more bucket to the primary area.Add one more bucket to the primary area. Distribute the records from the bucket chain at Distribute the records from the bucket chain at

the boundary value between the original area the boundary value between the original area and the new primary area bucket,and the new primary area bucket,

Add 1 to the boundary value.Add 1 to the boundary value.

Page 21: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 2121

DeletionDeletion Read in a chain of records.Read in a chain of records. Replace the deleted record with the last record in Replace the deleted record with the last record in

the chain.the chain. If the last overflow bucket becomes empty, deallocate it.If the last overflow bucket becomes empty, deallocate it.

When the number of records is less than the When the number of records is less than the number needed, contract the primary area by one number needed, contract the primary area by one bucket.bucket.

Compressing the table is exact opposite of Compressing the table is exact opposite of expanding it:expanding it:

Keep the count of the total number of records in Keep the count of the total number of records in the file and buckets in primary area.the file and buckets in primary area.

Compute f * m, if we have fewer records than Compute f * m, if we have fewer records than needed, consolidate the last bucket with the needed, consolidate the last bucket with the bucket which shares the same last k digits.bucket which shares the same last k digits.

Page 22: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 2222

Extendible Extendible HashingHashing

Page 23: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 2323

Extendible Hashing:Intr.Extendible Hashing:Intr.

Extendible hashing does not have chains Extendible hashing does not have chains of buckets, contrary to linear hashing.of buckets, contrary to linear hashing.

Hashing is based on creating index for an Hashing is based on creating index for an index table, which have pointers to the index table, which have pointers to the data buckets. data buckets. The number of the entries in the index table is The number of the entries in the index table is

22ii, where i is number of bit used for indexing., where i is number of bit used for indexing. The records with the same first i bits are The records with the same first i bits are

placed in the same bucket,placed in the same bucket, If max content of any bucket is reached, a new If max content of any bucket is reached, a new

bucket is added, provided the corresponding index bucket is added, provided the corresponding index entry is free.entry is free.

If max content of any bucket is reached and a new If max content of any bucket is reached and a new bucket cannot be added, according to the first i bits, bucket cannot be added, according to the first i bits, then the table size is doubled. then the table size is doubled.

Page 24: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 2424

Extendible Hashing: Extendible Hashing: OverflowOverflow

When i=j and overflow occurs, then index When i=j and overflow occurs, then index table is doubled; where j is the level of the table is doubled; where j is the level of the index, while i is the level of data buckets index, while i is the level of data buckets

Successive doubling of the index table Successive doubling of the index table may happen, in case of a poor hashing may happen, in case of a poor hashing function.function.

The main disadvantage of the extendible The main disadvantage of the extendible hashing is that, the index table may grow hashing is that, the index table may grow to be too big and too sparse. to be too big and too sparse. The best case: there are as many buckets as The best case: there are as many buckets as

index table size.index table size. If the index table can fit into the memory, the If the index table can fit into the memory, the

access to a record requires one disk access access to a record requires one disk access only.only.

Page 25: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 2525

Extendible HashingExtendible Hashing

Hash function returns Hash function returns bb bits bits Only the prefix Only the prefix ii bits are used to hash the item bits are used to hash the item There are There are 22ii entries in the index table entries in the index table Let Let iijj be the length of the common hash prefix for data be the length of the common hash prefix for data

bucket bucket jj, there is , there is 22(i-i(i-ijj)) entries in index table points to entries in index table points to jj

i

i2

bucket2

i3

bucket3

i1

bucket1

Data buckets

Bucket address table: index table

Length of common hash prefixHash prefix

Page 26: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 2626

Splitting a bucket: Case 1Splitting a bucket: Case 1

Splitting (Splitting (Case 1Case 1: i: ijj=i)=i) Only one entry in bucket address table points Only one entry in bucket address table points

to data bucket jto data bucket j i++; split data bucket j; rehash all items i++; split data bucket j; rehash all items

previously in j;previously in j; 3

3

2

1

3

000001010011100101110111

2

00011011

2

2

1

Page 27: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 2727

Splitting: Case 2Splitting: Case 2

Splitting (Splitting (Case 2:Case 2: i ijj< i)< i) More than one entry in bucket address table More than one entry in bucket address table

point to data bucket jpoint to data bucket j split data bucket j to j, z; isplit data bucket j to j, z; ij j = i= iz z = i= ijj +1; Adjust +1; Adjust

the pointers previously point to j only, now to j the pointers previously point to j only, now to j and z; rehash all items previously in j;and z; rehash all items previously in j;

2

2

1

3

000001010011100101110111

2

2

2

3

000001010011100101110111

2

Page 28: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 2828

Extendable Hashing: ExampleExample

Suppose the hash function is Suppose the hash function is h(x) = x mod 8h(x) = x mod 8 and each bucket can hold at most two records. and each bucket can hold at most two records. Show the extendable hash structure after Show the extendable hash structure after inserting 1, 4, 5, 7, 8, 2, 20.inserting 1, 4, 5, 7, 8, 2, 20.

1 4 5 7 8 2 20001 100 101 111 000 010 100

14

00

1

11

45

1

0

1

00

01

10

11

18

1

2

45

2

7

2

Use zero bits

Use one bits

Use two bits

Page 29: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 2929

ExampleExample

3

000

001

010

011

100

101

110

111

18

2

420

2

7

2

2

3

5

3

18

2

2

2

2

45

2

7

2

00

01

10

11

inserting 1, 4, 5, 7, 8, 2, 20

1 4 5 7 8 2 20001 100 101 111 000 010 100

Page 30: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 3030

Comments on Extendible Comments on Extendible HashingHashing

If directory fits in memory, equality search is If directory fits in memory, equality search is realized with one disk access. realized with one disk access.

A typical example: a 100MB file with 100 bytes record A typical example: a 100MB file with 100 bytes record and a page (bucket) size of 4K contains 1,000,000 and a page (bucket) size of 4K contains 1,000,000 records (as data entries) but only 25,000 [=1,000,000/records (as data entries) but only 25,000 [=1,000,000/(4,000/100)] directory elements (4,000/100)] directory elements ⇒⇒ chances are high that directory will fit in memory.chances are high that directory will fit in memory.

If the distribution If the distribution of hash values of hash values is skewed (e.g., is skewed (e.g., a large number of search key values all are a large number of search key values all are hashed to the same bucket ), directory can grow hashed to the same bucket ), directory can grow very large!very large!

But this kind of skew must be avoided with a well-tuned But this kind of skew must be avoided with a well-tuned hashing functionhashing function

Page 31: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 3131

Comments on Extendible Comments on Extendible HashingHashing

DeleteDelete: If removal of data entry : If removal of data entry makes a bucket empty, it can be makes a bucket empty, it can be merged with the “buddy” bucket. merged with the “buddy” bucket. If each directory element points to If each directory element points to

same bucket as its split image, same bucket as its split image, can halve directory. can halve directory.

Page 32: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 3232

Summary, so farSummary, so far Hash-based indexes: best for equality Hash-based indexes: best for equality

searches, cannot support range searches, cannot support range searches.searches.

Static Hashing can lead to long overflow Static Hashing can lead to long overflow chains.chains.

Extendible Hashing avoids overflow Extendible Hashing avoids overflow pages by splitting a full bucket when a pages by splitting a full bucket when a new data entry is to be added to it. new data entry is to be added to it. Directory to keep track of buckets, doubles Directory to keep track of buckets, doubles

periodically.periodically. Can get large with skewed data; additional Can get large with skewed data; additional

I/O if this does not fit in main memory.I/O if this does not fit in main memory.

Page 33: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 3333

Time considerations for Time considerations for Hashing: Hashing:

Reading AssignmentReading Assignment

Page 34: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 3434

Simple Simple Hashing: Hashing: SearchSearch

No overflow case, ie. direct hit:No overflow case, ie. direct hit:

TTFF=s+r+dtt=s+r+dtt Overflow successful case, for Overflow successful case, for

average chain length of xaverage chain length of x

TTFsFs=s+r+dtt + (x/2)*(s+r+dtt)=s+r+dtt + (x/2)*(s+r+dtt) Overflow unsuccessful case, for Overflow unsuccessful case, for

average chain length of xaverage chain length of x

TTFuFu=s+r+dtt + x*(s+r+dtt)=s+r+dtt + x*(s+r+dtt)

Page 35: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 3535

DeletionDeletion

Deletion from a hashed fileDeletion from a hashed file Delete without consolidation, i.e., mark Delete without consolidation, i.e., mark

the deleted record as the deleted record as deleteddeleted… which … which requires occasional reorganization.requires occasional reorganization.

TTDD=T=TFsFs+2r+2r

Where x/2 of the chain is involved Where x/2 of the chain is involved in Tin TFsFs- successful search.- successful search.

Page 36: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 3636

Delete with consolidationDelete with consolidation

Delete with consolidation, requires Delete with consolidation, requires the entire chain to be read in and the entire chain to be read in and written out, after been updated. This written out, after been updated. This will not require any reorganization. will not require any reorganization.

The approximate formula:The approximate formula:

TTDD=T=TFuFu+2r+2r

Where the full chain (x blocks) is Where the full chain (x blocks) is involved in Tinvolved in TFuFu……

Page 37: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 3737

Insertion Insertion

Assuming that insertion Assuming that insertion involves the modification involves the modification of the last bucket, just like of the last bucket, just like deletion…deletion…

TTII=T=TFuFu+2r+2r

Page 38: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 3838

Sequential AccessSequential Access

In hashing, each record is a randomly In hashing, each record is a randomly placed and randomly accessed, placed and randomly accessed, requiring very long sequential record requiring very long sequential record processing:processing:

TTXX=n*T=n*TFF

Page 39: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 3939

Sequential AccessSequential Access

In hashing, each record is a randomly In hashing, each record is a randomly placed and randomly accessed, placed and randomly accessed, requiring very long sequential record requiring very long sequential record processing:processing:

TTXX=n*T=n*TFF

Page 40: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 4040

Linear hashing: Review=1Linear hashing: Review=1

A bucket overflow, cause a split to A bucket overflow, cause a split to take place. take place.

Split will start from the first bucket Split will start from the first bucket address. It will cause the use of k+1 address. It will cause the use of k+1 bits for the split buckets, k bits for bits for the split buckets, k bits for others. others.

E.e., if the bucket whose last 3 bit E.e., if the bucket whose last 3 bit address is 010 split, the records with address is 010 split, the records with 0010 go to the one of the bucket 0010 go to the one of the bucket (with 4 bit address), the records with (with 4 bit address), the records with 1010 will go to the other. 1010 will go to the other.

Page 41: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 4141

Linear hashingLinear hashing: Review-2: Review-2 The next bucket address, known as boundary The next bucket address, known as boundary

value, to split needs to be recorded in the file value, to split needs to be recorded in the file header. header. If the value of the last k bits is less than that of the If the value of the last k bits is less than that of the

boundary value use k+1 bits, to find the address of the boundary value use k+1 bits, to find the address of the related bucket…related bucket…

Split buckets become part of the primary area.Split buckets become part of the primary area. For access time considerations, Knuth’s tables in For access time considerations, Knuth’s tables in

Fig 6.2 in the textbook by Salzberg can be used.Fig 6.2 in the textbook by Salzberg can be used. E.g., blocking factor E.g., blocking factor mm=50 and load factor =50 and load factor ff=75%, the =75%, the

average access times for successful and unsuccessful average access times for successful and unsuccessful cases can be stated ascases can be stated as

TTFsFs=1.05*(s+r+dtt)=1.05*(s+r+dtt)

TTFuFu=1.27*(s+r+dtt)=1.27*(s+r+dtt)

Page 42: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 4242

Linear hashing: Linear hashing: InsertionInsertion -1 -1

Find the correct bucket to place the Find the correct bucket to place the new recordnew record If the bucket is full, allocate a new If the bucket is full, allocate a new

bucket and place the new record and bucket and place the new record and link the new bucket to the related chain.link the new bucket to the related chain.

If the load factor If the load factor ff is now imbalanced, is now imbalanced, add one more bucket to the primary add one more bucket to the primary area, split the records in the area, split the records in the boundary boundary buckebucket with the new bucket.t with the new bucket.

Increment the Increment the boundary bucketboundary bucket address.address.

Page 43: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 4343

Linear hashing: Linear hashing: Insertion-2Insertion-2

An insertion formula needs to consider all An insertion formula needs to consider all the timing aspects caused by the above the timing aspects caused by the above algorithm.algorithm.

TTII=(T=(TFuFu+2r)+(1/m)*(s+r+dtt)+1/(f*m)*[(s+r+dtt)+2r+ +2r)+(1/m)*(s+r+dtt)+1/(f*m)*[(s+r+dtt)+2r+ s+r+dtt]s+r+dtt]

The first term (TThe first term (TFuFu+2r) is for finding the bucket +2r) is for finding the bucket to insert and write back to the disk, to insert and write back to the disk,

The second term is to link a new bucket, The second term is to link a new bucket, The third term is to expand the primary area The third term is to expand the primary area

with probability of 1/(f*m) by adding a new with probability of 1/(f*m) by adding a new bucket, read and write back the boundary bucket, read and write back the boundary bucket. bucket.

Page 44: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 4444

Linear hashing: Deletion Linear hashing: Deletion

Every delete may cause the linked buckets Every delete may cause the linked buckets to be read and rewritten, even deleted. to be read and rewritten, even deleted.

Deletion may also cause contraction of the Deletion may also cause contraction of the primary area if f falls below what it should primary area if f falls below what it should be. The last buckets consolidate with the be. The last buckets consolidate with the buckets sharing the same last k bits.buckets sharing the same last k bits.

A rough estimate for the deletion timeA rough estimate for the deletion timeTTDD=T=TFuFu+2r+2r Where the other contributing terms and factors Where the other contributing terms and factors

are ignored as they are small compared to are ignored as they are small compared to these two terms.these two terms.

Page 45: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 4545

Linear hashing: Linear hashing: Sequential Reading in record Sequential Reading in record orderorder

The hash function does not preserve order. The hash function does not preserve order. Therefore, each record is an independent Therefore, each record is an independent search, causing search, causing

TTXX=n*1.05*(s+r+dtt), for f=75% and m=50 case.=n*1.05*(s+r+dtt), for f=75% and m=50 case.

Note that, there is no provision about Note that, there is no provision about which blocks or buckets to allocate to an which blocks or buckets to allocate to an indexed file. There may be small disk indexed file. There may be small disk space allocation directory in the file space allocation directory in the file header to indicate which disk extents are header to indicate which disk extents are allocated to the file. allocated to the file.

Page 46: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 4646

Linear hashing: Linear hashing: ReorganizingReorganizing

It is possible to reorganize a hashed file, It is possible to reorganize a hashed file, which costs: which costs:

TTreorgreorg=n*(T=n*(TFF+2r) = n*(s+r+dtt+2r)+2r) = n*(s+r+dtt+2r)

The amount of space used by a hashed file The amount of space used by a hashed file is:-is:-

n/(f*m). n/(f*m).

How many bits (k) are required to form the How many bits (k) are required to form the hash table for M buckets:hash table for M buckets:

22kk<=M<2<=M<2k+1k+1 or k<=logM<k+1 or k<=logM<k+1

Page 47: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 4747

Redundant slidesRedundant slides

Page 48: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 4848

Predicting the distribution of Predicting the distribution of recordsrecords

In random case, one can approximate the In random case, one can approximate the collision probability P(x), which is the collision probability P(x), which is the probability that a given address will have x probability that a given address will have x records assigned to it, records assigned to it,

in n records N addresses case, x can take in n records N addresses case, x can take values 0, 1, 2, 3, etc.. bigger the x smaller is values 0, 1, 2, 3, etc.. bigger the x smaller is the probability.the probability.

Rather than taking the p(x) as the probability, Rather than taking the p(x) as the probability, it can be taken as the percentage of the it can be taken as the percentage of the addresses having x logical records assigned to addresses having x logical records assigned to it by hashing.it by hashing. Thus, p(0) is the proportion of the addresses Thus, p(0) is the proportion of the addresses

with 0 records assigned to it. Given p(0), the with 0 records assigned to it. Given p(0), the expected number of addresses with no expected number of addresses with no assignments is N*p(0).assignments is N*p(0).

Let n/N also represents f, Let n/N also represents f, load factorload factor (or (or packing densitypacking density) )

Page 49: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 4949

Collision resolutionCollision resolution Collisions can be resolved by chaining. A Collisions can be resolved by chaining. A

simple (separate) chaining method, allows simple (separate) chaining method, allows overflow from a full bucket to a bucket in a overflow from a full bucket to a bucket in a separate overflow area.separate overflow area.

Usually N=M*m, where M is the number of Usually N=M*m, where M is the number of buckets and m is the blocking factor.buckets and m is the blocking factor.

Given a bucket size and the load factor, one can Given a bucket size and the load factor, one can compute the average number of accesses (a) compute the average number of accesses (a) required to fetch a record. For example, for m=10 required to fetch a record. For example, for m=10 and f=70, a=1.201, for m=50, f=70, a=1.018. (see and f=70, a=1.201, for m=50, f=70, a=1.018. (see the textbook for computed the textbook for computed aa values for a given values for a given ff and and mm))

Page 50: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 5050

Bounded Extendible hashingBounded Extendible hashing

A buffer of fixed size (tuned to the size of A buffer of fixed size (tuned to the size of the available memory) is designated to the available memory) is designated to serve as the index. The index table grows serve as the index. The index table grows according to the extendible hashing according to the extendible hashing principle, principle,

however, index computed from a record however, index computed from a record key is assumed not to fall outside the size key is assumed not to fall outside the size of this table.of this table.

The file grows by doubling the data block The file grows by doubling the data block size, on overflow situation. Thus, the size size, on overflow situation. Thus, the size of the bucket is power of 2.of the bucket is power of 2.

Page 51: Hashing for filessaksagan.ceng.metu.edu.tr/.../week6_Hashing_section1.pdf · 2009. 11. 24. · Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary

CENG 351 CENG 351 5151

Bounded Extendible hashing: Bounded Extendible hashing: Hash key Hash key digitsdigits

As an example, a hashed-key digits may have the As an example, a hashed-key digits may have the following meanings:following meanings: First y digits : corresponding block address in the index First y digits : corresponding block address in the index

areaarea Next z digits: the correct record index entry in the Next z digits: the correct record index entry in the

located index blocklocated index block Next w digits: used to indicate the offset from the Next w digits: used to indicate the offset from the

beginning of the bucket, for the correct block containing beginning of the bucket, for the correct block containing the record. This is normally logthe record. This is normally log22m, in number of bits, m, in number of bits, where m indicates the number of the blocks in the where m indicates the number of the blocks in the bucket. bucket.

To avoid sparsely populated buckets, chaining To avoid sparsely populated buckets, chaining may be allowed to a certain extend…may be allowed to a certain extend…