Chapter 11. Hashing

File StructureFile Structure

Chapter 11. HashingChapter 11. Hashing

- 2 -File Structures - Chapter 11 -File Structures - Chapter 11 -

Contents

Introduction

A Simple Hashing Algorithm

Hashing Functions and Record Distributions

How Much Extra Memory Should Be Used?

Collision Resolution by Progressive Overflow

Storing More Than One Record per Address: Buckets

Making Deletions

Other Collision Resolution Techniques

Patterns of Record Access

Introduction



How Much Extra Memory Should Be Used?


Storing More Than One Record per Address: Buckets

Making Deletions




1. Introduction

O-notationO(1)O(N) : sequential searchingO(log2N)

O(logkN) : B-Tree (k : 리프 노드 크기 )

What is Hashing?a = h(K)

h (hash function), K (key), a (home address)

ExampleK = BASSh = (first char * second char) mod 1000

a = h(K) = (66 * 65) mod 1000 = 4,290 mod 1000 = 290

O-notationO(1)O(N) : sequential searchingO(log2N)

O(logkN) : B-Tree (k : 리프 노드 크기 )

What is Hashing?a = h(K)

h (hash function), K (key), a (home address)

ExampleK = BASSh = (first char * second char) mod 1000

a = h(K) = (66 * 65) mod 1000 = 4,290 mod 1000 = 290


Introduction

CollisionExample

key : LOWELL => a = (76 * 79) mod 1000 = 6,004 mod 1000 = 4 OLIVIER => a = (79 * 76) mod 1000 = 6,004 mod 1000 = 4

Several ways to reduce the number of collisions 1. Spread out the records

Good hashing algorithms 2. Use extra memory 3. Put more than one record at a single address

Buckets

CollisionExample

key : LOWELL => a = (76 * 79) mod 1000 = 6,004 mod 1000 = 4 OLIVIER => a = (79 * 76) mod 1000 = 6,004 mod 1000 = 4

Several ways to reduce the number of collisions 1. Spread out the records

Good hashing algorithms 2. Use extra memory 3. Put more than one record at a single address

Buckets


2. A Simple Hashing Algorithm

3 Steps1. Represent the key in numerical form2. Fold and add3. Divide by a prime number and use the remainder as the address

ExampleStep 1. Represent the Key in Numerical Form

3 Steps1. Represent the key in numerical form2. Fold and add3. Divide by a prime number and use the remainder as the address

ExampleStep 1. Represent the Key in Numerical Form

LOWELL = 76 79 87 69 76 76 32 32 32 32 32 32 L O W E L L Blanks



Example (계속 )Step 2. Fold and Add

76 79 | 87 69 | 76 76 | 32 32 | 32 32 | 32 327679 + 8769 + 7676 + 3232 + 3232 = 30588(30588+3232 = 33820 => 2byte Maximum 값 32767 을 초과하므로 )

7679 + 8769 = 16448 => 16448 mod 19937 = 16448 16448 + 7676 = 24124 => 24124 mod 19937 = 4187

4187 + 3232 = 7419 => 7419 mod 19937 = 74197419 + 3232 = 10651 => 10651 mod 19937 = 1065110651 + 3232 = 13883 => 13883 mod 19937 = 13883

Step 3. Divide by the Size of the Address Spacea = s mod n (n : # of address in file)a = 13883 mod 100 = 83a = 13883 mod 101 = 46

Example (계속 )Step 2. Fold and Add

76 79 | 87 69 | 76 76 | 32 32 | 32 32 | 32 327679 + 8769 + 7676 + 3232 + 3232 = 30588(30588+3232 = 33820 => 2byte Maximum 값 32767 을 초과하므로 )

7679 + 8769 = 16448 => 16448 mod 19937 = 16448 16448 + 7676 = 24124 => 24124 mod 19937 = 4187

4187 + 3232 = 7419 => 7419 mod 19937 = 74197419 + 3232 = 10651 => 10651 mod 19937 = 1065110651 + 3232 = 13883 => 13883 mod 19937 = 13883

Step 3. Divide by the Size of the Address Spacea = s mod n (n : # of address in file)a = 13883 mod 100 = 83a = 13883 mod 101 = 46


3. Hashing Functions and Record Distributions

Distributing Records among Addresses Distributing Records among Addresses

12345678910

ABCDEFG

Record Address

Best

(a)

12345678910

ABCDEFG

Record Address

Worst

(b)

12345678910

ABCDEFG

Record Address

Acceptable

(c)

<Figure 11.3> Different distributions. (a) Uniform distribution(Best) (b) Worst case (c) Randomly distribution (Acceptable)



Some Other Hashing MethodsBetter than random

Examine keys for a pattern 주민등록 번호

Divide the key by a prime number

Random Square the key and take the middle

4532 => 2 0 5 2 0 9 Radix transformation

Some Other Hashing MethodsBetter than random

Examine keys for a pattern 주민등록 번호

Divide the key by a prime number

Random Square the key and take the middle

4532 => 2 0 5 2 0 9 Radix transformation


4. How Much Extra Memory Should Be Used ?

Packing Density

Exampler = 75 recordsN = 100 address

Packing Density

Exampler = 75 recordsN = 100 address

N

r

spaces of #

records of #

%7575.0100

75


How Much Extra Memory Should Be Used ?

Predicting Collisions for Different Packing Densities Predicting Collisions for Different Packing Densities

Packing density (%) Synonyms (%)

10407090100

4.817.628.134.136.8

<Table 11.2> Effect of packing density on the proportion of records not stored at their home addresses


5. Collision Resolution by Progressive Overflow

Progressive OverflowOpen addressingLinear probing

Progressive OverflowOpen addressingLinear probing

0

1

Rosen2

Jasper3

York4

Novak’s home address

York’s home address

York h(K)address

3

Novak h(K)address

2



Search Length Search Length

KeyHome

Address# of Access

(Search Length)

AdamsBatesColeDeanEvans

01120

11225

Adams0

Bates1

Cole2

Dean3

Evans4

5



Search Length (계속 )

Example

Search Length (계속 )

Examplerecords ofnumber total

lengthsearch total Length Search Average

2.25

52211 Length Search Average

<Figure 11.7>Average search lengthversus packing densityin a hashed file


6. Storing More Than One Record per Address : Buckets

Buckets Buckets

Key Home Address

GreenHall

JenksKingLandMarxNutt

0023333

Green Hall0

1

Jenks2

King Land Marks3

Nutt4


Storing More Than One Record per Address : Buckets

Effects of Buckets on Performance Effects of Buckets on Performance

bN

r density packing

r : # of recordsN : # of addressesb : # of records in a bucket

File without buckets File with buckets

# of records# of addresses

Bucket sizePacking density

Ratio of records to addresses

r = 750N = 1000

b = 10.75

r/N = 0.75

r = 750N = 500

b = 20.75

r/N = 1.5


Storing More Than One Record per Address : Buckets

<Table 11.4> Synonyms causing collisions as a percent of records for different packing densities and different bucket sizes

<Table 11.4> Synonyms causing collisions as a percent of records for different packing densities and different bucket sizes

Packingdensity

Bucket size

1 2 5 10

20 %

50 %

80 %

100 %

9.4

21.3

31.2

36.8

2.2

10.4

20.4

27.1

0.1

2.5

10.3

17.6

0.0

0.4

5.3

12.5


7. Making Deletions

처음상태 처음상태

KeyHome

AddressActual

address

Adams

Jones

Morris

Smith

0

1

1

0

0

1

2

3

Adams0

Jones1

Morris2

Smith3


Making Deletions

(1) Tombstones for Handling Deletions (1) Tombstones for Handling Deletions

Adams0

Jones1

Morris2

Smith3

* Deletion of Morris

Adams0

Jones1

###2

Smith3

“Smith 는 찾을 수 없다”

### : tombstoneThis mark indicates that a record once lived there but no longer does


Making Deletions

(2) Implications of Tombstones for Insertions Inserting “Smith”

(3) Effects of Deletions and Additions on PerformanceSolution to problem of deteriorating average search length

Reorganization

(2) Implications of Tombstones for Insertions Inserting “Smith”

(3) Effects of Deletions and Additions on PerformanceSolution to problem of deteriorating average search length

Reorganization


8. Other Collision Resolution Techniques

(1) Double HashingSecond hashing function

Increment(c) adding

Seek time overhead

(1) Double HashingSecond hashing function

Increment(c) adding

Seek time overhead



(2) Chained Progressive Overflow (2) Chained Progressive Overflow

KeyHome

addressActual

AddressSearch

length(1)Search

length(2)

AdamsBatesColeDeanEvansFlint

010140

012345

113316

112213

Adams0

Bates1

Cole2

Dean3

Evans4

Flint5

Adams0

Bates1

Cole2

Dean3

Evans4

Flint5

2

3

5

-1

-1

-1



(3) Chaining with a Separate Overflow Area (3) Chaining with a Separate Overflow Area

Adams0

Bates1

2

3

Evans4

0

1

-1

Cole

Dean

Flint

2

-1

-1

Homeaddress

Primarydata area

Overflowarea



(4) Scatter Tables: Indexing Revisited (4) Scatter Tables: Indexing Revisited

0

1

2

3

4

Adams

Coles

Deans

1

3

Bates 4

Flint -1

-1

-1Evans



A small percentage of the records in a file account for a large percentage of the accesses : 80 / 20 Rule80% of the accesses are performed on 20% of the records

A small percentage of the records in a file account for a large percentage of the accesses : 80 / 20 Rule80% of the accesses are performed on 20% of the records

Documents

Chapter 11. Hashing