anandhavalli.files.wordpress.com … · Web viewIf we store an element on to the array with size 10 then the calculated hash value should be within {0 to 9}. Store the elements

Chapter

7

Hashing

HASHING

7.1. INTRODUCTION Hashing is a technique used for performing insertion, deletion and find operation in

constant average time. Tree operations are not supported in linear time efficiently when it requires any

ordering information among the elements such as FindMin, FindMax and printing of the entire table in sorted order.

Basically, the search time for searching technique such as sequential search, binary search is depends on the number of input elements. All search tree techniques has so many key comparisons to perform searching. But hashing supports to search the elements in constant time with less key comparisons.

For Example,1. Using Unsorted sequential array, Insertion=O(1), deletion=O(n) & find=O(n)2. Using Sorted sequential array, Insertion =O(n), deletion=O(n) & find=O(log n)3. Using Linked list, Insertion: add to front O(1) & for sorted list O(n), Deletion:

O(n), Search: O(n)4. Using AVL tree, find =O(log n), Insertion=O(log n), Deletion=O(log n).

Basic idea of HashingLet us take 5 elements or Keys 19,14,6,18,3. Calculate the index (Hash value) for

each element to store which may be the last digit in a key. If ‘19’ is an element then its index is 9. The generated index range should be within the ranges of array index (i.e.) If we store an element on to the array with size 10 then the calculated hash value should be within {0 to 9}.

Store the elements in the array only based on their index where array index and hash values are same. Same hash value is used for both storing & retrieving the data from an array.

The function used for hash value generation is called as “Hash function”. The array used to hold an element is referred as “Hash table”. The entire process is (hash value & hash table generation, hash function, and mapping key to index) called as HASHING.

Hash table data structure merely an array of some fixed size containing elements. Each element is identified by the index of table which is from 0 to (table size-1). The hash table is sometime referred to as scatter table, because we are trying to scatter the data throughout the table.

Each key is mapped into some hash number in the range 0 to (table size-1) and placed in the appropriate cell. This mapping is called hash function. It ensures the

Data Structures Using C 199

Hashing

different keys get different hash value or different cells and it also distributes the elements evenly among the cells.

There may be a possibility that the hash function generate same hash table address or index for different keys. This situation is called “Collision”.

For example: In the above example, if an element 29 is inserted on to the hash table, then the generated hash value is ‘9’ but an element ‘19’ is already placed in the 9 th index of hash table. This situation is called as Collision and the key ‘29’ becomes a collide record.

7.2. HASH FUNCTION A hash table is a data structure that works just like an array, except instead of

forcing you to use integers as your index you can use any arbitrary data type as your index.

We basically need a function to map a key ‘K’ to an integer index ‘i’ of the hash table (which has indices 0 to n), such a function is referred to as Hashing function. Ideally, the hash function is used to determine the location of any record by giving its key value.

Hash functions transform the keys into numbers within a predetermined interval (0 to N). These numbers are then used as indices in an array in the hash table to store the records.

If the indices are numbers, if N is the size of the array, then Hash(key)=key % N. This will map all the keys into numbers within the interval (0 to N-1).

If the indices are strings of characters, then the binary representations of a key is a number, and then apply as before. If each character is representation with m bits, then the string can be treated as base m number.

We should choose a hash function such that it gives us distinct values for different values of primary key.

To insert values into the hash table, use the hash function to generate an address for each value to be inserted. To search for a key in the table the same hash function is reused.

An important consideration in hashing performance is value of the Load Factor (L), defined as the ratio between the number of records to be stored and the size of the table. As the load factor increases, the hashing performance will degrade because of the time required to search collision chains.

L= No . of . KeysStoredTableSize

Types of Hash functions1. Truncation method


KEY or Element

HASH FUNCTION

Hash Address or Hash Table

Index

Hashing

2. Folding method3. Mid square method4. Division method

We can also combine the above methods to generate a complex hash function which gives the hash address for the corresponding key.1. Truncation Method

It is a Simplest and easiest method to create a hash function. This method truncates the part of the given keys, depending upon the size of the hash table. If we assume that the size of the hash table is 1000, then the rightmost (or leftmost) 3 digits are truncated and used as hash table addresses.For example: 987456, 125978, 963294 are the keys.

The hash address for given keys are 456, 978, 294Since we are using only the last 3 digits for computing the hash table address, the

chances of collisions are more in this method.

2. Folding Method In this method, the given keys are broken into groups of digits (say 3) and add these

groups and get the hash table address. For example: if the key to be mapped is 143765980 breaks into 3 groups of the 3 digit numbers (143) + (765) + (980) and add then up resulting in 1888. You can use the result as it is, as the hash table address. But if the size of the hash table is small, you can truncate the result to last 2 or 3 digits, depending upon the size of table. (Note that you can also multiply the digits).

3. Mid Square Method In this method, first find the square value for the given keys, and then take middle

digits (either 2 or 3) and use that us a hash address.For example: 456, 978 are the keys. 207936, 956484 are square result. 79, 64 are the middle values (i.e.) hash table address.

4. Division Method This is one of the best method to get the address, for the key to be mapped. Take the

key depending upon the size of the hash table, do the modulus operation, and get the reminder of the key value as the address for the hash (value) table.For example: Key is 987456782. Always use any prime number near to the hash table size for the modulus operation (i.e.) 61. Therefore 987456782 % 61=46 (hash address).

Properties of a good hash function1. The function should compute quickly.2. The function should easy to understand.3. The function should rely on all are most bits of key.4. The function should distribute the key apparently and randomly.5. The function will leads less collision.


Hashing

6. It should distribute values uniformly over the address range and avoid collisions as far as possible. To uniformly distribute the values over the address range we have to know the address bound before and define the hash function.

7.3. COLLISION RESOLUTION TECHNIQUES

There are two major methods used for resolving the collision. They are Closed Hashing and Open Hashing techniques.

1. Closed Hashing (or) Open addressing: When a data item cannot be placed at the index calculated by the hash function (because of collision), we look for the availability of another empty location within the hash table and store that collide record in hash table itself. (i.e. in closed hash table)

2. Open Hashing (or) Separate Chaining: When a data item cannot be placed at the index calculated by the hash function (because of collision), then create new location to store that collide record outside of the hash table. (i.e. in the open hash table)

7.3.1. OPEN ADDRESSING (CLOSED HASHING) To perform insertion using open addressing, we successively examine or probe, the

hash table until we find an empty slot in which to put the key. The sequence of positions probe depends upon the key being inserted. With open

addressing we require that for every key k, the probe sequence {h(k,0), h(k,1), …., h(k, m-1)}

The following pseudo codes are used for inserting a key ‘K’ into the hash table ‘T’ and search a given key ‘K’ in a hash table ‘T’

Pseudo CodeHash-Insert (T, k)i = 0Repeat j=h(k, i)

if T[j] = Nil thenT[j] = kReturn j

else i=i+1

Until (i= =m)Error “hash table overflow”Hash-Search (T, k)i = 0Repeat j=h(k, i)

if T[j] = k thenReturn j

i=i+1Until (T[j]= = Nil or i= =m)


Hashing

Return Nil

There are five collision resolution techniques based on open addressing method,1. Linear probing2. Quadratic probing3. Double hashing4. Rehashing5. Extendible hashing

7.3.1.1. LINEAR PROBING It is one of the popular methods used to resolve collision. The hash function is

resolved by placing that collide record in the next available empty position in the hash table, if the hashed address mapped by the key is already being occupied by a previously mapped hash address.

Since this method, searches for the empty position, in a linear way (i.e. sequential), is referred as linear probing.

Linear probing uses the hash function,

For example: Consider the keys to be mapped in the hash table with a size ‘7’ are 18, 72, 65, 34 and 13. Assume that division hash function is used to find the hash address for the keys specified. Use Prime number ’7’ for the modulus operation. So the primary hash function is h1(k) = (Key % 7)

Insert key 18, When i=0, h(18,0) = ( (18 % 7) + 0 ) % 7 = 4 No keys in index 4. So hashing completed.Insert key 72,When i=0, h(72,0) = ( (72 % 7) + 0 ) % 7 = 2No keys in index 2. So hashing completed.Insert key 65,When i=0, h(65,0) = ( (65 % 7) + 0 ) % 7 = 2Already there is a key in index 2. So Collision occurs. Increment i by 1 and redo hashing again.When i=1, h(65,1) = ( (65 % 7) + 1 ) % 7 = 3No keys in index 3. So hashing completed.Insert key 34,When i=0, h(34,0) = ( (34 % 7) + 0 ) % 7 = 6No keys in index 6. So hashing completed.Insert key 13,When i=0, h(13,0) = ( (13 % 7) + 0 ) % 7 = 6


h(k, i) = (h1(k) + i) mod m where ‘m’ is a table size and h1(k) is a Primary hash function.

Hashing

Already there is a key in index 6. So Collision occurs. Increment i by 1 and redo hashing again.When i=1, h(13,1) = ( (13 % 7) + 1 ) % 7 = 0No keys in index 0. So hashing completed.After the hashing of all given keys, the hash table becomes,

0 131

2 723 654 185

6 34

To search an element in the hash table, we check hash address position, corresponding to the key, if the key is not found at that position, then we linearly search the element, following the hash address position.

Even though this method seems to be simpler, this method doesn’t offer, uniform hashing. The main disadvantage of this collision resolution technique is the primary clustering problem. That is, if the hash table becomes half full, and if a collision occurs, it is difficult to find an empty location in the hash table, and hence the insertion process takes a longer time.

7.3.1.2. QUADRATIC PROBING It is similar to linear probing, except that, instead of looking just one more index

ahead each time until we find an empty index, we do the following. On the first collision we look ahead h+1 position, and place the key in the hash

table. On the second collision we look h+4 i.e. (22) position ahead, and on the third collision we look h+9 i.e. (32) positions ahead and so on. If ‘h’ is the position in the array where the collision occurs, in quadratic probing the probing sequence become h+1, h+4, h+9, … , h+i2.

Quadratic probing uses the hash function,

The initial position probed is h1(k); later positions probed are offset by amounts that depend in a quadratic manner on the probe number i. this method works much better than linear probing, but to make full use of the hash table, the values of c1, c2 and m are constrained.


h(k, i) = (h1(k) + C1i+C2i2) mod m where ‘m’ is a table size, h1(k) is a Primary hash function and C1,C2 are auxiliary constants and should be >0. i ranges from (0,1,…. m-1).

Hashing

For example: Consider the keys to be mapped in the hash table with a size ‘7’ are 18, 72, 65, 34 and 13. Assume that division hash function is used to find the hash address for the keys specified. Use Prime number ’7’ for the modulus operation. So the primary hash function is h1(k) = (Key % 7). Use c1, c2 as 1.Insert key 18, When i=0 & c1=c2 =1, h(18,0) = ( (18 % 7) + 0 + 0 ) % 7 = 4 No keys in index 4. So hashing completed.Insert key 72,When i=0 & c1=c2 =1, h(72,0) = ( (72 % 7) + 0 + 0 ) % 7 = 2No keys in index 2. So hashing completed.Insert key 65,When i=0 & c1=c2 =1, h(65,0) = ( (65 % 7) + 0 + 0 ) % 7 = 2Already there is a key in index 2. So Collision occurs. Increment i by 1 and redo hashing again.When i=1 & c1=c2 =1, h(65,1) = ( (65 % 7) + 1 + 1 ) % 7 = 4Already there is a key in index 4. So Collision occurs. Increment i by 1 and redo hashingWhen i=2 & c1=c2 =1, h(65,1) = ( (65 % 7) + (1 *2)+( 1*4) ) % 7 = 1No keys in index 1. So hashing completed.Insert key 34,When i=0 & c1=c2 =1, h(34,0) = ( (34 % 7) + 0 + 0 ) % 7 = 6No keys in index 6. So hashing completed.Insert key 13,When i=0 & c1=c2 =1, h(13,0) = ( (13 % 7) + 0 + 0 ) % 7 = 6Already there is a key in index 6. So Collision occurs. Increment i by 1 and redo hashing again.When i=1 & c1=c2 =1, h(13,1) = ( (13 % 7) + 1 + 1) % 7 = 1Already there is a key in index 1. So Collision occurs. Increment i by 1 and redo hashingWhen i=2 & c1=c2 =1, h(13,1) = ( (13 % 7) + 2 + 4) % 7 = 5No keys in index 5. So hashing completed.After the hashing of all given keys, the hash table becomes,

0

1 652 723

4 185 136 34

Quadratic probing is just as easy to implement as linear probing, and has less of a clustering effect than linear probing. The problem with quadratic probing is that it gives rise to secondary clustering. That is, this method will not search all locations in the hash


Hashing

table to find an empty slot, due to this insertion takes a longer time when compared to linear probing.

7.3.1.3. DOUBLE HASHING It is one of the best methods available for open addressing because the permutations produced have many of the characteristics of randomly chosen permutations. It uses a hash function of the form,

In this method, two hash functions are used. First hash function is used as like to perform the hashing. When collision occurs in first hash function then we use a second hash function which will resolve the collision.

For example: Consider the keys to be mapped in the hash table with a size ‘13’ are 69, 79, 72, 98 and 14. Assume that division hash function is used to find the hash address for the keys specified. The primary hash function is h1(k) = (Key % 13) and the secondary hash function is h2(k) = 1 + (Key % 11).

Insert key 69, When i=0, h(69,0) = ( (69 % 13) + 0 ) % 13 = 4 No keys in index 4. So hashing completed.

Insert key 79,When i=0, h(79,0) = ( (79 % 13) + 0 ) % 13 = 1No keys in index 1. So hashing completed.

Insert key 72,When i=0, h(72,0) = ( (72 % 13) + 0 ) % 13 = 7No keys in index 7. So hashing completed.

Insert key 98,When i=0, h(98,0) = ( (98 % 13) + 0 ) % 13 = 7Already there is a key in index 7. So Collision occurs. Increment i by 1 and redo hashing again.When i=1, h(98,1) = ( (98 % 13) + 1 ( 1 + 98 % 11 ) ) % 13 = 5No keys in index 5. So hashing completed.

Insert key 14,When i=0, h(14,0) = ( (14 % 13) + 0 ) % 13 = 1Already there is a key in index 1. So Collision occurs. Increment i by 1 and redo hashing again.When i=1, h(14,1) = ( (14 % 13) + 1 ( 1 + 14 % 11 ) ) % 13 = 5


h(k, i) = (h1(k) + ih2(k)) mod m where ‘m’ is a table size, h1(k) is a primary hash function and h2(k) is a secondary hash function.

Hashing

Already there is a key in index 5. So Collision occurs. Increment i by 1 and redo hashing again.When i=2, h(14,1) = ( (14 % 13) + 2 ( 1 + 14 % 11 ) ) % 13 = 9No keys in index 9. So hashing completed.After the hashing of all given keys, the hash table becomes,

01 79234 695 9867 7289 14

101112

Example:Consider inserting the keys 10, 22, 31, 4, 15, 28, 17, 88 and 59 into the hash table of

length m=11using open addressing with the primary hash function h1(k) = k mod m. Apply linear probing, quadratic probing with c1=1 and c2=3 and double hashing with h2(k) = 1 + (k mod (m-1))

SolutionUsing Linear Probing, the hash table becomes

0 221 88234 45 156 287 178 599 31


Hashing

10 10

Using Quadratic Probing, the hash table becomes

0 22

1 88

2

3 17

4 4

5

6 28

7 59

8 15

9 31

10 10

Using Double hashing, the hash table becomes

0 2212 593 174 45 156 287 8889 3110 10


Hashing

7.3.1.4. REHASHING It is a process of building another table that is twice as big as the original hash table.

Next, seen down the entire original hash table, compute the new hash value for each non-deleted element and inserting it in the newly created hash table.Need

If the table gets too full, then the running time for insertion will be too long. Sometimes might fail to insert an element based on quadratic probing under open addressing hashing. This may be of too many removals intermixed with insertion.Process

Rehashing can be implemented in several ways with quadratic probing. For example, suppose the element 13, 15, 24 and 6 are inserted into the hash table of size 7, using the hash function h(x) = x mod 7.Consider linear probing to resolve the collision. So the resultant hash table becomes,

If 23 is inserted into the table then if will be over 70% full, So a new table is created with the size 17, because this is the first prime that is twice as large as the old table size. Therefore, the new hash function is h(x) = x mod 17. Next, the old table is scanned and elements are inserted into the new one. So the resultant table becomes,


Insert (23)

0 61 1523 24456 13

0 61 152 233 24456 13

0123456 67 238 249

10111213 131415 1516

Hashing

When the rehashing take place?1. Rehash as soon as the table is half full.2. Rehash only when an insertion fails.3. Rehash when the table reaches a certain load factor. This could be best,

because the performance degrade when load factor increases.Advantage

It frees the programmer from worrying about the table size and is important because hash tables cannot be made arbitrarily large in complete program.Disadvantage

It is very expensive operation. The running time is O (N), since there are N elements to rehash and table size is roughly 2N. But it happens very infrequently.

7.3.1.5. EXTENDIBLE HASHING If Open addressing or separate chaining hashing is used, the major problem is that

collisions could cause several blocks to be examined during a find, even for a well-distributed hash table. Furthermore, when the table gets too full, an extremely expensive rehashing step must be performed, which requires O(N) disk accesses.

A clever alternative, known as extendible hashing, allows a find to be performed in two disk accesses. Insertions also require few disk accesses.

We recall a B-tree has depth O(logM/2 N). As M increases, the depth of a B-tree decreases. We choose M to be so large then the depth of the B-tree would be 1. So any find after the first would take one disk access, since the root node could be stored in main memory.

The problem with this strategy is that the branching factor is so high that it would take considerable processing to determine which leaf the data was in. If the time to perform this step could be reduced, then we would have a practical scheme. This is exactly the strategy used by extendible hashing.

For example, suppose our data consists of several six-bit integers. The following

figure shows an extendible hashing scheme for these data.


Hashing

The root of the “tree” contains four pointers determine by the leading two bits of the data each leaf has up to M=4 elements. It happens that in each leaf the first two bits are identical; this is indicated by the number in parentheses.

D will represent the number of bits used by the root, which is sometimes known as the directory. The number of entries in the directory is 2D. dL is the number of leading bits that all the elements of some leaf L have in common. dL will depend on the particular leaf, and dL ≤ D.

Suppose that we want to insert the key 100100. This would go into the third leaf, but as the third leaf is already full, there is no room. We thus split this leaf into two leaves, which are now determined by the first three bits. This requires increasing the directory size to 3. These changes are reflected in the following figure.

Notice that all of the leaves not involved in the split are now pointed to by two adjacent directory entries. Thus an entire directory is rewritten and none of the other leaves is actually accessed.


Hashing

If the key 000000 is now inserted, then the first leaf is split, generating two leaves with dL = 3.since D = 3, the only change required in the directory is the updating of the directory 000 and 001 pointers. The figure becomes,

This very simple strategy provides quick access times for insert and finds operations on large databases. There are a few important details we have not considered.

First, it is possible that several directory splits will be required if the elements in a leaf agree in more than D+1 leading bits. For instance, starting at the original example, with D = 2, if 111010, 111011, and finally 111100 are inserted, the directory size must be increased to 4 to distinguish between the five keys. This is an easy detail to take care of, but most not be forgotten.

Second, there is the possibility of duplicate keys; if there are more than M duplicates, then this algorithm does not work at all. In this case, some other arrangements need to be made.

These possibilities suggest that it is important for the bits to be fairly random. This can be accomplished by hashing the keys into a reasonably long integer-hence the name.

7.3.2. SEPARATE CHAINING (OPEN HASHING)

This method maintains the chain of elements, which maps to the same hash address. An array of Pointers is used for representing the hash table. Size of hash table can

be number of keys. Here each pointer points to one linked list and the keys have same hash address will be maintained in that linked list alone.

When a data item cannot be placed at the hash address calculated by the hash function (because of collision), a chain or link is allocated and stores the key element in that chain.

This allows an unlimited number of collisions to the same hash address can be handled and also doesn’t require a prior knowledge of how many elements are stored in the hash table.

This method as referred to as separate chaining, because you have a bunch of separate chains in your hash table.


0

1

2

3

4

5

6

NULL

NULL

NULL

NULL

NULL

NULL

NULL

0

1

2

3

4

5

6

NULL

NULL

NULL

NULL

NULL

NULL

NULL

18

0

1

2

3

4

5

6

NULL72

NULL

NULL

NULL

NULL

NULL

NULL

18

Hashing

In this method of resolving collisions there is no problem of limited storage hence insertions and searching elements are carried out in less time.

For example, consider the keys (18, 72, 65, 34 and 13) are inserted in hash table of size 7 using separate chain method. The primary hash function is h1(k) = k mod m. The hash address mapped for the given keys are 4, 2, 2, 6 and 6 respectively. The hash table structure is shown below.Initially,

Insert the key 18,

Insert the key 72,


0

1

2

3

4

5

6

NULL6572

NULL

NULL

NULL

NULL

NULL

34 13 NULL

18

Hashing

After inserting all the keys,

The main advantage of hashing using separate chaining is that to get uniform and perfect collision resolution hashing. Since separate chaining is implemented using linked lists, there is no wastage of memory (since memory for hash table is allocated when needed).

The elements which have the same hash address (i.e. collide records) will be in the same chain, hence search operation is fast, compare to other collision resolution method.

7.4. REVIEW QUESTIONS & EXERCISES 1. Define the following :

a) Hashing b) Hash tablec) Hash function

2. Compare hashing with linear search and binary search techniques?3. What is the expected time for insertion, deletion, and search operations in

hash table?4. Give example of hash function?5. Differentiate division and truncation hash functions?6. Define collision? Give an example?7. Write the properties of a good hash function?8. Mention any two collision resolution techniques?9. Compare open and closed hashing techniques?10. Define load factor?11. What is the significance of load factor in the performance of hashing?12. What is meant by primary clustering and secondary clustering?


Hashing

13. Illustrate all type of hash functions discussed in this chapter on the list of keys 4566234, 23456,12356, 67894, 12345, 345, 89467, and a hash table of size 10 using linear probing.

14. Why we go for quadratic probing?15. How do you handle overflow in hashing?16. What is a purpose of double hashing?17. When the rehashing take place?18. Compare chaining and linear probing?19. Using a hash table with 7 locations and hash function h(i)=i mod 7. Show

the hash table that results when the following integers are inserted in the order given: 46, 62, 26, 24, 82, 35, 19. Use linear probing, Quadratic probing, and Separate chaining to resolve collision if any.

20. Consider inserting the keys 10, 22, 31, 4, 15, 28, 17, 88, 59 into a hash table of length m=11 using open addressing with the primary hash function h(k)=k mod m. Illustrate the result of inserting these keys using linear probing, using quadratic probing with C1=1 and C2=3 and using double hashing with h2(k)=1+(k mod (m-1)).


Documents

anandhavalli.files.wordpress.com … · Web viewIf we store an element on to the array with size 10 then the calculated hash value should be within {0 to 9}. Store the elements