Upload
dinhhanh
View
246
Download
5
Embed Size (px)
Citation preview
Chapter 10: Hashing
Objectives
Looking ahead – in this chapter, we’ll consider
• Hash Functions
• Collision Resolution
• Deletion
• Perfect Hash Functions
• Rehashing
2 Data Structures and Algorithms in C++, Fourth Edition
Introduction
• In earlier chapters, the main process used by the searching techniques was comparing keys
• In sequential search for instance, the table storing the elements is searched in order, using key comparisons to determine a match
• In binary searching, we successively divide the table into halves to determine which cell to check, and again use key comparison to determine a match
• In binary search trees, the direction to take in the tree is determined by comparing keys in the nodes
• A different way to search can be based on calculating the position of the key in the table, based on the key’s value
3 Data Structures and Algorithms in C++, Fourth Edition
Introduction (continued)
• Since the value of the key is the only indication of position, if the key is known, we can access the table directly
• This reduces the search time from O(n) or O(lg n) to 1 or at least O(1)
• No matter how many elements there are, the run time is the same
• Unfortunately, this is just an ideal; in real applications we can only approximate this
• The task is to develop a function, h, that can transform a key, K, into an index for a table used to store items of the same type as K
• The function h is called a hash function
4 Data Structures and Algorithms in C++, Fourth Edition
Introduction (continued)
• If h is able to transform different key values into different hash values, it is called a perfect hash function
• For the hash function to be perfect, the table must have as many positions as there are items to be hashed
• However, it is not always possible to know how many elements will be hashed in advance, so some estimating is needed
• Consider a symbol table for a compiler, to store all the variable names
• Given the nature of the variable names typically used, a table with 1000 positions may be more than adequate
• However, even if we wanted to handle all possible variable names, we still need to design an appropriate h
5 Data Structures and Algorithms in C++, Fourth Edition
Introduction (continued)
• For example, we could define h to be the sum of the ASCII values of the letters in the variable name
• If we restrict variables to 31 letters, we will need 3782 positions, since a variable with of 31 characters all “z” would sum to 31 ∙ 122 (the ASCII code for “z”) = 3782
• Even then, the function will not produce unique values, for h(“abc”) = 97 + 98 + 99 = 294, and h(“acb”) = 97 + 99 + 98 = 294
• This is called a collision, and is a measure of the usefulness of a hash function
• Avoiding collisions can be achieved by making h more complex, but complexity and speed must be balanced
6 Data Structures and Algorithms in C++, Fourth Edition
Hash Functions
• The total possible number of hash functions for n items assigned to m positions in a table (n < m) is mn
• The number of perfect hash functions is equal to the number of different placements of these items, and is m!
m−n !
• With 50 elements and a 100-position array, we would have a total of 10050 hash functions and about 1094 perfect hash functions (about 1 in a million)
• Most of the perfect hashes are impractical and cannot be expressed in a simple formula
• However, even among those that can, a number of possibilities exist
Data Structures and Algorithms in C++, Fourth Edition 7
Hash Functions (continued)
• Division – Hash functions must guarantee that the value they produce is a valid
index to the table
– A fairly easy way to ensure this is to use modular division, and divide the keys by the size of the table, so h(K) = K mod TSize where TSize = sizeof(table)
– This works best if the table size is a prime number, but if not, we can use h(K) = (K mod p) mod TSize for a prime p > TSize
– However, nonprimes work well for the divisor provided they do not have any prime factors less than 20
– The division method is frequently used when little is known about the keys
Data Structures and Algorithms in C++, Fourth Edition 8
Hash Functions (continued)
• Folding – In folding, the keys are divided into parts which are then combined (or
“folded”) together and often transformed into the address
– Two types of folding are used, shift folding and boundary folding – In shift folding, the parts are placed underneath each other and then
processed (for example, by adding) – Using a Social Security number, say 123-45-6789, we can divide it into
three parts - 123, 456, and 789 – and add them to get 1368 – This can then be divided modulo TSize to get the address – With boundary folding, the key is visualized as being written on a piece
of paper and folded on the boundaries between the parts
Data Structures and Algorithms in C++, Fourth Edition 9
Hash Functions (continued)
• Folding (continued) – The result is that alternating parts of the key are reversed, so the
Social Security number part would be 123, 654, 789, totaling 1566 – As can be seen, in both versions, the key is divided into even length
parts of some fixed size, plus any leftover digits – Then these are added together and the result is divided modulo the
table size – Consequently this is very fast and efficient, especially if bit strings are
used instead of numbers – With character strings, one approach is to exclusively-or the individual
character together and use the result – In this way, h(“abcd”) = “a” ⋁ “b” ⋁ “c” ⋁ “d”
Data Structures and Algorithms in C++, Fourth Edition 10
Hash Functions (continued)
• Folding (continued) – However, this is limited, because it will only generate values between
0 and 127 – A better approach is to use chunks of characters, where each chunk
has as many characters as bytes in an integer – On the IBM PC, integers are often 2 bytes long, so h(“abcd”) = “ab” ⋁
“cd”, which would then be divided modulo TSize
Data Structures and Algorithms in C++, Fourth Edition 11
Hash Functions (continued)
• Mid-Square Function – In the mid-square approach, the numeric value of the key is squared
and the middle part is extracted to serve as the address – If the key is non-numeric, some type of preprocessing needs to be
done to create a numeric value, such as folding – Since the entire key participates in generating the address, there is a
better chance of generating different addresses for different keys – So if the key is 3121, 31212 = 9,740,641, and if the table has 1000
locations, h(3121) = 406, which is the middle part of 31212
– In application, powers of two are more efficient for the table size and the middle of the bit string of the square of the key is used
– Assuming a table size of 1024, 31212 is represented by the bit string 1001010 0101000010 1100001, and the key, 322, is in italics
Data Structures and Algorithms in C++, Fourth Edition 12
Hash Functions (continued)
• Extraction – In the extraction approach, the address is derived by using a portion of
the key – Using the SSN 123-45-6789, we could use the first four digits, 1234,
the last four 6789, or the first two combined with the last two 1289 – Other combinations are also possible, but each time only a portion of
the key is used – With careful choice of digits, this may be sufficient for address
generation – For example, some universities give international students ID numbers
beginning with 999; ISBNs start with digits representing the publisher – So these could be excluded from the address generation if the nature
of the data is appropriately limited
Data Structures and Algorithms in C++, Fourth Edition 13
Hash Functions (continued)
• Radix Transformation – With radix transformation, the key is transformed into a different base – For instance, if K is 345 in decimal, its value in base 9 is 423 – This result is then divided modulo TSize, and the resulting value
becomes the address top which K is hashed – The drawback to this approach is collisions cannot be avoided – For example, if TSize is 100, then although 345 and 245 in decimal will
not collide, 345 and 264 will because 264 is 323 in base nine – Since 345 is 423, these two values will collide when divided modulo
100
Data Structures and Algorithms in C++, Fourth Edition 14
Hash Functions (continued)
• Universal Hash Functions – When little is known about the keys, a universal class of hash
functions can be used – Functions are universal when a randomly chosen member of the class
will be expected to distribute a sample evenly, guaranteeing low collisions
– This idea was first considered by Larry Carter and Mark Wegman in 1979
– We define H as a class of functions from a set of keys to a hash table of size TSize
– H is called universal if no distinct pair of keys are mapped to the same position in the table by a function chosen at random from h with a probability of 1 / TSize
– This basically means there is one chance in TSize that two randomly chosen keys collide when hashed with a randomly chosen function
Data Structures and Algorithms in C++, Fourth Edition 15
Hash Functions (continued)
• Universal Hash Functions (continued) – One example of such a class of functions is defined for a prime
number p > |keys| and random numbers a and b
H = {ha,b(K): ha,b(K) = ((aK+b) mod p) mod TSize and 0 ≤ a, b < p}
– Another class of functions is for keys considered as sequences of bytes, K = K0K1 … Kr-1
– For a prime p > 28 = 256 and a sequence a = a0 , a1 , …, ar-1 ,
Data Structures and Algorithms in C++, Fourth Edition 16
1
0 1 1
0
: mod mod and 0 , , ,r
a a i i r
i
H h K h K a K p TSize a a a p
Collision Resolution
• The hashing we’ve looked at so far does have problems with multiple keys hashing to the same location in the table
• For example, consider a function that places names in a table based on hashing the ASCII code of the first letter of the name
• Using this function, all names beginning with the same letter would hash to the same position
• If we attempt to improve the function by hashing the first two letters, we achieve better results, but still have problems
• In fact, even if we used all the letters in the name, there is still a possibility of collisions
• Also, while using all the letters of the name gives a better distribution, if the table only has 26 positions there is no improvement in using the other versions
Data Structures and Algorithms in C++, Fourth Edition 17
Collision Resolution (continued)
• So in addition to using more efficient functions, we also need to consider the size of the table being hashed into
• Even then, we cannot guarantee to eliminate collisions; we have to consider approaches that assure a solution
• A number of methods have been developed; we will consider a few in the following slides
Data Structures and Algorithms in C++, Fourth Edition 18
Collision Resolution (continued)
• Open Addressing – In open addressing, collisions are resolved by finding an available table
position other than the one to which the key hashed – If the position h(K) is already occupied, positions are tried in the
probing sequence
norm(h(K) + p(1)), norm(h(K) + p(2)), . . . , norm(h(K) + p(i)), . . .
until an open location is found, the same positions are tried again, or the table is full
– The function p is called a probing function, i is the probe, and norm is a normalization function, often division modulo the table size
– The simplest realization of this is linear probing, where the search proceeds sequentially from the point of the collision
– If the end of the table is reached before finding an empty cell, it continued from the beginning of the table
– If it reaches the cell before the one causing the collision, it then stops
Data Structures and Algorithms in C++, Fourth Edition 19
Collision Resolution (continued)
• Open Addressing (continued) – The drawback to linear probing is that clusters of displaced keys tend
to form – This is illustrated in Figure 10.1, where keys Ki are hashed to locations i – In Figure 10.1a, three keys have been hashed to their locations – In Figure 10.1b, the key B5 arrives, but since A5 is stored there, it is
moved to the next location – Then A9 is stored OK, but when B2 arrives, it has to be placed in
location 4, and a large cluster is forming (Figure 10.1b) – When B9 arrives, it has to be placed from the beginning of the table,
and finally, when C2 shows up, it is placed five locations away from its home address
Data Structures and Algorithms in C++, Fourth Edition 20
Collision Resolution (continued)
Fig. 10.1 Resolving collisions with the linear probing method. Subscripts indicate the home positions of the keys being hashed
Data Structures and Algorithms in C++, Fourth Edition 21
Collision Resolution (continued)
• Open Addressing (continued) – As can be seen from the figure, empty cells immediately following
clusters tend to be filled more quickly than other locations
– So if a cluster is created, it tends to grow, and as it grows, it increases the likelihood of growing even larger
– This behavior significantly reduces the efficiency of the hash table for processing data
– So to avoid cluster creation and buildup, a better choice of the probing function, p, needs to be found
– One possibility is to use a quadratic function producing the formula
p(i) = h(K) + (–1)i–1((i + 1)/2)2 for i = 1, 2, . . . , TSize – 1
– Expressed as a sequence of probes, this is
h(K) + i2, h(K) – i2 for i = 1, 2, . . . , (TSize – 1)/2
Data Structures and Algorithms in C++, Fourth Edition 22
Collision Resolution (continued)
• Open Addressing (continued) – Starting with the first hash, this produces the sequence
h(K), h(K) + 1, h(K) – 1, h(K) + 4, h(K) – 4, . . . , h(K) + (TSize – 1)2/4,
h(K) – (TSize – 1)2/4
– Each of these values is divided modulo TSize – Because the value of h(K) tries only the even or odd positions in the
table, the size of the table should not be an even number – The ideal value for the table size is a prime of the form 4j + 3, which j is
an integer – This will guarantee that all the table locations will be checked in the
probing process – Applying this to the example of Figure 10.1 yields the configuration in
Figure 10.2; B2 still takes two probes, but C2 only takes four
Data Structures and Algorithms in C++, Fourth Edition 23
Collision Resolution (continued)
Fig. 10.2 Using quadratic probing for collision resolution
Data Structures and Algorithms in C++, Fourth Edition 24
Collision Resolution (continued)
• Open Addressing (continued) – Notice that though we obtain better results with this approach, we
don’t avoid clustering entirely – This is because the same probe sequence is used for any collision,
creating secondary clusters – These are less of a problem than primary clusters, however – Another variation is to have p be a random number generator (RNG) – This eliminates the need to have special conditions on the table size,
and does prevent secondary cluster formation – However, it does have an issue with repeating the same probing
sequence for the same keys
Data Structures and Algorithms in C++, Fourth Edition 25
Collision Resolution (continued)
• Open Addressing (continued) – If the RNG is initialized with the first invocation, different probing
sequences will occur for the same key K – As a result, the key is hashed more than once to the table, and there is
a chance it may not be found when searched for – So the RNG needs to be initialized to the same seed for the same key
before the probing sequence is generated – In C++ this can be done using the srand()function with a parameter
that is dependent on the key – An RNG could be written that generates values between 1 and TSize -1 – The following code by Robert Morris works for tables with TSize = 2n
for integer n generateNumber()
static int r = 1;
r = 5*r;
r = mask out n + 2 low-order bits of r;
return r/4;
Data Structures and Algorithms in C++, Fourth Edition 26
Collision Resolution (continued)
• Open Addressing (continued) – The best approach to secondary clustering is through the technique of
double hashing – This utilizes two hashing functions, one for the primary hash and the
other to resolve collisions – In this way the probing sequence becomes
h(K), h(K) + hp(K), . . . , h(K) + i · hp(K), . . .
– Here, h is the primary hashing function and hp is the secondary hash – The table size should be a prime number so every location is included
in the sequence, since the values above are divided modulo TSize – Empirical evidence shows that this approach works well to eliminate
secondary clustering, since the probe sequence is based on hp
Data Structures and Algorithms in C++, Fourth Edition 27
Collision Resolution (continued)
• Open Addressing (continued) – This is because the probing sequence for key K1 hashed to location j is
j, j + hp(K1), j + 2 · hp(K1), . . .
– And if another key hashes to j + hp(K1), the next location to be checked is j + hp(K1) + hp(K2), not j + 2 · hp(K1)
– This avoids secondary clustering, as long as hp is well chosen
– So even if two keys hash to the same position initially, the probing sequences can be different for each key
– The use of two different hash functions can be time-consuming, so it is possible to define the second hash in terms of the first
– For example, the function could be hp(K) = i · h(K) + 1; for key K1 the probe sequence is j, 2j + 1, 5j + 2, . . . ; if K2 hashes to 2j + 1, the sequence is 2j + 1, 4j + 3, 10j + 11, . . .
Data Structures and Algorithms in C++, Fourth Edition 28
Collision Resolution (continued)
• Open Addressing (continued) – Figure 10.4 shows the number of searches for different percentages of
occupied cells
Fig. 10.4 The average numbers of successful searches and unsuccessful searches for different collision resolution methods
Data Structures and Algorithms in C++, Fourth Edition 29
Collision Resolution (continued)
• Chaining – In chaining, the keys are not stored in the table, but in the info
portion of a linked list of nodes associated with each table position
– This technique is called separate chaining, and the table is called a scatter table
– This was the table never overflows, as the lists are extended when new keys arrive, as can be seen in Figure 10.5
– This is very fast for short lists, but as they increase in size, performance can degrade sharply
– Gains in performance can be made if the lists are ordered so unsuccessful searches don’t traverse the entire list, or by using self-organizing linked lists
– This approach requires additional space for the pointers, so if there are a large number of keys involved, space requirements can be high
Data Structures and Algorithms in C++, Fourth Edition 30
Collision Resolution (continued)
• Chaining (continued)
Fig. 10.5 In chaining, colliding keys are put on the same linked list.
– A variation of chaining called coalesced hashing (or coalesced chaining) combines chaining and linear probing
– In this approach, the first available position is found for the colliding key, and the index of this position is stored with the stored key
Data Structures and Algorithms in C++, Fourth Edition 31
Collision Resolution (continued)
• Chaining (continued) – This way the next element in the list can be accessed with doing a
sequential search down the list
– Each position in the table is an object that consists of two elements, info for the key and next which is the index of the key that collided with this position
– Available positions can be marked in their next fields, and a unique value can be used to mark the end of a chain
– This does require less space than chaining, but the number of keys that can be hashed is limited by table size
– For keys for which there is no room in the table, an overflow area known as the cellar can be dynamically located in the arrays
– Figure 10.6 shows an example of coalesced hashing where a key ends up in the last position in the table
Data Structures and Algorithms in C++, Fourth Edition 32
Collision Resolution (continued)
• Chaining (continued)
Fig. 10.6 Coalesced hashing puts a colliding key in the last available position of the table
– There is no collision in Figure 10.6a; when B5 arrives it is placed at the end of the table and linked to A5, and when A9 arrives it is linked to B5
– Figure 10.6b shows this; Figure 10.6c shows two more arriving keys
Data Structures and Algorithms in C++, Fourth Edition 33
Collision Resolution (continued)
• Chaining (continued) – Figure 10.7 shows the use of a cellar
Fig. 10.7 Coalesced hashing that uses a cellar
Data Structures and Algorithms in C++, Fourth Edition 34
Collision Resolution (continued)
• Chaining (continued) – Keys that hash without colliding are stored in their home locations, as
can be seen in Figure 10.7a
– Keys that collide are put in the last available position in the cellar and linked back to their home position, shown in Figure 10.7b
– If the cellar is filled, available cells are used in the table, as seen in Figure 10.7c
Data Structures and Algorithms in C++, Fourth Edition 35
Collision Resolution (continued)
• Bucket Addressing – Yet another approach to collision handling is to store all the colliding
elements in the same position in the table
– This is done by allocating a block of space, called a bucket, with each address
– Each bucket is capable of storing multiple items
– However, even with buckets, we cannot avoid collisions, because buckets can fill up, requiring items to be stored elsewhere
– If open addressing is incorporated into the design, the item can be stored in the next bucket if space is available, using linear probing
– This is shown in Figure 10.8
Data Structures and Algorithms in C++, Fourth Edition 36
Collision Resolution (continued)
• Bucket Addressing (continued)
Fig. 10.8 Collision resolution with buckets and linear probing method
– Collisions can be stored in an overflow area, in which case the bucket includes a field to indicate if a search should consider that area or not
– If used with chaining, the field could indicate the beginning of the list in the overflow area associated with the bucket shown in Figure 10.9
Data Structures and Algorithms in C++, Fourth Edition 37
Collision Resolution (continued)
Fig. 10.9 Collision resolution with buckets and overflow area
Data Structures and Algorithms in C++, Fourth Edition 38
Deletion
• How can data be removed from a hash table?
• If chaining is used, the deletion of an element entails deleting the node from the linked list holding the element
• For the other techniques we’ve considered, deletion usually involves more careful handling of collision issues, unless a perfect hash function is used
• This is illustrated in Figure 10.10a, which stores keys using linear probing
• In Figure 10.10b, when A4 is deleted, attempts to find B4 check location 4, which is empty, indicating B4 is not in the table
• A similar situation occurs in Figure 10.10c, when A2 is deleted, causing searches for B1 to stop at position 2
Data Structures and Algorithms in C++, Fourth Edition 39
Deletion (continued)
Fig. 10.10 Linear search in the situation where both insertion and deletion of keys are permitted
– A solution to this is to leave the deleted keys in the table with some type of indicator that the keys are not valid
– This way, searches for elements won’t terminate prematurely
– When new keys are inserted, they can overwrite the marked keys
Data Structures and Algorithms in C++, Fourth Edition 40
Deletion (continued)
• The drawback to this is that of the table has far more deletions than insertions
• It will become overloaded with deleted records, slowing down search times, because open addressing requires testing each element
• So the table needs to be purged periodically by moving undeleted elements to cells occupied by deleted elements
• Those cells containing deleted elements not overwritten can then be marked as available
• This is shown in Figure 10.10d
Data Structures and Algorithms in C++, Fourth Edition 41
Perfect Hash Functions
• All the examples we’ve considered to this point have assumed the data being hashed is not completely known
• Consequently, the hashing that took place only coincidentally turned out to be ideal in that collisions were avoided
• The majority of the time collisions had to be resolved because of conflicting keys
• In addition, the number of keys is usually not known in advance, so the table size had to be large enough
• Table size also played a role in the number of collisions; larger tables had fewer collisions if the hash took this into account
• All these factors were the result of not knowing ahead of time about the body of data to be hashed
Data Structures and Algorithms in C++, Fourth Edition 42
Perfect Hash Functions (continued)
• Therefore the hash function was developed first and then the data was processed into the table
• In a number of cases, though, the data is known in advance, and the hash function can be derived after the fact
• This function may turn out to be a perfect hash if items hash on the first try
• Additionally, if the function uses only as many cells as are available in the table with no empty cells left after the hash, it is called a minimal perfect hash function
• Minimal perfect hash functions avoid the need for collision resolution and also avoid wasting table space
Data Structures and Algorithms in C++, Fourth Edition 43
Perfect Hash Functions (continued)
• Processing fixed bodies of data occurs frequently in a variety of applications
• Dictionaries and tables of reserved words in compilers are among several examples
• Creating perfect hash functions requires a great deal of work
• Such functions are rare; as we saw earlier, for 50 elements in a 100-position array, only 1 function in a million is perfect
• All the other functions will lead to collisions
Data Structures and Algorithms in C++, Fourth Edition 44
Perfect Hash Functions (continued)
• Cichelli’s Method
– Richard J. Cichelli developed one technique for constructing minimal perfect hash functions in 1980
– It is used to hash fairly small number of reserved words, and has the form
h(word) = (length(word) + g(firstletter(word)) + g(lastletter(word))) mod TSize
– In the expression, g is the function to be constructed; it assigns values to letters to the function h returns unique hash values
– The values that g assigns to particular letters do not have to be unique
– There are three parts to the algorithm: computation of letter occurrences, ordering words, and searching
– The last step is critical, and uses an auxiliary function, try()
Data Structures and Algorithms in C++, Fourth Edition 45
Perfect Hash Functions (continued)
• Cichelli’s Method (continued) – The pseudocode for the algorithm is shown on pages 563 and 564
– It is used to construct the hash function for the names of the nine Muses: Calliope, Clio, Erato, Euterpe, Melpomene, Polyhymnia, Terpsichore, Thalia, and Urania
– Counting the first and last letters of each name gives the distribution A (3), C (2), E (6), M (1), O (2), P (1), T (2), and U (1)
– Based on this, we can order the names as follows: Euterpe, Calliope, Erato, Terpsichore, Melpomene, Thalia, Clio, Polyhymnia, and Urania
– Following this the search()routine is applied, as shown in Figure 10.11, with the value for Max is 4
– Euterpe is tried first and given the g-value 0, so h(Euterpe) = 7, which is put on the reserved hash value list
Data Structures and Algorithms in C++, Fourth Edition 46
Perfect Hash Functions (continued)
• Cichelli’s Method (continued) – All the other values go well until the last, Urania, is tried and fails to
hash after all five g-values are tried
– At this point, the routine backtracks to the previous step, where Polyhymnia was tried, and detaches its current hash value from the list
– A g-value of 1 is tried for P, which fails, but a value of 2 gives a hash value of 3, which works, and the algorithm proceeds
– Urania is then tried again, and on the fifth attempt, it is successful
– The search process completes with each name assigned a unique hash value
– Using the g-values A = C = E = O = M = T = 0, P = 2, and U = 4, h is a minimal perfect hash function for the nine Muses
Data Structures and Algorithms in C++, Fourth Edition 47
Perfect Hash Functions (continued)
• Cichelli’s Method (continued)
Fig. 10.11 Subsequent invocations of the searching procedure with Max = 4 in Cichelli’s algorithm assign the indicated values to letters and to the list of reserved hash values; the asterisks indicate failures
Data Structures and Algorithms in C++, Fourth Edition 48
Perfect Hash Functions (continued)
• Cichelli’s Method (continued) – This algorithm cannot be applied to a large number of words
practically because its exhaustive search is exponential
– It also does not guarantee a perfect hash function will be found
– For small numbers of words it works well, however, usually needing to run only once and yielding a hash function that can be used in other programs
– There have also been a number of successful attempts to extend this technique and address the limitations
– One approach by Robert Sebesta and Mark Taylor (1986)modified the terms used in defining the function, using the alphabetical position of the second to last letter in the word
– Another definition, by Gary Haggard and Kevin Karplus (1986) is:
h(word) = length(word) + g1(firstletter(word)) + · · · + glength(word)(lastletter(word))
Data Structures and Algorithms in C++, Fourth Edition 49
Perfect Hash Functions (continued)
• Cichelli’s Method (continued) – Another modification partitions the body of data into separate buckets
for which minimal perfect hash functions are found
– The partitioning is carried out by a grouping function, gr, which indicates the bucket each word belongs to
– A general hash function of the form
h(word) = bucketgr(word) + hgr(word)(word)
is then generated, as suggested by Ted Lewis and Curtis Cook (1986)
– The drawback to this approach is the difficulty in finding grouping functions that support the creation of minimal perfect hash functions
– Both of these approaches are not completely successful if they rely on the algorithm developed by Cichelli
– Although Cichelli advocated brute force as a last resort, most efforts focus on finding more efficient searching algorithms, such as the FHCD
Data Structures and Algorithms in C++, Fourth Edition 50
Perfect Hash Functions (continued)
• The FHCD Algorithm – Thomas Sager proposed a modification of Cichelli’s method in 1985.
– The FHCD algorithm (named for its developers, Edward Fox, Lenwood Heath, Qi Chen, and Amjad Daoud) is an extension of Sager’s work, and was proposed in 1992
– It searches for minimal perfect hash functions of the form:
h(word) = h0(word) + g(h1(word)) + g(h2(word))
– The result is divided modulo TSize, and g is the function determined by the algorithm
– Three tables of random numbers, T0, T1, and T2 are defined, one for each corresponding function hi
– Each word in the formula equals a string of characters c1c2 … cm corresponding to a triple (h0(word), h1(word), h2(word))
Data Structures and Algorithms in C++, Fourth Edition 51
Perfect Hash Functions (continued)
• The FHCD Algorithm (continued) – The elements of the triple are calculated using the formulas
h0 = (T0(c1) + · · · + T0(cm)) mod n
h1 = (T1(c1) + · · · + T1(cm)) mod r
h2 = ((T2(c1) + · · · + T2(cm)) mod r) + r
– In these formulas
• n is the number of all the words in the body of data
• r is a parameter typically < n/2
• Ti(cj) is the number generated in table Ti for character cj
– The function g is then found in three steps: mapping, ordering and searching
– In mapping, n triples of the form (h0(word), h1(word), h2(word)) are formed
Data Structures and Algorithms in C++, Fourth Edition 52
Perfect Hash Functions (continued)
• The FHCD Algorithm (continued) – The randomness of the hi functions typically guarantees the
uniqueness of the triples; if they aren’t, new Ti tables are created
– A dependency graph is then constructed; this is a bipartite graph
– Half the vertices corresponding to values of h1 are labeled 0 to r – 1; the other half correspond to h2 and are labeled r to 2r – 1
– The edge of the graph between vertices h1(word) and h2(word) corresponds to a word
– To illustrate this, the names of the nine Muses are again used in Figure 10.12
– The three tables, Ti, are created using the standard rand()function, and from these nine triples are created, shown in Figure 10.12a
Data Structures and Algorithms in C++, Fourth Edition 53
Perfect Hash Functions (continued)
• The FHCD Algorithm (continued)
Fig. 10.12 Applying the FHCD algorithm to the names of the nine Muses
Data Structures and Algorithms in C++, Fourth Edition 54
Perfect Hash Functions (continued)
• The FHCD Algorithm (continued) – The corresponding dependency graph (with r = 3) is shown in Figure
10.12b
– Note some vertices cannot be connected to other vertices, and some pairs can be connected with more than one arc
– In the ordering step the vertices are rearranged so they can be partitioned into a series of levels
– A level K(vi) of keys is defined once a sequence v1, . . . , vt of vertices is established which is the set of all the edges that connect vi with those vjs for which j < i
– The sequence is started the with the vertex of maximum degree
– Then, for each position i of the series, a vertex vi is chosen from the vertices having maximal degree which have at least one connection to the vertices v1, . . . , vi–1
Data Structures and Algorithms in C++, Fourth Edition 55
Perfect Hash Functions (continued)
• The FHCD Algorithm (continued) – If no vertex of that form can be found, any vertex of maximal degree is
chosen from among the unselected vertices
– An example of this is shown in Figure 10.12c
– Finally, in the searching step, the keys are assigned hash values level by level
– The first vertex has its g value chosen randomly from among the numbers 0, …, n – 1
– For the other vertices, we have the relation if vi < r then vi = h1
– So each word in level K(vi) has the same value g(h1(word)) = g(vi)
Data Structures and Algorithms in C++, Fourth Edition 56
Perfect Hash Functions (continued)
• The FHCD Algorithm (continued) – In addition, g(h2(word)) is already defined because it equals some vj
that has been processed
– By analogy, we can apply this to the case of vi > r and vi = h2, so for each word either g(h1(word)) or g(h2(word)) is known
– For each level, the second g value is found randomly so the values obtained from the formula for h indicate available table positions
– Since the first choice of random number will not always fit all the words on a particular level to the table, both numbers may be needed
– In considering the hashing of the names of the nine Muses, the search step starts by randomly choosing g(v1)
– For v1 = 2, we let g(2) = 2; the next vertex is v2= 5, so K(v2) = {Erato}
Data Structures and Algorithms in C++, Fourth Edition 57
Perfect Hash Functions (continued)
• The FHCD Algorithm (continued) – From Figure 10.12a, h0(Erato) = 3 and because v1 and v2 are connected
by the edge Erato, either h1(Erato) or h2(Erato) must be equal to v1
– From the figure, we can see h1(Erato) = 2 = v1; so g(h1(Erato)) = g(v1) = 2 and we can randomly choose g(v2) = g(h2(Erato)) = 6
– Thus, h(Erato) = (h0(Erato) + g(h1(Erato)) + g(h2(Erato))) mod TSize = (3 + 2 + 6) mod 9 = 2
– Consequently, position 2 of the table is no longer open, and the new g-value, g(5) = 6 is held onto for use later
– Next, v3 = 1 is tried, with K(v3) = {Calliope, Melpomene}
– From the table of triples, the h0 values are retried for both values and since h2 = v2 for both of them, the g(h2) values are 6 for both words
Data Structures and Algorithms in C++, Fourth Edition 58
Perfect Hash Functions (continued)
• The FHCD Algorithm (continued) – Next, a random g(h1) value must be found that will produce two
numbers other than 2 when h computes the hash function for the two words, because location 2 is occupied
– For purposes of the example, assume the number is 4, so h(Calliope) = 1 and h(Melpomene) = 4
– A summary of these steps is shown in Figure 10.12d; Figure 10.12e shows the values of the function g
– By means of this function g, the function h becomes a minimal perfect hash function
– However, g has to be stored as a table so it can be used every time function h is needed, because it is presented as a table and not a formula
– This may not be straightforward
Data Structures and Algorithms in C++, Fourth Edition 59
Perfect Hash Functions (continued)
• The FHCD Algorithm (continued) – The reasoning behind this is that as r increases, so does the function
g:{0, …, 2r – 1} → {0, …, n – 1} and the size of g’s domain
– Recall from earlier that r is approximately n/2, so for large databases the table storing all the values of g can be of significant size
– Consequently, this table needs to be kept in main memory to make hash computations efficient
Data Structures and Algorithms in C++, Fourth Edition 60
Rehashing
• When hash tables become full, no more items can be added
• As they fill up and reach certain levels of occupancy (saturation), their efficiency falls due to increased searching needed to place items
• A solution to these problems is rehashing, allocating a new, larger table, possibly modifying the hash function (and at least TSize), and hashing all the items from the old table to the new
• The old table is then discarded and all further hashes are done to the new table with the new function
• The size of the new table can be determined in a number of ways: doubled, a prime closest to doubled, etc.
Data Structures and Algorithms in C++, Fourth Edition 61
Rehashing (continued)
• All the methods we’ve looked at can use rehashing, as they then continue using the processes of hashing and collision resolution that were in place before rehashing
• One method in particular for which rehashing is important is cuckoo hashing, first described by Rasmus Pagh and Flemming Rodler in 2001
Data Structures and Algorithms in C++, Fourth Edition 62
Rehashing (continued)
• The cuckoo hashing – Two tables, T1 and T2, and two hash tables, h1 and h2, are used in the
cuckoo hash
– Inserting a key K1 into table T1 uses hash function h1, and if T1[h1(K1)] is open, the key is inserted there
– If the location is occupied by a key, say K2, this key is removed to allow K1 to be placed, and K2 is placed in the second table at T2[h2(K2)]
– If this location is occupied by key K3, it is moved to make room for K2, and an attempt is made to place K3 at T1[h1(K3)]
– So the key that is being placed in or moved to the table has priority over a key that is already there
– There is a possibility that a sequence like this could lead to an infinite loop if it ends up back at the first position that was tried
Data Structures and Algorithms in C++, Fourth Edition 63
Rehashing (continued)
• The cuckoo hashing (continued) – It is also possible the search will fail because both tables are full
– To circumvent this, a limit is set on tries which if exceeded causes rehashing to take place with two new, larger tables and new hash functions
– Then the keys from the old tables are rehashed to the new ones
– If during this the limit on number of tries is exceeded, rehashing is performed again with yet larger tables and new hash functions
– The pseudocode for this process is shown on the next slide
– Execution of this algorithm is illustrated in Figure 10.13
Data Structures and Algorithms in C++, Fourth Edition 64
Rehashing (continued)
• The cuckoo hashing (continued)
insert(K)
if K is already in T1[h1(K)] or in T2[h2(K)]
do nothing;
for i = 0 to maxLoop-1
swap(K,T1[h1(K)]);
if K is null
return;
swap(K,T2[h2(K)]);
if K is null
return;
rehash();
insert(K);
Data Structures and Algorithms in C++, Fourth Edition 65
Rehashing (continued)
Fig. 10.13 An example of the application of cuckoo hashing
Data Structures and Algorithms in C++, Fourth Edition 66
Rehashing (continued)
• The cuckoo hashing (continued) – Initially, key 2 is inserted in T1[h1(2)], as shown in Figure 10.13a
– When key 7 is inserted, h1(7) = h1(2) so 2 is moved out to make room for 7 in T1, and 2 is placed in T2[h2(2)], shown in Figure 10.13b
– Then key 12 attempts to be inserted, but h1(12) = h1(7), so 12 replaces 7 in T1; in transferring 7 to T2, h2(7) = h2(2), so 2 is moved out (Figure 10.13c-d)
– This process continues in Figure 10-13e-h, so the potential loop is broken by exceeding the value of maxloop, and rehashing occurs
– Figure 10.13i-l illustrates this, including the addition of key 13
– Searching for a key in this scheme requires at most two tests to check both T1[h1(K)] and T2[h2(K)]
Data Structures and Algorithms in C++, Fourth Edition 67
Rehashing (continued)
• The cuckoo hashing (continued) – One point to note is that rehashing may be limited, so that instead of
creating tables, only new hash functions are created and keys reprocessed
– However, this would be a global operation requiring both tables be completely processed
Data Structures and Algorithms in C++, Fourth Edition 68
Hash Functions for Extendible Files
• While rehashing adds flexibility to hashing by allowing for dynamic expansion of the has table, it has drawbacks
• In particular, the entire process comes to a halt while the new table(s) are created and the values rehashed to the new table
• The time required for this may be unacceptable in many cases
• An alternative approach is to expand the table rather than replace it, and only allow for local rehashing and local changes
• This approach won’t work with arrays, because expanding an array can’t be done by simply adding locations to the end
• However, it can be managed if the data is kept in a file
Data Structures and Algorithms in C++, Fourth Edition 69
Hash Functions for Extendible Files (continued)
• There are some hashing techniques that take into account variable sizes of tables or files
• These fall into two groups: directory and directoryless
• In directory schemes, a directory or index of keys controls access to the keys themselves
• There are a number of techniques that fall into this category – Expandable hashing, developed by Gary D. Knott in 1971
– Dynamic hashing, developed by Per-Âke Larson in 1978
– Extendible hashing, developed by Ronald Fagin and others in 1979
• All of these distribute keys among buckets in similar ways
• The structure of the directory or index is the main difference
Data Structures and Algorithms in C++, Fourth Edition 70
Hash Functions for Extendible Files (continued)
• A binary tree is used as the bucket index in expandable and dynamic hashing; extendible hashing uses a table
• Virtual hashing, developed in 1978, is a directoryless technique, defined as “any hashing which may dynamically change its hashing function” by its developer, Witold Litwin
• One example of this technique is linear hashing, developed by Litwin in 1980
Data Structures and Algorithms in C++, Fourth Edition 71
Hash Functions for Extendible Files (continued)
• Extendible Hashing – Consider a dynamically changing file with fixed-size buckets to which a
hashing method has been applied
– The data stored in the buckets is indirectly accessed through an index that is dynamically adjusted to reflect file changes
– The index, which is an expandable table, is the characteristic feature of extendible hashing
– When certain keys are hashed, they indicate a position in the index, not the file
– The values the hash function return are called pseudokeys
– This way, the file doesn’t need reorganization as data are added or deleted, because the changes are recorded in the index
Data Structures and Algorithms in C++, Fourth Edition 72
Hash Functions for Extendible Files (continued)
• Extendible Hashing (continued) – Only one hash function can be used, but depending on the size of the
table, only a portion of the address h(K) is used
– An easy way to do this is to look at the address as a string of bits from which only the leftmost i bits can be used
– This number i is called the depth of the directory; a directory of depth two is shown in Figure 10.14a
– As an example, assume h generates a pattern of five bits with a value of 01011
– With a depth of two, the two leftmost bits are 01, and this is the position in the directory containing a pointer to the bucket where the key can be found or inserted
– Figure 10.14 shows the h values in the buckets, which represent the keys actually stored in the buckets
Data Structures and Algorithms in C++, Fourth Edition 73
Hash Functions for Extendible Files (continued)
Fig. 10.14 An example of extendible hashing
Data Structures and Algorithms in C++, Fourth Edition 74
Hash Functions for Extendible Files (continued)
• Extendible Hashing (continued) – There is a local depth associated with each bucket that indicates the
number of leftmost bits in h(K); these bits are the same for all keys in the bucket
– These local depths are shown at the top of each bucket in Figure 10.14; for example, bucket b00 holds all keys for which h(K) starts with 00
– Importantly, the local depth indicates whether the bucket can be accessed from only one location in the directory or at least two
– If the local depth is equal to the depth of the directory, the size of the directory must be changed after the bucket is split in case of overflow
– If the local depth is smaller than the directory depth, only half the pointers pointing to the bucket need to be changed to point to the new one
Data Structures and Algorithms in C++, Fourth Edition 75
Hash Functions for Extendible Files (continued)
• Extendible Hashing (continued) – This is illustrated in Figure 10.14b; when a key with h value 11001
arrives, the first two bits send it to the fourth directory position
– From there it goes to bucket b1, where keys whose h-values start with 1 are stored
– This causes an overflow, splitting bucket b1 into b10, which is the new name for b1, and b11; their local depths are set to two
– Bucket b11 is now pointed at by the pointer in position 11, and b1’s keys are split between b10 and b11
– Things are more complicated if a bucket whose local depth is equal to the depth of the directory overflows
– Consider what happens when a key with h-value 0001 arrives at the table in Figure 10.14b
Data Structures and Algorithms in C++, Fourth Edition 76
Hash Functions for Extendible Files (continued)
• Extendible Hashing (continued) – It is hashed through position 00 to bucket b00
– This causes a split, but there is no room in the directory for a pointer to the new bucket
– As a consequence, the directory size is doubled, making its depth three, b00 becomes b000, and b001 is the new bucket
– The keys are divided between these two buckets; the ones with an h-value starting with 000 go in b000, and remaining ones go in b001
– In addition, all the slots of the new directory have to have their proper values set by having newdirectory[2 · i] = olddirectory[i] and newdirectory[2 ·i + 1] = olddirectory[i]
– This is done for the is that range over positions of olddirectory, except for the position referring to the bucket that was just split
Data Structures and Algorithms in C++, Fourth Edition 77
Hash Functions for Extendible Files (continued)
• Extendible Hashing (continued) – The algorithm for inserting a record into a file using extendible hashing
is shown on page 573
– The main advantage of extendible hashing is that it avoids file reorganization on directory overflow; only the directory is impacted
– Since the directory is most often kept in main memory, costs associated with expanding and updating it are small
– On the other hand, if we have a large file of small buckets, the directory can become so large as to require virtual memory or a file
– This can slow down processing significantly
– The size of the directory doesn’t grow uniformly, since it doubles if a bucket with local depth equal to the directory depth is split
Data Structures and Algorithms in C++, Fourth Edition 78
Hash Functions for Extendible Files (continued)
• Extendible Hashing (continued) – This means a large directory will have many redundant entries in it
– To address this problem, a proposal by David Lomet in 1983 suggested using extendible hashing until the directory was too large for main memory
– At that point, the buckets are doubled instead of the directories, and the bits that come after the first depth bits in h(K) are used to distinguish different parts of the bucket
– So if depth = 3 and bucket b10 has been quadrupled, we use bit strings 00, 01, 10, and 11 to distinguish the parts
– Then, if h(K) = 10101101, we look for K in the second portion (01) of bucket b101
Data Structures and Algorithms in C++, Fourth Edition 79
Hash Functions for Extendible Files (continued)
• Linear hashing – While extendible hashing allows a file to expand without
reorganization, it does require storage space for an index
– In Litwin’s method, there is no need for an index because new buckets resulting from splitting existing buckets are added in the same linear way, so indexes don’t need to be retained
– To facilitate this, a pointer split is used to indicate which bucket should be split next
– After that bucket is split, the keys in the bucket are divided between that bucket and the new one, which is added to the end of the table
– Figure 10.15 illustrates a series of initial splits where TSize = 3
– To begin with, the pointer split is 0
Data Structures and Algorithms in C++, Fourth Edition 80
Hash Functions for Extendible Files (continued)
• Linear hashing (continued)
Fig. 10.15 Splitting buckets in the linear hashing technique
– If the loading factor exceeds a threshold, a new bucket is made, keys from bucket zero are distributed between buckets zero and three, and split is incremented
Data Structures and Algorithms in C++, Fourth Edition 81
Hash Functions for Extendible Files (continued)
• Linear hashing (continued) – The question is, how to perform this distribution?
– With only one hash function, keys from bucket zero are hashed to bucket zero before and after splitting
– So one function isn’t adequate to perform this
– Linear hashing maintains two hash functions at each level of splitting, hlevel and hlevel+1 with hlevel (K) = K mod (TSize ∙ 2level)
– The first function hashes keys to buckets on the current level that have not yet been split; the second hashes keys to already split buckets
– The algorithm for linear hashing is shown on pages 574 and 575
Data Structures and Algorithms in C++, Fourth Edition 82
Hash Functions for Extendible Files (continued)
• Linear hashing (continued) – It may not be clear when a bucket should be split; as the algorithm
assumes, a threshold value of the loading factor is used to decide
– This threshold has to be determined in advance, usually by the designer of the program
– To illustrate, assume keys can be hashed to buckets in a file, and if a bucket is full, overflowing keys are put on a linked list in an overflow area
– This situation is shown in Figure 10.16a, with TSize = 3, h0(K) = K mod TSize, and h1(K) = K mod 2 ∙ TSize
– We’ll let the overflow are size OSize = 3, and the highest acceptable loading factor be 80 percent
– The loading factor is defined as the number of elements divided by the number of slots in the file and overflow area
Data Structures and Algorithms in C++, Fourth Edition 83
Hash Functions for Extendible Files (continued)
Fig. 10.16 Inserting keys to buckets and overflow areas with the linear hashing technique
Data Structures and Algorithms in C++, Fourth Edition 84
Hash Functions for Extendible Files (continued)
• Linear hashing (continued) – Current loading in Figure 10.16a is 75%, but when key 10 arrives it is
hashed to position 1 and loading increases to 83%
– The first bucket splits and keys are distributed using function h1, shown in Figure 10.16b
– Of note is the fact that the first bucket has the lowest load, but was the first one that split
– Now assume 21 and 36 are hashed to the table, and 25 arrives as seen in Figure 10.16c
– The loading factor rises to 87%, so another split, this time the second bucket, gives rise to the configuration in Figure 10.16d
– After hashing 27 and 37 another split occurs, and this new situation is shown in Figure 10.16e
Data Structures and Algorithms in C++, Fourth Edition 85
Hash Functions for Extendible Files (continued)
• Linear hashing (continued) – In this case, split reached the last allowed value on this level, so it is
assigned the value zero, and h1 is retained for use in further hashing
– In addition, a new function, h2 is defined as K mod 4 ∙ TSize
– These steps are presented in a table on the next slide
– Note that since the order of splitting is predetermined, some overflow area is needed with linear hashing
– If files are used, this may lead to more than one file access
– The area can be distinct from buckets, but can also work in the fashion of coalesced hashing by using empty space in buckets, as suggested by James Mullin in 1981
– Overflow areas are not necessary in a directory scheme, but can be used
Data Structures and Algorithms in C++, Fourth Edition 86
Hash Functions for Extendible Files (continued)
Table 10-1 Steps in carrying out the hashing and splitting of Figure 10.16
Data Structures and Algorithms in C++, Fourth Edition 87
Hash Functions for Extendible Files (continued)
• Linear hashing (continued) – As with directory schemes, linear hashing increases the address space
by splitting buckets
– Keys from the split bucket are also redistributed between the resulting buckets
– Since no indexes need to be maintained, linear hashing is faster and requires less space than previous methods
– This increase in efficiency is especially noticeable for large files
Data Structures and Algorithms in C++, Fourth Edition 88