Upload
leo-george
View
220
Download
2
Embed Size (px)
Citation preview
Hash Tables & Hash Functions
AADS-14
Significance of complexity of SearchIn an unordered list the time required to find
a value is O(n)In an ordered list this time can be improved,
and there could definitely be improvement in the modification operations
In a Binary Search Tree the search time could well improve to O(log n)
Same is the limit for AVL trees
Dictionary Data StructureDictionary is a general form of Data Structure to
store key and valuesIt can be implemented using Array or Linked List
structuresFor a Dictionary, the direct addressing of each
element could be done using the value of the element as index, if the Dictionary is of that size
Key Search/Dictionary StorageBut in any of the complex applications the memory is
simultaneously used by many processesAlso there could be frequent accesses to the Keys in
the runtimeSo, there is a need for reducing both size of space
and the search time
Example-1A 4 digit number as Key may need 9999 locations If the Key stands for the Employee ID of a company
with 500 employees, Then only 500 locations shall be used when all the
Keys are arranged in the memory
Example 2A Hospital might be having large number of patients,
both inpatients and outpatientsThe database system can be modeled to group the
patients and then index them so that the retrieval of the records shall be fast.
Another way is not to group, but assign only one number to each case
Search time of O(1)In both cases of large data or small amount of data,
the amortized time of O(1) or a near about time could be achieved if we know the location of the data or key we are looking for
This location could be obtained from a mapping of the key to a new hashed key using proper functions
Hashing Hashing could provide unique locations or a
reference to a shorter list for the keys from where we can easily get the data pertaining to one key
Also, this would perhaps use less space in memory Instead of a large array, we can use a short length
array/linked list
Hash TableHash Table is a Data StructureHash tables provide the time O(1) for any
and all values in a set contained on the Hash Table for search/insert/delete
Hash Table?Hash table is an array say T[1,m] where m is a
positive integer called the table sizeWhen we try to put an item into a spot in the
hash table that is occupied, the situation is called collision
It is resolved using a collision resolution policy
Hashing-Mathematical DefinitionHashing is a mapping operationConsider the a set K of keysLet H be a function that map the keys to a new set LSuch that
H:K L
Hash Function/ & Hash AddressThe function H is called the HASH FUNCTION
This mapping done by the function H is called the HASHING
The object L is the Hash table
Each cell/location in L is identified using the Hash address
Hash AddressLet k is Key in K or k KThen k will have a mapped address in L given by
H(k) known as the Hash AddressHash Address d is the mapped address/location
given by the hashing operation
d=H(k)
of a key k
Indexing on the Hash table The hash address d shall directly point to a location
in LThis address d is also called the Hash Address or
Hash Code for the key kThe process of Hashing is also called Compression
NotesThere is no meaning between the actual data value
k and the hash key dSo there is no practical way to traverse a hash table,
except a direct search using dHash table items are not in any orderThere is no mapping function from d to k, except the
hash tableThe purpose of hash tables is to provide fast look
ups
Illustration- Bucket Array Structure for Hash Table
1k1
2k2
3k3
L-1kN-1
LkN
Uses of Hash TablesCompilers use hash tables for symbol storage.The Linux Kernel uses hash tables to manage
memory pages and buffers.High speed routing tables use hash tables.Database systems use hash tables.
Operations on Hash TablesInitializeInsert(k)Search(k)Remove(k)SizeofIsempty
Types of hashingThere are two types
1. Open hashing- Open Chaining-Closed Addressing-Separate Chaining
2. Closed hashing- Open Addressing
Open hashing-Open ChainingAmount of data to be stored is highUses a hash function to obtain the hash addressAll data with same hash address shall be stored as a
shorter list with a reference indicated by the above hash address
Bucket in Open hashingEach hash location on the Hash table is said to a
bucket for the data with an indexData within the bucket could better be organized as
Linked List
1k1
2k2
3k3
L-1kN-1
LkN
Closed hashing-Open AddressingClosed hashing uses a fixed spaceHashing shall map a key into one of the locations in
the earmarked spaceIf there are multiple keys getting hashed to same
address(collision) then the tie shall be resolvedBucket may be small enough to hold only one value
at a time
Topics in HashingBasically there are two subareas under “Hashing”
1. Hash Functions
2. Collision Resolutions
Hash Functions
1. The Hash Function H should be easy to compute
2. The function H should, as far as possible, uniformly distribute the hash addresses throughout the set L so that there are a minimum number of collisions
Hash Functions
Requirement of Hash FunctionsThe main idea of using Hash Function H is that for a
key k, the hash function H obtains a value H(k) as an index into the hash table cell/bucket so that we can locate the key k in the Hash Table easily for search/insert
Hash FunctionsDivision MethodMid Square methodMultiplication Method
Division MethodChoose a prime number that is not close to the
power of 2Let m be the selected numberThen m also indicate the size of the Hash Table in
the ideal case with one cell in each bucketThe hash address/bucket address is given by
H(k)=k mod m
ExampleGiven keys are
4845, 5679, 6381, 3636, 7180, 8126, 1127
Use Table size m=7
Hash to a Table with 7 cells
Also use m=11
and m=8 to repeat the exercise
Answer
01127
14845
25679
33636
46381
57180
68126
HASH ADDRESS
KEY
Choosing Table size in Division MethodWhen using the division method, ample
consideration must be given to the size of the table.
The best choice for table size is usually a prime number not too close to a power of 2.
Division Method for Chaining-
Here, the Hash Table will have many cellsHash addresses map multiple keys to a single location, So, there could be multiple entries in one location, These multiple entries under a single hash Code are
held as a linked list
IllustrationTake Table size m as 11 to map a set keysKeys –
Modulo Divide each by 11 and get the hash addresses
122 221 661
90 167 57
69
Answer- We get the following Table
1 111 221 551
2 90 167 57
3 69
0
4
Load FactorLet there are m slots in a Hash TableAt the instant of observation the number
elements is nTherefore the Load factor =n/m This is the average number of element stored
in the Hash Table can be less than, equal to or greater than 1
Find the Load Factor 0 110
1 89 452 68
167 57
34 225 554
9 108
5 82
10 109
SolutionThere are 11 slots11 elements = 11/11=1So, indicates the average number of elements per
positionAlso, we get =1 even if there are vacant slots,
because it is only showing the average
Notes on The Load factor could be assuming various values
as the number of keys on the Hash Table changesAccordingly, could be less than, equal, or greater
than one in a Hash Table formed using Separate Chaining(Open Hashing)
In a Hash Table formed using Open Addressing(Closed Hashing) shall be always less than one
decides the complexity of the operations on the Hash Tables like insert, search, delete etc
Hashing the Strings
ExerciseMap the following keys in such a way that we have
the hash function as followsFind the ASCII values of first and last charactersIf there is only one character, it shall be the start and
endAdd the ASCII value of last character to the ASCII
value of first multiplied by 256Apply mod m division to this resulting number
KeysA, BABU, CHOWHAN, SUMAN, DILIP
The 5 symbols are:AA BUCNSNDP
These 5 symbols are then converted to a numerical code using the rule given previously by employing the ASCII values of the characters in the symbols
ASCII ValuesA-65B-66C-67D-68E-69F-70G-71H-72I-73
J-74K-75L-76M-77N-78O-79P-80Q-81R-82
S-83T-84U-85V-86W-87X-88Y-89Z-90
Example- AnswerAA 256*65+65=16705BU 256*66+85=16981CN 256*67+78=17320SN 256* 83+78=21326DP 256*68+ 80=17488
A-65B-66C-67D-68E-69F-70G-71H-72 I-73
J-74K-75L-76M-77N-78O-79P-80Q-81R-82
S-83T-84U-85V-86W-87X-88Y-89Z-90
SolutionTake m=7Obtain the Hash Addresses
AA 256*65+65=16705mod 7=3BU 256*66+85=16981mod7=6CN 256*67+78=17320mod7=2SN 256* 83+78=21326mod7=4DP 256*68+ 80=17488mod7=2
Solution
1
2 CHOWHAN DILIP
3 AA
4 SUMAN
0
5
6 BABU
Symbol TableCompilers use a method similar to the previous one
to form a symbol table for the parsing purposes in the compilation
Hash Functions for string hashingHash Functions perform two separate functions:
1 – Convert the string to a key.
2 – Constrain the key to a positive value less than the size of the table.
The best strategy is to keep the two functions separate so that there is only one part to change if the size of the table changes.
Notes-Chaining methodThe chaining method gives infinite space in the hash
table in principleBut, in practical applications, only limited space shall
be allotted for one hash table in the memoryThere is no collision in chaining
Collisions
CollisionIn the case of closed hashing(open addressing)-
even though H is ideally giving distinct addresses in L for each member in K in the real situation two or more Keys may LEAD TO A SINGLE Hash Address when a given Hash Function is used
This situation is called collisionWe need some method to resolve collisionThe method is called “Collision Resolution Policy”
Collision Resolution PolicyLinear ProbingQuadratic ProbingDouble Hashing
Linear ProbingIf a collision occurs, look for next immediate free
location and use it for storage for the insert operationIf a key is not found, look for it in the next cells in a
linear manner for search operations
ExampleLet H is mod 11Let the keys are 56, 78, 100 appear in this order for
hashingAll these have home as position 1The table is considered a circular array
0 156
278
3100
8 9 10
4
ExerciseHash 45, 39, 66, 74 in that order with Table size m=7
0 1 2 345
566
674
439
45 mod 7=339 mod 7 = 466 mod 7 =374 mod 7=4
ExerciseLet H is mod 11Let the keys are 46, 122, 222, 441 appear in this order
for hashing
46 mod 11 = 2122 mod 11 = 1222 mod 11 = 2441 mod 11 = 1
Solution
0 1122
246
3222
8 9 10
4441
More on Hash Functions
Mid Square Method of hashing
Mid square method1. The key k is squared to get k2
2. This value is now treated as a string of digits
3. Then hash function H(k) is defined as H(k)=f
4. This f is given by deleting the digits from both ends of k2
5. Once chosen, same positions of k2 must be used for all keys consistently
Examplek: 3205 7148 2345k2 : 10 272 025 51 093 904 5 499 025H(k) 72 93 99
Multiplication Method for hashing
Multiplication method for HashingThis method uses a hashing which is different from
the Division methodThe function take the form
H(k)=m(kA mod 1)=floor(m* (kA mod 1)
Where, 0<A<1 and kA mod 1 refers to the fractional part of kA
Since 0< kA mod 1<1, the range of H(k) is from 0 to m
Advantage of Multiplication MethodThe advantage of the multiplication method is that it
works equally well with any size mA should be chosen carefullyRational numbers should not chosen for AAn example of good choice for A is
2
15
Obtain the Hash Codes for the keys2343, 4345, 6567, 3476, 1215m=11, A=0.618
2343 floor(11* (2343* 0.618 mod 1) 10
4345 floor(11* (4345* 0.618 mod 1) 26567 floor(11* (6567* 0.618 mod 1) 43476 floor(11* (3476* 0.618 mod 1) 11215 floor(11* (1215* 0.618 mod 1) 9MATLAB command – floor(11*mod((k*0.618),1))
2
15 A
Solution
0 13476
24345
3
8 91215
102343
46567
More on Collision Resolution
Quadratic Probing for Collision Resolution
Notes on Linear ProbingLinear probing is simple to programLinear probing has better locality of reference and
hence better cache performance in the memory usage
Primary Clustering in Linear ProbingLinear probing use a probe sequence H+1, H+2,
H+3 and so on to find the space of the key, which has got the primary hash value as H
This would lead to clustering of hash codes near some cells, called primary clustering
Larger the cluster, lesser will be the search efficiency
Uniform Hashing & Random ProbingIf use a method to generate Hash codes in a
uniformly distributed manner with a larger table size the process may avoid collisions
Even if collisions occur we may use a pseudo random sequence to probe the locations
But this approach reduces the locality reference, which then becomes a random variable
So, better to use a via media solution between the linear probing and the random hashing
Quadratic ProbingInstead of linearly traversing through the hash table
slots in the case of collisions, the quadratic probing introduces more spacing between the slots we try in the case of collision
This reduces the clustering effect seen in linear probing
Clustering can still occur because Quadratic Probing is not immune to clustering
Quadratic Probing preserves some locality reference and hence give good cache performance but lower than that of Linear Probing
Hash Function for quadratic probingH(k,i)=(H’(k)+c1*i + c2 i2 ) mod m
Where c1 and c2 are constants, (auxiliary constants)H’ is an auxiliary hash function. It could be k mod mi=0,1,2,…,m-1 is called the probe numberFor a given Hash table the c1 and c2 remain
constantChoices for c1 and c2 are c1 = c2 =½, c1 = c2 =1, c1
= 0, c2 =1,
Examplec1 = c2 =½,Take m= 11Let the keys are 46, 122, 222, 441 appear in this
order for hashing
46 mod 11 = 2122 mod 11 = 1222 mod 11 = 2 (2+0.5 *1 + 0.5*1) mod 11441 mod 11 = 1 (1+0.5 *1 + 0.5*1) mod 11
ExerciseApply Quadratic Probing for the following Hash
Addresses78 mod 11 =189 mod 11 =1111 mod 11=1166 mod 11=1
Answer78 mod 11 =1 189 mod 11 =1 (1+0.5 *1 + 0.5*12 ) mod 11 2111 mod 11=1 (1+0.5 *2 + 0.5*22 ) mod 11 4166 mod 11=1 (1+0.5 *3 + 0.5*32 ) mod 11 7
NotesIf two keys have the same initial probe position, then
their probe sequences are the same, since H(k1, 0)=H(k2, 0) implies H(k1, i)=H(k2, i)
This property leads to milder form of clustering called secondary clustering
Clustering
Problems with Linear ProbingLinear probing leads to Primary Clustering- the
hashed keys share substantial segments of probe sequence, because more than one key hashed into same home position shall have the same probe sequence
And the hash addresses that collide at the home address, say b, will extend the cluster
Primary ClusteringAs we have seen, once a block of few contiguous
occupied positions emerges in the Hash Table, it becomes a “target” for subsequent collisions
As clusters grow, they also merge to form larger clusters
Primary clustering means – elements that hash to different cells probe same alternative cells
Clustering will be reduced only if the hash addresses home at different positions
ExampleSuppose we have 10 Hash Codes with value 1 and
5 Hash Codes with Value 2All these codes shall be clustering around 1 and 2
Problems with Quadratic ProbingThere could be adjacent clusters that join to form
composite clustersThis is called secondary clustering
This happens because the keys which have the same home hash address, will lead to same probe sequence
In Quadratic probing also, the probe sequence is a function of the home position and not the original key value
Double hashing for Collision Resolution
Double HashingTo avoid secondary clustering, we need to have the
probe sequence that make use of the original key value in its decision process
This is achieved using Double Hashing, because the Hashing is done in two stages
We shall use a second hash function also, so as to reduce the collisions
Double HashingLet H1(k) and H2(k) be two hash functions for the
same key kThe H(k) is obtained as
H(k,i)= {H1(k) + i* H2(k)} mod m for the ith probe sequence
If the Table size m is a prime number the above sequence is likely to access all locations in the Hash Table
NotesThe functions H1(k) and H2(k) are auxiliary hash
functions, which are selected like any hash function: so that the Keys are distributed in a uniform and random manner.
Example 1We let H1(k) = k mod m and H2(k) = 1 + (k mod m' ),
where m' is slightly less than m, say, m – 1 or m – 2.For example m=11 and m’=9
Example 2First Use Mid Square Method and then use the
Modulo Division
Double hashingDouble hashing can be used to avoid the primary and
secondary clusteringH2(k) must be chosen with care
m and H2(k) must be relatively prime and this can be effected by making m a prime number
If m is a power of two then choose H2(k) which is always odd
ExampleGenerate Hash Codes using Double Hashing for the
following:2227, 3545, 4537, 8981, 7857, 3433, 6965Use Division Method using H1(k) = k mod m and
H2(k) = 1 + (k mod m' )
We have H(k,i)= {H1(k) + i* H2(k)} mod m Use m=11 and m’=9
StepsFirst generate Hash codes with H1(k) = k mod m
using m=11 Then apply the Second hashing depends on the
Collisions. Take m’=9
Step 1-Answer2227 mod 11 = 53545 mod 11 = 34537 mod 11 = 58981 mod 11 = 57857 mod 11 = 33433 mod 11 = 16965 mod 11 = 2
Step 2For resolving collisions, use the second Hash
Function-two times for Hash Code 5 and once for Hash Code 3 and see how the mapping evolves
Answer-Step 22227 mod 11 = 53545 mod 11 = 34537 mod 11 = 58981 mod 11 = 57857 mod 11 = 33433 mod 11 = 16965 mod 11 = 2
2227 mod 9 +1= 53545 mod 9 +1 = 94537 mod 9 +1 = 28981 mod 9 +1 = 97857 mod 9 +1 = 13433 mod 9 +1 = 56965 mod 9 +1 = 9
Step 32227 mod 9 +1= 53545 mod 9 +1 = 94537 mod 9 +1 = 28981 mod 9 +1 = 97857 mod 9 +1 = 13433 mod 9 +1 = 56965 mod 9 +1 = 9
2227 53545 34537 5+1*2=78981 5+2*9 17857 3+1*1 43433 1+1*5 66965 2
Sparse Matrices