We use hashtable when we only want to insert, search and delete. There are three main stream methood for hashtable: 1. direct address. 2. chainning and 3. open addressing.
Direct addressing.
We directly access the array that contains the pointer to satellite data. It works well when the universe of k is small and we can afford to allocate enough space.(we assume each key is unique). Since everyone has slot already, why not store the data directly?
Exercise for direct addressing
1
A bit vector is simply an array of bits (0s and 1s). A bit vector of length m takes much less space than an array of m pointers. Describe how to use a bit vector to represent a dynamic set of distinct elements with no satellite data. Dictionary operations should run in O(1) time.
Given a bit vector b, it contains 1 in position k if k is in the dynamic set.
2
Suggest how to implement a direct-address table in which the keys of stored elements do not need to be distinct and the elements can have satellite data. All three dictionary operations (INSERT, DELETE, and SEARCH) should run in O(1) time. (Don’t forget that DELETE takes as an argument a pointer to an object to be deleted, not a key.)
The keys are not distinct so we could let T[x.key] points to a doubly linked list and store all the data that has the same key. Insert takes O(1). When we delete, we take pointer to the object as an argument and removing a node from linked list is O(1).
Search will take key as the argument and check if T[k] is null so O(1).
Down side of direct addressing
We need to allocate space for all possible key in the universe and this become infeasible when Universe is large. More importantly the actual keys used might be way smaller than U. So we use hashtable
Hashtable
Hashtable uses hash function to computer key for the table. For an element with key k, we will store it at slot h(k). But multiple key values might be hashed into the same value: h(k1) = h(k2)… when k1 != k2 !=…. a collision has occurred.
What constitutes a good hashing function
A good hashing function will appear to be “random” while being deterministic(else we can’t retrieve).
Load factor
Assume uniform hashing, where any given element is equally likely to be hashed into any of the m slot. n = number of elements in the dictionary. m = number of slot in the table. n/m is the load factor, the average length of linked list pointed by the table.
Chaining
Collision occurs when two element’s key got hash to the same value. Then we can do Chaining; we will store the elements with the same h(k) into the same linked list.
Run time of Insert, search and delete
Insert(T,x)
we will insert at the front of the doubly linked list pointed by h(x.key). O(1) time
Search(T,k) k is the key value.
Go to h(k) first and traverse the linked list until we either find k or fail to find it. It takes O(n/m) to search. (1 if we fail, we need to traverse the whole list, O(n/m)
(2 if we the key is in the dictionary…well we need to take the average of 1 plus the expected number of elements added to x’s list after x was added to the list(because we have to traverse through all the element before x, who are added after x
Where P(h(k_i)) == h(k_j)) = 1/m
\dfrac{n}{m}
\dfrac{n}{m}
h(k) = k mod m
h(k) = \floor{m (kA mod1)}
h : U x {1,2,3…m – 1} = {1,2,3… m – 1}
<(h(k,0), h(k,1)….h(k, m – 1)>
h(k) = (h'(k) + 1) mod m)
h(k) = (h'(k) + c_1 i + c_2 i^2) mod m
c_1, c_2, m
h(k) = h_1(k) + i h_2(k) mod m
h_1(k)
(h_2(k) + i) mod m
So the probe sequence depends on two hashing function is not dictated by the initial starting point.
Perfect hashing
Two level hashing using universal hashing at each level.