CLRS chpt 11: hashtable

We use hashtable when we only want to insert, search and delete. There are three main stream methood for hashtable: 1. direct address. 2. chainning and 3. open addressing.

Direct addressing.

We directly access the array that contains the pointer to satellite data. It works well when the universe of k is small and we can afford to allocate enough space.(we assume each key is unique). Since everyone has slot already, why not store the data directly?

Exercise for direct addressing

1

A bit vector is simply an array of bits (0s and 1s). A bit vector of length m takes much less space than an array of m pointers. Describe how to use a bit vector to represent a dynamic set of distinct elements with no satellite data. Dictionary operations should run in O(1) time.

Given a bit vector b, it contains 1 in position k if k is in the dynamic set.

2

Suggest how to implement a direct-address table in which the keys of stored elements do not need to be distinct and the elements can have satellite data. All three dictionary operations (INSERT, DELETE, and SEARCH) should run in O(1) time. (Don’t forget that DELETE takes as an argument a pointer to an object to be deleted, not a key.)

The keys are not distinct so we could let T[x.key] points to a doubly linked list and store all the data that has the same key. Insert takes O(1). When we delete, we take pointer to the object as an argument and removing a node from linked list is O(1).
Search will take key as the argument and check if T[k] is null so O(1).

Down side of direct addressing

We need to allocate space for all possible key in the universe and this become infeasible when Universe is large. More importantly the actual keys used might be way smaller than U. So we use hashtable

Hashtable

Hashtable uses hash function to computer key for the table. For an element with key k, we will store it at slot h(k). But multiple key values might be hashed into the same value: h(k1) = h(k2)… when k1 != k2 !=…. a collision has occurred.

What constitutes a good hashing function

A good hashing function will appear to be “random” while being deterministic(else we can’t retrieve).

Load factor

Assume uniform hashing, where any given element is equally likely to be hashed into any of the m slot. n = number of elements in the dictionary. m = number of slot in the table. n/m is the load factor, the average length of linked list pointed by the table.

Chaining

Collision occurs when two element’s key got hash to the same value. Then we can do Chaining; we will store the elements with the same h(k) into the same linked list.

Run time of Insert, search and delete

Insert(T,x)

we will insert at the front of the doubly linked list pointed by h(x.key). O(1) time

Search(T,k) k is the key value.

Go to h(k) first and traverse the linked list until we either find k or fail to find it. It takes O(n/m) to search. (1 if we fail, we need to traverse the whole list, O(n/m)
(2 if we the key is in the dictionary…well we need to take the average of 1 plus the expected number of elements added to x’s list after x was added to the list(because we have to traverse through all the element before x, who are added after x

\dfrac{1}{n} \sum_{i = 1}^{n} 1 + \sum_{j = i + 1}^{n}{E(X_{(ij)}

Where X_{ij} is the indicator variable that h(ki) = h(kj). The inner summation will sum the expected value of collision after x is inserted. And the earlier the x is inserted, the larger the number of sum will be.  P(h(k_i)) == h(k_j)) = 1/mby uniform hashing. So the inner loop sums to\dfrac{n}{m}. We then take the average of it and it will still be\dfrac{n}{m}. (Please ignore the gross math omission but the general idea is close.  <h3>Exercise</h3>  <h4>1</h4>  Professor Marley hypothesizes that he can obtain substantial performance gains by modifying the chaining scheme to keep each list in sorted order. How does the professor's modification affect the running time for successful searches, unsuccessful searches, insertions, and deletions?  It will make insert slower to O(n/m) since we need to traverse. It will <strong>NOT</strong> make delesion faster since the order doesn't help delete to be asymptotically faster. But it will help search to become faster if instead of using linked list we just use an array! Because we can use binary search. Times to search reduce to O(1 + lg(n/m))  <h1>Hash function</h1>  <h2>Division method</h2>  h(k) = k mod m. For this method, we want m not to be power of two. Otherwise it is just shifting bits and will not have the "random" effect a good hash function would have  <h2>Multiplication  method</h2>  h(k) = \floor{m (kA mod1)}Where kA mod1 is the fractional part of the kA.  This method has the advantage that m can be power of 2.  <h2>Universal hashing</h2>  We randomly choose hash function from a class of suitable function. The probability of going to a bad one is slim. (Just like quick sort).  <h1>Open addressing</h1>  In open addressing, all element occupy the table itself. SO NO POINTER and the element stored is the key. And the table can fill up. No more than m element can be stored in the table and the load factor is never greater than 1. Instead of pointer, for each hashing, we generate a sequence of slots to be examined in case of collision. To insert and search, we <strong>probe</strong> the table until we find a suitable position for our element.  h : U x {1,2,3…m – 1} = {1,2,3… m – 1}The sequence of probe depends on the hash value of the element. For every key k, we want the prob sequence to be<(h(k,0), h(k,1)….h(k, m – 1)>. It is a permutation of {1,2,3...m}. so that eventually we consider every slot when the table is filling up.  <pre><code class="python">hash_insert(T, k):    i = 0     while T[h(h,i)] != null and i != m:  #probe until we either find null( empty) or have no space left         i += 1     if i != m:         T[h(k,i)] = k     else:error "no space left" </code></pre>  The same logic goes for search. We either find it in the permutation sequence or finish the whole sequence...  <pre><code class="python">search(T,k): i = 0 while T[h(h,i)] != null and i != m:       i +=1 if i != m:     return i </code></pre>  But for delete, we can not just delete the key in the table. Because removing key will make it impossible to search. We can insert a DELETE instead of Null when delete a key.  We define h'(k,i), an ordinary hash function, to be the auxiliary hash function.  <h2>Linear probing</h2>  h(k) = (h'(k) + 1) mod m)linear probing will search h'(k) then h'(k) + 1.. h'(k) + m - 1 and then wrap around and search until h'(k) - 1. It will iterate through the whole table eventually. It suffers from <strong>primary clustering</strong> and long run sequence will build up the table and search will take longer. <strong>the initial starting position will dictate the sequence</strong>  <h2>Quardratic</h2>  h(k) = (h'(k) + c_1 i + c_2 i^2) mod mThe offset is provided by the polynomial. It works better than linear probing but we must constrain the use ofc_1, c_2, m.  <strong>the initial starting position will dictate the sequence as well</strong>  <h2>Double</h2>  h(k) = h_1(k) + i h_2(k) mod mwhere h1 and h2 are both aux hash function. The first probe position ish_1(k)and then we offset each probe with(h_2(k) + i) mod m
So the probe sequence depends on two hashing function is not dictated by the initial starting point.

Perfect hashing

Two level hashing using universal hashing at each level.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax