8.4 Hashing · Foundations of Algorithms Using C++ Pseudocod

## 8.4 Hashing Suppose the keys are integers from 1 to 100, and there are about 100 records. An efficient method for storing the records is to create an array *S* of 100 items and store the record whose key is equal to *i* in *S \[i\]*. The retrieval is immediate, with no comparisons of keys being done. This same strategy can be used if there are about 100 records and the keys are nine-digit social security numbers. However, in this case the strategy is very inefficient in terms of memory because an array of 1 billion items is needed to store only 100 records. Alternatively, we could create an array of only 100 items indexed from 0 to 99, and "hash" each key to a value from 0 and 99. A ***hash function*** is a function that transforms a key to an index, and an application of a hash function to a particular key is called "hashing the key." In the case of social security numbers, a possible hash function is ![](https://box.kancloud.cn/b1680467bf5cd9c1647e9138ef53af3d_165x20.jpg) (% returns the remainder when *key* is divided by 100.) This function simply returns the last two digits of the key. If a particular key hashes to *i*, we store the key and its record at *S\[i\]*. This strategy does not store the keys in sorted sequence, which means that it is applicable only if we never need to retrieve the records efficiently in sorted sequence. If it is necessary to do this, one of the methods discussed previously should be used. If no two keys hash to the same index, this method works fine. However, this is rarely the case when there are a substantial number of keys. For example, if there are 100 keys and each key is equally likely to hash to each of the 100 indices, the probability of no two keys hashing to the same index is ![](https://box.kancloud.cn/26a16f3a42d41adc16ad99ab55c6312a_171x39.jpg) We are almost certain that at least two keys will hash to the same index. Such an occurrence is called a ***collision*** or ***hash clash***. There are various ways to resolve a collision. One of the best ways is through the use of ***open hashing*** (also called ***open addressing***). With open hashing we create a bucket for each possible hash value and place all the keys that hash to a value in the bucket associated with that value. Open hashing is usually implemented with linked lists. For example, if we hash to the last two digits of a number, we create an array of pointers *Bucket* that is indexed from 0 to 99. All of those keys that hash to *i* are placed in a linked list starting at *Bucket* \[*i*\]. This is illustrated in [Figure 8.8](#ch08fig08). [![Click To expand](https://box.kancloud.cn/a4344c875b235208f5c3671eef838f1e_350x165.jpg)](fig8-8_0.jpg) Figure 8.8: An illustration of open hashing. Keys with the same last two digits are in the same bucket. The number of buckets need not equal the number of keys. For example, if we hash to the last two digits of a number, the number of buckets must be 100. However, we could store 100, 200, 1000, or any number of keys. Of course, the more keys we store, the more likely we are to have collisions. If the number of keys is greater than the number of buckets, we are guaranteed to have collisions. Because a bucket stores only a pointer, not much space is wasted by allocating a bucket. Therefore, it is often reasonable to allocate at least as many buckets as there are keys. When searching for a key, it is necessary to do a sequential search through the bucket (linked list) containing the key. If all the keys hash to the same bucket, the search degenerates into a sequential search. How likely is this to happen? If there are 100 keys and 100 buckets, and a key is equally likely to hash to each of the buckets, the probability of the keys all ending up in the same bucket is ![](https://box.kancloud.cn/5404e6555f798e1421fc84201d18b880_208x52.jpg) Therefore, it is almost impossible for all the keys to end up in the same bucket. What about the chances of 90, 80, 70, or any other large number of keys ending up in the same bucket? Our real interest should be how likely it is that hashing will yield a better average search than Binary Search. We show that if the file is reasonably large, this too is almost a certainty. But first we show how good things can be with hashing. Intuitively, it should be clear that the best thing that can happen is for the keys to be uniformly distributed in the buckets. That is, if there are *n* keys and *m* buckets, each bucket contains *n/m* keys. Actually, each bucket will contain exactly *n/m* keys only when *n* is a multiple of *m*. When this is not the case, we have an approximately uniform distribution. The following theorems show what happens when the keys are uniformly distributed. For simplicity, they are stated for an exact uniform distribution (that is, for the case in which *n* is a multiple of *m*). Theorem 8.4**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**If *n* keys are uniformly distributed in *m* buckets, the number of comparisons in an unsuccessful search is given by *n/m.* Proof: Because the keys are uniformly distributed, the number of keys in each bucket is *n/m*, which means that every unsuccessful search requires *n/m* comparisons. **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** Theorem 8.5**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**If *n* keys are uniformly distributed in *m* buckets and each key is equally likely to be the search key, the average number of comparisons in a successful search is given by ![](https://box.kancloud.cn/bf3e4882cf62e5be738bb8cb174801fe_70x40.jpg) Proof: The average search time in each bucket is equal to the average search time when doing a sequential search of *n/m* keys. The average-case analysis of [Alogrithm 1.1](LiB0008.html#27) in [Section 1.3](LiB0010.html#57) shows that this average equals the expression in the statement of the theorem. **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** The following example applies [Theorem 8.5](#ch08ex12). Example 8.1**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**If the keys are uniformly distributed and *n* = 2*m*, every unsuccessful search requires only 2*m/m* = 2 comparisons, and the average number of comparisons for a successful search is ![](https://box.kancloud.cn/7ed7609f71dbfa8286c32a5b9378ba62_105x44.jpg) **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** When the keys are uniformly distributed, we have a very small search time. However, even though hashing has the possibility of yielding such exceptionally good results, one might argue that we should still use Binary Search to guarantee that the search does not degenerate into something close to a sequential search. The following theorem shows that if the file is reasonably large, the chances of hashing being as bad as Binary Search are very small. The theorem assumes that a key is equally likely to hash to each of the buckets. When social security numbers are hashed to their last two digits, this criterion should be satisfied. However, not all hash functions satisfy this criterion. For example, if names are hashed to their last letters, the probability of hashing to "th" is much greater than that of hashing to "qz," because many more names end in "th." A data structures text such as Kruse (1994) discusses methods for choosing a good hash function. Our purpose here is to analyze how well hashing solves the Searching problem. Theorem 8.6**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**If there are *n* keys and *m* buckets, the probability that at least one bucket contains at least *k* keys is less than or equal to ![](https://box.kancloud.cn/db19fe602d1cb2d9c29c2ccdc93d4442_120x51.jpg) given the assumption that a key is equally likely to hash to any bucket. Proof: For a given bucket, the probability of any specific combination of *k* keys ending up in that bucket is (1/*m*)*k*, which means that the probability that the bucket contains at least the keys in that combination is (1/*m*)*k*. In general, for two events *S* and *T*, ![](https://box.kancloud.cn/c1abe1d480f49f17edcaff5f4e81ae8b_328x24.jpg) Therefore, the probability that a given bucket contains at least *k* keys is less than or equal to the sum of the probabilities that it contains at least the keys in each distinct combination of *k* keys. Because ![](https://box.kancloud.cn/51461db48865eaca7faf0e2912c89d7e_25x29.jpg) distinct combinations of *k* keys can be obtained from *n* keys (see [Section A.7](LiB0099.html#1323) in [Appendix A](LiB0093.html#1281)), the probability that any given bucket contains at least *k* keys is less than or equal to ![](https://box.kancloud.cn/be45945eecc87b037dd7d0240dfb18ba_101x53.jpg) The theorem now follows from Expression 8.1 and the fact that there are *m* buckets. **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** Recall that the average search time for Binary Search is about lg *n*. [Table 8.1](#ch08table01) shows the bounds on the probabilities of at least one bucket containing at least lg *n* keys and 2 lg *n* keys for several values of *n*. It is assumed in the table that *n = m.* Even when *n* is only 128, the chances of the search time ever exceeding 2 lg *n* are less than 1 in a billion. For *n* = 1024, the chances of the search time ever exceeding 2 lg *n* are less than 3 in 10 million billion. For *n* = 65, 536, the chances of the search time ever exceeding lg *n* are about 3 in 1 billion. The probability of dying in a single trip in a jet plane is about 6 in 1 million, and the probability of dying in an auto accident during an average year of driving is about 270 in 1 million. Yet many humans do not take serious measures to avoid these activities. The point is that, to make reasonable decisions, we often neglect exceptionally small probabilities as if they were impossibilities. Because for large *n* we can be almost certain of obtaining better performance using hashing instead of Binary Search, it is reasonable to do so. We amusingly recall that in the 1970s a certain computer manufacturer routinely talked data-processing managers into buying expensive new hardware by describing a catastrophe that would take place if a contrived scenario of usage occurred. The flaw in their argument, which the data-processing managers often overlooked, was that the probability of this scenario occurring was negligible. Risk aversion is a matter of personal preference. Therefore, an exceedingly risk-averse individual may choose not to take a 1-in-a-billion or even a 3-in-10-million-billion chance. However, the individual should not make such a choice without giving serious deliberation to whether such a decision truly models the individual's attitude toward risk. Methods for doing such an analysis are discussed in Clemen (1991). Table 8.1: Upper Bounds on the Probability That at Least One Bucket Contains at Least *k* Keys\[[\*](#ftn.ch08fnt01)\]*n* Bound when *k* = lg *n* Bound when *k* = 2 lg *n* 128 .021 7\.02 x 10–10 1,024 .00027 3\.49 x 10–16 8, 192 .0000013 1\.95 x 10–23 65, 536 3\.1 x 10–9 2\.47 x 10–31 \[[\*](#ch08fnt01)\]It is assumed that the number of keys *n* equals the number of buckets.