8.3 Searching in Trees · Foundations of Algorithms Using C++ Pseudocod

## 8.3 Searching in Trees We next discuss that even though Binary Search and its variations, Interpolation Search and Robust Interpolation Search, are very efficient, they cannot be used in many applications because an array is not an appropriate structure for storing the data in these applications. Then we show that a tree is appropriate for these applications. Furthermore, we show that we have Θ (lg *n*) algorithms for searching trees. By ***static searching*** we mean a process in which the records are all added to the file at one time and there is no need to add or delete records later. An example of a situation in which static searching is appropriate is the searching done by operating systems commands. Many applications, however, require ***dynamic searching***, which means that records are frequently added and deleted. An airline reservation system is an application that requires dynamic searching, because customers frequently call to schedule and cancel reservations. An array structure is inappropriate for dynamic searching, because when we add a record in sequence to a sorted array, we must move all the records following the added record. Binary Search requires that the keys be structured in an array, because there must be an efficient way to find the middle item. This means that Binary Search cannot be used for dynamic searching. Although we can readily add and delete records using a linked list, there is no efficient way to search a linked list. Therefore, linked lists do not satisfy the searching needs of a dynamic searching application. If it is necessary to retrieve the keys quickly in sorted sequence, direct-access storage (hashing) will not work (hashing is discussed in the next section). Dynamic searching can be implemented efficiently using a tree structure. First we discuss binary search trees. After that, we discuss B-trees, which are an improvement on binary search trees. B-trees are guaranteed to remain balanced. Our purpose here is to further analyze the problem of searching. Therefore, we only touch on the algorithms pertaining to binary search trees and B-trees. These algorithms are discussed in detail in many data structures texts, such as Kruse (1994). ### 8.3.1 Binary Search Trees Binary search trees were introduced in [Section 3.5](LiB0029.html#308). However, the purpose there was to discuss a static searching application. That is, we wanted to create an optimal tree based on the probabilities of searching for the keys. The algorithm that builds the tree ([Alogrithm 3.9](LiB0029.html#327)) requires that all the keys be added at one time, which means that the application requires static searching. Binary search trees are also appropriate for dynamic searching. Using a binary search tree, we can usually keep the average search time low, while being able to add and delete keys quickly. Furthermore, we can quickly retrieve the keys in sorted sequence by doing an in-order traversal of the tree. Recall that in an ***in-order traversal*** of a binary tree, we traverse the tree by first visiting all the nodes in the left subtree using an in-order traversal, then visiting the root, and finally visiting all the nodes in the right subtree using an in-order traversal. [Figure 8.5](#ch08fig05) shows a binary search tree containing the first seven integers. The search algorithm ([Algorithm 3.8](LiB0029.html#317)) searches the tree by comparing the search key *x* with the value at the root. If they are equal, we are done. If *x* is smaller, the search strategy is applied to the left child. If *x* is larger, the strategy is applied to the right child. We proceed down the tree in this fashion until *x* is found or it is determined that *x* is not in the tree. You may have noticed that when we apply this algorithm to the tree in [Figure 8.5](#ch08fig05), we do the same sequence of comparisons as done by the decision tree (see [Figure 8.1](LiB0068.html#765)) corresponding to Binary Search. This illustrates a fundamental relationship between Binary Search and binary search trees. That is, the comparison nodes in the decision tree, corresponding to Binary Search when searching *n* keys, also represent a binary search tree in which [Alogrithm 3.8](LiB0029.html#317) does the same comparisons as Binary Search. Therefore, like Binary Search, [Alogrithm 3.8](LiB0029.html#317) is optimal for searching *n* keys when it is applied to that tree. We can efficiently add keys to and delete keys from the tree in [Figure 8.5](#ch08fig05). For example, to add the key 5.5 we simply proceed down the tree, going to the right if 5.5 is greater than the key at a given node, and to the left otherwise, until we locate the leaf containing 5. We then add 5.5 as the right child of that leaf. As previously mentioned, the actual algorithms for adding and deleting can be found in Kruse (1994). [![Click To expand](https://box.kancloud.cn/9bb090c240dc2e33798828945e569dc8_300x184.jpg)](fig8-5_0.jpg) Figure 8.5: A binary search tree containing the first seven integers. The drawback of binary search trees is that when keys are dynamically added and deleted, there is no guarantee that the resulting tree will be balanced. For example, if the keys are all added in increasing sequence, we obtain the tree in [Figure 8.6](#ch08fig06). This tree, which is called a ***skewed tree***, is simply a linked list. An application of [Alogrithm 3.8](LiB0029.html#317) to this tree results in a sequential search. In this case, we have gained nothing by using a binary search tree instead of a linked list. [![Click To expand](https://box.kancloud.cn/03630d8e97a76f2854d40db8987e832c_300x377.jpg)](fig8-6_0.jpg) Figure 8.6: A skewed binary search tree containing the first seven integers. If the keys are added at random, intuitively it seems that the resulting tree will be closer to a balanced tree much more often than it will be closer to a linked list. (See [Section A.8.1](LiB0100.html#1340). in [Appendix A](LiB0093.html#1281) for a discussion of randomness.) Therefore, on the average, we would expect an efficient search time. Indeed, we have a theorem to that effect. First we explain the result in the theorem. The theorem obtains an average search time for inputs containing *n* keys under the assumption that all inputs are equally probable. By this we mean that every possible ordering of *n* keys has the same probability of being the input to the algorithm that builds the tree. For example, if *n* = 3 and *s*1 < *s*2 < *s*3 are the three keys. these inputs are all equally probable: ![](https://box.kancloud.cn/96a4db97a2897d893dde10e8d4549b7f_304x50.jpg) Notice that two inputs can result in the same tree. For example, \[*s*2,*s*3, *s*1\] and \[*s*2, *s*1, *s*3\] result in the same tree—namely, the one with *s*2 at the root, *s*1 on the left, and *s*3 on the right. These are the only two inputs that produce that tree. Sometimes a tree is produced by only one input. For example, the tree produced by the input \[*s*1, *s*2, *s*3\] is produced by no other input. It is the inputs, not the trees, that are assumed to he equiprobable. Therefore, each of the inputs listed has probability ![](https://box.kancloud.cn/ed3fa615eb4d778d2932ceaca0b77410_11x28.jpg), the tree produced by the inputs \[*s*2, *s*3, *s*1\] and \[*s*2,*s*1,*s*3\] has probability ![](https://box.kancloud.cn/685d630a244ce033c9a0be2a90fce06b_10x29.jpg), and the tree produced by the input \[*s*1, *s*2, *s*3\] has probability ![](https://box.kancloud.cn/ed3fa615eb4d778d2932ceaca0b77410_11x28.jpg). We also assume that the search key *x* is equally probable to be any of the *n* keys. The theorem now follows. Theorem 8.3**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Under the assumptions that all inputs are equally probable and that the search key *x* is equally probable to be any of the *n* keys, the average search time over all inputs containing *n* distinct keys, using binary search trees, is given approximately by ![](https://box.kancloud.cn/7e566f605829a0ef1764366ef76ea23e_132x21.jpg) Proof: We obtain the proof under the assumption that the search key *x* is in the tree. In the exercises, we show that the result still holds if we remove this restriction as long as we consider each of the 2*n* + 1 possible outcomes to be equally probable. (These outcomes are discussed in the average-case analysis of Binary Search in [Section 8.1.2](LiB0068.html#774).) Consider all binary search trees containing *n* keys that have the *k*th-smallest key at the root. Each of these trees has *k* − 1 nodes in its left subtree and *n − k* nodes in its right subtree. The average search time for the inputs that produce these trees is given by the sum of the following three quantities: - The average search time in the left subtrees of such trees times the probability of *x* being in the left subtree - The average search time in the right subtrees of such trees times the probability of *x* being in the right subtree - The one comparison at the root The average search time in the left subtrees of such trees is *A (k* − 1), and the average search time in the right subtrees is *A (n − k*). Because we have assumed that the search key *x* is equally probable to be any of the keys, the probabilities of *x* being in the left and right subtrees are, respectively, ![](https://box.kancloud.cn/ace18b29c613eafd1b9acf5de87d2487_196x41.jpg) If we let *A (n/k)* denote the average search time over all inputs of size *n* that produce binary search trees with the *k*th-smallest key at the root, we have established that ![](https://box.kancloud.cn/fdbd456939942f1b30d0ce3ba4fe05ce_379x40.jpg) Because all inputs are equally probable, every key has the same probability of being the first key in the input and therefore the key at the root. Therefore, the average search time over all inputs of size *n* is the average of *A (n/k)* as *k* goes from 1 to *n.* We have then ![](https://box.kancloud.cn/03d613fbb715036efa7df98611e8f187_400x52.jpg) If we set *C (n) = nA(n)* and substitute *C(n)/n* in the expression for *A(n)*, we obtain ![](https://box.kancloud.cn/4550989e79b86b4cb92bfef12bdbe9f3_400x53.jpg) Simplifying yields ![](https://box.kancloud.cn/1b02245a19581fe16b681f93befe8af7_325x113.jpg) The initial condition is ![](https://box.kancloud.cn/067178115beb57b0841db053285f6edc_149x20.jpg) This recurrence is almost identical to the one for the average case of Quicksort ([Alogrithm 2.6](LiB0018.html#177)). Mimicking the average-case analysis for Quicksort yields ![](https://box.kancloud.cn/69498ec7b4be5d7be4ccbe64f2324880_192x23.jpg) which means that ![](https://box.kancloud.cn/09910d685814ba237a9774f4edc0adee_134x19.jpg) **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** You must be careful not to misinterpret the result in [Theorem 8.3](#ch08ex10). This theorem does not mean that the average search time for a particular input containing *n* keys is about 1.38 lg *n.* A particular input could produce a tree that degenerates into the one in [Figure 8.6](#ch08fig06), which means that the average search time is in Θ (lg*n.*) [Theorem 8.3](#ch08ex10) is for the average search time over all inputs containing *n* keys. Therefore, for any given input containing *n* keys, it is probable that the average search time is in Θ (lg *n*) but it could be in Θ (*n*). There is no guarantee of an efficient search. ### 8.3.2 B-Trees In many applications, performance can be severely degraded by a linear search time. For example, the keys to records in large databases often cannot all fit in a computer's high-speed memory (called "RAM," for "random access memory") at one time. Therefore, multiple disk accesses are needed to accomplish the search. (Such a search is called an ***external search***, whereas when all the keys are simultaneously in memory it is called an ***internal search.***) Because disk access involves the mechanical movement of read/write heads and RAM access involves only electronic data transfer, disk access is orders of magnitude slower than RAM access. A linear-time external search could therefore prove to be unacceptable, and we would not want to leave such a possibility to chance. One solution to this dilemma is to write a balancing program that takes as input an existing binary search tree and outputs a balanced binary search tree containing the same keys. The program is then run periodically. [Alogrithm 3.9](LiB0029.html#327) is an algorithm for such a program. That algorithm is more powerful than a simple balancing algorithm because it is able to consider the probability of each key being the search key. In a very dynamic environment, it would be better if the tree never became unbalanced in the first place. Algorithms for adding and deleting nodes while maintaining a balanced binary tree were developed in 1962 by two Russian mathematicians, G. M. Adel'son-Velskii and E. M. Landis. (For this reason, balanced binary trees are often called ***AVL trees.***) These algorithms can be found in Kruse (1994). The addition and deletion times for these algorithms are guaranteed to be Θ (lg *n*), as is the search time. In 1972, R. Bayer and E. M. McCreight developed an improvement over binary search trees called ***B-trees.*** When keys are added to or deleted from a B-tree, all leaves are guaranteed to remain at the same level, which is even better than maintaining balance. The actual algorithms for manipulating Btrees can be found in Kruse (1994). Here we illustrate only how we can add keys while keeping all leaves at the same level. B-trees actually represent a class of trees, of which the simplest is a 3-2 tree. We illustrate the process of adding nodes to such trees. **A** *3-2 tree* is a tree with the following properties: - Each node contains one or two keys. - If a nonleaf contains one key, it has two children, whereas if it contains two keys, it has three children. - The keys in the left subtree of a given node are less than or equal to the key stored at that node. - The keys in the right subtree of a given node are greater than or equal to the key stored at that node. - If a node contains two keys, the keys in the middle subtree of the node are greater than or equal to the left key and less than or equal to the right key. - All leaves are at the same level. [Figure 8.7](#ch08fig07)(a) shows a 3–2 tree, and the remainder of that figure shows how a new key is added to the tree. Notice that the tree remains balanced because the tree grows in depth at the root instead of at the leaves. Similarly, when it is necessary to delete a node, the tree shrinks in depth at the root. In this way, all leaves always remain at the same level, and the search. addition, and deletion times are guaranteed to be in Θ (lg *n*). Clearly, an in-order traversal of the tree retrieves the keys in sorted sequence. For these reasons, B-trees are used in most modern database management systems. [![Click To expand](https://box.kancloud.cn/286425fe2319fcbbf8ea62db66609fec_350x422.jpg)](fig8-7_0.jpg) Figure 8.7: The way a new key is added to a 3–2 tree.