Appendix C: Data Structures for Disjoint Sets · Foundations of Algorithms Using C++ Pseudocod

# Appendix C: Data Structures for Disjoint Sets Kruskal's algorithm ([Algorithm 4.2](LiB0033.html#389) in [Section 4.1.2](LiB0033.html#386)) requires that we create disjoint subsets, each containing a distinct vertex in a graph, and repeatedly merge the subsets until all the vertices are in the same set. To implement this algorithm, we need a data structure for disjoint sets. There are many other useful applications of disjoint sets. For example, they can be used in [Section 4.3](LiB0035.html#405) to improve the time complexity of [Algorithm 4.4](LiB0035.html#422) (Scheduling with Deadlines). Recall that an ***abstract data type*** consists of data objects along with permissible operations on those objects. Before we can implement a disjoint set abstract data type, we need to specify the objects and operations that are needed. We start with a universe *U* of elements. For example, we could have ![](https://box.kancloud.cn/f3f96884eeda958244df5896238ccb3e_169x22.jpg) We then want a procedure *makeset* that makes a set out of a member of *U.* The disjoint sets in [Figure C. 1(a)](#ap-cfig01) should be created by the following calls: ``` for (each x ∊ U) makeset(x); ``` [![Click To expand](https://box.kancloud.cn/67bc1f62fe3394fbff4b3b81233483dc_350x169.jpg)](figap-c-1_0.jpg) Figure C.1: An example of a disjoint set data structure. We need a type set\_pointer and a function *find* such that if *p* and *q* are of type set\_pointer and we have the calls ``` p = find ('B'); q = find('C'); ``` then *p* should point to the set containing B, and *q* should point to the set containing C. This is illustrated in [Figure C.1(a)](#ap-cfig01). We also need a procedure *merge* to merge two sets into one. For example, if we do ``` p = find('B'); q = find('C'); merge(p,q); ``` our sets in [Figure C.1(a)](#ap-cfig01) should become the sets in [Figure C.1(b)](#ap-cfig01). Given the disjoint sets in [Figure C.1(b)](#ap-cfig01), if we have the call ``` p = find('B'); ``` we should obtain the result shown in [Figure C.1(c)](#ap-cfig01). Finally, we need a routine equal to check whether two sets are the same set. For example, if we have the sets in [Figure C.1(b)](#ap-cfig01), and we have the calls ``` p = find('B'); q = find('C'); r = find('A'); ``` then *equal*(*p,q*) should return ture and *equal*(*p,r*) should return false. We have specified an abstract data type whose objects consist of elements in a universe and disjoint sets of those elements, and whose operations are *makeset, find, merge,* and *equal.* One way to represent disjoint sets is to use trees with inverted pointers. In these trees each nonroot points to its parent, whereas each root points to itself. [Figure C.2(a)](#ap-cfig02) shows the trees corresponding to the disjoint sets in [Figure C.1(a)](#ap-cfig01), and [Figure C.2(b)](#ap-cfig02) shows the trees corresponding to the disjoint sets in [Figure C.1(b)](#ap-cfig01). To implement these trees as simply as possible, we assume that our universe contains only indices (integers). To extend this implementation to another finite universe we need only index the elements in that universe. We can implement the trees using an array *U*, where each index to *U* represents one index in the universe. If an index *i* represents a nonroot, the value of *U* \[*i*\] is the index representing its parent. If an index *i* represents a root, the value of *U* \[*i*\] is *i.* For example, if there are 10 indices in our universe, we store them in an array of indices *U* indexed from 1 to 10. [![Click To expand](https://box.kancloud.cn/c8e5816ad5046bc6cbfe092733a480f0_350x303.jpg)](figap-c-2_0.jpg) Figure C.2: The inverted tree representation of a disjoint set data structure. To initially place all the indices in disjoint sets, we set ``` U[i] = i; ``` for all *i.* The tree representation of 10 disjoint sets and the corresponding array implementation are shown in [Figure C.3(a)](#ap-cfig03). An example of a merge is illustrated in [Figure C.3(b)](#ap-cfig03). When the sets {4} and {10} are merged, we make the node containing 10 a child of the node containing 4. This is accomplished by setting *U*\[10\] = 4. In general, when we merge two sets, we first determine which tree has the larger index stored at its root. We then make that root a child of the root of the other tree. [Figure C.3(c)](#ap-cfig03) shows the tree representation and corresponding array implementation after several merges have been done. At this point there are only three disjoint sets. An implementation of the routines follows. For the sake of notational simplicity, both here and in the discussion of Kruskal's algorithm (see [Section 4.1.2](LiB0033.html#386)), we do not list our universe *U* as a parameter to the routines. [![Click To expand](https://box.kancloud.cn/1a6c62b4c54bde8921af63d28e500a1d_350x494.jpg)](figap-c-3_0.jpg) Figure C.3: The array implementation of the inverted tree representation of a disjoint set data structure. Disjoint Set Data Structure**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** ``` const int n = the number of elements in the universe; typedef int index; typedef index set_pointer; typedef index universe [1..n]; // universe is indexed from 1 to n. universe U; void makeset (index i) { U[i] = i; } set_pointer find (index i) { index j; j = i; while (U[j]! = j) j=U[j]; return j; } void merge (set_pointer p, set_pointer q) { if (p<q) // p points to merged set; U[q] = p; //q no longer points to a set. else U[p] = q; // q points to merged set; } // p no longer points to a set. bool equal (set_pointer p, set_pointer q) { if (p == q) return true; else return false; } void initial (int n) { index i; for (i = 1; <= n; i++) makeset(i); } ``` **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** The value returned by the call *find*(*i*) is the index stored at the root of the tree containing *i.* We have included a routine *initial* that initializes *n* disjoint sets, because such a routine is often needed in algorithms that use disjoint sets. In many algorithms that use disjoint sets, we initialize *n* disjoint sets, and then do *m* passes through a loop (the values of *n* and *m* are not necessarily equal). Inside the loop there are a constant number of calls to routines *equal, find*, and *merge.* When analyzing the algorithm, we need the time complexities of both the initialization and the loop in terms of *n* and *m.* Clearly, the time complexity for routine *initial* is in ![](https://box.kancloud.cn/11301c8c7d193e6c090b85e88055cbcd_52x22.jpg) Because order is not affected by a multiplicative constant, we can assume that routines *equal, find*, and *merge* are each called just once in each of the *m* passes through the loop. Clearly, *equal* and *merge* run in constant time. Only function *find* contains a loop. Therefore, the order of the time complexity of all the calls is dominated by function *find.* Let's count the worst-case number of times the comparison in *find* is done. Suppose, for example, that *m* = 5. The worst case happens if we have this sequence of merges: > *merge*({5}, {6}); > *merge*({4}, {5, 6}); > *merge*({3}, {4, 5, 6}); > *merge*({2}, {3, 4, 5, 6}); > *merge*({1}, {2, 3, 4, 5, 6}); and, after each *merge*, we call *find* looking for index 6. (The actual sets were written as the inputs to *merge* for illustration.) The final tree and array implementation are shown in [Figure C.4](#ap-cfig04). The total number of times the comparison in *find* is done is ![](https://box.kancloud.cn/0fb0df2159866257fca7c80fe4a06f79_140x18.jpg) [![Click To expand](https://box.kancloud.cn/bbce76e3cdab0d2c10d47de133bd6e47_350x474.jpg)](figap-c-4_0.jpg) Figure C.4: An example of the worst case for [Disjoint Data Structure I](#ap-cex01) when m = 5. Generalizing this result to an arbitrary *m*, we have that the worst-case number of comparisons is equal to ![](https://box.kancloud.cn/173250566fd4368e820671b53dea5e9e_255x24.jpg) We did not consider function *equal* because that function has no effect on the number of times the comparison in function *find* is done. The worst case occurs when the order in which we do the merging results in a tree whose depth is one less than the number of nodes in the tree. If we modify procedure *merge* so that this cannot happen, we should improve the efficiency. We can do this by keeping track of the depth of each tree, and, when merging, always making the tree with the smaller depth the child. [Figure C.5](#ap-cfig05) compares our old way of merging with this new way. Notice that the new way results in a tree with a smaller depth. To implement this method, we need to store the depth of the tree at each root. The following implementation does this. [![Click To expand](https://box.kancloud.cn/9a6c54af8f1ce65d3f8c75f85e04f78e_350x273.jpg)](figap-c-5_0.jpg) Figure C.5: In the new way of merging, we make the root of the tree with the smaller depth a child of the root of the other tree. Disjoint Set Data Structure II**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** ``` const int n = the number of elements in the universe; typedef int index; typedef index set_pointer; struct nodetype { index parent; int depth; } typedef nodetype universe [1...n]; // universe in indexed //from 1 to n. universe U; void makeset (index i); { U[i]. parent = i; U[i]. depth = 0; } set_pointer find (index i) { index j; j = i; while (U[j]. parent ! = j) j = U[j]. parent; return j; } void merge (set_pointer p, set_pointer q) { if (U[p]. depth == U[q]. depth){ U[p]. depth = U[p]. depth + 1; //Tree's depth U[q]. parent = p; //must increase. } else if (U[p]. depth < U[q]. depth) // Make tree with lesser U[p]. parent = q; // depth the child. else U[q]. parent = p; } bool equal (set_pointer p, set_pointer q) { if (p == q) return true; else return false; } void initial (int n) { index i; for (i = 1; i <= n; i++) makeset(i); } ``` **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** It can be shown that the worst-case number of comparisons done in *m* passes through a loop containing a constant number of calls to routines *equal, find*, and *merge* is in ![](https://box.kancloud.cn/6edae29c7726cfe5185460a4fc304bc7_92x22.jpg) In some applications it is necessary to locate efficiently the smallest member of a set. Using our first implementation, this is straightforward, because the smallest member is always at the root of the tree. In our second implementation, however, this is not necessarily so. We can easily modify that implementation to return the smallest member efficiently by storing a variable *smallest* at the root of each tree. This variable contains the smallest index in the tree. The following implementation does this. Disjoint Set Data Structure III**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** ``` const int n = the number of elements in the universe; typedef int index; typedef index set_pointer; struct nodetype { index parent; int depth; int smallest; }; typedef nodetype universe [1..n]; // universe is indexed // from 1 to n. universe U; void makeset (index i) { U[i]. parent = i; U[i]. depth = 0; U[i]. smallest = i; // The only index i is smallest. } void merge (set_pointer p, set_pointer q) { if (U[p]. depth == U[q]. depth){ U[p]. depth = U[p]. depth + 1; // Tree's depth must increase. U[q]. parent = p; if (U[q]. smallest < U[p]. smallest) // q's tree contains smallest U[p]. smallest = U[q]. smallest; // index. } else if (U[p]. depth < U[q]. depth){ // Make tree with lesser depth U[p]. parent = q; // the child. if (U[p]. smallest < U[q]. smallest) // p's tree contains U[p]. smallest = U[p]. smallest; // smallest index. } else{ U[q]. parent = p; if (U[q]. smallest < U[p]. smallest) // q's tree contains U[p]. smallest = U[p]. smallest; // smallest index. } } int small (set_pointer p) { return U[p]. smallest; } ``` **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** We have included only the routines that differ from those in [Disjoint set Data Structure II](#ap-cex02). Function *small* returns the smallest member of a set. Because function *small* has constant running time, the worst-case number of comparisons done in *m* passes through a loop containing a constant number of calls to routines *equal, find, merge*, and *small* is the same as that of [Disjoint Set Data Structure II](#ap-cex02). That is, it is in ![](https://box.kancloud.cn/17b58cc48c8a795056b9a62d168fbd25_84x21.jpg) Using a technique called ***path compression***, it is possible to develop an implementation whose worst-case number of comparisons, in *m* passes through a loop, is almost linear in *m*. This implementation is discussed in Brassard and Bratley (1988).