4.4 Huffman Code · Foundations of Algorithms Using C++ Pseudocod

## 4.4 Huffman Code Even though the capacity of secondary storage devices keeps getting larger and their cost keeps getting smaller, the devices continue to fill up due to increased storage demands. Given a data file, it would therefore be desirable to find a way to store the file as efficiently as possible. The problem of ***data compression*** is to find an efficient method for encoding a data file. Next, we discuss the encoding method, called ***Huffman code***, and a greedy algorithm for finding a Huffman encoding for a given file. A common way to represent a file is to use a ***binary code.*** In such a code, each character is represented by a unique binary string, called the ***codeword.*** A ***fixed-length binary code*** represents each character using the same number of bits. For example, suppose our character set is {a,b,c}. Then we could use 2 bits to code each character, since this gives us four possible codewords and we need only three. We could code as follows: ![](https://box.kancloud.cn/a69d278115a202c04fce4617247e8cc5_235x20.jpg) Given this code, if our file is ![](https://box.kancloud.cn/50b4d2bc60ad4803fbf11cb61bb48c31_279x31.jpg) our encoding is ![](https://box.kancloud.cn/38b3468131220dc56c2d62bb4e633051_193x22.jpg) We can obtain a more efficient coding using a ***variable-length binary code.*** Such a code can represent different charaacters using different numbers of bits. In the current example, we can code one of the characters as 0. Since ‘b’ occurs most frequently, it would be most efficient to code ‘b’ using this codeword. However, then ‘a’ could not be coded as ‘00’ because we would not be able to distinguish one ‘a’ from two ‘b's. Furthermore, we would not want to code ‘a’ as ‘01’ because when we encountered a 0, we could not determine if it represented a ‘b’ or the beginning of an ‘a’ without looking beyond the current digit. So we could code as follows: ![](https://box.kancloud.cn/c28f974f72487aff2e40725139181a05_332x25.jpg) Given this code, File 4.1 would be encoded as ![](https://box.kancloud.cn/f46e5b5b95c509e0beb21933e54d26e8_140x19.jpg) With this encoding it takes only 13 bits to represent the file, whereas the previous one took 18 bits. It is not hard to see that this encoding is optimal given that it is the minimum number of bits necessary to represent the file with a binary character code. Given a file, the Optimal Binary Code problem is to find a binary character code for the characters in the file, which represents the file in the least number of bits. First we discuss prefix codes, and then we develop Huffman's algorithm for solving this problem. ### 4.4.1 Prefix Codes One particular type of variable-length code is a ***prefix code.*** In a prefix code no codeword for one character constitutes the beginning of the codeword for another character. For example,if 01 is the code word for ‘a’, then 011 could not be the codeword for ‘b’. Code 4.2 is an example of a prefix code. A fixed-length code is also a prefix code. Every prefix code can be represented by a binary tree whose leaves are the characters that are to be encoded. The binary tree corresponding to Code 4.2 appears in [Figure 4.9](#ch04fig09). The advantage of a prefix code is that we need not look ahead when parsing the file. This is readily apparent from the tree representation of the code. To parse, we start at the first bit on the left in the file and the root of the tree. We sequence through the bits, and go left or right down the tree depending on whether a 0 or 1 is encountered. When we reach a leaf, we obtain the character at that leaf; then we return to the root and repeat the procedure starting with the next bit in sequence. [![Click To expand](https://box.kancloud.cn/a75bc25122aa0d9cfc7800b4bfd78d92_260x204.jpg)](fig4-9_0.jpg) Figure 4.9: Binary tree corresponding to Code 4.2. Example 4.7**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Suppose our character set is {*a,b,c,d,e,f*} and each character appears in the file the number of times indicated in [Table 4.1](#ch04table01). That table also shows three different codes we could use to encode the file. Let's compute the number of bits for each encoding: ![](https://box.kancloud.cn/6e5d5beb330c66d13a3e0e80320071a7_400x61.jpg) Table 4.1: Three codes for the same file. C3 is optimal.Character Frequency C1(Fixed-Length) C2 C3(Huffman) *a* 16 000 10 00 *b* 5 001 11110 1110 *c* 12 010 1110 110 *d* 17 011 110 01 *e* 10 100 11111 1111 *f* 25 101 0 10 We see that Code C2 is an improvement on the fixed-length code C1, but C3 (Huffman) is even better than C2. As we shall see in the next subsection, C3 is optimal. The binary trees corresponding to Codes C2 and C3 appear in [Figures 4.10(a)](#ch04fig10) and [(b)](#ch04fig10), respectively. The frequency of each character appears with the character in the tree. **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** [![Click To expand](https://box.kancloud.cn/9f2fcddf13975b629a432b027f599dea_350x171.jpg)](fig4-10_0.jpg) Figure 4.10: The binary character code for Code C2 in [Example 4.7](#ch04ex18) appears in (a), while the one for Code C3 (Huffman) appears in (b). As can be seen from the preceding example, the number of bits it takes to encode a file given the binary tree *T* corresponding to some code is given by ![](https://box.kancloud.cn/6f85aaea192ed9149a85425a55916b36_391x54.jpg) where {*v1, v2,…vn*} is the set of characters in the file, *frequency(vi*) is the number of times *vi* occurs in the file, and *depth(vi*) is the depth of *vi* in *T*. ### 4.4.2 Huffman's Algorithm Huffman developed a greedy algorithm that produces an optimal binary character code by constructing a binary tree corresponding to an optimal code. A code produced by this algorithm is called a **Huffman code.** We present a high-level version of the algorithm. However, because it involves constructing a tree, we need to be more detailed than in our other high-level algorithms. In particular, we first have the following type declaration: ``` struct nodetype { char symbol; // The value of a character. int frequency; // The number of times the character // is in the file. nodetype* left; nodetype* right; }; ``` Furthermore, we need to use a priority queue. In a ***priority queue***, the element with the highest priority is always removed next. In this case, the element with the highest priority is the character with the lowest frequency in the file. A priority queue can be implemented as a linked list, but more efficiently as a heap (See [Section 7.6](LiB0066.html#752) for a discussion of heaps.) **Huffman's algorithm** now follows. *n* = number of characters in the file; Arrange *n* pointers to nodetype records in a priority queue *PQ* as follows: For each pointer *p* in *PQ* ![](https://box.kancloud.cn/5712c9ff173c85f5ef22a1292d1a7054_400x54.jpg) The priority is according to the value of frequency, with lower frequencies having higher priority. ``` for (i = 1; i < = n-1; i++) { // There is no solution check; rather, remove (PQ, p); // solution is obtained when i = n - 1. remove (PQ, q); // Selection procedure. r = new nodetype; // There is no feasibility check. r->left = p; r->right = q; r->frequency = p->frequency + q->frequency; insert (PQ, r); } remove (PQ, r); return r; ``` If a priority queue is implemented as a heap, it can be initialized in θ(*n*) time. Furthermore, each heap operation requires θ(lg*n*) time. Since there are *n* - 1 passes through the **for-***i* loop, the algorithm runs in θ(*n*lg*n*) time. Next we show an example of applying the preceding algorithm. Example 4.8**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Suppose our character set is {*a,b,c,d,e,f*} and each character appears in the file the number of times indicated in [Table 4.1](#ch04table01). [Figure 4.11](#ch04fig11) shows the set of trees, constructed by the algorithm, after each pass through the **for-***i* loop. The first set in the figure is before the loop is entered. The value stored at each node is the value of the *frequency* field in the node. Note that the final tree is the one in [Figure 4.10(b)](#ch04fig10). **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** [![Click To expand](https://box.kancloud.cn/63c8f42379fb78dfea5d4e315a990b6c_350x355.jpg)](fig4-11_0.jpg) Figure 4.11: Given the file whose frequencies are shown in [Table 4.1](#ch04table01), this shows the state of the subtrees, constructed by Huffman's algorithm, after each pass through the for-*i* loop. The first tree is the state before the loop is entered. Next we prove that the algorithm always produces an optimal binary character code. We start with a lemma: Lemma 4.4**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**The binary tree corresponding to an optimal binary prefix code is full. That is, every nonleaf has two children. Proof: The proof is left as an exercise. **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** Before proceeding, we need some terminology. Two nodes are called ***siblings*** in a tree if they have the same parent. A ***branch*** with root *v* in tree *T* is the subtree whose root is *v*. We now prove the optimality of Huffman's algorithm. Theorem 4.5**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Huffman's algorithm produces an optimal binary code. Proof: The proof is by induction. Assuming the set of trees obtained in the *i*th step are branches in a binary tree corresponding to an optimal code, we show that the set of trees obtained in the (*i* + 1)st step are also branches in a binary tree corresponding to an optimal code. We conclude then that the binary tree produced in the (*n* - 1)st step corresponds to an optimal code. Induction base: Clearly, the set of single nodes obtained in the th step are branches in a binary tree corresponding to an optimal code. Induction hypothesis: Assume the set of trees obtained in the *i*th step are branches in a binary tree corresponding to an optimal code. Let *T* be that tree. Induction step: Let *u* and *v* be the roots of the trees combined in the (*i* + 1)st step of Huffman's algorithm. If *u* and *v* are siblings in *T*, then we are done because the set of trees obtained in the (*i* + 1)st step of Huffman's algorithm are branches in *T*. **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** Otherwise, without loss of generality, assume *u* is at a level in *T* at least as low as *v.* As before, let *frequency(v*) be the sum of the frequencies of the characters stored at the leaves of the branch rooted at *v.* Due to [Lemma 4.4](#ch04ex20), *u* has some sibling *w* in *T.* Let *S* be the set of trees that exist after the *i*th step in Huffman's algorithm. Clearly, the branch in *T* whose root is *w* is one of the trees in *S* or contains one of those trees as a subtree. In either case, since the tree whose root is *v* was chosen by Huffman's algorithm in this step, ![](https://box.kancloud.cn/c21a85a2021fb8bd989d782c5a0b96d4_247x22.jpg) Furthermore, in *T* ![](https://box.kancloud.cn/ae02e6b28d815f705c3769db7c794f55_168x21.jpg) We can create a new binary tree *T'* by swapping the positions of the branches rooted at *v* and *w* in *T.* This is depicted in [Figure 4.12](#ch04fig12). It is not hard to see from Equality 4.3 and the previous two inequalities that ![](https://box.kancloud.cn/dba5915402f2ae83b5ba1d39e81e0da5_400x33.jpg) [![Click To expand](https://box.kancloud.cn/57d32afa8e0a57281fe9733dcba67f40_350x171.jpg)](fig4-12_0.jpg) Figure 4.12: The branches rooted at *v* and *w* are swapped. which means the code corresponding to *T′* is optimal. Clearly, the set of trees obtained in the (*i* + 1)st step of Huffman's algorithm are branches in *T.*′