A.8 Probability · Foundations of Algorithms Using C++ Pseudocod

## A.8 Probability You may recall using probability theory in situations such as drawing a ball from an urn, drawing the top card from a deck of playing cards, and tossing a coin. We call the act of drawing a ball, drawing the top card, or tossing a coin an *experiment*. In general, probability theory is applicable when we have an experiment that has a set of distinct outcomes that we can describe. The set of all possible outcomes is called a ***sample space*** or ***population***. Mathematicians usually say "sample space," whereas social scientists usually say "population" (because they study people). We use these terms interchangeably. Any subset of a sample space is called an ***event***. A subset containing only one element is called an ***elementary event***. Example A.11**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**In the experiment of drawing the top card from an ordinary deck of playing cards, the sample space contains the 52 different cards. The set ![](https://box.kancloud.cn/a4ff1c32ade3893490485e3da84fb955_400x16.jpg) is an event, and the set ![](https://box.kancloud.cn/d592341401dce63da6a5b422590c8c0f_166x22.jpg) is an elementary event. There are 52 elementary events in the sample space. **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** The meaning of an event (subset) is that one of the elements in the subset is the outcome of the experiment. In [Example A.11](#ap-aex14), the meaning of the event *S* is that the card drawn is any one of the four kings, and the meaning of the elementary event *E* is that the card drawn is the king of hearts. We measure our certainty that an event contains the outcome of the experiment with a real number called the probability of the event. The following is a general definition of probability when the sample space is finite. Definition Suppose we have a sample space containing *n* distinct outcomes: ![](https://box.kancloud.cn/50c468d6458b13630dccf4a68011b94f_253x22.jpg) A function that assigns a real number *p* (*S*) to each event *S* is called a ***probability function*** if it satisfies the following conditions: 1. 0 ≤ *p* (*ei*) ≤ for 1 ≤ *i* ≤ *n* 2. *p* (*ei*) + *p* (*e2*) +… + *p* (*en*) = 1 3. For each event *S* that is not an elementary event, *p* (*S*) is the sum of the probabilities of the elementary events whose outcomes are in *S*. For example, if ![](https://box.kancloud.cn/1551045e0fe899435129277e21c74352_244x49.jpg) The sample space along with the function *p* is called a ***probability space***. Because we define probability as a function of a set, we should write *p* ({*ei*}) instead of *p* (*ei*) when referring to the probability of an elementary event. However, to avoid clutter, we do not do this. In the same way, we do not use the braces when referring to the probability of an event that is not elementary. For example, we write *p* (*e*1, *e*2, *e*7) for the probability of the event {*e*1, *e*2, *e*7}. We can associate an outcome with the elementary event containing that outcome, and therefore we can speak of the probability of an outcome. Clearly, this means the probability of the elementary event containing the outcome. The simplest way to assign probabilities is to use the ***Principle of Indifference***. This principle says that outcomes are to be considered equiprobable if we have no reason to expect or prefer one over the other. According to this principle, when there are *n* distinct outcomes, the probability of each of them is the ratio 1/*n*. Example A.12**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Suppose we have four balls marked A, B, C, and D in an urn, and the experiment is to draw one ball. The sample space is {A, B, C, D}, and, according to the Principle of Indifference, ![](https://box.kancloud.cn/7a3d23afde2704c60094b4e67568c4f5_273x40.jpg) The event {A, B} means that either ball A or ball B is drawn. Its probability is given by ![](https://box.kancloud.cn/47924501077be2c5a2e61f447c3b8b5a_305x40.jpg) **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** Example A.13**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Suppose we have the experiment of drawing the top card from an ordinary deck of playing cards. Because there are 52 cards, according to the Principle of Indifference, the probability of each card is ![](https://box.kancloud.cn/7189b03bf277fc0e131c68bd3c06ecc9_25x26.jpg) For example, ![](https://box.kancloud.cn/ea6e15d815befa5d77f84f1dca07ee16_188x40.jpg) The event ![](https://box.kancloud.cn/d1d393e8d1d3c1cd2654a7d65ca52da9_400x16.jpg) means that the card drawn is a king. Its probability is given by ![](https://box.kancloud.cn/b791f33c57c9b21952f33187cbad4b5e_400x76.jpg) **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** Sometimes we can compute probabilities using the formulas for permutations and combinations given in the preceding section. The following example shows how this is done. Example A.14**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Suppose there are five balls marked A, B, C, D, and E in an urn, and the experiment is to draw three balls and the order does not matter. We will compute *p*(A and B and C). Recall that by "A and B and C" we mean the outcome that A, B, and C are picked in any order. To determine the probability using the Principle of Indifference, we need to compute the number of distinct outcomes. That is, we need the number of combinations of five objects taken three at a time. Using the formula in the preceding section, that number is given by ![](https://box.kancloud.cn/bf608a0c662dd88c271aa9bf74855fe9_128x45.jpg) Therefore, according to the Principle of Indifference, ![](https://box.kancloud.cn/094eb00160589cc87f6d705821b2140e_198x40.jpg) which is the same as the probabilities of the other nine outcomes. **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** Too often, students, who do not have the opportunity to study probability theory in depth, are left with the impression that probability is simply about ratios. It would be unfair, even in this cursory overview, to give this impression. In fact, most important applications of probability have nothing to do with ratios. To illustrate, we give two simple examples. A classic textbook example of probability involves tossing a coin. Because of the symmetry of a coin, we ordinarily use the Principle of Indifference to assign probabilities. Therefore, we assign ![](https://box.kancloud.cn/48265948adc1707e6102da890f1bdbfb_191x39.jpg) On the other hand, we could toss a thumbtack. Like a coin, a thumbtack can land in two ways. It can land on its flat end (head) or it can land with the edge of the flat end (and the point) touching the ground. We assume that it cannot land only on its point. These two ways of landing are illustrated in [Figure A.4](#ap-afig04). Using coin terminology, we will call the flat end "heads" and the other outcome "tails." Because the thumbtack lacks symmetry, there is no reason to use the Principle of Indifference and assign the same probability to heads and tails. How then do we assign probabilities? In the case of a coin, when we say *p*(heads) = ½, we are implicitly assuming that if we tossed the coin 1,000 times, it should land on its head about 500 times. Indeed, if it only landed on its head 100 times, we would become suspicious that it was unevenly weighted and that the probability was not ½. This notion of repeatedly performing the same experiment gives us a way of actually computing a probability. That is, if we repeat an experiment many times, we can be fairly certain that the probability of an outcome is about equal to the fraction of times the outcome actually occurs. (Some philosophers actually define probability as the limit of this fraction as the number of trials approaches infinity.) For example, one of our students tossed a thumbtack 10,000 times and it landed on its flat end (heads) 3,761 times. Therefore, for that tack, ![](https://box.kancloud.cn/d29dcc32423bce851133a04a4d19cca2_400x38.jpg) [![Click To expand](https://box.kancloud.cn/67f942d0e7b77c47787fa2e9d3ec1b4d_322x146.jpg)](figap-a-4_0.jpg) Figure A.4: The two ways a thumbtack can land. Because of the asymmetry of a thumbtack, these two ways do not necessarily have the same probability. We see that the probabilities of the two events need not be the same, but that the probabilities still sum to 1. This way of determining probabilities is called the ***relative frequency approach*** to probability. When probabilities are computed from the relative frequency, we use the ≈ symbol because we cannot be certain that the relative frequency is exactly equal to the probability regardless of how many trials are performed. For example, suppose we have two balls marked A and B in an urn and we repeat the experiment of picking one ball 10,000 times. We cannot be certain that the ball marked A will be picked exactly 5,000 times. It may be picked only 4,967 times. Using the Principle of Indifference, we would have ![](https://box.kancloud.cn/617ea6fd9b3e2e0fd13d41ccef6ae560_96x21.jpg) whereas using the relative frequency approach we would have ![](https://box.kancloud.cn/7fbfe6b8144f7de6e8f32d0e8a7babd8_123x20.jpg) The relative frequency approach is not limited to experiments with only two possible outcomes. For example, if we had a six-sided die that was not a perfect cube, the probabilities of the six elementary events could all be different. However, they would still sum to 1. The following example illustrates this situation. Example A.15**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Suppose we have an asymmetrical six-sided die, and in 1,000 throws we determine that the six sides come up the following numbers of times: *Side* *Number of Times* 1 200 2 150 3 100 4 250 5 120 6 180 Then ![](https://box.kancloud.cn/852dc01016bbf54d7fdfe7e5b363a3d4_100x156.jpg) By Condition 3 in the definition of a probability space, ![](https://box.kancloud.cn/e21b621638054d9aca4b09bf55410b7c_332x21.jpg) This is the probability that either a 2 or a 3 comes up in a throw of the die. **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** There are other approaches to probability, not the least of which is the notion of probability as a degree of *belief* in an outcome. For example, suppose the Chicago Bears were going to play the Dallas Cowboys in a football game. At the time this text is being written, one of its authors has little reason to believe that the Bears would win. Therefore, he would not assign equal probabilities to each team winning. Because the game could not be repeated many times, he could not obtain the probabilities using the relative frequency approach. However, if he was going to bet on the game, he would want to access the probability of the Bears winning. He could do so using the ***subjectivistic approach*** to probability. One way to access probabilities using this approach is as follows: If a lottery ticket for the Bears winning cost $1, an individual would determine how much he or she felt the ticket should be worth if the Bears did win. One of the authors feels that it would have to be worth $5. This means that he would be willing to pay $1 for the ticket only if it would be worth at least $5 in the event that the Bears won. For him, the probability of the Bears winning is given by ![](https://box.kancloud.cn/49742e2b3d8c2543b29689440ffd6ca2_205x43.jpg) That is, the probability is computed from what he believes would be a fair bet. This approach is called "subjective" because someone else might say that the ticket would need to be worth only $4. For that person, *p*(Bears win) = 0.25. Neither person would be logically incorrect. When a probability simply represents an individual's belief, there is no unique correct probability. A probability is a function of the individual's beliefs, which means it is subjective. If someone believed that the amount won should be the same as the amount bet (that is, that the ticket should be worth $2), then for that person ![](https://box.kancloud.cn/a196fdea8c800c724234d2d8172ed69e_205x42.jpg) We see that probability is much more than ratios. You should read Fine (1973) for a thorough coverage of the meaning and philosophy of probability. The relative frequency approach to probability is discussed in Neapolitan (1992). The expression "Principle of Indifference" first appeared in Keynes (1948) (originally published in 1921). Neapolitan (1990) discusses paradoxes resulting from use of the Principle of Indifference. ### A.8.1 Randomness Although the term "random" is used freely in conversation, it is quite difficult to define rigorously. Randomness involves a process. Intuitively, by a ***random process*** we mean the following. First, the process must be capable of generating an arbitrarily long sequence of outcomes. For example, the process of repeatedly tossing the same coin can generate an arbitrarily long sequence of outcomes that are either heads or tails. Second, the outcomes must be unpredictable. What it means to be "unpredictable," however, is somewhat vague. It seems we are back where we started; we have simply replaced "random" with "unpredictable." In the early part of the 20th century, Richard von Mises made the concept of randomness more concrete. He said that an "unpredictable" process should not allow a successful gambling strategy. That is, if we chose to bet on an outcome of such a process, we could not improve our chances of winning by betting on some subsequence of the outcomes instead of betting on every outcome. For example, suppose we decided to bet on heads in the repeated tossing of a coin. Most of us feel that we could not improve our chances by betting on every other toss instead of on every toss. Furthermore, most of us feel that there is no other "special" subsequence that could improve our chances. If indeed we could not improve our chances by betting on some subsequence, then the repeated tossing of the coin would be a random process. As another example, suppose that we repeatedly sampled individuals from a population that contained individuals with and without cancer and that we put each sampled individual back into the population before sampling the next individual. (This is called *sampling with replacement*.) Let's say we chose to bet on cancer. If we sampled in such a way as to never give preference to any particular individual, most of us feel that we would not improve our chances by betting only on some subsequence instead of betting every time. If indeed we could not improve our chances by betting on some subsequence, then the process of sampling would be random. However, if we sometimes gave preference to individuals who smoked by sampling every fourth time only from smokers, and we sampled the other times from the entire population, the process would no longer be random because we could improve our chances by betting every fourth time instead of every time. Intuitively, when we say that "we sample in such a way as to never give preference to any particular individual," we mean that there is no pattern in the way the sampling is done. For example, if we were sampling balls with replacement from an urn, we would never give any ball preference if we shook the urn vigorously to thoroughly mix the balls before each sample. You may have noticed how thoroughly the balls are mixed before they are drawn in state lotteries. When sampling is done from a human population, it is not as easy to ensure that preference is not given. The discussion of sampling methods is beyond the scope of this appendix. Von Mises' requirement of not allowing a successful gambling strategy gives us a better grasp of the meaning of randomness. A predictable or *nonrandom process* does allow a successful gambling strategy. One example of a nonrandom process is the one mentioned above in which we sampled every fourth time from smokers. A less obvious example concerns the exercise pattern of one of the authors. He prefers to exercise at his health club on Tuesday, Thursday, and Sunday, but if he misses a day he makes up for it on one of the other days. If we were going to bet on whether he exercises on a given day, we could do much better by betting every Tuesday, Thursday, and Sunday than by betting every day. This process is not random. Even though Von Mises was able to give us a better understanding of randomness, he was not able to create a rigorous, mathematical definition. Andrei Kolmogorov eventually did so with the concept of compressible sequences. Briefly, a finite sequence is defined as ***compressible*** if it can be encoded in fewer bits than it takes to encode every item in the sequence. For example, the sequence 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0, which is simply "1 0" repeated 16 times, can be represented by 16 1 0. Because it takes fewer bits to encode this representation than it does to encode every item in the sequence, the sequence is compressible. A finite sequence that is not compressible is called a ***random sequence***. For example, the sequence 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 1 is random because it does not have a more efficient representation. Intuitively, a random sequence is one that shows no regularity or pattern. According to the Kolmogorov theory, a ***random process*** is a process that generates a random sequence when the process is continued long enough. For example, suppose we repeatedly toss a coin, and associate 1 with heads and 0 with tails. After six tosses we may see the sequence 1 0 1 0 1 0, but, according to the Kolmogorov theory, eventually the entire sequence will show no such regularity. There is some philosophical difficulty with defining a random process as one that *definitely* generates a random sequence. Many probabilists feel that it is only highly probable that the sequence will be random, and that the possibility exists that the sequence will not be random. For example, in the repeated tossing of a coin, they believe that, although it is very unlikely, the coin could come up heads forever. As mentioned previously, randomness is a difficult concept. Even today there is controversy over its properties. Let's discuss how randomness relates to probability. A random process determines a probability space (see definition given at the beginning of this section), and the experiment in the space is performed each time the process generates an outcome. This is illustrated by the following examples. Example A.16**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Suppose we have an urn containing one black ball and one white ball, and we repeatedly draw a ball and replace it. This random process determines a probability space in which *p*(black) = *p*(white) = 0.5. We perform the experiment in the space each time we draw a ball. **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** Example A.17**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**The repeated throwing of the asymmetrical six-sided die in [Example A.15](#ap-aex18) is a random process that determines a probability space in which ![](https://box.kancloud.cn/e9b37388d06b811b5a71edeb111f75f5_100x156.jpg) We perform the experiment in the space each time we throw the die. **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** Example A.18**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Suppose we have a population of *n* people, some of whom have cancer, we sample people with replacement, and we sample in such a way as to never give preference to any particular individual. This random process determines a probability space in which the population is the sample space (recall that "sample space" and "population" can be used interchangeably) and the probability of each person being sampled (elementary event) is 1/*n*. The probability of a person with cancer being sampled is ![](https://box.kancloud.cn/42b5fa11a57aa62695ba3dce3f036277_169x41.jpg) Each time we perform the experiment, we say that we sample (pick) a person ***at random*** from the population. The set of outcomes in all repetitions of the experiment is called a ***random sample*** of the population. Using statistical techniques, it can be shown that if a random sample is large, then it is highly probable that the sample is representative of the population. For example, if the random sample is large, and ![](https://box.kancloud.cn/c23fbc59b9b641e8899af5f24aed6efc_18x25.jpg) of the people sampled have cancer, it is highly probable that the fraction of people in the population with cancer is close to ![](https://box.kancloud.cn/c23fbc59b9b641e8899af5f24aed6efc_18x25.jpg) **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** Example A.19**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Suppose we have an ordinary deck of playing cards, and we turn over the cards in sequence. This process is not random, and the cards are not picked at random. This nonrandom process determines a different probability space each time an outcome is generated. On the first trial, each card has a probability of ![](https://box.kancloud.cn/7189b03bf277fc0e131c68bd3c06ecc9_25x26.jpg). On the second trial, the card turned over in the first trial has a probability of 0 and each of the other cards has a probability of ![](https://box.kancloud.cn/be9ad96b362f41d638d1e60202fe60f3_17x25.jpg), and so on. Suppose we repeatedly draw the top card, replace it, and shuffle once. Is this a random process and are the cards picked at random? The answer is no. The magician and statistician Persi Diaconis has shown that the cards must be shuffled seven times to thoroughly mix them and make the process random (see Aldous and Diaconis, 1986). **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** Although von Mises' notion of randomness is intuitively very appealing today, his views were not widely held at the time he developed his theory (in the early part of the 20th century). His strongest opponent was the philosopher K. Marbe. Marbe held that nature is endowed with a memory. According to his theory, if tails comes up 15 consecutive times in repeated tosses of a fair coin—that is, a coin for which the relative frequency of heads is 0.5—the probability of heads coming up on the next toss is increased because nature will compensate for all the previous tails. If this theory were correct, we could improve our chances of winning by betting on heads only after a long sequence of tails. Iverson et al. (1971) conducted experiments that substantiated the views of von Mises and Kolmogorov. Specifically, their experiments showed that coin tosses and dice throws do generate random sequences. Today few scientists subscribe to Marbe's theory, although quite a few gamblers seem to. Von Mises' original theory appeared in von Mises (1919) and is discussed more accessibly in von Mises (1957). A detailed coverage of compressible sequences and random sequences can be found in Van Lambalgen (1987). Neapolitan (1992) and Van Lambalgen (1987) both address the difficulties in defining a random process as one that definitely generates a random sequence. ### A.8.2 The Expected Value We introduce the expected value (average) with an example. Example A.20**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Suppose we have four students with heights of 68, 72, 67, and 74 inches. Their average height is given by ![](https://box.kancloud.cn/497bc517685598bb63a68a26f6b93717_400x35.jpg) Suppose now that we have 1000 students whose heights are distributed according to the following percentages: *Percentage of Students* *Height in Inches* 20 66 25 68 30 71 10 72 15 74 To compute the average height, we could first determine the height of each student and proceed as before. However, it is much more efficient to simply obtain the average as follows: ![](https://box.kancloud.cn/8db57228b7767b42b3a3fa3c4a721698_400x29.jpg) **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** Notice that the percentages in this example are simply probabilities obtained using the Principle of Indifference. That is, the fact that 20% of the students are 66 inches tall means that 200 students are 66 inches tall, and if we pick a student at random from the 1,000 students, then ![](https://box.kancloud.cn/e6188338c0541637b0d84e0e8b7bdf9b_271x41.jpg) In general, the expected value is defined as follows. Definition Suppose we have a probability space with the sample space ![](https://box.kancloud.cn/bf8f7b34f7ab18515fdd83c638494b0a_106x22.jpg) and each outcome *ei* has a real number *f* (*ei*) associated with it. Then *f* (*ei*) is called a ***random variable*** on the sample space, and the ***expected value***, or average, of *f* (*ei*) is given by ![](https://box.kancloud.cn/b8ef5ede5c37bc3e0a39eb22ddd3c2a8_367x21.jpg) Random variables are called "random" because random processes can determine the values of random variables. The terms "chance variable" and "stochastic variable" are also used. We use "random variable" because it is the most popular. Example A.21**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Suppose we have the asymmetrical six-sided die in [Example A.15](#ap-aex18). That is, ![](https://box.kancloud.cn/84f1773270d272caddcc4ea5cbf1c38c_232x72.jpg) Our sample space consists of the six different sides that can come up, a random variable on this sample space is the number written on the side, and the expected value of this random variable is ![](https://box.kancloud.cn/3d32fc40b68113bf6cd5b97b682c231b_400x37.jpg) If we threw the die many times, we would expect the average of the numbers showing up to equal about 3.48. A sample space does not have a unique random variable defined on it. Another random variable on this sample space might be the function that assigns 0 if an odd number comes up and 1 if an even number comes up. The expected value of this random variable is ![](https://box.kancloud.cn/eba16f6ea4bf7a90ffaef19d108be5da_400x33.jpg) **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)** Example A.22**![Start example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**Suppose the 1,000 students in [Example A.20](#ap-aex23) are our sample space. The height, as computed in [Example A.20](#ap-aex23), is one random variable on this sample space. Another one is the weight. If the weights are distributed according to the following percentages: *Percentage of Students* *Weight in Pounds* 15 130 35 145 30 160 10 170 10 185 the expected value of this random variable is ![](https://box.kancloud.cn/6e284534bc9ac61c06d09e1d3b0afbe6_400x33.jpg) **![End example](https://box.kancloud.cn/e95df9d604ab2d89febe370ae4d88fb1_1x1.gif)**