35
Elements of Mathematics: an embarrasignly simple (but practical) introduction to probability Jordi Vill`a i Freixa ([email protected]) November 23, 2011 Nota de classe Amb notes de classe! Contents 1

Probability 1011

Embed Size (px)

DESCRIPTION

an embarrasingly simple intro to probability

Citation preview

Page 1: Probability 1011

Elements of Mathematics: an embarrasignly simple (but

practical) introduction to probability

Jordi Villa i Freixa ([email protected])

November 23, 2011

Nota de classe

Amb notes de classe!

Contents

1

Page 2: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Most of the material for this report has been modified from [?, ?]. See also [?].

Nota de classe

Proposar l’experiment de tirar dos cops el mateix dau. Un experiment es un trial i pot donar,com outcome, qualsevol parella de 1 a 6; El Sample space conte 12 trials; jugar a posar-hi events enforma grafica

1 Venn diagrams

• We call a single performance of an experiment a trial and each possible result outcome.

• The sample space Ω of the experiment is the set of all possible outcomes of an individual trial

• Sample spaces can be finite (throwing a die) or infinite (measuring people’s height)

An event is a subset of the sample space. A venn diagram can help us understanding theclassification of events (Figure ??).

i iiiii

iv

BA

Ω

Figure 1: Example of Venn diagram representing two different events in a sample space Ω. Everypossible outcome is assigned to an appropriate region.

For example, when throwing a six sided die,

Ω = 1, 2, 3, 4, 5, 6

the event “getting an odd result” is

A ⊂ Ω; A = 2, 4, 6

or for the sample space of all heights of a human population, we may decide to have two events:greater than 180cm (event A) or less than or equal to 180cm (event B).

Probability 2

Page 3: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

In the case of the six-sided die, Figure ?? represents two different events:

• A is the event representing that the number can be divided by 2

• B represents the event that the number can be divided by 3

2, 4 36

1, 5

BA

Ω

Figure 2: Example of Venn diagram representing two different events in a sample space Ω. Everypossible outcome is assigned to an appropriate region.

Figure ?? shows how different regions can be described by a simple inspection of the Venndiagram. For example, two events are called disjoint or mutually exclusive if their intersection isthe empty event, ∅. Thus, in the case of the six-sided die: A

⋃B = 2, 3, 4, 6. Note too that

A = Ω−A = 1, 3, 5, that A⋃A = Ω and that A

⋂A = ∅.

Using the Venn diagrams, it is easy to see how the following properties hold:

commutative A⋂B = B

⋂A, and A

⋃B = B

⋃A;

associative (A⋂B)⋂C = A

⋂(B⋂C), and (A

⋃B)⋃C = A

⋃(B⋃C);

distributive A⋂

(B⋃C) = (A

⋂B)⋃

(A⋂C) and A

⋃(B⋂C) = (A

⋃B)⋂

(A⋃C); and, finally

idempotency A⋂A = A and A

⋃A = A.

Exercise 1show that A

⋃(A⋂B) = A

⋂(A⋃B) = A and that (A−B)

⋃(A⋂B) = A.

From the Venn diagrams we can also see how

• if A ⊂ B then A ⊃ B;

• A⋃B = A

⋂B;

• A⋂B = A

⋃B

Probability 3

Page 4: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

(a)

BA

Ω(b)

BA

Ω

(c)

A A

Ω(d)

BA

Ω

Figure 3: The shaded regions show (a) A⋂B, the intersection between two events; (b) A

⋃B, the

union; (c) the complement A of an event; and (d) A − B, those outcomes in A that do not belongto B.

2 Probability

Pr(A) is defined as the expected relative frequency of event A in a large number of trials. If anexperiment has a total number of possible outcomes nΩ, and nA of them correspond to event A,then:

Pr(A) =nAnΩ

From this definition we can deduce a number of properties:

1. For any event A in a sample space Ω, 0 ≤ Pr(A) ≤ 1.

2. For the entire sample space Ω, Pr(Ω) = nΩ

nΩ= 1.

3. For any two events A and B in Ω,

[Pr(A⋃B) = Pr(A) + Pr(B)− Pr(A

⋂B). (1)

4. (complement law) Pr(A) = 1− Pr(A).

Probability 4

Page 5: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Exercise 2Calculate the probability of drawing an ace or a spade from a pack of cards.

Exercise 3A biased six-sided die has probabilities 3p, 2p, p/2, p, p, 3p of showing 1, 2, 3, 4, 5, 6, respectively.

Find p

Eq. ?? above can be generalized saying:

Pr(A1

⋃A2

⋃· · ·⋃An) =

∑i

Pr(Ai)−∑i,j

Pr(Ai

⋂Aj) +

∑i,j,k

Pr(Ai

⋂Aj

⋂Ak)

− · · ·+ (−1)n+1Pr(A1

⋂A2

⋂· · ·⋂An)

Exercise 4Find the probability of drawing from a pack a card that has at least one of the following properties:

• it is an ace

• it is a spade

• it is a black honour card (ace, king, queen, jack or 10)

• it is a black ace

Nota de classe

comencem per les probabilitats de cada event individual: 4/52, 13/52, 10/52, 2/52. Desprstrobem les interseccions entre dos events, desprs entre tres, i finalment totes a l’hora (que resultaser 1/52. La frmula ens dna, finalment, P=20/52

Exercise 5John and Mary are taking a mathematics course. The course has only three grades: A, B, and C.

The probability that John gets a B is .3. The probability that Mary gets a B is .4. The probabilitythat neither gets an A but at least one gets a B is .1. What is the probability that at least one getsa B but neither gets a C?

Nota de classe

Let the sample space be:

ω1 = A,A ω4 = B,A ω7 = C,Aω2 = A,B ω5 = B,B ω8 = C,Bω3 = A,C ω6 = B,C ω9 = C,C

Probability 5

Page 6: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

where the first grade is John’s and the second is Mary’s. You are given that

P (ω4) + P (ω5) + P (ω6) = .3,

P (ω2) + P (ω5) + P (ω8) = .4,

P (ω5) + P (ω6) + P (ω8) = .1.

Adding the first two equations and subtracting the third, we obtain the desired probability as

P (ω2) + P (ω4) + P (ω5) = .6.

Finally, the total probability law states that for a series of mutually exclusive events Ai,

Pr(B) =∑i

Pr(Ai

⋂B)

3 Conditional probability and Bayes’ theorem

3.1 Conditional probability

Suppose we assign a distribution function to a sample space and then learn that an event E hasoccurred. How should we change the probabilities of the remaining events? We shall call the newprobability for an event F the conditional probability of F given E and denote it by P (F |E).

This is the same than wishing to know the probability of an event given another one. FromPr(A

⋂B) = Pr(A)Pr(B|A), we obtain (see, Figure ??):

Pr(B|A) =Pr(B

⋂A)

Pr(A)(2)

We can formalize this result in the following manner. Let Ω = ω1, ω2, . . . , ωr be the originalsample space with distribution function m(ωj) assigned. Suppose we learn that the event E hasoccurred. We want to assign a new distribution function m(ωj |E) to Ω to reflect this fact. Clearly,if a sample point ωj is not in E, we want m(ωj |E) = 0. Moreover, in the absence of informationto the contrary, it is reasonable to assume that the probabilities for ωk in E should have the samerelative magnitudes that they had before we learned that E had occurred. For this we require that

m(ωk|E) = cm(ωk)

for all ωk in E, with c some positive constant. But we must also have∑E

m(ωk|E) = c∑E

m(ωk) = 1 .

Thus,

c =1∑

E m(ωk)=

1

P (E).

Probability 6

Page 7: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

n = 10 n = 4n = 3

n = 13

BA

Ω

Figure 4: Conditional probability: evaluate Pr(A⋂B) = 3/30 is equivalent to the product of

probabilities Pr(B) = 7/30 times Pr(A|B) = 3/7, or the product of Pr(A) = 13/30 times Pr(B|A) =3/13

(Note that this requires us to assume that P (E) > 0.) Thus, we will define

m(ωk|E) =m(ωk)

P (E)

for ωk in E. We will call this new distribution the conditional distribution given E. For a generalevent F , this gives

P (F |E) =∑F∩E

m(ωk|E) =∑F∩E

m(ωk)

P (E)=P (F ∩ E)

P (E).

We call P (F |E) the conditional probability of F occurring given that E occurs, and compute itusing the formula

P (F |E) =P (F ∩ E)

P (E).

Exercise 6An experiment consists of rolling a die once. If the die has rolled and we are said it was bigger than

4, what is the probability of being 5?

Nota de classe

Let X be the outcome. Let F be the event X = 6, and let E be the event X > 4. We assignthe distribution function m(ω) = 1/6 for ω = 1, 2, . . . , 6. Thus, P (F ) = 1/6. Now suppose that thedie is rolled and we are told that the event E has occurred. This leaves only two possible outcomes:5 and 6. In the absence of any other information, we would still regard these outcomes to be equallylikely, so the probability of F becomes 1/2, making P (F |E) = 1/2.

Probability 7

Page 8: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Exercise 7In a Life Table, one finds that in a population of 100,000 females, 89.835% can expect to live to age

60, while 57.062% can expect to live to age 80. Given that a woman is 60, what is the probabilitythat she lives to age 80?

Nota de classe

This is an example of a conditional probability. In this case, the original sample space can bethought of as a set of 100,000 females. The events E and F are the subsets of the sample spaceconsisting of all women who live at least 60 years, and at least 80 years, respectively. We consider Eto be the new sample space, and note that F is a subset of E. Thus, the size of E is 89,835, and thesize of F is 57,062. So, the probability in question equals 57,062/89,835 = .6352. Thus, a womanwho is 60 has a 63.52% chance of living to age 80.

In terms of the Venn diagram, we can consider Pr(B|A) to be the probability of B in the reducedsample space defined by A. Thus, for two mutually exclusive events,

Pr(A|B) = 0 = Pr(B|A)

.

When randomly drawing an object from the set of total object we say we are sampling the space,and we can do so with replacement or without replacement.

ExampleWe have two urns, I and II. Urn I contains 2 black balls and 3 white balls. Urn II contains 1 black

ball and 1 white ball. An urn is drawn at random and a ball is chosen at random from it. We canrepresent the sample space of this experiment as the paths through a tree as shown in Figure ??.The probabilities assigned to the paths are also shown.

Let B be the event “a black ball is drawn,” and I the event “urn I is chosen.” Then the branchweight 2/5, which is shown on one branch in the figure, can now be interpreted as the conditionalprobability P (B|I).

Suppose we wish to calculate P (I|B). Using the formula, we obtain

P (I|B) =P (I ∩B)

P (B)

=P (I ∩B)

P (B ∩ I) + P (B ∩ II)

=1/5

1/5 + 1/4=

4

9.

Another way to visualize this is by drawing the table:

w bI 3/10 1/5II 1/4 1/4

Probability 8

Page 9: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Figure 5: Tree diagram.

which represents the probabilities for the events in the sample space Ω: 3/10+1/5+1/4+1/4 = 1.

Exercise 8Find the probability of drawing two aces if (i) we replace the card after the first draw or (ii) we do

not.

Nota de classe

en el primer cas tenim events estadısticament independents (Pr(A⋂B) = Pr(A)Pr(B)), i en el

segon no ho son i cal usar l’equacio ??. En general, cal pensar en dos events successius com un eventen forma de dupla.

The above exercise shows what are two statistically independent events A and B. This willoccur when Pr(A|B) = Pr(A) or, equivalently, Pr(B|A) = Pr(B) (or at least one of the events hasprobability 0). It can be shown that, for two independent events: Pr(A

⋂B) = Pr(A)Pr(B).

A set of events A1, A2, . . . , An is said to be mutually independent if for any subset Ai, Aj , . . . , Amof these events we have

P (Ai ∩Aj ∩ · · · ∩Am) = P (Ai)P (Aj) · · ·P (Am),

or equivalently, if for any sequence A1, A2, . . . , An with Aj = Aj or Aj ,

P (A1 ∩ A2 ∩ · · · ∩ An) = P (A1)P (A2) · · ·P (An).

It is natural to ask: If all pairs of a set of events are independent, is the whole set mutuallyindependent? The answer is not necessarily.

Probability 9

Page 10: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Exercise 9A coin is tossed twice. Consider the following events.A: Heads on the first toss.B: Heads on the second toss.C: The two tosses come out the same.

1. Show that A, B, C are pairwise independent but not independent.

2. Show that C is independent of A and B but not of A ∩B.

Nota de classe

• We have

P (A ∩B) = P (A ∩ C) = P (B ∩ C) = 1/4

P (A)P (B) = P (A)PC) = P (B)P (C) = 1/4

P (A ∩B ∩ C) = 1/4 6= P (A)P (B)P (C) = 1/8

• We haveP (A ∩ C) = P (A)P (C) = 1/4

so C and A are independent,

P (C ∩B) = P (B)P (C) = 1/4

so C and B are independent,

P (C ∩ (A ∩B)) = 1/4 6= P (C)P (A ∩B) = 1/8

so C and A ∩B are not independent.

It is important to note that the statement

P (A1 ∩A2 ∩ · · · ∩An) = P (A1)P (A2) · · ·P (An)

does not imply that the events A1, A2, . . . , An are mutually independent (see exercise ??).

Exercise 10Let Ω = a, b, c, d, e, f. Assume that m(a) = m(b) = 1/8 and m(c) = m(d) = m(e) = m(f) = 3/16.Let A, B, and C be the events A = d, e, a, B = c, e, a, C = c, d, f. Show that P (A∩B∩C) =P (A)P (B)P (C) but no two of these events are independent.

Probability 10

Page 11: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Even more, if A is the union of n mutually exclusive events:

Pr(A⋂B) =

∑i

Pr(Ai

⋂B)

Pr(A|B) =∑i

Pr(Ai|B)

which is the addition law for conditional probabilities. If the set of Ai’s exhaust Ω, then

Pr(B) =∑i

Pr(Ai)Pr(B|Ai) (3)

Exercise 11Suppose that we have a coin which comes up heads with probability p, and tails with probabilityq. Now suppose that this coin is tossed twice. Let E be the event that heads turns up on the firsttoss and F the event that tails turns up on the second toss. Show how P (E ∩ F ) = P (E)P (F )

Nota de classe

Using a frequency interpretation of probability, it is reasonable to assign to the outcome (H,H)the probability p2, to the outcome (H,T ) the probability pq, and so on. We will now checkthat with the above probability assignments, these two events are independent, as expected. Wehave P (E) = p2 + pq = p, P (F ) = pq + q2 = q. Finally P (E ∩ F ) = pq, so P (E ∩ F ) =P (E)P (F ).

Nota de classe

It is often, but not always, intuitively clear when two events are independent. In the aboveexample, let A be the event “the first toss is a head” and B the event “the two outcomes are thesame.” Then

P (B|A) =P (B ∩A)

P (A)=

PHHPHH,HT

=1/4

1/2=

1

2= P (B).

Therefore, A and B are independent, but the result was not so obvious.

3.2 Bayes’ (inverse) probability

From the fact that the probability of both A and B occurring, that can be calculated either byPr(A)Pr(B|A) or Pr(B)Pr(A|B), we obtain the Bayes theorem (in other words, finding the inverseprobability):

Pr(A|B) =Pr(A)

Pr(B)Pr(B|A) (4)

This theorem also shows that Pr(B|A) 6= Pr(A|B), unless Pr(A) = Pr(B).

Probability 11

Page 12: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Figure 6: Reverse tree diagram.

Nota de classe

en l’exercici segent, usar la taula extreta de l’arbre en l’exercici ??

ExampleFor the example of the white and black balls, the paths through the reverse tree (Figure ??) are

in one-to-one correspondence with those in the forward tree, since they correspond to individualoutcomes of the experiment, and so they are assigned the same probabilities. From the forward tree,we find that the probability of a black ball is

1

2· 2

5+

1

2· 1

2=

9

20.

The probabilities for the branches at the second level are found by simple division. For example,if x is the probability to be assigned to the top branch at the second level, we must have

9

20· x =

1

5

or x = 4/9. Thus, P (I|B) = 4/9, in agreement with our previous calculations. The reverse treethen displays all of the inverse, or Bayes, probabilities.

If we take into account that Pr(B) = Pr(A)Pr(B|A)+Pr(A)Pr( ¯B|A) (see Eq. ??), we can rewritethe above formula as:

Pr(A|B) =Pr(A)

Pr(A)Pr(B|A) + Pr(A)Pr( ¯B|A)Pr(B|A) (5)

Probability 12

Page 13: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Following we give a more general deduction, for the case of more than one alternative event.

We return now to the calculation of more general Bayes probabilities. Suppose we have a set ofevents H1, H2, . . . , Hm that are pairwise disjoint and such that the sample space Ω satisfies theequation

Ω = H1 ∪H2 ∪ · · · ∪Hm .

We call these events hypotheses. We also have an event E that gives us some information aboutwhich hypothesis is correct. We call this event evidence.

Before we receive the evidence, then, we have a set of prior probabilities P (H1),P (H2), . . . , P (Hm) for the hypotheses. If we know the correct hypothesis, we know the probabilityfor the evidence. That is, we know P (E|Hi) for all i. We want to find the probabilities for thehypotheses given the evidence. That is, we want to find the conditional probabilities P (Hi|E).These probabilities are called the posterior probabilities.

To find these probabilities, we write them in the form

P (Hi|E) =P (Hi ∩ E)

P (E). (6)

We can calculate the numerator from our given information by

P (Hi ∩ E) = P (Hi)P (E|Hi) . (7)

Since one and only one of the events H1, H2, . . . , Hm can occur, we can write the probability of Eas

P (E) = P (H1 ∩ E) + P (H2 ∩ E) + · · ·+ P (Hm ∩ E) .

Using Equation ??, the above expression can be seen to equal

P (H1)P (E|H1) + P (H2)P (E|H2) + · · ·+ P (Hm)P (E|Hm) . (8)

Using (??), (??), and (??) yields Bayes’ formula:

P (Hi|E) =P (Hi)P (E|Hi)∑m

k=1 P (Hk)P (E|Hk).

Although this is a very famous formula, we will rarely use it. If the number of hypotheses issmall, a simple tree measure calculation is easily carried out, as we have done in our examples. Ifthe number of hypotheses is large, then we should use a computer.

Bayes probabilities are particularly appropriate for medical diagnosis. A doctor is anxious toknow which of several diseases a patient might have. She collects evidence in the form of theoutcomes of certain tests. From statistical studies the doctor can find the prior probabilities of thevarious diseases before the tests, and the probabilities for specific test outcomes, given a particulardisease. What the doctor wants to know is the posterior probability for the particular disease, giventhe outcomes of the tests. In other words, here: Diseases = Hi, and TestsOutcome = E.

Probability 13

Page 14: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Number having The resultsDisease this disease + + + – – + – –d1 3215 2110 301 704 100d2 2125 396 132 1187 410d3 4660 510 3568 73 509

Total 10000

Table 1: Diseases data.

d1 d2 d3

+ + .700 .131 .169+ – .075 .033 .892– + .358 .604 .038– – .098 .403 .499

Table 2: Posterior probabilities.

ExampleA doctor is trying to decide if a patient has one of three diseases d1, d2, or d3. Two tests are to

be carried out, each of which results in a positive (+) or a negative (−) outcome. There are fourpossible test patterns ++, +−, −+, and −−. National records have indicated that, for 10,000 peoplehaving one of these three diseases, the distribution of diseases and test results are as in Table ??.

From this data, we can estimate the prior probabilities for each of the diseases and, given aparticular disease, the probability of a particular test outcome (Pr(E|Hi)). For example, the priorprobability of disease d1 may be estimated to be 3215/10,000 = .3215. The probability of the testresult +−, given disease d1, may be estimated to be 301/3215 = .094.

We can now use Bayes’ formula to compute various posterior probabilities. The results for thisexample are shown in Table ??.

We note from the outcomes that, when the test result is ++, the disease d1 has a significantlyhigher probability than the other two. When the outcome is +−, this is true for disease d3. Whenthe outcome is −+, this is true for disease d2. Note that these statements might have been guessedby looking at the data. If the outcome is −−, the most probable cause is d3, but the probabilitythat a patient has d2 is only slightly smaller. If one looks at the data in this case, one can see thatit might be hard to guess which of the two diseases d2 and d3 is more likely.

Our final example shows that one has to be careful when the prior probabilities are small.

ExampleA doctor gives a patient a test for a particular cancer. Before the results of the test, the only

evidence the doctor has to go on is that 1 woman in 1000 has this cancer. Experience has shownthat, in 99 percent of the cases in which cancer is present, the test is positive; and in 95 percentof the cases in which it is not present, it is negative. If the test turns out to be positive, whatprobability should the doctor assign to the event that cancer is present? An alternative form of thisquestion is to ask for the relative frequencies of false positives and cancers.

Probability 14

Page 15: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Figure 7: Forward and reverse tree diagrams.

We are given that prior(cancer) = .001 and prior(not cancer) = .999. We know also thatP (+|cancer) = .99, P (−|cancer) = .01, P (+|not cancer) = .05, and P (−|not cancer) = .95. Us-ing this data gives the result shown in Figure ??.

We see now that the probability of cancer given a positive test has only increased from .001 to.019. While this is nearly a twenty-fold increase, the probability that the patient has the cancer isstill small. Stated in another way, among the positive results, 98.1 percent are false positives, and1.9 percent are cancers. When a group of second-year medical students was asked this question,over half of the students incorrectly guessed the probability to be greater than .5.

Exercise 12Suppose that a blood test for a disease gives a false negative of 0.01% and a false positive of 0.02%.

In general, one person in 10000 individuals of the population is infected. What is the probabilitythat, taking an individual from the population and tested positive for the disease, he/she has thedisease, indeed?

Nota de classe

cal usar l’equacio ?? i substituir-hi: Pr(A) = 1/10000 = 1 − Pr(A), a part de Pr(B|A) =9999/10000 i Pr(B|A) = 2/10000

Probability 15

Page 16: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

4 Permutations and combinations

We need to be able to count the number of possible outcomes in various common situations:

nPk =n!

(n− k)!(9)

nCk =

(nk

)=

n!

(n− k)!k!(10)

Exercise 13Find the probability that in a group of k people at least two have the same brithday

Nota de classe

el nombre de cops en els quals l’aniversari es diferent es: 365Pk = 365!365−k)! i la probabilitat de que

tots els aniversaris siguin diferents es: p = 365!365−k)!365k

Exercise 14A hand of 13 playing cards is dealt from a well-shuffled pack of 52. What is the probability that

the hand contains two aces?

Nota de classe

com que no ens importa l’ordre de les cartes, el nombre total de mans diferents s simplement elnombre de combinacions de 13 objectes agafats d’una pila de 52: 52C13. No obstant, el nombre demans que contenen 2 asos s igual al producte del nombre de cops que es poden obtenir dels 4 que hiha, 4C2 multiplicat pel nombre de cops 48C11 que les restants 11 cartes a la m es poden obtenir deles 48 que no tenen asos:

4C248C11

52C13= 0.213

5 Random Variables and distributions

The outcome of an experiment need not be a number, for example, the outcome when a coin istossed can be ’heads’ or ’tails’. However, we often want to represent outcomes as numbers. Arandom variable is a function that associates a unique numerical value with every outcome of anexperiment. The value of the random variable will vary from trial to trial as the experiment isrepeated. Thus, it is possible to assign a probability distribution to any random variable. Randomvariables can be discrete or continuous.

Probability 16

Page 17: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

5.1 Discrete random variables

A random variable X that takes only discrete values x1, x2, . . . , xn with probabilities p1, p2, . . . , pn.A probability distribution or probability function (PF) f(x) assigns probabilities to all the distinctvalues that X may take:

f(x) = Pr(X = x) =

pi if x = xi0 otherwise

(11)

It is obvious that for a discrete random variable

n∑i=1

f(xi) = 1

We may also define a cumulative probability function (CPF), F (x), as

F (x) = Pr(X ≤ x) =∑xi≤x

f(xi)

which can help obtaining, eg, the probability of getting X lying between two values:

Pr(l1 < X ≤ l2) =∑

l1<xi≤l2

f(xi) = F (l2)− F (l1)

Exercise 15A bag contains seven red balls and three white balls. Three balls are drawn at random and not

replaced. Find the probability function for the number of red balls drawn.

Nota de classe

fer un tree, i immediatament:

Pr(X = 0) = f(0) =3

10× 2

9× 1

8=

1

120

Pr(X = 1) = f(1) =3

10× 2

9× 7

8× 3 =

7

40

Pr(X = 2) = f(2) =3

10× 7

9× 6

8× 3 =

21

40

Pr(X = 3) = f(3) =7

10× 6

9× 5

8=

7

24

5.2 Continuous random variables

(See [?] for an example on the analysis of fat tailed distributions of biological networks)

Probability 17

Page 18: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

X has a continuous distribution if it is defined for a continuous range of values between givenlimits l1 and l2 (including −∞ to ∞). In this case, the probability density function (PDF) f(x) isdefined as

Pr(x < X ≤ x+ dx) = f(x)dx

It is also obvious here than ∫ l2

l1

f(x)dx = 1

The cumulative density function (CDF), F (X), for a continuous random variable is then written as:

F (X) = Pr(X ≤ x) =

∫ x

l1

f(u)du

andPr(a < X ≤ b) = F (b)− F (a)

Exercise 16A random variable X has a PDF f(x) = A exp−x in ]0,∞[ and zero elsewhere. Find the

probability that X lies in ]1, 2].

Nota de classe

troba primer A normalitzant (integrant entre 0 i ∞.

Exercise 17Use a Monte Carlo procedure to find the area below the curve y = x2 in the interval [0, 1].

ExampleA real number is chosen at random from [0, 1] with uniform probability, and then this number is

squared. Let X represent the result. What is the cumulative distribution function of X? What isthe density of X?

We begin by letting U represent the chosen real number. Then X = U2. If 0 ≤ x ≤ 1, then wehave

F (X) = P (X ≤ x)

= P (U2 ≤ x)

= P (U ≤√x)

=√x .

Probability 18

Page 19: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Figure 8: Distribution and density for X = U2.

It is clear that X always takes on a value between 0 and 1, so the cumulative distribution functionof X is given by

F (X) =

0, if x ≤ 0,√x, if 0 ≤ x ≤ 1,

1, if x ≥ 1.

From this we easily calculate that the density function of X is

fX(x) =

0, if x ≤ 0,1/(2√x), if 0 ≤ x ≤ 1,

0, if x > 1.

Note that F (X) is continuous, but fX(x) is not. (See Figure ??.)

5.3 Joint probability distributions

We need to define them when we consider sets of random variables, not necessarily independent onefrom each other. We say then that for, for example, two random variables X and Y , one may definea joint probability density function by (discrete variables)

Pr(X = xi, Y = yj) = f(xi, yj)

or (continuous variables)

Pr(x < X ≤ x+ dx, y < Y ≤ y + dy) = f(x, y)dxdy

If the variables are independent with respective probability distributions g(x) and h(x),

Pr(X = xi, Y = yj) = g(xi)h(yj)

Pr(x < X ≤ x+ dx, y < Y ≤ y + dy) = g(x)h(x)dxdy

Probability 19

Page 20: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Figure 9: Solution for exercise ??

Exercise 18The independent random variables X and Y have the PDFs g(x) = e−x and h(y) = 2e−2y,

respectively. Calculate the probability that X lies in the interval 1 < X ≤ 2 and Y lies in theinterval 0 <≤ 1.

Nota de classe

simplement caldra integrar l’equacio anterior, donat que parlem de variables independents

Exercise 19A flat table is ruled with parallel straight lines a distance D apart, and a thin needle of lengthl < D is tossed onto the table at random. What is the probability that the needle will cross a line?(see http://en.wikipedia.org/wiki/Buffon%27s_needle for a solution).

Nota de classe

pgina 1040 de [?] (fig. ??)

Probability 20

Page 21: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

6 Properties of distributions

6.1 Expectation value

The average or expectation value E[g(X)] of any function g(X) of the random variable X thatfollows a distribution f(x) is defined as

E[g(X)] =

∑i g(xi)f(xi) for a discrete distribution,∫∞−∞ g(x)f(x)dx for a continuous distribution

The extension to a joint distribution is immediate. Thus, for any probaility density function f(x, y),the expectation value of any function g(X,Y ) of the random vairables X and Y is given by::

E[g(X,Y )] =

∑i

∑j g(xi, yi)f(xi, yi) for a discrete distribution,∫∞

−∞∫∞−∞ g(x, y)f(x, y)dxdy for a continuous distribution

It easily follows that the expectation value fullfills these properties:

1. if a is a constant, E[a] = a;

2. if a is a constant, then E[ag(X)] = aE[g(X)];

3. if g(X) = s(X) + t(X) then E[g(X)] = E[s(X)] + E[t(X)]

6.2 Mean

The mean of a random variable X is simply the expectation value of itself:

µ =< x >= E[X] =

∑i xif(xi) for a discrete distribution,∫xf(x)dx for a continuous distribution

Exercise 20The probability of finding a 1s electron in a hydrogen atom in a given infinitessimal volume dV isψ∗ψdV , where the quantum mechanical wavefunction ψ is given by

ψ = A exp−r/a0

. Find the value of the real constant A and deduce the mean distance of the electron from the origin.

Nota de classe

Let us consider the random variable R being the distance of the electron from the origin. The1s orbital is spherically symmetric, we may consider the infinitessimal volume element dV as the

Probability 21

Page 22: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

spherical shell with inner radius r and outer radius r + dr. Thus, dV = 4πr2dr and the PDF of Ris simply:

Pr(r < R ≤ r + dr) = f(r)dr = 4πr2A2e−2r/a0dr

The value of A is obtained by requiring the total probability to be 1. By parts we obtain A =1/(πa3

0)1/2. Finally, by using the definition of the expectation value, we obtain:

E[R] =

∫ ∞0

rf(r)dr =4

a30

∫ ∞0

r3e−2r/a0dr =3a0

2

6.3 Mode and median

Mode value of the random variable X at which the probability (density) reaches its maximum. Ifthis occurs more than once, all maximums are equally considered the mode.

Median value of the random variable X at which the cumulative probability (density) functionF (x) takes value 1/2. Related to it are the lower and upper quatiles Q1 and Qu, defined asF (Q1) = 1/4 and F (Qu) = 3/4.

Exercise 21Find the mode of the PDF for the distance from the origin of the electron whose wavefunction was

given in Exercise ??.

Nota de classe

noms cal derivar la f(r) de l’exercici anterior respecte r

6.4 Variance and standard deviation; moments

The variance V [X] of a distribution, also written σ2, is defined as

V [X] = E[(X − µ)2] =

∑j(xj − µ)2f(xj) for a discrete distribution∫

(x− µ)2f(x)dx for a continuous distribution(12)

The standard deviation, σ is the square root of the variance, and, roughly speaking, shows the spreadof the distribution. From the definition of the variance it follows that:

• V [a] = 0,

• V [aX + b] = a2V [X]

Probability 22

Page 23: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Exercise 22Find the standard deviation of the PDF for the distance from the origin of the electron whose

wavefunction was discussed above.

Nota de classe

pgina 988 de [?]. Basicament nomes cal aplicar la frmula, i s’obt V [R] = 3a20/4

Using the defintion of the variance can also be shown that (Bienaym-Chebyshev inequality)

Pr(|X − µ| ≥ c) ≤ σ2

c2

which is very useful to determine the amount of “space” outside a given interval around the mean.For example,

Pr(|X − µ| ≥ 2σ) ≤ 1

4

and

Pr(|X − µ| ≥ 3σ) ≤ 1

9

which holds for any distribution f(x) that possesses a variance.

Defining the mean as the first moment of X, as it is defined as the sum or the integral of theprobability density function multiplied by the first power of x, we can extend it to other moments:

µk = E[Xk] =

∑i x

ki f(xi) for a discrete distribution,∫

xkf(x)dx for a continuous distribution

It is easy to see that the variance can be expressed in terms of the second moment:

V [X] = E[X2]− µ2

Exercise 23A biased die has probabilities p/2, p, p, p, 2p of showing 1, 2, 3, 4, 5, 6 respectively. Find the mean,

the second moment and the variance of the probability distribution.

Nota de classe

veure figura ?? de la pgina 990 de [?]

In an analogous way we define moments we also define central moments

νk = E[(X − µ)k] =

∑i(xi − µ)kf(xi) for a discrete distribution,∫

(x− µ)kf(x)dx for a continuous distribution

Probability 23

Page 24: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Figure 10: Solution for exercise ??

7 Functions of random variables

Sometimes we are more interested in functions that are derived from a given random variable forwhich the PDF is known, f(x). The question is how to find the PDF for the related random variableY = Y (X), where Y (X) is some function of X.

7.1 Discrete random variables

The probability function for Y is given by

g(y) =

∑j f(xj) if y = yi

0 otherwise

where the sum extends over those values of j for which yi = Y (xj). This is easy to evaluate just inthe case that the function Y (X) possesses a single-valued inverse X(Y ).

7.2 Continuous random variables

In this case, we obtain

g(y) = f(x(y))

∣∣∣∣dxdy∣∣∣∣

Again, this is easy to evalate if Y (X) possesses a single valued inverse, but not so easy otherwise.However, in general, a closed form expression may be obtained in the case where there exist single

Probability 24

Page 25: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Figure 11: Solution for exercise ??

valued functions x1(y) and x2(y) giving the two values of x that correspond to any given value of y.Thus:

g(y) = f(x1(y))

∣∣∣∣dx1

dy

∣∣∣∣+ f(x2(y))

∣∣∣∣dx2

dy

∣∣∣∣Exercise 24

The random variable X is Gaussian distributed with mean µ and variance σ2. Find the PDF ofthe new variable Y = (X − µ)2/σ2.

Nota de classe

veure figura ?? de [?]

8 Important distributions

8.1 Discrete distributions

8.1.1 Uniform distribution

The uniform distribution on a sample space Ω containing n elements is the function m defined by

m(ω) =1

n

for every ω ∈ Ω.

Probability 25

Page 26: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

ExampleConsider the experiment that consists of rolling a pair of dice. We take as the sample space Ω the

set of all ordered pairs (i, j) of integers with 1 ≤ i ≤ 6 and 1 ≤ j ≤ 6. Thus,

Ω = (i, j) : 1 ≤ i, j ≤ 6 .

To determine the size of Ω, we note that there are six choices for i, and for each choice of i thereare six choices for j, leading to 36 different outcomes. Let us assume that the dice are not loaded.In mathematical terms, this means that we assume that each of the 36 outcomes is equally likely, orequivalently, that we adopt the uniform distribution function on Ω by setting

m((i, j)) =1

36, 1 ≤ i, j ≤ 6 .

What is the probability of getting a sum of 7 on the roll of two dice—or getting a sum of 11? Thefirst event, denoted by E, is the subset

E = (1, 6), (6, 1), (2, 5), (5, 2), (3, 4), (4, 3) .

A sum of 11 is the subset F given by

F = (5, 6), (6, 5) .

Consequently,P (E) =

∑ω∈E m(ω) = 6 · 1

36 = 16 ,

P (F ) =∑

ω∈F m(ω) = 2 · 136 = 1

18 .

What is the probability of getting neither snakeeyes (double ones) nor boxcars (double sixes)?The event of getting either one of these two outcomes is the set

E = (1, 1), (6, 6) .

Hence, the probability of obtaining neither is given by

P (E) = 1− P (E) = 1− 2

36=

17

18.

ExampleSuppose two pennies are flipped once each. There are several “reasonable” ways to describe the

sample space. One way is to count the number of heads in the outcome; in this case, the samplespace can be written 0, 1, 2. Another description of the sample space is the set of all ordered pairsof H’s and T ’s, i.e.,

(H,H), (H,T ), (T,H), (T, T ).Both of these descriptions are accurate ones, but it is easy to see that (at most) one of these, ifassigned a constant probability function, can claim to accurately model reality. In this case, asopposed to the preceding example, the reader will probably say that the second description, witheach outcome being assigned a probability of 1/4, is the “right” description. This conviction is dueto experience; there is no proof that this is the way reality works.

Probability 26

Page 27: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

8.1.2 Binomial distribution

The binomial distribution describes processes that consider a number of independent identical trialswith two possible outcomes, A and B = A (success and failure). X is the number of times A occurswhen doing n trials. The binomial distribution takes the form:

f(x) = Pr(X = x) =n Cxpxqn−x =n Cxp

x(1− p)n−x (13)

Exercise 25If a single six-sided die is rolled five times, what is the probability that a six is thrown exactly three

times?

One can show that, for the binomial distribution:

n∑x=0

f(x) =

n∑x=0

.nCxpxqn−x = (p+ q)n = 1

and that E[X] = np and V [X] = npq.

8.2 A continuous distribution: Gaussian

The PDF for a Gaussian distribution of a random variable X, with mean µ and variance σ2 takesthe form:

f(x) =1

σ√

2πexp

[−1

2

(x− µσ

)2]

Exercise 26Find the inflection points for the Gaussian distribution

Let us consider the random variable Z = (X − µ)/σ, which PDF takes the form:

φ(z) =1√2π

exp

(−z

2

2

)This is called the standard Gaussian distribution with µ = 0 and σ2 = 1, being Z the standardvariable.

The cumulative probability function for the standard Gaussian distribution can be evaluatedwith

Φ(z) = Pr(Z < z) =1√2π

∫ z

−∞exp

(−u

2

2

)du

Probability 27

Page 28: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

s1

s2

s3

s4s5

a11a13

a22

a21

a33

a32

a34a35

a44

a41

a45

a55

a51

a54

Figure 12: An example of a Markov model with 5 states and different transition probabilities

9 Markov chains

A Markov model is a probabilistic process over a finite set of states, s1, s2, ..., sN . Each statetransition generates a character from the alphabet of the process. The probability of each statecoming up next is denoted by Pr(qt = sj), where qt is the actual state at time t. Dependingon whether such probability depends or not on previous states, we distinguish Markov models ofdifferent orders:

Order 0 which have no memory: Pr(qt = sj |qt−1 = si, qt−2 = sk, . . .) = Pr(qt = sj) = Pr(qt′ = sj) forall points t and t′ in the sequence. It is equivalent to a multinomial probability distribution(see Eq. ?? for a case in which would just have two possible states.

Order 1 having memory of size 1: Pr(qt = sj |qt−1 = si, qt−2 = sk, . . .) = Pr(qt = sj |qt−1 = si). Suchmodels can be rationalized as N order 0 Markov models, one for each previous state si.

Order m In general, m is called the length or context upon which the probabilities of the possible valuesof the next step depend.

Here we will just consider order 1 processes in which the right hand side of the above expres-sions is independent of time (time-homogeneous processes), thereby leading to the set of transitionprobabilities aij of the form:

aij = Pr(qt = sj |qt−1 = si), 1 ≤ i, j ≤ N

with aij ≥ 0 and∑N

j=1 aij = 1. See Figure ?? for a graphical example.

The above stochastic process is an observable Markov model, since the output of the process isthe set of states at each time, where each state is a physical observable event.

Probability 28

Page 29: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

A Markov chain is a random process where all information about the future is contained in thepresent state (Markov model of order 1).

The probability of going from state i to state j in n states is given by

a(n)ij = Pr(qt = sj |q0 = si)

soa

(n)ij =

∑r∈S

a(k)ir a

(n−k)rj

where S is the state space of the Markov chain.

A state j is accessible from a state i (i → j) if a a system started in state i has non-zeroprobability of reaching j at some point:

Pr(qt = sj |q0 = si) = a(n)ij > 0

A state i communicates with state j if both i→ j and j → i. Then we can define communicatingclass as sets of states that communicate each other but not with the exterior and closed classes whenthere is no probability of leaving the class. A Markov chain is said to be irreductible if S is a singlecommunicating class: if it is possible to get to any state from any state.

A state si has period k if any return to it occurs in multiples of k (greatest common divisor). Astate is aperiodic if k = 1.

A state si is transient if, given that we start at si, there is a non-zero probability that we willnever return to si. Formally:

Pr(Ti =∞) > 0

being Ti a random variable that represents the first return time to state si (the ”hitting time”). Onthe contrary, a state is said to be recurrent or persistent if not transient and positive recurrent if alsoexpected value for the return time is finite:

Mi = E[Ti]

Finally, a state is called absorbing if it is not possible to leave it: aii = 1 and, thus, aij = 0 forall i 6= j.

An important definition is that of an ergodic state: aperiodic and positive recurrent (all can bereached at a finite time). If all states in a Markov chain are ergodic, the chain is said to be ergodic.

9.1 Steady state analysis of Markov chains

In a time-homogeneous Markov chain the vector π is called a stationary distribution (or invariantmeasure) if

∑πj = 1 and

πj =∑i∈S

aij

Probability 29

Page 30: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

An irreductible chain has a stationary distribution if and only if all of its states are positive recurrent.In that case, π is unique and is related to the expected return time:

πj =1

Mj

and, if the chain is also aperiodic, then for any i and j:

limn→∞

a(n)ij =

1

Mj

The chain converges to the stationary distribution regardless of where it begins! Such π is calledthe equilibrium distribution of the chain.

9.2 Transition matrix

Nota de classe

exemples macos a http://www.sosmath.com/matrix/markov/markov.html

aij = Pr(qt+1 = sj |qt = si)

is a right stochastic matrix, because each row sums 1 and all values are positive. If the Markov chainis time-homogeneous, the k-step transition probability can be computed by P k.

The stationary distribution π is a row vector that fullfills:

π = πP

(a left eigenvector of the transition matrix with the eigenvalue 1). One can also see vector ~π as a

fixed (equilibrium) point of the linear transformation ~P , which always exists but is only unique when

the Markov chain is irreductible and aperiodic. In such case, ~P k converges to a rank-one matrix inwhich each row is the stationary distribution ~π:

limk→∞

~P k = ~1π

Nota de classe

to be completed with reversible Markov chains and Bernouilli schemes and processes, as well asgeneral state spaces

Exercise 27For example, imagine a 3-state Markov model of the weather taken from [?]. Every day the weather

is observed at noon and anotated in one of three different states:

Probability 30

Page 31: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Figure 13: idees de soluci de problema

state 1: rainy

state 2: cloudy

state 3: sunny

The transition probabilities are set to

A = aij =

0.4 0.3 0.30.2 0.6 0.20.1 0.1 0.8

provided that on day 1, it is sunny, what is the probability that we will have the series: sunny-cloudy-sunny-sunny-sunny-rainy? What is the stationary distribution for the MArkov chain? Howwould you design a random generation of ”weathers” based on this Markov Chain?

Nota de classe

noms cal multiplicar probabilitats un cop construt el tree

Exercise 28A large biased coin A is described by the order 0 Markov model Pr(H) = 0.6, Pr(T ) = 0.4. A small

biased coin B by Pr(h) = 0.3, Pr(t) = 0.7 and a third biased coin C by Pr(H) = 0.8, Pr(T ) = 0.2.Build the Markov chain and the transition matrix for the following process:

1. Let X equal one of the two first coins A or B

2. repeat

(a) throw the coin X and write down the result

(b) toss coin C

(c) if C shows T then change X to the other coin; otherwise leave X unchanged

3. until enough

Nota de classe

Easy, nomes cal construir un model de Markov de 4 estats H,T,h,t i pensar en la probabilitat defer transicions ii o ij, que dependr dels resultats de llanar dues monedes cada cop (la C per decidir

Probability 31

Page 32: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

i A o B per anotar)

9.3 Selected applications

• state of ion channels in cell membranes

• distribution of shapes of cells in a tissue (Figure ??)

• Brownian motion and the ergodic hypothesis in physics

• the Metropolis-Hastings algorithm to sample a given distribution with known equilibriumdistribution

10 Hidden Markov Models

Real world processes produce signals that can be:

• discrete (letters in an alphabet) or continuous (temperature)

• stationary (invariant to time) or nonstationary (signal properties varying with time)

• pure (coming from just one source) or corrupted (by noise)

We aim at finding characteristics of the real-world signals in terms of signal models. Why?

1. the signal model can provide the basis for a theoretical model that will allow reproducing thesignal

2. on the contrary, without having the source available (reason is unknown or signal is expensiveto reproduce), we may learn a lot by simply analyzing the signal

3. they work very well in practice... ;-)

The signal model can be deterministic (we know something about the signal, like its mathematicalexpression, the physics behind...) or statistical (Gaussian processes, Poisson processes, Markovprocesses, hidden Markov processes...). In an statistical model we assume that the signal can be wellcharacterized as a parametric random process, with parameters that can be precisely determined.

HMM design includes the solving of three fundamental problems[?]:

1. what is the probability (or likelihood) of a sequence of observations given an specific HMM?

2. what is the best sequence of model states?

3. how do we parametrize the HMM in order to best account for the observed signal?

Probability 32

Page 33: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Figure 14: A simple Markov Chain Model to show why many different multicellular organisms allhave the same epithelial cell topology. The variable ”how many neighbors each cell has”, shows thesame distribution in different organisms.

Probability 33

Page 34: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

Figure 15: Toy model for an HMM applied to sequence analysis taken from [?]. The examplerepresents a simplified view of a 5’ splice site recognition problem with three possible states, exon,5’ splice site and intron

An HMM involves having the following material:

1. the symbol alphabet to be emmitted, with size M .

2. the number of states in the model, N .

3. the emission probabilities bi(k) = Pr(vk att|qt = si), 1 ≤ j ≤ N, 1 ≤ k ≤M .

4. the transition probabilities, aij .

5. the initial state distribution, πi = Pr(q1 = si), 1 ≤ i ≤ N .

A toy model for a HMM is shown in Figure ??.

HMMs do not deal well with correlations between observations, as they only depend on theunderlying states. For example, for secondary structure analysis of RNA or long range correlationsin protein 3D structure analysis, a HMM is not appropriate.

11 Sources of information

• Numerical recipes: http://www.nr.com

Probability 34

Page 35: Probability 1011

MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences

• LTCC math departmenthttp://www.ltcconline.net/greenl/courses/204/204.htm

• SOS MATH:http://www.sosmath.com/diffeq/diffeq.html

• Probability:http://www.dartmouth.edu/~chance/teaching_aids/articles.html

• Software:

– R: http://www.r-project.org/

– Octave: http://www.octave.org/

– gnuplot http://www.gnuplot.info/

Probability 35