77
PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

PROBABILITY THEORY AND STATISTICS · Chapter 1 Mathematical framework Theprobabilisticmodelforthedie Whenonethrowsacubicdie(withsixfaces),weknowthathalfofthetime,theupperface ofthedieshowsanevennumber:

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • PROBABILITY THEORY AND STATISTICS

    V. Tassion

    D-ITET Spring 2020 (updated: May 24, 2020)

  • Introduction

    Some questions you may askWhat is probability?

    Ü A mathematical language describing systems involving randomness.

    Where are probabilities used?

    Ü Describe random experiments in the real world (coin flip, dice rolling, arrivaltimes of customers in a shop, weather in 1 year,...).

    Ü Express uncertainty. For example, when a machine performs a measurement,the value is rarely exact. One may use probability theory in this context by sayingthat the value obtained is equal to the real value plus a small random error.

    Ü Decision making. Probability theory can be used to describe a system when onlypart of the information is known. In such context, it may help to make a decision.

    Ü Randomized algorithms in computer science. Sometimes, it is more efficient toadd some randomness to perform an algorithm. Examples: Google web search, antssearching food.

    Ü Simplify complex systems. Examples: water molecules in water, or cars on thehighway.

    The notion of probabilistic modelIf one wants a precise physical description of a coin flip one would need a lot (really!)of information: the exact position of the fingers and the coins, the initial velocity, theinitial angular velocity, imperfections of the coin, the surface characteristics of the table,air currents, the brain activity of the gambler... These parameters are almost impossibleto measure precisely, and a tiny change in one of them may affect completely the result.For practical use, it is very convenient to use probabilistic description which simplifies alot the system. We are only interested in the result, namely the set of possible outcomes.

    1

  • 2

    For the coin flip, the probabilistic model is given by 2 possible outcomes (head andtails) and each outcome has probability phead = ptail = 1/2 to be realized. In other words,

    Coin flip = {{head, tail}, phead = 1/2, ptail = 1/2}

    A surprising analysis: coin flips are not fair! If one tosses a coin it has morechance to fall on the same face as its initial face! See the youtube video of Persi Diaconis:How random is a coin toss? - Numberphile

    Probability laws: randomness vs orderingIf one performs a single random experiment (for example a coin flip), the result is unpre-dictable. In contrast, when one performs many random experiments, then some generallaws can be observed. For example if one tosses 10000 independent coins, one shouldgenerally observe 5000 heads and 5000 tails approximately: This is an instance of a fun-damental probability law, called the law of large numbers. One goal of probability theoryis to describe how ordering can emerge out of many random experiments, and establishsome general probability laws.

    https://www.youtube.com/watch?v=AYnJv68T3MM

  • Chapter 1

    Mathematical framework

    The probabilistic model for the die

    When one throws a cubic die (with six faces), we know that half of the time, the upper faceof the die shows an even number: 2, 4, or 6. One would like to give a precise mathematicalmeaning to this phenomena. More precisely, our goal is to give a rigorous sense to

    P[“the upper face of the die is even”] = 12. (1.1)

    First, we have to give a sense to “the upper face of the die is even”, and then to thenotation P[.] which represents the probability. A first goal of the chapter is to introducethe notion of probability space used to model general random experiments. It will be givenby a triple (Ω,F ,P), where Ω is called the sample space, F the set of events and P theprobability measure. Before giving the general definitions, we construct a first exampleof probability space in the simple case of the die.

    Step 1: Determine the set of possible outcomes Ω.In probability, we are not interested in the exact protocol of the experiment, we focus

    on its result. The first thing we identify is the set of possible outcomes, called the samplespace and denoted by Ω. For the throw of a die, we have

    Ω = {1,2,3,4,5,6}

    Step 2: Give a mathematical sense to “the die is even”.We want to evaluate the probability of certain “observations”, for example that the

    upper face of the die is even. These observations are formalized by the notion of event:

    An event is given by a subset A ⊂ Ω.

    3

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 4

    We write F for the set of events. Here, we have F = P(E).Examples of events:

    A = {2,4,6}, “the die is even”B = {1,2,3}. “the die is smaller than or equal to 3”

    Set operations on events:The intersection corresponds to the logic operation and:

    C ∶= A ∩B = {2}. “the die is even and smaller than or equal to 3”The union corresponds to the logic operation or:

    D ∶= A ∪B = {1,2,3,4,6}. “the die is even or smaller than or equal to 3”The complement corresponds to the logic operation not:

    E ∶= Ac = {1,3,5}. “the die is not even”

    Step 3: Give a mathematical sense to the probability P.We want to associate to each event A a probability P[A] ∈ [0,1]. This number rep-

    resents the “frequency” at which the event A occurs if one performs many experiments.Before giving the general formula, let us look at some examples.

    If one performs many throws of the die, the event A = {1} (“the die is equal to 1”) willoccur typically one time over six. Hence, it is natural to define P[A] = 1/6. The eventA = {1,2,3} will typically occur half of the time and we define P[A] = 1/2 in this case.More generally, we use the following definition:

    The probability measure is the function P ∶ F → [0,1] defined by

    ∀A ∈ F P[A] = ∣A∣6.

    For example, the probability of the event A = {2,4,6} (“the die is even”) is

    P[A] = ∣{2,4,6}∣6

    = 36= 1

    2.

    Ü the equation above is the mathematical way to express Equation (1.1).Notice that the probability measure P satisfies the following properties

    • P[Ω] = 1,

    • P[A ∪B] = P[A] + P[B] if A and B are disjoint.

    Summary: the probabilistic model for the die

    Sample space Ω = {1,2,3,4,5,6}

    Set of events F = P(E)

    Probability measure P ∶ F → [0,1]A ↦ ∣A∣6

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 5

    1.1 Sample spaces and events (textbook p.1-4)

    Sample space

    We want to model a random experiment. The first mathematical object needed is the setof all possible outcomes of the experiment, denoted by Ω.

    Definition 1.1. The set Ω is called the sample space.An element ω ∈ Ω is called an outcome (or elementary experiment).

    Example 1: Coin toss

    Ω = {head, tail} or Ω = {0,1}.

    Example 2: Throw of a die

    Ω = {1,2,3,4,5,6}.

    Example 3: Throw of three dice

    Ω = {1,2,3,4,5,6}3.In this case an outcome ω ∈ Ω has three components ω = (ω1, ω2, ω3) and ωi represents thevalue of the i-th die.Example 4: Throw of infinitely many dice

    Ω = {1,2,3,4,5,6}N.In this case an outcome ω ∈ Ω has infinitely many components ω = (ω1, ω2, . . .) and ωirepresents the value of the i-th die.Example 5: Number of customers visiting a shop in one day

    Ω = N(= {0,1, . . .}).

    Example 6: Droplet of water falling on a square

    Ω = [0,1]2.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 6

    Events

    Our goal is to measure the probability that the outcome of the experiment satisfies acertain property. Mathematically, such property can be represented by the set of all theoutcomes that satisfy this property. For example, for the throw of die, the property “theupper face of the die is even” corresponds to the subset

    {2,4,6} ⊂ {1,2,3,4,5,6}.

    A priori, an event could be any subset A ⊂ Ω. This works very well if the samplespace Ω is finite or countable. But for more complicated (uncountable) sample spaces(e.g. [0,1]), one needs to impose some conditions on the events. This is due to the factthat later, we want to be able to estimate the probability of the events, and this is possibleonly if the events are sufficiently “nice”. For this course, this technicality is not important,but it explains why the following definition may look complicated at a first sight.

    Definition 1.2. We write F for the set of all the events. It is given by a subset of P(E)satisfying the following hypotheses:

    H1. Ω ∈ F

    H2. A ∈ F ⇒ Ac ∈ F if A is an event, “non A“ is also an event.

    H3. A1,A2, . . . ∈ F ⇒∞

    ⋃i=1

    Ai ∈ F .if A1,A2, . . . are events, then“A1 or A2 or ...” is an event

    Remark 1.3. A set F ⊂ P(E) satisfying H1, H2 and H3 above is called a σ-algebra.

    Example 1: Throw of a dieThe event corresponding to “the upper face of the die is even” is mathematically definedby

    A = {2,4,6}.

    Example 2: Droplet falling on a squareThe event corresponding to “the droplet falls on the upper part of the square” is mathe-matically defined by

    A = [0,1] × [1/2,1].

    In general, an event can be defined in two ways

    Ü one can list all the elements in the event, as in the two examples above.

    Ü one can define it as the set of ω ∈ Ω satisfying a certain property. For example, forthe throw of a die, the event “the upper face is smaller than or equal to 3” can bedefined by

    A = {ω ∈ Ω ∶ ω ≤ 3}.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 7

    Terminology

    Let ω ∈ Ω (a possible outcome). Let A be an event.We say the event A occurs (for ω) if ω ∈ A.

    A

    ω

    We say that it does not occur if ω ∉ A.

    A

    ω

    Remark 1.4. The event A = ∅ never occurs. “we never have ω ∈ ∅”The event A = Ω always occurs. “we always have ω ∈ Ω”

    Operations on events and interpretation

    Since events are defined as subsets of Ω, one can use operations from set theory (union,intersection, complement, symmetric difference,. . . ). From the definition, we know thatwe can take the complement of an event (by H2), or a countable union of events (by H3).The following proposition asserts that the other standard set operations are allowed.

    Proposition 1.5 (Consequences of the definition). We have

    i. ∅ ∈ F ,

    ii. A1,A2, . . . ∈ F ⇒∞

    ⋂i=1

    Ai ∈ F ,

    iii. A,B ∈ F ⇒ A ∪B ∈ F ,

    iv. A,B ∈ F ⇒ A ∩B ∈ F .Proof. We prove the items one after the other.

    i. By H1 in Definition 1.2, we have Ω ∈ F . Hence, by H3 in Definition 1.2, we have

    ∅ = Ωc ∈ F .

    ii. Let A1,A2, . . . ∈ F . By H2, we also have Ac1,Ac2, . . . ∈ F . Then, by H3, we have⋃∞i=1(Ai)c ∈ F . Finally, using H2 again, we conclude that

    ⋂i=1

    Ai = (∞

    ⋃i=1

    (Ai)c)c

    ∈ F .

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 8

    iii. Let A,B ∈ F . Define A1 = A, A2 = B, and for every i ≥ 3 Ai = ∅. By H3, we have

    A ∪B =∞

    ⋃i=1

    Ai ∈ F .

    iv. Let A,B ∈ F . By H2, Ac,Bc ∈ F . Then by iii. above, we have Ac ∪Bc ∈ F . Finally,by H2, we deduce

    A ∩B = (Ac ∪Bc)c ∈ F .

    Let A,B ∈ F . “A and B are two events”

    Event Graphical representation Probab. interpretation

    Ac

    A

    AcA does not occur

    A ∩B

    A B

    A and B occur

    A ∪B

    A B

    A or B occurs

    A∆B

    A B

    one and only one of A or Boccurs

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 9

    Relations between events and interpretations

    Relation Graphical representation Probab. interpretation

    A ⊂ B A

    B

    If A occurs, then B occurs

    A ∩B = ∅

    A B

    A and B cannot occur at thesame time

    Ω = A1 ∪A2 ∪A3 withA1,A2,A3 pairwise disjoint

    A1 A3

    A2

    for each outcome ω, one andonly one of the events A1,

    A2, A3 is satisfied.

    1.2 Mathematical definition of probability spaces

    Motivation: frequencies of events in random experiments

    As previously mentioned, in the random experiments that we consider, if we look at theoccurence of a certain event, a stable relative frequency appears after repeating manytimes the experiment (for instance: when throwing a needle onto a parquet floor, “theneedle intersects two laths”). Hence, if we introduce

    n = how many times the experiment is repeated,

    andnA = the number of occurences of the event A,

    We assume that n is very large (e.g. n = 1000000000), we want to say that the probabilityof A corresponds to the proportion of times the event A occurs. Hence, we would likedefine

    P[A] ≃ nAn.

    In other words, we assign a number P[A] ∈ [0,1] “the probability of A”, to every event A.Let us assume that it is rigorous and let us look at the properties satisfied by P.

    • By definition, one has nΩ = n since the event A = Ω (sample space) always occurs.We will thus assume that

    P[Ω] = 1.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 10

    • In the case when A = A1 ∪A2 ∪⋯ ∪Ak and the Ai are pairwise disjoint, one has

    nA = nA1 + nA2 +⋯ + nAk .

    Dividing by n, we obtain

    nAn

    = nA1n

    + nA2n

    +⋯ + nAkn.

    Hence, the probability should satisfy

    P(A) = P (A1) +⋯ + P (Ak), if A = A1 ∪A2⋯∪Ak (disjoint union).

    Probability measure and probability space

    Definition 1.6. Let Ω be a sample space, let F be a set of events. A probability measureon (Ω,F) is a map

    P ∶ F → [0,1]A ↦ P[A]

    that satisfies the following two properties

    P1. P[Ω] = 1.

    P2. (countable additivity) P[A] = ∑∞i=1 P[Ai] if A = ⋃∞i=1Ai (disjoint union).

    “A probability measure is a map that associates to each event a number in [0,1].”

    Definition 1.7. Let Ω be a sample space, F a set of events, and P a probability measure.The triple (Ω,F ,P) is called a probability space.

    To summarize, if one want to construct a probabilistic model, we give

    • a sample space Ω, “all the possible outcomes of the experiment”

    • a set of events F ⊂ P(Ω), “the set of possible observations”

    • a probability measure P. “gives a number in [0,1] to every event”

    Example 1: Throw of three dice

    • Ω = {1,2,3,4,5,6}3, an outcome is ω = ( ω1´¸¶die 1

    , ω2´¸¶die 2

    , ω3´¸¶die 3

    ),

    • F = P(Ω), any subset A ⊂ Ω is an event

    • for A ∈ F , P[A] = ∣A∣∣Ω∣ . (∣A∣ =number of elements in A).

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 11

    Example 2: First time we see “tail” in a biased coinWe throw a biased coin multiple times, at each throw, the coin falls on head with

    probability p, and it falls on tail with probability 1 − p (p is a fixed parameter in [0,1]).We stop at the first time we see a tail. The probability that we stop exactly at time k isgiven by

    pk = pk−1(1 − p).(Indeed, we stop at time k, if we have seen exactly k − 1 heads and 1 tail.)

    For this experiment, one possible probability space is given by

    • Ω = N/{0} = {1,2,3, . . .},

    • F = P(Ω),

    • for A ∈ F , P[A] = ∑k∈A

    pk.

    A tempting approach to define a probability measure is to first associate to every ωthe probability pω that the output of the experiment is ω. Then, for an event A ⊂ Ω definethe probability of A by the formula

    P[A] = ∑ω∈A

    pω. (1.2)

    This approach works perfectly well, when the sample space Ω is finite or countable (thisis the case of the two examples above). But this approach does not work well if Ω isuncountable. For example, in the case of the droplet of water in a square Ω = [0,1]2.In this case, the probability of landing a fixed point is always 0 and the equation (1.2)does not make sense. This is for this reason that we use an axiomatic definition (inDefinition 1.6) of probability measure.Example 3: Droplet of water on a square

    • Ω = [0,1]2,

    • F = Borel σ-algebra1

    • for A ∈ F , P(A) = Area(A).1the Borel σ-algebra F is defined as follows: it contains all A = [x1, x2]× [y1, y2], with 0 ≤ x1 ≤ x2 ≤ 1,

    0 ≤ y1 ≤ y2 ≤ 1, and it is the smallest collection of subsets of Ω which satisfies H1, H2 and H3 inDefinition 1.2.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 12

    Direct consequences of the definition

    Proposition 1.8. Let P be a probability measure on (Ω,F).

    i. We haveP[∅] = 0.

    ii. (additivity) Let k ≥ 1, let A1, . . . ,Ak be k pairwise disjoint events, then

    P[A1 ∪⋯ ∪Ak] = P[A1] +⋯ + P[Ak].

    iii. Let A be an event, thenP[Ac] = 1 − P[A].

    iv. If A and B are two events (not necessarily disjoint), then

    P[A ∪B] = P[A] + P[B] − P[A ∩B].

    Proof. We prove the items one after the other

    i. Define x = P[∅]. We already know that x ∈ [0,1] because x is the probability ofsome event. Defining A1 = A2 = ⋯ = ∅, we have

    ∅ =∞

    ⋃i=1

    Ai.

    The events Ai are disjoint and countable additivity implies

    ∑i=1

    P [Ai] = P[∅].

    Since P[Ai] = x for every i and P [∅] ≤ 1, we have∞

    ∑i=1

    x ≤ 1,

    and therefore x = 0.

    ii. Define Ak+1 = Ak+2 = ⋯ = ∅. In this way we have

    A1 ∪⋯ ∪Ak = A1 ∪⋯ ∪Ak ∪ ∅ ∪∅ ∪⋯ =∞

    ⋃i=1

    Ai.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 13

    Since the events Ai are pairwise disjoint, one can apply countable additivity asfollows:

    P[A1 ∪⋯ ∪Ak] = P[∞

    ⋃i=1

    Ai]

    countableadditivity

    =∞

    ∑i=1

    P[Ai]

    = P[A1] +⋯ + P[Ak] +∑i>k

    P[Ai]´¹¹¸¹¹¶

    =0

    .

    iii. By definition of the complement, we have Ω = A ∪Ac, and therefore

    1 = P[Ω] = P[A ∪Ac].

    Since the two events A, Ac are disjoint, additivity finally gives

    1 = P[A] + P[Ac].

    iv. A ∪B is the disjoint union of A with B/A. Hence, by additivity, we have

    P[A ∪B] = P[A] + P[B/A]. (1.3)

    Also B = (B ∩A) ∪ (B ∩Ac) = (B ∩A) ∪ (B/A). Hence, by additivity,

    P[B] = P[B ∩A] + P[B/A],

    which give P[B/A] = P[B]−P[A∩B]. Plugging this estimate in Eq. (1.3) we obtainthe result.

    Useful Inequalities

    Proposition 1.9 (Monotonicity). Let A,B ∈ F , then

    A ⊂ B ⇒ P[A] ≤ P[B].

    Proof. If A ⊂ B, then we have B = A ∪ (B/A) (disjoint union). Hence, by additivity, wehave

    P[B] = P[A] + P[B/A] ≥ P[A].

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 14

    A simple application: Consider the “droplet of water falling on a square”. Consider theevent En that the droplet falls at Euclidean distance 1/n from the point (1/2,1/2). Onecan see that En ⊂ [1/2 − 1/n,1/2 + 1/n]2. Therefore,

    P[En] ≤ P[[1/2 − 1/n,1/2 + 1/n]2] = 4/n,

    which implies that P[En] converges to 0 as n tends to infinity.

    Proposition 1.10 (Union bound). Let A1,A2, . . . be a sequence of events (not necessarilydisjoint), then we have

    P[∞

    ⋃i=1

    Ai] ≤∞

    ∑i=1

    P[Ai].

    Remark 1.11. The union bound also applies to a finite collection of events.

    Proof. For i ≥ 1, defineÃi = Ai ∩Aci−1 ∩⋯ ∩Ac1.

    One can check that∞

    ⋃i=1

    Ai =∞

    ⋃i=1

    Ãi.

    (To prove the direct inclusion, consider ω in the left hand side. Then define the smallesti such that ω ∈ Ai. For this i, we have ω ∈ Ãi, which implies that ω belongs to the righthand side. The other inclusion is clear because Ãi ⊂ Ai for every i.) Now, one can applythe countable additivity to the Ãi, because they are disjoint. We get

    P[∞

    ⋃i=1

    Ai] = P[∞

    ⋃i=1

    Ãi]

    =∞

    ∑i=1

    P[Ãi]

    ≤∞

    ∑i=1

    P[Ai].

    Application: We throw a die n times. We consider the probability space given by

    • Ω = {1,2,3,4,5,6}n, an outcome is ω = ( ω1´¸¶die 1

    ,⋯, ωn´¸¶die n

    ),

    • F = P(Ω),

    • for A ∈ F , P[A] = ∣A∣∣Ω∣ .

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 15

    We want to prove that the probability to see more than ` ∶= 7 logn successive 1’s is smallif n is large. Consider the event A that there exist ` successive 1’s. For every 0 ≤ k ≤ n−`,define the event

    Ak = {ω ∶ ωk+1 = ωk+2 = ⋯ = ωk+` = 1} “` successive 1’s between k + 1 and k + ` ”.Then, we have

    A =n−`

    ⋃k=0

    Ak.

    Using the union bound, we have

    P[A] ≤n−`

    ∑k=0

    P[Ak] ≤ n ⋅ (1

    6)`

    ≤ n ⋅ n− log(7)/ log(6),

    and therefore we see that the probability of seeing more than 7 logn consecutive 1’sconverge to 0 as n tends to infinity.

    Continuity properties of probability measures

    Proposition 1.12. Let (An) be an increasing sequence of events (i.e. An ⊂ An+1 for everyn). Then

    limn→∞

    P [An] = P[∞

    ⋃n=1

    An]. "increasing limit"

    Let (Bn) be a decreasing sequence of events (i.e. Bn ⊃ Bn+1 for every n). Then

    limn→∞

    P [Bn] = P[∞

    ⋂n=1

    Bn]. "decreasing limit"

    A1

    ∞⋃n=1

    An

    A2A3 ∞⋂

    n=1

    Bn B1B2B3

    Remark 1.13. By monotonicity, we have P[An] ≤ P[An+1] and P[Bn] ≥ P[Bn+1] forevery n. Hence the limits in the proposition are well defined as monotone limits.

    Proof. Let (An)n≥1 be an increasing sequence of events. Define Ã1 = A1 and for everyn ≥ 2

    Ãn = An/An−1.

    The events Ãn are disjoint and satisfy

    ⋃n=1

    An =∞

    ⋃n=1

    Ãn and AN =N

    ⋃n=1

    Ãn.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 16

    Using first countable additivity and then additivity, we have

    P[∞

    ⋃n=1

    An] = P[∞

    ⋃n=1

    Ãn]

    =∞

    ∑n=1

    P[Ãn]

    = limN→∞

    N

    ∑n=1

    P[Ãn]

    = limN→∞

    P[AN].

    Now, let (Bn) be a decreasing sequence of events. Then (Bcn) is increasing, and we canapply the previous result in the following way:

    P[∞

    ⋂n=1

    Bn] = 1 − P[∞

    ⋃n=1

    Bcn]

    = 1 − limn→∞

    P[Bcn]= limn→∞

    P[Bn].

    1.3 Laplace models and counting (textbook p. 21–26)We now discuss a particular type of probability spaces that appear in many concreteexamples. The sample space Ω is an arbitrary finite set, and all the outcomes have thesame probability pω = 1∣Ω∣ .

    Definition 1.14. Let Ω be a finite sample space. The Laplace model on Ω is the triple(Ω,F ,P), where

    • F = P(Ω),

    • P ∶ F → [0,1] is defined by∀A ∈ F P[A] = ∣A∣∣Ω∣ .

    One can easily check that the mapping P above defines a probaility measure in thesense of the definition 1.6. In this context, estimating the probability P[A] boils down tocounting the number of elements in A and in Ω.Example:

    We consider n ≥ 3 points on a circle, from which we select 2 at random. What is theprobability that these two points selected are neighbors?We consider the Laplace model on

    Ω = {E ⊂ {1,2,⋯, n} ∶ ∣E∣ = 2}.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 17

    12

    3

    4 5

    6

    Figure 1.1: A circle with n = 6 points, and the subset {1,3} is selected.

    The event “the two points of E are neighbors” is given by

    A = {{1,2},{2,3},⋯,{n − 1, n},{n,1}},

    and we haveP[A] = ∣A∣∣Ω∣ =

    n

    (n2)= 2n − 1 .

    1.4 Random variables and distribution functions (textbookp 9–12)

    Most often, the probabilistic model under consideration is rather complicated, and oneis only interested in certain quantities in the model. For this reason, one introduces thenotion of random variables.

    Definition 1.15. Let (Ω,F ,P) be a probability space. A random variable (r.v.) is amap X ∶ Ω→ R such that for all a ∈ R,

    {ω ∈ Ω ∶ X(ω) ≤ a} ∈ F .

    Ü The condition {ω ∈ Ω ∶ X(ω) ≤ a} ∈ F is needed for P[{ω ∈ Ω ∶ X(ω) ≤ a}] to bewell-defined.

    Notation: When events are defined in terms of random variable, we will omit the de-pendence in ω. For example, for a ≤ b we write

    {X ≤ a} = {ω ∈ Ω ∶ X(ω) ≤ a},{a

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 18

    Example 1: Gambling with one dieWe throw a fair die. The sample space is Ω = {1,2,3,4,5,6} and the associated probabilityspace (Ω,F ,P) is defined on page 4. Suppose that we gamble on the outcome in such away that our profit is

    −1 if the outcome is 1, 2 or 3, (1.4)0 if the outcome is 4,2 if the outcome is 5 or 6,

    where a negative profit correspond to a loss. Our profit can be represented by the randomvariable X defined by

    ∀ω ∈ Ω X(ω) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

    −1 if ω = 1,2,3,0 if ω = 4,2 if ω = 5,6.

    (1.5)

    Example 2: First coordinate of a random pointAs in Example 3 on page 11, we define

    • Ω = [0,1]2,

    • F = Borel σ-algebra,

    • for A ∈ F , P(A) = Area(A).

    Ü The map which returns the x-coordinate, defined by

    Y ∶ Ω → R(ω1, ω2) ↦ ω1

    ,

    is a random variable. Indeed, for every a ∈ R, we have

    {Y ≤ a} = {(ω1, ω2) ∈ Ω ∶ ω1 ≤ a} =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

    ∅ if a < 0,[0, a] × [0,1] if 0 ≤ a ≤ 1,Ω if a > 1,

    and ∅, [0, a] × [0,1] and Ω are all elements of F .

    Example 3: Area on the left-down part of a random pointWe consider the same probability space as in Example 2 above (Ω = [0,1]2). For ω ∈ Ω,let R(ω) = [0, ω1] × [0, ω2] (it corresponds to rectangle at the left and below ω).Ü The map which returns the area of R(ω), defined by

    Z ∶ Ω → R(ω1, ω2) ↦ ω1ω2

    , (1.6)

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 19

    is a random variable. To see this, notice that for every a ∈ R, we have

    {Z ≤ a} = {ω ∈ Ω ∶ Y (ω) ≤ a} =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

    ∅ if a < 0,Ea if 0 ≤ a ≤ 1,Ω if a > 1,

    where Ea = {(x, y) ∈ [0,1]2 ∶ xy ≤ a} is represented below.

    a

    1

    a 1

    Figure 1.2: The set Ea corresponds to the region Ea.

    The sets ∅, Ω and Ea are elements of F for every a. For the culture, we now give theformal proof that Ea is in F , but this proof is not important for applications and can beskipped in a first reading.

    One just need to check that Ea is in F if 0 < a ≤ 1 (the case a = 0 can easily be treatedindependently). Let us first prove that

    Eca = ⋃0 a, one can pick a rational number q in theinterval (a/ω2, ω1). For such a choice, we have ω ∈ (q,1] × (a/q,1].

    ⊃ Assume that ω ∈ (q,1] × (a/q,1] for some 0 < q ≤ 1. Then ω1ω2 > q ⋅ a/q = a, andtherefore ω ∉ Ea.

    Now for every q, the rectangle (q,1]× (a/q,1] is in F . Therefore, Eca is also in F , since itis a countable union of elements of F . This concludes that Ea is an event, by taking thecomplement.

    Example 4: Indicator function of an event

    Let A ∈ F . Consider the indicator function 1A of A, defined by

    ∀ω ∈ Ω 1A(ω) =⎧⎪⎪⎨⎪⎪⎩

    0 if ω ∉ A,1 if ω ∈ A.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 20

    Then 1A is a random variable. Indeed, we have

    {1A ≤ a} =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

    ∅ if a < 0,Ac if 0 ≤ a < 1,Ω if a ≥ 1,

    and ∅, Ac and Ω are three elements of F .Definition 1.16. Let X be a random variable on a probability space (Ω,F ,P). Thedistribution function of X is the function FX ∶ R→ [0,1] defined by

    ∀a ∈ R FX(a) = P[X ≤ a].

    Proposition 1.17 (Basic identity). Let a < b be two real numbers. Then

    P[a

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 21

    −1 0 2

    1/2

    1/3

    1

    Figure 1.3: Graph of the distribution function FX .

    Hence,

    FY (a) = P[Y ≤ a] =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

    0 if a < 0,a if 0 ≤ a ≤ 1,1 if a > 1.

    0 1

    1

    Figure 1.4: Graph of the distribution function FY

    Example 3: Area of the left-down part of a random point

    0 1

    1

    Figure 1.5: Graph of the distribution function of FZ

    Consider the random variable Z defined by Eq. (1.6) on page 18. For 0 ≤ a ≤ 1, the event

    {Z ≤ a} = {(ω1, ω2) ∈ [0,1]2 ∶ ω1ω2 ≤ a} = Ea

    was represented on Fig. 1.4 and its probability is equal to its area, given by

    Area(Ea) = a + ∫1

    a

    a

    xdx = a + a log(1/a))

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 22

    (with the convention 0 log(1/0) = 0). The event {Z ≤ a} is empty if a < 0 and it is equalto Ω if a > 1. Therefore

    FZ(a) = P[Z ≤ a] =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

    0 if a < 0,a + a log(1/a) if 0 ≤ a ≤ 1,1 if a > 1.

    Example 4: Indicator function of an event

    Let A be an event. Let X = 1A be the indicator function of the event A. Then

    FX(a) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

    0 if a < 0,1 − P[A] if 0 ≤ a < 1,1 if a ≥ 1.

    Theorem 1.18 (Properties of distribution functions). Let X be a random variable onsome probability space (Ω,F ,P). The distribution function F = FX ∶ R → [0,1] of Xsatisfies the following properties.

    (i) F is nondecreasing.

    (ii) F is right continuous2.

    (iii) lima→−∞

    F (a) = 0 and lima→∞

    F (a) = 1.

    Proof. We first prove (i), then (iii) and finally (ii).

    (i) For a ≤ b, we have {X ≤ a} ⊂ {X ≤ b}. Hence, by monotonicity, we have P[X ≤ a] ≤P[X ≤ b], i.e.

    F (a) ≤ F (b).

    (iii) Let an ↑ ∞. For every ω ∈ Ω, there exists n large enough such that X(ω) ≤ an.Hence,

    Ω = ⋃n≥1

    {X ≤ an}.

    Furthermore, we have {X ≤ an} ⊂ {X ≤ an+1} and the continuity properties ofprobability measures imply

    1 = P[Ω] = P[⋃n≥1

    {X ≤ an}]

    = limn→∞

    P[X ≤ an]= limn→∞

    F (an).

    2i.e. F (a) = limh↓0

    F (a + h) for every a ∈ R.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 23

    In the same way, one has lima→−∞

    F (a) = 0. Indeed, using that for every an ↓ −∞,

    ∅ = ⋂n≥1

    {X ≤ an}

    and {X ≤ an} ⊃ {X ≤ an+1}, it follows from the continuity properties of probabilitymeasures that

    0 = P[∅] = limn→∞

    P[X ≤ an] = limn→∞

    F (an).

    (ii) Let a ∈ R, let hn ↓ 0. We have

    {X ≤ a} = ⋂n≥1

    {X ≤ a + hn},

    where {X ≤ a + hn} ⊃ {X ≤ a + hn+1}. Hence by the continuity properties of proba-bility measures, we have

    F (a) = P[X ≤ a] = P[⋂n≥1

    {X ≤ a + hn}] = limn→∞

    P[X ≤ a + hn] = limn→∞

    F [a + hn].

    Properties (i), (ii) and (iii) actually characterize distribution functions, in the followingsense:

    Theorem 1.19. Let F ∶ R → [0,1] satisfying (i), (ii) and (iii), then there exists aprobability space (Ω,F ,P) and a random variable X ∶ Ω→ R such that F = FX .

    Proof. Admitted.

    Theorem 1.19 above plays an important role in probability theory. Given a nonde-creasing, right continuous function F ∶ R → [0,1] with lim−∞F = 0 and lim+∞F = 1, onecan always consider a random variable X on some probability space (Ω,F ,P) such thatF = FX . This is very useful because it allows one to define a random variable via itsdistribution function. As soon as we have a function F as above, we have the right tosay:

    “Let X be a random variable on some probability space (Ω,F ,P) with distri-bution function F .”

    The precise choice of the probability space is often not important and we simply say:

    “Let X be a r.v. with distribution function F .”

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 24

    Discontinuity/continuity points of F We have seen that the distribution functionF = FX of a random variableX is always right continuous. What about the left continuity?In Example 1 (gambling on a die), we can see that it is not always left continuous, andthe following proposition gives an interpretation of the left limit

    F (a−) = limh↓0

    F (a − h)

    at a given point a for a general distribution function.

    Proposition 1.20 (probability of a given value). Let X ∶ Ω → R be a random variablewith distribution function F . Then for every a in R we have

    P[X = a] = F (a) − F (a−)

    We omit the proof, which can easily be obtained using the elementary identity to-gether with the continuity properties of probability measures. We rather insist on theinterpretation of this proposition. Fix a ∈ R.

    Ü If F is not continuous at a point a ∈ R, then the “jump size” F (a) − F (a−) is equalto the probability that X = a.

    Ü If F is continuous at a point a ∈ R, then P[X = a] = 0. This is what happens atevery point a in Example 2 or 3.

    Discrete and continuous random variables

    Two different types of random variables will play an important role in this class:

    • the discrete random variables,

    • the continuous random variables.

    We give here the definitions but these two classes of random variables will be studiedin more detail.

    Definition 1.21 (Discrete random variables). A random variable X ∶ Ω→ R is said to bediscrete if its image

    X(Ω) = {x ∈ R ∶ ∃ω ∈ Ω X(ω) = x}is at most countable.

    In other words, a random variables is said to be discrete if it only takes finitely orcountably many values x1, x2, . . .. In this case the probabilistic properties of X are fullydescribed by the numbers

    p(x1) = P[X = x1], p(x2) = P[X = x2], . . .

    Example: The random variable of Example 1 (defined on Eq. (1.5) page 18) is discrete:it takes only three possible values x1 = −1, x2 = 0 and x3 = 2.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 25

    Definition 1.22 (Continuous random variables). A random variable X ∶ Ω → R is saidto be continuous if its distribution function FX can be written as

    FX(a) = ∫a

    −∞f(x)dx for all a in R (1.7)

    for some nonnegative function f ∶ R→ R+, called the density of X.

    To understand the terminology “continuous”, observe that the formula (3.1) impliesthat FX is a continuous function. In particular, by Proposition 1.20, the r.v. X satisfiesP[X = x] = 0 for every fixed x. The probability that X is exactly equal to x is 0, but wemay have a chance that X is very close to x and f(x)dx represent the probability thatX takes a value in the infinitesimal interval [x,x + dx].Example: The random variables Y and Z of Example 2 and 3 are continuous.

    1.5 Conditional probabilities (textbook p.12–19)Consider a random experiment represented by some probability space (Ω,F ,P). We maysometimes possess incomplete information about the actual outcome of the experimentwithout knowing this outcome exactly. For example if we throw a die and a friend tellsus that an even number is showing, then this information affects all our calculation ofprobabilities. In general, if A and B are two events and we are given that B occurs, thenew probability may no longer be P[A]. In this new circumstance, we know that A occursif and only if A ∩B occurs, suggesting that the new probability of A is proportional toP[A ∩B].

    Definition 1.23 (Conditional probability). Let (Ω,F ,P) be some probability space. LetA, B be two events with P[B] > 0. The conditional probability of A given B is definedby

    P[A ∣B] = P[A ∩B]P[B] .

    Remark 1.24. P[B ∣B] = 1. Condition on B, the event B always occurs.

    Example: We consider the probability space (Ω,F ,P) corresponding to the throw ofone die. Let A = {1,2,3} be the event that the die is smaller than or equal to 3, and letB = {2,4,6} be the event that the die is even. Then

    P[A ∣B] = P[A ∩B]P[B] =

    1/61/2 = 1/3.

    Proposition 1.25. Let (Ω,F ,P) be some probability space. Let B be an event withpositive probability. Then P[ . ∣B] is a probability measure on Ω.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 26

    Proposition 1.26 (Formula of total probability). Let B1,⋯,Bn be a partition3 of thesample space Ω with P[Bi] > 0 for every 1 ≤ i ≤ n. Then, one has

    ∀A ∈ F P[A] =n

    ∑i=1

    P[A ∣Bi] P[Bi].

    Proof. Using the distributivity of the intersection, we have

    A = A ∩Ω = A ∩ (B1 ∪⋯ ∪Bn) = (A ∩B1) ∪⋯ ∪ (A ∩Bn).

    Since the events A ∩Bi are pairwise disjoint, we have

    P[A] = P[A ∩B1] +⋯ + P[A ∩Bn].

    By definition, we have P[A∩Bi] = P[A ∣Bi] P[Bi] for every i and using this expression inthe equation above, we finally get

    P[A] = P[A ∣B1] P[B1] +⋯ + P[A ∣Bn] P[Bn].

    The formula above is very useful when we have a random variable X and we wantto distinguish on the different values of X. More precisely, assume that X is a randomvariable taking the possible values x1, . . . , xn and P[X = xi] > for every i. Then we have anatural partition Ω = {X = x1} ∪⋯∪ {X = xn}, and the formula of total probability gives

    ∀A ∈ F P[A] = P[A ∣X = x1] P[X = x1] +⋯ + P[A ∣X = xn] P[X = xn].

    Example: A sender emits 0 − 1 bits. Due to noise in the communication channel, eachbit is wrongly transmitted with probability ε and correctly with probability 1 − ε. Thesender emits a 0 with probability p0, and a 1 with probability p1(= 1 − p0): what is theprobability to receive a 1? Consider the random variables:

    X = value of the bit sent,Y = value of the bit received.

    The formula of total probability implies that the probability to receive 1 is

    P[Y = 1] = P[Y = 1 ∣X = 1]P[X = 1] + P[Y = 1 ∣X = 0]P[X = 0]= (1 − ε)p1 + εp0.

    One could also consider the converse problem in the previous example: “if one receives1, what is the probability that 1 was sent?” This corresponds to the probability of thesource 1 at receiving, i.e. P[X = 1 ∣ Y = 1]. This leads us to the next formula.

    3i.e. Ω = B1 ∪⋯ ∪Bn and the events are pairwise disjoint.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 27

    Proposition 1.27 (Bayes formula). Let B1, . . . ,Bn ∈ F be a partition of Ω with P[Bi] > 0for every i. For every event A with P[A] > 0, we have

    ∀i = 1, . . . , n P[Bi ∣A] =P[A ∣Bi] P[Bi]

    ∑nj=1 P[A ∣Bj] P[Bj].

    Typical application: A test is performed in order to diagnose a certain rare disease,which concerns 1/10000 of a population. This test is quite reliable and gives the rightanswer 99 percent of the times. If a patient has a “positive test” (i.e. the test indicatesthat he is sick), what is the probability that he is actually sick?

    The situation is modeled by setting

    Ω = {0,1} × {0,1}.

    and F = P(Ω). An outcome is a pair ω = (ω1, ω2) representing a patient, where

    ω1 =⎧⎪⎪⎨⎪⎪⎩

    0 if the patient is healthy,1 if the patient is sick,

    ω2 =⎧⎪⎪⎨⎪⎪⎩

    0 if the test is negative,1 if the test is positive.

    We consider the random variables S and T defined

    S(ω) = ω1 T (ω) = ω2.

    With this notation {S = 1} is the event that the patient is sick, and {T = 1} is theevent that the test is positive. From the hypotheses, the information that we have on theprobability measure is

    P[S = 1] = 110000

    , P[T = 1 ∣ S = 1] = 99100

    , P[T = 1 ∣ S = 0] = 1100

    .

    We are looking for the a posteriori probability of being sick, given that the test is positive,i.e.

    P[S = 1 ∣ T = 1] = P[T = 1 ∣ S = 1] P[S = 1]P[T = 1 ∣ S = 1] P[S = 1] + P[T = 1 ∣ S = 0] P[S = 0]

    = 0.99 × 0.00010.99 × 0.0001 + 0.01 × 0.9999 ≃ 0.0098.

    This result is quite surprising: the probability to be actually sick when the test is positiveis very small!What is happening? If one looks at the whole population, there are two types ofpersons, who will have a positive test:

    - the healthy individuals with a (wrongly) positive test, which represent roughly onepercent of the population.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 28

    - the sick individuals with a (correctly) positive test, which represent roughly 1/10000of the population.

    Given that the test is positive, a person has much more chances to be in the first groupof individuals.

    1.6 Independence

    Independence of events

    Definition 1.28 (Independence of two events). Let (Ω,F ,P) be a probability space. Twoevents A and B are said to be independent if

    P[A ∩B] = P[A] P[B].

    Remark 1.29. If P[A] ∈ {0,1}, then A is independent of every event, i.e.

    ∀B ∈ F P[A ∩B] = P[A] P[B].

    If an event A is independent with itself (i.e. P[A ∩A] = P[A]2), then P[A] ∈ {0,1}A is independent of B if and only if A is independent of Bc.

    The concept of independence is fundamental in probability: it corresponds to theintuitive idea that two events do not influence each other, as illustrated in the followingproposition.

    Proposition 1.30. Let A, B ∈ F be two events with P[A],P[B] > 0. Then the followingare equivalent:

    (i) P[A ∩B] = P[A] P[B], A and B are independent

    (ii) P[A ∣B] = P[A], the occurrence of B has no influence on A

    (iii) P[B ∣A] = P[B]. the occurrence of A has no influence on B

    Proof. Since P[B] > 0 we have

    (i)⇔ (P[A ∩B]P[B] = P[A])⇔ (P[A ∣B] = P[A])⇔ (ii).

    Since (iii) is just the same as (ii) with the role of A and B reversed, the equivalence(i)⇔ (iii) can proved the same way.

    Typical examples of independent events occur when one performs successively a ran-dom experiment, as illustrated below with the throw of two dice.

    Example: Throw of two independent dice.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 29

    We throw two dice independently. This is modeled by the Laplace model on the samplespace

    Ω = {1,2,3,4,5,6}2.

    Let X and Y be the two random variables defined by X(ω) = ω1 and Y (ω) = ω2 (X andY represent the value of the first and second dice respectively). Consider the events

    A = {X ∈ 2Z}, "The first die is even"B = {1 + Y ∈ 2Z}, "The second die is odd"C = {X + Y ≤ 3}, "The sum of the two dice is at most 3"D = {X ≤ 2, Y ≤ 2}. "Both dice are smaller than or equal to 2"

    Check that:

    • A and B are independent,

    • A and C are not independent,

    • A and D are independent.

    Definition 1.31. We say that n events A1,⋯,An are independent if

    ∀J ⊂ {1,⋯n} P[⋂j∈J

    Aj] =∏j∈J

    P[Aj].

    Remark: Three events A, B and C are independent if the following 4 equations aresatisfied (and not only the last one!):

    P[A ∩B] = P[A] P[B],P[A ∩C] = P[A] P[C],P[B ∩C] = P[B] P[C],

    P[A ∩B ∩C] = P[A] P[B] P[C].

    Example: We consider the same notation as in the example above Definition 1.31. Theevents A, B, and D are independent (Check that!).

    Independence of random variables

    Definition 1.32. Let X1, . . . ,Xn be n random variables on some probability space(Ω,F ,P). We say that X1, . . . ,Xn are independent if

    ∀a1, a2, . . . , an ∈ R P[X1 ≤ a1, . . . ,Xn ≤ an] = P[X1 ≤ a1] . . .P[Xn ≤ an]. (1.8)

    Definition 1.33. Let X1,X2, . . . be an infinite sequence of random variables. We say thatX1,X2, . . . are independent if X1, . . . ,Xn are independent, for every n.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 30

    Example 1: Throw of n diceConsider the sample space Ω = {1, ...,6}n equipped with the Laplace model (F ,P). Definethe random variables

    Xi ∶ Ω → R(ω1, . . . , ωn) ↦ ωi

    .

    (Xi represents the value of the i-th die.) Then the random variables X1, . . . ,Xn are inde-pendent. To prove this, it suffices to prove Equation (1.8) for a1, . . . , an ∈ {1,2,3,4,5,6}.For such numbers, we have

    P[X1 ≤ a1, . . . ,Xn ≤ an] = P[{1, . . . , a1} ×⋯ × {1, . . . , an}]

    = ∣{1, . . . , a1} ×⋯ × {1, . . . , an}∣∣Ω∣ =a16⋯an

    6= P[X1 ≤ a1]⋯P[Xn ≤ an].

    Example 2: The coordinates of a random point in [0,1]2Consider the probability space (Ω = [0,1]2,F ,P) for the random point in a 1 by 1 square.Let X and Y be the two random variables defined by

    X((ω1, ω2)) = ω1 and Y ((ω1, ω2)) = ω2,

    which correspond to the x-coordinate and y-coordinate of the random point. For everya, b ∈ [0,1] we have

    P[X ≤ a, Y ≤ b] = P[ [0, a] × [0, b] ]= Area([0, a] × [0, b])= a ⋅ b= Area([0, a] × [0,1]) ⋅Area([0,1] × [0, b]) = P[X ≤ a] P[Y ≤ b].

    The equality P[X ≤ a, Y ≤ b] = P[X ≤ a] P[Y ≤ b] is also true if a ∉ [0,1] or b ∉ [0,1]because at least one of the two event is ∅ or Ω. Hence the two random variables Xand Y are independent.

    Theorem 1.34. Let F1, . . . , Fn be n distribution functions4. Then there exist a probabilityspace (Ω,F ,P) and n random variables X1, . . . ,Xn on this probability space such that

    • for every i Xi has distribution function Fi (i.e. ∀a P[Xi ≤ a] = Fi(a)), and

    • X1, . . . ,Xn are independent.

    Proof. Admitted.4i.e. for every i Fi ∶ R→ [0,1] satisfies (i), (ii) and (iii) in Theorem 1.18.

  • CHAPTER 1. MATHEMATICAL FRAMEWORK 31

    The theorem above is important in the theory because it allows us to work with randomvariables directly without defining precisely the probability space (Ω,F ,P). For example,if F and G are two given distribution functions, it allows us for example to write:“LetX,Y be two independent random variables with distribution function F and G resp.”.

    Definition 1.35. Some random variables X1,X2, . . . are said to be independent andidentically distributed (iid) if they are independent and they have the same distributionfunction, i.e.

    ∀i, j FXi = FXj .(The sequence may be finite or infinite.)

  • Chapter 2

    Discrete distributions

    2.1 Preliminaries: infinite sums.Let E be finite or countable. We would like to give a sense to

    ∑x∈E

    ax, (2.1)

    where (ax)x∈E is a sequence of real numbers.If E is finite, the sum (2.1) is just a finite sum and is always well defined.If E is infinite countable, a natural approach is to “approximate” E with finite sets

    and define the sum as a limit. We write Fn ↑ E if

    • for every n, Fn ⊂ E and Fn ⊂ Fn+1,

    • E = ⋃n∈N

    Fn.

    The existence of a sequence (Fn) is guaranteed by the fact that E is countable. Forexample, if E = Z, we can choose Fn = {−n, . . . , n}. For every n, the finite sum

    ∑x∈Fn

    ax

    always make sense. To define the infinite sum (2.1), it is natural to try to take the limitas n tends to infinity of the finite sums

    limn→∞

    ( ∑x∈Fn

    ax) (2.2)

    One problem is that the limit above is not always well defined. For example, if E = Z,Fn = {−n, . . . , n} and ax = (−1)∣x∣, we have ∑x∈Fn = (−1)n and therefore the limit in (2.2)does not exist. Another problem is that the limit may depend on the chosen sequence(Fn). For example, as above if E = Z and ax = (−1)∣x∣, the limit exists if one choosesFn = {−2n, . . . ,2n} or Fn = {−2n − 1, . . . ,2n + 1} but they do not coincide (it is +1 in thefirst case and −1 in the second case.

    32

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 33

    Definition 2.1 (Sum of nonnegative numbers). Let (ax)x∈E be a sequence of nonnegativenumbers (i.e. ∀x ax ≥ 0). The sum of the ax is defined by

    ∑x∈E

    ax ∶= supF⊂EF finite

    ∑x∈F

    ax.

    Remark 2.2.

    • The sum can be infinite, for example if E is infinite and ax = 1.

    • For every sequence Fn ↑ E, we can check that the limit in (2.2) makes sense and

    limn→∞

    ( ∑x∈Fn

    ax) = ∑x∈E

    ax.

    “The sum of nonnegative numbers is always well defined, it can be finite or infinite”

    When E = N, we can use the notation

    ∑x∈N

    ax =∞

    ∑x=0

    ax.

    and the sum corresponds to the increasing limit

    ∑x∈N

    ax = limn→∞

    (n

    ∑x=0

    ax).

    Definition 2.3 (Sum of an integrable sequence). A sequence (ax)x∈E of real numbers issaid to be integrable if

    ∑x∈E

    ∣ax∣

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 34

    We finish this section with the statement of Fubini’s theorem, which allows one topermute infinite sums. The take-home message is that we can always permute the sumswhen all the terms are nonnegative. If the terms are not all nonegative, one needs anintegrability condition, which is obtained by summing the absolute values of the sequence.

    Theorem 2.5 (Fubini’s theorem for nonegative sequences). Let E,F be two finite orcountable sets. Let (ux,y)(x,y)∈E×F be a family of nonnegative numbers. Then

    ∑(x,y)∈E×F

    ux,y = ∑x∈E

    (∑y∈F

    ux,y) = ∑y∈F

    (∑x∈E

    ux,y) .

    Theorem 2.6 (Fubini’s theorem for integrable sequences). Let E,F be two finite orcountable sets. Let (ux,y)(x,y)∈E×F be a a family of real numbers. Assume that

    ∑x∈E

    (∑y∈F

    ∣ux,y ∣)

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 35

    Proof. The random variable X takes values in E hence

    ⋃x∈E

    {X = x} = Ω.

    Since the union is disjoint and the set E is at most countable, we have

    ∑x∈E

    P[X = x] = P[⋃x∈E

    {X = x}] = P[Ω] = 1.

    Remark 2.10. Conversely, if we are given a sequence of numbers (px) with values in[0,1] and such that

    ∑x∈E

    px = 1, (2.3)

    then there exists a probability space (Ω,F ,P) and a random variable X with associateddistribution (px). This is a consequence of the existence theorem 1.19 in Chapter 1. Thisobservation is important in practice, it allows us to write:“Let X be a random variable with distribution (px)x∈E.”

    Example 1: Bernoulli random variableLet 0 ≤ p ≤ 1. A random variable X is said to be a Bernoulli random variable withparameter p if it takes values in E = {0,1} and

    P[X = 0] = 1 − p and P[X = 1] = p .

    In this case, we write X ∼ Ber(p).

    Example 2: Binomial random variableLet 0 ≤ p ≤ 1, let n ∈ N. A random variable X is said to be a binomial random variablewith parameters n and p if it takes values in E = {0, . . . , n} and

    ∀k ∈ {0, . . . , n} P[X = k] = (nk) pk (1 − p)n−k .

    In this case, we write X ∼ Bin(n, p).

    Remark 2.11. If we define pk = (nk) pk (1 − p)n−k, we have

    n

    ∑k=0

    pk =n

    ∑k=0

    (nk) pk (1 − p)n−k = (p + 1 − p)n = 1,

    hence the equation (2.3) is satisfied. This guarantees the existence of binomial randomvariables.

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 36

    Proposition 2.12 (Sum of independent Bernoulli and binomial). Let 0 ≤ p ≤ 1, Let n ∈ N.Let X1, . . . ,Xn be independent Bernoulli random variables with parameter p. Then

    Sn ∶=X1 +⋯ +Xn

    is a binomial random variable with parameter n and p.

    Proof. One can easily check that Sn is a random variable which takes values in {0, . . . , n}.Furthermore, for every k ∈ {0, . . . , n} we have

    {Sn = k} = ⋃x1,...,xn∈{0,1}x1+⋯+xn=k

    {X1 = x1,⋯,Xn = xn}.

    Since the union is disjoint, we get

    P[Sn = k] = ∑x1,...,xn∈{0,1}x1+⋯+xn=k

    P[X1 = x1,⋯,Xn = xn]

    = ∑x1,...,xn∈{0,1}x1+⋯+xn=k

    P[X1 = x1]⋯P[Xn = xn]

    = ∑x1,...,xn∈{0,1}x1+⋯+xn=k

    pk(1 − p)n−k

    = (nk)pk(1 − p)n−k.

    Remark 2.13. In particular, the distribution Bin(1, p) is the same as the distributionBer(p). One can also check that if X ∼ Bin(m,p) and Y ∼ Bin(n, p) and X,Y areindependent, then X + Y ∼ Bin(m + n, p).

    Example 3: Geometric random variableLet 0 < p ≤ 1. A random variable X is said to be a geometric random variable withparameter p if it takes values in E = N/{0} and

    ∀k ∈ N/{0} P[X = k] = (1 − p)k−1 ⋅ p .

    In this case, we write X ∼ Geom(p).

    Remark 2.14. If we define pk = (1 − p)k−1 p, we have∞

    ∑k=1

    pk = p∞

    ∑k=1

    (1 − p)k−1 = p ⋅ 1p= 1,

    hence the equation (2.3) is satisfied. This guarantees the existence of geometric randomvariables.

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 37

    The geometric random variable appears naturally as the first success in an infinitesequence of Bernoulli experiments with parameter p. This is formalized by the followingproposition.

    Proposition 2.15. Let X1,X2, . . . be a sequence of infinitely many independent1 Bernoullir.v.’s with parameter p. Then

    T ∶= min{n ≥ 1 ∶ Xn = 1}

    is a geometric random variable with parameter p.

    Remark 2.16. When saying that T is a geometric random variable, we make a slightabuse: indeed, the random variable T may take the value +∞ if all the random variablesXi’s are equal to 0. Nevertheless, this is not a problem for the calculations because onecan check that P[T =∞] = 0.

    Proof. We have T = k if the first k − 1 trials fail, and the k’s one is a success. Formally,we have

    {T = k} = {X1 = 0, . . . ,Xk−1 = 0,Xk = 1}.

    Hence, by independence,

    P[T = k] = P[X1 = 0, . . . ,Xk−1 = 0,Xk = 1]= P[X1 = 0]⋯P[Xk−1 = 0] P[Xk = 1]= (1 − p)k−1p.

    The previous proposition gives us an easy way to remember the definition of thegeometric r.v., and also some simple formulas related to the geometric distribution. Forexample, if T is a geometric distribution with parameter p, we have X > n if the n firstBernoulli experiments fail, and therefore

    P[T > n] = (1 − p)n. (2.4)

    Also, it gives an important interpretation to the equation (2.5) in the proposition below:if we are waiting for a first success in a sequence of experiments, and if we know that thefirst n steps were a failure, then the remaining time to wait is again a geometric randomvariable with parameter p.

    Proposition 2.17 (Absence of memory of the geometric distribution). Let T ∼ Geom(p)for some 0 < p < 1. Then

    ∀n ≥ 0∀k ≥ 1 P[T ≥ n + k ∣ T > n] = P[T ≥ k]. (2.5)1this means: (Xj)j∈J are independent r.v., for every J ⊂ N finite.

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 38

    Proof. It follows directly from the formula (2.4).

    Example 3: Poisson random variableLet λ > 0 be a positive real number. A random variable X is said to be a Poissonrandom variable with parameter λ if it takes values in E = N and

    ∀k ∈ N P[X = k] = λk

    k!e−λ .

    In this case, we write X ∼ Poisson(λ).

    Remark 2.18. If we define pk = λk

    k! e−λ, we have

    ∑k=0

    pk = e−λ∞

    ∑k=0

    λk

    k!= e−λ ⋅ eλ = 1,

    hence the equation (2.3) is satisfied. This guarantees the existence of Poisson randomvariables.

    The Poisson distribution appears naturally as an approximation of a binomial distri-bution when the parameter n is large and the parameter p is small, as stated formally inthe following proposition.

    Proposition 2.19 (Poisson approximation of the binomial). Let λ > 0. For every n ≥ 1,consider a random variable Xn ∼ Bin(n, λn). Then

    ∀k ∈ N limn→∞

    P[Xn = k] = P[N = k], (2.6)

    where N is a Poisson random variable with parameter λ.

    Remark 2.20. The convergence (2.6) is called a convergence in distribution. Intuitively,it says that Xn and N have very similar probabilistic properties for n large.

    Proof. Fix k ∈ N. For every n ≥ 1, we have

    P[Xn = k] = (n

    k)(λ

    n)k

    (1 − λn)n−k

    = λk

    k!⋅ n ⋅ (n − 1)⋯(n − k + 1)

    n

    k

    ⋅ (1 − λn)−k

    ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶Ð→n→∞1

    ⋅(1 − λn)n

    ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶Ð→n→∞e

    −λ

    ,

    which concludes the proof.

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 39

    This approximation may be useful in practice. For example, consider a single pageof the “Neue Zürcher Zeitung” containing, say, n = 104 characters, and suppose thatthe typesetter mis-sets approximately 1/1000 of the characters. In other words, eachcharacter has a probability p = 10/n to be mis-set. The number M of mistakes in thepage corresponds to a binomial random variable with parameters n and p = 10/n. Henceby the Poisson approximation, for example we have

    P[M = 5] ≃ 105

    5!e−10 ≃ 0,0378.

    2.3 Joint distribution and image of random variablesDefinition 2.21. Let X1 ∶ Ω → E1, . . . , . . . ,Xn ∶ Ω → En be n discrete random vari-ables on (Ω,F ,P) with values in some finite or countable sets E1, . . . ,En. The family(px1,...,xn)x1∈E1,...,xn∈En defined by

    px1,...,xn = P[X1 = x1, . . .Xn = xn]

    is called the joint distribution of the r.v.’s X1, . . . ,Xn.

    The joint distribution characterizes the probabilistic properties of the random vec-tor (X1, . . . ,Xn). If the random variables X1, . . . ,Xn are independent, then the jointdistribution is given by the product

    px1,...,xn = P[X1 = x1]⋯P[Xn = xn].

    For example, consider two independent Poisson random variables X,Y ∶ Ω → N with thesame parameter λ, then the joint distribution of X and Y is given by

    ∀k, ` ∈ N P[X = k, Y = `] = λk+`

    k!`!e−2λ.

    In general the random variables may have dependencies. For example, let Z be a thirdrandom variables defined by Z = X. Then Z is also a Poisson random variable withparameter λ. This time, the joint distribution is

    ∀k, ` ∈ N P[X = k,Z = `] = 1k=`λk

    k!e−λ.

    One of the main advantage of working with random variables is that we can “manipu-late” them as numbers. For instance, if we are given n random variables X1,X2, . . . ,Xn,we can think of them as n “random” numbers and we can make operations with them. Forexample, we can consider the sum S =X1 +X2 +⋯+Xn or the product Y =X1 ⋅X2 ⋅⋯ ⋅Xnwhich are two new random variables. More generally, the following proposition allows usto consider random variables as images of discrete random variables.

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 40

    Proposition 2.22. Let n ≥ 1 and φ ∶ Rn → R be an arbitrary function. Let X1 ∶ Ω →E1, . . . ,Xn ∶ Ω→ En be n discrete random variables on (Ω,F ,P) with values in some finiteor countable sets E1, . . . ,En. Then Z = φ(X1, . . . ,Xn) is a discrete random with values inthe discrete set F = φ(E1 ×⋯ ×En) and with distribution given by

    ∀z ∈ F P[Z = z] = ∑x1∈E1,...,xn∈Enφ(x1,...,xn)=z

    P[X1 = x1, . . .Xn = xn].

    Proof. To prove that Z is a discrete random variable, it suffices to show that

    • it takes values in the discrete set F , and

    • for every z in F , we have {Z = z} ∈ F .

    Indeed, the second item implies that for every a ∈ R

    {Z ≥ a} = ⋃z∈Fz≥a

    {Z = z}

    is also an event (as a countable union of events).Now, the first item follows from the hypotheses on X1, . . . ,Xn and the definition of F .

    For the second item, let z ∈ F and observe that

    {Z = z} = ⋃x1∈E1,...,xn∈Enφ(x1,...,xn)=z

    {X1 = x1, . . .Xn = xn}. (2.7)

    Hence, {Z = z} ∈ F since it is a countable union of events. This implies that Z is adiscrete r.v.. To compute the distribution, observe that the union in (2.7) is disjoint andat most countable, hence,

    P[Z = z] = ∑x1∈E1,...,xn∈Enφ(x1,...,xn)=z

    P[X1 = x1, . . .Xn = xn].

    2.3.1 A parenthesis: almost sure events

    An important notion when working with random variables is the notion of almost sureoccurrence for an event.

    Definition 2.23. Let A ∈ F be an event. We say that A occurs almost surely (a.s.) if

    P[A] = 1.

    Remark 2.24. This notion can be extended to any set A ⊂ Ω (not necessarily an event):We say that A occurs almost surely if there exists an event A′ ∈ F such that A′ ⊂ A andP[A′] = 1.

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 41

    In other words, something occurs a.s. if it occurs with probability 1. For example, ifX,Y are two random variables, we write

    X ≤ Y a.s.

    if P[X ≤ Y ] = 1, andX ≤ a a.s.

    if P[X ≤ a] = 1.

    2.4 ExpectationThe expectation of a random variable is a fundamental concept, which corresponds intu-itively to the notion of “average”. Before defining it for general random variables, let usstart with our favorite example: the throw of a die!

    Let X ∶ Ω → E = {1,⋯,6} be a uniform r.v. in {1,2,3,4,5,6}, i.e. P[X = x] = 1/6 forevery x = 1, . . . ,6. It is natural to say that the average value is

    1 + 2 + 3 + 4 + 5 + 66

    .

    With the probabilistic notation, this can be rewritten as

    1 ⋅ 16+ 2 ⋅ 1

    6+ 3 ⋅ 1

    6+ 4 ⋅ 1

    6+ 5 ⋅ 1

    6+ 6 ⋅ 1

    6= ∑x∈E

    x ⋅ P[X = x],

    which is the general formula for the expectation.If a random variable is nonnegative, then the expectation is always well-defined.

    Definition 2.25. Let X ∶ Ω → E be a discrete random variable, and assume that X ≥ 0a.s.. Then the expectation of X is defined by

    E[X] = ∑x∈E

    x ⋅ P[X = x] .

    Remark 2.26. If X ≥ 0 a.s, we have E[X] ∈ R+ ∪ {+∞}.

    When a random variable does not have a constant sign, one needs an integrabilityassumption to be able to define the expectation.

    Definition 2.27. Let X ∶ Ω → E be a discrete random variable. We say that X isintegrable if

    E[∣X ∣]

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 42

    Remark 2.28. If X is an integrable r.v., then E[X] ∈ R.

    Example 1: Bernoulli r.v.Let X be a Bernoulli random variable with parameter p. We have

    E[X] = p .

    Indeed,E[X] = 0 ⋅ P[X = 0] + 1 ⋅ P[X = 1] = 0 ⋅ (1 − p) + 1 ⋅ p = p.

    Example 2: Bet on a dieConsider the random variable X ∶ Ω→ {−1,0,+2} defined by Eq. (1.5) page 18. Then

    E[X] = −1 ⋅ 12+ 0 ⋅ 1

    6+ 2 ⋅ 1

    3= 1

    6.

    Example 3: Poisson r.v.Let X be a Poisson random variable with parameter λ > 0, then

    E[X] = λ .

    Indeed,

    E[X] =∞

    ∑k=0

    k ⋅ λk

    k!⋅ e−λ = λ ⋅ (

    ∑k=1

    λk−1

    (k − 1)!) ⋅ e−λ = λ.

    Example 4: Indicator of an eventLet A be an event. Then 1A is a Bernoulli r.v. with parameter P[A]. Hence,

    E[1A] = P[A] .

    Proposition 2.29 (Tailsum formula). Let X be a discrete r.v. taking values in N = {0,1,2, . . .}.Then

    E[X] =∞

    ∑n=1

    P[X ≥ n]. (2.8)

    Proof. Write pn = P[X = n] for the distribution of X. Since X takes values in N, we havefor every n ≥ 1

    P[X ≥ n] = ∑k≥n

    P[X = k] =∞

    ∑k=1

    1k≥n ⋅ pk.

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 43

    Using this estimate in the right hand side of (2.8) and using Fubini’s Theorem (for non-negative sequences) to permute the sums, we obtain

    ∑n=1

    P[X ≥ n] =∞

    ∑n=1

    (∞

    ∑k=1

    1k≥n ⋅ pk)

    =∞

    ∑k=1

    (∞

    ∑n=1

    1k≥n) ⋅ pk

    =∞

    ∑k=1

    k ⋅ pk

    = E[X].

    Application: Computation of the expectation of a geometric random variable.Let T be a geometric random variable with parameter 0 < p ≤ 1. Then

    E[T ] = 1p.

    Indeed, T takes values in N and the proposition above gives

    E[T ] =∑n≥1

    P[T ≥ n] =∑n≥1

    (1 − p)n−1 = 11 − (1 − p) =

    1

    p.

    Theorem 2.30 (Image random variables). Let X1, . . . ,Xn ∶ Ω→ E be n random variables,let φ ∶ Rn → R. Then Z = φ(X1, . . . ,Xn) defines a discrete random variable. Furthermore,if

    ∑x1,...,xn∈E

    ∣φ(x1, . . . , xn)∣ ⋅ P[X1 = x1, . . . ,Xn = xn],

    then Z is integrable and

    E[φ(X1, . . . ,Xn)] = ∑x1,...,xn∈E

    φ(x1, . . . , xn)P[X1 = x1, . . . ,Xn = xn] . (2.9)

    Proof. Using the formula of Prop. 2.22 and Fubini’s theorem for non negative sequences,we have

    ∑z∈F

    ∣z∣P[Z = z] = ∑z∈F

    ∑x1,...,xn∈E

    ∣z∣ ⋅ 1φ(x1,...,xn)=zP[X1 = x1, . . . ,Xn = xn]

    = ∑x1,...,xn∈E

    (∑z∈F

    ∣z∣1φ(x1,...,xn)=z)

    ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶=∣φ(x1,...,xn)∣

    P[X1 = x1, . . . ,Xn = xn].

    Hence Z is integrable, and the formula (2.9) is proved using the same computation,without the absolute values (the permutation of the two sums is justified by the Fubini’stheorem for integrable sequences).

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 44

    Remark 2.31. The formula for n = 1 is also important. Let X ∶ Ω → E be a discreterandom variable, and φ ∶ R→ R. Assume that

    ∑x∈E

    ∣φ(x)∣ ⋅ P[X = x]

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 45

    and equivalently, for fixed y

    P[Y = y] =∑x

    P[X = x,Y = y]. (2.10)

    Plugging these two equations in (2.10), we obtain

    E[X + Y ] =∑x

    xP[X = x] +∑y

    yP[Y = y],

    which concludes the proof.

    Application: Expectation of a Binomial random variable.Let n ≥ 1 and 0 ≤ p ≤ 1. Let S be a binomial random variable with parameters n and p.What is the expectation of S?

    By definition we have

    E[S] =n

    ∑k=0

    k ⋅ (nk)pk(1 − p)n−k

    and this sum does not look so nice... However, we can use that S has the same distributionas Sn = X1 + ⋯ +Xn, where X1, . . . ,Xn are n iid Bernoulli r.v.’s with parameter p. Bylinearity we have

    E[Sn] = E[X1] +⋯ +E[Xn].

    Using that E[Xi] = p for every i, we deduce directly

    E[S] = E[Sn] = np.

    Theorem 2.36 (Jensen’s inequality). Let X be a discrete random variable. Let φ ∶ R→ Rbe a convex function. If E[φ(X)] and E[X] are well defined, then

    φ(E[X]) ≤ E[φ(X)].

    The Jensen inequality has several important consequences. First, by applying it toφ(x) = ∣x∣, we obtain the triangle inequality. For every integrable discrete random variableX, we have

    ∣E[X]∣ ≤ E[∣X ∣].

    Another important consequence relates the average of ∣X ∣ with the average of X2. Byapplying it to the convex function φ(x) = x2, we obtain that for every discrete randomvariable, we have

    E[∣X ∣] ≤√E[X2]. (2.11)

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 46

    2.5 Independence of discrete random variablesTheorem 2.37. Let X,Y be 2 discrete random variables. Then the following are equiv-alent

    (i) X,Y are independent,

    (ii) For every a, b ∈ R we have P[X = a, Y = b] = P[X = a] ⋅ P[Y = b],

    (iii) For every f ∶ R→ R, g ∶ R→ R,

    E[f(X)g(Y )] = E[f(X)]E[g(Y )]

    whenever the expectations are well-defined.

    Proof.

    (i)⇔ (ii) Already seen.

    (ii)⇒ (iii) Let f, g be as in (iii). By Theorem 2.30 applied to φ(x, y) = f(x)g(y), we have

    E[f(X)g(Y )] = ∑x,y

    f(x)g(y) ⋅ P[X = x,Y = y]

    (ii)= ∑x,y

    f(x)g(y) ⋅ P[X = x]P[Y = y]

    = ∑x

    f(x) ⋅ P[X = x]∑y

    g(y) ⋅ P[Y = y]

    = E[f(X)] ⋅E[g(Y )].

    (iii)⇒ (ii) Consider the functions f and g defined by f(x) = 1x=a and g(y) = 1y=b. We have

    P[X = a, Y = b] = E[1X=a,Y =b]

    = E[f(X)g(Y )](iii)= E[f(X)] ⋅E[g(Y )]

    = P[X = a]P[Y = b]

    Above, we only considered a pair (X,Y ) of random variables, but the same ideas applyto n random variables X1, . . . ,Xn, as in the following theorem.

    Theorem 2.38. Let X1, . . . ,Xn be n discrete random variables. Then the following areequivalent

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 47

    (i) X1, . . . ,Xn are independent,

    (ii) For every x1, . . . , xn ∈ R we have P[X1 = x1, . . . ,Xn = xn] = P[X1 = x1]⋯P[Xn = xn],

    (iii) For every f1 ∶ R→ R, . . . , fn ∶ R→ R such that f1(X1), . . . fn(Xn) are integrable,

    E[f1(X1)⋯fn(Xn)] = E[f(X1)]⋯E[f(Xn)].

    2.6 VarianceDefinition 2.39. Let X be a discrete random variable such that E[X2]

  • CHAPTER 2. DISCRETE DISTRIBUTIONS 48

    Proof.

    1. Let m = E[X]. Using linearity of the expectation, we get

    E[(X −m)2] = [X2 − 2mX +m2] = E[X2] − 2mE[X] +m2 = E[X2] −m2.

    2. Writing mi = E[Xi], we have

    S −E[S] =n

    ∑i=1

    (Xi −mi),

    and thereforeσ2S = ∑

    1≤i,j≤n

    E[(Xi −mi)(Xj −mj)].

    However, for i ≠ j, independence implies E[(Xi−mi)(Xj−mj)] = E[(Xi−mi)]E[(Xj−mj)] = 0. Hence only the diagonal terms (for which i = j) survive in the sum andwe obtain

    σ2S =n

    ∑i=1

    E[(Xi −mi)2] =n

    ∑i=1

    σ2Xi .

    Application: Let S be a binomial random variables with parameters n and p. What isthe variance of S?

    Here, again, we can use that S has the same distribution as Sn = X1 + ⋯ +Xn whereX1, . . . ,Xn are iid Bernoulli random variables with parameter p. Hence

    σ2S = σ2Snindependence= σ2X1 +⋯ + σ

    2Xn

    ident. distrib.= n ⋅ σ2X1 .

    One has σ2X1 = E[X2i ] − p2 = p − p2 = p(1 − p). Hence

    σ2S = n ⋅ p(1 − p) .

    Here, we have discovered an important effect of summing iid random variables. One has

    E[S] = n ⋅ p and σS =√n ⋅

    √p(1 − p)

    so the expectation of S grows like n, while the fluctuation of Sn grow like√n thanks to

    cancellations in the sum Sn =X1 +⋯ +Xn.

  • Chapter 3

    Continuous distributions

    In all the chapter, we fix some probability space (Ω,F ,P).

    3.1 Definition and examples (textbook p.85-88)Let us recall the definitions from Chapter 1. A random variable is a function X ∶ Ω → Rsuch that {X ≤ a} is an event for every a, and the distribution function of X is definedby FX(a) = P[X ≤ a].

    Definition 3.1 (Continuous random variables). A random variable X ∶ Ω→ R is said tobe continuous if its distribution function FX can be written as

    FX(a) = ∫a

    −∞f(x)dx for all a in R (3.1)

    for some nonnegative function f ∶ R→ R+, called the density of X.

    In this case, we have already noticed in Section 1.4 that FX is a continuous function,and for a ∈ R

    P[X = a] = F (a) − F (a−) = 0.

    If X has a continuous distribution, we thus have: for all a ≤ b,

    P[a

  • CHAPTER 3. CONTINUOUS DISTRIBUTIONS 50

    1b−a

    a b

    x

    fa,b(x) 1

    a bx

    FX(x)

    Figure 3.1: Density and distribution function of a uniform random variable in [a, b].

    Intuition: X represents a uniformly chosen point in [a, b].Properties of a uniform random variable X in [a, b]:

    • The probability to fall in a an interval [c, c+ `] ⊂ [a, b] depends only on its length `:

    P[X ∈ [c, c + `]] = `b − a.

    • The distribution function of X is equal to

    FX(x) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

    0 x < a,x−ab−a a ≤ x ≤ b,1 x > b.

    Example 2: Exponential distribution with λ > 0The exponential distribution is the continuous analogue of the geometric distribution.

    A continuous random variable T is said be exponential with parameter λ > 0 if itsdensity is equal to

    fλ(x) =⎧⎪⎪⎨⎪⎪⎩

    λe−λx x ≥ 0,0 x < 0.

    In this case, we write T ∼ Exp(λ).

    λ/e

    λ

    1/λ

    x

    fλ(x) = λe−λx

    1

    x

    FX(x) = 1− e−λx

    Figure 3.2: Density and distribution function of an exponential random variable withparameter λ.

  • CHAPTER 3. CONTINUOUS DISTRIBUTIONS 51

    Intuition/application: T represents the time of a “clock ring”. For example, the timeat which a first customer arrives in a shop is well modeled by an exponential randomvariable.

    Properties of a exponential random variable T with parameter λ.

    • The waiting probability is exponentially small:

    ∀t ≥ 0 P[T > t] = e−λt .

    • It has the absence of memory property:

    ∀t, s ≥ 0 P[T > t + s∣T > t] = [T > s].

    The first item follows from the definition:

    P[T ≥ t] = ∫∞

    tλe−λxdx = e−λt.

    The second item is a direct computation of the conditional probability:

    P[T > t + s∣T > t] = P[T > t + s]P[T > t] =

    e−λ(t+s)

    e−λt= e−λs.

    Example 3: Normal distribution with parameters m ∈ R and σ2 > 0A continuous random variable X is said be normal with parameters m and σ2 > 0 ifits density is equal to

    fm,σ(x) =1√

    2πσ2e−(x−m)2

    2σ2

    In this case, we write X ∼ N (m,σ2).

    fm,σ(x)

    m−σ m+σm

    Figure 3.3: Density of a normal random variable with parameters m and σ2.

    Intuition/application: The normal distribution arises in many applications. For exam-ple, imagine that we measure a physical quantity: the real value is m and the measured

  • CHAPTER 3. CONTINUOUS DISTRIBUTIONS 52

    value is in general very well modeled by a normal random variable X with parametersm and σ. The quantity σ, which represents the fluctuations for X can also be inter-preted as the quality of the measurement in this context. A small σ corresponds to smallfluctuations of X, which means that X is typically closed to m. In contrary, a large σcorresponds to large fluctuations and can be interpreted as inaccurate measurement. Wewill see later a mathematical justification explaining why the normal random variableappears in many places.

    Properties of normal random variables

    • IfX1, . . . ,Xn are independent random variables with parameters (m1, σ21), . . . , (mn, σ2n)respectively, then

    Z =m0 + λ1X1 + . . . + λnXnis a normal random variable with parameters m = m0 + λ1m1 + ⋯ + λnmn and σ2 =λ21σ

    21 +⋯ + λ2nσ2n.

    • In particular, if X ∼ N (0,1) (in this case we say that X is a standard normalrandom variable), then

    Z =m + σ ⋅Xis a normal random variable with parameters m and σ2.

    • IfX is a normal random variable with parametersm and σ2, then all the “probabilitymass” is mainly in the interval [m − 3σ,m + 3σ]. Namely, we have

    P[∣X −m∣ ≥ 3σ] ≤ 0.0027.

    At a first look, it may be surprising that the right hand side (0.0027) does notdepend on the parameters σ and m... This is explained as follows: consider therandom variable Z = X−mσ . The first property above implies that Z = 1σX − mσ is astandard normal random variable. The left hand side can be rewritten as

    P[∣X −m∣ ≥ 3σ] = P[∣X −mσ

    ∣ ≥ 3] = P[∣Z ∣ ≥ 3]

    Then the inequality P[∣Z ∣ ≥ 3] ≤ 0.0027 can be directly checked from a table.

    How to recognize a continuous random variable?

    The following theorem will be useful in applications to calculate densities.

    Theorem 3.2. Let X be a random variable. Assume the distribution function FX iscontinuous and piecewise C1, i.e. that there exist x0 = −∞ < x1 < ⋯ < xn−1 < xn = +∞ suchthat FX is C1 on every interval (xi, xi+1). Then X is a continuous random variable anda density f can be constructed by defining

    ∀x ∈ (xi, xi+1) f(x) = F ′X(x)

    and setting arbitrary values at x1, . . . , xn−1.

  • CHAPTER 3. CONTINUOUS DISTRIBUTIONS 53

    Proof. We simply write F = FX . Let 0 ≤ i < n. If xi < a < b < xi+1, the fundamentaltheorem of calculus implies that

    F (b) − F (a) = ∫b

    aF ′(y)dy = ∫

    b

    af(y)dy

    Now, let x ∈ R and let i be such that x ∈ [xi, xi+1). Using the convention F (x0) = 0, wecan write F (x) as a telescopic sum

    F (x) = F (x) − F (x0) = (F (x) − F (xi)) +⋯ + (F (x1) − F (x0)). (3.2)

    By continuity of F , we have

    (F (x) − F (xi)) = lima↓xi

    (F (x) − F (a)) = lima↓xi

    ∫x

    af(y)dy = ∫

    x

    xif(y)dy,

    and equivalentlyF (xi) − F (xi−1) = ∫

    xi

    xi−1f(y)dy

    Plugging these identities in (3.2), we get

    F (x) = ∫x

    xif(y)dy + ∫

    xi

    xi−1f(y)dy +⋯ + ∫

    x1

    x0f(y)dy

    = ∫x

    −∞f(y)dy.

    A method to compute the density of a transformed random variable

    As we have seen, it is very useful to take the image Y = φ(X) of a random variable X bysome function φ ∶ R→ R. Here we consider the following problem:

    Problem: Let X be a continuous random variable with density f .Assuming it exists, what is the density of Y = φ(X)?

    First, remark that φ(X) may not be continuous. For example, in the case that φ is justa constant random variable, then φ(X) is NOT continuous, and therefore it has not adensity. The method below works well if X and φ are sufficiently “nice”.

  • CHAPTER 3. CONTINUOUS DISTRIBUTIONS 54

    How to compute the density of Y = φ(X)?

    Step 1: Compute the distribution function.

    FY (x) = P[φ(X) ≤ x] = ⋯

    Step 2: If the function FY is continuous and piecewise C1, then wedirectly conclude that Y is continuous and its density is given by

    f(x) = F ′Y (x)

    on each interval where the derivative is defined.

    We give two examples where the method can be applied.

    Example 1: What is the density of Y =X2, for X standard normal r.v.?First, we compute the distribution function of Y . For every x < 0, FY (x) = 0 and forevery x ≥ 0

    FY (x) = P[X2 ≤ x] = P[−√x ≤X ≤

    √x] = Φ(

    √x) −Φ(−

    √x),

    where Φ(x) = 1√2π ∫

    x

    −∞e−y

    2/2dy.The function FY is continuous and piecewise C1 so we can apply Theorem 3.2. We

    compute the derivative on the intervals (−∞,0) and (0,∞). For x < 0, we have F ′Y (x) = 0and for x > 0, we have

    F ′Y (x) =1

    2√x(Φ′(

    √x) +Φ′(−

    √x)) = 1√

    xe−x/2.

    Therefore, by Theorem 3.2, the random variable Y is continuous and it has density

    f(x) =⎧⎪⎪⎨⎪⎪⎩

    0 x ≤ 0,1√xe−x/2 x > 0.

    Example 2: What is the density of Y =Xn, for X uniformly distributed on [0,1]?First we compute the distribution function of Y . The random variable Y takes values in[0,1], and for 0 < x ≤ 1

    FY (x) = P[Y ≤ x] = P[Xn ≤ x] = P[Xn ≤ x] = x1/n.

    Therefore,

    FY (x) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

    0 x ≤ 0x1/n 0 < x ≤ 11 x > 1.

  • CHAPTER 3. CONTINUOUS DISTRIBUTIONS 55

    Second, we observe that Y is continuous, piecewise C1. By computing the derivative ofFY on the intervals (−∞,0), (0,1) and (1,∞), we find that it has density

    f(x) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

    0 x ≤ 01nx

    1n−1 0 < x ≤ 1

    0 x > 1.

    3.2 Expectation and varianceDefinition 3.3 (Expectation for a positive random variable). Let X ∶ Ω→ R be a contin-uous random variable with density f , and assume that X ≥ 0 a.s.. Then the expectationof X is defined by

    E[X] = ∫∞

    −∞x ⋅ f(x)dx.

    Remark 3.4. The fact that X ≥ 0 a.s. implies that f(x) = 0 for x < 0, hence

    E[X] = ∫∞

    0x ⋅ f(x)dx.

    and the integral is always well defined. It can be finite or infinite.

    Definition 3.5. Let X ∶ Ω → R be a continuous random variable with density f . We saythat X is integrable if E[∣X ∣]

  • CHAPTER 3. CONTINUOUS DISTRIBUTIONS 56

    If Y = φ(X) is discrete and takes values in a finite or countable set E, then the definitionabove is consistent with Definition 2.27, in the sense that

    ∑y∈E

    yP[Y = y]

    ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶E[Y ]

    = ∫∞

    −∞φ(x) ⋅ f(x)dx

    ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶E[φ(X)]

    .

    This facts are not obvious in the formalism used in this class, and we admit them.

    Example 1: Uniform random variable in [a, b], a < b

    E[X] = 1b − a ∫

    b

    axdx = 1

    b − a ⋅ (1

    2b2 − 1

    2a2).

    Therefore,

    E[X] = a + b2

    .

    Example 2: Exponential random variable with parameter λ > 0By integration by parts, we have

    E[X] = ∫∞

    0xλe−λxdx = [−xe−λx]∞

    0+ ∫

    0e−λxdx.

    Therefore,

    E[X] = 1λ.

    Example 3: Normal random variable with parameters m and σ2

    If X is a normal distribution with parameters m and σ2, then it has the same distri-bution as m + σ ⋅ Y where Y is a standard normal random variable. Using definition 3.6,we see that

    E[X] = E[m + σ ⋅ Y ] =m + σE[Y ],

    hence it suffices to compute the expectation of Y . Writing f0,1 for the density of Y , wehave

    E[Y ] = ∫∞

    −∞x ⋅ f0,1(x)dx = 0

    because x ⋅ f0,1(x) is an odd function. Finally, we obtain

    E[X] =m.

  • CHAPTER 3. CONTINUOUS DISTRIBUTIONS 57

    Example 4: Cauchy distribution

    A random variable X is said to have a Cauchy distribution if it is continuous and hasdensity

    f(x) = 1π

    1

    1 + x2 , x ∈ R.

    In this case,

    E[∣X ∣] = 1π ∫

    ∣x∣1 + x2dx = +∞.

    The expectation of X is not well defined. Actually, there is no natural definition of

    ∫+∞

    −∞

    1

    π

    x

    1 + x2dx.

    Indeed, one can note for instance that

    limN→∞

    ∫2N

    −N

    1

    π

    x

    1 + x2dx =1

    πlog 2

    whilelimN→∞

    ∫N

    −3N

    1

    π

    x

    1 + x2dx = −1

    πlog 3.

    As in the discrete case, the Jensen’s inequality allows us to compare the expectationof φ(X) to the expectation of X when φ is convex.

    Theorem 3.8 (Jensen’s inequality). Let φ ∶ R → R be a convex function. Let X be acontinuous random variable such that E[X] and E[φ(X)] are well defined, then

    φ(E[X]) ≤ E[φ(X)].

    Definition 3.9. Let X ∶ Ω → R be a continuous random variable with density f . IfE[X2]

  • CHAPTER 3. CONTINUOUS DISTRIBUTIONS 58

    1. Let X be a continuous random variable with E[X2]

  • CHAPTER 3. CONTINUOUS DISTRIBUTIONS 59

    By integration by parts, we have

    E[X2] = ∫∞

    0x2 ⋅ λe−λxdx = [−x2e−λx]∞

    0+ 2∫

    0x ⋅ e−λx = 0 + 2

    λE[X] = 2

    λ2,

    and therefore, using the formula σ2X = E[X2] −E[X]2, we obtain

    σ2X =1

    λ2.

    Example 3: Normal random variable with parameters m and σ2

    We use that X has the same distribution as m+σ ⋅Y , where Y is standard normal. Now,for a standard normal random variable, we have m = 0 and using integration by parts, wefind

    σ2Y =1

    2π ∫∞

    −∞x2e−x

    2/2

    ´¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¶=x⋅xe−x2/2

    dx = 12π

    [−x ⋅ e−x2/2]∞−∞

    + 12π ∫

    −∞e−x

    2/2dx = 0 + 1 = 1.

    The second item of Proposition 3.11 implies that σ2X = σ2 ⋅ σ2Y and therefore

    σ2X = σ2.

    3.3 Joint distributionDefinition 3.12. Two random variables X,Y ∶ Ω→ R have a continuous joint distri-bution if there exists a function f ∶ R2 → R+ such that

    P[X ∈ [a, a′], Y ∈ [b, b′]] = ∫a′

    a(∫

    b′

    bf(x, y)dy)dx

    for every −∞ ≤ a ≤ a′ ≤ ∞ and −∞ ≤ b ≤ b′ ≤ ∞. A function f as above is called a jointdensity of (X,Y ).

    Remark 3.13. By choosing a = b = −∞ and a′ = b′ = +∞ in the definition, one can seethat a joint density always satisfies

    ∫∞

    −∞

    (∫∞

    −∞f(x, y)dy)dx = 1. (3.3)

    Conversely, given a non negative function f satisfying (3.3), one can always construct aprobability space (Ω,F ,P) and two random variables X,Y ∶ Ω → R with joint density f(admitted).

  • CHAPTER 3. CONTINUOUS DISTRIBUTIONS 60

    Interpretation: Informally, f(x, y)dxdy represents the probability that the randompoint (X,Y ) lies in the small rectangle [x,x + dx] × [y, y + dy].

    Example 1: Uniform point in the square

    Consider two random variables X and Y with joint density f(x, y) = 10≤x,y≤1, i.e.

    f(x, y) =⎧⎪⎪⎨⎪⎪⎩

    1 (x, y) ∈ [0,1]20 (x, y) ∉ [0,1]2.

    In this case, we have for every rectangle R = (a, a′) × (b, b′) ⊂ [0,1]2

    P[(X,Y ) ∈ R] = (a′ − a)(b′ − b) = Area(R),

    and (X,Y ) intuitively represents a uniform point in the square [0,1]2.

    Example 2: Uniform point in the disk

    Let D = {(x, y) ∶ x2 + y2 ≤ 1} be the disk of radius 1 around 0. Consider two randomvariables X and Y with joint density f(x, y) = 1π1x2+y2≤1, i.e.

    f(x, y) =⎧⎪⎪⎨⎪⎪⎩

    1π x

    2 + y2 ≤ 10 x2 + y2 > 1.

    In this case, we have for every rectangle R = (a, a′) × (b, b′) ⊂D

    P[(X,Y ) ∈ R] = 1π(a′ − a)(b′ − b) = Area(R)

    Area(D) ,

    and (X,Y ) intuitively represents a uniform point in the square [0,1]2.Marginal densities If X,Y possess a joint density fX,Y , then we have

    P[X ≤ a] = P[X ∈ [−∞, a], Y ∈ [−∞,∞]]

    = ∫a

    −∞(∫

    −∞f(x, y)dy)dx,

    and therefore X is continuous with density

    fX(x) = ∫∞

    −∞f(x, y)dy.

    Equivalently Y is continuous with density

    fY (y) = ∫∞

    −∞f(x, y)dx.

    Let us calculate the marginal densities in the two examples of joint densities we have seen

    above.

  • CHAPTER 3. CONTINUOUS DISTRIBUTIONS 61

    Example 1: Uniform point in the square

    If fX,Y (x, y) = 10≤x,y≤1, then X has density

    fX(x) = ∫0,110≤x≤110≤y≤1dy = 10≤x≤1.

    and equivalently fY (y) = 10≤y≤1. In other words, both X and Y are uniform randomvariables in [0,1].

    Example 2: Uniform point in the disk

    If fX,Y (x, y) = 1π1x2+y2≤1, then the density of X is

    fX(x) = ∫∞

    −∞

    1

    π1x2+y2≤1dy = ∫

    √1−x2

    −√

    1−x2

    1

    πdy = 2

    π

    √1 − x2,

    and equivalently fY (y) = 2π√

    1 − y2.

    Expectation of φ(X,Y )Let φ ∶ R2 → R. If X,Y have joint density fX,Y , then the expectation of the randomvariable Z = φ(X,Y ) can be calculated by the formula

    E[φ(X,Y )] = ∫∞

    −∞∫

    −∞φ(x, y) ⋅ fX,Y (x, y)dxdy, (3.4)

    whenever the integral is well defined.

    Application: proof of linearity of the expectation for jointly continuous distributions.For every λ,µ ∈ R, we have

    E[λX + µY ] = ∫∞

    −∞∫

    −∞(λx + µy) ⋅ fX,Y (x, y)dxdy

    = λ∫∞

    −∞x (∫

    −∞fX,Y (x, y)dy)

    ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶=fX(x)

    dx + µ∫∞

    −∞y (∫

    −∞fX,Y (x, y)dx)

    ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶=fY (y)

    dy.

    Hence, we obtain