Lesson 6 Probability Distributions Notes

Embed Size (px)

Citation preview

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    1/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 1

    Probability Distributions

    A random variable is a mechanism that generates data. The probability distribution (also

    marginal distribution if univariate) of the random variable describes the probabilities by which

    the data are generated. From the probability distribution we can then infer the expectation(weighted mean) and/or the varianceof the distribution (also described as the expectation and/or

    the variance of the random variable, respectively). Thus, understanding the most common

    random variables and their probability distributions can save us significant time in solving themost common probabilistic problems we encounter in our studies. In this lesson, we will discuss

    discrete random variables, continuous random variables, and then joint and conditional

    probability distributions.

    Discrete Distributions

    Discrete #1: The Bernoulli Distribution

    The simplest discrete random variable is a Bernoulli random variable, which is used to model

    experiments that can only succeed or fail. Examples include flipping a coin (and hoping for

    heads), choosing a female from the population, observing a daily stock return of less than -3%,observing a white car on El Camino Real, and so on. The Bernoulli random variable takes on

    values 1 and 0 with probabilitiespand 1 p, respectively. In general, in order to fully describe a

    discrete random variable, we need to list the outcomes and the corresponding probabilities ofthose outcomes, as we have just done with the Bernoulli. More formally, if X is a Bernoulli

    random variable, then we write the following:

    ( )

    if 1

    1 if 0

    p x

    P X x p x

    == =

    = or

    1

    0 1

    p

    X p

    =

    Xdenotes the random variable andxdenotes the arbitrary value thatXcan take on. Collectively,

    this description is called the probability distribution of the random variable X. A shorthand

    notation is ( )~X Bernoulli p . The ~ notation means is distributed as and is used generally

    for other types of random variables as well. What does the probability look like? A histogram!Imagine the example of a coin flipin this case the histogram will have two bins (heads and

    tails) that will each have a relative frequency of 50% (or, for a sample, closer and closer to that

    distribution as the sample size increases). This plot of probabilities against outcomes for discreterandom variables is also called a probability mass function.

    Discrete #2: The Binomial Distribution

    The Bernoulli distribution is for a single trial. The probabilities expressed by this distributionderive, of course, from many repetitions of this trial, but the end result is a description of the

    probability of doing a success/fail experiment one time: flipping a coin once, etc. If we are

    interested in the outcome of repeated Bernoulli experiments (for example, the probability offlipping three heads in five tosses), then we can turn to the Binomial distribution. Another

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    2/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 2

    example could be describing how many people out of 100 vote for Obama over Romney (or vice

    versa to be politically correct) when the (known) probability of voting for Obama is 0.40. Thislatter probability of any individual vote would be described by a Bernoulli distribution, while the

    former probability of how many out of all 100 vote for Obama would be described by a Binomial

    distribution.

    More formally and generally, the Binomial distribution describes the probabilities of xsuccessesout of nindependent Bernoulli success/fail trials. If the random variableXdescribes the number

    of successes, then the probability distribution ofXis

    ( ) ( )1n kk

    nP X k p p

    k

    = =

    where( )

    !

    ! !

    n n

    k k n k

    =

    and is spoken nchoose k. The ! operator is the factorial operator,

    and represents the product ( ) ( ) ( )( )! 1 2 ... 2 1n n n n= . For example, 5! 5 4 3 2 1 120= = ,

    while 1! 1= . By definition, 0! 1 as well. The choose or combination operator takes intoaccount the different orderings in which the successes and failures can happen; if we were

    interested in the probability of a specific ordering (e.g. five successes followed by five failures in

    a trial of ten repetitions), we would have a different formula (since there are more ways topermute the number of successes than there are to just combine them; the permutation formula,

    is just( )

    !

    !

    n

    n k).

    We say ( )~ ,X Binomial n p . If you do some math, you will find that ( )E X np= and

    ( ) ( )1V X np p= . Hopefully, these results make intuitive sense: we expect to get the same

    proportion of successes as the probability of success in one trial would be, and we expect thevariance to depend on both the probability of success and the probability of failure, again scaledup to the number of trials we run.

    Example 6.1 Say Dirk Nowitzki, the 2011 Most Valuable Player of the National Basketball

    Association, has a 90% chance of making any given free throw he takes. Say he takes ten free

    throws in a row. What is the probability that he makes exactly five of those shots, and misses the

    other five? What number of shots is he most likely to make?

    Solution 6.1 ( )( )

    ( ) ( )10 55 510!

    5 0.9 1 0.9 252(0.59) 1 10 0.15%5! 10 5 !

    P X

    = = = =

    . It makes

    sense that this shouldnt be very large, since this would represent making 50% of his shots when

    he has a 90% chance of making any given shot. Intuitively, we can expect him to be most likelyto make 9 shots out of the 10; to actually prove this, wed have to find the probability of all 11

    possible events (making 0 shots through making all 10). However, since the Binomial

    probability distribution only has one peak, well check our intuition just by making sure that theprobability of making 10 shots and the probability of making 8 shots are both lower than the

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    3/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 3

    probability of making 9 shots. ( )( )

    ( )10 9910!

    9 0.9 1 0.9 38.7%9! 10 9 !

    P X

    = = =

    , while also

    ( )( )

    ( ) ( )( )

    ( )10 10 10 810 810! 10!10 0.9 1 0.9 34.9%, 8 0.9 1 0.9 19.4%

    10! 10 10 ! 8! 10 8 !P X P

    = = = = =

    so we can indeed say that he is most likely to make 9 of his 10 shots. It is not always the casethat the most likely outcome is related to the underlying parameter pin the same manner it is in

    this case.

    Discrete #3: The Poisson Distribution

    The Binomial distribution describes probabilities for successes in a certain number of trials. In

    contrast, the Poisson distribution applies to successes over a time period or an area. Forexample, we could model the probability that a biotechnology firm files for a vaccination patent

    in some time interval with a Poisson distribution. Poisson distributions can be useful in

    regression analysis when the dependent variable (the modeled variable) is a count variable.

    Characteristics of Discrete Random Variables

    Aside from the probabilities and the distribution, we are interested in certain characteristics of

    the random variable. For example, what is the central tendency of the random variable? What

    about the spread and variation? These are the same questions that we asked in the previouslesson about descriptive statistics, only now we are going to compute them for a particular

    random variable. No data will be involved because these arepopulationcharacteristics. Earlier

    we talked about the difference between population and sample characteristics. In the case of a

    random variable, the expectation ( )E X (a.k.a. mean or average) and variance ( )Var X or

    ( )V X are considered population or true characteristics, rather than estimates such as x and 2s .Additionally, when computing the expectation or variance, it is usually a good idea notationally

    to write what distribution the expectation or variance is taken with respect to, i.e. )(XEX rather

    than just E(X). In these notes it should always be clear what distribution the variance or

    expectation is taken with respect to and so we omit the subscript for now. However, it may be

    good to keep this in mind if you start dealing with complex expressions and many randomvariables.

    Lets assume that our random variableXhas the general discrete probability distribution

    ( )

    1 1

    2 2

    ifif

    ifn n

    p x xp x x

    P X x

    p x x

    =

    =

    = = =

    where1

    1n

    i

    i

    p=

    =

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    4/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 4

    We could also use the more compact notation ( )i iP X x p= = where1

    1n

    i

    i

    p=

    = . Note that,

    technically, adding the specification that1

    1n

    i

    i

    p=

    = is redundant, because it is implied when we

    say that the above is a probability distribution. Regardless, using this general probability

    distribution, we can define the expectation of a discrete random variable Xas ( )1

    n

    i i

    i

    E X p x=

    = .

    Note that, therefore, ( )E X is a single number, not a random variable.

    Example 6.2If ( )~ 0.8X Bernoulli , what is ( )E X ?

    Solution 6.2In a Bernoulli distribution, the random variable can only take on the values 0 and 1.

    So we have ( ) ( ) ( )0.2 0 0.8 1 0.8E X = + = . We can see that, in general for a Bernoulli random

    variable, the expectation isp.

    Example 6.3Let X be the random variable describing how many students in the boot camp aresleeping (or, at least, their eyes are getting droopy) at any one time after lunch:

    ( )

    0.8 if 0

    0.1 if 1

    0.1 if 2

    x

    P X x x

    x

    =

    = = =

    =

    What is the expected value of sleeping students?

    Solution 6.3 ( ) ( ) ( ) ( )0 0.8 1 0.1 2 0.1 0.3E X = + + = . Note that even though the random variable

    is discrete and can only take on the values 0, 1, and 2, the expected value is continuous and can

    take on any value in that range.

    The expectation operator ( )E is a linear operator. This means that certain properties hold:

    ( )E a a= for any constant a. Note that this means that ( )( ) ( )E E X E X= .

    ( ) ( ) ( )E aX bY aE X bE y+ = + where a, bare constants andX, Yare random variables.

    In general, ( ) ( ) ( )E XY E X E Y . In fact, as we will see later, this relationship would

    only be true if the two variables had a correlation of exactly zero.

    We call ( )E X thefirst momentof the distribution (or of the random variable). More generally,

    ( )1

    nk k

    i i

    i

    E X p x=

    = is called the kth moment. Note that we have taken the outcomes to the kth

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    5/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 5

    power inside the sum, but not the probabilities. Even more generally, for any continuous

    function ( )g x , ( )( ) ( )1

    n

    i i

    i

    E g X p g x=

    = .

    Just as we can define the expected value of a discrete random variable, we can define the

    variance as well. Remember that in the case of descriptive statistics, the variance was theaverage squared deviation from the mean. The current context is no different. Thus, the

    variance of a discrete random variable X is defined as ( ) ( )2

    V X E X E X = (and NOT

    ( ){ }2

    E X E X ). Again, a convenient shortcut formula is ( ) ( ) ( )22V X E X E X = .

    Example 6.4What is the variance of our sleepy-time distribution from Example 6.3?

    Solution 6.4 ( ) ( ) ( ) ( ) ( ) ( ) ( )2 22 2 2 20 0.8 1 0.1 2 0.1 0.3 0.41V X E X E X = = + + = . Recall, of

    course, that the more meaningful measure of spread is the standard deviation. Thats not

    difficult to find, however, since it is just ( ) ( ) 0.41 0.64SD X V X = = = .

    Just as we defined properties of the expectation, we can do so for the variance operator ( )V :

    ( ) ( )2V aX a V X = for any constant a

    ( ) ( )2V aX b a V X + = (additive constants disappear, since they have no variance)

    ),cov(2)()()( 22 YXabYVbXVabYaXV +=

    These last property can easily be generalized for sums of more than two random variables.Essentially what you get is that the variance of a sum of random variables is the sum of the

    variances, plus a bunch of covariance terms. Showing that any of these properties is true is not

    difficult; all you need is the definition of variance and a little patience with algebra.

    By now you may be able to see that ( )E X tells us something about the mean of a distribution

    and ( )2E X tells us something about the variance. As it turns out, for particular probability

    distributions, ( )3E X tells us about the skewness of the distribution and ( )4E X tells us aboutthe kurtosis(how fat the tails of the distribution are).

    If, instead of ( )E X , you wish to use the median as your measure of central tendency for a

    discrete random variable, find the value that the random variable takes such that there is equal

    0.5 probability of being greater than or less than that point. Depending on the number of values

    that the random variable can take, this may not be a unique number. Formally, the median is

    defined as the number M such that 5.0)( MXP and 5.0)( MXP . For example, M= 0

    for our sleepy-time distribution from Example 6.3.

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    6/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 6

    Continuous Distributions

    The primary difference between the properties of a continuous probability distribution and those

    of a discrete distribution is that the continuous case requires a little calculus. Recall that the

    continuous analogue of a histogram is a probability density function, ( )f x , where

    ( ) 0f x for allx(that is, every value ofxhas at least zero probability of occurring), and

    ( ) 1f x dx

    = (that is, all the probabilities add up to one).

    These should seem intuitive because, with the histogram (or, more formally, the probability mass

    function (pmf)) in the discrete case, the y-axis was relative frequency of occurrence. In the

    continuous world, it is actually the area under the probability density function over an interval on

    thex-axis that corresponds to a probability. For example, ( ) ( )b

    aP a X b f x dx< < = . Note that

    this means that the probability of being at exactly one infinitesimal point,( ) ( ) 0

    a

    aP X a f x dx= = = in the continuous case.

    Example 6.5Verify that ( )0 2

    0

    x xf x

    otherwise

    =

    is a probability density function.

    Solution 6.5We can see by inspection that this function is always above the x-axis. The more

    interesting question is the area under the function:

    ( )2

    22

    00

    11 0 1

    2f x dx xdx x

    = = = =

    We could have also checked this geometrically since, by inspection, the density function is atriangle. As a result, notice that the distribution is skewed right.

    An alternative to the probability density function (often abbreviated pdf) is the cumulative

    distribution function (often abbreviated cdf). The cdf, denoted by ( )F x instead of the pdfs

    ( )f x , is defined as ( ) ( )0 0F x P X x= . In other words, this function describes cumulative

    probabilities: the cdf at any point is the probability that the random variable Xtakes on a value of

    less than or equal to0x . Additionally, it is related to the pdf by '( ) ( )F x f x= . Though less

    common, cdfs can be quite useful in certain cases. For example, the median M of a random

    variable is easily found by solving

    ( ) ( )1 0.5F M F M = = .

    With this background on probability and cumulative distribution functions, we can now proceed

    to describe the most common continuous random variables, just as we did above with discrete

    ones.

    Continuous #1: The Uniform Distribution

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    7/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 7

    The uniform distribution is a simple example of a continuous probability distribution. The

    height of the uniform distribution is constant; therefore, the probability that a random variable X

    that is uniformly distributed takes on a value in some interval depends only on the length of the

    interval itself:

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    8/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 8

    provide you with a table for x < < ). Solving the first problem is slightly more involved,

    and requires the use of a process known as standardization.

    Standardization is the process by which we can convert our ( )2~ ,X N to a ( )~ 0,1Z N .For this, we subtract off the mean and divide by the standard deviation, to find the z-scoreof a

    point of interestx. Thus, the z-score ofxisx

    , and,

    ( ) ( )P a X b P a X b

    a b b aP Z

    < < = < <

    = < < =

    This works because the process of standardization preserves probabilities, allowing us to find a

    probability on a ( )5,9N distribution by finding it on ( )0,1N instead, and because

    ( ) ( )2~ , ~ 0,1X

    X N N

    . This second assertion is easily proved using the properties

    of expectation and variance, as well as the rule ( ) ( )2 2 2~ , ~ ,R N S aR b N a b a = + + .

    Example 6.7Suppose ( )~ 5,9X N ; what is ( )6P X > ?

    Solution 6.7Rather than integrate to find this probability, we will standardize and use the tables.

    ( ) ( ) ( )6 6 5

    6 6 0.33 0.37073

    XP X P X P P Z P Z

    > = > = > = > = > =

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    9/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 9

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    10/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 10

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    11/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 11

    Normal variables have several other interesting properties. For example, when you scale a

    normal variable (that is, you multiply it by a constant), the result is a normal variable whose

    mean and variance are scaled accordingly: ( ) ( )2 2 2~ , ~ ,X N X N . Also, the sumof two (or more) independent normal variables is normally distributed, with its mean the sum of

    the means and its variance the sum of the variances (if the variables are not independent then the

    variance also involves covariance terms, but the mean of the sum is still just the sum of theindividual means).

    Example 6.8Consider a series of random variables { }1

    n

    i iX

    =. Assume they are independent and

    distributed according to ( )2~ ,iX N . What is the probability distribution of1

    1 n

    i

    i

    X Xn =

    = ?

    Solution 6.8 ( )2

    2 2

    1 1 1 1

    1~ , ~ , ~ ,

    n n n n

    i i

    i i i i

    X N N n n X X Nn n

    = = = =

    =

    . This result,

    that the variance of a mean decreases with sample size, is a key result in statistical inference.

    Continuous #3: The Student t Distribution

    Interestingly, the inventor of the Student t distribution invented it while working at the Guinness

    Brewery in Dublin in 1908. He wasnt allowed to publish under his own name, so he published

    under the name Student. The main parameter for the t-distribution is the degrees of freedom.

    This sole number parameterizes the distribution (that is, it determines the distributions shape

    and location). Think of the Student t distribution as the close sibling of the standard normal

    distribution. Both are symmetric about the (zero) mean and both are used extensively in

    hypothesis testing. In fact, as the degrees of freedom of the t distribution go to infinity, the t

    distribution converges to the standard normal.

    Continuous #4: The F Distribution

    The F distribution was invented by another anonymous brewmaster named F. Well, okay,

    thats not actually true, but it would make these notes a lot more interesting! The F distribution

    is NOT centered at zero like the t and the standard normal. Instead, the F distribution is defined

    for positive x only and is skewed to the right. In similar fashion to the t distribution, the F is

    parameterized by its degrees of freedom. The only difference is that the F has two sets of

    degrees of freedom, called the numerator degrees of freedom and the denominator degrees of

    freedom. This is because the ratio of two chi-squared distributions (defined next) is (roughly) F

    distributed.

    Continuous #5: The 2 (Chi-squared) Distribution

    The chi-squared distribution is similar to the F in that it is defined for positive x only and is

    skewed to the right. It also is used frequently for particular types of statistical inference. The

    most interesting aspect of the chi-squared distribution is that it is defined as the sum of

    independent squared standard normal random variables, where the degrees of freedom are equal

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    12/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 12

    to the number of standard normal variables in the sum. That is, if ( )~ 0,1Z N and we observe v

    draws fromZ, then 2 2 2 21 2 ... ~v vZ Z Z+ + + .

    Characteristics of Continuous Random Variables

    Just like we did for discrete random variables, we want to calculate the expectation and variance

    of continuous random variables. So, we will. The expectation of a continuous random variable

    X with probability density function ( )f x is defined as ( ) ( )E X xf x dx

    = . Remember,integration is a kind of sum (of areas of really narrow rectangles), so this is a logical progression

    from the discrete expectation discussed above. Just as before, ( ) ( )k kE X x f x dx

    = and

    ( )( ) ( ) ( )E g X g x f x dx

    = for any continuous function ( )g x .

    Example 6.9Find ( )E X for the distribution in Example 6.5.

    Solution 6.9 ( )

    323

    2

    00

    20.94

    3 3

    xE X x xdx

    = = = =

    .

    The variance of a continuous random variable is the exact same formula as the variance for a

    discrete random variable, ( ) ( ) ( ) ( )2 22V X E X E X E X E X = = ; its just that now these

    expectation operators make use of the continuous definition rather than the discrete one.

    Example 6.10If [ ]~ ,X U a b , what are ( )E X and ( )V X ?

    Solution 6.10 ( )( ) ( )

    ( )( )

    ( )

    2 2 2

    2 2 2 2

    b

    b

    a

    a

    b a b ax x b a b aE X dx

    b a b a b a b a

    + += = = = =

    ( )( ) ( )

    ( ) ( )( )

    ( ) ( ) ( ) ( )

    ( )

    2 22 3 3 3 2 22

    22 222

    22 2 2 2 2 2

    3 3 3 3

    3 4

    4 4 4 3 6 3 2

    12 12 12 12

    b

    b

    a

    a

    b ab a b ax x b a b ab aE X dx

    b a b a b a b a

    b ab ab aV X E X E X

    b ab ab a b ab a b ab a

    + + + += = = = =

    ++ += =

    + + + + += = =

    Example 6.11Find the variance of the distribution in Example 6.5.

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    13/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 13

    Solution 6.11 ( ) ( )

    ( ) ( ) ( ) ( )

    24

    22 2 2

    00

    2 22

    14

    1 0.94 0.116

    all x

    xE X x f x dx x xdx

    V X E X E X

    = = = =

    = = =

    Finally, since the median is the middle number, then that must mean that half of the probabilityin the distribution is on one side of the median and half of the probability in the distribution is on

    the other side. So, for a continuous probability distribution, the median is defined as the pointM

    such that ( ) ( ) 0.5M

    Mf x dx f x dx

    = = . Solving this forMis not always trivial, which is whythe definition involving cumulative probability distributions is significantly preferable in this

    case (assuming the cdf has a closed form expression).

    Multivariate Probability Distributions

    So far we have discussed marginal density functions, or probability density functions for onevariable. In the previous lesson, we discussed correlation and how multiple variables may be

    related. We can capture the interdependence (and independence) of multiple random variables

    by looking at the multivariate probability distribution, specifically joint and conditionaldistributions. For our purposes, we will stick to the context of just two variables.

    Joint Probability Distributions

    We will mostly focus on the continuous case here, noting only for the discrete case that, for tworandom variablesXand Y, the discrete joint probability density (or mass in the case of discrete)

    function is denoted by ( ),P X x Y y= = . But we shall let our discussion of how to find such a

    probability be couched in the continuous case. If X is the foreign aid that a country receivesfrom the U.S. and Y is the growth rate of that countrys economy the month prior to aid being

    disseminated, we would expect the two variables to be related since we would expect that aidwould go to low-growth countries, so prior growth may decrease aid.

    The probability that X is in some interval and Y is in some other interval can be calculated by

    knowing the joint density function (or bivariate density function) ( ),f x y . Probabilities are

    found by integrating this function, as before: ( ) ( )2 2

    1 11 2 1 2

    , ,y x

    y xP x X x y Y y f x y dxdy< < < < = .

    Since ( ),f x y is a density function, we still have ( ), 1f x y dxdy

    =

    . And we can actually

    derive either marginal distribution from the joint distribution, by integrating with respect to the

    variable we dont want: ( ) ( ) ( ) ( ), , ,f x f x y dy f y f x y dx

    = = .

    Example 6.12 Consider the joint density function ( ),f x y xy= for 0 1x and 0 2y .

    Verify that both this and the two marginal density functions are all proper density functions.

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    14/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 14

    Solution 6.12 By inspection, ( ), 0f x y . The area under the function is1 2

    0 0xydydx

    22

    1 1 12

    00 00

    2 12

    yxdx xdx x

    = = = =

    . The marginal density functions are ( )

    2

    02f x xydy x= =

    and ( ) 10

    12

    f y xydy y= = . So then ( )1 1 12

    00 02 1f x dx xdx x = = = and

    2

    2 2

    00

    1 1 12 4

    ydy y = =

    Conditional Probability Distributions

    Recall the definition of conditional probability, ( ) ( )

    ( )|

    P A BP A B

    P B

    = , and note that this

    definition relates the marginal, joint, and conditional probabilities to one another. The same

    relationship exists in the context of probability distributions: ( ) ( )

    ( )

    ,|

    f y xf y x

    f x= , where ( )|f y x

    is the conditional density function. The discrete case is an even more direct application of the

    definition: ( ) ( )

    ( )

    ,|

    P X x Y yP X x Y y

    P Y y

    = == = =

    =and ( )

    ( )

    ( )

    ,|

    P X x Y yP Y y X x

    P X x

    = == = =

    =.

    Example 6.13Consider the following joint distribution of two random variables X and Y:

    1X = 0X =

    2Y = 0.4 0.1

    4Y = 0.2 0.3

    Find the conditional distribution of X given Y.

    Solution 6.13All we need to do is find the conditional probabilities for all possible combinationsof values thatXand Ycan take. Thus,

    ( ) ( )

    ( )

    ( ) ( )

    ( )

    ( )

    ( )

    ( )

    ( ) ( )

    ( )

    1, 2 0.41| 2 0.8

    2 0.4 0.1

    1, 4 0.21| 4 0.4

    4 0.2 0.3

    0, 2 0.1

    0 | 2 0.22 0.4 0.1

    0, 4 0.30 | 4 0.6

    4 0.2 0.3

    P X YP X Y

    P Y

    P X YP X Y

    P Y

    P X Y

    P X Y P Y

    P X YP X Y

    P Y

    = == = = = =

    = +

    = == = = = =

    = +

    = =

    = = = = == +

    = == = = = =

    = +

    If asked, we could find the conditional distribution of YgivenXsimilarly.

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    15/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 15

    Now that we have defined conditional probability distributions, we can formally define what it

    means for two random variables to be independentand, therefore, more generally, how randomvariables may be related or unrelated. Recall that two events A and B are independent if and

    only if ( ) ( ) ( )P A B P A P B = , or, equivalently, iff ( ) ( ) ( ) ( )| , |P A B P A P B A P B= = . Well,

    two discrete random variables are independent iff ( ) ( ) ( ),P X x Y y P X x P Y y= = = = = or

    ( ) ( ) ( ) ( )| , |P X x Y y P X x P Y y X x P Y y= = = = = = = = , and continuous random variables

    are independent iff ( ) ( ) ( ),f x y f x f y= or ( ) ( ) ( ) ( )| , |f x y f x f y x f y= = . Note that the

    variables in Example 6.12 were independent, since ( ) ( ) ( ) ( )2 0.5 ,f x f y x y xy f x y= = = .

    Characteristics of Multivariate Probability Distributions

    Just like for marginal distributions, we can also calculate the expectation of jointly orconditionally distributed random variables. Depending on the joint distribution, this may involve

    finding ( ) ( ) ( )E X Y E X E Y+ = +

    , ( )E XY

    , or ( )

    2

    E X Y . Furthermore, the covariance of tworandom variables is ( ) ( )( ) ( )( ) ( ) ( ) ( )cov ,X Y E X E X Y E Y E XY E X E Y = = , so being

    able to find ( )E XY can be especially important. In other words, we can determine how two

    variables are correlated from the joint distribution of those variables.

    Example 6.14Find ( )E XY for the joint distribution from Example 6.12.

    Solution 6.14 ( ) ( ) ( )1 2 1 2 1 12

    2 2 2 3 2

    00 0 0 0 0 0

    1 8 8,

    3 3 9E XY xyf x y dydx x y dydx x y dx x dx= = = = = .

    Calculating expectations for conditional distributions is pretty much the same as calculating them

    for marginal distributions. The only difference is that we need to use the conditionalprobabilities or density functions for the discrete and continuous cases, respectively. That is,

    ( ) ( ) ( )

    | | |all x

    E X Y E X Y y xP X x Y y= = = = = and vice versa for discrete random variables,

    while ( ) ( ) ( )

    | | |all x

    E X Y E X Y y xf x y dx= = = and vice versa for continuous variables.

    Example 6.15 Using the discrete probability distribution in Example 6.13, find ( )| 1E Y X = .

    Solution 6.15 ( ) ( ) ( ) ( )| 1 | 1 4 4 | 1 2 2 | 1

    0.2 0.4 4 4 84 2 2.67

    0.2 0.4 0.2 0.4 3 3 3

    all yE Y X yP Y y X P Y X P Y X= = = = = = = + = =

    = + = + = =

    + +

    Note that conditional expectations are functions of the variable that is given. Thus, for an

    arbitrary value of the random variableX(i.e.X = x) that is given, the conditional expectation is a

    function of that arbitrary value of X; in other words, the conditional expectation is itself a

  • 7/23/2019 Lesson 6 Probability Distributions Notes

    16/16

    Public Policy Boot Camp Math Lesson 6 Notes

    Page | 16

    random variable! As a result, it is possible to take the expectation of it again. The Law of

    Iterated Expectationstells us that ( )( ) ( )|E E X Y E X= ; in fact, this holds for any function ofX

    and Y: ( )( )( ) ( )( )( ) ( )( ), | , | ,E E g X Y X E E g X Y Y E g X Y= = . The implication here is that wecan find unconditional expectations from conditional expectations, if the latter is easier to find

    initially. One use of this law is to write random variables covariance in terms of conditionalexpectations: ( ) ( )( ) ( ) ( ) ( )( ) ( ) ( )cov , | |X Y E E XY X E X E Y E X E Y X E X E Y= = .