Lecture 10: Probability distributionsffh8x/d/soi19S/Lecture10.pdfDiscrete probability distributions provide common language for describing these pmf’s: Bernoulli(p) distribution:

Lecture 10: Probability distributionsDANIEL WELLER

TUESDAY, FEBRUARY 19, 2019

AgendaWhat is probability? (again)

Describing probabilities (distributions)

Understanding probabilities (expectation)

Partial information (conditional probability)

Probability distributions are useful for describing and predicting how likely random events are to occur. Histogram plots like the red, green, and blue color plots for the image above provide empirical probability distributions for real world signals such as digital images.

2

Image credit: Daniel Schwen/Wikipedia.

What is probability?Recall: Probability describes the likelihood of a random event occurring.

◦ Likelihood: chance or relative frequency of occurrence

◦ Random event: some set of possible outcomes, of a trial or experiment, determined by chance.

Outcomes are fairly arbitrary objects (e.g., heads or tails, dice roll, playing card) drawn from a set called the sample space. Unfortunately, such generality is also inconvenient – mathematics works best with actual numbers.

How do we transform coin flips, die rolls, card draws, etc., into numbers? Solution: define a random variable.

3

Random variablesA random variable assigns a real number to any possible outcome of a trial or experiment.

◦ E.g.: heads = 0, tails = 1

Random variables allow us to translate general ideas about probability to numeric concepts:◦ Finite or discrete sample space finite or discrete value range of random variable

◦ Continuous sample space continuous value range of random variable

◦ Events as sets of outcomes events as intervals, or collections of intervals, on real number line

◦ Probability function on events probability function on intervals, or ranges, on real line

4

Random variablesExample: Define a random variable for a birthdate (discrete).

Example: Define a random variable for a temperature (continuous).

5

Image credits: (top) iambaker.net, (bottom) homesciencetools.com.

Probability functionsWe already have described probabilities as assigned to events, satisfying important axioms:

◦ 1) A probability of an event is a nonnegative real number between 0 and 1. (higher = more likely)

◦ 2) The probability of the sample space P(Ω) = 1.

◦ 3) The probability of any collection of mutually exclusive events occurring is the sum of the probabilities of the individual events:

Today:◦ Let us translate these probability axioms from outcomes to random variables, from events to intervals.

◦ Instead of talking about P(heads) or P(two coin flips the same), we will talk about P(X > 1) or P(X ≤ 3).

◦ This leads to the definition of a cumulative distribution function, or cdf, defined on the real numbers.

6

𝑃 ራ𝑛𝐸𝑛 =

𝑛𝑃 𝐸𝑛

This sum may be countable or uncountable.This “U” operator is called the union of sets.

𝑃 𝑋 ≤ 𝑥 = 𝐹𝑋 𝑥

Probability functionsRecall: the probability function is defined on a collection of events.

This collection has certain requirements:◦ 1)

◦ 2)

◦ 3)

It turns out this is not enough to guarantee we can talk about probabilities for random variables.◦ To do that, we would need what’s called a Borel σ-algebra and a way of constructing set theoretic

topologies containing minimal closed collections of “open sets”.

◦ Instead, what we will do is work backwards and assume that any interval of random variable X we want to measure has a well-defined event. It turns out topological theory would lead to this anyway.

7

Cumulative distribution functionsCumulative distribution functions FX(x) have three important properties:

◦ 1) The function is non-decreasing: if a > b, then FX(a) ≥ FX(b).

◦ 2) The function is right-continuous: lim𝑥→𝑎+

𝐹𝑋(𝑥) = 𝐹𝑋(𝑎).

◦ 3) The function increases from 0 to 1: lim𝑥→−∞

𝐹𝑋(𝑥) = 0, lim𝑥→+∞

𝐹𝑋(𝑥) = 1.

Three cases of the cdf, for discrete, continuous, and hybrid (mixed) random variables:

◦ Discrete: jumps at discrete values, constant between them

◦ Continuous: continuous function, no jumps

◦ Hybrid: jumps at discrete values, continuous (but not necessarily constant) between them

8

Image credit: Incnis Mrsi/Wikipedia.

Cumulative distribution functionsExample: temperature range in the classroom

Example: sum of values from a roll of a pair of fair 6-sided dice

9

Cumulative distribution functionWhy cumulative distribution function versus probability function?

◦ We said that random variables simplify analysis by describing probability outcomes with numbers.

◦ The cdf allows us to determine probabilities for these random variable representations.

◦ What kinds of “intervals” can we describe?◦ P(X ≤ x) – interval closed at (contains) maximum

◦ P(X > x) – interval open at (does not contain) minimum

◦ Empty interval (empty set) and full range (sample space)

◦ Unions, intersections of these: P(a < X ≤ b), P(X ≤ a or X > b), …

◦ For a discrete value, we can also define P(X = a), as the value of that jump:

◦ For continuous values, this P(X = a) equals zero! Does this mean that continuous-valued random variables never occur? No, it just means we would need a different way to interpret them.

◦ We will focus on discrete-valued random variables in this course.

10

𝑃 𝑋 = 𝑎 = 𝐹𝑋 𝑎 − lim𝑥→𝑎−

𝐹𝑋 𝑥

Discrete probability distributionsFor random variables with discrete values, we can use P(X = x) to describe how relatively likely different random variable values, and their associated outcomes, are to occur.

◦ This description is called a probability mass function (pmf).

◦ Note: pmf’s do not apply to continuous random variables.

Discrete probability distributions provide common language for describing these pmf’s:◦ Bernoulli(p) distribution: two values (usually 0 and 1), with P(X = 1) = p.

◦ Binomial(N,p) distribution: integer values from 0 to N, formed by a sum of N Bernoulli(p) variables.

◦ Many other discrete distributions exist and are covered in probability courses. We will focus on these two given their relation to information theory and digital communications (later).

11

PredictionSuppose we want to use probability to predict the outcome of an experiment. Given a random variable X for the outcome, and a probability distribution for that random variable, how do we come up with the best prediction of X?

Some candidates:◦ The most likely outcome (mode)

◦ The outcome with the smallest bias (mean)

◦ The “middle” outcome (median)

◦ The outcome with the smallest total error

The issue is how do we define best, and how we use the information provided by the probability distribution, to achieve it.

12

PredictionWhy not the most likely (mode)? Here’s a counterexample:

Better alternative: mean (easy)

Even better alternative: median (harder)

13

P(X=x)

x

Prediction and expected valueSo, what about the mean?

The mean, or expected value, includes the relative frequencies of all the values, not just the single largest. We compute the expected value E{X} as:

This expected value function yields a fixed number, since it sums over all the possibilities of X.

Some useful properties of expectations:◦ Linearity: if X, Y are random variables, and a, b are constants, then E{aX+bY} = aE{X}+bE{Y}

◦ Unbiased: we define bias as the difference of the prediction and the expected value E{X}, so = 0

◦ If X, Y are uncorrelated, with means μX, μY, then E{(X-μX)(Y-μY)} = 0. More on this later.

14

𝐸 𝑋 =𝑥𝑥 ∙ 𝑃 𝑋 = 𝑥

Expected valueExample: Let’s imagine a fair coin flip:

Example: How about rolling a fair six-sided die?

15

Expected value and varianceThe mean does not capture the variability, or how much the outcome is expected to change from trial to trial, of the random variable. It is possible for random outcomes to vary widely from the expected value (or any other prediction). How do we quantify this?

Variance, or second central moment, measures squared-variability of X:

Standard deviation, σX, is the square root of the variance.

This is one common measure of variability, or spread, of possible outcomes of a random trial.

16

𝜎𝑋2 = 𝑉𝑎𝑟 𝑋 = 𝐸 𝑋 − 𝜇𝑋

2 =𝑥𝑥 − 𝜇𝑋

2 ∙ 𝑃 𝑋 = 𝑥

Variance examplesExample: variance of a fair coin flip (what if it’s not fair?)

Example: variance of a fair six-sided die roll

17

Properties of varianceJust like expected value, the variance has certain properties:

◦ If we scale random variable X by a constant a, the Var(aX) = |a|2Var(X).

◦ If we add a constant to a random variable, the variance does not change.

◦ The variance of the sum of uncorrelated random variables is the sum of the variances:

Example: Variance of sum of uncorrelated die rolls

18

𝑉𝑎𝑟 𝑋 + 𝑌 = 𝐸 𝑋 + 𝑌 − 𝜇𝑋 − 𝜇𝑌2

= 𝐸 𝑋 − 𝜇𝑋2 + 𝑌 − 𝜇𝑌

2 + 𝑋 − 𝜇𝑋 𝑌 − 𝜇𝑌= 𝑉𝑎𝑟 𝑋 + 𝑉𝑎𝑟 𝑌 + 0

Conditional probabilitySo far, we have mainly ignored relationships that can exist among random events. However, real events frequently are interdependent.

◦ For instance, the event of UVA men’s basketball losing its second game against Duke probably had something to do with the first loss (e.g., most of the same players).

Depending on the outcome of one experiment or trial (A), the likelihood of a subsequent trial (B) may change! We ask how likely B is to occur if A occurs (or does not occur). We write: P(B|A), or the probability of B given A. Its definition is the ratio of two probabilities:

◦ It is undefined if P(A) = 0

◦ The numerator indicates the joint probability of A occurring and B occurring.

19

𝑃 𝐵 𝐴 =𝑃 𝐴 ∩ 𝐵

𝑃 𝐴

Conditional probabilityExample: Roll a six-sided die and flip that many fair coins. What is probability distribution of # of heads given a die-roll of 3? Of 6?

20

IndependenceEvents A and B are independent if A tells us nothing about B, and vice versa. So P(A|B) = P(A), and P(B|A) = P(B). This also means that P(A ∩ B) = P(A)P(B). Otherwise, they are dependent.

A similar definition holds for probabilities of random variables. We say X and Y are independentif P(X|Y) = P(X) and P(Y|X) = P(Y), where P(X,Y) is the joint distribution of X and Y, and

Independent random variables are also uncorrelated, but the converse frequently is not true. Why?

21

𝑃 𝑋 𝑌 =𝑃 𝑋, 𝑌

𝑃 𝑌𝑃 𝑌 𝑋 =

𝑃 𝑋, 𝑌

𝑃 𝑋

IndependenceWhat about more than two random variables?

◦ We can talk about independence of any pair of random variables.

◦ We can talk about independence among any group of random variables:

◦ If an entire group of random variables are independent, we say they are mutually independent.◦ Mutual independence implies pairwise independence, but the converse is not true.

In general, random variables X, Y, Z, … are mutually independent if and only if E{f(X)g(Y)h(Z)…} = E{f(X)}E{g(Y)}E{h(Z)}… for any* function f, g, h, …. This is an equivalent definition and also a powerful tool (proves uncorrelatedness among other things).

22

𝑃 𝑋, 𝑌, 𝑍, … = 𝑃 𝑋 𝑃 𝑌 𝑃 𝑍 …

* Technically, only those functions that have valid expected values

Bernoulli distributionLet’s give our first probability distribution example:

Bernoulli(p) is a two-valued discrete distribution (0 and 1) that takes probability P(X=1) = p. This means that P(X=0) = (1-p). Let’s plot the cdf:

What is the expected value E{X} in terms of p?

What is the variance Var(X) in terms of p?

23

Towards the binomial distributionA Bernoulli distribution is one way to describe a coin flip, but the coin does not have to be fair.

What if we flip a large number of coins? What is the fraction that comes up tails (1)?

What can we say about the distribution of coin flips that come up heads or tails?

What are we assuming? Independence of course…

24

Towards the binomial distributionLet’s imagine N coins are fair: P(0) = P(1) = 0.5. If we flip them, we can get 2N possible outcomes. That’s a lot to count:

How many ways are there to choose n things from N items? How does this relate to counting heads or tails?

Let’s see, what is the probability of flipping 10 fair coins and getting exactly 5 heads?

25

11 1

1 11 1

1 1

23 3

4 6 4

Binomial distributionNow, let’s imagine the flips are not necessary fair; P(1) = p.

What is the probability of flipping N coins and getting n tails (1’s)?

What does the cdf look like for binomial(5,0.5)?

26

Binomial distributionWhat is the expected value for X ~ binomial(N,p)? We use “~” to signify X’s distribution.

What is the variance for X ~ binomial(N,p)?

Hint: use properties we’ve discussed of expected value, variance.

27

Why Bernoulli and binomial?Why these distributions? It turns out coin flips and bits have a lot in common:

◦ Two values (heads/tails, 0 and 1)

◦ Fair or biased coin random bits (e.g., communication errors)

◦ We can describe large numbers of bits easily using a binomial distribution.

Many other discrete (and continuous) probability distributions exist. Curious? Take APMA 3100.

28

AnnouncementsNext time: Probability and information theory

Homework #4 due today.

Lab #4 will happen today.

Homework #5 will go out next week.

29

Documents

Lecture 10: Probability distributionsffh8x/d/soi19S/Lecture10.pdfDiscrete probability distributions provide common language for describing these pmf’s: Bernoulli(p) distribution: