Probability and Sampling Recitation

Avishek Dutta

CS155 Machine Learning and Data Mining

February 2, 2017

Avishek Dutta Probability and Sampling Recitation

Motivation

Uncertainty is everywhere around us

”what is the chance that it will rain today?””when will the next bus arrive?””will I go to recitation today?”

Machine learning tries to understand uncertainties andinteract with the real world

Probability theory is the mathematical study of uncertainty

Basic Concepts

Sample Space Ω: set of all possible outcomes

An event, A, is a subspace of Ω

P(A) is the probability of event A occurring

0 ≤ P(A) ≤ 1P(∅) = 0P(Ω) = 1P(A) = 1− P(A)

Examples

Suppose you are rolling a six-sided dice. Then we have:

Ω = 1, 2, 3, 4, 5, 6P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1

P(1, 2, 3, 4, 5, 6) = 1

Given two events, A and B, the probability of A or B is

P(A ∪ B) = P(A) + P(B)− P(A ∩ B) (1)

= P(A) + P(B) (2)

where (1) and (2) are equal only if A and B are mutually exclusive

Examples

Suppose you are rolling a six-sided dice. Then we have:

P(2, 4, 6) = P(2) + P(4) + P(6) = 12

P(1, 2, 3 ∪ 2, 4, 6) =P(1, 2, 3) + P(2, 4, 6)− P(2) = 1

2 + 12 −

16 = 5

Joint and Conditional Probabilities

Given two events, A and B,

P(A,B) is the joint probability of A and B occurring together

P(A | B) is the conditional probability of A occurring giventhat we know B has occurred

Joint and Conditional Probabilities are related in the following way:

P(A | B) =P(A,B)

where we assume that P(B) 6= 0.

Joint and Conditional Probabilities

Examples

Consider a standard deck of cards. What is the probability that thefirst two cards drawn are both Kings?

Event A = First card drawn is a King

Event B = Second card drawn is a King

A standard deck of cards has 4 Kings and 52 total cards.

P(A) = 452

P(B | A) = 351

This means that

P(A,B) = P(B | A)P(A) = 351 ·

452 = 1

221 ≈ 0.005

Joint Probabilities - Chain Rule

An extension of P(A,B) = P(B | A)P(A):

P(A1,A2, . . . ,An) = P(An, . . . ,A2,A1)

= P(An | An−1, . . . ,A2,A1)P(An−1, . . . ,A2,A1)

P(Ai | A1,A2, . . . ,Ai−1)

Independence

Two events, A and B, are independent if

P(A,B) = P(A)P(B)

or equivalently

P(A | B) = P(A)

In other words, knowledge of whether B occurred does not affectthe probability that A occurs

Independence

Examples

Suppose that you roll a six-sided die twice. What is the probabilitythat you roll 1 both times?

A = rolling a 1 in the first roll

B = rolling a 1 in the second roll

These two events are independent so

P(A,B) = P(A)P(B) =1

Bayes Theorem

We know that the following is true

P(A,B) = P(A | B)P(B) = P(B | A)P(A)

This implies that

P(A | B) =P(B | A)P(A)

P(A | B) ∝ P(B | A)P(A)

P(A | B) - Posterior Probability

P(B | A) - Likelihood Function

P(A) - Prior Information

P(B) - Evidence

Bayes Theorem

Examples

If a person has an allergy (A), sneezing (S) is observed withprobability P(S | A) = 0.8. What is the chance a person has anallergy given that they are sneezing: P(A | S)?

Assume that P(A) = 0.001 (few people have allergies)

Assume that P(S) = 0.1 (many people sneeze).

P(A | S) =P(S | A)P(A)

=0.8 · 0.001

0.1= 0.008

Random Variables

So far, we have only considered simple examples with simple eventsas the possible outcomes

A = rolling 1 on a six-sided dice

B = first card drawn is a King

This is mathematically imprecise and can be limiting. The notionof random variables resolves these issues

A random variable X is a function X : Ω→ RAbuse of notation: P(X = x) is the probability that therandom variable takes on value x

Random Variables

Examples

Dice Roll: X = i for i ∈ 1, 2, 3, 4, 5, 6 with P(X = i) = 16

Biased Coin: X = 1 with P(X = 1) = p and X = 0 withP(X = 0) = 1− p

All random variables have a Cumulative Distribution Function(CDF):

F (x) = P(X ≤ x)

Some properties of the CDF are:

0 ≤ F (x) ≤ 1

F (X ) is monotonically increasing

limx→−∞

F (x) = 0 limx→∞

F (x) = 1

Discrete Random Variables

A random variable that can take on only finitely many differentvalues

Discrete random variables have a Probability Mass Function(PMF):

p(x) = P(X = x)

The PMF satisfies ∑x

P(X = x) = 1

Examples

Suppose you are flipping two coins. Let X be the random variablethat equals the number of heads

P(X = 0) P(X = 1) P(X = 2)

0.25 0.5 0.25

Continuous Random Variables

A random variable that can take on infinitely many different values

Continuous random variables have a Probability DistributionFunction (PDF):

p(x) = f (x) =d

dxF (x)

The PDF satisfies ∫ ∞−∞

f (x)dx = 1

Examples

Suppose that X is a random variable that is equal to a valuechosen uniformly at random from the interval [0, a]. Then

F (x) =x

af (x) =

Marginal Distribution

Suppose that for two random variables, X and Y , the jointdistribution p(x , y) is known for all combinations of X and Y .Then the marginal distribution of X is

p(x) =∑y

p(x , y) p(x) =

∫ ∞−∞

p(x , y)dy

and the marginal distribution of Y is

p(y) =∑x

p(x , y) p(y) =

∫ ∞−∞

p(x , y)dx

Marginal Distribution

Examples

Consider two random variables X and Y . Suppose the jointdistribution is given by

x1 x2 x3 x4 p(y)

y4832 0 0 0 8

p(x) 1632

p(x1) =∑y

p(x1, y) =16

Expected Value

The expected value of a random variable, X , is the mean of thedistribution

E[X ] =∑x

xP(X = x) E[X ] =

∫xxf (x)dx

Expectation is linear

E[aX + b] = aE[X ] + b

E[X + Y ] = E[X ] + E[Y ]

Examples

Suppose that X and Y are random variables that are 1 withprobability p and 0 otherwise. What is the expected value ofX + Y ?

E[X + Y ] = E[X ] + E[Y ] = P(X = 1) + P(Y = 1) = 2p

Variance

The variance of a random variable is the measure of the ’spread’ ofthe values the variable takes on:

Var(X ) = E[(X − E[X ])2]

The variance can also be written as

Var(X ) = E[X 2]− E[X ]2

Variance is generally not linear

Var(aX + b) = a2Var(X )

Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X ,Y )

Covariance

The covariance between random variables X and Y measures thedegree to which X and Y are related

Cov(X ,Y ) = E[

(X − E[X ])(Y − E[Y ])

]This can also be written as

Cov(X ,Y ) = E[XY ]− E[X ]E[Y ]

Note that Cov(X ,X ) = Var(X )

Variance and Covariance

Examples

Suppose that X and Y are random variables that are 1 withprobability p and 0 otherwise. What is the covariance of X and Y ?What is the variance of X + Y ?

Cov(X ,Y ) = E[XY ]− E[X ]E[Y ]

= p2(1) + (1− p2)(0)− p2 = 0

Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X ,Y )

= E[X 2]− E[X ]2 + E[Y 2]− E[Y ]2

= p − p2 + p − p2 = 2p(1− p)

Important Discrete Distributions

Bernoulli (p)

P(X = x) = px(1− p)1−x for x ∈ 0, 1E[x ] = p

Binomial (n, p)

P(X = x) =(nx

)px(1− p)n−x

E[x ] = np

Geometric (p)

P(X = x) = p(1− p)x−1

E[x ] = 1p

Poisson (λ)

P(X = x) = λxe−x

x!E[x ] = λ

Important Discrete Distributions

Examples

Bernoulli - flipping a biased coin that lands heads withprobability p

Binomial - flipping n biased coins that land heads withprobability p

Geometric - flipping a biased coin that lands heads withprobability p until it lands on heads

Poisson - letters received in a given day when the averageletters received per day is λ

Important Continuous Distributions

Uniform (a, b) with a ≤ b

f (x) = 1b−a for a ≤ x ≤ b, 0 otherwise

E[x ] = a+b2

Normal (µ, σ2)

f (x) = 1σ√2πe−

12σ2 (x−µ)

E(x) = µ

Examples

Uniform - spinning a board game spinner

Normal - height of a person in the general population

Law of Large Numbers

Let X1,X2, . . . ,Xn be a set of independent and identicallydistributed (iid) random variables with E[Xi ] = µ. Then thesample average given by

X n =1

n(X1 + X1 + . . .+ Xn)

converges to the expected value

X n → µ as n→∞

Central Limit Theorem

Let X1,X2, . . . ,Xn be a set of independent and identicallydistributed (iid) random variables with E[Xi ] = µ andVar(Xi ) = σ2, both finite. If we define the random variable Y as

n(X1 + X1 + . . .+ Xn)

then Y is approximately normal with mean µ and variance σ2

The approximation improves as n→∞

Sampling

Sampling is the process by which elements are drawn from apopulation or distribution

In this course, we typically assume elements are sampledindependently (i.e. with replacement)

The np.random module contains many functions for samplingrandomly from various distributions

np.random.uniform(0,1) # uniform on [0,1]

Independent and Identically Distributed

The importance of the iid assumption cannot be overstated!

Independence assumption simplifies many things:

P(X ,Y ) = P(X )P(Y )

P(X | Y ) = P(X )

Var(X + Y ) = Var(X ) + Var(Y )

Parameter Estimation

Suppose we have a parametrized distribution P(X ; θ) with θunknown

Suppose we have an iid set of samples x1, . . . , xn

How can we estimate θ? From Bayes’ Theorem, we have that

P(θ | X ) ∝ P(X | θ)P(θ)

MAP: θ = arg maxθ

P(θ | X )

MLE: θ = arg maxθ

P(X | θ)

Parameter Estimate - log trick

The logarithmic function is monotonically increasing - will notchange the location of where a function achieves its maximum

Multiplication turns into summation, resulting in less overflow

Considering the negative of the likelihood function allows us to usegradient descent to minimize the function

arg maxθ

f (X | θ) = arg minθ

− log f (X | θ)

Parameter Estimation - Binomial Distribution

Examples

Suppose you toss a (possibly biased) coin n times and record thenumber of heads, k . What is p, the probability that it lands headson a given coin flip. We can use the binomial distribution andMLE to find p:

p = arg maxp

P(k | p) = arg maxp

)pk(1− p)n−k

= arg maxp

pk(1− p)n−k

= arg minp

− log pk(1− p)n−k

= arg minp

− k log p − (n − k) log (1− p)

Taking the derivative wrt to p and zeroing implies that p = kn

MLE with Logistic Regression

Examples

Suppose you have an iid dataset, S = (xi , yi )Ni=1, and want tomodel it using a ”log-linear” model:

P(y | x ,w , b) =1

1 + e−y(wT x+b)

We can use a MLE to find w , b:

w , b = arg maxw ,b

P(y | x ,w , b)

= arg maxw ,b

N∏i=1

P(yi | xi ,w , b) iid assumption

= arg minw ,b

N∑i=1

− lnP(yi | xi ,w , b) log-loss in logistic reg.

References

Lucy Yin’s Recitation on Probability (CS155 Winter 2016)

Kevin Murphy’s Machine Learning: A Probabilistic Perspective

Probability and Sampling Recitation - Yisong Yue

Documents

A Deep Learning Approach for Generalized Speech AnimationA Deep Learning Approach for Generalized Speech Animation SARAH TAYLOR, University of East Anglia TAEHWAN KIM, YISONG YUE,

1 Netcomm 2006 - Recitation 1: Sockets Communication Networks Recitation 1

Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University

Hierarchical Exploration for Accelerating Contextual Bandits Yisong Yue Carnegie Mellon University Joint work with Sue Ann Hong (CMU) & Carlos Guestrin

An Interactive Learning Approach to Optimizing Information Retrieval Systems Yahoo! August 24 th, 2010 Yisong Yue Cornell University

Recitation: Signaling 15213-S04, Recitation, Section A

Learning Calibratable Policies using Programmatic Style ... · Learning Calibratable Policies using Programmatic Style-Consistency Eric Zhan 1Albert Tseng Yisong Yue Adith Swaminathan

Deep Learning Part III - Yisong Yue€¦ · Conditional Image Generation with PixelCNN Decoders, van den Oord et al., 2016 WaveNet: A Generative Model for Raw Audio, van den Oord

Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic

Personalized Collaborative Clustering - Facebook Research · 2020-04-16 · Personalized Collaborative Clustering Yisong Yue Disney Research yisong.yue@disneyresearch.com Chong Wang

Control Regularization for Reduced Variance Reinforcement ...orosz/articles/ICML_2019_Richard_Abhin… · Richard Cheng1 Abhinav Verma2 Gabor Orosz ´ 3 Swarat Chaudhuri2 Yisong Yue

Introduction to Recursion by Dr. Bun Yue Professor of Computer Science yue@uhcl.edu 2013 yue@uhcl.edu

GLAS: Global-to-Local Safe Autonomy Synthesis for Multi ... · Motion Planning with End-to-End Learning Benjamin Riviere, Wolfgang H` onig, Yisong Yue, and Soon-Jo Chung¨ Abstract—We

Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University

HSBC Hong Li Yue Yue Ying Annuity Insurance (Participating) - … · 2020-06-15 · HSBC Hong Li Yue Yue Ying Annuity Insurance (Participating)-Type B, provides the sustained and

Linear Submodular Bandits and their Application to Diversified Retrieval Yisong Yue (CMU) & Carlos Guestrin (CMU) Optimizing Recommender Systems Every

Java Debugging in Eclipse by Dr. Bun Yue Professor of Computer Science yue@uhcl.edu 2013 yue@uhcl.edu

Trees by Dr. Bun Yue Professor of Computer Science yue@uhcl.edu 2013 yue@uhcl.edu CSCI 3333 Data Structures

Off-Policy Evaluation - Yisong Yue€¦ · Off-Policy Evaluation Miguel Aroca-Ouellette, Akshata Athawale, Mannat Singh. 1 Introduction & Motivation 2 Model & Goal 3 4 Application

Practical Online Retrieval Evaluation SIGIR 2011 Tutorial Filip Radlinski (Microsoft) Yisong Yue (CMU)