View
1
Download
0
Category
Preview:
Citation preview
Probability and Sampling Recitation
Avishek Dutta
CS155 Machine Learning and Data Mining
February 2, 2017
Avishek Dutta Probability and Sampling Recitation
Motivation
Uncertainty is everywhere around us
”what is the chance that it will rain today?””when will the next bus arrive?””will I go to recitation today?”
Machine learning tries to understand uncertainties andinteract with the real world
Probability theory is the mathematical study of uncertainty
Avishek Dutta Probability and Sampling Recitation
Basic Concepts
Sample Space Ω: set of all possible outcomes
An event, A, is a subspace of Ω
P(A) is the probability of event A occurring
0 ≤ P(A) ≤ 1P(∅) = 0P(Ω) = 1P(A) = 1− P(A)
Examples
Suppose you are rolling a six-sided dice. Then we have:
Ω = 1, 2, 3, 4, 5, 6P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1
6
P(1, 2, 3, 4, 5, 6) = 1
Avishek Dutta Probability and Sampling Recitation
Union
Given two events, A and B, the probability of A or B is
P(A ∪ B) = P(A) + P(B)− P(A ∩ B) (1)
= P(A) + P(B) (2)
where (1) and (2) are equal only if A and B are mutually exclusive
Examples
Suppose you are rolling a six-sided dice. Then we have:
P(2, 4, 6) = P(2) + P(4) + P(6) = 12
P(1, 2, 3 ∪ 2, 4, 6) =P(1, 2, 3) + P(2, 4, 6)− P(2) = 1
2 + 12 −
16 = 5
6
Avishek Dutta Probability and Sampling Recitation
Joint and Conditional Probabilities
Given two events, A and B,
P(A,B) is the joint probability of A and B occurring together
P(A | B) is the conditional probability of A occurring giventhat we know B has occurred
Joint and Conditional Probabilities are related in the following way:
P(A | B) =P(A,B)
P(B)
where we assume that P(B) 6= 0.
Avishek Dutta Probability and Sampling Recitation
Joint and Conditional Probabilities
Examples
Consider a standard deck of cards. What is the probability that thefirst two cards drawn are both Kings?
Event A = First card drawn is a King
Event B = Second card drawn is a King
A standard deck of cards has 4 Kings and 52 total cards.
P(A) = 452
P(B | A) = 351
This means that
P(A,B) = P(B | A)P(A) = 351 ·
452 = 1
221 ≈ 0.005
Avishek Dutta Probability and Sampling Recitation
Joint Probabilities - Chain Rule
An extension of P(A,B) = P(B | A)P(A):
P(A1,A2, . . . ,An) = P(An, . . . ,A2,A1)
= P(An | An−1, . . . ,A2,A1)P(An−1, . . . ,A2,A1)
...
=n∏
i=1
P(Ai | A1,A2, . . . ,Ai−1)
Avishek Dutta Probability and Sampling Recitation
Independence
Two events, A and B, are independent if
P(A,B) = P(A)P(B)
or equivalently
P(A | B) = P(A)
In other words, knowledge of whether B occurred does not affectthe probability that A occurs
Avishek Dutta Probability and Sampling Recitation
Independence
Examples
Suppose that you roll a six-sided die twice. What is the probabilitythat you roll 1 both times?
A = rolling a 1 in the first roll
B = rolling a 1 in the second roll
These two events are independent so
P(A,B) = P(A)P(B) =1
6· 1
6=
1
36
Avishek Dutta Probability and Sampling Recitation
Bayes Theorem
We know that the following is true
P(A,B) = P(A | B)P(B) = P(B | A)P(A)
This implies that
P(A | B) =P(B | A)P(A)
P(B)
P(A | B) ∝ P(B | A)P(A)
P(A | B) - Posterior Probability
P(B | A) - Likelihood Function
P(A) - Prior Information
P(B) - Evidence
Avishek Dutta Probability and Sampling Recitation
Bayes Theorem
Examples
If a person has an allergy (A), sneezing (S) is observed withprobability P(S | A) = 0.8. What is the chance a person has anallergy given that they are sneezing: P(A | S)?
Assume that P(A) = 0.001 (few people have allergies)
Assume that P(S) = 0.1 (many people sneeze).
P(A | S) =P(S | A)P(A)
P(S)
=0.8 · 0.001
0.1= 0.008
Avishek Dutta Probability and Sampling Recitation
Random Variables
So far, we have only considered simple examples with simple eventsas the possible outcomes
A = rolling 1 on a six-sided dice
B = first card drawn is a King
This is mathematically imprecise and can be limiting. The notionof random variables resolves these issues
A random variable X is a function X : Ω→ RAbuse of notation: P(X = x) is the probability that therandom variable takes on value x
Avishek Dutta Probability and Sampling Recitation
Random Variables
Examples
Dice Roll: X = i for i ∈ 1, 2, 3, 4, 5, 6 with P(X = i) = 16
Biased Coin: X = 1 with P(X = 1) = p and X = 0 withP(X = 0) = 1− p
All random variables have a Cumulative Distribution Function(CDF):
F (x) = P(X ≤ x)
Some properties of the CDF are:
0 ≤ F (x) ≤ 1
F (X ) is monotonically increasing
limx→−∞
F (x) = 0 limx→∞
F (x) = 1
Avishek Dutta Probability and Sampling Recitation
Discrete Random Variables
A random variable that can take on only finitely many differentvalues
Discrete random variables have a Probability Mass Function(PMF):
p(x) = P(X = x)
The PMF satisfies ∑x
P(X = x) = 1
Examples
Suppose you are flipping two coins. Let X be the random variablethat equals the number of heads
P(X = 0) P(X = 1) P(X = 2)
0.25 0.5 0.25
Avishek Dutta Probability and Sampling Recitation
Continuous Random Variables
A random variable that can take on infinitely many different values
Continuous random variables have a Probability DistributionFunction (PDF):
p(x) = f (x) =d
dxF (x)
The PDF satisfies ∫ ∞−∞
f (x)dx = 1
Examples
Suppose that X is a random variable that is equal to a valuechosen uniformly at random from the interval [0, a]. Then
F (x) =x
af (x) =
1
a
Avishek Dutta Probability and Sampling Recitation
Marginal Distribution
Suppose that for two random variables, X and Y , the jointdistribution p(x , y) is known for all combinations of X and Y .Then the marginal distribution of X is
p(x) =∑y
p(x , y) p(x) =
∫ ∞−∞
p(x , y)dy
and the marginal distribution of Y is
p(y) =∑x
p(x , y) p(y) =
∫ ∞−∞
p(x , y)dx
Avishek Dutta Probability and Sampling Recitation
Marginal Distribution
Examples
Consider two random variables X and Y . Suppose the jointdistribution is given by
x1 x2 x3 x4 p(y)
y1432
232
132
132
832
y2232
432
132
132
832
y3232
232
232
232
832
y4832 0 0 0 8
32
p(x) 1632
832
432
432
3232
p(x1) =∑y
p(x1, y) =16
32
Avishek Dutta Probability and Sampling Recitation
Expected Value
The expected value of a random variable, X , is the mean of thedistribution
E[X ] =∑x
xP(X = x) E[X ] =
∫xxf (x)dx
Expectation is linear
E[aX + b] = aE[X ] + b
E[X + Y ] = E[X ] + E[Y ]
Examples
Suppose that X and Y are random variables that are 1 withprobability p and 0 otherwise. What is the expected value ofX + Y ?
E[X + Y ] = E[X ] + E[Y ] = P(X = 1) + P(Y = 1) = 2p
Avishek Dutta Probability and Sampling Recitation
Variance
The variance of a random variable is the measure of the ’spread’ ofthe values the variable takes on:
Var(X ) = E[(X − E[X ])2]
The variance can also be written as
Var(X ) = E[X 2]− E[X ]2
Variance is generally not linear
Var(aX + b) = a2Var(X )
Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X ,Y )
Avishek Dutta Probability and Sampling Recitation
Covariance
The covariance between random variables X and Y measures thedegree to which X and Y are related
Cov(X ,Y ) = E[
(X − E[X ])(Y − E[Y ])
]This can also be written as
Cov(X ,Y ) = E[XY ]− E[X ]E[Y ]
Note that Cov(X ,X ) = Var(X )
Avishek Dutta Probability and Sampling Recitation
Variance and Covariance
Examples
Suppose that X and Y are random variables that are 1 withprobability p and 0 otherwise. What is the covariance of X and Y ?What is the variance of X + Y ?
Cov(X ,Y ) = E[XY ]− E[X ]E[Y ]
= p2(1) + (1− p2)(0)− p2 = 0
Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X ,Y )
= E[X 2]− E[X ]2 + E[Y 2]− E[Y ]2
= p − p2 + p − p2 = 2p(1− p)
Avishek Dutta Probability and Sampling Recitation
Important Discrete Distributions
Bernoulli (p)
P(X = x) = px(1− p)1−x for x ∈ 0, 1E[x ] = p
Binomial (n, p)
P(X = x) =(nx
)px(1− p)n−x
E[x ] = np
Geometric (p)
P(X = x) = p(1− p)x−1
E[x ] = 1p
Poisson (λ)
P(X = x) = λxe−x
x!E[x ] = λ
Avishek Dutta Probability and Sampling Recitation
Important Discrete Distributions
Examples
Bernoulli - flipping a biased coin that lands heads withprobability p
Binomial - flipping n biased coins that land heads withprobability p
Geometric - flipping a biased coin that lands heads withprobability p until it lands on heads
Poisson - letters received in a given day when the averageletters received per day is λ
Avishek Dutta Probability and Sampling Recitation
Important Continuous Distributions
Uniform (a, b) with a ≤ b
f (x) = 1b−a for a ≤ x ≤ b, 0 otherwise
E[x ] = a+b2
Normal (µ, σ2)
f (x) = 1σ√2πe−
12σ2 (x−µ)
2
E(x) = µ
Examples
Uniform - spinning a board game spinner
Normal - height of a person in the general population
Avishek Dutta Probability and Sampling Recitation
Law of Large Numbers
Let X1,X2, . . . ,Xn be a set of independent and identicallydistributed (iid) random variables with E[Xi ] = µ. Then thesample average given by
X n =1
n(X1 + X1 + . . .+ Xn)
converges to the expected value
X n → µ as n→∞
Avishek Dutta Probability and Sampling Recitation
Central Limit Theorem
Let X1,X2, . . . ,Xn be a set of independent and identicallydistributed (iid) random variables with E[Xi ] = µ andVar(Xi ) = σ2, both finite. If we define the random variable Y as
Y =1
n(X1 + X1 + . . .+ Xn)
then Y is approximately normal with mean µ and variance σ2
n //
The approximation improves as n→∞
Avishek Dutta Probability and Sampling Recitation
Sampling
Sampling is the process by which elements are drawn from apopulation or distribution
In this course, we typically assume elements are sampledindependently (i.e. with replacement)
The np.random module contains many functions for samplingrandomly from various distributions
np.random.uniform(0,1) # uniform on [0,1]
Avishek Dutta Probability and Sampling Recitation
Independent and Identically Distributed
The importance of the iid assumption cannot be overstated!
Independence assumption simplifies many things:
P(X ,Y ) = P(X )P(Y )
P(X | Y ) = P(X )
Var(X + Y ) = Var(X ) + Var(Y )
Avishek Dutta Probability and Sampling Recitation
Parameter Estimation
Suppose we have a parametrized distribution P(X ; θ) with θunknown
Suppose we have an iid set of samples x1, . . . , xn
How can we estimate θ? From Bayes’ Theorem, we have that
P(θ | X ) ∝ P(X | θ)P(θ)
MAP: θ = arg maxθ
P(θ | X )
MLE: θ = arg maxθ
P(X | θ)
Avishek Dutta Probability and Sampling Recitation
Parameter Estimate - log trick
The logarithmic function is monotonically increasing - will notchange the location of where a function achieves its maximum
Multiplication turns into summation, resulting in less overflow
Considering the negative of the likelihood function allows us to usegradient descent to minimize the function
arg maxθ
f (X | θ) = arg minθ
− log f (X | θ)
Avishek Dutta Probability and Sampling Recitation
Parameter Estimation - Binomial Distribution
Examples
Suppose you toss a (possibly biased) coin n times and record thenumber of heads, k . What is p, the probability that it lands headson a given coin flip. We can use the binomial distribution andMLE to find p:
p = arg maxp
P(k | p) = arg maxp
(n
k
)pk(1− p)n−k
= arg maxp
pk(1− p)n−k
= arg minp
− log pk(1− p)n−k
= arg minp
− k log p − (n − k) log (1− p)
Taking the derivative wrt to p and zeroing implies that p = kn
Avishek Dutta Probability and Sampling Recitation
MLE with Logistic Regression
Examples
Suppose you have an iid dataset, S = (xi , yi )Ni=1, and want tomodel it using a ”log-linear” model:
P(y | x ,w , b) =1
1 + e−y(wT x+b)
We can use a MLE to find w , b:
w , b = arg maxw ,b
P(y | x ,w , b)
= arg maxw ,b
N∏i=1
P(yi | xi ,w , b) iid assumption
= arg minw ,b
N∑i=1
− lnP(yi | xi ,w , b) log-loss in logistic reg.
Avishek Dutta Probability and Sampling Recitation
Recommended