81
PROBABILITY THEORY CMPUT 296 Martha White Winter, 2020

PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

PROBABILITY THEORY CMPUT 296

Martha White

Winter, 2020

Page 2: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

QUICK CHECK ON PRE-REQ KNOWLEDGE

• You should know how to take derivatives • Will rarely, if ever integrate • I will teach you about Partial Derivatives

• I assume you know about vectors and dot products, hopefully also about matrices

• Need to have learned about probability, I will cover • Expected Value (Mean) • Variance • Random Variables • Probability Density Functions…

Page 3: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

WHY DO WE NEED PROBABILITIES?

• We could just assume a deterministic world • I see an input x, I can produce the output y = f(x)

• Example: in a game (e.g., chess), you take an action, and the outcome is deterministic

• But, even in a deterministic world, we have a problem: partial observability

• Outcomes look random because we don’t have enough information

• Example: Imagine a (high-tech) gumball machine, where f(x = has candy, battery charged) = output candy

• You can only see if it has candy

Page 4: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

WHY DO WE NEED PROBABILITIES?

• But, even in a deterministic world, we have a problem: partial observability

• Outcomes look random because we don’t have enough information

• Example: Imagine a (high-tech) gumball machine, where f(x = has candy, battery charged) = output candy

• You can only see if it has candy • From your perspective, when x = has candy,

sometimes candy is outputted and sometimes not • Looks stochastic (dependent on hidden battery

charge)

Page 5: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

SPACE OF OUTCOMES AND EVENTS

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

Page 6: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

SAMPLE SPACES

e.g.,F = {;, {1, 2}, {3, 4, 5, 6}, {1, 2, 3, 4, 5, 6}} e.g.,F = {;, [0, 0.5], (0.5, 1.0], [0, 1]}E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

e.g., Outcome of Dice Roll

Page 7: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

A FEW COMMENTS ON TERMINOLOGY

• A few new terms, including countable, closure • only a small amount of terminology used, can

google these terms and learn on your own • notation sheet in notes

• Countable: integers, {0.1,2.0,3.6},… • Uncountable: real numbers, intervals, … • Interchangeably I use (though its somewhat loose)

• discrete and countable • continuous and uncountable

Page 8: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

(MEASURABLE) SPACE OF OUTCOMES AND EVENTS

an event space

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

Page 9: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

WHY IS THIS THE DEFINITION?Intuitively,

1. A collection of outcomes is an event (e.g., either a 1 or 6 was rolled)

2. If we can measure two events separately, then their union should also bea measurable event

3. If we can measure an event, then we should be able to measure that thatevent did not occur (the complement)

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

Page 10: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

QUICK CHECK ON BACKGROUND

• The complement of a set

• The union of sets

• A set of sets

• Any other confusing notation?

Page 11: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

SAMPLE SPACES

e.g.,F = {;, {1, 2}, {3, 4, 5, 6}, {1, 2, 3, 4, 5, 6}} e.g.,F = {;, [0, 0.5], (0.5, 1.0], [0, 1]}

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit> E

<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

Page 12: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

AXIOMS OF PROBABILITY

1. (unit measure) P (⌦) = 1

2. (�-additivity) Any countable sequence of disjoint eventsA1, A2, . . . 2 F satisfies P ([1

i=1Ai) =P1

i=1 P (Ai)

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

Page 13: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

FINDING PROBABILITY DISTRIBUTIONS

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

Do you recognize this distribution?

Page 14: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

PROBABILITY MASS FUNCTIONS

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

Page 15: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

ARBITRARY PMFS

e.g. PMF for a fair die (table of values)⌦ = {1, 2, 3, 4, 5, 6}

p(!) = 1/6 8! 2 ⌦

Page 16: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXERCISE: EXAMPLES OF EVENTS

• What is a possible event? What is its probability

• What is the event space?

⌦ = {1, 2, 3, 4, 5, 6}p(!) = 1/6 8! 2 ⌦

Page 17: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXERCISE: HOW ARE PMFS USEFUL AS A MODEL?

• Histograms!

• Imagine you recorded your commute times for a year, in minutes (365 recorded times)

• How do you get p(t)?

• How is p(t) useful?

Page 18: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXERCISE: HOW ARE PMFS USEFUL AS A MODEL?• Histograms!

• Imagine you recorded your commute times for a year, in minutes

• Get p(t): count number of times t = 1, 2, 3, … occurs and then normalize probabilities by # samples

Page 19: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

USEFUL PMFS

Page 20: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

REMINDERS: JANUARY 9, 2020

• Assignment 1 is available on the website

• https://marthawhite.github.io/mlbasics/

• You should start reading the notes

• Chapters 1, 2 and 3 (about 30 pages)

• The notes are in-progress

• Some sections still say “Coming soon…”

• Avoid printing the full set of notes (if at all). At most, print what you are reading

Page 21: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

A FEW OTHER NOTES ON POLICY

• If you have an issue (sickness, injury, family problem, etc.), and need an extension on an assignment, contact me about it before the assignment deadline.

• I cannot give extensions after.

• You cannot submit Thought Questions late

• only Assignments have a late policy that allows for late submission

• if you submit 1 hour late, we won’t penalize you

Page 22: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

USEFUL PMFS

e.g., amount of mail received in a day number of calls received by call center in an hour

Page 23: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

USEFUL PMFS

e.g., amount of mail received in a day number of calls received by call center in an hour

1. Can we use a table for this? 2. How do we know this is a valid

pmf? How can you check?

Page 24: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXERCISE: CAN WE USE A POISSON FOR COMMUTE TIMES?• Used a probability table (histogram) for minutes:

count number of times t = 1, 2, 3, … occurs and then normalize probabilities by # samples

• Can we use a Poisson?

Why Poisson? Why not just a histogram?

Page 25: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

PROBABILITY DENSITY FUNCTIONS

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

Who has never seen an integral?

Page 26: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

PROBABILITY DENSITY FUNCTIONS

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

e.g. normal distribution (Gaussian)

Page 27: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

PROBABILITY DENSITY FUNCTIONS

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

Who has never seen an integral?

Page 28: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

PMFS VS. PDFS

Example:

- Stopping time of a car, in interval [3,15]. What is the probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability of stopping between 3 to 3.5 seconds. How do we get that probability?

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

Page 29: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

USEFUL PDFS

Page 30: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

USEFUL PDFS

Page 31: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

USEFUL PDFS

Page 32: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

WHY CAN THE DENSITY BE ABOVE 1?

p(x) can be big, because delta x is small P(A) MUST be less than or equal to 1 p(x) can be bigger than 1

Page 33: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

RANDOM VARIABLES

Age: 35 Likes sports: Yes Height: 1.85m Smokes: No Weight: 75kg Marital st.: Single IQ: 104 Occupation: Musician

Age: 26 Likes sports: Yes Height: 1.75m Smokes: No Weight: 79kg Marital st.: Divorced IQ: 103 Occupation: Athlete

Musician is a random variable (a function) A is a new event, let’s call it 1; Not-A is 0 Omega is {0, 1} Can ask P(M = 0) and P(M = 1)

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

Page 34: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

WE INSTINCTIVELY CREATE THIS TRANSFORMATION

Assume ⌦ is a set of people.

Compute the probability that a randomly selected person ! 2 ⌦ has a cold.

Define event A = {! 2 ⌦ : Disease(!) = cold}.

Disease is our new random variable, P (Disease = cold)

Disease is a function that maps outcome space to new outcome space {cold, not cold}

Disease is a function, which is neither a variable nor random BUT, this term is still a good one since we treat Disease as a variable And assume it can take on different values (randomly according to some distribution)

Page 35: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

RANDOM VARIABLE: FORMAL DEFINITION

Example X : ⌦ ! [0,1)

⌦ is set of (measured) people in population

with associated measurements such as height and weight

X(!) = height

A = interval = [50100, 50200]

P (X 2 A) = P (50100 X 50200) = P ({! : X(!) 2 A})

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

Page 36: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

3 MINUTE BREAK AND EXERCISE

• Let X be a random variable that corresponds to the ratio of hard-to-easy problems on an assignment. Assume it takes values in {0.1, 0.25, 0.7}.

• Is this discrete or continuous? Does it have a PMF or PDF?

• Further, where could the variability come from? i.e., why is this a random variable?

• Let X be the stopping time of a car, taking values in [3,5] union [7,9]. Is this discrete or continuous?

• Think of an example of a discrete random variable (RV) and a continuous RV

Page 37: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

WHAT IF WE HAVE MORE THAN TWO VARIABLES…

• So far, we have considered scalar random variables

• Axioms of probability defined abstractly, apply to vector random variables

⌦ = R2, e.g., ! = [�0.5, 10]

⌦ = [0, 1]⇥ [2, 5], e.g., ! = [0.2, 3.5]

But, when defining probabilities, we will want to consider how the variables interact

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

Page 38: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

TWO DISCRETE RANDOM VARIABLES

Random variables X and YOutcome spaces X and Y

answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤

1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.

Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that

p(x, y) = P (X = x, Y = y)

where the pmf needs to satisfyÿ

xœX

ÿ

yœY

p(x, y) = 1.

For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities

Y0 1

X0 1/2 1/1001 1/10 39/100

This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and

qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY

p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.

Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.

The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable

P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .

A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable

21

answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤

1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.

Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that

p(x, y) = P (X = x, Y = y)

where the pmf needs to satisfyÿ

xœX

ÿ

yœY

p(x, y) = 1.

For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities

Y0 1

X0 1/2 1/1001 1/10 39/100

This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and

qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY

p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.

Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.

The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable

P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .

A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable

21

answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤

1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.

Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that

p(x, y) = P (X = x, Y = y)

where the pmf needs to satisfyÿ

xœX

ÿ

yœY

p(x, y) = 1.

For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities

Y0 1

X0 1/2 1/1001 1/10 39/100

This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and

qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY

p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.

Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.

The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable

P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .

A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable

21

* these numbers are completely made up

about your commute time tomorrow. For this setting, your random variable X correspondsto the commute time, and you need to define probabilities for this random variable. Thisdata could be considered to be discrete, taking values, in minutes, {4, 5, 6, . . . , 26}. Youcould then create histograms of this data (table of probability values), as shown in Figure1.4, to reflect the likelihood of commute times.

The commute time, however, is not actually discrete, and so you would like to modelit as a continuous RV. One reasonable choice is a gamma distribution. How, though, doesone take the recorded data and determine the parameters –, — to the gamma distribution?Estimating these parameters is actually quite straightforward, though not as immediatelyobvious as estimating tables of probability values; we discuss how to do so in Chapter 3.The learned gamma distribution is also depicted in Figure 1.4.

Given the gamma distribution, one could now ask the question: what is the most likelycommute time today? This corresponds to maxÊ p(Ê), which is called the mode of thedistribution. Another natural question is the average or expected commute time. To obtainthis, you need the expected value (mean) of this gamma distribution, which we define belowin Section 1.4. ⇤

1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.

Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that

p(x, y) def= P (X = x, Y = y)

where the pmf needs to satisfy ÿ

xœX

ÿ

yœY

p(x, y) = 1.

For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of joint probabilities This fits within the definition of probability spaces, because

Y0 1

X0 P (X=0, Y =0) = 1/2 P (X=0, Y =1) = 1/1001 P (X=1, Y =0) = 1/10 P (X=1, Y =1) = 39/100

Table 1.1: A joint probability table for random variables X and Y .

� = X ◊ Y is a valid space, andq

Êœ� p(Ê) =q

(x,y)œ� p(x, y) =q

xœX

qyœY

p(x, y). Therandom variable Z = (X, Y ) is a multivariate random variable, of two dimensions.

21

Page 39: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

SOME QUESTIONS WE MIGHT ASK NOW THAT WE HAVE TWO RANDOM VARIABLES

answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤

1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.

Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that

p(x, y) = P (X = x, Y = y)

where the pmf needs to satisfyÿ

xœX

ÿ

yœY

p(x, y) = 1.

For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities

Y0 1

X0 1/2 1/1001 1/10 39/100

This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and

qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY

p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.

Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.

The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable

P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .

A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable

21

answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤

1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.

Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that

p(x, y) = P (X = x, Y = y)

where the pmf needs to satisfyÿ

xœX

ÿ

yœY

p(x, y) = 1.

For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities

Y0 1

X0 1/2 1/1001 1/10 39/100

This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and

qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY

p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.

Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.

The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable

P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .

A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable

21

Are these two variables related? Or do they change completely independently of each other?

Given this joint distribution, can we determine just the distribution over arthritis? i.e., P(Y = 1)? (Marginal distribution)

If we knew something about one of the variables, say that the person Is young, do we know the distribution over Y? (Conditional distribution)

Page 40: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXAMPLE: MARGINAL AND CONDITIONAL DISTRIBUTION

answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤

1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.

Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that

p(x, y) = P (X = x, Y = y)

where the pmf needs to satisfyÿ

xœX

ÿ

yœY

p(x, y) = 1.

For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities

Y0 1

X0 1/2 1/1001 1/10 39/100

This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and

qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY

p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.

Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.

The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable

P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .

A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable

21

answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤

1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.

Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that

p(x, y) = P (X = x, Y = y)

where the pmf needs to satisfyÿ

xœX

ÿ

yœY

p(x, y) = 1.

For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities

Y0 1

X0 1/2 1/1001 1/10 39/100

This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and

qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY

p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.

Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.

The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable

P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .

A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable

21

P(Y = 1) = P(Y = 1, X = 0) + P(Y = 1, X = 1) = 40/100 What is P(Y = 0)?

P(X = 1) = 49/100. So what is P(X = 0)?

P(Y = 1 | X = 0) = ? Is it 1/100, where the table tells us P(Y = 1, X=0)?

No

Page 41: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

CONDITIONAL DISTRIBUTIONS

p(y|x) = p(x, y)

p(x)

If p(x,y) is small, does this imply that p(y|x) is small?

Page 42: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXERCISE: CONDITIONAL DISTRIBUTION

answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤

1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.

Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that

p(x, y) = P (X = x, Y = y)

where the pmf needs to satisfyÿ

xœX

ÿ

yœY

p(x, y) = 1.

For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities

Y0 1

X0 1/2 1/1001 1/10 39/100

This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and

qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY

p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.

Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.

The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable

P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .

A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable

21

answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤

1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.

Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that

p(x, y) = P (X = x, Y = y)

where the pmf needs to satisfyÿ

xœX

ÿ

yœY

p(x, y) = 1.

For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities

Y0 1

X0 1/2 1/1001 1/10 39/100

This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and

qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY

p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.

Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.

The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable

P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .

A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable

21

P(Y = 1 | X = 0) = ?

What is P(Y = 0 | X = 0)? Should P(Y = 1 | X = 0) + P(Y = 0 | X = 0) = 1?

p(y|x) = p(x, y)

p(x)

Page 43: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

JOINT DISTRIBUTIONS FOR MANY VARIABLESZ = (X, Y ), we can determine the proportion of the population that is young and theproportion that is old.

In general, we can consider d-dimensional random variable X = (X1, X2, . . . , Xd) withvector-valued outcomes x = (x1, x2, . . . , xd), such that each xi is chosen from some Xi.Then, for the discrete case, any function p : X1 ◊ X2 ◊ . . . ◊ Xd æ [0, 1] is called a multidi-mensional probability mass function if

ÿ

x1œX1

ÿ

x2œX2

· · ·

ÿ

xdœX d

p (x1, x2, . . . , xd) = 1.

or, for the continuous case, p : X1 ◊X2 ◊ . . .◊Xd æ [0, Œ] is a multidimensional probabilitydensity function if

X1 X2· · ·

X d

p (x1, x2, . . . , xd) dx1dx2 . . . dxd = 1.

A marginal distribution is defined for a subset of X = (X1, X2, . . . , Xd) by summing orintegrating over the remaining variables. For the discrete case, the marginal distributionp (xi) is defined as

p (xi) =ÿ

x1œX1

· · ·

ÿ

xi≠1œXi≠1

ÿ

xi+1œXi+1

· · ·

ÿ

xdœXd

p (x1, . . . , xi≠1, xi, xi+1, . . . , xd) ,

where the variable xi is fixed to some value and we sum over all possible values of the othervariables. Similarly, for the continuous case, the marginal distribution p (xi) is defined as

p (xi) =X1

· · ·

Xi≠1 Xi+1

· · ·

Xd

p (x1, . . . , xi≠1, xi, xi+1, . . . , xd) dx1 . . . dxi≠1dxi+1 . . . dxd.

Notice that we use p to define the density over x, but then we overload this terminology andalso use p for the density only over xi. To be more precise, we should define two separatefunctions (pdfs), say px for the density over the multivariate random variable and pxi

forthe marginal. It is common, however, to simply use p, and infer the random variable fromcontext. In most cases, it is clear; if it is not, we will explicitly highlight the pdfs withadditional subscripts.

We can define common multivariate pmfs and pdfs, that are extensions of the scalarpmfs and pdfs. The most useful extensions include table of probability values, uniformdistributions and Gaussian distributions. We define these concretely in the final sectionof this chapter, Section 1.5, for reference. Before we provide these definitions, however,it will be useful to understand how multiple variables interact, because this will influencethe extension from univariate to multivariate. In particular, it will be useful to understandconditional distributions and dependence, which we discuss next.

1.3.1 Conditional distributionsConditional probabilities define probabilities of a random variable X, given informationabout the value of another random variable Y . More formally, the conditional probabilityp(y|x) for two random variables X and Y is defined as

p(y|x) = p(x, y)p(x) (1.1)

22

Page 44: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

MARGINAL DISTRIBUTIONS

Z = (X, Y ), we can determine the proportion of the population that is young and theproportion that is old.

In general, we can consider d-dimensional random variable X = (X1, X2, . . . , Xd) withvector-valued outcomes x = (x1, x2, . . . , xd), such that each xi is chosen from some Xi.Then, for the discrete case, any function p : X1 ◊ X2 ◊ . . . ◊ Xd æ [0, 1] is called a multidi-mensional probability mass function if

ÿ

x1œX1

ÿ

x2œX2

· · ·

ÿ

xdœX d

p (x1, x2, . . . , xd) = 1.

or, for the continuous case, p : X1 ◊X2 ◊ . . .◊Xd æ [0, Œ] is a multidimensional probabilitydensity function if

X1 X2· · ·

X d

p (x1, x2, . . . , xd) dx1dx2 . . . dxd = 1.

A marginal distribution is defined for a subset of X = (X1, X2, . . . , Xd) by summing orintegrating over the remaining variables. For the discrete case, the marginal distributionp (xi) is defined as

p (xi) =ÿ

x1œX1

· · ·

ÿ

xi≠1œXi≠1

ÿ

xi+1œXi+1

· · ·

ÿ

xdœXd

p (x1, . . . , xi≠1, xi, xi+1, . . . , xd) ,

where the variable xi is fixed to some value and we sum over all possible values of the othervariables. Similarly, for the continuous case, the marginal distribution p (xi) is defined as

p (xi) =X1

· · ·

Xi≠1 Xi+1

· · ·

Xd

p (x1, . . . , xi≠1, xi, xi+1, . . . , xd) dx1 . . . dxi≠1dxi+1 . . . dxd.

Notice that we use p to define the density over x, but then we overload this terminology andalso use p for the density only over xi. To be more precise, we should define two separatefunctions (pdfs), say px for the density over the multivariate random variable and pxi

forthe marginal. It is common, however, to simply use p, and infer the random variable fromcontext. In most cases, it is clear; if it is not, we will explicitly highlight the pdfs withadditional subscripts.

We can define common multivariate pmfs and pdfs, that are extensions of the scalarpmfs and pdfs. The most useful extensions include table of probability values, uniformdistributions and Gaussian distributions. We define these concretely in the final sectionof this chapter, Section 1.5, for reference. Before we provide these definitions, however,it will be useful to understand how multiple variables interact, because this will influencethe extension from univariate to multivariate. In particular, it will be useful to understandconditional distributions and dependence, which we discuss next.

1.3.1 Conditional distributionsConditional probabilities define probabilities of a random variable X, given informationabout the value of another random variable Y . More formally, the conditional probabilityp(y|x) for two random variables X and Y is defined as

p(y|x) = p(x, y)p(x) (1.1)

22

Natural question: Why do you use p for p(xi) and for p(x1, …., xd)? They have different domains, they can’t be the same function!

Page 45: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

DROPPING SUBSCRIPTS

Instead of:

We will write: p(y|x) = p(x, y)

p(x)

Page 46: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

ANOTHER EXAMPLE FOR CONDITIONAL DISTRIBUTIONS

• Let X be a Bernoulli random variable (i.e., 0 or 1 with probability alpha)

• Let Y be a random variable in {10, 11, …, 1000}

• p(y | X = 0) and p(y | X = 1) are different distributions

• Two types of books: fiction (X=0) and non-fiction (X=1)

• Let Y correspond to number of pages

• Distribution over number of pages different for fiction and non-fiction books (e.g., average different)

Page 47: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXAMPLE CONTINUED

• Two types of books: fiction (X=0) and non-fiction (X=1) • Y corresponds to number of pages • If most books are non-fiction, p(X = 0, y) is small even if

y is a likely number of pages for a fiction book • p(X = 0) accounts for the fact that joint probability small

if p(X = 0) is small • p(y | X = 0) = p(X = 0, y)/p(X = 0) • p(X = 0 , y) = probability that a book is fiction and has

y pages (imagine randomly sampling a book) • p(X = 0) = probability that a book is fiction

Page 48: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

ANOTHER EXAMPLE

• Two types of books: fiction (X=0) and non-fiction (X=1) • Let Y be a random variable over the reals, which

corresponds to amount of money made • p(y | X = 0) and p(y | X = 1) are different distributions • e.g., even if both p(y | X = 0) and p(y | X = 1) are

Gaussian, they likely have different means and variances

Page 49: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

THINK-PAIR-SHARE (5 MINUTES)• How might you use a given Poisson distribution, that

models commute times? (Hint: recall modes)

• How might you pick lambda for a Poisson distribution, to model commute times? (Hint: the mean of a Poisson is lambda)

Page 50: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

Review so far

• PMFs (discrete) and PDFs (continuous) • Joint probabilities • Marginals • Conditional probabilities

50

Page 51: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

CHAIN RULE

Two variable example p(x, y) = p(x|y)p(y) = p(y|x)p(x)

p(x1, . . . , xk) = p(xk)k�1Y

i=1

p(xi|xi+1, . . . , xk)

= p(x1)kY

i=2

p(xi|x1, . . . , xi�1)<latexit sha1_base64="RVK/3RBZsHi67FgwqVrXklW5t6A=">AAACdHicbZFNS8MwGMfT+jbn29SDBz1Ex0RRRzMEvQhDLx4VnA7WWdI0m2FpU5JUHLWfwG/nzY/hxbPp1oPbfCDw5/9/fnl54secKe04X5Y9N7+wuFRaLq+srq1vVDa3HpVIJKEtIriQbR8ryllEW5ppTtuxpDj0OX3yBzd5/vRKpWIietDDmHZD3I9YjxGsjeVVPuKjNw+dQpcHQqtT+OYNjuHhFcxto9xYisBL2RXKntPBGcpGAYPvpjFlJyibJF23XLDoD9vI2QlykkuZ2diwXqXq1J1RwVmBClEFRd15lU83ECQJaaQJx0p1kBPrboqlZoTTrOwmisaYDHCfdoyMcEhVNx0NLYM14wSwJ6RZkYYj9y+R4lCpYeibzhDrFzWd5eZ/WSfRvctuyqI40TQi44N6CYdawPwHYMAkJZoPjcBEMnNXSF6wxESbfyqbIaDpJ8+Kx0YdOXV0f15tXhfjKIFdcACOAAIXoAluwR1oAQK+rR0LWvvWj71nV+3auNW2CmYbTJRd/wUg1Liq</latexit><latexit sha1_base64="RVK/3RBZsHi67FgwqVrXklW5t6A=">AAACdHicbZFNS8MwGMfT+jbn29SDBz1Ex0RRRzMEvQhDLx4VnA7WWdI0m2FpU5JUHLWfwG/nzY/hxbPp1oPbfCDw5/9/fnl54secKe04X5Y9N7+wuFRaLq+srq1vVDa3HpVIJKEtIriQbR8ryllEW5ppTtuxpDj0OX3yBzd5/vRKpWIietDDmHZD3I9YjxGsjeVVPuKjNw+dQpcHQqtT+OYNjuHhFcxto9xYisBL2RXKntPBGcpGAYPvpjFlJyibJF23XLDoD9vI2QlykkuZ2diwXqXq1J1RwVmBClEFRd15lU83ECQJaaQJx0p1kBPrboqlZoTTrOwmisaYDHCfdoyMcEhVNx0NLYM14wSwJ6RZkYYj9y+R4lCpYeibzhDrFzWd5eZ/WSfRvctuyqI40TQi44N6CYdawPwHYMAkJZoPjcBEMnNXSF6wxESbfyqbIaDpJ8+Kx0YdOXV0f15tXhfjKIFdcACOAAIXoAluwR1oAQK+rR0LWvvWj71nV+3auNW2CmYbTJRd/wUg1Liq</latexit><latexit sha1_base64="RVK/3RBZsHi67FgwqVrXklW5t6A=">AAACdHicbZFNS8MwGMfT+jbn29SDBz1Ex0RRRzMEvQhDLx4VnA7WWdI0m2FpU5JUHLWfwG/nzY/hxbPp1oPbfCDw5/9/fnl54secKe04X5Y9N7+wuFRaLq+srq1vVDa3HpVIJKEtIriQbR8ryllEW5ppTtuxpDj0OX3yBzd5/vRKpWIietDDmHZD3I9YjxGsjeVVPuKjNw+dQpcHQqtT+OYNjuHhFcxto9xYisBL2RXKntPBGcpGAYPvpjFlJyibJF23XLDoD9vI2QlykkuZ2diwXqXq1J1RwVmBClEFRd15lU83ECQJaaQJx0p1kBPrboqlZoTTrOwmisaYDHCfdoyMcEhVNx0NLYM14wSwJ6RZkYYj9y+R4lCpYeibzhDrFzWd5eZ/WSfRvctuyqI40TQi44N6CYdawPwHYMAkJZoPjcBEMnNXSF6wxESbfyqbIaDpJ8+Kx0YdOXV0f15tXhfjKIFdcACOAAIXoAluwR1oAQK+rR0LWvvWj71nV+3auNW2CmYbTJRd/wUg1Liq</latexit><latexit sha1_base64="RVK/3RBZsHi67FgwqVrXklW5t6A=">AAACdHicbZFNS8MwGMfT+jbn29SDBz1Ex0RRRzMEvQhDLx4VnA7WWdI0m2FpU5JUHLWfwG/nzY/hxbPp1oPbfCDw5/9/fnl54secKe04X5Y9N7+wuFRaLq+srq1vVDa3HpVIJKEtIriQbR8ryllEW5ppTtuxpDj0OX3yBzd5/vRKpWIietDDmHZD3I9YjxGsjeVVPuKjNw+dQpcHQqtT+OYNjuHhFcxto9xYisBL2RXKntPBGcpGAYPvpjFlJyibJF23XLDoD9vI2QlykkuZ2diwXqXq1J1RwVmBClEFRd15lU83ECQJaaQJx0p1kBPrboqlZoTTrOwmisaYDHCfdoyMcEhVNx0NLYM14wSwJ6RZkYYj9y+R4lCpYeibzhDrFzWd5eZ/WSfRvctuyqI40TQi44N6CYdawPwHYMAkJZoPjcBEMnNXSF6wxESbfyqbIaDpJ8+Kx0YdOXV0f15tXhfjKIFdcACOAAIXoAluwR1oAQK+rR0LWvvWj71nV+3auNW2CmYbTJRd/wUg1Liq</latexit>

Page 52: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

CHAIN RULE

Three variable example

p(x, y, z) = p(y|x, z)p(x, z) = p(y|x, z)p(x|z)p(z)= p(x|y, z)p(y|z)p(z)= p(x|y, z)p(z|y)p(y)

<latexit sha1_base64="9QsCkcafLHX5+w5FPFQWzQOTcY4=">AAACZnicfZHNS8MwGMbT+rXVr7khHrwEh7LBGK0IehGGXjxOcB+wjZFm6RaWpiVJZd3cP+nNsxf/DNOuB93EF0J/eZ73JclTN2RUKtv+MMyt7Z3dvVze2j84PDounBTbMogEJi0csEB0XSQJo5y0FFWMdENBkO8y0nGnj4nfeSVC0oC/qDgkAx+NOfUoRkpLw8IyrMxqMK7BeRVe3cOwEsM3OEu3qaO/G6rmFObVft9KhxIpzuz4f3uecNqn7WGhbNfttOAmOBmUQVbNYeG9Pwpw5BOuMENS9hw7VIMFEopiRpZWP5IkRHiKxqSnkSOfyMEijWkJL7Uygl4g9OIKpurPiQXypYx9V3f6SE3kupeIf3m9SHl3gwXlYaQIx6uDvIhBFcAkcziigmDFYg0IC6rvCvEECYSV/jOWDsFZf/ImtK/rjl13nm/KjYcsjhw4BxegAhxwCxrgCTRBC2DwaeSNolEyvswj89Q8W7WaRjZTAr/KhN9Ez6xo</latexit><latexit sha1_base64="9QsCkcafLHX5+w5FPFQWzQOTcY4=">AAACZnicfZHNS8MwGMbT+rXVr7khHrwEh7LBGK0IehGGXjxOcB+wjZFm6RaWpiVJZd3cP+nNsxf/DNOuB93EF0J/eZ73JclTN2RUKtv+MMyt7Z3dvVze2j84PDounBTbMogEJi0csEB0XSQJo5y0FFWMdENBkO8y0nGnj4nfeSVC0oC/qDgkAx+NOfUoRkpLw8IyrMxqMK7BeRVe3cOwEsM3OEu3qaO/G6rmFObVft9KhxIpzuz4f3uecNqn7WGhbNfttOAmOBmUQVbNYeG9Pwpw5BOuMENS9hw7VIMFEopiRpZWP5IkRHiKxqSnkSOfyMEijWkJL7Uygl4g9OIKpurPiQXypYx9V3f6SE3kupeIf3m9SHl3gwXlYaQIx6uDvIhBFcAkcziigmDFYg0IC6rvCvEECYSV/jOWDsFZf/ImtK/rjl13nm/KjYcsjhw4BxegAhxwCxrgCTRBC2DwaeSNolEyvswj89Q8W7WaRjZTAr/KhN9Ez6xo</latexit><latexit sha1_base64="9QsCkcafLHX5+w5FPFQWzQOTcY4=">AAACZnicfZHNS8MwGMbT+rXVr7khHrwEh7LBGK0IehGGXjxOcB+wjZFm6RaWpiVJZd3cP+nNsxf/DNOuB93EF0J/eZ73JclTN2RUKtv+MMyt7Z3dvVze2j84PDounBTbMogEJi0csEB0XSQJo5y0FFWMdENBkO8y0nGnj4nfeSVC0oC/qDgkAx+NOfUoRkpLw8IyrMxqMK7BeRVe3cOwEsM3OEu3qaO/G6rmFObVft9KhxIpzuz4f3uecNqn7WGhbNfttOAmOBmUQVbNYeG9Pwpw5BOuMENS9hw7VIMFEopiRpZWP5IkRHiKxqSnkSOfyMEijWkJL7Uygl4g9OIKpurPiQXypYx9V3f6SE3kupeIf3m9SHl3gwXlYaQIx6uDvIhBFcAkcziigmDFYg0IC6rvCvEECYSV/jOWDsFZf/ImtK/rjl13nm/KjYcsjhw4BxegAhxwCxrgCTRBC2DwaeSNolEyvswj89Q8W7WaRjZTAr/KhN9Ez6xo</latexit><latexit sha1_base64="9QsCkcafLHX5+w5FPFQWzQOTcY4=">AAACZnicfZHNS8MwGMbT+rXVr7khHrwEh7LBGK0IehGGXjxOcB+wjZFm6RaWpiVJZd3cP+nNsxf/DNOuB93EF0J/eZ73JclTN2RUKtv+MMyt7Z3dvVze2j84PDounBTbMogEJi0csEB0XSQJo5y0FFWMdENBkO8y0nGnj4nfeSVC0oC/qDgkAx+NOfUoRkpLw8IyrMxqMK7BeRVe3cOwEsM3OEu3qaO/G6rmFObVft9KhxIpzuz4f3uecNqn7WGhbNfttOAmOBmUQVbNYeG9Pwpw5BOuMENS9hw7VIMFEopiRpZWP5IkRHiKxqSnkSOfyMEijWkJL7Uygl4g9OIKpurPiQXypYx9V3f6SE3kupeIf3m9SHl3gwXlYaQIx6uDvIhBFcAkcziigmDFYg0IC6rvCvEECYSV/jOWDsFZf/ImtK/rjl13nm/KjYcsjhw4BxegAhxwCxrgCTRBC2DwaeSNolEyvswj89Q8W7WaRjZTAr/KhN9Ez6xo</latexit>

. . .

p(x1, . . . , xk) = p(xk)k�1Y

i=1

p(xi|xi+1, . . . , xk)<latexit sha1_base64="LMJGMgMfsBXQ42t5fd2istmQ/is=">AAACNXicbVDLSgMxFM3UV62vqks3wSJUrGUigm4KRTcuXFSwD2jrkMmkbWhmMiQZaRnnp9z4H6504UIRt/6C6WNhqwcCh3PO5eYeN+RMadt+tVILi0vLK+nVzNr6xuZWdnunpkQkCa0SwYVsuFhRzgJa1Uxz2gglxb7Lad3tX478+j2ViongVg9D2vZxN2AdRrA2kpO9DvMDBxVgi3tCqwIcOP1DWIIj1ZBWKIXnxKyEkru4f4ySscHgg8nF7Agls4NONmcX7THgX4KmJAemqDjZ55YnSOTTQBOOlWoiO9TtGEvNCKdJphUpGmLSx13aNDTAPlXteHx1Ag+M4sGOkOYFGo7V3xMx9pUa+q5J+lj31Lw3Ev/zmpHunLdjFoSRpgGZLOpEHGoBRxVCj0lKNB8agolk5q+Q9LDERJuiM6YENH/yX1I7KSK7iG5Oc+WLaR1psAf2QR4gcAbK4ApUQBUQ8AhewDv4sJ6sN+vT+ppEU9Z0ZhfMwPr+Af93qJM=</latexit><latexit sha1_base64="LMJGMgMfsBXQ42t5fd2istmQ/is=">AAACNXicbVDLSgMxFM3UV62vqks3wSJUrGUigm4KRTcuXFSwD2jrkMmkbWhmMiQZaRnnp9z4H6504UIRt/6C6WNhqwcCh3PO5eYeN+RMadt+tVILi0vLK+nVzNr6xuZWdnunpkQkCa0SwYVsuFhRzgJa1Uxz2gglxb7Lad3tX478+j2ViongVg9D2vZxN2AdRrA2kpO9DvMDBxVgi3tCqwIcOP1DWIIj1ZBWKIXnxKyEkru4f4ySscHgg8nF7Agls4NONmcX7THgX4KmJAemqDjZ55YnSOTTQBOOlWoiO9TtGEvNCKdJphUpGmLSx13aNDTAPlXteHx1Ag+M4sGOkOYFGo7V3xMx9pUa+q5J+lj31Lw3Ev/zmpHunLdjFoSRpgGZLOpEHGoBRxVCj0lKNB8agolk5q+Q9LDERJuiM6YENH/yX1I7KSK7iG5Oc+WLaR1psAf2QR4gcAbK4ApUQBUQ8AhewDv4sJ6sN+vT+ppEU9Z0ZhfMwPr+Af93qJM=</latexit><latexit sha1_base64="LMJGMgMfsBXQ42t5fd2istmQ/is=">AAACNXicbVDLSgMxFM3UV62vqks3wSJUrGUigm4KRTcuXFSwD2jrkMmkbWhmMiQZaRnnp9z4H6504UIRt/6C6WNhqwcCh3PO5eYeN+RMadt+tVILi0vLK+nVzNr6xuZWdnunpkQkCa0SwYVsuFhRzgJa1Uxz2gglxb7Lad3tX478+j2ViongVg9D2vZxN2AdRrA2kpO9DvMDBxVgi3tCqwIcOP1DWIIj1ZBWKIXnxKyEkru4f4ySscHgg8nF7Agls4NONmcX7THgX4KmJAemqDjZ55YnSOTTQBOOlWoiO9TtGEvNCKdJphUpGmLSx13aNDTAPlXteHx1Ag+M4sGOkOYFGo7V3xMx9pUa+q5J+lj31Lw3Ev/zmpHunLdjFoSRpgGZLOpEHGoBRxVCj0lKNB8agolk5q+Q9LDERJuiM6YENH/yX1I7KSK7iG5Oc+WLaR1psAf2QR4gcAbK4ApUQBUQ8AhewDv4sJ6sN+vT+ppEU9Z0ZhfMwPr+Af93qJM=</latexit><latexit sha1_base64="LMJGMgMfsBXQ42t5fd2istmQ/is=">AAACNXicbVDLSgMxFM3UV62vqks3wSJUrGUigm4KRTcuXFSwD2jrkMmkbWhmMiQZaRnnp9z4H6504UIRt/6C6WNhqwcCh3PO5eYeN+RMadt+tVILi0vLK+nVzNr6xuZWdnunpkQkCa0SwYVsuFhRzgJa1Uxz2gglxb7Lad3tX478+j2ViongVg9D2vZxN2AdRrA2kpO9DvMDBxVgi3tCqwIcOP1DWIIj1ZBWKIXnxKyEkru4f4ySscHgg8nF7Agls4NONmcX7THgX4KmJAemqDjZ55YnSOTTQBOOlWoiO9TtGEvNCKdJphUpGmLSx13aNDTAPlXteHx1Ag+M4sGOkOYFGo7V3xMx9pUa+q5J+lj31Lw3Ev/zmpHunLdjFoSRpgGZLOpEHGoBRxVCj0lKNB8agolk5q+Q9LDERJuiM6YENH/yX1I7KSK7iG5Oc+WLaR1psAf2QR4gcAbK4ApUQBUQ8AhewDv4sJ6sN+vT+ppEU9Z0ZhfMwPr+Af93qJM=</latexit>

Page 53: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

HOW DO WE GET BAYES RULE?

Recall chain rule: p(x, y) = p(x|y)p(y) = p(y|x)p(x)

p(y|x) = p(x|y)p(y)p(x)

Bayes rule:

Page 54: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXAMPLE: DRUG TEST

• Imagine p(DT = True | User = True) = 0.99 • Imagine p(User = True) = 0.005 • Imagine p(DT = True | User = False) = 0.01 • What is p(User = True | DT = True)?

Page 55: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

REMINDERS: JANUARY 14, 2020

• Thought Questions 1 due on January 23 • Chapters 1-3 • You should be reading now • if you read the material before I lecture, it will (a)

help you understand a lot better and (b) be easier to write coherent thought questions

• Assignment 1 due on January 30 • Any questions?

Page 56: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

INDEPENDENCE OF RANDOM VARIABLES

We will drop subscripts and write p(x, y) = p(x)p(y)

p(x, y|z) = p(x|z)p(y|z)

Page 57: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

CONDITIONAL INDEPENDENCE EXAMPLES EXAMPLE 7 IN THE NOTES

• Imagine you have a biased coin (does not flip 50% heads and 50% tails, but skewed towards one)

• Let Z = bias of a coin (say outcomes are 0.3, 0.5, 0.8 with associated probabilities 0.7, 0.2, 0.1)

• what other outcome space could we consider? • what kinds of distributions?

• Let X and Y be consecutive flips of the coin • Are X and Y independent? • Are X and Y conditionally independent, given Z?

**(Basic example about an important issue in ML: hidden variables)

Page 58: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

CONDITIONAL INDEPENDENCE IS RELATIVE TO DISTRIBUTION YOU PICK, NOT OBJECTIVE FOR ALL DISTRIBUTIONS

• Explained on whiteboard • What is p(X) and p(X | Z)

Page 59: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

CONDITIONAL INDEPENDENCE EXAMPLES EXAMPLE 7 IN THE NOTES

• Are X and Y independent? (don’t know Z) • Are X and Y conditionally independent, given Z?

**(Basic example about an important issue in ML: hidden variables)

Z

X Y

0 or 1 0 or 1

z 0.3 0.5 0.8

p(z) 0.7 0.2 0.1

bias

probability of that bias

Imagine don’t know Z and flip two 0s. Does that tell you anything about Z?

p(X,Y |Z) = p(X|Z)p(Y |Z)?<latexit sha1_base64="+7KGq3ZWR9HAbJVhuQO6VAvHe6M=">AAACBnicbVDLSgMxFM3UV62vUZciBItQQcqMCLoRi25cVrAv26Fk0kwbmsmEJCOUtis3/oobF4q49Rvc+Tdm2llo64GQk3Pu5eYeXzCqtON8W5mFxaXllexqbm19Y3PL3t6pqiiWmFRwxCJZ95EijHJS0VQzUheSoNBnpOb3rxO/9kCkohG/0wNBvBB1OQ0oRtpIbXtfFOrHDTiC90fwAprHyBBRaCTXZa5t552iMwGcJ25K8iBFuW1/tToRjkPCNWZIqabrCO0NkdQUMzLOtWJFBMJ91CVNQzkKifKGkzXG8NAoHRhE0hyu4UT93TFEoVKD0DeVIdI9Nesl4n9eM9bBuTekXMSacDwdFMQM6ggmmcAOlQRrNjAEYUnNXyHuIYmwNsklIbizK8+T6knRdYru7Wm+dJXGkQV74AAUgAvOQAncgDKoAAwewTN4BW/Wk/VivVsf09KMlfbsgj+wPn8AqCiVZg==</latexit><latexit sha1_base64="+7KGq3ZWR9HAbJVhuQO6VAvHe6M=">AAACBnicbVDLSgMxFM3UV62vUZciBItQQcqMCLoRi25cVrAv26Fk0kwbmsmEJCOUtis3/oobF4q49Rvc+Tdm2llo64GQk3Pu5eYeXzCqtON8W5mFxaXllexqbm19Y3PL3t6pqiiWmFRwxCJZ95EijHJS0VQzUheSoNBnpOb3rxO/9kCkohG/0wNBvBB1OQ0oRtpIbXtfFOrHDTiC90fwAprHyBBRaCTXZa5t552iMwGcJ25K8iBFuW1/tToRjkPCNWZIqabrCO0NkdQUMzLOtWJFBMJ91CVNQzkKifKGkzXG8NAoHRhE0hyu4UT93TFEoVKD0DeVIdI9Nesl4n9eM9bBuTekXMSacDwdFMQM6ggmmcAOlQRrNjAEYUnNXyHuIYmwNsklIbizK8+T6knRdYru7Wm+dJXGkQV74AAUgAvOQAncgDKoAAwewTN4BW/Wk/VivVsf09KMlfbsgj+wPn8AqCiVZg==</latexit><latexit sha1_base64="+7KGq3ZWR9HAbJVhuQO6VAvHe6M=">AAACBnicbVDLSgMxFM3UV62vUZciBItQQcqMCLoRi25cVrAv26Fk0kwbmsmEJCOUtis3/oobF4q49Rvc+Tdm2llo64GQk3Pu5eYeXzCqtON8W5mFxaXllexqbm19Y3PL3t6pqiiWmFRwxCJZ95EijHJS0VQzUheSoNBnpOb3rxO/9kCkohG/0wNBvBB1OQ0oRtpIbXtfFOrHDTiC90fwAprHyBBRaCTXZa5t552iMwGcJ25K8iBFuW1/tToRjkPCNWZIqabrCO0NkdQUMzLOtWJFBMJ91CVNQzkKifKGkzXG8NAoHRhE0hyu4UT93TFEoVKD0DeVIdI9Nesl4n9eM9bBuTekXMSacDwdFMQM6ggmmcAOlQRrNjAEYUnNXyHuIYmwNsklIbizK8+T6knRdYru7Wm+dJXGkQV74AAUgAvOQAncgDKoAAwewTN4BW/Wk/VivVsf09KMlfbsgj+wPn8AqCiVZg==</latexit><latexit sha1_base64="+7KGq3ZWR9HAbJVhuQO6VAvHe6M=">AAACBnicbVDLSgMxFM3UV62vUZciBItQQcqMCLoRi25cVrAv26Fk0kwbmsmEJCOUtis3/oobF4q49Rvc+Tdm2llo64GQk3Pu5eYeXzCqtON8W5mFxaXllexqbm19Y3PL3t6pqiiWmFRwxCJZ95EijHJS0VQzUheSoNBnpOb3rxO/9kCkohG/0wNBvBB1OQ0oRtpIbXtfFOrHDTiC90fwAprHyBBRaCTXZa5t552iMwGcJ25K8iBFuW1/tToRjkPCNWZIqabrCO0NkdQUMzLOtWJFBMJ91CVNQzkKifKGkzXG8NAoHRhE0hyu4UT93TFEoVKD0DeVIdI9Nesl4n9eM9bBuTekXMSacDwdFMQM6ggmmcAOlQRrNjAEYUnNXyHuIYmwNsklIbizK8+T6knRdYru7Wm+dJXGkQV74AAUgAvOQAncgDKoAAwewTN4BW/Wk/VivVsf09KMlfbsgj+wPn8AqCiVZg==</latexit>

p(X,Y ) = p(X)p(Y )?<latexit sha1_base64="bl6zlgp6j49zYrr3WzDcmkNPH7w=">AAAB/nicbZDLSgMxFIbP1Futt1Fx5SZYhBakzIigG7HoxmUFe5F2KJk0bUMzmSHJCGUo+CpuXCji1udw59uYaWehrQcSPv7/HHLy+xFnSjvOt5VbWl5ZXcuvFzY2t7Z37N29hgpjSWidhDyULR8rypmgdc00p61IUhz4nDb90U3qNx+pVCwU93ocUS/AA8H6jGBtpK59EJVaJw9ldIkMlM1l+KrQtYtOxZkWWgQ3gyJkVevaX51eSOKACk04VqrtOpH2Eiw1I5xOCp1Y0QiTER7QtkGBA6q8ZLr+BB0bpYf6oTRHaDRVf08kOFBqHPimM8B6qOa9VPzPa8e6f+ElTESxpoLMHurHHOkQpVmgHpOUaD42gIlkZldEhlhiok1iaQju/JcXoXFacZ2Ke3dWrF5nceThEI6gBC6cQxVuoQZ1IJDAM7zCm/VkvVjv1sesNWdlM/vwp6zPHxcfklQ=</latexit><latexit sha1_base64="bl6zlgp6j49zYrr3WzDcmkNPH7w=">AAAB/nicbZDLSgMxFIbP1Futt1Fx5SZYhBakzIigG7HoxmUFe5F2KJk0bUMzmSHJCGUo+CpuXCji1udw59uYaWehrQcSPv7/HHLy+xFnSjvOt5VbWl5ZXcuvFzY2t7Z37N29hgpjSWidhDyULR8rypmgdc00p61IUhz4nDb90U3qNx+pVCwU93ocUS/AA8H6jGBtpK59EJVaJw9ldIkMlM1l+KrQtYtOxZkWWgQ3gyJkVevaX51eSOKACk04VqrtOpH2Eiw1I5xOCp1Y0QiTER7QtkGBA6q8ZLr+BB0bpYf6oTRHaDRVf08kOFBqHPimM8B6qOa9VPzPa8e6f+ElTESxpoLMHurHHOkQpVmgHpOUaD42gIlkZldEhlhiok1iaQju/JcXoXFacZ2Ke3dWrF5nceThEI6gBC6cQxVuoQZ1IJDAM7zCm/VkvVjv1sesNWdlM/vwp6zPHxcfklQ=</latexit><latexit sha1_base64="bl6zlgp6j49zYrr3WzDcmkNPH7w=">AAAB/nicbZDLSgMxFIbP1Futt1Fx5SZYhBakzIigG7HoxmUFe5F2KJk0bUMzmSHJCGUo+CpuXCji1udw59uYaWehrQcSPv7/HHLy+xFnSjvOt5VbWl5ZXcuvFzY2t7Z37N29hgpjSWidhDyULR8rypmgdc00p61IUhz4nDb90U3qNx+pVCwU93ocUS/AA8H6jGBtpK59EJVaJw9ldIkMlM1l+KrQtYtOxZkWWgQ3gyJkVevaX51eSOKACk04VqrtOpH2Eiw1I5xOCp1Y0QiTER7QtkGBA6q8ZLr+BB0bpYf6oTRHaDRVf08kOFBqHPimM8B6qOa9VPzPa8e6f+ElTESxpoLMHurHHOkQpVmgHpOUaD42gIlkZldEhlhiok1iaQju/JcXoXFacZ2Ke3dWrF5nceThEI6gBC6cQxVuoQZ1IJDAM7zCm/VkvVjv1sesNWdlM/vwp6zPHxcfklQ=</latexit><latexit sha1_base64="bl6zlgp6j49zYrr3WzDcmkNPH7w=">AAAB/nicbZDLSgMxFIbP1Futt1Fx5SZYhBakzIigG7HoxmUFe5F2KJk0bUMzmSHJCGUo+CpuXCji1udw59uYaWehrQcSPv7/HHLy+xFnSjvOt5VbWl5ZXcuvFzY2t7Z37N29hgpjSWidhDyULR8rypmgdc00p61IUhz4nDb90U3qNx+pVCwU93ocUS/AA8H6jGBtpK59EJVaJw9ldIkMlM1l+KrQtYtOxZkWWgQ3gyJkVevaX51eSOKACk04VqrtOpH2Eiw1I5xOCp1Y0QiTER7QtkGBA6q8ZLr+BB0bpYf6oTRHaDRVf08kOFBqHPimM8B6qOa9VPzPa8e6f+ElTESxpoLMHurHHOkQpVmgHpOUaD42gIlkZldEhlhiok1iaQju/JcXoXFacZ2Ke3dWrF5nceThEI6gBC6cQxVuoQZ1IJDAM7zCm/VkvVjv1sesNWdlM/vwp6zPHxcfklQ=</latexit>

Page 60: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXPECTED VALUE (MEAN, AVERAGE)

E [X] =

8><

>:

Px2X xp(x) X : discrete

RX xp(x)dx X : continuous

Exercise: Biased coin with p(Heads) = 0.8

Page 61: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXPECTATIONS WITH FUNCTIONS1.2.5 Expectations and moments

Expectations of functions are defined as sums (or integrals) of function values weighted ac-cording to the probability mass (or density) function. Given a probability space (X ,B(X ), PX),we consider a function f : X ! C and define its expectation function as

E [f(X)] =

8><

>:

Px2X

f(x)p(x) X : discrete

´Xf(x)p(x)dx X : continuous

Note that we use a capital X for the random variable (with f(X) the random variabletransformed by f) and lower case x when it is an instance (e.g., p(x) is the probability of aspecific outcome). The capital X in E [f(X)] also specifies that the summation (integration)is defined over p(x); this will become more important later when we consider multiple randomvariables. It can happen that E [f(X)] = ±1; in such cases we say that the expectationdoes not exist or is not well-defined.4 For f(x) = x, we have a standard expectationE [X] =

Pxp(x), or the mean value of X. Using f(x) = xk results in the k-th moment,

f(x) = log 1/p(x) gives the well-known entropy function H(X), or differential entropy forcontinuous random variables, and f(x) = (x� E [X])2 provides the variance of a randomvariable X, denoted by V [X]. Interestingly, the probability of some event A ✓ X

5 can alsobe expressed in the form of expectation; i.e.,

P (A) = E [1(X 2 A)] ,

where

1(t) =

(1 t is true0 t is false

(1.5)

is an indicator function. With this, it is possible to express the cumulative distributionfunction as FX(t) = E[1(X 2 (�1, t])].

Function f(x) inside the expectation can also be complex-valued. For example, 'X(t) =E[eitX ], where i is the imaginary unit, defines the characteristic function of X. The char-acteristic function is closely related to the inverse Fourier transform of p(x) and is useful inmany forms of statistical inference. Several expectation functions are summarized in Table1.1.

Given two random variables X and Y and a specific value x assigned to X, we definethe conditional expectation as

E [f(Y )|x] =

8><

>:

Py2Y

f(y)p(y|x) Y : discrete

´Yf(y)p(y|x)dy Y : continuous

4There is sometimes disagreement on terminology, and some definitions allow the expected value to be

infinite, which, for example, still allows the strong law of large numbers. In that setting, an expectation is

not well-defined only if both left and right improper integrals are infinite. For our purposes, this is splitting

hairs.5This notation is a bit loose; we should say A 2 B(X ) instead of A ✓ X . We used it to re-emphasize

(through this footnote) that some subsets of continuous sets are not measurable.

25

f : X ! R

f(x)p(x)

Exercise: Imagine you get 10 dollars if a Heads is flipped, lose 3 if Tails.

Page 62: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

CONDITIONAL EXPECTATIONS

E [Y |X = x] =

8><

>:

Py2Y yp(y|x) Y : discrete

RY yp(y|x)dy Y : continuous

Different expected value, depending on which x is observed

Page 63: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

3 MINUTE BREAK

E [Y |X = x] =

8><

>:

Py2Y yp(y|x) Y : discrete

RY yp(y|x)dy Y : continuous

What is the E[Y | x] for the following? - p(y | x) = Gaussian with N(mu = x, sigma^2 = 10) - p(y | x) = Gaussian with N(mu = f(x), sigma^2 = 0.1) - p(y | x) = Bernoulli with alpha = x

Turn to your neighbour and discuss a question that you have If you have nothing, consider the questions below

Page 64: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

PROPERTIES OF EXPECTATIONS

• E[cX] = c E[X], for a constant c • E[X + Y] = E[X] + E[Y] (linearity of expectation) • If X and Y independent, then E[XY] = E[X] E[Y] • E[Y] = E[E[Y | X]], where outer expectation over X

• called Law of Total Expectation • (prove these on the whiteboard)

Page 65: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

VARIANCE

Variance(X) = E[(X � E[X])2]

= E[X2]� E[X]2<latexit sha1_base64="ZsVW4Q/gat7XW9uJrdGkK/5obYc=">AAACSXicbVBNSwMxFMzW7/pV9eglWJR6sOwWQS+CKIJHBVsXumvNpq8ams0uyVuxLP17Xrx58z948aCIJ9Pag1YHApOZeeRlolQKg6777BQmJqemZ2bnivMLi0vLpZXVhkkyzaHOE5loP2IGpFBQR4ES/FQDiyMJl1H3eOBf3oE2IlEX2EshjNmNEh3BGVqpVboOEO4xbzAtmOLQr/jbdOuABjHD2yjKT/rNik93flz9cPuqFgYBpcXfOd/KY8mrGqWtUtmtukPQv8QbkTIZ4axVegraCc9iUMglM6bpuSmGOdMouIR+McgMpIx32Q00LVUsBhPmwyb6dNMqbdpJtD0K6VD9OZGz2JheHNnkYFEz7g3E/7xmhp39MBcqzRAU/36ok0mKCR3USttCA0fZs4RxLeyulN8yzTja8ou2BG/8y39Jo1b13Kp3vls+PBrVMUvWyQapEI/skUNySs5InXDyQF7IG3l3Hp1X58P5/I4WnNHMGvmFwsQXn3qvtg==</latexit><latexit sha1_base64="ZsVW4Q/gat7XW9uJrdGkK/5obYc=">AAACSXicbVBNSwMxFMzW7/pV9eglWJR6sOwWQS+CKIJHBVsXumvNpq8ams0uyVuxLP17Xrx58z948aCIJ9Pag1YHApOZeeRlolQKg6777BQmJqemZ2bnivMLi0vLpZXVhkkyzaHOE5loP2IGpFBQR4ES/FQDiyMJl1H3eOBf3oE2IlEX2EshjNmNEh3BGVqpVboOEO4xbzAtmOLQr/jbdOuABjHD2yjKT/rNik93flz9cPuqFgYBpcXfOd/KY8mrGqWtUtmtukPQv8QbkTIZ4axVegraCc9iUMglM6bpuSmGOdMouIR+McgMpIx32Q00LVUsBhPmwyb6dNMqbdpJtD0K6VD9OZGz2JheHNnkYFEz7g3E/7xmhp39MBcqzRAU/36ok0mKCR3USttCA0fZs4RxLeyulN8yzTja8ou2BG/8y39Jo1b13Kp3vls+PBrVMUvWyQapEI/skUNySs5InXDyQF7IG3l3Hp1X58P5/I4WnNHMGvmFwsQXn3qvtg==</latexit><latexit sha1_base64="ZsVW4Q/gat7XW9uJrdGkK/5obYc=">AAACSXicbVBNSwMxFMzW7/pV9eglWJR6sOwWQS+CKIJHBVsXumvNpq8ams0uyVuxLP17Xrx58z948aCIJ9Pag1YHApOZeeRlolQKg6777BQmJqemZ2bnivMLi0vLpZXVhkkyzaHOE5loP2IGpFBQR4ES/FQDiyMJl1H3eOBf3oE2IlEX2EshjNmNEh3BGVqpVboOEO4xbzAtmOLQr/jbdOuABjHD2yjKT/rNik93flz9cPuqFgYBpcXfOd/KY8mrGqWtUtmtukPQv8QbkTIZ4axVegraCc9iUMglM6bpuSmGOdMouIR+McgMpIx32Q00LVUsBhPmwyb6dNMqbdpJtD0K6VD9OZGz2JheHNnkYFEz7g3E/7xmhp39MBcqzRAU/36ok0mKCR3USttCA0fZs4RxLeyulN8yzTja8ou2BG/8y39Jo1b13Kp3vls+PBrVMUvWyQapEI/skUNySs5InXDyQF7IG3l3Hp1X58P5/I4WnNHMGvmFwsQXn3qvtg==</latexit><latexit sha1_base64="ZsVW4Q/gat7XW9uJrdGkK/5obYc=">AAACSXicbVBNSwMxFMzW7/pV9eglWJR6sOwWQS+CKIJHBVsXumvNpq8ams0uyVuxLP17Xrx58z948aCIJ9Pag1YHApOZeeRlolQKg6777BQmJqemZ2bnivMLi0vLpZXVhkkyzaHOE5loP2IGpFBQR4ES/FQDiyMJl1H3eOBf3oE2IlEX2EshjNmNEh3BGVqpVboOEO4xbzAtmOLQr/jbdOuABjHD2yjKT/rNik93flz9cPuqFgYBpcXfOd/KY8mrGqWtUtmtukPQv8QbkTIZ4axVegraCc9iUMglM6bpuSmGOdMouIR+McgMpIx32Q00LVUsBhPmwyb6dNMqbdpJtD0K6VD9OZGz2JheHNnkYFEz7g3E/7xmhp39MBcqzRAU/36ok0mKCR3USttCA0fZs4RxLeyulN8yzTja8ou2BG/8y39Jo1b13Kp3vls+PBrVMUvWyQapEI/skUNySs5InXDyQF7IG3l3Hp1X58P5/I4WnNHMGvmFwsQXn3qvtg==</latexit>

Why? See if you can get this formula

Page 66: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

COVARIANCEX

Y

where f : Y ! C is some function. Again, using f(y) = y results in E [Y |x] =P

yp(y|x) orE [Y |x] =

´yp(y|x)dy. We shall see later that under some conditions E [Y |x] is referred to

as the regression function. These types of integrals are often seen and evaluated in Bayesianstatistics.

For two random variables X and Y we also define

E [f(X,Y )] =

8><

>:

Px2X

Py2Y

f(x, y)p(x, y) X,Y : discrete

´X

´Yf(x, y)p(x, y)dxdy X, Y : continuous

Expectations can also be defined over a single variable

E [f(X, y)] =

8><

>:

Px2X

f(x, y)p(x) X : discrete

´Xf(x, y)p(x)dx X : continuous

where E [f(X, y)] is now a function of y.We define the covariance function as

Cov[X,Y ] = E [(X � E [X]) (Y � E [Y ])]

= E [XY ]� E [X]E [Y ] ,

with Cov[X,X] = V [X] being the variance of the random variable X. Similarly, we definea correlation function as

Corr[X,Y ] =Cov[X,Y ]pV [X] · V [Y ]

,

which is simply a covariance function normalized by the product of standard deviations.Both covariance and correlation functions have wide applicability in statistics, machinelearning, signal processing and many other disciplines. Several important expectations fortwo random variables are listed in Table 1.2.

Example 6: Three tosses of a fair coin (yet again). Consider two random variables fromExamples 3 and 5, and calculate the expectation and variance for both X and Y . Thencalculate E [Y |X = 0].

We start by calculating E [X] = 0 · pX(0) + 1 · pX(1) = 12 . Similarly,

E [Y ] =3X

y=0

y · pY (y)

= pY (1) + 2pY (2) + 3pY (3)

=3

2

The conditional expectation can be found as

E [Y |X = 0] =3X

y=0

y · pY |X(y|0)

= pY |X(1|0) + 2pY |X(2|0) + 3pY |X(3|0)

= 1

27

where f : Y ! C is some function. Again, using f(y) = y results in E [Y |x] =P

yp(y|x) orE [Y |x] =

´yp(y|x)dy. We shall see later that under some conditions E [Y |x] is referred to

as the regression function. These types of integrals are often seen and evaluated in Bayesianstatistics.

For two random variables X and Y we also define

E [f(X,Y )] =

8><

>:

Px2X

Py2Y

f(x, y)p(x, y) X,Y : discrete

´X

´Yf(x, y)p(x, y)dxdy X, Y : continuous

Expectations can also be defined over a single variable

E [f(X, y)] =

8><

>:

Px2X

f(x, y)p(x) X : discrete

´Xf(x, y)p(x)dx X : continuous

where E [f(X, y)] is now a function of y.We define the covariance function as

Cov[X,Y ] = E [(X � E [X]) (Y � E [Y ])]

= E [XY ]� E [X]E [Y ] ,

with Cov[X,X] = V [X] being the variance of the random variable X. Similarly, we definea correlation function as

Corr[X,Y ] =Cov[X,Y ]pV [X] · V [Y ]

,

which is simply a covariance function normalized by the product of standard deviations.Both covariance and correlation functions have wide applicability in statistics, machinelearning, signal processing and many other disciplines. Several important expectations fortwo random variables are listed in Table 1.2.

Example 6: Three tosses of a fair coin (yet again). Consider two random variables fromExamples 3 and 5, and calculate the expectation and variance for both X and Y . Thencalculate E [Y |X = 0].

We start by calculating E [X] = 0 · pX(0) + 1 · pX(1) = 12 . Similarly,

E [Y ] =3X

y=0

y · pY (y)

= pY (1) + 2pY (2) + 3pY (3)

=3

2

The conditional expectation can be found as

E [Y |X = 0] =3X

y=0

y · pY |X(y|0)

= pY |X(1|0) + 2pY |X(2|0) + 3pY |X(3|0)

= 1

27

Page 67: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

PROPERTIES OF VARIANCES

• V[c] = 0 for a constant c • V[c X] = c^2 V[X] • V[X + Y] = V[X] + V[Y] + 2 Cov[X,Y] • If X and Y are independent, V[X + Y] = V[X] + V[Y]

• i.e., Cov[X,Y] = 0

Page 68: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

INDEPENDENCE AND DECORRELATION

• Independent RVs have zero correlation • How can we tell? • Hint: use Cov[X,Y] = E[XY] - E[X]E[Y]

• Uncorrelated RVs (zero correlation) might be dependent

• Correlation (Pearson’s correlation) shows linear relationships; can miss nonlinear ones

• Example: X normal RV, Y = X^2 (whiteboard)

Page 69: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

ALTERNATIVES: MUTUAL INFORMATION (USING KL)• KL-divergence measures how different two distributions

are

KL(p||q) =X

x2Xp(x) log

p(x)

q(x)

=X

x2Xp(x) [log p(x)� log q(x)]

=X

x2Xp(x) log p(x)�

X

x2Xp(x) log q(x)

<latexit sha1_base64="9AKpmBK2cWkXEOnZmrdAMoTKgD4=">AAAC0HicjVJNbxMxEPUuXyV8BTj2YhGB0gPRplTQHpAqekECpIJIGyleRV5nNrHq9W7sWZTIWSGu/Dxu/AL+Bt5NVBWKUEey9ebNvPHY46RQ0mIU/QzCa9dv3Ly1dbt15+69+w/aDx+d2Lw0AgYiV7kZJtyCkhoGKFHBsDDAs0TBaXJ2VMdPv4CxMtefcVlAnPGplqkUHD01bv9iCAt0795X3YKuVnS+Q1vPXlNmy2zsFpRJTVnGcSa4csOqokV3sUOZyqeUpYYLV/uVm9c7Y1dRQoqjRt/4z9e15k3MyOkM46uVuVDgf5nn1cftTtSLGqOXQX8DOmRjx+P2DzbJRZmBRqG4taN+VGDsuEEpFFQtVloouDjjUxh5qHkGNnbNQCr61DMTmubGL420YS8qHM+sXWaJz6w7tn/HavJfsVGJ6X7spC5KBC3WB6WlopjTerp0Ig0IVEsPuDDS90rFjPtBof8DreYRDmp7eX7ly+Bkt9d/0dv7uNc5fLN5ji2yTZ6QLumTV+SQvCXHZEBE8CGwwSqowk/hIvwaflunhsFG85j8YeH33ykj3BU=</latexit>

or

KL(p||q) =Z

Xp(x) log

p(x)

q(x)dx

<latexit sha1_base64="KG0zdAqSBfbo9nwjY1FE/ceTcQw=">AAACP3icbVBNTxsxFPRCW8L2g0CPvVhEReESbQDxcUBC9FKpHEBqQqQ4iryON1h47cV+ixI5+8+49C/0xpULh1YVV254k6hqS0eyNZ73np5n4kwKC1F0Gywsvnj5aqmyHL5+8/bdSnV1rW11bhhvMS216cTUcikUb4EAyTuZ4TSNJT+PLz+V9fNrbqzQ6iuMM95L6VCJRDAKXupX2xsE+AicNgUh4Yx/OSnqGZ5M8NUmDjcOMREK+o6kFC4Yla5TFDirjzYxkXqISWIoc+W7cFfljQcjQvrVWtSIpsDPSXNOamiO0371OxlolqdcAZPU2m4zyqDnqAHBJC9CklueUXZJh7zrqaIptz039V/gj14Z4EQbfxTgqfrnhKOpteM09p2lCftvrRT/V+vmkOz3nFBZDlyx2aIklxg0LsPEA2E4Azn2hDIj/F8xu6A+EPCRh9MQDkrs/rb8nLS3Gs3txs7ZTu3oeB5HBX1A66iOmmgPHaHP6BS1EEM36A79QD+Db8F98Ct4mLUuBPOZ9+gvBI9PY6Kusg==</latexit>

Page 70: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

ALTERNATIVES: MUTUAL INFORMATION• KL-divergence measures how different two distributions

are

• Mutual Information between X and Y: • I(X; Y) = KL(p_{xy} || p_x p_y) • only zero when X and Y are independent • measure of price for encoding (X,Y) as independent

RVs, even when they are not

Page 71: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXERCISE: MODELLING COMMUTE TIMES

• Let’s imagine we have 365 samples of commute times • Say you wanted to model commute time C as a

continuous RV (takes 7.6 minutes to get to work) • This means we have to specify (or learn) the pdf; how? • One option: pick distribution type (e.g., Gaussian), and

find the “best” parameters that match the data • What are the parameters to learn? • Is a Gaussian a good choice?

Page 72: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXERCISE: MODELLING COMMUTE TIMES (CONT.)

• Note a better choice is actually a Gamma dist. • Gaussian distribution (or gamma) for commute times

extrapolates between recorded time in minutes

Page 73: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXERCISE: CONDITIONAL PROBABILITIES

• Using conditional probabilities, we can incorporate other external information (features)

• Let y be the commute time, x the day of the year • Array of conditional probability values —> p(y | x)

• y = 1, 2, … and x = 1, 2, …, 365 • What other x could we use?

Page 74: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXERCISE: ADDING IN AUXILIARY INFORMATION (1)

• Mean, variance for p(y | x) could depend on value of x • Example: X = 1 if slippery out, and X = 0 else

Gaussian denoted by N

p(y|X = 0) = N�µ0,�

20

p(y|X = 1) = N�µ1,�

21

�<latexit sha1_base64="urhoOP+QZFfszXmFHWgTPXYAT+k=">AAACX3icdVFbS8MwGE3rbc5b1SfxJTiUCTJaFS8PguiLT6LgdLDMkmbpFkzaknwdjLo/6Zvgi//EdhvzfiBwOOc7JN9JkEhhwHVfLXtqemZ2rjRfXlhcWl5xVtfuTZxqxusslrFuBNRwKSJeBwGSNxLNqQokfwieLgv/oce1EXF0B/2EtxTtRCIUjEIu+U4vqfbxM27gM+zu4p0zTBSFLqMyux4QyUOoEpX67h4mRnQU9d3HfaJFpwu7hJQ/s97/WW+S9SbZsu9U3Jo7BP5NvDGpoDFufOeFtGOWKh4Bk9SYpucm0MqoBsEkH5RJanhC2RPt8GZOI6q4aWXDfgZ4O1faOIx1fiLAQ/VrIqPKmL4K8sliA/PTK8S/vGYK4UkrE1GSAo/Y6KIwlRhiXJSN20JzBrKfE8q0yN+KWZdqyiD/klEJpwWOJiv/Jvf7Ne+gdnh7WDm/GNdRQptoC1WRh47RObpCN6iOGHqzbGvBWrTe7Tl72XZGo7Y1zqyjb7A3PgD6arB3</latexit>

Page 75: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXERCISE: ADDING IN AUXILIARY INFORMATION (2)

• Can incorporate external information (features) by modeling parameter = function(features)

µ =dX

j=1

wixi

<latexit sha1_base64="C+FHd4OxPMaQJmhWk3lXIXY53Rg=">AAACA3icbVDLSsNAFJ3UV62vqDvdDBbBVUm0+FgUim5cVrAPaGKYTKbt2JkkzEzUEgpu/BU3LhRx60+4829M0iBqPXDhcM693HuPGzIqlWF8aoWZ2bn5heJiaWl5ZXVNX99oySASmDRxwALRcZEkjPqkqahipBMKgrjLSNsdnqV++4YISQP/Uo1CYnPU92mPYqQSydG3LB7BGrRkxJ34umaOrzx461B451BHLxsVIwOcJmZOyiBHw9E/LC/AESe+wgxJ2TWNUNkxEopiRsYlK5IkRHiI+qSbUB9xIu04+2EMdxPFg71AJOUrmKk/J2LEpRxxN+nkSA3kXy8V//O6keod2zH1w0gRH08W9SIGVQDTQKBHBcGKjRKCsKDJrRAPkEBYJbGVshBOUhx+vzxNWvsV86BSvaiW66d5HEWwDXbAHjDBEaiDc9AATYDBPXgEz+BFe9CetFftbdJa0PKZTfAL2vsXntqXAw==</latexit>

p(y|x) = N

0

@µ =dX

j=1

wixi,�2

1

A

<latexit sha1_base64="nKc3NOYTWMM+/I4lyv/P7mpWr+w=">AAACOnicbVDLSgMxFM34tr6qLt0Ei1BBykwVHwtBdONKFKwKnTpk0kwbTWaG5I5axvkuN36FOxduXCji1g8w0xbxdSBwcs69JOf4seAabPvRGhgcGh4ZHRsvTExOTc8UZ+dOdJQoymo0EpE684lmgoesBhwEO4sVI9IX7NS/3Mv90yumNI/CY+jErCFJK+QBpwSM5BWP4nIH32JXEmj7QXqTLePt3o0SkR5krmABlF2Z5LJOpJdebDvZeRNfexzfeHzFqLwlyXnVVbzVhmXsFUt2xe4C/yVOn5RQH4de8cFtRjSRLAQqiNZ1x46hkRIFnAqWFdxEs5jQS9JidUNDIplupN3oGV4yShMHkTInBNxVv2+kRGrdkb6ZzEPp314u/ufVEwg2GykP4wRYSHsPBYnAEOG8R9zkilEQHUMIVdz8FdM2UYSCabvQLWErx/pX5L/kpFpxVitrR2ulnd1+HWNoAS2iMnLQBtpB++gQ1RBFd+gJvaBX6956tt6s997ogNXfmUc/YH18AiLhrJE=</latexit>

Gaussian denoted by N

Page 76: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

MIXTURES OF DISTRIBUTIONS

Page 77: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

MIXTURES OF GAUSSIANS

Page 78: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

* Image from https://people.ucsc.edu/~ealdrich/Teaching/Econ114/LectureNotes/kde.html

MIXTURES CAN PRODUCE COMPLEX DISTRIBUTIONS

Page 79: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

THINK-PAIR-SHARE (5 MINUTES)• Notice that moving to continuous RVs puts more restrictions

on the distributions we can define • For discrete RVs, distributions are tables of probabilities and

so are highly flexible • we can define any possible distribution

• So, why not just discretize our variables? • Example: imagine you have an RV in range [-10,10] • You decide to discretize into chunks of size 0.01 • How many variables do you need to define the PMF? • What if had instead modelled it as a Gaussian? Or a mixture

of two Gaussians?

Page 80: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

12

34

5

45

67

89

1011

1213

14

0

1

2

3

4

5

6

7

8

Red lights

Commute time

Occ

uren

ces

THINK-PAIR-SHARE

• Example: imagine you have an RV in range [-10,10] • You decide to discretize into chunks of size 0.01 • How many variables do you need to define the PMF? • What if had instead modelled it as a Gaussian? Or mixture

of two Gaussians? • Additional question if you have time: imagine you have a

2-dim. RV (in [-10, 10] x [-10,10]). Now imagine you discretize to the same level. How many variables in this PMF? (i.e., this multidimensional array?)

Page 81: PROBABILITY THEORY - GitHub Pages · probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability

EXERCISE: SAMPLE AVERAGE IS AN UNBIASED ESTIMATOR

Obtain instances x1, . . . , xn

What can we say about the sample average?

This sample is random, so we consider i.i.d. random variables

X1, . . . , Xn

Reflects that we could have seen a di↵erent set of instances xi

E"1

n

nX

i=1

Xi

#=

1

n

nX

i=1

E[Xi]

=1

n

nX

i=1

µ

= µ

For any one sample x1, . . . , xn, unlikely that1

n

nX

i=1

xi = µ