Lecture 3: Probability and Statistics

Machine Learning

Queens College

• Probability and Statistics– Naïve Bayes Classification

• Linear Algebra– Matrix Multiplication– Matrix Inversion

• Calculus– Vector Calculus– Optimization– Lagrange Multipliers

Classical Artificial Intelligence

• Expert Systems• Theorem Provers• Shakey• Chess

• Largely characterized by determinism.

Modern Artificial Intelligence

• Fingerprint ID• Internet Search• Vision – facial ID, object recognition• Speech Recognition• Asimo• Jeopardy!

• Statistical modeling to generalize from data.

Two Caveats about Statistical Modeling

• Black Swans• The Long Tail

Black Swans

• In the 17th Century, all known swans were white.• Based on evidence, it is impossible for a swan to

be anything other than white.

• In the 18th Century, black swans were discovered in Western Australia

• Black Swans are rare, sometimes unpredictable events, that have extreme impact

• Almost all statistical models underestimate the likelihood of unseen events.

The Long Tail

• Many events follow an exponential distribution• These distributions have a very long “tail”.

– I.e. A large region with significant probability mass, but low likelihood at any particular point.

• Often, interesting events occur in the Long Tail, but it is difficult to accurately model behavior in this region.

Boxes and Balls

• 2 Boxes, one red and one blue.• Each contain colored balls.

Boxes and Balls

• Suppose we randomly select a box, then randomly draw a ball from that box.

• The identity of the Box is a random variable, B.

• The identity of the ball is a random variable, L.

• B can take 2 values, r, or b• L can take 2 values, g or o.

Boxes and Balls

• Given some information about B and L, we want to ask questions about the likelihood of different events.

• What is the probability of selecting an apple?

• If I chose an orange ball, what is the probability that I chose from the blue box?

Some basics

• The probability (or likelihood) of an event is the fraction of times that the event occurs out of n trials, as n approaches infinity.

• Probabilities lie in the range [0,1]• Mutually exclusive events are events that cannot

simultaneously occur.– The sum of the likelihoods of all mutually exclusive events

must equal 1.

• If two events are independent then,

p(X, Y) = p(X)p(Y)

p(X|Y) = p(X)

Joint Probability – P(X,Y)

• A Joint Probability function defines the likelihood of two (or more) events occurring.

• Let nij be the number of times event i and event j simultaneously occur.

Orange Green

Blue box 1 3 4

Red box 6 2 8

7 5 12

Generalizing the Joint Probability

Marginalization

• Consider the probability of X irrespective of Y.

• The number of instances in column j is the sum of instances in each cell

• Therefore, we can marginalize or “sum over” Y:

Conditional Probability

• Consider only instances where X = xj.

• The fraction of these instances where Y = yi is the conditional probability– “The probability of y given x”

Relating the Joint, Conditional and Marginal

Sum and Product Rules

• In general, we’ll refer to a distribution over a random variable as p(X) and a distribution evaluated at a particular value as p(x).

Sum Rule

Product Rule

Bayes Rule

Interpretation of Bayes Rule

• Prior: Information we have before observation.

• Posterior: The distribution of Y after observing X

• Likelihood: The likelihood of observing X given Y

PriorPosterior

Likelihood

Boxes and Balls with Bayes Rule

• Assuming I’m inherently more likely to select the red box (66.6%) than the blue box (33.3%).

• If I selected an orange ball, what is the likelihood that I selected the red box? – The blue box?

Boxes and Balls

Naïve Bayes Classification

• This is a simple case of a simple classification approach.

• Here the Box is the class, and the colored ball is a feature, or the observation.

• We can extend this Bayesian classification approach to incorporate more independent features.

• Some theory first.

• Assuming independent features simplifies the math.

Naïve Bayes Example Data

HOT LIGHT SOFT RED

COLD HEAVY SOFT RED

HOT HEAVY FIRM RED

HOT LIGHT FIRM RED

COLD LIGHT SOFT BLUE

COLD HEAVY FIRM BLUE

HOT HEAVY FIRM BLUE

HOT LIGHT FIRM BLUE

HOT HEAVY FIRM ?????

HOT LIGHT SOFT RED

COLD HEAVY SOFT RED

HOT HEAVY FIRM RED

HOT LIGHT FIRM RED

COLD HEAVY FIRM BLUE

HOT HEAVY FIRM BLUE

HOT LIGHT FIRM BLUE

Prior:

HOT LIGHT SOFT RED

COLD HEAVY SOFT RED

HOT HEAVY FIRM RED

HOT LIGHT FIRM RED

COLD HEAVY SOFT BLUE

HOT HEAVY FIRM BLUE

HOT LIGHT FIRM BLUE

HOT LIGHT SOFT RED

COLD HEAVY SOFT RED

HOT HEAVY FIRM RED

HOT LIGHT FIRM RED

COLD HEAVY SOFT BLUE

HOT HEAVY FIRM BLUE

HOT LIGHT FIRM BLUE

Continuous Probabilities

• So far, X has been discrete where it can take one of M values.

• What if X is continuous?• Now p(x) is a continuous probability density

function.• The probability that x will lie in an interval (a,b)

Continuous probability example

Properties of probability density functions

Sum Rule

Product Rule

Expected Values

• Given a random variable, with a distribution p(X), what is the expected value of X?

Multinomial Distribution

• If a variable, x, can take 1-of-K states, we represent the distribution of this variable as a multinomial distribution.

• The probability of x being in state k is μk

Expected Value of a Multinomial

• The expected value is the mean values.

Gaussian Distribution

• One Dimension

• D-Dimensions

Gaussians

How machine learning uses statistical modeling

• Expectation– The expected value of a function is the

hypothesis

• Variance– The variance is the confidence in that

hypothesis

Variance

• The variance of a random variable describes how much variability around the expected value there is.

• Calculated as the expected squared error.

Covariance

• The covariance of two random variables expresses how they vary together.

• If two variables are independent, their covariance equals zero.

Next Time: Math Primer

• Linear Algebra– Definitions and Properties

• Calculus– Review of derivation and integration

• Vector Calculus– Derivation w.r.t. a vector or matrix

• Lagrange Multipliers– Constrained optimization

Lecture 3: Probability and Statistics

Documents

Lecture 1 Dustin Lueker. Statistical terminology Descriptive statistics Probability and distribution functions Inferential statistics ◦ Estimation

Probability & Statistics Lecture 1

Lecture notes 031 Probability and Statistics Lecture notes 03

lecture notes on probability, statistics and linear algebra

Statistics 67 Introduction to Probability and Statistics ...sternh/courses/67/stat67slides.pdf · Introduction to Probability and Statistics for Computer Science Lecture notes for

June 3, 2008Stat 111 - Lecture 6 - Probability1 Probability Introduction to Probability, Conditional Probability and Random Variables Statistics 111 -

MA-250 Probability and Statistics Nazar Khan PUCIT Lecture 3

Machine Learning Queens College Lecture 3: Probability and Statistics

G. Cowan 2011 CERN Summer Student Lectures on Statistics / Lecture 21 Introduction to Statistics − Day 2 Lecture 1 Probability Random variables, probability

Statistics Lecture Notes Dr. Halil İbrahim CEBECİ Chapter 05 Probability

Lecture Notes(Introduction to Probability and Statistics)

MA-250 Probability and Statistics Nazar Khan PUCIT Lecture 5

lecture 1: introduction · · 2017-11-21Probability, statistics and computation Probability: • analysisofrandomphenomena(propertiesofprobability distributionsandmodels). Statistics

MA-250 Probability and Statistics Nazar Khan PUCIT Lecture 21

Standford - HRP 259 Introduction to Probability and Statistics - Lecture 12

Lecture 1 Probability and Statistics

G. Cowan 2011 CERN Summer Student Lectures on Statistics / Lecture 31 Introduction to Statistics − Day 3 Lecture 1 Probability Random variables, probability

MA-250 Probability and Statistics Nazar Khan PUCIT Lecture 4

SDS 321: Introduction to Probability and Statistics ...psarkar/sds321/lecture3-ps.pdf · SDS 321: Introduction to Probability and Statistics Lecture 3: Conditional probability and

MA-250 Probability and Statistics Nazar Khan PUCIT Lecture 13