Machine Learning Week 2 Lecture 1. Quiz and Hand in data Test what you know so I can adapt! We need data for the hand in

Machine Learning

Week 2 Lecture 1

Quiz and Hand in data

• Test what you know so I can adapt!

• We need data for the hand in

Quiz

Any ProblemsAny Questions

Recap

Data Set

Learning Algorithm

Hypothesis hh(x) ≈ f(x)

Unknown Target f

Hypothesis Set

Supervised Learning

5 0 4 1 9 2 1 3 1 4

Target: House PriceInput: Size, Rooms, Age, Garage, …Data: Historical Data of House Sales

Regression

Classification (10 classes)

Linear Models

House Price = 1234 x 1 + 88 x size + 42 x Rooms - 666 x age + 0.01 x Garage

Example:Target House PriceInput: Size,

Rooms, Age, Garage, …

Data: Historical House Sales

Weigh each input dimension to effect the target function in a good way

θ4 x4 θ3 x3 θ2 * x2 θ1 x1 θ0 x0

Linear in θ

Nonlinear Transform

(matrix product)

Three Models

Logistic Regression

Estimating Probabilities

Classification (Perceptron)

Regression

w

Classify y = 1 if

Equivalent to

w

Maximum Likelihood

Likelihood

Use Logarithm to make into a sum. Then Optimize.

Assumption: Independent Data

For Logistic Regression we get cross entropy error:

Convex Optimization

ConvexNon-convex

x,f(x)y,f(y)

x,f(x)

y,f(y)

f(x)+f’(x)(y-x)

f and g are convex, h is affine

Local Minima are Global Minima

Descent Methods

Iteratively move toward a better solution

where f is twice continuously differentiable

• Pick start point x• Repeat Until Stopping Criterion Satisfied• Compute Descent Direction v• Line Search: Compute Step Size t• Update: x = x + t v

Simple Gradient Descent

• Pick start point x• LR = 0.1• Repeat 50 rounds• Set v• Update: x = x + LR v

Descent Direction is

Step size:

Learning Rate Learning Rate Learning Rate

Gradient Descent Jump Around

Use Exact Line Search Starting From (10,1)

Gradient Checking

If You Use Gradient Descent Compute Gradient Correctly.

Choose small h and compute

Use this two sided formula.Reduces the estimation error significantly.

n-dimensional gradient: Use formula for each variableUsually works well

Handin 1.

• It comes online after class today• Include Matlab examples but not a long intro.

Google is your friend.• Questions are always welcome• Get Busy

Supervised Learning

5 0 4 1 9 2 1 3 1 4

Today

• Learning feasibility

• Probabilistic Approach

• Learning Formalized

Learning Diagram

Data Set(x1,y1,...,xn,yn)

Learning Algorithm


Unknown Target f

Hypothesis Set

Impossibility of Learning!x1 x2 x3 f(x

)0 0 0 11 0 0 00 1 0 11 1 0 10 0 1 01 0 1 ?0 1 1 ?1 1 1 ?

What is f?

There are 256 potential functions 8 of them has in sample error 0

Assumptions are needed

No Free Lunch"All models are wrong, but some models are useful.” George Box

Machine Learning has many different models and algorithms.

Assumptions that works well in one domain may fail in another.

There is no single best model that works best for all problems (No Free Lunch Theorem)

Probabilistic Games

Probabilistic ApproachRepeat N times independently

What does sample mean say about μ?

Sample mean: ν #heads/N

With Certainty? Nothing really

Probabilistically? Yes sample mean is likely close to bias

Sample:h,h,h,t,t,h,t,t,h

μ is unknown

Hoeffdings InequalityBinary Variables

Sample mean is probably close to μ

Bound is independent of sample mean and actual probability, e.g. the probability distribution P(x)

Probability increase with #samples N

Hoeffdings Inequality

Sample mean ν

coin bias μ

Sample mean is probably approximately correctPAC

Classification ConnectionTesting a Hypothesis

Fixed Hypothesis Unknown Target

is probability of picking x such that f(x) ≠ h(x)is probability of picking x such that f(x) = h(x)

μ is the sum of the probability of all the points X where hypothesis is wrong

Probability Distribution over x

Sample Mean - Out of sample Error

Learning Diagram


Learning Algorithm


Unknown Target f

Hypothesis Set

Unknown Input Probability

Distribution P(x)

Coins to hypotheses

Sample size N:h,h,h,t,t,h,t,t,h

Samplemean

unknown μ

Not Learning Yet

• Hypothesis fixed before seeing data• Every hypothesis has its own error (different coin for

each hypothesis)

• In learning we have a training algorithm that picks the “best” hypothesis from the set

• We are only verifying fixed hypothesis• Hoeffding has left the building again.

Coin Analogy – Exercise 1.10 Book

• Flip a fair coin 10 times• What is Probability of 10 heads?

• Repeat 1000 times (1000 coins)• What is the probability that some coin has 10

heads? Approximately 63%

Crude ApproachApply Union Bound

Union Bound:

P(True for some hypothesis)≤

.

.

.

Apply Union Bound and Then Hoeffding to each one

Result

Finite Hypothesis set with M hypotheses.

Data Set with N points

Classification Problem. Error is f(x)≠h(x)

It explains the idea of what we are looking for (model complexity is a factor it seems)Our “simple” linear models have infinite size hypothesis sets…

New Learning Diagram


Learning Algorithm


Unknown Target f

Hypothesis Set

Input Probability Distribution P(x)

finite X

Learning Feasibility

• Deterministic/No assumptions NOT SO MUCH• Probabilisticly YES:

• Generalization: Out of sample error Close to In Sample Error

• Make In Sample Error Small• If target function is complex learning should

be harder?

Error FunctionsUser Specified, Heavily Problem Dependent.Identity System, Fingerprints. Is the person who he says he is. h(x)/f(x) Lying True

Estimate Lying True Negative False Negative

Estimate True False Positive True Positive

h(x)/f(x) Lying True

Est. Lying 0

Est. True 0

Walmart. Discount for a given personError Function

h(x)/f(x) Lying True

Est. Lying 0

Est. True 0

CIA Access (Friday bar stock)Error Function

10001

1000 1

Error FunctionsIf Not Given

Base it on making problem “solvable”.. Making the problem smooth and convex seems like a good idea.Least Squares Linear Regression was very nice indeed.

Base on assumptions about target and noiseLogistic Regression: Gives Cross EntropyAssume linear and Gaussian noise: Gives Least Squares

Formalize Everything


Learning Algorithm


Unknown Target

Hypothesis Set

Unknown Probability Distribution P(x)

Final Diagram

Unknown Target Unknown Probability Distribution P(x)

Learn Importance

P(y | x)

Data Set

Learning Algorithm

Hypothesis Set

Final Hypothesis

Error Measure e

Words on out of sample error

Imagine X,y are finite sets

Quick Summary

• Learning Without Assumptions is impossible• Probabilistically learning is possible– Hoeffding bound – Work needed for infinite hypothesis spaces!

• Error function depend on problem• Formalized Learning Approach– Ensure out of sample error is close to in sample error– Minimize in sample error– Complexity of hypothesis set (size M currently) matters – More data helps

Documents

Machine Learning Week 2 Lecture 1. Quiz and Hand in data Test what you know so I can adapt! We need data for the hand in