Upload
ashley-daniels
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Machine Learning
Week 2 Lecture 1
Quiz and Hand in data
• Test what you know so I can adapt!
• We need data for the hand in
Quiz
Any ProblemsAny Questions
Recap
Data Set
Learning Algorithm
Hypothesis hh(x) ≈ f(x)
Unknown Target f
Hypothesis Set
Supervised Learning
5 0 4 1 9 2 1 3 1 4
Target: House PriceInput: Size, Rooms, Age, Garage, …Data: Historical Data of House Sales
Regression
Classification (10 classes)
Linear Models
House Price = 1234 x 1 + 88 x size + 42 x Rooms - 666 x age + 0.01 x Garage
Example:Target House PriceInput: Size,
Rooms, Age, Garage, …
Data: Historical House Sales
Weigh each input dimension to effect the target function in a good way
θ4 x4 θ3 x3 θ2 * x2 θ1 x1 θ0 x0
Linear in θ
Nonlinear Transform
(matrix product)
Three Models
Logistic Regression
Estimating Probabilities
Classification (Perceptron)
Regression
w
Classify y = 1 if
Equivalent to
w
Maximum Likelihood
Likelihood
Use Logarithm to make into a sum. Then Optimize.
Assumption: Independent Data
For Logistic Regression we get cross entropy error:
Convex Optimization
ConvexNon-convex
x,f(x)y,f(y)
x,f(x)
y,f(y)
f(x)+f’(x)(y-x)
f and g are convex, h is affine
Local Minima are Global Minima
Descent Methods
Iteratively move toward a better solution
where f is twice continuously differentiable
• Pick start point x• Repeat Until Stopping Criterion Satisfied• Compute Descent Direction v• Line Search: Compute Step Size t• Update: x = x + t v
Simple Gradient Descent
• Pick start point x• LR = 0.1• Repeat 50 rounds• Set v• Update: x = x + LR v
Descent Direction is
Step size:
Learning Rate Learning Rate Learning Rate
Gradient Descent Jump Around
Use Exact Line Search Starting From (10,1)
Gradient Checking
If You Use Gradient Descent Compute Gradient Correctly.
Choose small h and compute
Use this two sided formula.Reduces the estimation error significantly.
n-dimensional gradient: Use formula for each variableUsually works well
Handin 1.
• It comes online after class today• Include Matlab examples but not a long intro.
Google is your friend.• Questions are always welcome• Get Busy
Supervised Learning
5 0 4 1 9 2 1 3 1 4
Today
• Learning feasibility
• Probabilistic Approach
• Learning Formalized
Learning Diagram
Data Set(x1,y1,...,xn,yn)
Learning Algorithm
Hypothesis hh(x) ≈ f(x)
Unknown Target f
Hypothesis Set
Impossibility of Learning!x1 x2 x3 f(x
)0 0 0 11 0 0 00 1 0 11 1 0 10 0 1 01 0 1 ?0 1 1 ?1 1 1 ?
What is f?
There are 256 potential functions 8 of them has in sample error 0
Assumptions are needed
No Free Lunch"All models are wrong, but some models are useful.” George Box
Machine Learning has many different models and algorithms.
Assumptions that works well in one domain may fail in another.
There is no single best model that works best for all problems (No Free Lunch Theorem)
Probabilistic Games
Probabilistic ApproachRepeat N times independently
What does sample mean say about μ?
Sample mean: ν #heads/N
With Certainty? Nothing really
Probabilistically? Yes sample mean is likely close to bias
Sample:h,h,h,t,t,h,t,t,h
μ is unknown
Hoeffdings InequalityBinary Variables
Sample mean is probably close to μ
Bound is independent of sample mean and actual probability, e.g. the probability distribution P(x)
Probability increase with #samples N
Hoeffdings Inequality
Sample mean ν
coin bias μ
Sample mean is probably approximately correctPAC
Classification ConnectionTesting a Hypothesis
Fixed Hypothesis Unknown Target
is probability of picking x such that f(x) ≠ h(x)is probability of picking x such that f(x) = h(x)
μ is the sum of the probability of all the points X where hypothesis is wrong
Probability Distribution over x
Sample Mean - Out of sample Error
Learning Diagram
Data Set(x1,y1,...,xn,yn)
Learning Algorithm
Hypothesis hh(x) ≈ f(x)
Unknown Target f
Hypothesis Set
Unknown Input Probability
Distribution P(x)
Coins to hypotheses
Sample size N:h,h,h,t,t,h,t,t,h
Samplemean
unknown μ
Not Learning Yet
• Hypothesis fixed before seeing data• Every hypothesis has its own error (different coin for
each hypothesis)
• In learning we have a training algorithm that picks the “best” hypothesis from the set
• We are only verifying fixed hypothesis• Hoeffding has left the building again.
Coin Analogy – Exercise 1.10 Book
• Flip a fair coin 10 times• What is Probability of 10 heads?
• Repeat 1000 times (1000 coins)• What is the probability that some coin has 10
heads? Approximately 63%
Crude ApproachApply Union Bound
Union Bound:
P(True for some hypothesis)≤
.
.
.
Apply Union Bound and Then Hoeffding to each one
Result
Finite Hypothesis set with M hypotheses.
Data Set with N points
Classification Problem. Error is f(x)≠h(x)
It explains the idea of what we are looking for (model complexity is a factor it seems)Our “simple” linear models have infinite size hypothesis sets…
New Learning Diagram
Data Set(x1,y1,...,xn,yn)
Learning Algorithm
Hypothesis hh(x) ≈ f(x)
Unknown Target f
Hypothesis Set
Input Probability Distribution P(x)
finite X
Learning Feasibility
• Deterministic/No assumptions NOT SO MUCH• Probabilisticly YES:
• Generalization: Out of sample error Close to In Sample Error
• Make In Sample Error Small• If target function is complex learning should
be harder?
Error FunctionsUser Specified, Heavily Problem Dependent.Identity System, Fingerprints. Is the person who he says he is. h(x)/f(x) Lying True
Estimate Lying True Negative False Negative
Estimate True False Positive True Positive
h(x)/f(x) Lying True
Est. Lying 0
Est. True 0
Walmart. Discount for a given personError Function
h(x)/f(x) Lying True
Est. Lying 0
Est. True 0
CIA Access (Friday bar stock)Error Function
10001
1000 1
Error FunctionsIf Not Given
Base it on making problem “solvable”.. Making the problem smooth and convex seems like a good idea.Least Squares Linear Regression was very nice indeed.
Base on assumptions about target and noiseLogistic Regression: Gives Cross EntropyAssume linear and Gaussian noise: Gives Least Squares
Formalize Everything
Data Set(x1,y1,...,xn,yn)
Learning Algorithm
Hypothesis hh(x) ≈ f(x)
Unknown Target
Hypothesis Set
Unknown Probability Distribution P(x)
Final Diagram
Unknown Target Unknown Probability Distribution P(x)
Learn Importance
P(y | x)
Data Set
Learning Algorithm
Hypothesis Set
Final Hypothesis
Error Measure e
Words on out of sample error
Imagine X,y are finite sets
Quick Summary
• Learning Without Assumptions is impossible• Probabilistically learning is possible– Hoeffding bound – Work needed for infinite hypothesis spaces!
• Error function depend on problem• Formalized Learning Approach– Ensure out of sample error is close to in sample error– Minimize in sample error– Complexity of hypothesis set (size M currently) matters – More data helps