View
3
Download
0
Category
Preview:
Citation preview
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
CS 273A: Machine Learning Fall 2021
Lecture 9: Logistic Regression
Roy FoxDepartment of Computer ScienceBren School of Information and Computer SciencesUniversity of California, Irvine
All slides in this course adapted from Alex Ihler & Sameer Singh
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Logistics
assignments • Assignment 3 due next Tuesday, Oct 26
midterm
• Midterm exam on Nov 4, 11am–12:20 in SH 128
• If you're eligible to be remote — let us know by Oct 28
• If you're eligible for more time — let us know by Oct 28
• Review during lecture next Thursday
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Separability• Separable dataset = there's a model (in our class) with perfect prediction
• Separable problem = there's a model with 0 test loss
‣ Also called realizable
• Linearly separable = separable by a linear classifiers (hyperplanes)
𝔼x,y∼p[ℓ(y, y(x))] = 0
x1
x2
Linearly separable data
x1
x2
Linearly non-separable data
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Adding features• How to make the perceptron more expressive?
‣ Add features — recall linear → polynomial regression
• Linearly non-separable:
• Linearly separable in quadratic features:
• Visualized in original feature space:
‣ Decision boundary: ax2 + bx + c = 0
x1
x1
x2 = x21
y = T(ax2 + bx + c)
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Adding features
• Which functions do we need to represent the decision boundary?
‣ When linear functions aren't sufficiently expressive
‣ Perhaps quadratic functions are
ax21 + bx1 + cx2
2 + dx2 + ex1x2 + f = 0
x1 x2 y0 0 -10 1 -11 0 -11 1 1
ANDx1 x2 y0 0 -10 1 11 0 11 1 -1
XOR
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Separability in high dimension
• As we add more features → dimensionality of instance increases:
‣ Separability becomes easier: more parameters, more models that could separate
‣ Add enough (good) features, and even a linear classifier can separate
- Given a decision boundary : add as a feature → linearly separable!
• However:
‣ Do these features explain test data or just training data?
‣ Increasing model complexity can lead to overfitting
x
f(x) = 0 f(x)
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Recap• Perceptron = linear classifier
‣ Linear response → step decision function → discrete class prediction
‣ Linear decision boundary
• Separability = existence of a perfect model (in the class)
‣ Separable data: 0 loss on this data
‣ Separable problem: 0 loss on the data distribution
‣ Perceptron: linear separability
• Adding features:
‣ Complex features: complex decision boundary, easier separability
‣ Can lead to overfitting
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Today's lecture
Learning perceptrons
Logistic regression
Multi-class classifiers
VC dimension
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Learning a perceptron
• What do we need to learn the parameters of a perceptron?
‣ Training data = labeled instances
‣ Loss function = error rate on labeled data
‣ Optimization algorithm = method for minimizing training loss
θ
𝒟
ℒθ
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Error rate
• Error rate:
‣ With the indicator
ℒθ = 1m ∑
i
δ(y(i) ≠ fθ(x(i)))
δ(y ≠ y) = {1 y ≠ y0 else
x
y
ℒθ = 29
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Use linear regression?• Idea: find using linear regression
• Affected by large regression losses
‣ We only care about the classification loss
θ
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Perceptron: gradient-based learning
• Problem: loss function not differentiable
‣ Write differently:
- almost everywhere
‣ But we also don't want MSE =
‣ Compromise:
-
ℒθ(x, y) = δ(y ≠ sign(θ⊺x))
ℒθ(x, y) = 14 (y − sign(θ⊺x))2
∇θsign(θ⊺x) = 0
ℒθ(x, y) = 12 (y − θ⊺x)2
ℒθ(x, y) = (y − sign(θ⊺x))(y − θ⊺x)
∇θℒθ = − (y − sign(θ⊺x))x = − (y − y)x
T(r)
r
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Perceptron training algorithm
• Similar to linear regression with MSE loss
‣ Except that is the class prediction, not the linear response
‣ No update for correct predictions
‣ For incorrect predictions:
- update towards (for false negative) or (for false positive)
y
y( j) = y( j)
y( j) − y( j) = ± 2
⟹ x −x
predict output for point
gradient step on weird loss
j
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Perceptron training algorithm
predict output for point
gradient step on weird loss
j
incorrect prediction: update weights
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Perceptron training algorithm
predict output for point
gradient step on weird loss
j
correct prediction: no update
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Perceptron training algorithm
predict output for point
gradient step on weird loss
j
convergence: no more updates
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Today's lecture
Learning perceptrons
Logistic regression
Multi-class classifiers
VC dimension
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Perceptron
• Perceptron = linear classifier
‣ Parameters = weights (also denoted )
‣ Response = weighted sum of the features
‣ Prediction = thresholded response
‣ Decision function:
• Update rule:
θ w
r = θ⊺x
y(x) = T(r) = T(θ⊺x)
y(x) = {+1 if θ⊺x > 0−1 otherwise
(for T(r) = sign(r))
θ ← θ − α( y − y⏟
)x
T(r)
r
error
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Widening the classification margin• Which decision boundary is “better”?
‣ Both have 0 training loss
‣ But one seems more robust, expected to generalize better
• Benefit of smooth loss function: care about margin
‣ Encourage distancing the boundary from data points
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Surrogate loss functions• Alternative: use differentiable loss function
‣ E.g., approximate step function with smooth sigmoid function (= ”looks like s”)
‣ Popular choice: logistic / sigmoid function
‣ For this part, let's assume
• MSE loss:
‣ Far from the boundary: , loss approximates 0–1 loss
‣ Near the boundary: , loss near , but clear improvement direction
σ(x) = 11 + exp(−x) ∈ [0,1]
y ∈ {0,1}
ℒθ(x, y) = (y − σ(r(x)))2
σ ≈ 0 or 1
σ ≈ 12
14
T(r)
r
σ(r)
r0.5
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Learning smooth linear classifiers• Use gradient-based optimizer on the loss
• Logistic function:
• It's derivative:
‣ Saturates for both ,
• Confidently correct prediction:
• Confidently incorrect prediction:
ℒθ(x, y) = (y − σ(θ⊺x))2
−∇θℒθ(x, y) = 2(y − σ(θ⊺x))σ′ (θ⊺x) x
σ(r) = 11 + exp(−r)
σ′ (r) = σ(r)(1 − σ(r))
r → ∞ r → − ∞
σ(r) ≈ y ∈ {0,1} ⟹ ∇θℒθ ≈ 0
σ(r) ≈ 1 − y ⟹ ∇θℒθ ≈ 0
error
σ(r)
r
sensitivity
good
bad
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Learning smooth linear classifiers
• With a smooth loss function with can use Stochastic Gradient Descent
ℒθ(x, y) = (y − σ(r(x)))2
ℒ = 1.9
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Learning smooth linear classifiers
• With a smooth loss function with can use Stochastic Gradient Descent
ℒθ(x, y) = (y − σ(r(x)))2
ℒ = 0.4
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Learning smooth linear classifiers
• With a smooth loss function with can use Stochastic Gradient Descent
ℒθ(x, y) = (y − σ(r(x)))2
ℒ = 0.1
Minimum training MSE
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Maximum likelihood• What if we had a probabilistic predictor ?
• The better the parameter , the more likely the training data:
• Bayesian interpretation?
‣ MAP:
• Maximum log-likelihood:
pθ(y |x)
θ
pθ(y(1), …, y(m) |x(1), …, x(m)) = ∏j
pθ(y( j) |x( j))
arg maxθ
p(θ |𝒟) = arg maxθ
p(θ)p(𝒟x)pθ(𝒟y |𝒟x) = arg maxθ
pθ(𝒟y |𝒟x)
maxθ
1m ∑
j
log pθ(y( j) |x( j))
except, often there's no uniform distribution over parameter space
average over training dataset
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Logistic Regression
• Can we turn a linear response into a probability? Sigmoid!
• Think of
• Negative Log-Likelihood (NLL) loss:
σ : ℝ → [0,1]
σ(θ⊺x) = pθ(y = 1 |x)
ℒθ(x, y) = − log pθ(y |x) = − y log σ(θ⊺x) − (1 − y)log(1 − σ(θ⊺x))for y = 1 for y = 0
−log 0.99
−log 0.7 −log 0.98
−log 0.1−log 0.7−log 0.99
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Logistic Regression: gradient• Logistic NLL loss:
•Gradient:
• Compare:
‣ Perceptron:
‣ Logistic MSE:
ℒθ(x, y) = − y log σ(θ⊺x) − (1 − y)log(1 − σ(θ⊺x))
−∇θℒθ(x, y) = (yσ′ (θ⊺x)σ(θ⊺x)
− (1 − y)σ′ (θ⊺x)
1 − σ(θ⊺x) ) x
= (y (1 − σ(θ⊺x)) − (1 − y)σ(θ⊺x))x
= (y − pθ(y = 1 |x))x
(y − y)x
−∇θℒθ(x, y) = 2(y − σ(θ⊺x))σ′ (θ⊺x)x
error for y = 1 error for y = 0
but update toward −x
constant error ( ), insensitive to margin±2
0 gradient for bad mistakes
T(r)
r
σ(r)
r
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Recap• Linear classifiers:
‣ Perceptron
‣ Logistic classifier
• Measuring decision quality:
‣ Error rate / 0–1 loss
‣ MSE loss
‣ Negative log-likelihood (Logistic Regression)
• Learning the weights
‣ Perceptron algorithm — not quite gradient-based (or gradient of weird loss)
‣ Gradient-based optimization of surrogate loss (MSE / NLL)
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Today's lecture
Learning perceptrons
Logistic regression
Multi-class classifiers
VC dimension
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Multi-class linear models
• How to predict multiple classes?
• Idea: have a linear response per class
‣ Choose class with largest response:
• Linear boundary between classes , :
‣
rc = θ⊺c x
fθ(x) = arg maxc
θ⊺c x
c1 c2
θ⊺c1
x ≶ θ⊺c2
x ⟺ (θc1− θc2
)⊺x ≶ 0
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Multi-class linear models
• More generally: add features — can even depend on !
• Example:
‣
y
fθ(x) = arg maxy
θ⊺Φ(x, y)
y = ± 1
Φ(x, y) = xy
⟹ fθ(x) = arg maxy
yθ⊺x = {+1 +θ⊺x > − θ⊺x−1 +θ⊺x < − θ⊺x
= sign(θ⊺x) perceptron!
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Multi-class linear models
• More generally: add features — can even depend on !
• Example:
‣
‣
y
fθ(x) = arg maxy
θ⊺Φ(x, y)
y ∈ {1,2,…, C}
Φ(x, y) = [0 0 ⋯ x ⋯ 0] = one-hot(y) ⊗ x
θ = [θ1 ⋯ θC]
⟹ fθ(x) = arg maxc
θ⊺c x largest linear response
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Multi-class perceptron algorithm• While not done:
‣ For each data point :
- Predict:
- Increase response for true class:
- Decrease response for predicted class:
• More generally:
‣ Predict:
‣ Update:
(x, y) ∈ 𝒟
y = arg maxc
θ⊺c x
θy ← θy + αx
θ y ← θ y − αx
y = arg maxy
θ⊺Φ(x, y)
θ ← θ + α(Φ(x, y) − Φ(x, y))
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Multilogit Regression
• Define multi-class probabilities:
‣For binary :
• Benefits:
‣ Probabilistic predictions: knows its confidence
‣ Linear decision boundary:
‣ NLL is convex
pθ(y |x) =exp(θ⊺
y x)∑c exp(θ⊺
c x)= soft max
cθ⊺
c xy
ypθ(y = 1 |x) =
exp(θ⊺1x)
exp(θ⊺1x) + exp(θ⊺
2x)
=1
1 + exp((θ2 − θ1)⊺x)= σ((θ1 − θ2)⊺x)
arg maxy
exp(θ⊺y x) = arg max
yθ⊺
y x
Logistic Regression with θ = θ1 − θ2
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Multilogit Regression: gradient
• NLL loss:
• Gradient:
• Compare to multi-class perceptron:
ℒθ(x, y) = − log pθ(y |x) = − θ⊺y x + log∑
c
exp(θ⊺c x)
−∇θcℒθ(x, y) = δ(y = c)x −
∇θc∑c′
exp(θ⊺c′
x)
∑c′
exp(θ⊺c′
x)
= (δ(y = c) −exp(θ⊺
c x)∑c′
exp(θ⊺c′
x) ) x
= (δ(y = c) − pθ(c |x))x
(δ(y = c) − δ( y = c))xmake true class more likely make all other classes less likely
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Today's lecture
Learning perceptrons
Logistic regression
Multi-class classifiers
VC dimension
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Complexity measures• What are we looking for in a measure of model class complexity?
‣ Tell us something about generalization error
‣ Tell us how it depends on amount of data
‣ Be easy to find for a given model class — haha jk not gonna happen (more later)
• Ideally: a way to select model complexity (other than validation)
‣ Akaike Information Criterion (AIC) — roughly: loss + #parameters
‣ Bayesian Information Criterion (BIC) — roughly: loss + #parameters
- But what's the #parameters, effectively? Should change the complexity?
ℒtest − ℒtraining
m
⋅ log m
fθ1,θ2= gθ=h(θ1,θ2)
also called: risk – empirical risk
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Model expressiveness• Model complexity also measures expressiveness / representational power
• Tradeoff:
‣ More expressive can reduce error, but may also overfit to training data
‣ Less expressive may not be able to represent true pattern / trend
• Example:
⟹
⟹
sign(θ0 + θ1x1 + θ2x2)
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
2
3
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Model expressiveness• Model complexity also measures expressiveness / representational power
• Tradeoff:
‣ More expressive can reduce error, but may also overfit to training data
‣ Less expressive may not be able to represent true pattern / trend
• Example:
⟹
⟹
sign(x21 + x2
2 − θ)
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
2
3
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Shattering• Separability / realizability: there's a model that classifies all points correctly
• Shattering: the points are separable regardless of their labels
‣ Our model class can shatter points
if for any labeling
there exists a model that classifies all of them correctly
- The model class must have at least as many models as labelings
x(1), …, x(h)
y(1), …, y(h)
Ch
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Shattering• Separability / realizability: there's a model that classifies all points correctly
• Shattering: the points are separable regardless of their labels
‣ Our model class can shatter points
if for any labeling
there exists a model that classifies all of them correctly
• Example: can shatter these points?
x(1), …, x(h)
y(1), …, y(h)
fθ(x) = sign(θ0 + θ1x1 + θ2x2)
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Shattering• Separability / realizability: there's a model that classifies all points correctly
• Shattering: the points are separable regardless of their labels
‣ Our model class can shatter points
if for any labeling
there exists a model that classifies all of them correctly
• Example: can shatter these points?
x(1), …, x(h)
y(1), …, y(h)
fθ(x) = sign(x21 + x2
2 − θ)
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Vapnik–Chervonenkis (VC) dimension• VC dimension: maximum number of points that can be shattered by a class
• A game:
‣ Fix a model class
‣ Player 1: choose points
‣ Player 2: choose labels
‣ Player 1: choose model
‣ Are all ? Player 1 wins
• Player 1 can win, otherwise cannot win
H
fθ : x → y θ ∈ Θ
h x(1), …, x(h)
y(1), …, y(h)
θ
y( j) = fθ(x( j)) ⟹
h ≤ H ⟹
∃x(1), …, x(h) : ∀y(1), …, y(h) : ∃θ : ∀j : y( j) = fθ(x( j))
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
VC dimension: example (1)• VC dimension: maximum number of points that can be shattered by a class
• To find , think like the winning player: 1 for , 2 for
• Example:
‣ We can place one point and ”shatter” it
‣ We can prevent shattering any two points: make the distant one blue
‣
H
H h ≤ H h > H
fθ(x) = sign(x21 + x2
2 − θ)
H = 1
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
VC dimension: example (2)• Example:
‣ We can place 3 points and shatter them
‣ We can prevent shattering any 4 points:
- If they form a convex shape, alternate labels
- Otherwise, label differently the point in the triangle
‣
• Linear classifiers (perceptrons) of features have VC-dim
‣ But VC-dim is generally not #parameters
fθ(x) = sign(θ0 + θ1x1 + θ2x2)
H = 3
d d + 1
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
VC Generalization bound• VC-dim of a model class can be used to bound generalization loss:
‣ With probability at least , we will get a ”good” dataset, for which
•
• We need larger training size :
‣ The better generalization we need
‣ The more complex (higher VC-dim) our model class
‣ The more likely we want to get a good training sample
1 − η
test loss − training loss ≤H log(2m/H) + H − log(η/4)
m
mgeneralization loss
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Model selection with VC-dim• Using validation / cross-validation:
‣ Estimate loss on held out set
‣ Use validation loss to select model
• Using VC dimension:
‣ Use generalization bound to select model
‣ Structural Risk Minimization (SRM)
‣ Bound not tight, must too conservative
training loss validation lossmodel complexity
training loss VC bound test loss boundmodel complexity
Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression
Logistics
assignments • Assignment 3 due next Tuesday, Oct 26
midterm
• Midterm exam on Nov 4, 11am–12:20 in SH 128
• If you're eligible to be remote — let us know by Oct 28
• If you're eligible for more time — let us know by Oct 28
• Review during lecture next Thursday
Recommended