Machine LearningSubfield of AI concerned with learning from data. !!Broadly, using:
• Experience • To Improve Performance • On Some Task
!(Tom Mitchell, 1997)
!
Unsupervised LearningInput: X = {x1, …, xn} !Try to understand the structure of the data. !!!E.g., how many types of cars? How can they vary?
inputs
Dimensionality ReductionX = {x1, …, xn} !If n is high, data can be hard to deal with.
• High-dimensional decision boundary. • Need more data. • But data is often not really high-dimensional.
!!
Dimensionality reduction: • Reduce or compress the data • Try not to lose too much! • Find intrinsic dimensionality
Dimensionality ReductionFor example, imagine if x1 and x2 are meaningful features, and x3 … xn are random noise. !What happens to k-nearest neighbors? !What happens to a decision tree? !What happens to the perceptron algorithm? !What happens if you want to do clustering?
Dimensionality ReductionOften can be phrased as a projection: !!!where:
• • our goal: retain as much variance as possible.
!!
Variance captures what varies within the data.
f : X ! X 0
|X 0| << |X|
PCAPrinciple Components Analysis. !Project data into a new space:
• Dimensions are linearly uncorrelated. • We have a measure of importance for each dimension.
!!
PCA
PCA• Gather data X1, …, Xm. • Adjust data to be zero-mean:
!!
• Compute covariance matrix C. • Compute unit eigenvectors Vi and eigenvalues vi of C.
!Each Vi is a direction, and each vi is its importance - the amount of the data’s variance it accounts for. !New data points:
Xi = Xi �X
j
Xj
m
X̂i = [V1, ..., Vp]Xi
Eigenfaces
(courtesy ORL database)
ISOMAPAnother approach:
• Estimate intrinsic geometric dimensionality of data. • Recover natural distance metric
ISOMAPCore idea: distance metric locally Euclidean • Small radius r, connect each point to neighbors • Weight based on Euclidean distance
ISOMAPSolve all-points shortest pairs:
• Transforms local distance to global distance. • Compute embedding.
ISOMAP
From Tenenbaum, de Silva, and Langford, Science 290:2319-2323, December 2000.
Reinforcement Learning
Reinforcement LearningLearning counterpart of planning. !
max�
R =��
t=0
�trtπ : S → A
RLThe problem of learning how to interact with an environment to maximize reward.
RL
Agent interacts with an environment At each time t:
• Receives sensor signal • Executes action • Transition:
• new sensor signal • reward
st
at
st+1
rt
Goal: find policy that maximizes expected return (sum of discounted future rewards):
π
maxπ
E
[
R =
∞∑
t=0
γtrt
]
RLThis formulation is general enough to encompass a wide variety of learned control problems.
Markov Decision Processes: set of states : set of actions : discount factor !: reward function is the reward received taking action from state and transitioning to state . !: transition function is the probability of transitioning to state after taking action in state . !
RL: one or both of T, R unknown.
S
A
R
R(s, a, s′)
γ
a s
s′
T
T (s′|s, a) s′
a s
< S, A, γ, R, T >
RLExample: !!!!!!!!
States: set of grid locations Actions: up, down, left, right Transition function: move in direction of action with p=0.9 Reward function: -1 for every step, 1000 for finding the goal
RLExample: !!!!!!!!
States: (real-valued vector) Actions: +1, -1, 0 units of torque added to elbow Transition function: physics! Reward function: -1 for every step
(θ1, θ̇1, θ2, θ̇2)
MDPsOur target is a policy: !!!A policy maps states to actions. !
The optimal policy maximizes: !!!!This means that we wish to find a policy that maximizes the return from every state.
π : S → A
maxπ
∀s, E
[
R(s) =∞∑
t=0
γtrt
∣
∣
∣
∣
∣
s0 = s
]
Value Functions
Given a policy, we can estimate of for every state. • This is a value function. • It can be used to improve our policy.
!!!!
R(s)
Vπ(s) = E
[
∞∑
t=0
γtrt
∣
∣
∣
∣
∣
π, s0 = s
]
πThis is the value of state under policy . s
Value FunctionsSimilarly, we define a state-action value function as follows: !!!!This is the value of executing in state , then following . !Note that:
Qπ(s, a) = E
[
∞∑
t=0
γtrt
∣
∣
∣
∣
∣
π, s0 = s, a0 = a
]
πa s
Qπ(s, π(s)) = Vπ(s)
Policy IterationRecall that we seek the policy that maximizes . !Therefore we know that, for the optimal policy : !!!!!This means that any change to that increases anywhere obtains a better policy.
Vπ(s),∀s
π∗
Vπ∗(s) ≥ Vπ(s),∀π, s
Qπ∗(s, a) ≥ Qπ(s, a),∀π, s, a
π Q
Policy IterationThis leads to a general policy improvement framework:
1. Start with a policy 2. Learn 3. Improve
a. !
π
Qπ
π
π(s) = maxa
Q(s, a),∀sRepeat
This is known as policy iteration. It is guaranteed to converge to the optimal policy. !Steps 2 and 3 can be interleaved as rapidly as you like. Usually, perform 3a every time step.
FridayHow to learn Q!