Download pdf - George Konidaris [email protected]/courses/cps170/compsci270/spring15/... · George Konidaris! [email protected] Spring 2015. Machine Learning Subﬁeld of AI concerned with

Unsupervised Learning

George Konidaris [email protected]

Spring 2015

Machine LearningSubfield of AI concerned with learning from data. !!Broadly, using:

• Experience • To Improve Performance • On Some Task

!(Tom Mitchell, 1997)

!

Unsupervised LearningInput: X = {x1, …, xn} !Try to understand the structure of the data. !!!E.g., how many types of cars? How can they vary?

inputs

Dimensionality ReductionX = {x1, …, xn} !If n is high, data can be hard to deal with.

• High-dimensional decision boundary. • Need more data. • But data is often not really high-dimensional.

!!

Dimensionality reduction: • Reduce or compress the data • Try not to lose too much! • Find intrinsic dimensionality

Dimensionality ReductionFor example, imagine if x1 and x2 are meaningful features, and x3 … xn are random noise. !What happens to k-nearest neighbors? !What happens to a decision tree? !What happens to the perceptron algorithm? !What happens if you want to do clustering?

Dimensionality ReductionOften can be phrased as a projection: !!!where:

• • our goal: retain as much variance as possible.

!!

Variance captures what varies within the data.

f : X ! X 0

|X 0| << |X|

PCAPrinciple Components Analysis. !Project data into a new space:

• Dimensions are linearly uncorrelated. • We have a measure of importance for each dimension.

!!

PCA

PCA• Gather data X1, …, Xm. • Adjust data to be zero-mean:

!!

• Compute covariance matrix C. • Compute unit eigenvectors Vi and eigenvalues vi of C.

!Each Vi is a direction, and each vi is its importance - the amount of the data’s variance it accounts for. !New data points:

Xi = Xi �X

j

Xj

m

X̂i = [V1, ..., Vp]Xi

Eigenfaces

(courtesy ORL database)

ISOMAPAnother approach:

• Estimate intrinsic geometric dimensionality of data. • Recover natural distance metric

ISOMAPCore idea: distance metric locally Euclidean • Small radius r, connect each point to neighbors • Weight based on Euclidean distance

ISOMAPSolve all-points shortest pairs:

• Transforms local distance to global distance. • Compute embedding.

ISOMAP

From Tenenbaum, de Silva, and Langford, Science 290:2319-2323, December 2000.

Reinforcement Learning

Reinforcement LearningLearning counterpart of planning. !

max�

R =��

t=0

�trtπ : S → A

RLThe problem of learning how to interact with an environment to maximize reward.

RL

Agent interacts with an environment At each time t:

• Receives sensor signal • Executes action • Transition:

• new sensor signal • reward

st

at

st+1

rt

Goal: find policy that maximizes expected return (sum of discounted future rewards):

π

maxπ

E

[

R =

∞∑

t=0

γtrt

]

RLThis formulation is general enough to encompass a wide variety of learned control problems.

Markov Decision Processes: set of states : set of actions : discount factor !: reward function is the reward received taking action from state and transitioning to state . !: transition function is the probability of transitioning to state after taking action in state . !

RL: one or both of T, R unknown.

S

A

R

R(s, a, s′)

γ

a s

s′

T

T (s′|s, a) s′

a s

< S, A, γ, R, T >

RLExample: !!!!!!!!

States: set of grid locations Actions: up, down, left, right Transition function: move in direction of action with p=0.9 Reward function: -1 for every step, 1000 for finding the goal

RLExample: !!!!!!!!

States: (real-valued vector) Actions: +1, -1, 0 units of torque added to elbow Transition function: physics! Reward function: -1 for every step

(θ1, θ̇1, θ2, θ̇2)

MDPsOur target is a policy: !!!A policy maps states to actions. !

The optimal policy maximizes: !!!!This means that we wish to find a policy that maximizes the return from every state.

π : S → A

maxπ

∀s, E

[

R(s) =∞∑

t=0

γtrt

∣

∣

∣

∣

∣

s0 = s

]

Value Functions

Given a policy, we can estimate of for every state. • This is a value function. • It can be used to improve our policy.

!!!!

R(s)

Vπ(s) = E

[

∞∑

t=0

γtrt

∣

∣

∣

∣

∣

π, s0 = s

]

πThis is the value of state under policy . s

Value FunctionsSimilarly, we define a state-action value function as follows: !!!!This is the value of executing in state , then following . !Note that:

Qπ(s, a) = E

[

∞∑

t=0

γtrt

∣

∣

∣

∣

∣

π, s0 = s, a0 = a

]

πa s

Qπ(s, π(s)) = Vπ(s)

Policy IterationRecall that we seek the policy that maximizes . !Therefore we know that, for the optimal policy : !!!!!This means that any change to that increases anywhere obtains a better policy.

Vπ(s),∀s

π∗

Vπ∗(s) ≥ Vπ(s),∀π, s

Qπ∗(s, a) ≥ Qπ(s, a),∀π, s, a

π Q

Policy IterationThis leads to a general policy improvement framework:

1. Start with a policy 2. Learn 3. Improve

a. !

π

Qπ

π

π(s) = maxa

Q(s, a),∀sRepeat

This is known as policy iteration. It is guaranteed to converge to the optimal policy. !Steps 2 and 3 can be interleaved as rapidly as you like. Usually, perform 3a every time step.

FridayHow to learn Q!