1
Reinforcement learning is not directly modulated by sensory prediction errors Darius E. Parvin 1 , Matthew J. Boggess 1 , Samuel D. McDougle 2 , Jordan A. Taylor 2 , Richard B. Ivry 1 University of California, Berkeley 1 , Princeton University 2 [email protected] ivrylab.berkeley.edu Overview The credit assignment problem: Action Execution vs Action Selection We use a reaching version of the classic 2 armed bandit task We test the predictions of a model where reinforcement learning is gated by sensory prediction errors (SPEs) E1: Does delaying sensory prediction errors influence risk seeking? E2: Does adaptation or sense of control influence risk seeking? General Task Top: Task Irrelevant Clamp feedback induced adaptation. 2x2 Task Design Factor 1: Presence of SPE on miss trials. Clamp (SPE): Task Irrelevant Clamped SPE on ‘miss’ trials only Binary (No SPE): No sensory feedback E1 Results – Adaptation: Delayed feedback reduces adaptation from SPE E2 Results: Instruction of control, but not adaptation affects risk preference Right: Risk preference affected by instruction of control but not clamp feedback. Negative outcomes have two possible sources: Execution Errors Selection Errors Previous work (McDougle et al 2016): Subjects show a reversal of the usual ‘Risk Averse’ behavior observed in the classic key-press version of the task. Gating Model Trial Start Reward Feedback Sensory Feedback Clamp Groups ‘Miss’ Feedback All Groups ‘Hit’ Feedback Left: At start of training, participants exhibit considerable biases in reaching direction (e.g., tend to reach to clockwise position of target). Feedback reduces these biases; however, the reduction was much faster and more complete. Delayed feedback condition Modeling the Effect of Movement Errors on Action Selection Gating Model: Sensory prediction errors (SPE), an important signal for motor adaptation, are exploited to disrupt value updating. Safe Neutral Risky Risk Preference Target Preference Over Time Left: Subject target preference track changes in the risk associated with each target. Right: Ratio of trials where the riskier target was picked. *Binary group did reaching but with only reward feedback Competence Model Movement errors help signal motor competence, one source of input for value representation. Goal of Study If SPE is critical for modulating action selection, manipulation of the strength of this signal should result in change in choice behavior. Examine two measures, sensorimotor adaptation and choice behavior, to verify efficacy of our manipulation of SPE. Motivation: Efficacy of SPE to produce sensorimotor adaptation is severely weakened when feedback is delayed. Thus, delaying feedback by 2 s should reduce sensorimotor adaptation and make people more risk averse on choice behavior. Repeat Choice: Participants in all groups shift their reach direction opposite that of the previous error. This effect may reflect re- aiming. Repeat Choice after intervening trials: Shift in reach direction is significant, but only for the no-delay groups. This result is consistent with hypothesis that SPE is severely attenuated in the delay group. E1 Results – Risk Preference: Effect of feedback delay did NOT affect choice behavior 2s Delay group shows less risk aversion as predicted. However, the difference disappears when matching the inter-trial interval (‘Matched ITI’). Greater ‘risky’ bias for shorter ITI may reflect stronger iconic motor memory, biasing participants to believe they can accurately modify future reaches. Alternatively, a longer ITI may lead to a decay of value representation, resulting in more random (neutral) behavior. Motivation: Task irrelevant SPEs presented on miss trials should induce implicit adaptation (Morehead et al, 2014). Will this affect choice behavior? • Also manipulated participants’ sense of control via instructions to provide a test of Competence model. Factor 2: Manipulate belief of (but not actual) control over rewards. In Control: “You get points only if you hit the target” No Control: “Your reach accuracy does not affect points” Clamp (SPE) Binary (No SPE) In Control No Control Mean reach error Trial by trial error correction N Trials back Trial Number Delay vs Risk Preference Clamp SPE Causes Adaptation Adaptation After Effect Risk Preference Instruction: Select a target by reaching through it. 0 Points if the target is missed. Two Targets: ‘Risky target’ - high payoff, low hit probability ‘Safe target’ - low payoff, high hit probability Cursor feedback: Sometimes manipulated in order to match predetermined hit/miss outcomes Conclusion Efficacy of SPE for motor adaptation was evident in both experiments. Nonetheless, these manipulations did NOT affect choice behavior, arguing against the Gating model. Greater willingness of participants to choose risky target when they believe they are in control is consistent with Competence model. Effect of ITI on risk preference may reflect an iconic motor memory influencing belief of motor competence, or a decaying value representation.

Reinforcement learning is not directly modulated by sensory ...ivrylab.berkeley.edu/uploads/4/1/1/5/41152143/parvin_sfn...Reinforcement learning is not directly modulated by sensory

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reinforcement learning is not directly modulated by sensory ...ivrylab.berkeley.edu/uploads/4/1/1/5/41152143/parvin_sfn...Reinforcement learning is not directly modulated by sensory

Reinforcement learning is not directly modulated by sensory prediction errorsDarius E. Parvin1, Matthew J. Boggess1, Samuel D. McDougle2, Jordan A. Taylor2, Richard B. Ivry1

University of California, Berkeley1, Princeton University2

[email protected]

Overview• The credit assignment problem: Action Execution vs Action Selection• We use a reaching version of the classic 2 armed bandit task• We test the predictions of a model where reinforcement learning is

gated by sensory prediction errors (SPEs)

E1: Does delaying sensory prediction errors influence risk seeking?

E2: Does adaptation or sense of control influence risk seeking?

General Task

Top: Task Irrelevant Clamp feedbackinduced adaptation.

2x2 Task Design• Factor 1: Presence of SPE on miss trials.

• Clamp (SPE): Task Irrelevant Clamped SPE on ‘miss’ trials only• Binary (No SPE): No sensory feedback

E1 Results – Adaptation: Delayed feedback reduces adaptation from SPEE2 Results: Instruction of control, but not adaptation affects risk preference

Right: Risk preference affected byinstruction of control but not clampfeedback.

Negative outcomes have two possible sources:

Execution ErrorsSelection Errors

Previous work (McDougle et al 2016): Subjects show a reversal of the usual ‘Risk Averse’behavior observed in the classic key-press version of the task.

Gating Model

Trial Start

Reward Feedback

Sensory Feedback

Clamp Groups ‘Miss’ FeedbackAll Groups ‘Hit’ Feedback

Left: At start of training, participantsexhibit considerable biases inreaching direction (e.g., tend toreach to clockwise position oftarget). Feedback reduces thesebiases; however, the reduction wasmuch faster and more complete.

Delayed feedback condition

Modeling the Effect of Movement Errors on Action Selection

Gating Model: Sensory prediction errors (SPE), an important signal for motor adaptation, are exploited to disrupt value updating.

Safe

Neutral

Risky

Risk PreferenceTarget Preference Over TimeLeft: Subject targetpreference trackchanges in the riskassociated with eachtarget.Right: Ratio of trialswhere the riskiertarget was picked.*Binary group didreaching but with onlyreward feedback

Competence Model

Movement errors help signal motor competence, one source of input for value representation.

Goal of StudyIf SPE is critical for modulating action selection, manipulation of the strength of thissignal should result in change in choice behavior.

Examine two measures, sensorimotor adaptation and choice behavior, to verifyefficacy of our manipulation of SPE.

Motivation: Efficacy of SPE to produce sensorimotor adaptation is severelyweakened when feedback is delayed. Thus, delaying feedback by 2 sshould reduce sensorimotor adaptation and make people more risk averseon choice behavior.

Repeat Choice: Participants in all groups shifttheir reach direction opposite that of theprevious error. This effect may reflect re-aiming.Repeat Choice after intervening trials: Shift inreach direction is significant, but only for theno-delay groups. This result is consistent withhypothesis that SPE is severely attenuated inthe delay group.

E1 Results – Risk Preference: Effect of feedback delay did NOT affect choice behavior2s Delay group shows less risk aversionas predicted. However, the differencedisappears when matching the inter-trialinterval (‘Matched ITI’).

Greater ‘risky’ bias for shorter ITI mayreflect stronger iconic motor memory,biasing participants to believe they canaccurately modify future reaches.Alternatively, a longer ITI may lead to adecay of value representation, resultingin more random (neutral) behavior.

• Motivation: Task irrelevant SPEs presented on miss trials should induceimplicit adaptation (Morehead et al, 2014). Will this affect choicebehavior?

• Also manipulated participants’ sense of control via instructions to provide atest of Competence model.

• Factor 2: Manipulate belief of (but not actual) control over rewards. • In Control: “You get points only if you hit the target”• No Control: “Your reach accuracy does not affect points”

Clamp (SPE) Binary (No SPE)

In Control

No ControlMean reach error Trial by trial error correction

N Trials backTrial Number

Delay vs Risk Preference

Clamp SPE Causes Adaptation Adaptation After Effect

Risk Preference

Instruction: Select a target by reachingthrough it. 0 Points if the target is missed.Two Targets:‘Risky target’ - high payoff, low hit probability‘Safe target’ - low payoff, high hit probabilityCursor feedback: Sometimes manipulatedin order to match predetermined hit/missoutcomes

Conclusion• Efficacy of SPE for motor adaptation was evident in both experiments.

Nonetheless, these manipulations did NOT affect choice behavior, arguing againstthe Gating model.

• Greater willingness of participants to choose risky target when they believe theyare in control is consistent with Competence model.

• Effect of ITI on risk preference may reflect an iconic motor memory influencingbelief of motor competence, or a decaying value representation.