Cobot: A Social Reinforcement Learning Agent

Cobot: A Social Reinforcement Learning Agent

Charles Lee Isbell, Jr. Christian R. Shelton

Michael Kearns Satinder Singh

Peter Stone

Presented by Josh Waxman

Applications of RL Control Game playing Optimization

Recently Human-computer interaction

prev systems encounter humans one at a time E.g. spoken dialog systems

Challenges Data sparsity Inevitable violations of Markov property Irreproducibility of experiments (happening in a MOO) Variability in user’s understanding of Cobot’s working Drift or user’s desires, Inconsistency of Reward Choosing appropriate state space

LambdaMOO (λμ) MUD – Multi-User Dungeon

A class of online worlds with roots in text-based multiplayer role-playing games.

Virtual world, oft created by participants Users choose characters to represent them Mechanisms of social interaction reinforce illusion that user

is present in the virtual space

MOO –Multi-user Object Oriented - MUD that uses an object-oriented programming language to manipulate objects in the virtual world

Complex, open ended, multiuser chat environment, populated by a community of human users with rich and often enduring social relationships.

LambdaMOO (λμ) (2)

Interconnected rooms Rooms contain users and objects

that can move between them Each room has chat channel (people

in a room can talk to each other) Room (and objects) has text

description that gives it a “look and feel”

Verbs and Speech in λμUsers can talk and also have a series of verbs,

allowing rich set of actions and expression of emotional states

1. Buster is overwhelmed by all these deadlines.2. Buster begins to slowly tear his hair out, one

strand at a time.3. HFh comforts Buster. [standard verb comfort]4. HFh [to Buster]: Remember, the mighty oak

was once a nut like you.5. Buster [to HFh]: Right, but his personal

growth was assured. Thanks anyway, though.6. Buster feels better now.

verbs

speech

LambdaMOO (λμ) (3) Rooms created by users

Descriptions Control access by other users

Can create objects

4836 Active User Accounts 118,154 objects Oldest continuously operated MUD

Founded in 1990 Good environment for AI experiments, including

learning

Cobot

Cobot is RL-based agent for LambdaMOO

Long term goal: to build an agent who can learn to perform useful, interesting and entertaining actions in LambdaMOO on the basis of user feedback.

Cobot (2)

Originally Social Statistics Agent How freq, in what ways users interact Provided service of these statistics Rudimentary chatting capabilities Reactive – did not initiate interaction Very popular with LambdaMOO users

Cobot (3) Modifications

Not just reactive. + Proactive Take actions under own initiative:

Propose conversation topics Introduce users Word play

Hope: will eventually take unprompted actions that are meaningful, useful or amusing to users.

Reinforcement Learning In RL, often model decision making by agents in

uncertain environment as MDPs.

Markov Decision Process – if environment has the Markov property, in which only need look at current state to make a decision

At time t, agent senses environment, chooses an action a from A, set of actions available in state s.

Action causes change in environment, and agent receives a scalar reward from the environment

Reinforcement Learning (2) Goal: Maximize expected rewards over some time horizon A policy π is a mapping of a state s and action a to the

probability of taking action a from state s. π(s, a) p(s,a)

π* – the optimal policy

A value function is a function of states (V) or state-action pairs (Q) that tells how good it is to be in a specific state, where goodness is defined in terms of expected future return.

Qπ(s, a), the action value function for policy π is the expected return when taking action a from state s and afterwards following policy π.

Reinforcement Learning (3) π* denotes the optimal policy whose value function Q is greater than

or equal to that of any other policy for all states I and actions a in the set.

Q*– optimal action-value function

Most RL algorithms use experience to π* from the agents experience in its environment, by learning Q*

The learned value function is used to choose actions stochastically, so that in each state, actions with higher value are chosen with higher probability.

Many RL algorithms use function approximations, (parametric representations of complex value functions) both to map state-action features to their values and to map states to distributions over actions (i.e., the policy).

Linear Function Approximator Used a linear function

approximator: for each state feature, maintain vector of real-valued weights indexed by possible actions

+ weight: feature increases prob of taking that action

– weight: decreases

State feature 1

Action 1

Action 2

Action 3

Action 4

Action 5

Action 6

Action 7

Action 8

Action 9

State feature 2

Action 1

Action 2

Action 3

Action 4

Action 5

Action 6

Action 7

Action 8

Action 9

Reactive Actions (Social Statistics)Tells the questioner various facts about herself, including the verbs she

like to use most, and the verbs most often directed at her. who are your playmates: Lists those who interact with Cobot the most. who loves me: Tells the questioner those with whom she interacts the

most. who acts like me: Tells the questioner users who perform actions

similar to hers. who does not act like me: Tells the questioner users who do not

perform actions similar to hers. relate me to user: Indicates how the questioner interacts with user,

providing a ranking on each other’s list of playmates, how many playmates they have in common, and how similarly they act.

tell me my verb-buddies: Tells the questioner the users she uses verb on and who use verb on her.

who is spammy: Lists those users who generate and are the targets of the most verbs.

http://cobot.research.att.com/papers/cobot.pdf

ExampleHFh [to cobot]: relate me to Bustercobot whispers, “Here are your relationships with Buster. You like to use: - (62%), poke (7%), hug (3%), eye (3%),

nod (2%), hi5, h5, zap, comfort, and grin on each other. Buster is ranked #14 on your list of playmates. You are

ranked #1 on Buster’s list. Your socializing overlap is 75.4% and your playmate

overlap is 33.7%. Your actions have a similarity of 95.9% but ignoring

common speech verbs it’s 58.3%. Others act on you with a similarity of 96.6% but ignoring

common speech verbs it’s 81.9%.”

Technicolor Guest [to cobot]: What’s your religion?cobot [to Technicolor Guest]: technology unless you REALLY believe in that religion

yourself and find that it arouses a deep, strong, genuine response in many other people.

…cobot [to DragonBoi]: The line of conflict should be drawn between the mass of the

people and the power-holding elite of industrial society.DragonBoi eyes cobot warily.cobot [to DragonBoi]: You have something cute in your eye.DragonBoi [to HFh]: you’re bot is turning into a communist.DragonBoi [to cobot]: so what are you gonna do about it?DragonBoi [to HFh]: did you see what it told me? you are turning it into a

communist!Wikked [to cobot]: are you a communist?DragonBoi [to cobot]: are you now, or ever been a member of the communist party?cobot mumbles something in binary.DragonBoi [to cobot]: so you’re taking the fifth are you? well this is a special

commision, that won’t work here!

9 Proactive Actions(back to chart)

Null Action Choose to remain silent for this time period. Topic Starters (4) Introduce a conversational topic. Cobot declares

that he wants to discuss sports or politics, or he utters a sentence from either the sports section or political section of the Boston Globe.

Roll Call (2) Initiate a “roll call,” a common word play routine in LambdaMOO. For example, someone may declare that she is tired of Monica Lewinsky by announcing “TIRED OF LEWINSKY ROLL CALL.” Each user feeling the same will agree with the roll call. Cobot initiates a roll call by taking a recent utterance, and extracting either a single noun, or a verb phrase. These are treated as two separate RL actions.

Social Commentary Make a comment describing the current social state of the Living Room, such as “It sure is quiet” or “Everyone here is friendly.” These statements are based on Cobot’s statistics from recent activity. Several different utterances possible, but they are treated as a single action for RL purposes.

Introductions Introduce two users who have not yet interacted with one another in front of Cobot.

Actions (2) These actions were chosen to fit in with

what goes on in LambdaMOO. So as not to irritate.

Most common routines Conversation Wordplay Emoting

Infinite range of actions since based on utterance from recent conversation (ROLL CALL) or from Boston Globe online

Reinforcement Learning At set time intervals, Cobot chooses an

action according to a distribution based on Q values in current state.

Rewards and punishments between time t and t+1 apply to action at time t.

Possible erroneous reward/punishment – if user rewarded a reactive rather than proactive action = noise in training process

Feedback Actions Explicit

reward and punish verbs give numeric training signal to Cobot immed feedback to current state, action Backed up to prev. state and actions

Implicit standard LambdaMOO verbs e.g. hug and spank, kiss, spit, numerically weaker than explicit

Train for individual useror community? Design Choice

Train for entire community Or each individual user Combine value functions for those present Thus, like several RL processes in parallel, with each process with

different state space

Why? If just store which users present as another state feature, Cobot

would have to learn on own this feature primacy Learning should be fast, significant. If users don’t get feedback

that they influenced Cobot’s behavior, will be discouraged Curse of dimensionality, size of state space increases exponentially

with num of state features. Don’t want to represent presence/absence of ~250 users, maintain small state space, speed up learning

Certain users interact much more often with Cobot than others. Don’t want their input to dwarf the impact of others.

State space for generic user Social Summary Vector (4)

rate user produces events rate events produced by others directed at user % other users are amongst user’s “playmates” % others users that user is their playmate Playmate = top 10 interact with

Mood Vector – Recent use of eight groups of common words e.g. grin and smile in a single group

Rates vector – rate at which events produced by users present, including Cobot

Current Room – which room Cobot is currently in Roll Call Vector

Has saved Roll Call text been used by Cobot before Has someone done a roll call since last time Cobot did a roll call Has there been roll call since last time Cobot grabbed text

Bias vector – always on – means user is present

State space for single user too complex to model based on table representation

Linear function approximator used for each user

Mix policies of users present

Experimental Procedure Cobot in LambdaMOO since Sept 1999 RL Cobot in May 2000 Cobot is real working system with real human

users, conducted experiment in this context Launched RL functionality in Living Room Cobot logged RL-related data from May 10 –

October 10, 2000 States visited, actions taken, rewards from each

user, params of value function, etc. 63123 RL actions taken (+ reactive actions) 3171 reward, punishment events From 254 users

Findings Inappropriateness of average reward

Successful RL would have increase in avg reward over time

Not because users more dissatisfied as Cobot learns Humans fickle, preferences change over time (indeed,

novelty highly valued in LambdaMOO) popular, exciting irritatin Trying to hit (learn) a moving target So perhaps average reward shouldn’t be primary measure

of performance Users with fixed preferences

Tend to give less feedback of reward/punishment as learns preferences accurately (good enough)

Didn’t mention users get bored Typical RL, consistently gives reward, punishment M and S, dedicated users. Explore other measures later

Users M and S

Findings

Small set of dedicated parents 254 users 218 gave 20 – 15 gave 50+ Many had passing interest, a few

willing to invest signif time to teach preferences to Cobot

M: 594 S: 69

Findings Some parents have strong opinions

Majority of users, policy learned was close to uniform distribution

Policies dependant on state, but for most users, this dependence was weak, and thus near uniform distribution

Most users did not provide enough feedback, and maybe were not consistent and strong in feedback they provided

Small group, did learn a non-uniform policy M, S: policies relatively independent of state; other users,

not as dramatic, but non-uniform Makes sense: if does not like sports, does not matter what

room, or what others users are doing M: likes Roll call – Cobot selects with prob 0.99. S: likes

social commentary, selects with prob 0.38 (S interacted less, at 69)

Findings

Cobot learns matching policiesPolicy for user M reflects empirical pattern of rewards over time

Blue bars: average reward given by User M for each action {note: relative, see 8}Yellow bars: Policy learned for User MRed Bars: empirical frequency at which the action was taken

Action 6: roll call – see earlier chart: recall M likes Roll call

Findings Cobot responds to dedicated parents

For those users who train him, those users have strong impact. Shifts towards M’s preferences when M is present. [Of course! No one else trained him, so here is where reward/punishment will have most impact. Need only say this because so few actually trained him.]

Some preferences depend on state Deduce which features relevant to a given user By construction, bias feature indep of state (always

on) (All weights initialized to 0, so only nonzero features

contribute. Feature relevant if far from bias feature weight vector, and all 0 vector)

Findings – some do in fact rely on state

ConclusionsReported on efforts to apply RL in a complex

human online social environment (a MOO) where many of the standard assumptions (stationary rewards, Markovian behavior, appropriateness of average reward) are clearly violated.

We feel that the results obtained with Cobot so far are compelling, and offer promise for the application of RL in such open-ended social settings.

Cobot continues to take RL actions and receive rewards and punishments from LambdaMOO users, and we plan to continue and embellish this work as part of our overall efforts on Cobot.

Documents

Cobot: A Social Reinforcement Learning Agent