Learning Social Affordance for Human-Robot …Learning Social Affordance for Human-Robot Interaction...

Preview:

Citation preview

Outer loopA Metropolis-Hasting algorithm for latent sub-event parsingDynamics: splitting/merging/relabeling

Learning Social Affordance for Human-Robot InteractionTianmin Shu1 M. S. Ryoo2 Song-Chun Zhu1

1Center for Vision, Cognition, Learning, and Autonomy, UCLA 2Indiana University, Bloomington

Objective:Learning explainable knowledge from the noisy observation of humaninteractions in RGB-D videos to enable human-robot interactions.Key idea:Beyond traditional object and scene affordances, we propose a weaklysupervised learning of social affordances for HRI.Contributions:• First formulation and hierarchical representation of social affordance• Weakly supervised learning from noisy skeleton input• Efficient motion synthesis based on learned hierarchical affordances

IntroductionGoal:Obtain the optimal joint selection and grouping and interaction parsingresults by maximizing the joint probability.Algorithm:

Learning

Model

...

c

s1 sK

...

s2

p(c)

t 2 T1J t

p(Z)

J t

p(Z)

......

Joints Selectionand Grouping Interaction Parsing

...

t 2 T2

Zs2

Zs1

Shake HandsCurrent frame Affordance

For one instance of category c:

Inner loop A Gibbs sampling for our modified CRP

Initialization Skeleton clustering for initial sub-event parsing

Motion SynthesisGoal: Given the initial 10 frames (25 fps), synthesize the motion of an agentgiven the motion of the other agent and the interaction type.Algorithm:At time ,1) Estimate the current sub-event by DP2) Predict the ending time and the corresponding joint positions3) Obtain the joint positions at through interpolation

Experiment

Hand Over a CupThrow and CatchHigh-Five#50 #80 #110 #30 #80 #130 #15 #45 #75 #15 #40 #65 #20 #70 #120Shake Hands Pull Up

GT

Our

sHM

M

Exp 2: User study (14 subjects)Q1: Successful? Q2: Natural? Q3: Human vs. robot?From 1 (worst) to 5 (best)

0

0.2

0.4

0.6

0.8

1

GTOurs

Exp 1: Average joint distance in meters (compared with GT skeletons)

Examples of discovered latent sub-events and their sub-goals

Synthesis Examples

Frequencies of highscores (4 or 5) for Q3

AcknowledgmentThis research has beensponsored by grants DARPASIMPLEX project N66001-15-C-4035 and ONR MURI projectN00014-16-1-2007.

Scantovisitourprojectwebsite

p(G|Zc) /Y

k

p({J t}t2Tk |Zsk , sk, c)

| {z }likelihood

· p(c)|{z}interaction prior

·KY

k=2

p(sk|sk�1

, c)

| {z }sub-event transition

·KY

k=1

p(sk|c)

| {z }sub-event prior

For N training examples of category c ( ) :G = {Gn}n=1,...,N

p(Zc) =Y

s2Sp(Zs|c)

p({J t}t2T |Zs, s, c) = g({J t}t2T , Zs, s) m({J t}t2T , Z

s, s)

p(G, Zc) = p(Zc)QN

n p(Gn|Zc)

G = {Gn}n=1,...,N

Zc

p(zsai|Zs�ai) =

8>>><

>>>:

��

M � 1 + �if zsai > 0,Mzs

ai= 0

�Ms

zai

M � 1 + �if zsai > 0,Mzs

ai> 0

1� � if zsai = 0

t+ 5

t

t0

Human agentSynthesized agent

t+ 5 t0t

s2s1

UCLA Human-Human-Object Interaction Dataset• Five types of interactions; on average, 23.6 instances per interaction

performed by totally 8 actors. Each lasts 2-7 s presented at 10-15 fps.• RGB-D videos, skeletons and annotations are available:

http://www.stat.ucla.edu/~tianmin.shu/SocialAffordance

Pull Up

1 2 3 1 4 55 61 2 1 2

Hand Over a CupThrow and CatchHigh-FiveShake Hands

zsai ⇠ p(G|Z 0c)p(z

sai|Zs

�ai)

p(G,Zc) = p(G|Zc)p(Zc)

Recommended