18
Scene-centric Joint Parsing of Cross-view Videos Tianfu Wu NCSU Song-Chun Zhu UCLA Yuanlu Xu* UCLA Tao Yuan* UCLA Hang Qi* UCLA * Contributed equally.

Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Scene-centric Joint Parsing of Cross-view Videos

Tianfu Wu

NCSU

Song-Chun Zhu

UCLA

Yuanlu Xu*

UCLA

Tao Yuan*

UCLA

Hang Qi*

UCLA

* Contributed equally.

Page 2: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Outline

• Introduction

• Representation• Ontology graph, parse graph, parse graph hierarchy

• Joint parsing problem• Formulation

• Cross-view Compatibility

• Inference

• Experiments

Page 3: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

A Multi-camera Scenario

Overlapping fields-of-views

Establishing cross-references

View-centric recognition can be noisy and ambiguous.

Page 4: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Representation

Ontology Graph

• The domain knowledge on scenes and

events.

• A set of plausible objects, actions and

semantic attributes.

Parse Graph

• Subgraphs of the ontology graph.

• Grounded facts given a video

sequence.

Page 5: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Input data

person 1

View 1 View-2

Scene-centric parse graph

View-centric parse graphs

male

throwing

no-hat short-sleeves

female

catching

no-hat long-sleeves

person 2

male

throwing

no-hat short-sleeves female long-sleeves

??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

Page 6: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

View-centric Parse Graphs

person 1

View 1 View-2

Scene-centric parse graph

View-centric parse graphs

male

throwing

no-hat short-sleeves

female

catching

no-hat long-sleeves

person 2

male

throwing

no-hat short-sleeves female long-sleeves

??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

View-centric

Detection and Recognition

Ambiguities due to resolution,

occlusion, illumination differences

etc.

? ?

? ?

?

Page 7: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Scene-centric Parse Graph

person 1

View 1 View-2

Scene-centric parse graph

View-centric parse graphs

male

throwing

no-hat short-sleeves

female

catching

no-hat long-sleeves

person 2

male

throwing

no-hat short-sleeves female long-sleeves

??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

View-centric

Detection and Recognition

Multi-view Joint Inference

A cohesive set of scene-centric

believes of the data.

Ambiguities due to resolution,

occlusion, illumination differences

etc.

? ?

? ?

?

Page 8: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Parse Graph Hierarchy

person 1

View 1 View-2

Scene-centric parse graph

View-centric parse graphs

male

throwing

no-hat short-sleeves

female

catching

no-hat long-sleeves

person 2

male

throwing

no-hat short-sleeves female long-sleeves

??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

View-centric

Detection and Recognition

Multi-view Joint Inference

A cohesive set of scene-centric

believes of the data.

Ambiguities due to resolution,

occlusion, illumination differences

etc.

? ?

? ?

?

Page 9: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Scene

ḡ𝑡−1(2)

Time

View-centric

trunk

hood

door

vehicle

𝑔𝑡−1 𝑔𝑡 𝑔𝑡+1

ḡ𝑡−1(1)

ḡ𝑡−1(3)

ḡ𝑡(2)

ḡ𝑡(1)

ḡ𝑡(3)

ḡ𝑡+1(2)

ḡ𝑡+1(1)

ḡ𝑡+1(3)

Scene-centric

t-1 t+1

lower bodytorso

long sleeveslong

hair

head

person

jeans

male

moving

lower bodytorso

long

sleeves

long hair

head

person

jeans

male

movinglower body

torso

long sleeveslong

hair

head

person

jeans

female

moving

moving

closed open closedlower bodytorso

T-shirtshort

hair

head

person

shorts

female

moving

driving

trunk

hood

door

vehicle

lower

bodytorso

long

sleeves

long

hair

head

person

jeans

male

moving

moving

closed open closed

lower bodytorso

T-shirtshort hair

head

person

shorts

female

moving

driving

trunk

hood

door

vehicle

moving

closed openclosed

drivingrunning

running

driving

t

Spatio-Temporal Joint Parsing Problem

Infer the parse graph hierarchy

from videos captured by a network of cameras.

Page 10: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Spatio-Temporal Joint Parsing Problem

Models the grounding of nodes in

view-centric parse graphs to the input

video sequences.

Models the compatibility of scene-

centric and view-centric parse graphs

across time.

Likelihood Prior

Page 11: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Cross-view Compatibility

Appearance similarity. Spatial consistency.

person 1

View 1 View-2

Scene-centric parse graph

View-centric parse graphs

male

throwing

no-hat short-sleeves

female

catching

no-hat long-sleeves

person 2

male

throwing

no-hat short-sleeves female long-sleeves

??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

person 1

View 1 View-2

Scene-centric parse graph

View-centric parse graphs

male

throwing

no-hat short-sleeves

female

catching

no-hat long-sleeves

person 2

male

throwing

no-hat short-sleeves female long-sleeves

??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

The appearance energy regulates

the appearance similarity of objects

in the scene-centric parse graph and

in the view-centric parse graphs.

For each object node in the parse graph hierarchy, we

keep a scene-centric location for each object in

scene-centric parse graphs and a view-centric

location in view-centric parse graphs.

Page 12: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Cross-view Compatibility

Action compatibility. Attribute consistency.

action

labels

pose

person 1

View 1 View-2

Scene-centric parse graph

View-centric parse graphs

male

throwing

no-hat short-sleeves

female

catching

no-hat long-sleeves

person 2

male

throwing

no-hat short-sleeves female long-sleeves

??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

person 1

View 1 View-2

Scene-centric parse graph

View-centric parse graphs

male

throwing

no-hat short-sleeves

female

catching

no-hat long-sleeves

person 2

male

throwing

no-hat short-sleeves female long-sleeves

??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

person 1

View 1 View-2

Scene-centric parse graph

View-centric parse graphs

male

throwing

no-hat short-sleeves

female

catching

no-hat long-sleeves

person 2

male

throwing

no-hat short-sleeves female long-sleeves

??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

person 1

View 1View-2

Scene-centric parse graph

View-centric parse graphs

male

throwing

no-hatshort-sleeves

female

catching

no-hatlong-sleeves

person 2

male

throwing

no-hatshort-sleevesfemalelong-sleeves

? ? person 1person 2

??

torso headtorso head

torso headtorso head

maleno-hat

person 1

torso head

femalelong-sleeves

catching person 2

?

headtorso

Scene-centric human action

predictions shall agree with the human

pose observed in individual views

from different viewing angles.

In cross-view sequences, entities

observed from multiple cameras shall have

a consistent set of attributes.

Page 13: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Inference

1. Initialize with view-centric objects, actions, and attributes proposals from pre-trained detectors.

2. Sample parse graph structure with Markov Chain Monte Carlo (MCMC).

3. For a fixed parse graph hierarchy, estimate the value for each node with dynamic programming

4. Iterate 2 & 3 until converge.

Merging Splitting Swapping

Page 14: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Experiments

• CAMPUS Dataset

• TUM Kitchen Dataset

Detection Multi-object Tracking

Page 15: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

View-centric Scene-centric

Action recognition

The breakdown of action recognition accuracy

according to the number of camera views in which

each entity is observed.

Confusion matrices of action recognition on CAMPUS Accuracy vs # of cameras

Page 16: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Comparison with More Baselines

Human Attributes

Action Recognition

Page 17: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Conclusion

• A parse graph hierarchy for representing a comprehensive understanding of cross-view videos.

• A joint parsing framework that infers a set of coherent scene-centric predictions.

• Explored various constraints that reflect the appearance and geometry correlations among objects across multiple views and the correlations among different semantic properties of objects.

Page 18: Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Thank you!Q & A