Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing

Scene-centric Joint Parsing of Cross-view Videos

Tianfu Wu

NCSU

Song-Chun Zhu

UCLA

Yuanlu Xu*

UCLA

Tao Yuan*

UCLA

Hang Qi*

UCLA

* Contributed equally.

Outline

• Introduction

• Representation• Ontology graph, parse graph, parse graph hierarchy

• Joint parsing problem• Formulation

• Cross-view Compatibility

• Inference

• Experiments

A Multi-camera Scenario

Overlapping fields-of-views

Establishing cross-references

View-centric recognition can be noisy and ambiguous.

Representation

Ontology Graph

• The domain knowledge on scenes and

events.

• A set of plausible objects, actions and

semantic attributes.

Parse Graph

• Subgraphs of the ontology graph.

• Grounded facts given a video

sequence.

Input data

person 1

View 1 View-2

Scene-centric parse graph

View-centric parse graphs

male

throwing

no-hat short-sleeves

female

catching

no-hat long-sleeves

person 2

male

throwing

no-hat short-sleeves female long-sleeves

??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

View-centric Parse Graphs

person 1

View 1 View-2



male

throwing


female

catching

no-hat long-sleeves

person 2

male

throwing


??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

View-centric

Detection and Recognition

Ambiguities due to resolution,

occlusion, illumination differences

etc.

? ?

? ?

?

Scene-centric Parse Graph

person 1

View 1 View-2



male

throwing


female

catching

no-hat long-sleeves

person 2

male

throwing


??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

View-centric


Multi-view Joint Inference

A cohesive set of scene-centric

believes of the data.



etc.

? ?

? ?

?

Parse Graph Hierarchy

person 1

View 1 View-2



male

throwing


female

catching

no-hat long-sleeves

person 2

male

throwing


??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

View-centric


Multi-view Joint Inference

A cohesive set of scene-centric

believes of the data.



etc.

? ?

? ?

?

Scene

ḡ𝑡−1(2)

Time

View-centric

trunk

hood

door

vehicle

𝑔𝑡−1 𝑔𝑡 𝑔𝑡+1

ḡ𝑡−1(1)

ḡ𝑡−1(3)

ḡ𝑡(2)

ḡ𝑡(1)

ḡ𝑡(3)

ḡ𝑡+1(2)

ḡ𝑡+1(1)

ḡ𝑡+1(3)

Scene-centric

t-1 t+1

lower bodytorso

long sleeveslong

hair

head

person

jeans

male

moving

lower bodytorso

long

sleeves

long hair

head

person

jeans

male

movinglower body

torso

long sleeveslong

hair

head

person

jeans

female

moving

moving

closed open closedlower bodytorso

T-shirtshort

hair

head

person

shorts

female

moving

driving

trunk

hood

door

vehicle

lower

bodytorso

long

sleeves

long

hair

head

person

jeans

male

moving

moving

closed open closed

lower bodytorso

T-shirtshort hair

head

person

shorts

female

moving

driving

trunk

hood

door

vehicle

moving

closed openclosed

drivingrunning

running

driving

t

Spatio-Temporal Joint Parsing Problem

Infer the parse graph hierarchy

from videos captured by a network of cameras.

Spatio-Temporal Joint Parsing Problem

Models the grounding of nodes in

view-centric parse graphs to the input

video sequences.

Models the compatibility of scene-

centric and view-centric parse graphs

across time.

Likelihood Prior

Cross-view Compatibility

Appearance similarity. Spatial consistency.

person 1

View 1 View-2



male

throwing


female

catching

no-hat long-sleeves

person 2

male

throwing


??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

person 1

View 1 View-2



male

throwing


female

catching

no-hat long-sleeves

person 2

male

throwing


??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

The appearance energy regulates

the appearance similarity of objects

in the scene-centric parse graph and

in the view-centric parse graphs.

For each object node in the parse graph hierarchy, we

keep a scene-centric location for each object in

scene-centric parse graphs and a view-centric

location in view-centric parse graphs.

Cross-view Compatibility

Action compatibility. Attribute consistency.

action

labels

pose

person 1

View 1 View-2



male

throwing


female

catching

no-hat long-sleeves

person 2

male

throwing


??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

person 1

View 1 View-2



male

throwing


female

catching

no-hat long-sleeves

person 2

male

throwing


??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

person 1

View 1 View-2



male

throwing


female

catching

no-hat long-sleeves

person 2

male

throwing


??person 1 person 2

? ?

torsohead torsohead

torsohead torsohead

male no-hat

person 1

torsohead

female long-sleeves

catchingperson 2

?

head torso

person 1

View 1View-2



male

throwing

no-hatshort-sleeves

female

catching

no-hatlong-sleeves

person 2

male

throwing

no-hatshort-sleevesfemalelong-sleeves

? ? person 1person 2

??

torso headtorso head

torso headtorso head

maleno-hat

person 1

torso head

femalelong-sleeves

catching person 2

?

headtorso

Scene-centric human action

predictions shall agree with the human

pose observed in individual views

from different viewing angles.

In cross-view sequences, entities

observed from multiple cameras shall have

a consistent set of attributes.

Inference

1. Initialize with view-centric objects, actions, and attributes proposals from pre-trained detectors.

2. Sample parse graph structure with Markov Chain Monte Carlo (MCMC).

3. For a fixed parse graph hierarchy, estimate the value for each node with dynamic programming

4. Iterate 2 & 3 until converge.

Merging Splitting Swapping

Experiments

• CAMPUS Dataset

• TUM Kitchen Dataset

Detection Multi-object Tracking

View-centric Scene-centric

Action recognition

The breakdown of action recognition accuracy

according to the number of camera views in which

each entity is observed.

Confusion matrices of action recognition on CAMPUS Accuracy vs # of cameras

Comparison with More Baselines

Human Attributes

Action Recognition

Conclusion

• A parse graph hierarchy for representing a comprehensive understanding of cross-view videos.

• A joint parsing framework that infers a set of coherent scene-centric predictions.

• Explored various constraints that reflect the appearance and geometry correlations among objects across multiple views and the correlations among different semantic properties of objects.

Thank you!Q & A

Documents

Scene-centric Joint Parsing of Cross-view Videosweb.cs.ucla.edu/~yuanluxu/publications/scene_parse_aaai18_oral.pdf · A Multi-camera Scenario Overlapping fields-of-views Establishing