Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Scene-centric Joint Parsing of Cross-view Videos
Tianfu Wu
NCSU
Song-Chun Zhu
UCLA
Yuanlu Xu*
UCLA
Tao Yuan*
UCLA
Hang Qi*
UCLA
* Contributed equally.
Outline
• Introduction
• Representation• Ontology graph, parse graph, parse graph hierarchy
• Joint parsing problem• Formulation
• Cross-view Compatibility
• Inference
• Experiments
A Multi-camera Scenario
Overlapping fields-of-views
Establishing cross-references
View-centric recognition can be noisy and ambiguous.
Representation
Ontology Graph
• The domain knowledge on scenes and
events.
• A set of plausible objects, actions and
semantic attributes.
Parse Graph
• Subgraphs of the ontology graph.
• Grounded facts given a video
sequence.
Input data
person 1
View 1 View-2
Scene-centric parse graph
View-centric parse graphs
male
throwing
no-hat short-sleeves
female
catching
no-hat long-sleeves
person 2
male
throwing
no-hat short-sleeves female long-sleeves
??person 1 person 2
? ?
torsohead torsohead
torsohead torsohead
male no-hat
person 1
torsohead
female long-sleeves
catchingperson 2
?
head torso
View-centric Parse Graphs
person 1
View 1 View-2
Scene-centric parse graph
View-centric parse graphs
male
throwing
no-hat short-sleeves
female
catching
no-hat long-sleeves
person 2
male
throwing
no-hat short-sleeves female long-sleeves
??person 1 person 2
? ?
torsohead torsohead
torsohead torsohead
male no-hat
person 1
torsohead
female long-sleeves
catchingperson 2
?
head torso
View-centric
Detection and Recognition
Ambiguities due to resolution,
occlusion, illumination differences
etc.
? ?
? ?
?
Scene-centric Parse Graph
person 1
View 1 View-2
Scene-centric parse graph
View-centric parse graphs
male
throwing
no-hat short-sleeves
female
catching
no-hat long-sleeves
person 2
male
throwing
no-hat short-sleeves female long-sleeves
??person 1 person 2
? ?
torsohead torsohead
torsohead torsohead
male no-hat
person 1
torsohead
female long-sleeves
catchingperson 2
?
head torso
View-centric
Detection and Recognition
Multi-view Joint Inference
A cohesive set of scene-centric
believes of the data.
Ambiguities due to resolution,
occlusion, illumination differences
etc.
? ?
? ?
?
Parse Graph Hierarchy
person 1
View 1 View-2
Scene-centric parse graph
View-centric parse graphs
male
throwing
no-hat short-sleeves
female
catching
no-hat long-sleeves
person 2
male
throwing
no-hat short-sleeves female long-sleeves
??person 1 person 2
? ?
torsohead torsohead
torsohead torsohead
male no-hat
person 1
torsohead
female long-sleeves
catchingperson 2
?
head torso
View-centric
Detection and Recognition
Multi-view Joint Inference
A cohesive set of scene-centric
believes of the data.
Ambiguities due to resolution,
occlusion, illumination differences
etc.
? ?
? ?
?
Scene
ḡ𝑡−1(2)
Time
View-centric
trunk
hood
door
vehicle
𝑔𝑡−1 𝑔𝑡 𝑔𝑡+1
ḡ𝑡−1(1)
ḡ𝑡−1(3)
ḡ𝑡(2)
ḡ𝑡(1)
ḡ𝑡(3)
ḡ𝑡+1(2)
ḡ𝑡+1(1)
ḡ𝑡+1(3)
Scene-centric
t-1 t+1
lower bodytorso
long sleeveslong
hair
head
person
jeans
male
moving
lower bodytorso
long
sleeves
long hair
head
person
jeans
male
movinglower body
torso
long sleeveslong
hair
head
person
jeans
female
moving
moving
closed open closedlower bodytorso
T-shirtshort
hair
head
person
shorts
female
moving
driving
trunk
hood
door
vehicle
lower
bodytorso
long
sleeves
long
hair
head
person
jeans
male
moving
moving
closed open closed
lower bodytorso
T-shirtshort hair
head
person
shorts
female
moving
driving
trunk
hood
door
vehicle
moving
closed openclosed
drivingrunning
running
driving
t
Spatio-Temporal Joint Parsing Problem
Infer the parse graph hierarchy
from videos captured by a network of cameras.
Spatio-Temporal Joint Parsing Problem
Models the grounding of nodes in
view-centric parse graphs to the input
video sequences.
Models the compatibility of scene-
centric and view-centric parse graphs
across time.
Likelihood Prior
Cross-view Compatibility
Appearance similarity. Spatial consistency.
person 1
View 1 View-2
Scene-centric parse graph
View-centric parse graphs
male
throwing
no-hat short-sleeves
female
catching
no-hat long-sleeves
person 2
male
throwing
no-hat short-sleeves female long-sleeves
??person 1 person 2
? ?
torsohead torsohead
torsohead torsohead
male no-hat
person 1
torsohead
female long-sleeves
catchingperson 2
?
head torso
person 1
View 1 View-2
Scene-centric parse graph
View-centric parse graphs
male
throwing
no-hat short-sleeves
female
catching
no-hat long-sleeves
person 2
male
throwing
no-hat short-sleeves female long-sleeves
??person 1 person 2
? ?
torsohead torsohead
torsohead torsohead
male no-hat
person 1
torsohead
female long-sleeves
catchingperson 2
?
head torso
The appearance energy regulates
the appearance similarity of objects
in the scene-centric parse graph and
in the view-centric parse graphs.
For each object node in the parse graph hierarchy, we
keep a scene-centric location for each object in
scene-centric parse graphs and a view-centric
location in view-centric parse graphs.
Cross-view Compatibility
Action compatibility. Attribute consistency.
action
labels
pose
person 1
View 1 View-2
Scene-centric parse graph
View-centric parse graphs
male
throwing
no-hat short-sleeves
female
catching
no-hat long-sleeves
person 2
male
throwing
no-hat short-sleeves female long-sleeves
??person 1 person 2
? ?
torsohead torsohead
torsohead torsohead
male no-hat
person 1
torsohead
female long-sleeves
catchingperson 2
?
head torso
person 1
View 1 View-2
Scene-centric parse graph
View-centric parse graphs
male
throwing
no-hat short-sleeves
female
catching
no-hat long-sleeves
person 2
male
throwing
no-hat short-sleeves female long-sleeves
??person 1 person 2
? ?
torsohead torsohead
torsohead torsohead
male no-hat
person 1
torsohead
female long-sleeves
catchingperson 2
?
head torso
person 1
View 1 View-2
Scene-centric parse graph
View-centric parse graphs
male
throwing
no-hat short-sleeves
female
catching
no-hat long-sleeves
person 2
male
throwing
no-hat short-sleeves female long-sleeves
??person 1 person 2
? ?
torsohead torsohead
torsohead torsohead
male no-hat
person 1
torsohead
female long-sleeves
catchingperson 2
?
head torso
person 1
View 1View-2
Scene-centric parse graph
View-centric parse graphs
male
throwing
no-hatshort-sleeves
female
catching
no-hatlong-sleeves
person 2
male
throwing
no-hatshort-sleevesfemalelong-sleeves
? ? person 1person 2
??
torso headtorso head
torso headtorso head
maleno-hat
person 1
torso head
femalelong-sleeves
catching person 2
?
headtorso
Scene-centric human action
predictions shall agree with the human
pose observed in individual views
from different viewing angles.
In cross-view sequences, entities
observed from multiple cameras shall have
a consistent set of attributes.
Inference
1. Initialize with view-centric objects, actions, and attributes proposals from pre-trained detectors.
2. Sample parse graph structure with Markov Chain Monte Carlo (MCMC).
3. For a fixed parse graph hierarchy, estimate the value for each node with dynamic programming
4. Iterate 2 & 3 until converge.
Merging Splitting Swapping
Experiments
• CAMPUS Dataset
• TUM Kitchen Dataset
Detection Multi-object Tracking
View-centric Scene-centric
Action recognition
The breakdown of action recognition accuracy
according to the number of camera views in which
each entity is observed.
Confusion matrices of action recognition on CAMPUS Accuracy vs # of cameras
Comparison with More Baselines
Human Attributes
Action Recognition
Conclusion
• A parse graph hierarchy for representing a comprehensive understanding of cross-view videos.
• A joint parsing framework that infers a set of coherent scene-centric predictions.
• Explored various constraints that reflect the appearance and geometry correlations among objects across multiple views and the correlations among different semantic properties of objects.
Thank you!Q & A