Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Spatiotemporal Graphs for Object Segmentation, Human Pose Estimation and
Action Detection in Videos
Mubarak Shah
Center for Research in Computer Vision
University of Central Florida
Spatiotemporal Graphs (STG)
• Video-based problems
• Nodes and edges
• Spatiotemporal
• Type I
• Type II
Frame 3 Frame 2
Type I Spatiotemporal Graph (STG)
• Nodes represent entities in single frames
Frame 1
……
……
……
Frame ...
Nodes can be: Object proposals Pixels Super-pixels Object locations …
Edges can be: Color similarities Distances Shape similarities …
Type II Spatiotemporal Graph (STG)
• Nodes represent entities in multiple frames
Nodes can be: Object tracklets Super-voxels …
Edges can be: Appearance similarities Motion models Overlaps …
Examples of Spatiotemporal Graphs
Original Video Object Segmentation
Video Object Segmentation (VOS)
Spatiotemporal Graph (STG): Video Object Segmentation
Frame i-1 Frame i Frame i+1
……
……
……
……
……
……
t s
Video Object Co-Segmentation (VOCS)
… …
Video 1 Video 2
…
…
…
…
…
Tracklets
…
…
…
…
…
Tracklets
STG – Video Object Co-Segmentation
Human Pose Estimation in Videos (HPEV)
STG – Human Pose Estimation in Videos
Head Top …
Head Bottom …
Hip
Shoulder
… …
Knee
Elbow
… …
Ankle
Hand … …
Action Detection (HAD)
Diving
…
Video 1
Spatiotemporal Context Graphs for Training Videos
Co
mp
osi
te G
rap
h (
)
Training Videos for Action c
…
Video n
Context Graphs
G1 ( V1, E1 )
Gn ( Vn , En ) …
…
Outline
• Video Object Segmentation (VOS)
• Video Object Co-Segmentation (VOCS)
• Human Pose Estimation in Videos (HPEV)
• Human Action Detection (HAD)
Video Object Segmentation (VOS)
Dong Zhang, Omar Javed, and Mubarak Shah, “Video object segmentation through spatially accurate and temporally dense extraction of primary object regions”, CVPR, 2013
Video Object Segmentation (VOS)
• Applications • Object Recognition
• Activity Recognition
• Surveillance
Video Object Segmentation (VOS)
• Challenges • Camera movements
• Varieties of objects
• Deformable objects
Spatiotemporal Graph for Object Selection
GMMs and MRF based Optimization
Input Video
Object Segmentation
Object Proposal Generation
Framework
Object Proposal Generation
• Object proposal methods [1,2]
[1] Ian Endres and Derek Hoiem, “Category Independent Object Proposals”, ECCV, 2010
[2] Alexe, B., Deselares, T. and Ferrari, V., “What is an object?”, CVPR, 2010
… …
Frame index
Segtrack (monkeydog)
… … 100 1 2 3 4
30
40
17
21 … …
… …
… … 100 1 2 3 4
1 2 3 4
1 2 3 4
51
60
… … 100 1 2 3 4
18
Ranked object proposals
Sample a lot of proposals! Select the right ones!
100
100
… …
… …
… …
… …
… …
Frame index
96
98
100
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
Segtrack (parachute)
33
38
40
43
49
Ranked object proposals expansion
Multiple proposals
Spatiotemporal Graph for Object Selection
Beginning node Ending node
Unary edge Represents object-ness
An object proposal
Unary Edge
𝑺𝒖𝒏𝒂𝒓𝒚 = 𝑴 𝒓 + 𝑨(𝒓)
𝑨 𝒓 : appearance score Objectness
𝑴(𝒓) : average Frobenius norm for optical flow gradient
𝑼𝒙 =𝒖𝒙 𝒖𝒚𝒗𝒙 𝒗𝒚 𝑭
= 𝒖𝒙𝟐 + 𝒖𝒚
𝟐 + 𝒗𝒙𝟐 + 𝒗𝒚
𝟐
Unary Edge: Score
Original video frame
Optical flow
Object region (proposal
Optical flow gradient
Boundary region
OF gradient around boundary
Unary Edge: Motion Score
Binary edge
Frame i Frame i+1
… …
…
…
… …
…
…
Frame i+2
… …
…
…
… …
… …
… …
… …
… …
… …
… …
… …
… …
… …
Binary Edges
𝑺𝒃𝒊𝒏𝒂𝒓𝒚 = 𝝀 ∙ 𝑺𝒐𝒗𝒆𝒓𝒍𝒂𝒑 𝒓𝒎, 𝒓𝒏 ∙ 𝑺𝒄𝒐𝒍𝒐𝒓 (𝒓𝒎, 𝒓𝒏)
𝑺𝒄𝒐𝒍𝒐𝒓(𝒓𝒎, 𝒓𝒏) = 𝒉𝒊𝒔𝒕(𝒓𝒎) ∙ 𝒉𝒊𝒔𝒕(𝒓𝒏) 𝑻
𝑺𝒐𝒗𝒆𝒓𝒍𝒂𝒑(𝒓𝒎, 𝒓𝒏) =𝒓𝒎 ∩ 𝒘𝒂𝒓𝒑𝒎𝒏(𝒓𝒏)
𝒓𝒎 ∪ 𝒘𝒂𝒓𝒑𝒎𝒏(𝒓𝒏)
Binary Edge Score
…… ……
…… ……
Frame i-1 Frame i Frame i+1
…… ……
t s
Goal: Find only one object proposal from each frame, such that all of them have high object-ness and high similarity across frames.
Find the highest weighted path in the DAG.
Longest Path Problem of DAG Dynamic Programming Solution.
Final Spatiotemporal Graph
Results
Qualitative Results – “Girl”
Original video Ground truth
Selected object proposals Segmentation results
Region within the red boundary is the object region
Qualitative Results – “Parachute”
Original video Ground truth
Selected object proposals Segmentation results
Region within the red boundary is the object region
Qualitative Results – “Birdfall”
Original video Ground truth Segmentation results
Region within the red boundary is the object region
Original video Ground truth Segmentation results
Qualitative Results – “Cheetah”
Region within the red boundary is the object region
Original video Ground truth Segmentation results
Qualitative Results – “Monkeydog”
Region within the red boundary is the object region
* Average per-frame pixel error rate. The smaller, the better.
SegTrack: Quantitative Results*
Ours [14] [13] [20] [6]
Use GTs? N N N Y Y
Birdfall 155 189 288 252 454
Cheetah 633 806 905 1142 1217
Girl 1488 1698 1785 1304 1755
Monkeydog 365 472 521 533 683
Parachute 220 221 201 235 502
Avg. 452 542 592 594 791
Summary
• STG moving object
• STG pixel-level segmentation
• Performance improved ~20%
Video Object Segmentation (VOS)
Dong Zhang, Omar Javed, and Mubarak Shah, “Video object segmentation through spatially accurate and temporally dense extraction of primary object regions”, CVPR, 2013
How about multiple videos?
Video Object Co-Segmentation (VOCS)
Dong Zhang, Omar Javed, and Mubarak Shah, “Video object co-segmentation by regulated maximum weight cliques”, ECCV, 2014
Video Object Co-Segmentation (VOCS)
• Applications • Automatic Annotation
• Unsupervised object detection & recognition
• Re-Identification Training image
Annotation
Testing image
Video Object Co-Segmentation (VOCS)
• Challenges
• Appearance variation • Multiple object classes • High complexity
Regulated Maximum Weight Cliques for Tracklets
MRF based Optimization
Input Videos
Object Co-Segmentation
Object Proposal Tracklets Generation
Framework
Object Proposal Tracklets Generation
… …
Video
Object Proposals
… …
Object Proposals
Frame 31 track 1
Track backward Track forward Frame 31 track 2
𝑺𝒔𝒊𝒎𝒊 𝒙𝒎, 𝒙𝒏 = 𝑺𝒂𝒑𝒑 𝒙𝒎, 𝒙𝒏 .𝑺𝒍𝒐𝒄 𝒙𝒎, 𝒙𝒏 .𝑺𝒔𝒉𝒂𝒑𝒆 𝒙𝒎, 𝒙𝒏
Frame 31 track 1
Frame 31 track 2
… … … … for all proposals, in all frames
Frame 61 track 2
… …
… …
… …
…
…
… …
… …
… …
Regulated Maximum Weight Cliques for Tracklets
… …
Video 1 Video 2
C1
C2
…
…
…
…
…
Tracklets
…
…
…
…
…
Tracklets
Clique 1: all chickens
Clique 2: all turtles
Each tracklet is a node Node weight 𝑾 𝑿 = (𝑺𝒐𝒃𝒋𝒆𝒄𝒕(𝒙𝒊))
𝒇𝒊=𝟏 Find Regulated Maximum Weight Cliques by
our modified Bron-Kerbosch Algorithm
Results
Chicken & Turtle
Red: first object Green: second object
Original Videos CoSegmentation Results
Elephant & Giraffe
Red: first object Green: second object
Original Videos CoSegmentation Results
Lion & Zebra
Red: first object Green: second object
Original Videos CoSegmentation
Results Original Videos CoSegmentation
Results
Quantitative Results: MOViCS Dataset
Video Set Ours1 Ours2 VCS[4] ICS[13]
Ours1: same parameters for all video sets Ours2: different parameters for each video set Numbers are the results by intersection-over-union metric, the larger, the better.
Quantitative Results: MOViCS Dataset
Video Set Ours1 Ours2 VCS[4] ICS[13]
Chicken&turtle 0.860 0.860 0.65 0.08
Ours1: same parameters for all video sets Ours2: different parameters for each video set
Numbers are the results by intersection-over-union metric, the larger, the better.
Quantitative Results: MOViCS Dataset
Video Set Ours1 Ours2 VCS[4] ICS[13]
Chicken&turtle 0.860 0.860 0.65 0.08
Zebra&lion 0.588 0.636 0.48 0.23
Giraffe&elephant 0.528 0.639 0.52 0.07
Tiger 0.336 0.336 0.30 0.30
Overall 0.578 0.617 0.49 0.17
Ours1: same parameters for all video sets Ours2: different parameters for each video set
Numbers are the results by intersection-over-union metric, the larger, the better.
Summary
• Type I STG for object segmentation
• Type II STG for object co-segmentation
• Results improved more than 20%
Video Object Co-Segmentation (VOCS)
Dong Zhang, Omar Javed, and Mubarak Shah, “Video object co-segmentation by regulated maximum weight cliques”, ECCV, 2014
What is the most important object?
Human!
Human Pose Estimation in Videos (HPEV)
Dong Zhang and Mubarak Shah, “Human Pose Estimation in Videos”, ICCV, 2015 Dong Zhang and Mubarak Shah, “A Framework for Human Pose Estimation in Videos” (submitted), PAMI, 2016
An Example for Human Segmentation
Coarse segmentation
Pose Estimation
Human Pose Estimation in Videos (HPEV)
• Applications • Action recognition
• HCI
• Surveillance
Human Pose Estimation in Videos (HPEV)
• Challenges • Huge appearance variation
• Multiple people
• Consistent estimation
Body Part Hypotheses Generation
Body Part Tracking
Input Videos
Tree-based Pose Estimation
Pose Hypotheses Generation
Framework
Frame f Frame f+1 Frame f+2
… … … …
Body part
Intra-frame Edge
Inter-frame Edge
Yellow Edges: Commonly Used Intra-
frame Edges
Blue Edges: Symmetric Intra-
frame Edges
Red Edges: Inter-frame Edges
Intra-frame Simple Cycles
Inter-frame Simple Cycles
Too Many Simple Cycles!
NP Hard!!!
Idea 1: Abstraction
Abstract Body Parts Relational Graph Real Body Parts Relational Graph
Remove intra-frame simple cycles
Idea 2: Association
Pose Relational Graph (Tracklet Graph)
Remove the inter-frame simple cycles
N-Best Hypotheses
Real Body Part Hypotheses
Abstract Body Part Hypotheses
Abstract Body Part Tracklets
Tree-based Pose
Estimation
Generate many full body pose hypotheses for each video frame
x x x x
x x
x
x x x x
x x
x
x
x
x
x x x
x
N-Best Hypotheses
Real Body Part Hypotheses
Abstract Body Part Hypotheses
Abstract Body Part Tracklets
Tree-based Pose
Estimation
x x x x
x x
x
x
x
x
x x
x
Generate real body part hypotheses for the frames
N-Best Hypotheses
Real Body Part Hypotheses
Abstract Body Part Hypotheses
Abstract Body Part Tracklets
Tree-based Pose
Estimation
x x x x
x x
x
x x x x
x
x x
x
x
x x x x
x x
x x
x
x x x x
x x x
x x x x
x
x x x
x x x x
x
x x x
x x x
x x
x
Combine Symmetric Parts
Real Body Parts Relational Graph
Abstract Body Parts Relational Graph
x x
x x x
x x
x
N-Best Hypotheses
Real Body Part Hypotheses
Abstract Body Part Hypotheses
Abstract Body Part Tracklets
Tree-based Pose
Estimation
Tracklet Hypotheses Graph
Get Best Tracklets for each part
N-Best Hypotheses
Real Body Part Hypotheses
Abstract Body Part Hypotheses
Abstract Body Part Tracklets
Tree-based Pose
Estimation
Pose Hypotheses Graph
…
…
…
…
…
…
…
… Select Best Poses
Qualitative Results
Outdoor Dataset (video: warmup)
Ours N-Best
Outdoor Dataset (video: bounce)
Ours N-Best
Outdoor Dataset: (video: walk2 video: kick)
Ours
N-Best
Ours
N-Best
N-Best Dataset (video: baseball)
Ours N-Best
N-Best Dataset (video: walkstraight)
Ours N-Best
HumanEva Dataset (video: Jog)
Ours N-Best
HumanEva Dataset (video: Walking)
Ours N-Best
Quantitative Results
Park et
al.
0.44 0.58 0.55 0.69 1.03 1.65 0.82
Ramakri
shna
et.al
0.39 0.58 0.48 0.48 0.88 1.42 0.71
Ours 0.19 0.22 0.35 0.37 0.41 0.61 0.36
Park et
al.
0.99 0.83 0.92 0.86 0.79 0.52 0.82
Ours 0.99 1.00 1.00 0.97 0.91 0.66 0.92
Ramakri
shna
et.al
0.99 0.86 0.95 0.96 0.86 0.52 0.86
Metric Method
Head Torso U.L. L.L. U.A. L.A. Average
PCP
Ours 0.99 1.00 1.00 0.97 0.91 0.66 0.92
Ramakrishna et.al
0.99 0.86 0.95 0.96 0.86 0.52 0.86
Park et al.
0.99 0.83 0.92 0.86 0.79 0.52 0.82
KLE
Ours 0.19 0.22 0.35 0.37 0.41 0.61 0.36
Ramakrishna et.al
0.39 0.58 0.48 0.48 0.88 1.42 0.71
Park et al.
0.44 0.58 0.55 0.69 1.03 1.65 0.82
Outdoor Dataset
PCP is a precision metric, the larger the better KLE is an error metric, the smaller the better
Metric Method Head Torso U.L. L.L. U.A. L.A. Average
PCP
KLE
Probability of a Correct Pose (PCP)
Keypoint Localization Error (KLE)
Park et
al.
0.23 0.52 0.24 0.35 1.10 1.18 0.60
Ramakris
hna et.al
0.27 0.48 0.13 0.22 1.14 1.07 0.55
Ours 0.16 0.42 0.13 0.15 0.20 0.24 0.22
Park et
al.
0.97 0.97 0.97 0.90 0.83 0.48 0.85
Ramakris
hna et.al
0.99 1.00 0.99 0.98 0.99 0.53 0.91
Ours 1.00 1.00 1.00 0.94 0.93 0.67 0.92
Metric Method Head Torso U.L. L.L. U.A. L.A. Average
PCP
Ours 1.00 1.00 1.00 0.94 0.93 0.67 0.92
Ramakrishna et.al
0.99 1.00 0.99 0.98 0.99 0.53 0.91
Park et al.
0.97 0.97 0.97 0.90 0.83 0.48 0.85
KLE
Ours 0.16 0.42 0.13 0.15 0.20 0.24 0.22
Ramakrishna et.al
0.27 0.48 0.13 0.22 1.14 1.07 0.55
Park et al.
0.23 0.52 0.24 0.35 1.10 1.18 0.60
HumanEva I Dataset
PCP is a precision metric, the larger the better KLE is an error metric, the smaller the better
Metric Method Head Torso U.L. L.L. U.A. L.A. Average
PCP
KLE
Park et
al.
0.54 0.74 0.80 1.39 2.39 4.08 1.66
Ramakris
hna et.al
0.53 0.88 0.67 1.01 1.70 2.68 1.25
Ours 0.15 0.17 0.24 0.37 0.30 0.60 0.31
Park et
al.
1.00 0.61 0.86 0.84 0.66 0.41 0.73
Ramakris
hna et.al
1.00 0.69 0.91 0.89 0.85 0.42 0.80
Ours 1.00 1.00 0.92 0.94 0.93 0.65 0.91
Metric Method Head Torso U.L. L.L. U.A. L.A. Average
PCP
Ours 1.00 1.00 0.92 0.94 0.93 0.65 0.91
Ramakrishna et.al
1.00 0.69 0.91 0.89 0.85 0.42 0.80
Park et al.
1.00 0.61 0.86 0.84 0.66 0.41 0.73
KLE
Ours 0.15 0.17 0.24 0.37 0.30 0.60 0.31
Ramakrishna et.al
0.53 0.88 0.67 1.01 1.70 2.68 1.25
Park et al.
0.54 0.74 0.80 1.39 2.39 4.08 1.66
N-Best Dataset
PCP is a precision metric, the larger the better KLE is an error metric, the smaller the better
Metric Method Head Torso U.L. L.L. U.A. L.A. Average
PCP
KLE
Summary
• HPEV can be well formulated into STGs
• STGs can be employed in multiple stages of HPEV
• Improved results
Action Localization in Videos through Context Walk
Khurram Soomro, Haroon Idrees and Mubarak Shah ICCV-2015
Action Recognition
Diving Lifting
Golf
Swing Bench Walking
Action Localization
1. Action Recognition
2. Action Detection a. Trimmed Videos
i. Spatio-Temporal
b. Untrimmed Videos i. Temporal
ii. Spatio-Temporal
Diving
Lifting
Swing Bench
Challenges: Action Localization
• Cluttered Background
• Multiple Actors/Actions
• Untrimmed Videos
Basketball Dunk
Salsa Spin
Hand Waving/Clapping/Boxing
Applications of Action Localization
•Video Search
•Action Retrieval
•Multimedia Event Recounting
•Video Understanding
Existing Solutions to Action Localization
• 1) Learn Action Detector
• 2) Exhaustively search in testing videos
• Sliding Window approach is IMPRACTICAL and WASTEFUL! • Videos:
• Untrimmed (Longer Duration)
• High Resolution
• Action Localization in Videos through Context Walk An efficient approach for action localization
Use of Context Relations that exists in videos: Action-Scene Intra-Action
Action Contours instead of bounding boxes
Motivation Context Graph Context Walk CRF Results
• Context Relations • Learn Spatio-Temporal Relations between all the Supervoxels to those within the Action (Actor
Bounding Box) • Arrows represent three-dimensional displacement vectors capturing:
Action-Scene Relations Intra-Action Relations
Motivation Context Graph Context Walk CRF Results
• Context Graph • Given supervoxels in an nth Training Video
• Construct a directed Graph Gn(Vn, En) for the video • Vn = Supervoxel nodes • En = Spatio-Temporal Relations
• Edges emanate from: All the nodes (supervoxels) Nodes (supervoxels) contained within the Actor Bounding Box
Directed Graph Action-Scene Relations Intra-Action Relations
Motivation Context Graph Context Walk CRF Results
• Context Walk • Given a Testing Video: 1. Construct an Undirected Graph G(V,E)
• Edges exist between Spatio-Temporal Neighbors 2. Randomly Select Initial node 3. Find Nearest Neighbor Supervoxel from Training Data 4. Project Displacement Vectors onto Testing Supervoxels 5. Select Next Node with Max. Probability, Repeat (Steps 3-5)
Training Video Nc
Motivation Context Graph Context Walk CRF Results
(b) Construct Spatio-temporal
Graph using all SVs
SV (v), SV Features ( )
(c) Search NNs using SV
features, then project
displacement vectors
(d) Update SVs Conditional
Distribution using all NNs
(e) Select SV with highest
confidence
(f) Repeat for T steps
(g) Segment Action Proposals through
CRF + SVM Classification
G (V, E)
i
n
j
n uu
Ξ
τΨ
Context Walk
Proposed Framework for Context Walk
CRF + SVM
(a) Segment Video into
Supervoxels (SVs)
•UCF Sports Dataset
Annotated Actor Bounding Box Action Localization Contour
Motivation Context Graph Context Walk CRF Results
Action Localization Contour
•UCF Sports Dataset
Motivation Context Graph Context Walk CRF Results
Annotated Actor Bounding Box
• Sub-JHMDB Dataset
Motivation Context Graph Context Walk CRF Results
Action Localization Contour Annotated Actor Bounding Box
• Sub-JHMDB Dataset
Motivation Context Graph Context Walk CRF Results
Action Localization Contour Annotated Actor Bounding Box
• THUMOS’13 Dataset
Motivation Context Graph Context Walk CRF Results
Action Localization Contour Annotated Actor Bounding Box
• THUMOS’13 Dataset
Motivation Context Graph Context Walk CRF Results
Action Localization Contour Annotated Actor Bounding Box
•Quantitative Results (UCFSports)
Motivation Context Graph Context Walk CRF Results
•Quantitative Results (sub-JHMDB)
Motivation Context Graph Context Walk CRF Results
•Quantitative Results (THUMOS’13)
Motivation Context Graph Context Walk CRF Results
Summary
• Efficient and Effective approach for Action Localization
• Learn Contextual Relations in the form of relative locations between different video regions
• Use Context Walk to select supervoxel at each step and predict the Action Location
Action Localization in Videos through Context Walk
Khurram Soomro, Haroon Idrees and Mubarak Shah ICCV-2015
Conclusion
• Generic Object Segmentation in Videos • Single video (CVPR-2013)
• Multiple videos (ECCV-2014)
• Human Pose Estimation in Videos (ICCV-2015)
• Human Action Detection in Videos (ICCV-2015)
Youtube Presentations
https://www.youtube.com/user/UCFCRCV
Thank You