Spatiotemporal Graphs for Object Segmentation, Human Pose

Spatiotemporal Graphs for Object Segmentation, Human Pose Estimation and

Action Detection in Videos

Mubarak Shah

Center for Research in Computer Vision

University of Central Florida

Spatiotemporal Graphs (STG)

• Video-based problems

• Nodes and edges

• Spatiotemporal

• Type I

• Type II

Frame 3 Frame 2

Type I Spatiotemporal Graph (STG)

• Nodes represent entities in single frames

Frame 1

……

……

……

Frame ...

Nodes can be: Object proposals Pixels Super-pixels Object locations …

Edges can be: Color similarities Distances Shape similarities …

Type II Spatiotemporal Graph (STG)

• Nodes represent entities in multiple frames

Nodes can be: Object tracklets Super-voxels …

Edges can be: Appearance similarities Motion models Overlaps …

Examples of Spatiotemporal Graphs

Original Video Object Segmentation

Video Object Segmentation (VOS)

Spatiotemporal Graph (STG): Video Object Segmentation

Frame i-1 Frame i Frame i+1

……

……

……

……

……

……

t s

Video Object Co-Segmentation (VOCS)

… …

Video 1 Video 2

…

…

…

…

…

Tracklets

…

…

…

…

…

Tracklets

STG – Video Object Co-Segmentation

Human Pose Estimation in Videos (HPEV)

STG – Human Pose Estimation in Videos

Head Top …

Head Bottom …

Hip

Shoulder

… …

Knee

Elbow

… …

Ankle

Hand … …

Action Detection (HAD)

Diving

…

Video 1

Spatiotemporal Context Graphs for Training Videos

Co

mp

osi

te G

rap

h (

)

Training Videos for Action c

…

Video n

Context Graphs

G1 ( V1, E1 )

Gn ( Vn , En ) …

…

Outline

• Video Object Segmentation (VOS)

• Video Object Co-Segmentation (VOCS)

• Human Pose Estimation in Videos (HPEV)

• Human Action Detection (HAD)


Dong Zhang, Omar Javed, and Mubarak Shah, “Video object segmentation through spatially accurate and temporally dense extraction of primary object regions”, CVPR, 2013


• Applications • Object Recognition

• Activity Recognition

• Surveillance


• Challenges • Camera movements

• Varieties of objects

• Deformable objects

Spatiotemporal Graph for Object Selection

GMMs and MRF based Optimization

Input Video

Object Segmentation

Object Proposal Generation

Framework

Object Proposal Generation

• Object proposal methods [1,2]

[1] Ian Endres and Derek Hoiem, “Category Independent Object Proposals”, ECCV, 2010

[2] Alexe, B., Deselares, T. and Ferrari, V., “What is an object?”, CVPR, 2010

… …

Frame index

Segtrack (monkeydog)

… … 100 1 2 3 4

30

40

17

21 … …

… …

… … 100 1 2 3 4

1 2 3 4

1 2 3 4

51

60

… … 100 1 2 3 4

18

Ranked object proposals

Sample a lot of proposals! Select the right ones!

100

100

… …

… …

… …

… …

… …

Frame index

96

98

100

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

Segtrack (parachute)

33

38

40

43

49

Ranked object proposals expansion

Multiple proposals

Spatiotemporal Graph for Object Selection

Beginning node Ending node

Unary edge Represents object-ness

An object proposal

Unary Edge

𝑺𝒖𝒏𝒂𝒓𝒚 = 𝑴 𝒓 + 𝑨(𝒓)

𝑨 𝒓 : appearance score Objectness

𝑴(𝒓) : average Frobenius norm for optical flow gradient

𝑼𝒙 =𝒖𝒙 𝒖𝒚𝒗𝒙 𝒗𝒚 𝑭

= 𝒖𝒙𝟐 + 𝒖𝒚

𝟐 + 𝒗𝒙𝟐 + 𝒗𝒚

𝟐

Unary Edge: Score

Original video frame

Optical flow

Object region (proposal

Optical flow gradient

Boundary region

OF gradient around boundary

Unary Edge: Motion Score

Binary edge

Frame i Frame i+1

… …

…

…

… …

…

…

Frame i+2

… …

…

…

… …

… …

… …

… …

… …

… …

… …

… …

… …

… …

Binary Edges

𝑺𝒃𝒊𝒏𝒂𝒓𝒚 = 𝝀 ∙ 𝑺𝒐𝒗𝒆𝒓𝒍𝒂𝒑 𝒓𝒎, 𝒓𝒏 ∙ 𝑺𝒄𝒐𝒍𝒐𝒓 (𝒓𝒎, 𝒓𝒏)

𝑺𝒄𝒐𝒍𝒐𝒓(𝒓𝒎, 𝒓𝒏) = 𝒉𝒊𝒔𝒕(𝒓𝒎) ∙ 𝒉𝒊𝒔𝒕(𝒓𝒏) 𝑻

𝑺𝒐𝒗𝒆𝒓𝒍𝒂𝒑(𝒓𝒎, 𝒓𝒏) =𝒓𝒎 ∩ 𝒘𝒂𝒓𝒑𝒎𝒏(𝒓𝒏)

𝒓𝒎 ∪ 𝒘𝒂𝒓𝒑𝒎𝒏(𝒓𝒏)

Binary Edge Score

…… ……

…… ……

Frame i-1 Frame i Frame i+1

…… ……

t s

Goal: Find only one object proposal from each frame, such that all of them have high object-ness and high similarity across frames.

Find the highest weighted path in the DAG.

Longest Path Problem of DAG Dynamic Programming Solution.

Final Spatiotemporal Graph

Results

Qualitative Results – “Girl”

Original video Ground truth

Selected object proposals Segmentation results

Region within the red boundary is the object region

Qualitative Results – “Parachute”

Original video Ground truth

Selected object proposals Segmentation results


Qualitative Results – “Birdfall”

Original video Ground truth Segmentation results



Qualitative Results – “Cheetah”



Qualitative Results – “Monkeydog”


* Average per-frame pixel error rate. The smaller, the better.

SegTrack: Quantitative Results*

Ours [14] [13] [20] [6]

Use GTs? N N N Y Y

Birdfall 155 189 288 252 454

Cheetah 633 806 905 1142 1217

Girl 1488 1698 1785 1304 1755

Monkeydog 365 472 521 533 683

Parachute 220 221 201 235 502

Avg. 452 542 592 594 791

Summary

• STG moving object

• STG pixel-level segmentation

• Performance improved ~20%


Dong Zhang, Omar Javed, and Mubarak Shah, “Video object segmentation through spatially accurate and temporally dense extraction of primary object regions”, CVPR, 2013

How about multiple videos?


Dong Zhang, Omar Javed, and Mubarak Shah, “Video object co-segmentation by regulated maximum weight cliques”, ECCV, 2014


• Applications • Automatic Annotation

• Unsupervised object detection & recognition

• Re-Identification Training image

Annotation

Testing image


• Challenges

• Appearance variation • Multiple object classes • High complexity

Regulated Maximum Weight Cliques for Tracklets

MRF based Optimization

Input Videos

Object Co-Segmentation

Object Proposal Tracklets Generation

Framework

Object Proposal Tracklets Generation

… …

Video

Object Proposals

… …

Object Proposals

Frame 31 track 1

Track backward Track forward Frame 31 track 2

𝑺𝒔𝒊𝒎𝒊 𝒙𝒎, 𝒙𝒏 = 𝑺𝒂𝒑𝒑 𝒙𝒎, 𝒙𝒏 .𝑺𝒍𝒐𝒄 𝒙𝒎, 𝒙𝒏 .𝑺𝒔𝒉𝒂𝒑𝒆 𝒙𝒎, 𝒙𝒏

Frame 31 track 1

Frame 31 track 2

… … … … for all proposals, in all frames

Frame 61 track 2

… …

… …

… …

…

…

… …

… …

… …

Regulated Maximum Weight Cliques for Tracklets

… …

Video 1 Video 2

C1

C2

…

…

…

…

…

Tracklets

…

…

…

…

…

Tracklets

Clique 1: all chickens

Clique 2: all turtles

Each tracklet is a node Node weight 𝑾 𝑿 = (𝑺𝒐𝒃𝒋𝒆𝒄𝒕(𝒙𝒊))

𝒇𝒊=𝟏 Find Regulated Maximum Weight Cliques by

our modified Bron-Kerbosch Algorithm

Results

Chicken & Turtle

Red: first object Green: second object

Original Videos CoSegmentation Results

Elephant & Giraffe


Original Videos CoSegmentation Results

Lion & Zebra


Original Videos CoSegmentation

Results Original Videos CoSegmentation

Results

Quantitative Results: MOViCS Dataset

Video Set Ours1 Ours2 VCS[4] ICS[13]

Ours1: same parameters for all video sets Ours2: different parameters for each video set Numbers are the results by intersection-over-union metric, the larger, the better.



Chicken&turtle 0.860 0.860 0.65 0.08

Ours1: same parameters for all video sets Ours2: different parameters for each video set

Numbers are the results by intersection-over-union metric, the larger, the better.



Chicken&turtle 0.860 0.860 0.65 0.08

Zebra&lion 0.588 0.636 0.48 0.23

Giraffe&elephant 0.528 0.639 0.52 0.07

Tiger 0.336 0.336 0.30 0.30

Overall 0.578 0.617 0.49 0.17

Ours1: same parameters for all video sets Ours2: different parameters for each video set

Numbers are the results by intersection-over-union metric, the larger, the better.

Summary

• Type I STG for object segmentation

• Type II STG for object co-segmentation

• Results improved more than 20%


Dong Zhang, Omar Javed, and Mubarak Shah, “Video object co-segmentation by regulated maximum weight cliques”, ECCV, 2014

What is the most important object?

Human!


Dong Zhang and Mubarak Shah, “Human Pose Estimation in Videos”, ICCV, 2015 Dong Zhang and Mubarak Shah, “A Framework for Human Pose Estimation in Videos” (submitted), PAMI, 2016

An Example for Human Segmentation

Coarse segmentation

Pose Estimation


• Applications • Action recognition

• HCI

• Surveillance


• Challenges • Huge appearance variation

• Multiple people

• Consistent estimation

Body Part Hypotheses Generation

Body Part Tracking

Input Videos

Tree-based Pose Estimation

Pose Hypotheses Generation

Framework

Frame f Frame f+1 Frame f+2

… … … …

Body part

Intra-frame Edge

Inter-frame Edge

Yellow Edges: Commonly Used Intra-

frame Edges

Blue Edges: Symmetric Intra-

frame Edges

Red Edges: Inter-frame Edges

Intra-frame Simple Cycles

Inter-frame Simple Cycles

Too Many Simple Cycles!

NP Hard!!!

Idea 1: Abstraction

Abstract Body Parts Relational Graph Real Body Parts Relational Graph

Remove intra-frame simple cycles

Idea 2: Association

Pose Relational Graph (Tracklet Graph)

Remove the inter-frame simple cycles

N-Best Hypotheses

Real Body Part Hypotheses

Abstract Body Part Hypotheses

Abstract Body Part Tracklets

Tree-based Pose

Estimation

Generate many full body pose hypotheses for each video frame

x x x x

x x

x

x x x x

x x

x

x

x

x

x x x

x

N-Best Hypotheses




Tree-based Pose

Estimation

x x x x

x x

x

x

x

x

x x

x

Generate real body part hypotheses for the frames

N-Best Hypotheses




Tree-based Pose

Estimation

x x x x

x x

x

x x x x

x

x x

x

x

x x x x

x x

x x

x

x x x x

x x x

x x x x

x

x x x

x x x x

x

x x x

x x x

x x

x

Combine Symmetric Parts

Real Body Parts Relational Graph

Abstract Body Parts Relational Graph

x x

x x x

x x

x

N-Best Hypotheses




Tree-based Pose

Estimation

Tracklet Hypotheses Graph

Get Best Tracklets for each part

N-Best Hypotheses




Tree-based Pose

Estimation

Pose Hypotheses Graph

…

…

…

…

…

…

…

… Select Best Poses

Qualitative Results

Outdoor Dataset (video: warmup)

Ours N-Best

Outdoor Dataset (video: bounce)

Ours N-Best

Outdoor Dataset: (video: walk2 video: kick)

Ours

N-Best

Ours

N-Best

N-Best Dataset (video: baseball)

Ours N-Best

N-Best Dataset (video: walkstraight)

Ours N-Best

HumanEva Dataset (video: Jog)

Ours N-Best

HumanEva Dataset (video: Walking)

Ours N-Best

Quantitative Results

Park et

al.

0.44 0.58 0.55 0.69 1.03 1.65 0.82

Ramakri

shna

et.al

0.39 0.58 0.48 0.48 0.88 1.42 0.71

Ours 0.19 0.22 0.35 0.37 0.41 0.61 0.36

Park et

al.

0.99 0.83 0.92 0.86 0.79 0.52 0.82

Ours 0.99 1.00 1.00 0.97 0.91 0.66 0.92

Ramakri

shna

et.al

0.99 0.86 0.95 0.96 0.86 0.52 0.86

Metric Method

Head Torso U.L. L.L. U.A. L.A. Average

PCP

Ours 0.99 1.00 1.00 0.97 0.91 0.66 0.92

Ramakrishna et.al

0.99 0.86 0.95 0.96 0.86 0.52 0.86

Park et al.

0.99 0.83 0.92 0.86 0.79 0.52 0.82

KLE

Ours 0.19 0.22 0.35 0.37 0.41 0.61 0.36

Ramakrishna et.al

0.39 0.58 0.48 0.48 0.88 1.42 0.71

Park et al.

0.44 0.58 0.55 0.69 1.03 1.65 0.82

Outdoor Dataset

PCP is a precision metric, the larger the better KLE is an error metric, the smaller the better

Metric Method Head Torso U.L. L.L. U.A. L.A. Average

PCP

KLE

Probability of a Correct Pose (PCP)

Keypoint Localization Error (KLE)

Park et

al.

0.23 0.52 0.24 0.35 1.10 1.18 0.60

Ramakris

hna et.al

0.27 0.48 0.13 0.22 1.14 1.07 0.55

Ours 0.16 0.42 0.13 0.15 0.20 0.24 0.22

Park et

al.

0.97 0.97 0.97 0.90 0.83 0.48 0.85

Ramakris

hna et.al

0.99 1.00 0.99 0.98 0.99 0.53 0.91

Ours 1.00 1.00 1.00 0.94 0.93 0.67 0.92


PCP

Ours 1.00 1.00 1.00 0.94 0.93 0.67 0.92

Ramakrishna et.al

0.99 1.00 0.99 0.98 0.99 0.53 0.91

Park et al.

0.97 0.97 0.97 0.90 0.83 0.48 0.85

KLE

Ours 0.16 0.42 0.13 0.15 0.20 0.24 0.22

Ramakrishna et.al

0.27 0.48 0.13 0.22 1.14 1.07 0.55

Park et al.

0.23 0.52 0.24 0.35 1.10 1.18 0.60

HumanEva I Dataset



PCP

KLE

Park et

al.

0.54 0.74 0.80 1.39 2.39 4.08 1.66

Ramakris

hna et.al

0.53 0.88 0.67 1.01 1.70 2.68 1.25

Ours 0.15 0.17 0.24 0.37 0.30 0.60 0.31

Park et

al.

1.00 0.61 0.86 0.84 0.66 0.41 0.73

Ramakris

hna et.al

1.00 0.69 0.91 0.89 0.85 0.42 0.80

Ours 1.00 1.00 0.92 0.94 0.93 0.65 0.91


PCP

Ours 1.00 1.00 0.92 0.94 0.93 0.65 0.91

Ramakrishna et.al

1.00 0.69 0.91 0.89 0.85 0.42 0.80

Park et al.

1.00 0.61 0.86 0.84 0.66 0.41 0.73

KLE

Ours 0.15 0.17 0.24 0.37 0.30 0.60 0.31

Ramakrishna et.al

0.53 0.88 0.67 1.01 1.70 2.68 1.25

Park et al.

0.54 0.74 0.80 1.39 2.39 4.08 1.66

N-Best Dataset



PCP

KLE

Summary

• HPEV can be well formulated into STGs

• STGs can be employed in multiple stages of HPEV

• Improved results

Action Localization in Videos through Context Walk

Khurram Soomro, Haroon Idrees and Mubarak Shah ICCV-2015

Action Recognition

Diving Lifting

Golf

Swing Bench Walking

Action Localization

1. Action Recognition

2. Action Detection a. Trimmed Videos

i. Spatio-Temporal

b. Untrimmed Videos i. Temporal

ii. Spatio-Temporal

Diving

Lifting

Swing Bench

Challenges: Action Localization

• Cluttered Background

• Multiple Actors/Actions

• Untrimmed Videos

Basketball Dunk

Salsa Spin

Hand Waving/Clapping/Boxing

Applications of Action Localization

•Video Search

•Action Retrieval

•Multimedia Event Recounting

•Video Understanding

Existing Solutions to Action Localization

• 1) Learn Action Detector

• 2) Exhaustively search in testing videos

• Sliding Window approach is IMPRACTICAL and WASTEFUL! • Videos:

• Untrimmed (Longer Duration)

• High Resolution

• Action Localization in Videos through Context Walk An efficient approach for action localization

Use of Context Relations that exists in videos: Action-Scene Intra-Action

Action Contours instead of bounding boxes

Motivation Context Graph Context Walk CRF Results

• Context Relations • Learn Spatio-Temporal Relations between all the Supervoxels to those within the Action (Actor

Bounding Box) • Arrows represent three-dimensional displacement vectors capturing:

Action-Scene Relations Intra-Action Relations


• Context Graph • Given supervoxels in an nth Training Video

• Construct a directed Graph Gn(Vn, En) for the video • Vn = Supervoxel nodes • En = Spatio-Temporal Relations

• Edges emanate from: All the nodes (supervoxels) Nodes (supervoxels) contained within the Actor Bounding Box

Directed Graph Action-Scene Relations Intra-Action Relations


• Context Walk • Given a Testing Video: 1. Construct an Undirected Graph G(V,E)

• Edges exist between Spatio-Temporal Neighbors 2. Randomly Select Initial node 3. Find Nearest Neighbor Supervoxel from Training Data 4. Project Displacement Vectors onto Testing Supervoxels 5. Select Next Node with Max. Probability, Repeat (Steps 3-5)

Training Video Nc


(b) Construct Spatio-temporal

Graph using all SVs

SV (v), SV Features ( )

(c) Search NNs using SV

features, then project

displacement vectors

(d) Update SVs Conditional

Distribution using all NNs

(e) Select SV with highest

confidence

(f) Repeat for T steps

(g) Segment Action Proposals through

CRF + SVM Classification

G (V, E)

i

n

j

n uu

Ξ

τΨ

Context Walk

Proposed Framework for Context Walk

CRF + SVM

(a) Segment Video into

Supervoxels (SVs)

•UCF Sports Dataset

Annotated Actor Bounding Box Action Localization Contour


Action Localization Contour

•UCF Sports Dataset


Annotated Actor Bounding Box

• Sub-JHMDB Dataset


Action Localization Contour Annotated Actor Bounding Box

• Sub-JHMDB Dataset



• THUMOS’13 Dataset



• THUMOS’13 Dataset



•Quantitative Results (UCFSports)


•Quantitative Results (sub-JHMDB)


•Quantitative Results (THUMOS’13)


Summary

• Efficient and Effective approach for Action Localization

• Learn Contextual Relations in the form of relative locations between different video regions

• Use Context Walk to select supervoxel at each step and predict the Action Location

Action Localization in Videos through Context Walk

Khurram Soomro, Haroon Idrees and Mubarak Shah ICCV-2015

Conclusion

• Generic Object Segmentation in Videos • Single video (CVPR-2013)

• Multiple videos (ECCV-2014)

• Human Pose Estimation in Videos (ICCV-2015)

• Human Action Detection in Videos (ICCV-2015)

Youtube Presentations

https://www.youtube.com/user/UCFCRCV



Thank You

Documents

Spatiotemporal Graphs for Object Segmentation, Human Pose