Activity Understanding - “ProcNets: Learning to Segment ... · Activity Understanding \ProcNets: Learning to Segment Procedures in Untrimmed and Unconstrained Videos" by Zhou, Xu

$Page 1: Activity Understanding - “ProcNets: Learning to Segment ... · Activity Understanding \ProcNets: Learning to Segment Procedures in Untrimmed and Unconstrained Videos" by Zhou, Xu$
Activity Understanding“ProcNets: Learning to Segment Procedures in Untrimmed and

Unconstrained Videos” by Zhou, Xu and Corso

Thomas Leyh

University of Freiburg

June 28th, 2017Seminar on Current Works in Computer Vision

Thomas Leyh Activity Understanding June 28th, 2017 1 / 24

Outline

1 Introduction

2 Network ArchitectureContext-Aware Video EncodingProcedure Segment ProposalSequential Prediction

3 Performance

4 Conclusion


Outline

1 Introduction


3 Performance

4 Conclusion


What is this about?


What is this about?

1 Grill the tomatoes in a pan

2 Add oil to a pan

3 Grill bacon until crispy...

8 Finish with bread

Number of segments and positions are inferred automatically!


What is this about?

1 Grill the tomatoes in a pan

2 Add oil to a pan

3 Grill bacon until crispy...

8 Finish with bread

Number of segments and positions are inferred automatically!


Why is this useful?

Video Description Generation

Activity Recognition

First step towards a self-learningrobot cook?

Figure: Kim Kyung-Hoon/Reuters


Why is this useful?

Video Description Generation

Activity Recognition

First step towards a self-learningrobot cook?

Figure: Kim Kyung-Hoon/Reuters


Outline

1 Introduction


3 Performance

4 Conclusion


Network has three stages.


Stage 1

Reduce dimensionality of each frame.


What is ResNet?

What is Bi-LSTM?


What is ResNet? → Residual Network

Very popular convolutionalnetwork model

Can be very deep

Figure: medium.com/@karpathy/a-peek-at-trends-in-machine-learning-

ab8a1085a106


medium.com/@karpathy/a-peek-at-trends-in-machine-learning-ab8a1085a106




State-of-the-art performance inimage classification

Easy to train

Figure: chaosmail.github.io/deeplearning/2016/10/22/intro-to-

deep-learning-for-computer-vision/


chaosmail.github.io/deeplearning/2016/10/22/intro-to-deep-learning-for-computer-vision/



What is Bi-LSTM? → Bidirectional Long short-term Memory

Long short-term Memory (LSTM)

Special kind of Recurrent Neural Network (RNN)

Add ‘forgetting’ mechanism

For capturing long term dependencies

Easier to train than traditional RNN



Bidirectional LSTM (Bi-LSTM)

For capturing past and future context

One network runs forward over sequence

One network runs backwards

Combine output of both




Dimensionality Reduction with Context

∈ R720×360×3

1©7→

0.190.940.84

...

∈ R512

(Numbers are made-up)


Stage 2

Produce segment proposals and their likelihood.


What are Temporal Convolutional Anchors?

Region Proposal Networks

Introduce an ‘attention’ mechanism

Originally for object detection on images

Here used on the temporal axis

Generates multiple proposals with score for each feature


What are Temporal Convolutional Anchors?

Segment Proposals

←time axis→0.190.94

· · · 0.84 · · ·...

∈R512×11

2©7→

k = 11

0.95 0.74 0.920.51 0.25 0.900.28 0.10 0.46

...

∈ R15×3


Stage 3

Choose a variable number of segment proposals.


Another Long short-term Memory.

LSTM Input:

All Proposal Scores

Discretized Location, e.g. second 3 to 5 7→[0 1 0 · · ·

]Segment Content



LSTM Output:

Likelihood that a proposal is next segment

Maximize likelihood for all segmentsand you get an array of segment positions



LSTM Output:

Likelihood that a proposal is next segment

Maximize likelihood for all segmentsand you get an array of segment positions


Using different convolutional and recurrent network models to

Encode video frames

Encode temporal dependencies

And search for most likely arrangement.


Outline

1 Introduction


3 Performance

4 Conclusion


YouCookII Dataset

Since existing datasets were not sufficient a new dataset was collected.

Set of YouTube cooking videos with different recipes.Includes 2007 videos.


New Metrics

Existing methods are hardly comparable.(e.g. the number of segments needs to be given)

Existing metrics fail to measure ordering information.

Therefore new metrics:

Average Recall at 0.5 ([email protected])

Mean Intersection-over-Union (mIoU)

They essentially measure the overlapping between ground truth.


Performance Comparison



Uniform model produces 8 segments

The others produce 7 segments

r is relaxation factor










Outline

1 Introduction


3 Performance

4 Conclusion


Conclusion

The performance looks quite nice and the approach is interesting.

But is this the right path to human level understanding?

Well...

Go try it out yourself!1

github.com/LuoweiZhou/Procedure-Segmentation-Networks

1

Unfortunately not finished at time of presentation.


https://github.com/LuoweiZhou/Procedure-Segmentation-Networks

Conclusion



Well...



1

Unfortunately not finished at time of presentation.



Conclusion



Well...



1Unfortunately not finished at time of presentation.Thomas Leyh Activity Understanding June 28th, 2017 24 / 24


Documents

Activity Understanding - “ProcNets: Learning to Segment ... · Activity Understanding \ProcNets: Learning to Segment Procedures in Untrimmed and Unconstrained Videos" by Zhou, Xu