Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Activity Understanding“ProcNets: Learning to Segment Procedures in Untrimmed and
Unconstrained Videos” by Zhou, Xu and Corso
Thomas Leyh
University of Freiburg
June 28th, 2017Seminar on Current Works in Computer Vision
Thomas Leyh Activity Understanding June 28th, 2017 1 / 24
Outline
1 Introduction
2 Network ArchitectureContext-Aware Video EncodingProcedure Segment ProposalSequential Prediction
3 Performance
4 Conclusion
Thomas Leyh Activity Understanding June 28th, 2017 2 / 24
Outline
1 Introduction
2 Network ArchitectureContext-Aware Video EncodingProcedure Segment ProposalSequential Prediction
3 Performance
4 Conclusion
Thomas Leyh Activity Understanding June 28th, 2017 3 / 24
What is this about?
Thomas Leyh Activity Understanding June 28th, 2017 4 / 24
What is this about?
1 Grill the tomatoes in a pan
2 Add oil to a pan
3 Grill bacon until crispy...
8 Finish with bread
Number of segments and positions are inferred automatically!
Thomas Leyh Activity Understanding June 28th, 2017 5 / 24
What is this about?
1 Grill the tomatoes in a pan
2 Add oil to a pan
3 Grill bacon until crispy...
8 Finish with bread
Number of segments and positions are inferred automatically!
Thomas Leyh Activity Understanding June 28th, 2017 5 / 24
Why is this useful?
Video Description Generation
Activity Recognition
First step towards a self-learningrobot cook?
Figure: Kim Kyung-Hoon/Reuters
Thomas Leyh Activity Understanding June 28th, 2017 6 / 24
Why is this useful?
Video Description Generation
Activity Recognition
First step towards a self-learningrobot cook?
Figure: Kim Kyung-Hoon/Reuters
Thomas Leyh Activity Understanding June 28th, 2017 6 / 24
Outline
1 Introduction
2 Network ArchitectureContext-Aware Video EncodingProcedure Segment ProposalSequential Prediction
3 Performance
4 Conclusion
Thomas Leyh Activity Understanding June 28th, 2017 7 / 24
Network has three stages.
Thomas Leyh Activity Understanding June 28th, 2017 8 / 24
Stage 1
Reduce dimensionality of each frame.
Thomas Leyh Activity Understanding June 28th, 2017 9 / 24
What is ResNet?
What is Bi-LSTM?
Thomas Leyh Activity Understanding June 28th, 2017 10 / 24
What is ResNet? → Residual Network
Very popular convolutionalnetwork model
Can be very deep
Figure: medium.com/@karpathy/a-peek-at-trends-in-machine-learning-
ab8a1085a106
Thomas Leyh Activity Understanding June 28th, 2017 11 / 24
What is ResNet? → Residual Network
State-of-the-art performance inimage classification
Easy to train
Figure: chaosmail.github.io/deeplearning/2016/10/22/intro-to-
deep-learning-for-computer-vision/
Thomas Leyh Activity Understanding June 28th, 2017 11 / 24
What is Bi-LSTM? → Bidirectional Long short-term Memory
Long short-term Memory (LSTM)
Special kind of Recurrent Neural Network (RNN)
Add ‘forgetting’ mechanism
For capturing long term dependencies
Easier to train than traditional RNN
Thomas Leyh Activity Understanding June 28th, 2017 12 / 24
What is Bi-LSTM? → Bidirectional Long short-term Memory
Bidirectional LSTM (Bi-LSTM)
For capturing past and future context
One network runs forward over sequence
One network runs backwards
Combine output of both
Thomas Leyh Activity Understanding June 28th, 2017 12 / 24
What is ResNet? → Residual Network
What is Bi-LSTM? → Bidirectional Long short-term Memory
Dimensionality Reduction with Context
∈ R720×360×3
1©7→
0.190.940.84
...
∈ R512
(Numbers are made-up)
Thomas Leyh Activity Understanding June 28th, 2017 13 / 24
Stage 2
Produce segment proposals and their likelihood.
Thomas Leyh Activity Understanding June 28th, 2017 14 / 24
What are Temporal Convolutional Anchors?
Region Proposal Networks
Introduce an ‘attention’ mechanism
Originally for object detection on images
Here used on the temporal axis
Generates multiple proposals with score for each feature
Thomas Leyh Activity Understanding June 28th, 2017 15 / 24
What are Temporal Convolutional Anchors?
Segment Proposals
←time axis→0.190.94
· · · 0.84 · · ·...
∈R512×11
2©7→
k = 11
0.95 0.74 0.920.51 0.25 0.900.28 0.10 0.46
...
∈ R15×3
Thomas Leyh Activity Understanding June 28th, 2017 15 / 24
Stage 3
Choose a variable number of segment proposals.
Thomas Leyh Activity Understanding June 28th, 2017 16 / 24
Another Long short-term Memory.
LSTM Input:
All Proposal Scores
Discretized Location, e.g. second 3 to 5 7→[0 1 0 · · ·
]Segment Content
Thomas Leyh Activity Understanding June 28th, 2017 17 / 24
Another Long short-term Memory.
LSTM Output:
Likelihood that a proposal is next segment
Maximize likelihood for all segmentsand you get an array of segment positions
Thomas Leyh Activity Understanding June 28th, 2017 17 / 24
Another Long short-term Memory.
LSTM Output:
Likelihood that a proposal is next segment
Maximize likelihood for all segmentsand you get an array of segment positions
Thomas Leyh Activity Understanding June 28th, 2017 17 / 24
Using different convolutional and recurrent network models to
Encode video frames
Encode temporal dependencies
And search for most likely arrangement.
Thomas Leyh Activity Understanding June 28th, 2017 18 / 24
Outline
1 Introduction
2 Network ArchitectureContext-Aware Video EncodingProcedure Segment ProposalSequential Prediction
3 Performance
4 Conclusion
Thomas Leyh Activity Understanding June 28th, 2017 19 / 24
YouCookII Dataset
Since existing datasets were not sufficient a new dataset was collected.
Set of YouTube cooking videos with different recipes.Includes 2007 videos.
Thomas Leyh Activity Understanding June 28th, 2017 20 / 24
New Metrics
Existing methods are hardly comparable.(e.g. the number of segments needs to be given)
Existing metrics fail to measure ordering information.
Therefore new metrics:
Average Recall at 0.5 ([email protected])
Mean Intersection-over-Union (mIoU)
They essentially measure the overlapping between ground truth.
Thomas Leyh Activity Understanding June 28th, 2017 21 / 24
Performance Comparison
Thomas Leyh Activity Understanding June 28th, 2017 22 / 24
Performance Comparison
Uniform model produces 8 segments
The others produce 7 segments
r is relaxation factor
Thomas Leyh Activity Understanding June 28th, 2017 22 / 24
Performance Comparison
Thomas Leyh Activity Understanding June 28th, 2017 22 / 24
Performance Comparison
Thomas Leyh Activity Understanding June 28th, 2017 22 / 24
Performance Comparison
Thomas Leyh Activity Understanding June 28th, 2017 22 / 24
Performance Comparison
Thomas Leyh Activity Understanding June 28th, 2017 22 / 24
Outline
1 Introduction
2 Network ArchitectureContext-Aware Video EncodingProcedure Segment ProposalSequential Prediction
3 Performance
4 Conclusion
Thomas Leyh Activity Understanding June 28th, 2017 23 / 24
Conclusion
The performance looks quite nice and the approach is interesting.
But is this the right path to human level understanding?
Well...
Go try it out yourself!1
github.com/LuoweiZhou/Procedure-Segmentation-Networks
1
Unfortunately not finished at time of presentation.
Thomas Leyh Activity Understanding June 28th, 2017 24 / 24
Conclusion
The performance looks quite nice and the approach is interesting.
But is this the right path to human level understanding?
Well...
Go try it out yourself!1
github.com/LuoweiZhou/Procedure-Segmentation-Networks
1
Unfortunately not finished at time of presentation.
Thomas Leyh Activity Understanding June 28th, 2017 24 / 24
Conclusion
The performance looks quite nice and the approach is interesting.
But is this the right path to human level understanding?
Well...
Go try it out yourself!1
github.com/LuoweiZhou/Procedure-Segmentation-Networks
1Unfortunately not finished at time of presentation.Thomas Leyh Activity Understanding June 28th, 2017 24 / 24