33
Shuffle and Learn: Unsupervised Learning using Temporal Order Verification Slides by Xunyu Lin ReadCV, UPC 20th February, 2017 Ishan Misra, C. Lawrence Zitnick, Martial Hebert [arxiv ] (26 July 2016) [code ] [demo]

Shuffle and learn: Unsupervised Learning using Temporal Order Verification (UPC Reading Group)

Embed Size (px)

Citation preview

Shuffle and Learn: Unsupervised Learning using Temporal Order Verification

Slides by Xunyu LinReadCV, UPC

20th February, 2017

Ishan Misra, C. Lawrence Zitnick, Martial Hebert[arxiv] (26 July 2016) [code] [demo]

Index1. Introduction2. Unsupervised Representations Learning3. Video Representations Learning4. Temporal Order Verification5. In Practice6. Evaluations7. Conclusions

Index1. Introduction2. Unsupervised Representations Learning3. Video Representations Learning4. Temporal Order Verification5. In Practice6. Evaluations7. Conclusions

IntroductionWhat is Unsupervised Learning?

● Unsupervised Learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data.

● The key of Unsupervised Learning is how to do clustering:

IntroductionWhy Unsupervised Learning?

“Most of human and animal learning is unsupervised learning. If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake.” —— Yann LeCun

IntroductionWhy Unsupervised Learning?

● It is the nature of how intelligent beings percept the world.

● It can save us tons of efforts to build a human-alike intelligent agent compared to a totally supervised fashion.

● It’ll be the new breakthroughs to get true AI!

Index1. Introduction2. Unsupervised Representations Learning3. Video Representations Learning4. Temporal Order Verification5. In Practice6. Evaluations7. Conclusions

Unsupervised Representations LearningPopular Unsupervised Representations Learning frameworks

Auto-Encoder

Unsupervised Representations LearningPopular Unsupervised Representations Learning frameworks

Variational Auto-Encoder (VAE)

Tutorial

Unsupervised Representations LearningPopular Unsupervised Representations Learning frameworks

GAN

Index1. Introduction2. Unsupervised Representations Learning3. Video Representations Learning4. Temporal Order Verification5. In Practice6. Evaluations7. Conclusions

Video Representations Learning● Human percept the world through observing the dynamic changing of our

daily lives, which can be regarded as videos. ● Thus the unsupervised video representations learning plays an

unneglectable role in building a human-alike intelligent agent.

Video Representations LearningRelated Works

Video Prediction with LSTMs

Video Representations LearningRelated Works

Spatiotemporally Coherent Reconstruction

Index1. Introduction2. Unsupervised Representations Learning3. Video Representations Learning4. Temporal Order Verification5. In Practice6. Evaluations7. Conclusions

Temporal Order VerificationThe internal temporal order of videos

Temporal Order Verification

Temporal Order VerificationTake temporal order as the supervisory signals for learning

Shuffled sequences

Binary classification

In order

Not in order

Index1. Introduction2. Unsupervised Representations Learning3. Video Representations Learning4. Temporal Order Verification5. In Practice6. Evaluations7. Conclusions

In PracticeHow to sample the tuple of frames?

1. The number of frames for each tuple- 2 frames: may be ambiguous (picking up or placing down a cup?)- 3 frames: practically useful, but still not enough for a cyclical case- ...

In PracticeHow to sample the tuple of frames?

a b c d e

b c d

ab d

eb d

PositiveNegative

Original Video

In PracticeHow to sample the tuple of frames?

2. Ambiguity in frames with small motion

- The order of a small motion is indistinguishable.- Only sample from frames with high motion (smart sampling).- Use coarse frame level optical flow as a proxy to measure the motion

between frames.

In PracticeHow to sample the tuple of frames?

3. The distance of frames in positive tuples (difficulty of the task)

- Too close: results in ambiguous small motion or overly easy task- Too far: consecutive frames are not highly related which makes the

learning task too difficult.

In PracticeHow to sample the tuple of frames?

3. The distance of frames in positive tuples (difficulty of the task)

- Two metrics which control the difficulty of positive and negative samples.

b c dab d

eb d

In PracticeHow to sample the tuple of frames?

4. Ratio of negative and positive samples

Index1. Introduction2. Unsupervised Representations Learning3. Video Representations Learning4. Temporal Order Verification5. In Practice6. Evaluations7. Conclusions

EvaluationAction Recognition on UCF-101 & HMDB-51

- Comparison to random initialization & transfer learning

- Pre-trained on ImageNet and finetuned on UCF-101 gives an accuracy of 67.1%.- Pre-trained on ImageNet and finetuned on HMDB-51 gives an accuracy of 28.5%.

+ 11.6 %

+ 4.8 %

* UCF-101 is two times larger than HMDB-51

EvaluationAction Recognition on UCF-101 & HMDB-51

- Comparison to other unsupervised frameworks

- Two Close: measure if two frames are close or not.- Two Order: temporal verification with only 2 frames.- DrLim: measure temporal coherency with L2 loss.- TempCoh: measure temporal coherency with L1 loss.- Obj. Patch: basically imitates human’s instinct eyes fixation ability. Paper link

EvaluationNearest Neighbor retrieval

EvaluationVisualizing pool5 Unit Responses

EvaluationPose Estimation on FLIC & MPII

Index1. Introduction2. Unsupervised Representations Learning3. Video Representations Learning4. Temporal Order Verification5. In Practice6. Evaluations7. Conclusions

Conclusions● Temporal verification exploits the potential of a network to capture the

sequential logics in videos.● Further works should be explored by capturing a longer temporal logics.

For now it only utilizes single frames in less than around 60 frames. Architectures like RNN could be further utilized to extend the temporal range.

● The only drawbacks lie in its weak constraint and tedious sampling techniques.

● More general constraint with simplified procedure? → My research line