Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)

Preview:

Citation preview

@DocXavi

Module 4 - Lecture 6

Video Analysis with CNNs31 January 2017

Xavier Giró-i-Nieto

[http://pagines.uab.cat/mcv/]

Acknowledgments

2

Víctor Campos Alberto Montes

Linked slides

Outline

1. Recognition2. Optical Flow3. Object Tracking4. Audio and Video5. Generative models

4

Recognition

Demo: Clarifai

MIT Technology Review : “A start-up’s Neural Network Can Understand Video” (3/2/2015)5

Figure: Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.

6

Recognition

7

Recognition

Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

8

Recognition

Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Previous lectures

9

Recognition

Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.

Slides extracted from ReadCV seminar by Victor Campos 10

Recognition: DeepVideo

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 11

Recognition: DeepVideo: Demo

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 12

Recognition: DeepVideo: Architectures

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 13

Unsupervised learning [Le at al’11] Supervised learning [Karpathy et al’14]

Recognition: DeepVideo: Features

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 14

Recognition: DeepVideo: Multiscale

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 15

Recognition: DeepVideo: Results

16

Recognition

Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

17

Recognition: C3D

Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

18Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Demo

19K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition ICLR 2015.

Recognition: C3D: Spatial dimensionSpatial dimensions (XY) of the used kernels are fixed to 3x3, following Symonian & Zisserman (ICLR 2015).

20Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Temporal dimension3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets

Temporal depth

2D ConvNets

21Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets

Recognition: C3D: Temporal dimension

22Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

No gain when varying the temporal depth across layers.

Recognition: C3D: Temporal dimension

23Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Architecture

Featurevector

24Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Feature vector

Video sequence

16 frames-long clips

8 frames-long overlap

25Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Feature vector

16-frame clip

16-frame clip

16-frame clip

16-frame clip

...

Average

4096

-dim

vid

eo d

escr

ipto

r

4096

-dim

vid

eo d

escr

ipto

r

L2 norm

26Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: VisualizationBased on Deconvnets by Zeiler and Fergus [ECCV 2014] - See [ReadCV Slides] for more details.

27Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Compactness

28Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Convolutional 3D(C3D) combined with a simple linear classifier outperforms state-of-the-art methods on 4 different benchmarks and are comparable with state of the art methods on other 2 benchmarks

Recognition: C3D: Performance

29Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: SoftwareImplementation by Michael Gygli (GitHub)

30Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." 2014.

Recognition: Two stream Two CNNs in paralel:

● One for RGB images● One for Optical flow (hand-crafted features)

Fusion after the softmax layer

31Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition." CVPR 2016. [code]

Recognition: Two stream Two CNNs in paralel:

● One for RGB images● One for Optical flow (hand-crafted features)

Fusion at a convolutional layer

32Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.(Slidecast and Slides by Alberto Montes)

Recognition: Localization

33

Recognition: Localization

Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.(Slidecast and Slides by Alberto Montes)

34

Recognition: Localization

Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.(Slidecast and Slides by Alberto Montes)

Outline

1. Recognition2. Optical Flow3. Object Tracking4. Audio and Video5. Generative models

35

Optical Flow

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 36

Optical Flow: DeepFlow

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 37

Andrei BursucPostoc INRIA@abursuc

Optical Flow: DeepFlow

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 38

● Deep (hierarchy) ✔

● Convolution ✔

● Learning ❌

Optical Flow: Small vs Large

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 39

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 40

Optical FlowClassic approach:Rigid matching of HoG or SIFT descriptors

Deep Matching:Allow each subpatch to move:

● independently● in a limited range

depending on its size

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 41

Optical Flow: Deep Matching

Source: Matlab R2015b documentation for normxcorr2 by Mathworks42

Optical Flow: 2D correlation

Image

Sub-Image

Offset of the sub-image with respect to the image [0,0].

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 43

Instead of pre-trained filters, a convolution is defined between each:

● patch of the reference image● target image

...as a results, a correlation map is generated for each reference patch.

Optical Flow: Deep Matching

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 44

Optical Flow: Deep Matching

The most discriminative response map

The less discriminative

response map

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 45

Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search.

Optical Flow: Deep Matching

4x4 patches

8x8 patches

16x16 patches

32x32 patches

Top-down matching

(TD)Bottom-upextraction

(BU)

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 46

Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search.

Optical Flow: Deep Matching

4x4 patches

8x8 patches

16x16 patches

32x32 patches

Bottom-upextraction

(BU)

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 47

Optical Flow: Deep Matching (BU)

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 48

Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search.

Optical Flow: Deep Matching (TD)

4x4 patches

8x8 patches

16x16 patches

32x32 patches

Top-down matching

(TD)

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 49

Optical Flow: Deep Matching (TD)Each local maxima in the top layer corresponds to a shift of one of the biggest (32x32) patches.If we focus on local maximum, we can retrieve the corresponding responses one scale below and focus on shift of the sub-patches that generated it

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 50

Optical Flow: Deep Matching (TD)

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 51

Optical Flow: Deep Matching

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 52

Ground truth

Dense HOG[Brox & Malik 2011]

Deep Matching

Optical Flow: Deep Matching

Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 53

Optical Flow: Deep Matching

Optical Flow

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 54

Optical Flow: FlowNet

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 55

Optical Flow: FlowNet

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 56

End to end supervised learning of optical flow.

Optical Flow: FlowNet (contracting)

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 57

Option A: Stack both input images together and feed them through a generic network.

Optical Flow: FlowNet (contracting)

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 58

Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage.

Optical Flow: FlowNet (contracting)

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 59

Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage.

Correlation layer: Convolution of data patches from the layers to combine.

Optical Flow: FlowNet (expanding)

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 60

Upconvolutional layers: Unpooling features maps + convolution.Upconvolutioned feature maps are concatenated with the corresponding map from the contractive part.

Optical Flow: FlowNet

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. ICCV 2015 61

Since existing ground truth datasets are not sufficiently large to train a Convnet, a synthetic Flying Dataset is generated… and augmented (translation, rotation, scaling transformations; additive Gaussian noise; changes in brightness, contrast, gamma and color).

Convnets trained on these unrealistic data generalize well to existing datasets such as Sintel and KITTI.

Data augmentation

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 62

Optical Flow: FlowNet

Outline

1. Recognition2. Optical Flow3. Object Tracking4. Audio and Video5. Generative models

63

Object tracking: MDNet

64Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)

Object tracking: MDNet

65Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)

Object tracking: MDNet: Architecture

66Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)

Domain-specific layers are used during training for each sequence, but are replaced by a single one at test time.

Object tracking: MDNet: Online update

67Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)

MDNet is updated online at test time with hard negative mining, that is, selecting negative samples with the highest positive score.

Object tracking: FCNT

68Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015 [code]

Object tracking: FCNT

69Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

Focus on conv4-3 and conv5-3 of VGG-16 network pre-trained for ImageNet image classification.

conv4-3 conv5-3

Object tracking: FCNT: Specialization

70Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

Most feature maps in VGG-16 conv4-3 and conv5-3 are not related to the foreground regions in a tracking sequence.

Object tracking: FCNT: Localization

71Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

Although trained for image classification, feature maps in conv5-3 enable object localization…...but is not discriminative enough to different objects of the same category.

Object tracking: Localization

72Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene cnns." ICLR 2015.

Other works have shown how features maps in convolutional layers allow object localization.

Object tracking: FCNT: Localization

73Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation…

conv4-3 conv5-3

Object tracking: FCNT: Architecture

74Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

SNet=Specific Network (online update)

GNet=General Network (fixed)

Object tracking: FCNT: Results

75Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

Outline

1. Recognition2. Optical Flow3. Object Tracking4. Audio and Video5. Generative models

76

77

Audio and Video

Audio Vision

78

Audio and Video: Soundnet

Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016.

Object & Scenes recognition in videos by analysing the audio track (only).

79Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.

Videos for training are unlabeled. Relies on CNNs trained on labeled images.

Audio and Video: Soundnet

80Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.

Videos for training are unlabeled. Relies on CNNs trained on labeled images.

Audio and Video: Soundnet

81Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016.

Audio and Video: Soundnet

82Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art.

Audio and Video: Soundnet

83Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.

Visualization of the 1D filters over raw audio in conv1.

Audio and Video: Soundnet

84Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.

Visualization of the 1D filters over raw audio in conv1.

Audio and Video: Soundnet

85Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.

Visualization of the 1D filters over raw audio in conv1.

Audio and Video: Soundnet

86Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016.

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7):

Audio and Video: Soundnet

87

Audio and Video: Sonorizaton

Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.

Learn synthesized sounds from videos of people hitting objects with a drumstick.

88

Audio and Video: Visual Sounds

Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.

No end-to-end

89

Audio and Video: Visual Sounds

Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.

90

Learn moreRuus Salakhutdinov, “Multimodal Machine Learning” (NIPS 2015 Workshop)

92

What are Generative Models?

We want our model with parameters θ = {weights, biases} and outputs

distributed like Pmodel to estimate the distribution of our training data Pdata.

Example) y = f(x), where y is scalar, make Pmodel similar to Pdata by training

the parameters θ to maximize their similarity.

Key Idea: our model cares about what distribution generated the input data

points, and we want to mimic it with our probabilistic model. Our learned

model should be able to make up new samples from the distribution, not

just copy and paste existing samples!

93

What are Generative Models?

Figure from NIPS 2016 Tutorial: Generative Adversarial Networks (I. Goodfellow)

94

Video Frame Prediction

Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016

95

Video Frame Prediction

Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016

96

Video Frame Prediction

Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016

Adversarial Training analogy

Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake.

100

100

It’s not even green

Adversarial Training analogy

Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake.

100

100

There is no watermark

Adversarial Training analogy

Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake.

100

100

Watermark should be rounded

Adversarial Training analogy

Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake.

?

After enough iterations, and if the counterfeiter is good enough (in terms of G network it means “has enough parameters”), the police should be confused.

Adversarial Training (batch update)

● Pick a sample x from training set● Show x to D and update weights to

output 1 (real)

Adversarial Training (batch update)

● G maps sample z to ẍ ● show ẍ and update weights to output 0 (fake)

Adversarial Training (batch update)

● Freeze D weights● Update G weights to make D output 1 (just G weights!)● Unfreeze D Weights and repeat

104

Generative Adversarial Networks (GANs)

Slide credit: Víctor Garcia

DiscriminatorD(·)

GeneratorG(·)

Real World

Randomseed (z)

Real/Synthetic

105Slide credit: Víctor Garcia

Conditional Adversarial Networks

Real World

Real/Synthetic

Condition

DiscriminatorD(·)

GeneratorG(·)

Generative Adversarial Networks (GANs)

Generating images/frames

(Radford et al. 2015)

Deep Conv. GAN (DCGAN) effectively generated 64x64 RGB images in a single shot. For example bedrooms from LSUN dataset.

Generating images/frames conditioned on captions

(Reed et al. 2016b) (Zhang et al. 2016)

Unsupervised feature extraction/learning representations

Similarly to word2vec, GANs learn a distributed representation that disentangles concepts such that we can perform operations on the data manifold:

v(Man with glasses) - v(man) + v(woman) = v(woman with glasses)

(Radford et al. 2015)

Image super-resolution

Bicubic: not using data statistics. SRResNet: trained with MSE. SRGAN is able to understand that there are multiple correct answers, rather than averaging.

(Ledig et al. 2016)

Image super-resolution

Averaging is a serious problem we face when dealing with complex distributions.

(Ledig et al. 2016)

Manipulating images and assisted content creation

https://youtu.be/9c4z6YsBGQ0?t=126 https://youtu.be/9c4z6YsBGQ0?t=161

(Zhu et al. 2016)

112

Adversarial Networks

Slide credit: Víctor Garcia

Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial networks." arXiv preprint arXiv:1611.07004 (2016).

Generator

Discriminator

Generated Pairs

Real World

Ground Truth Pairs

Loss → BCE

113Víctor Garcia and Xavier Giró-i-Nieto (work under progress)

Generator

Discriminator Loss2 GAN {Binary Crossentropy}

1/0

Generative Adversarial Networks (GANs)

Generative models for video

114Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." 2016.

Generative models for video

115Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." NIPS 2016.

116

Adversarial Networks

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." NIPS 2014Goodfellow, Ian. "NIPS 2016 Tutorial: Generative Adversarial Networks." arXiv preprint arXiv:1701.00160 (2016).

F. Van Veen, “The Neural Network Zoo” (2016)

Outline

1. Recognition2. Optical Flow3. Object Tracking4. Audio and Video5. Generative models

117

Recommended