Unsupervised Computer Vision: The Current State of the Art

Unsupervised Computer Vision

Stitch Fix, Styling Algorithms Research Talk

The Current State of the Art

TJ Torres Data Scientist, Stitch Fix

WHY DEEP LEARNING?Before DL much of computer vision was focused on feature descriptors

and image stats.

SURF MSER Corner

Image Credit: http://www.mathworks.com/products/computer-vision/features.html

WHY DEEP LEARNING?

Turns out NNs are great feature extractors.

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

WHY DEEP LEARNING?

Turns out NNs are great feature extractors. Team name Entry description Classification

errorLocalization error

GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257

VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion weights learnt on the validation set); detected boxes were not updated

0.07325 0.256167

VGG a combination of multiple ConvNets, including a net trained on images of different size (fusion done by averaging); detected boxes were not updated 0.07337 0.255431

VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231

VGG a combination of multiple ConvNets (fusion weights learnt on the validation set) 0.07407 0.253501

Leaderboard

WHY DEEP LEARNING?

Turns out NNs are great feature extractors. Team name Entry description Classification

errorLocalization error

GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257

VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion weights learnt on the validation set); detected boxes were not updated

0.07325 0.256167

VGG a combination of multiple ConvNets, including a net trained on images of different size (fusion done by averaging); detected boxes were not updated 0.07337 0.255431

VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231

VGG a combination of multiple ConvNets (fusion weights learnt on the validation set) 0.07407 0.253501

Leaderboard

Convolution: gives local, translation invariant feature hierarchy

WHY DEEP LEARNING?

Image Credit: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

WHY DEEP LEARNING?

Edges Curves Top of 3 shapes

Softmax Output: Classification

Image Credit: http://parse.ele.tue.nl/education/cluster2

WHY DEEP LEARNING?

Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

WHY DEEP LEARNING?

LEARN MORE

http://cs231n.github.io/convolutional-networks/

WHY UNSUPERVISED?

Unfortunately very few image sets come with labels.

WHY UNSUPERVISED?

Unfortunately very few image sets come with labels.

What are the best labels for fashion/style?

THE UNSUPERVISED MO

Try to learn embedding space of image data. (generally includes generative process)

THE UNSUPERVISED MO

1) Train encoder and decoder to encode then reconstruct image.

THE UNSUPERVISED MO

2) Generate image from random embedding and reinforce “good” looking images.

THE UNSUPERVISED MO

2) Generate image from random embedding and reinforce “good” looking images.

DOWNSIDES

Higher dimension embeddings = Non-interpretable

Latent distributions may contain gaps. No sensible continuum.

OUTLINE

1. Variational Auto-encoders (VAE)

2. Generative Adversarial Networks (GAN)

3. The combination of the two (VAE/GAN)

4. Generative Moment Matching Networks (GMMN)

5. Adversarial Auto-encoders (AAE?)

Briefly

OUTLINE

1. Variational Auto-encoders (VAE)

2. Generative Adversarial Networks (GAN)

3. The combination of the two (VAE/GAN)

4. Generative Moment Matching Networks (GMMN)

5. Adversarial Auto-encoders (AAE?)

Briefly

stitchfix/fauxtograph

VARIATIONAL AUTO-ENCODERS

ENCODING

Convolution

ENCODING

Convolution

latent

ENCODING

Convolution

VARIATIONAL STEP

sample from distribution

q�(z) = N (z;µ(i),�2(i)I)

VARIATIONAL STEP

sampled

Deconvolution

DECODING

output

Deconvolution

DECODING

reconstruction

Deconvolution

CALCULATE LOSS

L(x) = DKL(q�(z)||N (0, I)) +MSE(x,yout

UPDATE WEIGHTS

W (l)⇤ij = W (l)

✓1� ↵

@L@Wij

◆✓@x

(n�1)

◆· · ·

source: @genekogan

Because of pixel-wise MSE loss.

Non-centered features disproportionately penalized.

OUTPUT

source: @genekogan

Because of pixel-wise MSE loss.

Non-centered features disproportionately penalized.

OUTPUT

Note Blurring hair.

GENERATIVE ADVERSARIAL NETWORKS

GAN STRUCTURE

Latent Random Vector

Generator Discriminator

Discriminator

GAN STRUCTUREGenerator

Filtered

Discriminator

Gen/Train Image

Discriminator

Filtered

Discriminator

Yes/No

Discriminator

TRAINING

Generator

Generator and Discriminator play minimax game.

DV (D,G) = E

x⇠pdata(x) [logD(x)] + Ez⇠pz(z) [log(1�D(G(z)))]

Discriminator

TRAINING

Generator

Lower loss for fooling Discriminator.

Lower loss for IDing correct training/generated data.

DV (D,G) = E

Discriminator

TRAINING

Generator

(i)⌘+ log

⇣1�D

⇣G⇣z

(i)⌘⌘⌘i

⇣1�D

⇣G⇣z(i)

⌘⌘⌘

DV (D,G) = E

Discriminator

TRAINING

Generator

(i)⌘+ log

⇣1�D

⇣G⇣z

(i)⌘⌘⌘i

⇣1�D

⇣G⇣z(i)

⌘⌘⌘

http://arxiv.org/pdf/1406.2661v1.pdf

OUTPUT

Unfortunately Only Generative

VAE+GAN

VAE+GAN STRUCTUREGenerator DiscriminatorEncoder

G(E(O))

Yes/ No

Discriminator

TRAINING

Generator

Train Encoder, Generator and Discriminator with separate optimizers.

Encoder

LE = DKL(q�(z)||N (0, I)) +MSE(Dl(x), Dl(G(E(x))))

LG = � ⇥MSE(Dl(x), Dl(G(E(x))))� LGAN

LD = LGAN = || log(D(x)) + log(1�D(E(G(x)))) + log(1�D(G(z)))||1

Discriminator

TRAINING

Generator

Train Encoder, Generator and Discriminator with separate optimizers.

Encoder

LE = DKL(q�(z)||N (0, I)) +MSE(Dl(x), Dl(G(E(x))))

LG = � ⇥MSE(Dl(x), Dl(G(E(x))))� LGAN

LD = LGAN = || log(D(x)) + log(1�D(E(G(x)))) + log(1�D(G(z)))||1

VAE Prior learned similarity

learned similarity GAN

GAN discriminator loss

OUTPUT

TAKEAWAY

We are trying to get away from pixels to begin with so why use pixel distance as metric?

TAKEAWAY

Learned similarity metric provides feature-level distance rather than pixel-level.

TAKEAWAY

Latent space of a GAN with the encoder of a VAE

TAKEAWAY

Latent space of a GAN with the encoder of a VAE

…BUT NOT THAT EASY TO TRAIN

GENERATIVE MOMENT MATCHING NETWORKS

DESCRIPTION

Use Maximum Mean Discrepancy between generated data and test data for loss.

Train generative network to output distribution with moments matching dataset.

LMMD2 =

��1

�(xi)�1

�(yj)

��

LMMD2 =1

k(xi, xi0)�2

k(xi, yj) +1

k(yj , yj0)

DESCRIPTION

OUTPUT

ADVERSARIAL AUTO-ENCODERS

DESCRIPTION

Want to create an auto-encoder whose “code space” has a distribution matching an arbitrary specified prior.

Like VAE, but instead of using Gaussian KL Div., use adversarial procedure to match coding dist. to prior.

DESCRIPTION

Want to create an auto-encoder whose “code space” has a distribution matching an arbitrary specified prior.

Like VAE, but instead of using Gaussian KL Div., use adversarial procedure to match coding dist. to prior.

Train encoder/decoder with reconstruction metrics.

Additionally: sample from encoding space, train encoder to produce samples indistinguishable

from specified prior.

DESCRIPTION

DESCRIPTIONGAN/

Regularization

DESCRIPTIONGAN/

Regularization

AE/ Reconstruction

SEMI-SUPERVISED

Regularize encoding space Disentangle encoding space

SEMI-SUPERVISED10 2D Gaussians

Swiss roll http://arxiv.org/pdf/1511.05644v1.pdf

SEMI-SUPERVISED

Unsupervised Computer Vision: The Current State of the Art

Science

Applying Unsupervised Learning - MathWorks · Applying Unsupervised Learning3 Unsupervised Learning Techniques As we saw in section 1, most unsupervised learning techniques are a

Unsupervised Learning for Recognition Pietro Perona California Institute of Technology & Universita di Padova 11 th British Machine Vision Conference –

Broad Vision: the art and science of looking

NEURAL NETS FOR VISION - NYU Computer Sciencefergus/teaching/vision/nnets.pdf- Neural Networks for Vision: Convolutional & Tiled - Unsupervised Training of Neural Networks - Extensions:

Share the vision through art

Vision in action – communications & raw art

Adaptive Resonance Theory (ART) networks perform completely unsupervised learning. Their competitive learning algorithm is similar to the first (unsupervised)

Adaptive Resonance Theory (ART) networks perform completely unsupervised learning

Vision and Art - Cap. 1

No.10 VISION ART

Applying Computer Vision to Art History - John Resig · 2015. 2. 10. · Computer Vision • Unsupervised (requires no labeling): • Comparing an entire image • Categorizing an

UNSUPERVISED LEARNING IN COMPUTER VISION · Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations. Proceedings of the 26th International

HOSTING 2020 Vision - Art Zeile CEO

Computer vision techniques for interactive art

Unsupervised Learning for Image Category Detectioncvww2017.prip.tuwien.ac.at › papers › CVWW2017_paper_4.pdf · Unsupervised Learning for Image Category Detection ... of-the-art

Beyond Vision - Essays on the Perception of Art

Vision is the Art of Seeing Things Invisible

Unsupervised Learning Clustering Algorithms - IT - websiteafred/tutorials/B_Clustering_Algorithms.pdf · 2 3 Unsupervised Learning Clustering Algorithms Unsupervised Learning -- Ana

ART & VISION - American Academy of Ophthalmology Vision (2… · the exhibit Art & Vision: Seeing in 3-D. The subjects of art and vision science interconnect at many levels as you

Unsupervised Learning and Data Mining Unsupervised Learning