Unsupervised Computer Vision: The Current State of the Art

  • View
    7.426

  • Download
    0

  • Category

    Science

Preview:

Citation preview

Unsupervised Computer Vision

Stitch Fix, Styling Algorithms Research Talk

The Current State of the Art

TJ Torres Data Scientist, Stitch Fix

WHY DEEP LEARNING?Before DL much of computer vision was focused on feature descriptors

and image stats.

SURF MSER Corner

Image Credit: http://www.mathworks.com/products/computer-vision/features.html

WHY DEEP LEARNING?

Turns out NNs are great feature extractors.

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

WHY DEEP LEARNING?

Turns out NNs are great feature extractors. Team name Entry description Classification

errorLocalization error

GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257

VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion weights learnt on the validation set); detected boxes were not updated

0.07325 0.256167

VGG a combination of multiple ConvNets, including a net trained on images of different size (fusion done by averaging); detected boxes were not updated 0.07337 0.255431

VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231

VGG a combination of multiple ConvNets (fusion weights learnt on the validation set) 0.07407 0.253501

Leaderboard

WHY DEEP LEARNING?

Turns out NNs are great feature extractors. Team name Entry description Classification

errorLocalization error

GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257

VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion weights learnt on the validation set); detected boxes were not updated

0.07325 0.256167

VGG a combination of multiple ConvNets, including a net trained on images of different size (fusion done by averaging); detected boxes were not updated 0.07337 0.255431

VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231

VGG a combination of multiple ConvNets (fusion weights learnt on the validation set) 0.07407 0.253501

Leaderboard

Convolution: gives local, translation invariant feature hierarchy

WHY DEEP LEARNING?

Image Credit: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

WHY DEEP LEARNING?

Edges Curves Top of 3 shapes

Softmax Output: Classification

Image Credit: http://parse.ele.tue.nl/education/cluster2

WHY DEEP LEARNING?

Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

WHY DEEP LEARNING?

Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

WHY DEEP LEARNING?

Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

WHY DEEP LEARNING?

Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

WHY DEEP LEARNING?

Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

LEARN MORE

http://cs231n.github.io/convolutional-networks/

WHY UNSUPERVISED?

Unfortunately very few image sets come with labels.

WHY UNSUPERVISED?

Unfortunately very few image sets come with labels.

What are the best labels for fashion/style?

THE UNSUPERVISED MO

Try to learn embedding space of image data. (generally includes generative process)

THE UNSUPERVISED MO

Try to learn embedding space of image data. (generally includes generative process)

1) Train encoder and decoder to encode then reconstruct image.

THE UNSUPERVISED MO

Try to learn embedding space of image data. (generally includes generative process)

1) Train encoder and decoder to encode then reconstruct image.

2) Generate image from random embedding and reinforce “good” looking images.

THE UNSUPERVISED MO

Try to learn embedding space of image data. (generally includes generative process)

1) Train encoder and decoder to encode then reconstruct image.

2) Generate image from random embedding and reinforce “good” looking images.

DOWNSIDES

Higher dimension embeddings = Non-interpretable

Latent distributions may contain gaps. No sensible continuum.

OUTLINE

1. Variational Auto-encoders (VAE)

2. Generative Adversarial Networks (GAN)

3. The combination of the two (VAE/GAN)

4. Generative Moment Matching Networks (GMMN)

5. Adversarial Auto-encoders (AAE?)

Briefly

OUTLINE

1. Variational Auto-encoders (VAE)

2. Generative Adversarial Networks (GAN)

3. The combination of the two (VAE/GAN)

4. Generative Moment Matching Networks (GMMN)

5. Adversarial Auto-encoders (AAE?)

Briefly

stitchfix/fauxtograph

VARIATIONAL AUTO-ENCODERS

ENCODING

input

Convolution

input

ENCODING

Convolution

latent

ENCODING

Convolution

VARIATIONAL STEP

sample from distribution

}�

q�(z) = N (z;µ(i),�2(i)I)

VARIATIONAL STEP

sampled

Deconvolution

DECODING

output

Deconvolution

DECODING

reconstruction

Deconvolution

CALCULATE LOSS

L(x) = DKL(q�(z)||N (0, I)) +MSE(x,yout

)

UPDATE WEIGHTS

W (l)⇤ij = W (l)

ij

✓1� ↵

@L@Wij

◆@L

@W

(l)ij

=

✓@L

@x

out

◆✓@x

out

@f

(n�1)

◆· · ·

@f

(l)

@W

(l)ij

!

source: @genekogan

Because of pixel-wise MSE loss.

Non-centered features disproportionately penalized.

OUTPUT

source: @genekogan

Because of pixel-wise MSE loss.

Non-centered features disproportionately penalized.

OUTPUT

Note Blurring hair.

GENERATIVE ADVERSARIAL NETWORKS

GAN STRUCTURE

Latent Random Vector

Generator Discriminator

Discriminator

GAN STRUCTUREGenerator

Filtered

Discriminator

GAN STRUCTUREGenerator

Image

Discriminator

GAN STRUCTUREGenerator

Gen/Train Image

Discriminator

GAN STRUCTUREGenerator

Filtered

Discriminator

GAN STRUCTUREGenerator

Yes/No

Discriminator

TRAINING

Generator

Generator and Discriminator play minimax game.

min

Gmax

DV (D,G) = E

x⇠pdata(x) [logD(x)] + Ez⇠pz(z) [log(1�D(G(z)))]

Discriminator

TRAINING

Generator

Lower loss for fooling Discriminator.

Generator and Discriminator play minimax game.

Lower loss for IDing correct training/generated data.

min

Gmax

DV (D,G) = E

x⇠pdata(x) [logD(x)] + Ez⇠pz(z) [log(1�D(G(z)))]

Discriminator

TRAINING

Generator

Lower loss for fooling Discriminator.

Generator and Discriminator play minimax game.

Lower loss for IDing correct training/generated data.

LD =

1

m

mX

i=1

hlogD

⇣x

(i)⌘+ log

⇣1�D

⇣G⇣z

(i)⌘⌘⌘i

LG =

1

m

mX

i=1

log

⇣1�D

⇣G⇣z(i)

⌘⌘⌘

min

Gmax

DV (D,G) = E

x⇠pdata(x) [logD(x)] + Ez⇠pz(z) [log(1�D(G(z)))]

Discriminator

TRAINING

Generator

Lower loss for fooling Discriminator.

Generator and Discriminator play minimax game.

Lower loss for IDing correct training/generated data.

LD =

1

m

mX

i=1

hlogD

⇣x

(i)⌘+ log

⇣1�D

⇣G⇣z

(i)⌘⌘⌘i

LG =

1

m

mX

i=1

log

⇣1�D

⇣G⇣z(i)

⌘⌘⌘

http://arxiv.org/pdf/1406.2661v1.pdf

OUTPUT

http://arxiv.org/pdf/1511.06434v2.pdf

OUTPUT

http://arxiv.org/pdf/1511.06434v2.pdf

OUTPUT

http://arxiv.org/pdf/1511.06434v2.pdf

OUTPUT

http://arxiv.org/pdf/1511.06434v2.pdf

Unfortunately Only Generative

VAE+GAN

VAE+GAN STRUCTUREGenerator DiscriminatorEncoder

O

VAE+GAN STRUCTUREGenerator DiscriminatorEncoder

O S

VAE+GAN STRUCTUREGenerator DiscriminatorEncoder

O S O

G(S)

G(E(O))

VAE+GAN STRUCTUREGenerator DiscriminatorEncoder

O S O

G(S)

G(E(O))

VAE+GAN STRUCTUREGenerator DiscriminatorEncoder

O S O

G(S)

G(E(O))

Yes/ No

MSE

Discriminator

TRAINING

Generator

Train Encoder, Generator and Discriminator with separate optimizers.

Encoder

LE = DKL(q�(z)||N (0, I)) +MSE(Dl(x), Dl(G(E(x))))

LG = � ⇥MSE(Dl(x), Dl(G(E(x))))� LGAN

LD = LGAN = || log(D(x)) + log(1�D(E(G(x)))) + log(1�D(G(z)))||1

Discriminator

TRAINING

Generator

Train Encoder, Generator and Discriminator with separate optimizers.

Encoder

LE = DKL(q�(z)||N (0, I)) +MSE(Dl(x), Dl(G(E(x))))

LG = � ⇥MSE(Dl(x), Dl(G(E(x))))� LGAN

LD = LGAN = || log(D(x)) + log(1�D(E(G(x)))) + log(1�D(G(z)))||1

VAE Prior learned similarity

learned similarity GAN

GAN discriminator loss

OUTPUT

http://arxiv.org/pdf/1512.09300v1.pdf

OUTPUT

http://arxiv.org/pdf/1512.09300v1.pdf

OUTPUT

http://arxiv.org/pdf/1512.09300v1.pdf

TAKEAWAY

http://arxiv.org/pdf/1512.09300v1.pdf

We are trying to get away from pixels to begin with so why use pixel distance as metric?

TAKEAWAY

http://arxiv.org/pdf/1512.09300v1.pdf

Learned similarity metric provides feature-level distance rather than pixel-level.

We are trying to get away from pixels to begin with so why use pixel distance as metric?

TAKEAWAY

http://arxiv.org/pdf/1512.09300v1.pdf

Learned similarity metric provides feature-level distance rather than pixel-level.

We are trying to get away from pixels to begin with so why use pixel distance as metric?

Latent space of a GAN with the encoder of a VAE

TAKEAWAY

http://arxiv.org/pdf/1512.09300v1.pdf

Learned similarity metric provides feature-level distance rather than pixel-level.

We are trying to get away from pixels to begin with so why use pixel distance as metric?

Latent space of a GAN with the encoder of a VAE

…BUT NOT THAT EASY TO TRAIN

GENERATIVE MOMENT MATCHING NETWORKS

DESCRIPTION

Use Maximum Mean Discrepancy between generated data and test data for loss.

Train generative network to output distribution with moments matching dataset.

LMMD2 =

������1

N

NX

i=0

�(xi)�1

M

MX

j=0

�(yj)

������

2

LMMD2 =1

N

2

NX

i=0

NX

i0=0

k(xi, xi0)�2

MN

NX

i=0

MX

j=0

k(xi, yj) +1

M

2

MX

j=0

MX

j0=0

k(yj , yj0)

DESCRIPTION

OUTPUT

http://arxiv.org/pdf/1502.02761v1.pdf

ADVERSARIAL AUTO-ENCODERS

DESCRIPTION

Want to create an auto-encoder whose “code space” has a distribution matching an arbitrary specified prior.

Like VAE, but instead of using Gaussian KL Div., use adversarial procedure to match coding dist. to prior.

DESCRIPTION

Want to create an auto-encoder whose “code space” has a distribution matching an arbitrary specified prior.

Like VAE, but instead of using Gaussian KL Div., use adversarial procedure to match coding dist. to prior.

Train encoder/decoder with reconstruction metrics.

Additionally: sample from encoding space, train encoder to produce samples indistinguishable

from specified prior.

DESCRIPTION

DESCRIPTIONGAN/

Regularization

DESCRIPTIONGAN/

Regularization

AE/ Reconstruction

SEMI-SUPERVISED

Regularize encoding space Disentangle encoding space

SEMI-SUPERVISED10 2D Gaussians

Swiss roll http://arxiv.org/pdf/1511.05644v1.pdf

SEMI-SUPERVISED

http://arxiv.org/pdf/1511.05644v1.pdf

Recommended