33
Unsupervised Visual Representation Learning by Context Prediction Most slides in this representation are adopted from authors' original presentation at ICCV 2015 Berkan Demirel

Unsupervised Visual Representation Learning by Context Prediction

Embed Size (px)

Citation preview

Page 1: Unsupervised Visual Representation Learning by Context Prediction

UnsupervisedVisualRepresentationLearningbyContextPrediction

Mostslidesinthisrepresentationareadoptedfromauthors'originalpresentationatICCV2015

Berkan Demirel

Page 2: Unsupervised Visual Representation Learning by Context Prediction

ImageNet +DeepLearning

Beagle

- ImageRetrieval- Detection(RCNN)- Segmentation(FCN)- DepthEstimation- …

Page 3: Unsupervised Visual Representation Learning by Context Prediction

ImageNet +DeepLearning

Beagle

Dowe needsemanticlabels?Pose?

Boundaries?Geometry?

Parts?Materials?

Page 4: Unsupervised Visual Representation Learning by Context Prediction

ContextasSupervision[Collobert&Weston2008;Mikolov etal.2013]

DeepNet

Page 5: Unsupervised Visual Representation Learning by Context Prediction

ContextPredictionforImages

A B

? ? ?

??

? ? ?

Page 6: Unsupervised Visual Representation Learning by Context Prediction

Semanticsfromanon-semantictask

Page 7: Unsupervised Visual Representation Learning by Context Prediction

RandomlySamplePatchSampleSecondPatch

CNN CNN

Classifier

RelativePositionTask8possiblelocations

Page 8: Unsupervised Visual Representation Learning by Context Prediction

CNN CNN

Classifier

PatchEmbedding

Input NearestNeighbors

CNN Note:connectsacross instances!

Page 9: Unsupervised Visual Representation Learning by Context Prediction

Architecture

Patch2Patch1

Fullyconnected

MaxPoolingLRN

MaxPoolingLRN

ConvolutionConvolutionConvolution

Convolution

Convolution

MaxPooling

MaxPoolingLRN

MaxPoolingLRN

Fullyconnected

ConvolutionConvolutionConvolution

Convolution

Convolution

MaxPooling

Softmax loss

Fullyconnected

Fullyconnected

TiedWeights

Page 10: Unsupervised Visual Representation Learning by Context Prediction

AvoidingTrivialShortcuts

Includeagap

Jitterthepatchlocations

Page 11: Unsupervised Visual Representation Learning by Context Prediction

PositioninImage

ANot-So“Trivial”Shortcut

Page 12: Unsupervised Visual Representation Learning by Context Prediction

ChromaticAberration

Page 13: Unsupervised Visual Representation Learning by Context Prediction

Solutions

ColorDroppingRandomlydrop2ofthe3colorchannelsfromeachpatch.Then,replacingthedroppedcolorswithGaussianNoise(standarddeviation~1/100thestandard

deviationoftheremainingchannel).

ProjectionShiftgreenandmagenta(red+blue)towardsgray

Page 14: Unsupervised Visual Representation Learning by Context Prediction

ImplementationDetails• TrainontheImageNet2012trainingset(1.3Mimages),usingonlytheimagesanddiscarding

thelabels.• Resizeeachimagetobetween150Kand450Ktotalpixels,preservingtheaspect-ratio.• Samplepatchesatresolution96-by-96.• Samplethepatchesfromagridlikepattern.Eachsampledpatchcanparticipateinasmanyas

8separatepairings.• Allowagapof48pixelsbetweenthesampledpatchesinthegrid,butalsojitterthe location

ofeachpatchinte gridby–7to7pixelsineachdirection.• Preprocesspatchesby(1)meansubstraction,(2)projectingordroppingcolors,(3)randomly

downsamplingsomepatchestoaslittleas100totalpixels,andthenupsamplingit,tobuildrobustness topixelation.

• Usebatchnormalization,without thescaleandshift.

Page 15: Unsupervised Visual Representation Learning by Context Prediction

Experiments• ChromaticAberration• Nearest-NeighborMatching• ObjectDetection• GeometryEstimation• VisualDataMining• LayoutPrediction

Page 16: Unsupervised Visual Representation Learning by Context Prediction

ChromaticAberration

CNN

Page 17: Unsupervised Visual Representation Learning by Context Prediction

ChromaticAberration

CNN

Page 18: Unsupervised Visual Representation Learning by Context Prediction

Nearest-NeighborMatching• fc6layerfeaturesandonlyoneofthetwostacksareused.• fc7andhigherlayersareremoved.• Normalizedcrosscorrelationisusedtofindsimilarpatches• Randomlyselected96x96patchesareusedinthecomparison.

Page 19: Unsupervised Visual Representation Learning by Context Prediction

Ours

Whatislearned?

Input RandomInitialization ImageNet AlexNet

Page 20: Unsupervised Visual Representation Learning by Context Prediction

Stilldon’tcaptureeverythingInput Ours RandomInitialization ImageNet AlexNet

Youdon’talwaysneedtolearn!Input Ours RandomInitialization ImageNet AlexNet

Page 21: Unsupervised Visual Representation Learning by Context Prediction

ObjectDetection

Pre-trainonrelative-positiontask,w/olabels

[Girshick etal.2014]

Page 22: Unsupervised Visual Representation Learning by Context Prediction

ObjectDetection

[Girshick etal.2014]

Page 23: Unsupervised Visual Representation Learning by Context Prediction

ObjectDetection

[Girshick etal.2014]

Page 24: Unsupervised Visual Representation Learning by Context Prediction

Multi-TaskTraining?

Page 25: Unsupervised Visual Representation Learning by Context Prediction

Surface-normalEstimation

Error (LowerBetter) %GoodPixels(HigherBetter)

NoPretraining 38.6 26.5 33.1 46.8 52.5Unsup.Track. 34.2 21.9 35.7 50.6 57.0Ours 33.2 21.3 36.0 51.2 57.8ImageNet Labels 33.3 20.8 36.7 51.7 58.1

Page 26: Unsupervised Visual Representation Learning by Context Prediction

VisualDataMining• Sampleaconstellationoffouradjacentpatchesfroman

image(weusefourtoreducethelikelihoodofamatchingspatialarrangementhappeningbychance).

• Findtop100imageswhichhavethestrongestmatchesforallfourpatches,ignoringspatiallayout.

• Useatypeofageometricverificationtofilterawaytheimageswherethefourmatchesarenotgeometricallyconsistent.

• ApplythedescribedminingalgorithmtoPascalVOC2011.

Page 27: Unsupervised Visual Representation Learning by Context Prediction

VisualDataMining

ViaGeometricVerification

Simplifiedfrom[Chumetal2007]

Page 28: Unsupervised Visual Representation Learning by Context Prediction

MinedfromPascalVOC2011

Page 29: Unsupervised Visual Representation Learning by Context Prediction

LayoutPredictionVisualDataMiningAlgorithmresultsfor15,000StreetViewimagesfromParis

Page 30: Unsupervised Visual Representation Learning by Context Prediction

Purity Test

Page 31: Unsupervised Visual Representation Learning by Context Prediction

So,doweneedsemanticlabels?

Page 32: Unsupervised Visual Representation Learning by Context Prediction

SourceCode&SupplementaryMaterials

• MagicInit• UnsupervisedVisualRepresentationLearningbyContextPrediction• VisualDataMiningResultsonunlabeledPASCALVOC2011Images• NearestNeighborsonPASCALVOC2007• More

Page 33: Unsupervised Visual Representation Learning by Context Prediction

THANKYOU!