Unsupervised Visual Representation Learning by Context Prediction

UnsupervisedVisualRepresentationLearningbyContextPrediction

Mostslidesinthisrepresentationareadoptedfromauthors'originalpresentationatICCV2015

Berkan Demirel

ImageNet +DeepLearning

Beagle

- ImageRetrieval- Detection(RCNN)- Segmentation(FCN)- DepthEstimation- …

ImageNet +DeepLearning

Beagle

Dowe needsemanticlabels?Pose?

Boundaries?Geometry?

Parts?Materials?

ContextasSupervision[Collobert&Weston2008;Mikolov etal.2013]

DeepNet

ContextPredictionforImages

A B

? ? ?

??

? ? ?

Semanticsfromanon-semantictask

RandomlySamplePatchSampleSecondPatch

CNN CNN

Classifier

RelativePositionTask8possiblelocations

CNN CNN

Classifier

PatchEmbedding

Input NearestNeighbors

CNN Note:connectsacross instances!

Architecture

Patch2Patch1

Fullyconnected

MaxPoolingLRN

MaxPoolingLRN

ConvolutionConvolutionConvolution

Convolution

Convolution

MaxPooling

MaxPoolingLRN

MaxPoolingLRN

Fullyconnected

ConvolutionConvolutionConvolution

Convolution

Convolution

MaxPooling

Softmax loss

Fullyconnected

Fullyconnected

TiedWeights

AvoidingTrivialShortcuts

Includeagap

Jitterthepatchlocations

PositioninImage

ANot-So“Trivial”Shortcut

ChromaticAberration

Solutions

ColorDroppingRandomlydrop2ofthe3colorchannelsfromeachpatch.Then,replacingthedroppedcolorswithGaussianNoise(standarddeviation~1/100thestandard

deviationoftheremainingchannel).

ProjectionShiftgreenandmagenta(red+blue)towardsgray

ImplementationDetails• TrainontheImageNet2012trainingset(1.3Mimages),usingonlytheimagesanddiscarding

thelabels.• Resizeeachimagetobetween150Kand450Ktotalpixels,preservingtheaspect-ratio.• Samplepatchesatresolution96-by-96.• Samplethepatchesfromagridlikepattern.Eachsampledpatchcanparticipateinasmanyas

8separatepairings.• Allowagapof48pixelsbetweenthesampledpatchesinthegrid,butalsojitterthe location

ofeachpatchinte gridby–7to7pixelsineachdirection.• Preprocesspatchesby(1)meansubstraction,(2)projectingordroppingcolors,(3)randomly

downsamplingsomepatchestoaslittleas100totalpixels,andthenupsamplingit,tobuildrobustness topixelation.

• Usebatchnormalization,without thescaleandshift.

Experiments• ChromaticAberration• Nearest-NeighborMatching• ObjectDetection• GeometryEstimation• VisualDataMining• LayoutPrediction

ChromaticAberration

CNN

ChromaticAberration

CNN

Nearest-NeighborMatching• fc6layerfeaturesandonlyoneofthetwostacksareused.• fc7andhigherlayersareremoved.• Normalizedcrosscorrelationisusedtofindsimilarpatches• Randomlyselected96x96patchesareusedinthecomparison.

Ours

Whatislearned?

Input RandomInitialization ImageNet AlexNet

Stilldon’tcaptureeverythingInput Ours RandomInitialization ImageNet AlexNet

Youdon’talwaysneedtolearn!Input Ours RandomInitialization ImageNet AlexNet

ObjectDetection

Pre-trainonrelative-positiontask,w/olabels

[Girshick etal.2014]

ObjectDetection


ObjectDetection


Multi-TaskTraining?

Surface-normalEstimation

Error (LowerBetter) %GoodPixels(HigherBetter)

NoPretraining 38.6 26.5 33.1 46.8 52.5Unsup.Track. 34.2 21.9 35.7 50.6 57.0Ours 33.2 21.3 36.0 51.2 57.8ImageNet Labels 33.3 20.8 36.7 51.7 58.1

VisualDataMining• Sampleaconstellationoffouradjacentpatchesfroman

image(weusefourtoreducethelikelihoodofamatchingspatialarrangementhappeningbychance).

• Findtop100imageswhichhavethestrongestmatchesforallfourpatches,ignoringspatiallayout.

• Useatypeofageometricverificationtofilterawaytheimageswherethefourmatchesarenotgeometricallyconsistent.

• ApplythedescribedminingalgorithmtoPascalVOC2011.

VisualDataMining

…

ViaGeometricVerification

Simplifiedfrom[Chumetal2007]

MinedfromPascalVOC2011

LayoutPredictionVisualDataMiningAlgorithmresultsfor15,000StreetViewimagesfromParis

Purity Test

So,doweneedsemanticlabels?

SourceCode&SupplementaryMaterials

• MagicInit• UnsupervisedVisualRepresentationLearningbyContextPrediction• VisualDataMiningResultsonunlabeledPASCALVOC2011Images• NearestNeighborsonPASCALVOC2007• More

THANKYOU!

Documents

Unsupervised Visual Representation Learning by Context Prediction