Upload
phamkien
View
236
Download
0
Embed Size (px)
Citation preview
UnsupervisedVisualRepresentationLearningbyContextPrediction
Mostslidesinthisrepresentationareadoptedfromauthors'originalpresentationatICCV2015
Berkan Demirel
ImageNet +DeepLearning
Beagle
- ImageRetrieval- Detection(RCNN)- Segmentation(FCN)- DepthEstimation- …
ImageNet +DeepLearning
Beagle
Dowe needsemanticlabels?Pose?
Boundaries?Geometry?
Parts?Materials?
ContextasSupervision[Collobert&Weston2008;Mikolov etal.2013]
DeepNet
ContextPredictionforImages
A B
? ? ?
??
? ? ?
Semanticsfromanon-semantictask
RandomlySamplePatchSampleSecondPatch
CNN CNN
Classifier
RelativePositionTask8possiblelocations
CNN CNN
Classifier
PatchEmbedding
Input NearestNeighbors
CNN Note:connectsacross instances!
Architecture
Patch2Patch1
Fullyconnected
MaxPoolingLRN
MaxPoolingLRN
ConvolutionConvolutionConvolution
Convolution
Convolution
MaxPooling
MaxPoolingLRN
MaxPoolingLRN
Fullyconnected
ConvolutionConvolutionConvolution
Convolution
Convolution
MaxPooling
Softmax loss
Fullyconnected
Fullyconnected
TiedWeights
AvoidingTrivialShortcuts
Includeagap
Jitterthepatchlocations
PositioninImage
ANot-So“Trivial”Shortcut
ChromaticAberration
Solutions
ColorDroppingRandomlydrop2ofthe3colorchannelsfromeachpatch.Then,replacingthedroppedcolorswithGaussianNoise(standarddeviation~1/100thestandard
deviationoftheremainingchannel).
ProjectionShiftgreenandmagenta(red+blue)towardsgray
ImplementationDetails• TrainontheImageNet2012trainingset(1.3Mimages),usingonlytheimagesanddiscarding
thelabels.• Resizeeachimagetobetween150Kand450Ktotalpixels,preservingtheaspect-ratio.• Samplepatchesatresolution96-by-96.• Samplethepatchesfromagridlikepattern.Eachsampledpatchcanparticipateinasmanyas
8separatepairings.• Allowagapof48pixelsbetweenthesampledpatchesinthegrid,butalsojitterthe location
ofeachpatchinte gridby–7to7pixelsineachdirection.• Preprocesspatchesby(1)meansubstraction,(2)projectingordroppingcolors,(3)randomly
downsamplingsomepatchestoaslittleas100totalpixels,andthenupsamplingit,tobuildrobustness topixelation.
• Usebatchnormalization,without thescaleandshift.
Experiments• ChromaticAberration• Nearest-NeighborMatching• ObjectDetection• GeometryEstimation• VisualDataMining• LayoutPrediction
ChromaticAberration
CNN
ChromaticAberration
CNN
Nearest-NeighborMatching• fc6layerfeaturesandonlyoneofthetwostacksareused.• fc7andhigherlayersareremoved.• Normalizedcrosscorrelationisusedtofindsimilarpatches• Randomlyselected96x96patchesareusedinthecomparison.
Ours
Whatislearned?
Input RandomInitialization ImageNet AlexNet
Stilldon’tcaptureeverythingInput Ours RandomInitialization ImageNet AlexNet
Youdon’talwaysneedtolearn!Input Ours RandomInitialization ImageNet AlexNet
ObjectDetection
Pre-trainonrelative-positiontask,w/olabels
[Girshick etal.2014]
ObjectDetection
[Girshick etal.2014]
ObjectDetection
[Girshick etal.2014]
Multi-TaskTraining?
Surface-normalEstimation
Error (LowerBetter) %GoodPixels(HigherBetter)
NoPretraining 38.6 26.5 33.1 46.8 52.5Unsup.Track. 34.2 21.9 35.7 50.6 57.0Ours 33.2 21.3 36.0 51.2 57.8ImageNet Labels 33.3 20.8 36.7 51.7 58.1
VisualDataMining• Sampleaconstellationoffouradjacentpatchesfroman
image(weusefourtoreducethelikelihoodofamatchingspatialarrangementhappeningbychance).
• Findtop100imageswhichhavethestrongestmatchesforallfourpatches,ignoringspatiallayout.
• Useatypeofageometricverificationtofilterawaytheimageswherethefourmatchesarenotgeometricallyconsistent.
• ApplythedescribedminingalgorithmtoPascalVOC2011.
VisualDataMining
…
ViaGeometricVerification
Simplifiedfrom[Chumetal2007]
MinedfromPascalVOC2011
LayoutPredictionVisualDataMiningAlgorithmresultsfor15,000StreetViewimagesfromParis
Purity Test
So,doweneedsemanticlabels?
SourceCode&SupplementaryMaterials
• MagicInit• UnsupervisedVisualRepresentationLearningbyContextPrediction• VisualDataMiningResultsonunlabeledPASCALVOC2011Images• NearestNeighborsonPASCALVOC2007• More
THANKYOU!