DrillDown: Interactive Retrieval of Complex Scenes Using Natural Language Queries
When we’d like to retrieve an image of a complex scene
Difficult to describe the whole scene in one sentence
Image Search Engine
Single sentence as queryNo refinement (no interaction)
Find a specific image in our gallery album
or online image collection
Image Retrieval with Multiple Rounds Queries
Drill-down: Interactive Retrieval of Complex Scenes using Natural Language QueriesFuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, Vicente Ordonez.Conf. on Neural Information Processing Systems. NeurIPS 2019. Vancouver, Canada. December 2019.
Previous efforts on Image-Text Matching
Two women sitting on the sofa
Woman in white shirt holding a dog
Woman in yellow shirt holding a cat
CNN RNN
1D Feature Space
[1] DeViSE: A Deep Visual-Semantic Embedding Model. Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov. NIPS 2013.[2] Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. Andrej Karpathy, Armand Joulin, Li Fei-Fei. NIPS 2014
Previous efforts on Image-Text Matching
[3] Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations. Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma. CVPR 2019.
Observations
Feature channels
Sp
atia
l dim
ensi
on
s2D image representation can help distinguish instances sharing the same feature subspace
Observations
Feature channels
Sp
atia
l dim
ensi
on
s
Two women sitting on the sofa
Woman in white shirt holding a dog
Woman in yellow shirt holding a cat
1D sentence representation can NOT distinguish instances sharing the same feature subspace
Observations
Feature channels
Sp
atia
l dim
ensi
on
s
Two women sitting on the sofa
Woman in white shirt holding a dog
Woman in yellow shirt holding a cat
2D sentence representation
“person” subspace
“dog” subspace
“cat” subspace
Instance1
Instance2
Instance3
We still want compact representations
Especially, if it is for retrieval applications
Feature vector 1Sentence 1
Feature vector 2Sentence 2
Feature vector 3Sentence 3
...
Text input
Pre-allocated state vectors
Text feature
Action: which state vector to
update
Update the state vector
Pairwise alignment between state vectors and
image regions
Simulated queries through region-phrase annotations at training time
Human queries
Quantitative evaluation on a test set of 10000 images
Although, the more state vectors,
the better
Although, the more state vectors,
the better
We could have an even more compact representation
Quantitative evaluation on a test set of 10000 images
Quantitative evaluation on a test set of 10000 images
Target
Target
Target
Target
Target
Target
Target
Future work: instance aware text encoder for dialog based applications?
Potential challenges:● Named entity detection● Coreference resolution● Negation● ...
Q&A