Download pdf - DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval

DrillDown: Interactive Retrieval of Complex Scenes Using Natural Language Queries

When we’d like to retrieve an image of a complex scene

Difficult to describe the whole scene in one sentence

Image Search Engine

Single sentence as queryNo refinement (no interaction)

Find a specific image in our gallery album

or online image collection

Image Retrieval with Multiple Rounds Queries

Drill-down: Interactive Retrieval of Complex Scenes using Natural Language QueriesFuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, Vicente Ordonez.Conf. on Neural Information Processing Systems. NeurIPS 2019. Vancouver, Canada. December 2019.

Previous efforts on Image-Text Matching

Two women sitting on the sofa

Woman in white shirt holding a dog

Woman in yellow shirt holding a cat

CNN RNN

1D Feature Space

[1] DeViSE: A Deep Visual-Semantic Embedding Model. Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov. NIPS 2013.[2] Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. Andrej Karpathy, Armand Joulin, Li Fei-Fei. NIPS 2014

Previous efforts on Image-Text Matching

[3] Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations. Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma. CVPR 2019.

Observations

Feature channels

Sp

atia

l dim

ensi

on

s2D image representation can help distinguish instances sharing the same feature subspace

Observations

Feature channels

Sp

atia

l dim

ensi

on

s




1D sentence representation can NOT distinguish instances sharing the same feature subspace

Observations

Feature channels

Sp

atia

l dim

ensi

on

s




2D sentence representation

“person” subspace

“dog” subspace

“cat” subspace

Instance1

Instance2

Instance3

We still want compact representations

Especially, if it is for retrieval applications

Feature vector 1Sentence 1



...

Text input

Pre-allocated state vectors

Text feature

Action: which state vector to

update

Update the state vector

Pairwise alignment between state vectors and

image regions

Simulated queries through region-phrase annotations at training time

Human queries

Quantitative evaluation on a test set of 10000 images

Although, the more state vectors,

the better

Although, the more state vectors,

the better

We could have an even more compact representation



Target

Target

Target

Target

Target

Target

Target

Future work: instance aware text encoder for dialog based applications?

Potential challenges:● Named entity detection● Coreference resolution● Negation● ...

Q&A