6
Renmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal Embeddings for Video-Text Retrieval Xirong Li , Jianfeng Dong * , Chaoxi Xu , Jing Cao , Xun Wang * , Gang Yang AI & Media Computing Lab Renmin University of China Beijing, China * School of Computer and Information Engineering Zhejiang Gongshang University Hangzhou, China Abstract In this paper we summarize our TRECVID 2018 [1] video retrieval experiments. We participated in two tasks: Ad-hoc Video Search (AVS) and Video-to-Text (VTT) Matching and Ranking. For the AVS task, we develop our solutions based on W2VV++, a super version of Word2VisualVec (W2VV) [4]. For the VTT task, our entry is built on the top of a recently proposed dual encoding network [5], which encodes an input, let it be a video or a natural lan- guage sentence, at multiple levels. The 2018 edition of the TRECVID benchmark has been a fruitful participation for our joint-team, resulting in the best overall result for both AVS and VTT tasks. Retrospective experiments show that our ad-hoc video search system, used as is, also out- performs the previous best results of the TRECVID 2016 and 2017 AVS tasks. We have released feature data at https: // github. com/ li-xirong/ avs . 1 Ad-hoc Video Search 1.1 Approach For the ad-hoc video search task, we develop a super vision of Word2VisualVec (W2VV) [4], which we term W2VV++. The original W2VV model is a deep neural network that projects a given sentence into a visual feature space by first vectorizing the sentence by a multi-scale encoding strategy. Then, the encoding result goes through a multilayer percep- tron (MLP) to produce a feature vector r(s). The network is trained such that the loss between a given video v and the sentence s, defined as the Mean Square Error (MSE) be- tween r(s) and the video feature φ(v), is minimized. For cross-modal matching the cosine similarity between φ(v) and r(s) is used, denoted by S θ (v,s), where θ indicates all the trainable parameters in the model. While W2VV is shown to be effective in the 2016 and 2017 TRECVID video-to-text matching tasks [11, 12], the MSE based loss limits its ability to exploit many negative samples during the training stage. W2VV++ improves over W2VV by substituting an im- proved marginal ranking loss [7] for the MSE loss. Given a video-sentence pair (v,s), the new loss is defined as: l(v,s; θ) = max v (0+ S θ (v - ,s) - S θ (v,s)), (1) where v - is a hardest negative video sample of the sentence s. Following [7], we define the hardest negative example as the most similar sample to s in a mini-batch. As illustrated in Fig. 1, we investigate three variants of W2VV++, de- pending on 1) the choice of video features and 2) whether an extra fully connected layer is added for video feature re- learning [6]. For video representation, we uniformly sample frames with an interval of 0.5 second. Deep visual features are extracted per frame by pre-trained CNN models. In par- ticular, we adopt a ResNet-152 model used in [3] and a ResNeXt-101 model used in [12]. Accordingly, two 2,048- dim video-level features are obtained by mean pooling over the frames. For sentence representation, we keep the sentence encoder of W2VV. That is, a given sentence is firstly vectorized in parallel by three vectorization strategies including Bag- of-Words (BoW), word2vec and a Gated Recurrent Units (GRU). The output of the three encoding blocks is concate- nated and forwarded to a fully connected layer for common space learning. We train W2VV++ on a joint collection of MSR-VTT [8] and TGIF [10], with hyper-parameters tuned on the training sets of the previous TRECVID VTT task. 1.2 Submissions We submit the following runs: Run 4 is W2VV++ that predict the ResNeXt-101 + ResNet-152 feature for a given sentence. The cross- modal similarity between the given sentence and any video from the IACC.3 collection is implemented as the cosine similarity between their corresponding feature vectors. Run 3 differs Run 4 in two ways. First, the former uses only the ResNeXt-101 feature. Second, it adds a fully connected layer on the top of the video feature.

Renmin University of China and Zhejiang Gongshang ...lixirong.net/pub/trecvid2018-rucmm.pdfRenmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal

  • Upload
    others

  • View
    21

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Renmin University of China and Zhejiang Gongshang ...lixirong.net/pub/trecvid2018-rucmm.pdfRenmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal

Renmin University of China and Zhejiang Gongshang University at TRECVID2018: Deep Cross-Modal Embeddings for Video-Text Retrieval

Xirong Li†, Jianfeng Dong∗, Chaoxi Xu†, Jing Cao†, Xun Wang∗, Gang Yang†

†AI & Media Computing LabRenmin University of China

Beijing, China

∗School of Computer and Information EngineeringZhejiang Gongshang University

Hangzhou, China

Abstract

In this paper we summarize our TRECVID 2018 [1] videoretrieval experiments. We participated in two tasks: Ad-hocVideo Search (AVS) and Video-to-Text (VTT) Matchingand Ranking. For the AVS task, we develop our solutionsbased on W2VV++, a super version of Word2VisualVec(W2VV) [4]. For the VTT task, our entry is built onthe top of a recently proposed dual encoding network [5],which encodes an input, let it be a video or a natural lan-guage sentence, at multiple levels. The 2018 edition ofthe TRECVID benchmark has been a fruitful participationfor our joint-team, resulting in the best overall result forboth AVS and VTT tasks. Retrospective experiments showthat our ad-hoc video search system, used as is, also out-performs the previous best results of the TRECVID 2016and 2017 AVS tasks. We have released feature data athttps: // github. com/ li-xirong/ avs .

1 Ad-hoc Video Search

1.1 Approach

For the ad-hoc video search task, we develop a super visionof Word2VisualVec (W2VV) [4], which we term W2VV++.The original W2VV model is a deep neural network thatprojects a given sentence into a visual feature space by firstvectorizing the sentence by a multi-scale encoding strategy.Then, the encoding result goes through a multilayer percep-tron (MLP) to produce a feature vector r(s). The network istrained such that the loss between a given video v and thesentence s, defined as the Mean Square Error (MSE) be-tween r(s) and the video feature φ(v), is minimized. Forcross-modal matching the cosine similarity between φ(v)and r(s) is used, denoted by Sθ(v, s), where θ indicatesall the trainable parameters in the model. While W2VVis shown to be effective in the 2016 and 2017 TRECVIDvideo-to-text matching tasks [11, 12], the MSE based losslimits its ability to exploit many negative samples duringthe training stage.

W2VV++ improves over W2VV by substituting an im-proved marginal ranking loss [7] for the MSE loss. Given a

video-sentence pair (v, s), the new loss is defined as:

l(v, s; θ) = maxv

(0, α+ Sθ(v−, s) − Sθ(v, s)), (1)

where v− is a hardest negative video sample of the sentences. Following [7], we define the hardest negative example asthe most similar sample to s in a mini-batch. As illustratedin Fig. 1, we investigate three variants of W2VV++, de-pending on 1) the choice of video features and 2) whetheran extra fully connected layer is added for video feature re-learning [6].

For video representation, we uniformly sample frameswith an interval of 0.5 second. Deep visual features areextracted per frame by pre-trained CNN models. In par-ticular, we adopt a ResNet-152 model used in [3] and aResNeXt-101 model used in [12]. Accordingly, two 2,048-dim video-level features are obtained by mean pooling overthe frames.

For sentence representation, we keep the sentence encoderof W2VV. That is, a given sentence is firstly vectorizedin parallel by three vectorization strategies including Bag-of-Words (BoW), word2vec and a Gated Recurrent Units(GRU). The output of the three encoding blocks is concate-nated and forwarded to a fully connected layer for commonspace learning.

We train W2VV++ on a joint collection of MSR-VTT [8]and TGIF [10], with hyper-parameters tuned on the trainingsets of the previous TRECVID VTT task.

1.2 Submissions

We submit the following runs:

• Run 4 is W2VV++ that predict the ResNeXt-101 +ResNet-152 feature for a given sentence. The cross-modal similarity between the given sentence and anyvideo from the IACC.3 collection is implemented as thecosine similarity between their corresponding featurevectors.

• Run 3 differs Run 4 in two ways. First, the former usesonly the ResNeXt-101 feature. Second, it adds a fullyconnected layer on the top of the video feature.

Page 2: Renmin University of China and Zhejiang Gongshang ...lixirong.net/pub/trecvid2018-rucmm.pdfRenmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal

w2

vB

oW

GR

U

Two young men with dark glasses are walking around an

airport.

Co

ncate

nate

Fully co

nn

ecte

d

Sentence s

Video 𝑣

𝛷 𝑣𝑟(𝑠)

Fully co

nn

ecte

d

(𝑟 𝑠 − 𝛷 𝑣 )2

Minimize mean squared error

Re

sNe

Xt-1

01

Me

an p

oo

ling

(a) Original Word2VisualVec (W2VV) [4]

w2

vB

oW

GR

U

Two young men with dark glasses are walking around an

airport.

Re

sNe

Xt-1

01

Co

ncate

nate

Fully co

nn

ecte

d

Sentence s

Video 𝑣

𝛷 𝑣𝑟(𝑠)

Me

an p

oo

l

Re

sne

t-15

2

Me

an p

oo

l

Co

ncate

nate

𝑚𝑎𝑥(0, 𝛼 + 𝑆𝜃 𝑣−, 𝑠 − 𝑆𝜃(𝑣, 𝑠))

Minimize improved marginal ranking loss

(b) Model for Run 4

w2

vB

oW

GR

U

Two young men with dark glasses are walking around an

airport.

Co

ncate

nate

Fully co

nn

ecte

d

Sentence s

Video 𝑣

𝛷 𝑣𝑟(𝑠)

Fully co

nn

ecte

d

𝑚𝑎𝑥(0, 𝛼 + 𝑆𝜃 𝑣−, 𝑠 − 𝑆𝜃(𝑣, 𝑠))

Minimize improved marginal ranking loss

Re

sNe

Xt-1

01

Me

an p

oo

ling

(c) Model for Run 3

w2

vB

oW

GR

U

Two young men with dark glasses are walking around an

airport.

Re

sNe

Xt-1

01

Co

ncate

nate

Fully co

nn

ecte

d

Sentence s

Video 𝑣

𝛷 𝑣𝑟(𝑠)

Me

an p

oo

l

Re

snet-1

52

Me

an p

oo

l

Co

ncate

nate

Fully co

nn

ecte

d

𝑚𝑎𝑥(0, 𝛼 + 𝑆𝜃 𝑣−, 𝑠 − 𝑆𝜃(𝑣, 𝑠))

Minimize improved marginal ranking loss

(d) Model for Run 2

Figure 1: Conceptual diagrams of W2VV++ models used inour runs. Our best run, i.e. Run 1, equally combine the W2VV++models used in the other runs.

• Run 2 is similar to Run 3 but re-using the ResNeXt-101 + ResNet-152 feature.

• Run 1 equally combines models from all the other runsand trained with different setups.

An overview of the AVS task benchmark is shown in Fig.2. Run 4 servers as our baseline, better than all submissionsfrom the other teams. The result shows the effectivenessof W2VV++. Run 2, by adding an additional feature re-learning layer, outperforms Run 4. While the improvementappears to be marginal, their ensemble, as demonstrated by

Table 1: Retrospective experiments on the AVS tasks of theprevious years. Our runs outperform the previous best runs.

TRECVID edition

2016 2017 2018

Previous best run 0.054 [9] 0.206 [12] –

Ours:

Run 4 0.149 0.176 0.104

Run 3 0.140 0.171 0.103

Run 2 0.151 0.213 0.106

Run 1 0.149 0.220 0.121

Run 1, gives a noticeable performance boost. The result in-dicates that these single models are complementary to eachother.

A retrospective experiment on the AVS tasks of the pre-vious years is reported in Table 1. Our models, used as is,outperform the previous best runs. The results again con-firm the effectiveness of W2VV++. Moreover, consideringthat the video pool stays the same while the queries changeeach year, the retrospective experiment suggests that the2018 topics are the most difficult, while the 2017 topics seemto be the easiest.

2 Video to Text Matching

For video-to-text matching, we participate in the Match-ing and Ranking subtask. Given a video, participants wereasked to rank a list of pre-defined candidate sentences interms of their relevance with respect to the given video. Inthe 2018 edition, the test video set consists of 1,904 videoscollected from Twitter Vine. Five sentence sets are pro-vided by the task organizers, denoted as setA, setB, setC,setD and setE. Each sentence set has 1,921 sentences.

2.1 Approach

Our entry is built on the top of a recently proposed dualencoding network [5]. The dual encoding network uses thesame architecture to learn powerful representations for twosequential input of distinct modalities, i.e. video as a se-quence of frames and sentence as a sequence of words, atmultiple levels. In particular, the encoding network con-sists of three encoding blocks that are implemented by meanpooling, bidirectional GRU (biGRU) and biGRU-CNN re-spectively. The three blocks are stacked to explicitly modelglobal, local and temporal patterns in both videos and sen-tences. The output of a specific encoding block is not onlyused as input of a follow-up encoding block, but also re-used via skip connection to contribute to the final output ofthe entire encoding network. It generates new, higher-levelfeatures progressively. These features, generated at distinctlevels, are powerful and complementary to each other. So we

Page 3: Renmin University of China and Zhejiang Gongshang ...lixirong.net/pub/trecvid2018-rucmm.pdfRenmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal

RUCMM.ru

n1

RUCMM.ru

n2

RUCMM.ru

n4

RUCMM.ru

n3

INF.ru

n2

INF.ru

n4

NTU_R

OSE_A

VS.ru

n1

Med

iaMill.ru

n2

INF.ru

n3

Med

iaMill.ru

n1

INF.ru

n1

UTS_IS

A.run4

UTS_IS

A.run2

Med

iaMill.ru

n4

Med

iaMill.ru

n3

Was

eda_

Meise

i.run

4

UTS_IS

A.run3

Was

eda_

Meise

i.run

1

ITI_C

ERTH

.run2

ITI_C

ERTH

.run1

Was

eda_

Meise

i.run

3

Was

eda_

Meise

i.run

2

ITI_C

ERTH

.run3

ITI_C

ERTH

.run4

UTS_IS

A.run1

NII_Hita

chi_U

IT.ru

n2

NII_Hita

chi_U

IT.ru

n1

INF.ru

n5

VIREO

_NEx

T.ru

n1

VIREO

_NEx

T.ru

n3

NECTE

C.run1

VIREO

_NEx

T.ru

n2

NII_Hita

chi_U

IT.ru

n3

runs

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

infe

rred a

vera

ge p

reci

sion

29 runs from all other teams

Our runs

Figure 2: Overview of the TRECVID 2018 ad-hoc video search task benchmark, all runs ranked according to mean infAP. We werethe best overall performer.

combine them to obtain robust video / sentence represen-tations by a simple concatenation operation. After videosand sentences being encoded by the dual encoding network,we employ a state-of-the-art common space learning algo-rithm [7] to project the two modalities into a common space.Consequently, the relevance between a given video v and asentence s is computed as the cosine similarity between thevideo feature f(v) and the sentence feature f(s) in the com-mon space. For more technical details we refer readers ofinterest to [5].

The model is trained on a combined set of MSR-VTT [8],MSVD [2] and TGIF [10], with hyper-parameters tuned onthe TRECVID 2016 VTT training set.

2.2 Submissions

We submit the following four runs:

• Run 0 is the dual encoding model using the ResNeXt-101 feature.

• Run 1 equally combines either models, among whichfour models are based on Run 0 with their last FClayer varies. That is, a FC layer, a FC layer with atanh activation, a FC layer with a BN layer, and a FClayer with a BN layer and a tanh activation. The otherfour models are trained in a similar manner, but usingthe ResNext-152 feature.

• Run 2 equally combines eight W2VV++ models. Fourmodels are separately trained using the ResNeXt-101feature, with sentence vectorization varies. That is,BoW, GRU using the last output, GRU using the meanof all outputs, and multi-scale sentence vectorization,respectively. The other four models are trained in a

Table 2: Our runs in the TRECVID 2018 VTT matching andranking task.

Ours setA setB setC setD setE

Run 0 0.450 0.448 0.430 0.436 0.448

Run 1 0.505 0.502 0.495 0.494 0.500

Run 2 0.458 0.453 0.448 0.436 0.455

Run 3 0.516 0.505 0.492 0.491 0.509

similar manner, but using the ResNeXt-101 + ResNet-152 feature.

• Run 3 combines run 1, run 2 and eight VSE++ models[7]. Besides the original VSE++, we train multiplevariants, including 1) VSE++ with two FC layers and2) substituting BoW for GRU. Four models use theResNeXt-101 feature, while the other four models usethe ResNet-152 feature.

An overview of all submissions is shown in Fig. 3 andFig. 4. Concrete numbers are summarized in Table 2. Theleading position of our runs clearly demonstrates the effec-tiveness of the dual encoding network.

Acknowledgments

The authors are grateful to NIST and the TRECVID coordi-nators for the benchmark organization effort. The RenminUniversity of China acknowledges support by NSFC (No.61672523, No. 61773385), the Fundamental Research Fundsfor the Central Universities and the Research Funds of Ren-min University of China (No. 18XNLG19). The ZhejiangGongshang University acknowledges support by NSFC (No.

Page 4: Renmin University of China and Zhejiang Gongshang ...lixirong.net/pub/trecvid2018-rucmm.pdfRenmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal

U1609215, No. 61672460), the Zhejiang Provincial NaturalScience Foundation of China (No. Q19F020013).

References

[1] G. Awad, A. Butt, K. Curtis, Y. Lee, J. Fiscus, A. Godil,D. Joy, A. Delgado, A. F. Smeaton, Y. Graham, W. Kraaij,G. Qunot, J. Magalhaes, D. Semedo, and S. Blasi. Trecvid2018: Benchmarking video activity detection, video cap-tioning and matching, video storytelling linking and videosearch. In Proceedings of TRECVID 2018. NIST, USA,2018.

[2] D. L. Chen and W. B. Dolan. Collecting highly paralleldata for paraphrase evaluation. In ACL, 2011.

[3] J. Dong, S. Huang, D. Xu, and D. Tao. Dl-61-86 atTRECVID 2017: Video-to-text description. In TRECVIDWorkshop, 2017.

[4] J. Dong, X. Li, and C. G. M. Snoek. Predicting visualfeatures from text for image and video caption retrieval.IEEE Transactions on Multimedia, 20(12):3377–3388, 2018.

[5] J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang.Dual encoding for zero-example video retrieval. In CVPR,2019.

[6] J. Dong, X. Li, C. Xu, G. Yang, and X. Wang. Feature re-learning with data augmentation for content-based videorecommendation. In ACMMM, 2018.

[7] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler. VSE++:Improving visual-semantic embeddings with hard negatives.In BMVC, 2018.

[8] T. Y. J. Xu, T. Mei and Y. Rui. MSR-VTT: A large videodescription dataset for bridging video and language. InCVPR, 2016.

[9] D.-D. Le, S. Phan, V.-T. Nguyen, B. Renoust, T. A.Nguyen, V.-N. Hoang, T. D. Ngo, M.-T. Tran, Y. Watan-abe, M. Klinkigt, et al. NII-HITACHI-UIT at TRECVID2016. In TRECVID Workshop, 2016.

[10] Y. Li, Y. Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes,and J. Luo. TGIF: A new dataset and benchmark on ani-mated gif description. In CVPR, 2016.

[11] C. G. M. Snoek, J. Dong, X. Li, X. Wang, Q. Wei, W. Lan,E. Gavves, N. Hussein, D. C. Koelma, and A. W. M. Smeul-ders. University of Amsterdam and Renmin University atTRECVID 2016: Searching video, detecting events and de-scribing video. In TRECVID Workshop, 2016.

[12] C. G. M. Snoek, X. Li, C. Xu, and D. C. Koelma. Universityof Amsterdam and Renmin university at TRECVID 2017:Searching video, detecting events and describing video. InTRECVID Workshop, 2017.

Page 5: Renmin University of China and Zhejiang Gongshang ...lixirong.net/pub/trecvid2018-rucmm.pdfRenmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal

RUCMM.v

tt.m

atch

ing.

run3

.A

RUCMM.v

tt.m

atch

ing.

run1

.A

RUCMM.v

tt.m

atch

ing.

run2

.A

RUCMM.v

tt.m

atch

ing.

run0

.A

INF.gr

oup.

alig

n.A

EUREC

OM.ru

n4.e

nsem

ble.

A

UCR_VCG_S

et.A

.avg

EUREC

OM.ru

n3.fi

nal_i

3d_b

idir_

2018

.A

UCR_VCG_S

et.A

.max

EUREC

OM.ru

n2.fi

nal_c

aps_

i3d_

2018

.A

EUREC

OM.ru

n1.fi

nal_i

3d_2

018.

A

UCR_VCG_S

et.A

.key

2

UCR_VCG_S

et.A

.key

4

NTU_R

OSE.tr

ecvid_

subm

ission_

retri

eval.A

.RUN1

oran

d.A

KU_IS

PL_m

atch

ing_

rank

ing_

run2

.A.p

rimar

y

KU_IS

PL_m

atch

ing_

rank

ing_

run1

.A

KU_IS

PL_m

atch

ing_

rank

ing_

run4

.A

KU_IS

PL_m

atch

ing_

rank

ing_

run3

.A

kslab.

tv18

.vtt.

Mat

chin

gAnd

Ranking

.A

NTU_R

OSE.tr

ecvid_

subm

ission_

retri

eval.A

.RUN0

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

3.A

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

2.A

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

1.A

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

4.A

MMsy

s_CCM

IP.ru

n.A

runs

0.0

0.1

0.2

0.3

0.4

0.5

mean invert

ed r

ank 22 runs from all other teams

Our runs

(a) setA

RUCMM.v

tt.m

atch

ing.

run3

.B

RUCMM.v

tt.m

atch

ing.

run1

.B

RUCMM.v

tt.m

atch

ing.

run2

.B

RUCMM.v

tt.m

atch

ing.

run0

.B

INF.gr

oup.

alig

n.B

EUREC

OM.ru

n4.e

nsem

ble.

B

UCR_VCG_S

et.B

.avg

EUREC

OM.ru

n3.fi

nal_i

3d_b

idir_

2018

.B

EUREC

OM.ru

n2.fi

nal_c

aps_

i3d_

2018

.B

UCR_VCG_S

et.B

.max

EUREC

OM.ru

n1.fi

nal_i

3d_2

018.

B

UCR_VCG_S

et.B

.key

2

UCR_VCG_S

et.B

.key

4

oran

d.B

KU_IS

PL_m

atch

ing_

rank

ing_

run3

.B

KU_IS

PL_m

atch

ing_

rank

ing_

run2

.B.p

rimar

y

KU_IS

PL_m

atch

ing_

rank

ing_

run4

.B

KU_IS

PL_m

atch

ing_

rank

ing_

run1

.B

kslab.

tv18

.vtt.

Mat

chin

gAnd

Ranking

.B

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

3.B

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

2.B

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

1.B

NTU_R

OSE.tr

ecvid_

subm

ission_

retri

eval.B

.RUN1

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

4.B

NTU_R

OSE.tr

ecvid_

subm

ission_

retri

eval.B

.RUN0

MMsy

s_CCM

IP.ru

n.B

runs

0.0

0.1

0.2

0.3

0.4

0.5

mean invert

ed r

ank 22 runs from all other teams

Our runs

(b) setB

RUCMM.v

tt.m

atch

ing.

run1

.C

RUCMM.v

tt.m

atch

ing.

run3

.C

RUCMM.v

tt.m

atch

ing.

run2

.C

RUCMM.v

tt.m

atch

ing.

run0

.C

INF.gr

oup.

alig

n.C

EUREC

OM.ru

n4.e

nsem

ble.

C

UCR_VCG_S

et.C

.avg

EUREC

OM.ru

n3.fi

nal_i

3d_b

idir_

2018

.C

UCR_VCG_S

et.C

.max

EUREC

OM.ru

n2.fi

nal_c

aps_

i3d_

2018

.C

EUREC

OM.ru

n1.fi

nal_i

3d_2

018.

C

UCR_VCG_S

et.C

.key

2

UCR_VCG_S

et.C

.key

4

oran

d.C

KU_IS

PL_m

atch

ing_

rank

ing_

run3

.C

KU_IS

PL_m

atch

ing_

rank

ing_

run2

.C.p

rimar

y

KU_IS

PL_m

atch

ing_

rank

ing_

run4

.C

KU_IS

PL_m

atch

ing_

rank

ing_

run1

.C

kslab.

tv18

.vtt.

Mat

chin

gAnd

Ranking

.C

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

4.C

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

1.C

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

3.C

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

2.C

NTU_R

OSE.tr

ecvid_

subm

ission_

retri

eval.C

.RUN1

NTU_R

OSE.tr

ecvid_

subm

ission_

retri

eval.C

.RUN0

MMsy

s_CCM

IP.ru

n.C

runs

0.0

0.1

0.2

0.3

0.4

0.5

mean invert

ed r

ank 22 runs from all other teams

Our runs

(c) setC

Figure 3: Overview of the TRECVID 2018 video-to-text matching and ranking task benchmark on setA, setB and setC, all runsranked according to MIR. We were the best overall performer.

Page 6: Renmin University of China and Zhejiang Gongshang ...lixirong.net/pub/trecvid2018-rucmm.pdfRenmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal

RUCMM.v

tt.m

atch

ing.

run1

.D

RUCMM.v

tt.m

atch

ing.

run3

.D

RUCMM.v

tt.m

atch

ing.

run2

.D

RUCMM.v

tt.m

atch

ing.

run0

.D

INF.gr

oup.

alig

n.D

EUREC

OM.ru

n4.e

nsem

ble.

D

UCR_VCG_S

et.D

.avg

UCR_VCG_S

et.D

.max

EUREC

OM.ru

n1.fi

nal_i

3d_2

018.

D

EUREC

OM.ru

n3.fi

nal_i

3d_b

idir_

2018

.D

EUREC

OM.ru

n2.fi

nal_c

aps_

i3d_

2018

.D

UCR_VCG_S

et.D

.key

2

UCR_VCG_S

et.D

.key

4

oran

d.D

KU_IS

PL_m

atch

ing_

rank

ing_

run4

.D

KU_IS

PL_m

atch

ing_

rank

ing_

run2

.D.p

rimar

y

KU_IS

PL_m

atch

ing_

rank

ing_

run1

.D

KU_IS

PL_m

atch

ing_

rank

ing_

run3

.D

kslab.

tv18

.vtt.

Mat

chin

gAnd

Ranking

.D

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

4.D

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

3.D

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

2.D

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

1.D

NTU_R

OSE.tr

ecvid_

subm

ission_

retri

eval.D

.RUN1

NTU_R

OSE.tr

ecvid_

subm

ission_

retri

eval.D

.RUN0

MMsy

s_CCM

IP.ru

n.D

runs

0.0

0.1

0.2

0.3

0.4

0.5

mean invert

ed r

ank 22 runs from all other teams

Our runs

(a) setD

RUCMM.v

tt.m

atch

ing.

run3

.E

RUCMM.v

tt.m

atch

ing.

run1

.E

RUCMM.v

tt.m

atch

ing.

run2

.E

RUCMM.v

tt.m

atch

ing.

run0

.E

INF.gr

oup.

alig

n.E

EUREC

OM.ru

n4.e

nsem

ble.

E

UCR_VCG_S

et.E

.avg

EUREC

OM.ru

n3.fi

nal_i

3d_b

idir_

2018

.E

UCR_VCG_S

et.E

.max

EUREC

OM.ru

n2.fi

nal_c

aps_

i3d_

2018

.E

EUREC

OM.ru

n1.fi

nal_i

3d_2

018.

E

UCR_VCG_S

et.E

.key

2

UCR_VCG_S

et.E

.key

4

oran

d.E

KU_IS

PL_m

atch

ing_

rank

ing_

run4

.E

KU_IS

PL_m

atch

ing_

rank

ing_

run2

.E.p

rimar

y

KU_IS

PL_m

atch

ing_

rank

ing_

run3

.E

KU_IS

PL_m

atch

ing_

rank

ing_

run1

.E

kslab.

tv18

.vtt.

Mat

chin

gAnd

Ranking

.E

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

3.E

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

1.E

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

4.E

UTS_C

ETC_D

2DCRC_C

AI_tv1

8.vt

t.run

2.E

NTU_R

OSE.tr

ecvid_

subm

ission_

retri

eval.E

.RUN1

NTU_R

OSE.tr

ecvid_

subm

ission_

retri

eval.E

.RUN0

MMsy

s_CCM

IP.ru

n.E

runs

0.0

0.1

0.2

0.3

0.4

0.5

mean invert

ed r

ank 22 runs from all other teams

Our runs

(b) setE

Figure 4: Overview of the TRECVID 2018 video-to-text matching and ranking task benchmark on setD and setE, all runs rankedaccording to MIR. We were the best overall performer.