Confidential + ProprietaryConfidential + Proprietary
Collaborative Deep Metric Learningfor Video UnderstandingMachine Perception, Google Research
August 23, 2018
Balakrishnan Varadarajan
PaulNatsev
JoonseokLee
SamiAbu-El-Haija
Confidential + Proprietary
What is Video Understanding?
2
Figure skating Winter sports Ice rink Pair skating
Confidential + Proprietary
What is Video Understanding?
3
{(238, 204, 187), (238, 187, 187), … (255, 221, 221), (255, 238, 204), … (255, 238, 221), (238, 238, 221), … :
Figure skatingWinter sportsIce rinkPair skating
Confidential + Proprietary
Goal
Collaborative Deep Metric Learning
We’d like to learn a content-aware video embedding,
preserving video-video similarity using collaborative filtering.
4
Confidential + Proprietary
Overview
Pre-trainedDeep Models
Visual/Audio Features
EmbeddingModel
Video Embedding
NearestNeighbor Search
UserModeling
Classifier
Related VideoRetrieval
VideoRecommendation
VideoAnnotation
5
Confidential + Proprietary
Feature Extraction
6
Imag
e fe
atur
e ex
tract
or
Poo
ling
Poo
ling
L2L2
Aud
io
feat
ure
extra
ctor
Frames
Audio
Vid
eo fe
atur
eA
udio
feat
ure
FC
L2
Fina
l em
bedd
ing
FC
X
Confidential + Proprietary
Embedding Models: Triplet Loss● Train a model, preserving pairwise distance between videos.● Triplet loss: {anchor,positive} closer than {anchor,negative}.
[1] F. Schroff, D. Kalenichenko, J. Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR 2015.
7
Confidential + Proprietary
Ground Truth for “Related Videos”?
[2] D. Goldberg, D. Nichols, B. M. Oki, D. Terry. Using collaborative filtering to weave an information tapestry, Communications of the ACM, 1992.
● Collaborative Filtering [2]:Users (implicitly) collaborate to filter relevant items to themselves by annotating their preference.
● Collected users’ aggregated watch history from YouTube.
○ Co-watched videos are regarded as related.
8
Confidential + ProprietaryConfidential + Proprietary
Related Video Retrieval
9
Confidential + Proprietary
Related Video Retrieval● Task: Given a query video q with its content features xq, rank videos in a
candidate set according to relevance to q.● Cold-start
○ Training on triplets from T2T.○ Evaluation on T2E and E2T.
● Dataset○ Trained on 500M triplets.○ Tested on 100M eval set (T2E + E2T).
Positive Training(70%)
Eval(30%)Anchor
Training(70%)
1T2T
2T2E
Eval(30%)
3E2T
10
Confidential + Proprietary
Examples
11
Confidential + Proprietary
Examples
12
Confidential + ProprietaryConfidential + Proprietary
Personalized Video Recommendation
13
Confidential + Proprietary
Personalized Video Recommendation● Task: Given a user with watch history Q = {x1, x2, …, x|Q|}, rank videos
in a candidate set according to the user’s preference.○ Similar to related video retrieval, but with multiple query videos.
● Average aggregation: mostly harmonious videos to entire watch history
● Max aggregation: most related to any of the user’s taste
14
Confidential + Proprietary
0.39
0.92
Scalability Issues● A naive implementation
Candidate set V
0.92 0.75 0.23 0.90
0.05 0.14 0.76 0.06
0.43 0.56 0.37 0.14
0.02 0.11 0.04 0.03
WatchHistory
Q
0.75 0.76 0.90
0.36 0.35 0.28
MAX
AVG
|Q|·|V| dot-products
● Still, computing 40,000 dot-products per user in 100ms is hard.
● So, what should we do?
5B200
● In YouTube scale?
● But, ranking must be done in ~100ms.
● With prefiltering, |V| = 100~1000.
20015
Confidential + Proprietary
Optimizations for Scalability● Pre-computing averaged watch history
○ Thanks to linearity of dot-product, we compute averaged watch history once and reuse for every candidate.
○ Total time complexity is O(|Q| + |V|) instead of O(|Q||V|).○ Unfortunately, max aggregation is not possible to optimize in this way.
16
Confidential + Proprietary
MovieLens Experiment● Collected MovieLens trailers from YouTube.
○ 22,798 trailers (out of 27,279 movies, 83.6%) for MovieLens 20M○ Released pre-computed features through MovieLens official site.
● Cold-start experiment○ Our content-based models outperform CF model when we know less about the users!
17
Confidential + ProprietaryConfidential + Proprietary
Video Annotation/Classification
18
Confidential + Proprietary
Video Annotation Problem● Model as a multi-labeled classification problem, learning a mapping from
video features x to d binary labels y ∈ {0, 1}d.● Data preprocessing:
○ PCA to 256-D, then L2 normalize.
19
Input x is correlated PCA decorrelates the dimensions
Whitening ensures each dimension is equally important.
z: L2-normalized input data
Confidential + Proprietary
Video Annotation Problem● Mixture of Expert (MoE) classifier:
○ Given z, MoE model estimates the probability p(e|z) for an entity e exist in the video as a weighted average over experts h. For each expert h, we use a binary logistic regression classifier.
○ We train different MoE for each label. Each MoE is trained independently, in parallel.○ More info about the model available here.
20
Probability of label e given features z(e.g, Hyundai Sonata)
Probability of a hidden state h given features z(e.g, interior of car, engine)
Probability of a label e given features z and hidden state h(e.g, prob of Sonata given it is a view of the engine)
Logistic regression
Confidential + Proprietary
Experimental Result: YouTube-8M
21
Rank Team NameVideo-level features only Frame-level features used
Single Ensemble Single Ensemble
1 WILLOW ㅡ ㅡ 0.8300 0.8469
2 monkeytyping 0.8106 0.8225 0.8179 0.8458
3 offline 0.8082 ㅡ 0.8275 0.8454
4 FDT ㅡ ㅡ 0.8178 0.8419
5 You8M ㅡ 0.8308 ㅡ 0.8418
6 Rankyou 0.8041 ㅡ 0.8246 0.8408
7 Yeti ㅡ ㅡ 0.8254 0.8396
8 SNUVL X SKT ㅡ ㅡ 0.8200 0.8389
9 LanzanRamen ㅡ ㅡ ㅡ 0.8372
10 Samartian ㅡ ㅡ 0.8139 0.8366
Ours 0.8430 ㅡ ㅡ ㅡ
Scores in GAP; higher values are better.
Confidential + ProprietaryConfidential + Proprietary
Summary
22
Confidential + Proprietary
Take-home Messages● Signals that are indirectly related can be useful to various tasks.
○ CF signals are useful for video annotation as well.
● Pure content models can perform comparably against CF models.○ Even outperform in cold/cool-start cases.
● Analyzing video content in large-scale is challenging, but we are improving.○ Video features are extracted and widely used in Google products (YouTube, Photos, and
more).
23
Thank you for your [email protected]