1
Noureldien Hussein, Efstratios Gavves, Arnold W. M. Smeulders | Quva Lab, University of Amsterdam Unified Embedding and Metric Learning for Zero-Exemplar Event Detection July 21-26, Honolulu IEEE Conference on Computer Vision and Pattern Recognition The Good: external knowledge (EventNet, WikiHow) is leveraged for better zero-exemplar event detection. The Bad: no fine-grained event detection, e.g. “fixing musical instrument” vs. “tuning musical instrument”. We pose ZED as learning from a set of predefined events. Given video exemplars of events “removing drywall” or “fit wall tiles”, one may detect a novel event “renovate home” as a probability distribution over the predefined events. Experiments Results remove drywall skateboarding snowboarding fit wall tiles renovate home board trick Event detection accuracies: per-event average precision (AP) and per-dataset mean average precision (mAP) for MED-13 and MED-14 datasets. Left: retrieval accuracy (mAP) of our model vs. related works for MED-13 and MED-14 datasets. Right: retrieval accuracy (mAP) of our model (unified embedding) vs. other baselines. • The unified embedding (c) is doing a better job in discriminating the text articles of the events. In (c), projecting the videos on their related events is much better than the baselines (a) and (b). (b) Separate Embedding ( ) (a) Visual Embedding ( ) (c) Unified Embedding ( ) Out textual embedding maps the text description of MED events to EventNet events better than off-the- shelf LDA, LSI or Doc2vec. Semantic Space Video Text Query Textual Space Text Query Video Model overview of baseline methods. Top: visual embedding ( ). Bottom: separate embedding ( ) Loss functions used to train the baseline models: visual embedding (2), contrastive visual (3), separate embedding (4) and non-metric embedding (5). Success (left) and failure (right) video examples of three different events: (a) birthday party, (b) dog show, (c) renovating home. (a) (b) (c) Model Birthday Party Zero-exemplar Event Detection (ZED) is posed as a video retrieval task. Given test videos and a novel query, the model is required to rank the videos accordingly. Event Article Event 500: Gardening Event 001: Dancing Videos Video Titles Problem Approach Novelties We use video samples from EventNet and event articles from WikiHow. Unified embedding for cross-modalities with metric loss for maximum discrimination between events. 1 At the top, network learns to classify the title feature into one of event categories. In the middle, we borrow the network to embed the event articles feature as . Then, at the bottom, the network learns to embed the video feature as such that the distance between ( , ) is minimized, in the learned metric space . GitHub: https://git.io/vS2Cf Take Home The Ugly: is average pooling enough for video representation or temporal modeling is required? Textual embedding poses a novel query a probability of predefined events. 2 External data source, of event articles and related videos, with end-to-end learning from cross-modal pairs. 3 Contrastive Loss Distance LSI CNN Event Article Video Video Title Logistic Loss LSI Visual Embedding Textual Embedding Σ Total Loss

Unified Embedding and Metric Learning for Zero-Exemplar Event … · Unified Embedding and Metric Learning for Zero-Exemplar Event Detection July 21-26, Honolulu IEEE Conference on

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Noureldien Hussein, Efstratios Gavves, Arnold W. M. Smeulders | Quva Lab, University of Amsterdam

    Unified Embedding and Metric Learning for Zero-Exemplar Event DetectionJuly 21-26, Honolulu

    IEEE Conference on Computer

    Vision and Pattern Recognition

    The Good: external knowledge (EventNet, WikiHow) is

    leveraged for better zero-exemplar event detection.

    The Bad: no fine-grained event detection, e.g. “fixing

    musical instrument” vs. “tuning musical instrument”.

    We pose ZED as learning from a set of predefined events. Given video

    exemplars of events “removing drywall” or “fit wall tiles”, one may

    detect a novel event “renovate home” as a probability distribution over

    the predefined events.

    Experiments

    Results

    remove drywall

    skateboarding

    snowboardingfit wall tiles

    renovate home

    board trick

    Event detection accuracies: per-event average precision (AP) and per-dataset mean

    average precision (mAP) for MED-13 and MED-14 datasets.

    Left: retrieval accuracy (mAP) of our model vs. related works for MED-13 and MED-14

    datasets. Right: retrieval accuracy (mAP) of our model (unified embedding) vs. other

    baselines.

    • The unified embedding (c) is doing

    a better job in discriminating the text

    articles of the events.

    • In (c), projecting the videos on

    their related events is much better

    than the baselines (a) and (b).

    (b) Separate Embedding (𝑚𝑜𝑑𝑒𝑙𝒮)(a) Visual Embedding (𝑚𝑜𝑑𝑒𝑙𝒱)

    (c) Unified Embedding (𝑚𝑜𝑑𝑒𝑙𝒰)

    Out textual embedding 𝑓𝒯maps the text description of

    MED events to EventNet

    events better than off-the-

    shelf LDA, LSI or Doc2vec.

    𝑥 𝑧𝑣 𝑧𝑡 𝑦𝑓𝒯

    Semantic Space 𝒵

    Video

    𝑣

    Text Query

    𝑥𝑓𝒱

    𝑦𝑣 𝑦𝑡𝑡

    Textual Space 𝒴

    Text Query

    𝑡

    Video

    𝑣∎∎∎∎∎∎

    ∎∎∎∎∎∎

    𝜃

    𝜃

    𝑓𝒱

    Model overview of

    baseline methods. Top:

    visual embedding

    (𝑚𝑜𝑑𝑒𝑙𝒱). Bottom: separate embedding

    (𝑚𝑜𝑑𝑒𝑙𝒮)

    Loss functions used to train the baseline models: visual embedding

    𝑚𝑜𝑑𝑒𝑙𝒱(2), contrastive visual 𝑚𝑜𝑑𝑒𝑙𝒞(3), separate embedding 𝑚𝑜𝑑𝑒𝑙𝒮(4) andnon-metric embedding 𝑚𝑜𝑑𝑒𝑙𝒩(5).

    Success (left) and failure (right) video examples of three different events: (a)

    birthday party, (b) dog show, (c) renovating home.

    (a)

    (b)

    (c)

    Model

    Birthday Party

    Zero-exemplar Event Detection (ZED) is

    posed as a video retrieval task. Given test

    videos and a novel query, the model is

    required to rank the videos accordingly.

    Event Article

    Event 500: Gardening

    Event 001: Dancing

    VideosVideo Titles

    Problem Approach

    Novelties

    We use video samples

    from EventNet and

    event articles from

    WikiHow.

    Unified embedding for cross-modalities with metric

    loss for maximum discrimination between events.1

    At the top, network 𝑓𝒯 learns to classify the title feature 𝑦𝑘 into one of 𝑀 event categories. In the middle,

    we borrow the network 𝑓𝒯 to embed the event articles feature 𝑦𝑡 as 𝑧𝑡∈ 𝒵. Then, at the bottom, the

    network 𝑓𝒱 learns to embed the video feature 𝑥 as 𝑧𝑣∈ 𝒵 such that the distance between (𝑧𝑡, 𝑧𝑣) is

    minimized, in the learned metric space 𝒵.

    GitHub: https://git.io/vS2Cf

    Take Home

    The Ugly: is average pooling enough for

    video representation or temporal

    modeling is required?

    Textual embedding poses a novel query

    a probability of predefined events.2

    External data source, of event articles and related videos,

    with end-to-end learning from cross-modal pairs.3

    𝑡𝑧𝑡

    𝑣𝑧𝑣

    ℒ𝑐

    Contrastive

    Loss

    𝑥

    𝑓𝒱

    Dis

    tan

    ce

    𝑦𝑡

    LSI

    CNN

    Event Article

    Video

    𝑘 𝑦𝑘 𝑧𝑘

    Video Title

    Logistic Loss

    𝑓𝒯

    LSI ℒ𝑙

    Visual Embedding

    Textual Embedding

    ℒ𝒰Σ

    Total Loss