with%Applications%in%Audio0Visual%Recognition MMML’15 ...15a_MMML_NIPS_… · Objective...

Our ApproachObjective

Leverage source domain data to improve target domain task

Multimodal Transfer Deep Learningwith Applications in Audio-‐Visual Recognition

Seungwhan Moon Suyoun Kim Haohan WangCarnegie Mellon University

Input: imbalanced (e.g. in label space) multimodal parallel datasets for training(e.g. source: audio and target: video)

MMML’15Workshop

audio: A - Z video: A - M

Output: a robust deep neural network for target task

video recognition network

video: A – Z (some unforeseen during training)

ApplicationsMultimodal tasks with imbalanced datasets• Audio-visual recognition- Lip-reading recognition- In-video action recognition

• Multi-lingual natural language learning- Rare language text classification

• Text-image joint learning

Fine-tune a target network with source instances transferred at intermediate layers

: output label

: audio data

: output label

: video data

③ ④

① ①

① : Train a separate model for each modality ( , )Define activation at i-‐th layer:

② : Learn a transfer function

using source-‐target correspondent instances

③ : Transfer auxiliary source data to target network,and compute activations at upper layers

④ : Fine-‐tune the target networkwith the transferred source instances

ResultsLabel Space Setup

audio: full video: partialTrainvideo: full (+transferred)Fine-tune

Datasets: AV_Letters, StanfordAV_Letters (26 labels)

Stanford (49 labels)

Interpretation

: intractable or less reliable transfer, fine-‐tune more layers: more reliable transfer, fine-‐tune less layers

• Performance is soft-‐upper-‐bounded by feature mapping accuracy• Trade-‐off for

Future work• Comparison with state-‐of-‐the-‐art transfer learning methods

(heterogeneous transfer, deep shared representation, etc.)

• Artificial construction of targetmodality instances via top-‐down inference, using sourcemodality instances

video: fullTest

with%Applications%in%Audio0Visual%Recognition MMML’15 ...15a_MMML_NIPS_… · Objective...

Documents

Multimodal Transformer for Unaligned Multimodal Language ...Multimodal Transformer for Unaligned Multimodal Language Sequences Yao-Hung Hubert Tsai Carnegie Mellon University (*equal

Multimodal Literacy - Reading & Viewing Multimodal Digital Texts in Prrimary

Multimodal Summarization with Guidance of Multimodal Reference

Multimodal multiliteracy

Multimodal Interfaces

Multimodal Projects

Measure of Location and Variability. Histogram Multimodal Multimodal

Multimodal presentation

Multimodal analysis1

Multimodal Vigilance Estimation with Adversarial Domain ...bcmi.sjtu.edu.cn/~blu/papers/2018/2018-3.pdf · In this paper, our goal is to apply ... of several traditional domain adaptation

Web-based Multimodal Multi-domain Spoken Dialogue System · Web-based Multimodal Multi-domain Spoken Dialogue System Ridong Jiang, Rafael E. Banchs, Seokhwan Kim, Kheng Hui Yeo, Arthur

Challenges and Solutions to Multimodal Discourse ...multimodal-analysis-lab.org/_docs/Challenges_and_Solutions_to... · Challenges and Solutions to Multimodal Analysis: Technology,

Multimodal Transport 1. MTOGA Multimodal Transportation of Goods Act 1993 Multimodal Transport 2

ROUTE 1 MULTIMODAL ALTERNATIVES NALYSIS · 2015-03-02 · Route 1 Multimodal Alternatives Analysis Final Report . 4 . What is “multimodal”? Multimodal is a shorthand way of referring

mmml-IIIIII..I DAM SAFETY PROGRAM. KERNODLE LAKE DAM ... · NCLASSIFIED mmml-IIIIII..I llll..llll.... IEHIlllllllIE IIEEEEEEEEEEE. LEVEL rI MISSOURI-KANSAS CITY BASIN 10, 6 4 ,c-KERNODLE

MSMO: Multimodal Summarization with Multimodal Output · Figure 1: The illustration of our proposed task – Multimodal Summarization with Multimodal Output (MSMO). The image can

Transport Multimodal

COMMONWEALTH OF PENNSYLVANIA JExQxslnixbt 3(mmml · COMMONWEALTH OF PENNSYLVANIA JExQxslnixbt 3(mmml WEDNESDAY, OCTOBER 15, 2003 SESSION OF 2003 187TH OF THE GENERAL ASSEMBLY No

Multimodal Unsupervised Image-to-image Translation...arate domain and use multi-domain image-to-image translation techniques to learn a mapping between each pair of modes, thus achieving

Joint Wasserstein Autoencoders for Aligning Multimodal ...openaccess.thecvf.com/content_ICCVW_2019/papers/...in each domain, we enforce the latent embeddings to be similar to a Gaussian