1
Our Approach Objective Leverage source domain data to improve target domain task Multimodal Transfer Deep Learning with Applications in Audio Visual Recognition Seungwhan Moon Suyoun Kim Haohan Wang Carnegie Mellon University Input: imbalanced ( e.g. in label space) multimodal parallel datasets for training ( e.g. source: audio and target: video) MMML’15 Workshop audio:AZ video:AM train Output: a robust deep neural network for target task video recognition network test video:A–Z (some unforeseen during training) Applications Multimodal tasks with imbalanced datasets Audiovisual recognition Lipreading recognition Invideo action recognition Multilingual natural language learning Rare language text classification Textimage joint learning Finetune a target network with source instances transferred at intermediate layers ... ... : output label ... ... : audio data ... ... : output label ... ... : video data : Train a separate model for each modality ( , ) Define activation at ith layer: : Learn a transfer function using source target correspondent instances : Transfer auxiliary source data to target network, and compute activations at upper layers : Finetune the target network with the transferred source instances Results Label Space Setup audio: full video: partial Train video: full (+transferred) Finetune Datasets: AV_Letters, Stanford AV_Letters (26 labels) Stanford (49 labels) Interpretation : intractable or less reliable transfer, finetune more layers : more reliable transfer, finetune less layers Performance is softupperbounded by feature mapping accuracy Tradeoff for Future work Comparison with stateoftheart transfer learning methods (heterogeneous transfer, deep shared representation, etc.) Artificial construction of target modality instances via topdown inference, using source modality instances video: full Test

with%Applications%in%Audio0Visual%Recognition MMML’15 ...15a_MMML_NIPS_… · Objective Our$Approach Leverage’source domain’data’to’improve’ target domain task Multimodal$Transfer$Deep$Learning

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: with%Applications%in%Audio0Visual%Recognition MMML’15 ...15a_MMML_NIPS_… · Objective Our$Approach Leverage’source domain’data’to’improve’ target domain task Multimodal$Transfer$Deep$Learning

Our  ApproachObjective

Leverage  source domain  data  to  improve  target domain task

Multimodal  Transfer  Deep  Learningwith  Applications  in  Audio-­‐Visual  Recognition

Seungwhan Moon                Suyoun Kim                Haohan WangCarnegie  Mellon  University

Input:  imbalanced (e.g.  in  label  space)  multimodal  parallel    datasets  for  training(e.g. source: audio and  target: video)

MMML’15Workshop

audio:  A  -­ Z video:  A  -­ M

train

Output:  a  robust  deep  neural  network  for  target task

video recognition  network

test

video:  A  – Z  (some  unforeseen  during  training)

ApplicationsMultimodal  tasks  with  imbalanced  datasets• Audio-­visual  recognition-­ Lip-­reading  recognition-­ In-­video  action  recognition

• Multi-­lingual  natural  language  learning-­ Rare  language  text  classification

• Text-­image  joint  learning

Fine-­tune  a  target  network  with  source  instances transferred  at  intermediate  layers

.  .  .

...

:  output  label

...

.  .  .

:  audio data

.  .  .

...

:  output  label

...

.  .  .

:  video data

③ ④

① ①

① :  Train  a  separate  model  for  each  modality  (              ,              )Define  activation  at  i-­‐th layer:  

② :  Learn  a  transfer  function  

using  source-­‐target correspondent  instances

③ :  Transfer  auxiliary  source data  to  target network,and  compute  activations  at  upper  layers

④ :  Fine-­‐tune  the  target networkwith  the  transferred  source instances

ResultsLabel  Space  Setup

audio:  full   video:  partialTrainvideo:  full  (+transferred)Fine-­tune

Datasets:  AV_Letters,  StanfordAV_Letters (26  labels)

Stanford (49  labels)

Interpretation

:  intractable  or  less  reliable  transfer,  fine-­‐tune  more  layers:  more  reliable  transfer,  fine-­‐tune  less  layers

• Performance  is  soft-­‐upper-­‐bounded  by  feature  mapping  accuracy• Trade-­‐off  for

Future  work• Comparison  with  state-­‐of-­‐the-­‐art  transfer  learning  methods

(heterogeneous  transfer,  deep  shared  representation,  etc.)

• Artificial  construction  of  targetmodality  instances  via  top-­‐down  inference,  using  sourcemodality  instances

video:  fullTest