with%Applications%in%Audio0Visual%Recognition MMML’15 ...15a_MMML_NIPS_… · Objective...

Preview:

Citation preview

Our  ApproachObjective

Leverage  source domain  data  to  improve  target domain task

Multimodal  Transfer  Deep  Learningwith  Applications  in  Audio-­‐Visual  Recognition

Seungwhan Moon                Suyoun Kim                Haohan WangCarnegie  Mellon  University

Input:  imbalanced (e.g.  in  label  space)  multimodal  parallel    datasets  for  training(e.g. source: audio and  target: video)

MMML’15Workshop

audio:  A  -­ Z video:  A  -­ M

train

Output:  a  robust  deep  neural  network  for  target task

video recognition  network

test

video:  A  – Z  (some  unforeseen  during  training)

ApplicationsMultimodal  tasks  with  imbalanced  datasets• Audio-­visual  recognition-­ Lip-­reading  recognition-­ In-­video  action  recognition

• Multi-­lingual  natural  language  learning-­ Rare  language  text  classification

• Text-­image  joint  learning

Fine-­tune  a  target  network  with  source  instances transferred  at  intermediate  layers

.  .  .

...

:  output  label

...

.  .  .

:  audio data

.  .  .

...

:  output  label

...

.  .  .

:  video data

③ ④

① ①

① :  Train  a  separate  model  for  each  modality  (              ,              )Define  activation  at  i-­‐th layer:  

② :  Learn  a  transfer  function  

using  source-­‐target correspondent  instances

③ :  Transfer  auxiliary  source data  to  target network,and  compute  activations  at  upper  layers

④ :  Fine-­‐tune  the  target networkwith  the  transferred  source instances

ResultsLabel  Space  Setup

audio:  full   video:  partialTrainvideo:  full  (+transferred)Fine-­tune

Datasets:  AV_Letters,  StanfordAV_Letters (26  labels)

Stanford (49  labels)

Interpretation

:  intractable  or  less  reliable  transfer,  fine-­‐tune  more  layers:  more  reliable  transfer,  fine-­‐tune  less  layers

• Performance  is  soft-­‐upper-­‐bounded  by  feature  mapping  accuracy• Trade-­‐off  for

Future  work• Comparison  with  state-­‐of-­‐the-­‐art  transfer  learning  methods

(heterogeneous  transfer,  deep  shared  representation,  etc.)

• Artificial  construction  of  targetmodality  instances  via  top-­‐down  inference,  using  sourcemodality  instances

video:  fullTest

Recommended