1
At ti id t ti i hi hi lt i t j t dl Automatic video annotation via hierarchical topic trajectory model Automatic video annotation via hierarchical topic trajectory model id i dl l ti considering cross modal correlations considering cross-modal correlations considering cross modal correlations Tkh Nk (1) Aki t Ki (2) Hi k K k (1 2) Shi ki S (1) Shi ki Mi b (1) Nbtk O (1) K i K hi (2) Tk Ni hi t (1) Takuho Nakano (1) Akisato Kimura (2) , Hirokazu Kameoka (1,2) Shigeki Sagayama (1) Shigeki Miyabe (1) Nobutaka Ono (1) Kunio Kashino (2) Takuya Nishimoto (1) Takuho Nakano , Akisato Kimura , Hirokazu Kameoka , Shigeki Sagayama , Shigeki Miyabe , Nobutaka Ono , Kunio Kashino , Takuya Nishimoto (1) The University of Tokyo (2) NTT Communication Science Labortories Contact: Akisato Kimura <akisato@ieee org> (1) The University of Tokyo (2) NTT Communication Science Labortories Contact: Akisato Kimura <[email protected]> ABSTRACT ABSTRACT PROCEDURE PROCEDURE ABSTRACT ABSTRACT PROCEDURE PROCEDURE ABSTRACT ABSTRACT PROCEDURE PROCEDURE (1) Model learning We propose a generative model named Hierarchical Topic Trajectory Model (HTTM) (1) Model learning We propose a generative model named Hierarchical Topic Trajectory Model (HTTM) (A) E t t l l lf t f h d lit for automatic video annotation and video retrieval (A) Extracts low-level features for each modality for automatic video annotation and video retrieval. [I ] SIFT (k i &d i ) B VW (B f Vi lW d) HTTM i ifi d dl f i id i d id i l [Images] SIFT (keypoints & descriptors) + BoVW (Bag of Visual Words) HTTM is a unified model for automatic video annotation and video retrieval [Images] SIFT (keypoints & descriptors) BoVW (Bag of Visual Words) [T ] T ih db idf (i d f ) HTTM is a unified model for automatic video annotation and video retrieval. [Texts] Tag occurrences weighted by idf (inverse document frequency) Incorporates low level features together with keyframe wise topic models that can [Texts] Tag occurrences weighted by idf (inverse document frequency) Incorporates low-level features together with keyframe-wise topic models that can (B) C l lt lt t i bl i PCCA encode an arbitrary number of cross modal correlations (B) Calculates latent variables using PCCA encode an arbitrary number of cross-modal correlations (B) Calculates latent variables using PCCA Deals with temporal topic transitions by hidden Markov models (HMM) Deals with temporal topic transitions by hidden Markov models (HMM) P j i i b i d i CCA Some behaviors have been analyzed through TRECVID Semantic Indexing task Projection matrices obtained via CCA Diagonal matrix Some behaviors have been analyzed through TRECVID Semantic Indexing task. (C) U i dl i f dl t f t i t iti dl(HMM) i Vit bi l i Diagonal matrix (C) Unsupervised learning of model parameters for topic transition model (HMM) via Viterbi learning (C) Unsupervised learning of model parameters for topic transition model (HMM) via Viterbi learning METHOD METHOD Observation: Sequence of latent variables METHOD METHOD Observation: Sequence of latent variables METHOD METHOD (2) Infe ence ( annotation) (2) Inference (= annotation) (1) Motivation (1) Motivation (a) Extracts low-level image features M d li l ti hi bt i dt t (a) Extracts low-level image features Modeling relationships between images and texts (b) Calculates latent variables only from image features Modeling relationships between images and texts (b) Calculates latent variables only from image features (c) Objects: Co Co-occurrences occurrences are useful (d) (c) Objects: Co Co-occurrences occurrences are useful. (d) Events: Co Co occurrences occurrences are almost necessary sometimes they might be the only cue Events: Co Co-occurrences occurrences are almost necessary, sometimes they might be the only cue. ( )E ti t hidd tt b Vit bi h I ti t li f ti it th dl (c)Estimates hidden states by Viterbi search Incorporating temporal information into the model Cross road, traffic jam (c) Estimates hidden states by Viterbi search Incorporating temporal information into the model (d) Latent variable modification with image features & hidden states A hierarchical temporal structure A hierarchical temporal structure would be necessary (d) Latent variable modification with image features & hidden states () A hierarchical temporal structure A hierarchical temporal structure would be necessary (e) (d) Standard approaches: Discriminative (ex SVM) (e) (b) Standard approaches: Discriminative (ex. SVM) Each classifier performs well in general Each classifier performs well in general cars Diffi lt t i t Difficult to incorporate “co-occurrences” buses Our approach: Generative (topic models) bikes buses ( ) f ( ) semantic Our approach: Generative (topic models) (e) Missing feature (=semantic index) estimation semantic (2) Hierarchical topic trajectory model (HTTM) (e) Missing feature ( semantic index) estimation index (2) Hierarchical topic trajectory model (HTTM) index Hidden states EXPERIMENTS EXPERIMENTS Hidden states EXPERIMENTS EXPERIMENTS Hi hi l EXPERIMENTS EXPERIMENTS Hierarchical temporal (1) Conditions Latent variables temporal (1) Conditions Latent variables structures [Dataset] 127 video clips, 56191 shots from TRECVID2005 Cross-modal [Dataset] 127 video clips, 56191 shots from TRECVID2005 [I f t ] VIREO 374 (SIFT k it &d it li d B VW ith 500 di ) co-occurrences [Image features] VIREO-374 (SIFT keypoints & descriptors + normalized BoVWs with 500 dims) co-occurrences Low-level features [Labels] 47 concepts selected from LSCOM and LSCOM lite Low-level features [Labels] 47 concepts selected from LSCOM and LSCOM-lite Car Road [Evaluation] Mean average precision (meanAP commonly used measure in TRECVID) Car, Road Feature [Evaluation] Mean average precision (meanAP , commonly used measure in TRECVID) Feature l ti (2) R lt correlations Images / text tags (2) Results Images / text tags (2) Results 0 045 chance 0 25 1 Lt t i bl difi ti i ff ti 0.045 chance 画像 0.25 1 Latent variable modification is effective T i t j t i K f i t i dl 0 04 画像のみ image only Topic trajectory expression Keyframe-wise topic model 0.09 0.04 s1 m240 Topic trajectory expression Keyframe wise topic model 0 08 0 035 s1_m240 2 120 0.2 HMM is used (Observations = latent variables) Incorporate co occurrences in a natural way 0.08 0.035 s2_m120 Framewise HMM is used (Observations = latent variables). Incorporate co-occurrences in a natural way 0.07 0.03 on s3 m80 PCCA E h hidd tt i ht d t S ifi d f kf ti sio s3_m80 4 60 0 15 PCCA Each hidden state might corresponds to a Serve as a unified framework for semantic 0.06 AP 0.025 cis s4_m60 0.15 concept or a story expressed by a sequence of indexing and retrieval 0 05 nA rec s5 m48 concept or a story expressed by a sequence of indexing and retrieval 0.05 ean 0.02 pr s5_m48 6 40 01 topics T l P b bili ti CCA 0.04 me ge s6_m40 0.1 topics. Tool: Probabilistic CCA [Bach+ 2005] m 0.015 ag s7 m35 HTTM HMM is feasible for simple e pression of topic 0.03 0 01 era s7_m35 s10 m24 HTTM HMM is feasible for simple expression of topic Low computational cost for model 0 02 0.01 ave s10_m24 0.05 (temporal transition Low computational cost for model l i d if 0.02 0 005 a s20 m12 (temporal t t transition. learning and inference 0.01 0.005 s30 m8 structure+ T i i ti b il it d db learning and inference 0 s30_m8 0 latent variable Topic variation can be easily introduced by Easy to extend the model to non-linear 0 0 s40 m6 0 b d latent variable f ) using Gaussian mixtures (GMM) Easy to extend the model to non linear i k ltik Airplane Airplane_Flying Maps Urban Sports Studio modification) using Gaussian mixtures (GMM). ones via kernel trick 2 A it tti fh t id hi h ii ones via kernel trick 2 Appropriate setting of hyper-parameters provides high precision

(e) d Ni hi t (d) () (d) semantic (c) () (b) dl model...TRECVID Semantic Indexing task. (C) U i d l i f d l t f t i t iti dl(HMM) i Vit bi li j Diagonal matrix yg g Unsupervised learning

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: (e) d Ni hi t (d) () (d) semantic (c) () (b) dl model...TRECVID Semantic Indexing task. (C) U i d l i f d l t f t i t iti dl(HMM) i Vit bi li j Diagonal matrix yg g Unsupervised learning

At

tiid

tti

ihi

hilt

it

jt

dl

Aut

omat

icvi

deo

anno

tatio

nvi

ahi

erar

chic

alto

pic

traj

ecto

rym

odel

Aut

omat

ic v

ideo

ann

otat

ion

via

hier

arch

ical

topi

c tr

ajec

tory

mod

elp

jy

idi

dl

lti

cons

ider

ing

cros

sm

odal

corr

elat

ions

cons

ider

ing

cros

s-m

odal

corr

elat

ions

cons

ider

ing

cros

sm

odal

cor

rela

tions

gT

kh

Nk

(1)

Aki

tK

i(2

)H

ik

Kk

(12)

Shi

kiS

(1)

Shi

kiM

ib

(1)

Nb

tk

O(1

)K

iK

hi(2

)T

kN

ihi

t(1

)Ta

kuho

Nak

ano(

1)A

kisa

toK

imur

a(2)

,Hiro

kazu

Kam

eoka

(1,2

)Sh

igek

iSag

ayam

a(1)

Shig

ekiM

iyab

e(1)

Nob

utak

aO

no(1

)K

unio

Kas

hino

(2)

Taku

yaN

ishi

mot

o(1)

Taku

hoN

akan

o, A

kisa

to K

imur

a, H

iroka

zuK

ameo

ka, S

hige

ki S

agay

ama

, Shi

geki

Miy

abe

, Nob

utak

aO

no, K

unio

Kas

hino

, Tak

uya

Nis

him

oto

(1)T

heU

nive

rsity

ofTo

kyo

(2)N

TTC

omm

unic

atio

nSc

ienc

eLa

bort

orie

sC

onta

ct:A

kisa

toK

imur

a<a

kisa

to@

ieee

org>

(1) T

he U

nive

rsity

of T

okyo

(2)

NTT

Com

mun

icat

ion

Scie

nce

Labo

rtor

ies

Con

tact

: Aki

sato

Kim

ura

<aki

sato

@ie

ee.o

rg>

()

yy

()

@g

ABST

RAC

TAB

STR

ACT

PRO

CED

UR

EPR

OCE

DU

RE

ABST

RAC

TAB

STR

ACT

PRO

CED

UR

EPR

OCE

DU

RE

ABST

RAC

TAB

STR

ACT

PRO

CED

UR

EPR

OCE

DU

RE

(1)

Mod

elle

arni

ng•W

epr

opos

ea

gene

rativ

em

odel

nam

edH

iera

rchi

calT

opic

Traj

ecto

ryM

odel

(HTT

M)

(1)

Mod

el le

arni

ng•W

e pr

opos

e a

gene

rativ

e m

odel

nam

ed H

iera

rchi

cal T

opic

Tra

ject

ory

Mod

el (H

TTM

) (A

)Et

tl

llf

tf

hd

litfo

raut

omat

icvi

deo

anno

tatio

nan

dvi

deo

retr

ieva

l(A

) Ext

ract

s lo

w-le

vel f

eatu

res

for

each

mod

ality

for a

utom

atic

vid

eo a

nnot

atio

n an

d vi

deo

retr

ieva

l.(

)y

[I]S

IFT

(ki

&d

i)

BV

W(B

fVi

lWd

)H

TTM

iifi

dd

lfi

idi

did

il

[Imag

es]S

IFT

(key

poin

ts&

desc

ripto

rs)+

BoV

W(B

agof

Visu

alW

ords

)•H

TTM

isa

unifi

edm

odel

fora

utom

atic

vide

oan

nota

tion

and

vide

ore

trie

val

[Imag

es] S

IFT

(key

poin

ts&

des

crip

tors

) B

oVW

(Bag

of V

isua

l Wor

ds)

[T]

Ti

hd

bid

f(i

df

)H

TTM

is a

uni

fied

mod

elfo

r aut

omat

ic v

ideo

ann

otat

ion

and

vide

o re

trie

val.

[Tex

ts]

Tag

occu

rren

ces

wei

ghte

dby

idf(

inve

rse

docu

men

tfre

quen

cy)

In

corp

orat

eslo

wle

velf

eatu

res

toge

ther

with

keyf

ram

ew

ise

topi

cm

odel

sth

atca

n[T

exts

] T

ag o

ccur

renc

es w

eigh

ted

by id

f(in

vers

e do

cum

ent f

requ

ency

)

Inco

rpor

ates

low

-leve

l fea

ture

s to

geth

er w

ith k

eyfra

me-

wis

e to

pic

mod

els

that

can

(B

)Cl

lt

lt

ti

bli

PC

CA

pg

yp

enco

dean

arbi

trary

num

bero

fcro

ssm

odal

corre

latio

ns(B

)Cal

cula

tes

late

ntva

riabl

esus

ing

PC

CA

enco

de a

n ar

bitra

ry n

umbe

r of c

ross

-mod

al c

orre

latio

ns(B

) Cal

cula

tes

late

nt v

aria

bles

usi

ng P

CC

Ay

D

eals

with

tem

pora

ltop

ictra

nsiti

ons

byhi

dden

Mar

kov

mod

els

(HM

M)

D

eals

with

tem

pora

l top

ic tr

ansi

tions

by h

idde

n M

arko

v m

odel

s (H

MM

)P

ji

ib

id

iC

CA

Som

ebe

havi

ors

have

been

anal

yzed

thro

ugh

TREC

VID

Sem

antic

Inde

xing

task

Pro

ject

ion

mat

rices

obt

aine

d vi

a C

CA

Dia

gona

lmat

rix•S

ome

beha

vior

s ha

ve b

een

anal

yzed

thro

ugh

TREC

VID

Sem

antic

Inde

xing

task

.(C

)Ui

dl

if

dl

tf

ti

titi

dl(

HM

M)

iV

itbi

li

jD

iago

nal m

atrix

yg

g(C

)Uns

uper

vise

dle

arni

ngof

mod

elpa

ram

eter

sfo

rtop

ictra

nsiti

onm

odel

(HM

M)v

iaV

iterb

ilea

rnin

g(C

) Uns

uper

vise

d le

arni

ng o

f mod

el p

aram

eter

s fo

r top

ic tr

ansi

tion

mod

el (H

MM

) via

Vite

rbil

earn

ing

MET

HO

DM

ETH

OD

Obs

erva

tion:

Seq

uenc

eof

late

ntva

riabl

esM

ETH

OD

MET

HO

DO

bser

vatio

n: S

eque

nce

of la

tent

var

iabl

es

MET

HO

DM

ETH

OD

(2)

Infe

ence

(an

nota

tion)

(2)

Inf

eren

ce (

= a

nnot

atio

n)(1

)M

otiv

atio

n(

)(

)(1

) M

otiv

atio

n(a

)Ext

ract

slo

w-le

veli

mag

efe

atur

esM

dli

lti

hib

ti

dt

t(a

) Ext

ract

s lo

w-le

vel i

mag

e fe

atur

es

•Mod

elin

gre

latio

nshi

psbe

twee

nim

ages

and

text

s(b

)Cal

cula

tes

late

ntva

riabl

eson

lyfro

mim

age

feat

ures

Mod

elin

g re

latio

nshi

ps b

etw

een

imag

es a

nd te

xts

(b) C

alcu

late

s la

tent

var

iabl

es o

nly

from

imag

e fe

atur

es(c)

O

bjec

ts:

Co

Co--

occu

rren

ces

occu

rren

ces

are

usef

ul(d)

(c)

O

bjec

ts:

Co

Co--

occu

rren

ces

occu

rren

ces

are

usef

ul.

(d)

E

vent

s:C

oC

ooc

curr

ence

soc

curr

ence

sar

eal

mos

tnec

essa

ryso

met

imes

they

mig

htbe

the

only

cue

E

vent

s: C

oC

o--oc

curr

ence

soc

curr

ence

sar

e al

mos

t nec

essa

ry, s

omet

imes

they

mig

ht b

e th

e on

ly c

ue.

()E

tit

hidd

tt

bV

itbi

hI

tit

lif

tii

tth

dl

(c)E

stim

ates

hidd

enst

ates

byV

iterb

isea

rch

•Inc

orpo

ratin

gte

mpo

rali

nfor

mat

ion

into

the

mod

elC

ross

road

, tra

ffic

jam

(c) E

stim

ates

hidd

en s

tate

s by

Vite

rbis

earc

hIn

corp

orat

ing

tem

pora

l inf

orm

atio

n in

to th

e m

odel

j

(d)L

aten

tvar

iabl

em

odifi

catio

nw

ithim

age

feat

ures

&hi

dden

stat

es

Ahi

erar

chic

alte

mpo

rals

truct

ure

Ahi

erar

chic

alte

mpo

rals

truct

ure

wou

ldbe

nece

ssar

y(d

) Lat

ent v

aria

ble

mod

ifica

tion

with

imag

e fe

atur

es &

hid

den

stat

es()

A

hie

rarc

hica

l tem

pora

l stru

ctur

eA

hie

rarc

hica

l tem

pora

l stru

ctur

ew

ould

be

nece

ssar

y(e)

(d)

•S

tand

ard

appr

oach

es:D

iscr

imin

ativ

e(e

xS

VM

)(e)

()

(b)

•S

tand

ard

appr

oach

es: D

iscr

imin

ativ

e (e

x. S

VM

)()

E

ach

clas

sifie

rper

form

sw

elli

nge

nera

l

Eac

h cl

assi

fier p

erfo

rms

wel

l in

gene

ral

cars

g

Diff

ilt

ti

t“

Diff

icul

t to

inco

rpor

ate

“co-

occu

rren

ces”

pbu

ses

•O

urap

proa

ch:G

ener

ativ

e(to

pic

mod

els)

bike

sbu

ses

()

f(

)se

man

ticO

ur a

ppro

ach:

Gen

erat

ive

(top

ic m

odel

s)(e

)Mis

sing

feat

ure

(=se

man

ticin

dex)

estim

atio

nse

man

tic

(2)

Hie

rarc

hica

ltop

ictr

ajec

tory

mod

el(H

TTM

)(e

) Mis

sing

feat

ure

(se

man

tic in

dex)

est

imat

ion

inde

x(2

) H

iera

rchi

cal t

opic

tra

ject

ory

mod

el (

HTT

M)

inde

x(

)p

jy

()

Hid

den

stat

esEX

PER

IMEN

TSEX

PER

IMEN

TSH

idde

nst

ates

EXPE

RIM

ENTS

EXPE

RIM

ENTS

Hi

hil

EXPE

RIM

ENTS

EXPE

RIM

ENTS

Hie

rarc

hica

l te

mpo

ral

(1)

Cond

ition

sLa

tent

varia

bles

tem

pora

l (1

) C

ondi

tions

Late

ntva

riabl

esst

ruct

ures

[Dat

aset

]12

7vi

deo

clip

s,56

191

shot

sfro

mTR

EC

VID

2005

Cro

ss-m

odal

[Dat

aset

]12

7 vi

deo

clip

s, 5

6191

sho

ts fr

om T

RE

CV

ID20

05[I

ft

]V

IRE

O37

4(S

IFT

ki

t&

di

tli

dB

VW

ith50

0di

)co

-occ

urre

nces

[Imag

e fe

atur

es]

VIR

EO

-374

(SIF

T ke

ypoi

nts

& d

escr

ipto

rs +

nor

mal

ized

BoV

Ws

with

500

dim

s)co

-occ

urre

nces

Low

-leve

lfea

ture

s[

g]

(yp

p)

[Lab

els]

47co

ncep

tsse

lect

edfro

mLS

CO

Man

dLS

CO

Mlit

eLo

w-le

velf

eatu

res

[Lab

els]

4

7 co

ncep

ts s

elec

ted

from

LS

CO

M a

nd L

SC

OM

-lite

Car

Roa

d[

]p

[Eva

luat

ion]

Mea

nav

erag

epr

ecis

ion

(mea

nAP

com

mon

lyus

edm

easu

rein

TRE

CV

ID)

Car

, Roa

dFe

atur

e[E

valu

atio

n]M

ean

aver

age

prec

isio

n (m

eanA

P, c

omm

only

use

d m

easu

re in

TR

EC

VID

)Fe

atur

e l

ti(2

)R

ltco

rrel

atio

nsIm

ages

/tex

ttag

s(2

)Re

sults

Imag

es/ t

extt

ags

(2)

Res

ults

0045

chance

025

1L

tt

ibl

difi

tii

ffti

0.045

chance

画像

0.25

1La

tent

var

iabl

e m

odifi

catio

n is

effe

ctiv

e

Ti

tj

ti

Kf

it

id

l004

画像のみ

imag

e on

ly

Topi

ctr

ajec

tory

expr

essi

onKe

yfra

me-

wis

eto

pic

mod

el0.09

0.04

s1m24

0To

pic

traj

ecto

ry e

xpre

ssio

nKe

yfra

me

wis

e to

pic

mod

el008

0035

s1_m

240

212

00.2

•H

MM

isus

ed(O

bser

vatio

ns=

late

ntva

riabl

es)

•In

corp

orat

eco

occu

rren

ces

ina

natu

ralw

ay0.08

0.035

s2_m

120

Fram

ewis

e•

HM

M is

use

d (O

bser

vatio

ns =

late

nt v

aria

bles

).•

Inco

rpor

ate

co-o

ccur

renc

es in

a n

atur

al w

ay0.07

0.03

on

s3m80

ae

seP

CC

A(

)

Eh

hidd

tt

iht

dt

py

Sifi

df

kf

ti

sio

s3_m

804

60015

PC

CA

E

ach

hidd

en s

tate

mig

ht c

orre

spon

ds to

a

•S

erve

as

a un

ified

fram

ewor

k fo

r sem

antic

0.06

AP

0.025

cis

s4_m

600.15

gp

conc

epto

rast

ory

expr

esse

dby

ase

quen

ceof

inde

xing

and

retri

eval

005

nA

rec

s5m48

conc

ept o

r a s

tory

exp

ress

ed b

y a

sequ

ence

of

inde

xing

and

retri

eval

0.05

ean

0.02

pr

s5_m

486

4001

py

py

qto

pics

gT

lP

bbi

liti

CC

A0.04

me

ge 

s6_m

400.1

topi

cs.

•To

ol:

Pro

babi

listic

CC

A [B

ach+

200

5]

m

0.015

ag

s7m35

HTT

MH

MM

isfe

asib

lefo

rsim

ple

epr

essi

onof

topi

c[

]

0.03

001

era

s7_m

35s10m24

HTT

M•

HM

M is

feas

ible

for s

impl

e ex

pres

sion

of t

opic

Low

com

puta

tiona

lcos

tfor

mod

el002

0.01

ave

s10_m24

0.05

(tem

pora

lp

pp

trans

ition

Low

com

puta

tiona

l cos

t for

mod

el

li

di

f0.02

0005

a

s20m12

(tem

pora

lt

ttra

nsiti

on.

lear

ning

and

infe

renc

e0.01

0.005

_s30m8

stru

ctur

e+

Ti

iti

bil

it

dd

ble

arni

ng a

nd in

fere

nce

0s30_m8

0la

tent

varia

ble

To

pic

varia

tion

can

be e

asily

intro

duce

d by

Eas

yto

exte

ndth

em

odel

tono

n-lin

ear

00

s40m6

0

bd

late

nt v

aria

ble

f)

py

yus

ing

Gau

ssia

nm

ixtu

res

(GM

M)

E

asy

to e

xten

d th

e m

odel

to n

onlin

ear

ik

lti

kAirplane

Airplane

_Flying

_Maps

Urban

Sports

Stud

iom

odifi

catio

n)us

ing

Gau

ssia

n m

ixtu

res

(GM

M).

ones

via

kern

eltri

ck2

Ai

ttti

fht

idhi

hi

i)

g(

)on

es v

ia k

erne

l tric

k2

App

ropr

iate

set

ting

of h

yper

-par

amet

ers

prov

ides

hig

h pr

ecis

ion

gy

g