Transcript
Page 1: Going Deeper into First-Person Activity Recognition

Going&Deeper&into&First.Person&Ac2vity&Recogni2on&

!

Slides!by!Alejandro&Cartas!

Minghuang!Ma,!Haoqi!Fan!and!Kris!M.!Kitani!

All!diagrams!and!images!are!originally!from!Minghuang!Ma,!et.!Al.!or!otherwise!stated!!

Page 2: Going Deeper into First-Person Activity Recognition

What&is&this&paper&about?&

•  Proposes! a! two.stream& CNN! to! recognize! Ac2vi2es&(object+ac2on)!in!short!egocentric!videos.!

Preparing a Hotdog sequence

`!

Pictu

res!ta

ken!fro

m!GTEA!dataset!

BREAD!

Take&bread!!

Take!

HOTDOG&(SAUSAGE)!

Take&Hotdog!Take!

BREAD! HOTDOG&(SAUSAGE)!

Put!Put&Hotdog&on&Bread!

… …

Page 3: Going Deeper into First-Person Activity Recognition

Proposed&approach&

hand segmentation object localization

action'take'

object'milk container'

activity'take milk container'

optical flow

Motion stream

Appearance stream

Input video

Take&bread&sequence!!

ARM+HAND! ARM+HAND! ARM+HAND!BREAD! BREAD!

… …

Pictu

res!ta

ken!fro

m!GTEA!dataset!

Page 4: Going Deeper into First-Person Activity Recognition

CNN&Architecture&

Fully&Convolu2onal&networks&

!

Late&Fusion&

!

Binary&SoQmax&layer&

!

Perpixel&Euclidean&loss&

!

Page 5: Going Deeper into First-Person Activity Recognition

CNN&Architecture&

Fully&Convolu2onal&networks&

Late&Fusion&

Binary&SoQmax&layer&

Perpixel&Euclidean&loss&

Appearance&Stream&

!

Page 6: Going Deeper into First-Person Activity Recognition

Appearance&training&

Hand mask

Location heatmap

Segmentation CNN

Localization CNN

fine-tune

Training&data&

Images!

GroundHtruth!

hand!masks!

Heatmaps!

(Gaussian!bumps)!

Page 7: Going Deeper into First-Person Activity Recognition

Appearance&stream&

Localization CNN

ObjectNet

(Appearance-based)

Segmentation CNN

Input video clip

Handsegmentation

interest region

Results&

Page 8: Going Deeper into First-Person Activity Recognition

CNN&Architecture&

Fully&Convolu2onal&networks&

Late&Fusion&

Binary&SoQmax&layer&

Perpixel&Euclidean&loss&

Page 9: Going Deeper into First-Person Activity Recognition

CNN&Architecture&

Fully&Convolu2onal&networks&

Late&Fusion&

Binary&SoQmax&layer&

Perpixel&Euclidean&loss&

Mo2on&Stream&

!

Page 10: Going Deeper into First-Person Activity Recognition

Mo2on&stream&

Results&

(Motion-based)

ActionNet

Input video clip

Optical flow

~ ~

StartHEnd!

Image!frames!

StartHEnd!

OpLcal!flow!frames!

Average!

A&fixed&set&of&L"frames"

Pair&of&ver2cal&and&horizontal&frames"

Page 11: Going Deeper into First-Person Activity Recognition

CNN&Architecture&

Fully&Convolu2onal&networks&

Late&Fusion&

Binary&SoQmax&layer&

Perpixel&Euclidean&loss&

Page 12: Going Deeper into First-Person Activity Recognition

CNN&Architecture&

Fully&Convolu2onal&networks&

Late&Fusion&

Binary&SoQmax&layer&

Perpixel&Euclidean&loss&

Full&architecture&

!

Page 13: Going Deeper into First-Person Activity Recognition

Results&

GTEA& GAZE& GAZE&+44&

Page 14: Going Deeper into First-Person Activity Recognition

Closer&look&at&the&GAZE&Confusion&matrix&

Pictu

res!ta

ken!fro

m!GAZE!dataset!

Take peanut Open peanut

Close peanut

Scoop peanut


Recommended