14
Going Deeper into First.Person Ac2vity Recogni2on Slides by Alejandro Cartas Minghuang Ma, Haoqi Fan and Kris M. Kitani All diagrams and images are originally from Minghuang Ma, et. Al. or otherwise stated

Going Deeper into First-Person Activity Recognition

Embed Size (px)

Citation preview

Page 1: Going Deeper into First-Person Activity Recognition

Going&Deeper&into&First.Person&Ac2vity&Recogni2on&

!

Slides!by!Alejandro&Cartas!

Minghuang!Ma,!Haoqi!Fan!and!Kris!M.!Kitani!

All!diagrams!and!images!are!originally!from!Minghuang!Ma,!et.!Al.!or!otherwise!stated!!

Page 2: Going Deeper into First-Person Activity Recognition

What&is&this&paper&about?&

•  Proposes! a! two.stream& CNN! to! recognize! Ac2vi2es&(object+ac2on)!in!short!egocentric!videos.!

Preparing a Hotdog sequence

`!

Pictu

res!ta

ken!fro

m!GTEA!dataset!

BREAD!

Take&bread!!

Take!

HOTDOG&(SAUSAGE)!

Take&Hotdog!Take!

BREAD! HOTDOG&(SAUSAGE)!

Put!Put&Hotdog&on&Bread!

… …

Page 3: Going Deeper into First-Person Activity Recognition

Proposed&approach&

hand segmentation object localization

action'take'

object'milk container'

activity'take milk container'

optical flow

Motion stream

Appearance stream

Input video

Take&bread&sequence!!

ARM+HAND! ARM+HAND! ARM+HAND!BREAD! BREAD!

… …

Pictu

res!ta

ken!fro

m!GTEA!dataset!

Page 4: Going Deeper into First-Person Activity Recognition

CNN&Architecture&

Fully&Convolu2onal&networks&

!

Late&Fusion&

!

Binary&SoQmax&layer&

!

Perpixel&Euclidean&loss&

!

Page 5: Going Deeper into First-Person Activity Recognition

CNN&Architecture&

Fully&Convolu2onal&networks&

Late&Fusion&

Binary&SoQmax&layer&

Perpixel&Euclidean&loss&

Appearance&Stream&

!

Page 6: Going Deeper into First-Person Activity Recognition

Appearance&training&

Hand mask

Location heatmap

Segmentation CNN

Localization CNN

fine-tune

Training&data&

Images!

GroundHtruth!

hand!masks!

Heatmaps!

(Gaussian!bumps)!

Page 7: Going Deeper into First-Person Activity Recognition

Appearance&stream&

Localization CNN

ObjectNet

(Appearance-based)

Segmentation CNN

Input video clip

Handsegmentation

interest region

Results&

Page 8: Going Deeper into First-Person Activity Recognition

CNN&Architecture&

Fully&Convolu2onal&networks&

Late&Fusion&

Binary&SoQmax&layer&

Perpixel&Euclidean&loss&

Page 9: Going Deeper into First-Person Activity Recognition

CNN&Architecture&

Fully&Convolu2onal&networks&

Late&Fusion&

Binary&SoQmax&layer&

Perpixel&Euclidean&loss&

Mo2on&Stream&

!

Page 10: Going Deeper into First-Person Activity Recognition

Mo2on&stream&

Results&

(Motion-based)

ActionNet

Input video clip

Optical flow

~ ~

StartHEnd!

Image!frames!

StartHEnd!

OpLcal!flow!frames!

Average!

A&fixed&set&of&L"frames"

Pair&of&ver2cal&and&horizontal&frames"

Page 11: Going Deeper into First-Person Activity Recognition

CNN&Architecture&

Fully&Convolu2onal&networks&

Late&Fusion&

Binary&SoQmax&layer&

Perpixel&Euclidean&loss&

Page 12: Going Deeper into First-Person Activity Recognition

CNN&Architecture&

Fully&Convolu2onal&networks&

Late&Fusion&

Binary&SoQmax&layer&

Perpixel&Euclidean&loss&

Full&architecture&

!

Page 13: Going Deeper into First-Person Activity Recognition

Results&

GTEA& GAZE& GAZE&+44&

Page 14: Going Deeper into First-Person Activity Recognition

Closer&look&at&the&GAZE&Confusion&matrix&

Pictu

res!ta

ken!fro

m!GAZE!dataset!

Take peanut Open peanut

Close peanut

Scoop peanut