Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
ActivityRecognitionJUSTINLIANGMARCH27, 2016
1
Agenda•End-to-endLearningofActionDetectionfromFrameGlimpsesinVideos.S.Yeung,O.Russakovsky,G.Mori,L.Fei-Fei.CVPR2016.
•DetectingEventsandKeyActorsinMulti-PersonVideos.V.Ramanathan,J.Huang,S.Abu-El-Haija,A.Gorban,K.MurphyandL.Fei-Fei.CVPR2016.
2
WhatisActivityRecognition•Ideaistobeabletodetectwhateventoccursinavideo• Ex.diving, successfullayup,failedlayup,successfulslamdunk,blocking, setting,standing
•Differentsubdomainstodoactivityrecognition:• Individualactivityrecognition• Groupactivityrecognition• Temporalactivityrecognition
[Ibrahimetal.CVPR2016]
3
End-to-endLearningofActionDetectionfromFrameGlimpsesinVideos
4
End-to-endLearningofActionDetectionfromFrameGlimpsesinVideos•PaperfromSerenaYeung,OlgaRussakovsky,GregMori,LiFei-Fei inCVPR2016.
•Objective:• Predictactionsandtheirtemporalbounds:howlongandwheretheyoccurinavideoclip.Videoclipsusedareuntrimmed.
•KeyContributions:• End-to-endapproachtoactiondetectionandtemporallocalizationinvideos• Trainanagentpolicytoskipvideoframestofindwheretheactionsareinthevideo• Showthatthismethodcanoutperformstateoftheartresults
5
Approach•Actiondetectionisaprocessofobservationandrefinement.Effectivelychoosingasequenceofframeobservationsallowsustoquicklynarrowdownwhenthebaseballswingoccurs.
6
Approach(Pipeline)•𝑜": observationfeaturevector
•ℎ": internalhiddenstate
•𝑑": candidatedetection• 𝑠": actionstarts• 𝑒": actionends• 𝑐": actionconfidence level
•𝑝": indicatortoemitaction
•𝑙"*+: locationofnextobservation,𝑙" ∈ [0,1]
7
ObservationNetwork•Boththelocation𝑙" andvideoframe𝑣34 aremappedtoahiddenspaceandthencombinedwithafullyconnectedlayertoproducetheobservationvector𝑜"•𝑣34 ismappedusingtheVGG16networkandfc7featuresareextractedfromit
8
RecurrentNetwork•Observationfeatures𝑜" andpreviousinternalhiddenstateℎ"5+ areinputstotherecurrentnetwork𝑓7 whichisparameterizedby𝜃7 toproduceℎ"
9
RecurrentNetwork•Observationfeatures𝑜" andpreviousinternalhiddenstateℎ"5+ areinputstotherecurrentnetwork𝑓7 whichisparameterizedby𝜃7 toproduceℎ"•Candidatedetection𝑑":• 𝑑" = 𝑓: ℎ"; 𝜃: ,𝑓: isafullyconnectedlayer
10
RecurrentNetwork•Observationfeatures𝑜" andpreviousinternalhiddenstateℎ"5+ areinputstotherecurrentnetwork𝑓7 whichisparameterizedby𝜃7 toproduceℎ"•Candidatedetection𝑑":• 𝑑" = 𝑓: ℎ"; 𝜃: ,𝑓: isafullyconnectedlayer
•PredictionIndicator𝑝":• 𝑝" = 𝑓< ℎ";𝜃< ,𝑓< isafullyconnectedlayer• During training,𝑓< isusedtoparameterizeaBernoullidistribution fromwhich𝑝" issampled.AttesttimeMAPestimateisused.
11
RecurrentNetwork•Observationfeatures𝑜" andpreviousinternalhiddenstateℎ"5+ areinputstotherecurrentnetwork𝑓7 whichisparameterizedby𝜃7 toproduceℎ"•Candidatedetection𝑑":• 𝑑" = 𝑓: ℎ"; 𝜃: ,𝑓: isafullyconnectedlayer
•PredictionIndicator𝑝":• 𝑝" = 𝑓< ℎ";𝜃< ,𝑓< isafullyconnectedlayer• During training,𝑓< isusedtoparameterizeaBernoullidistribution fromwhich𝑝" issampled.AttesttimeMAPestimateisused.
•Locationofnextobservation𝑙"*+:• 𝑙"*+ = 𝑓3 ℎ"; 𝜃3 ,𝑓3 isafullyconnectedlayer• During training, 𝑙"*+ issampledfromaGaussiandistributionwithmean𝑓3 ℎ"; 𝜃3 andfixedvariance.AttesttimeMAPestimateisused.
12
Training•Goalistotrainthreeoutputs:candidatedetection𝑑",predictionindicator𝑝",locationofnextobservation𝑙"*+• Thisisdifficultduetothechallengesofdesigning suitablelossandrewardfunctionsandhandling non-differentiablemodelcomponents
•Weusebackpropagationtotrain𝑑" andREINFORCEtotrain𝑝" and𝑙"*+
13
Training(CandidateDetection𝑑")•MatcheachcandidatedetectionD = {𝑑"|𝑛 = 1,… , 𝑁} fromrecurrentnetworktogroundtruth𝑔+,…,Q•Matchingfunction:
• 𝑦"S = T1𝑖𝑓𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛YZ+,…,Q𝑑𝑖𝑠𝑡(𝑙", 𝑔Y)0𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• 𝑔Y = (𝑠Y , 𝑒Y)• 𝑑𝑖𝑠𝑡 𝑙", 𝑔Y = min( 𝑠Y − 𝑙" , 𝑒Y − 𝑙" )
14
Training(CandidateDetection𝑑")•MatcheachcandidatedetectionD = {𝑑"|𝑛 = 1,… , 𝑁} fromrecurrentnetworktogroundtruth𝑔+,…,Q•Matchingfunction:
• 𝑦"S = T1𝑖𝑓𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛YZ+,…,Q𝑑𝑖𝑠𝑡(𝑙", 𝑔Y)0𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• 𝑔Y = (𝑠Y , 𝑒Y)• 𝑑𝑖𝑠𝑡 𝑙", 𝑔Y = min( 𝑠Y − 𝑙" , 𝑒Y − 𝑙" )
•Lossfunction:• ∑ 𝐿c3d 𝑑" + 𝛾 ∑ ∑ 𝕀[𝑦"S= 1]𝐿3hc(S"" 𝑑", 𝑔S)• 𝐿c3d 𝑑" :crossentropy lossondetectionconfidence𝑐"• 𝐿3hc(𝑑", 𝑔S):L2losstofurtherminimizedistance 𝑠", 𝑒" − 𝑠S, 𝑒S
•Optimizelossusingbackpropagation
15
Training(Location𝑙"*+ andPredictionIndicator𝑝")•UseREINFORCEtolearnobservationandemissionpolicies
•REINFORCE:• Objective:𝐽 𝜃 = ∑ 𝑝j 𝑎 𝑟(𝑎)k∈𝒜• 𝒜:spaceofactionsequences• 𝑝j 𝑎 :probability ofaction• 𝑟(𝑎):reward
16
Training(Location𝑙"*+ andPredictionIndicator𝑝")•UseREINFORCEtolearnobservationandemissionpolicies
•REINFORCE:• Objective:𝐽 𝜃 = ∑ 𝑝j 𝑎 𝑟(𝑎)k∈𝒜• 𝒜:spaceofactionsequences• 𝑝j 𝑎 :probability ofaction• 𝑟(𝑎):reward
• Gradient:𝛻𝐽 𝜃 = ∑ 𝑝j 𝑎 𝛻log𝑝j 𝑎 𝑟(𝑎)k∈𝒜• Thisisanontrivialoptimizationproblemduetothehigh
dimensional spaceofpossible actionsequences!• InsteadwecanuseMonteCarlototaketheexpectation
17
Training(Location𝑙"*+ andPredictionIndicator𝑝")•UseREINFORCEtolearnobservationandemissionpolicies
•REINFORCE:• Objective:𝐽 𝜃 = ∑ 𝑝j 𝑎 𝑟(𝑎)k∈𝒜• 𝒜:spaceofactionsequences• 𝑝j 𝑎 :probabilityofaction• 𝑟(𝑎):reward
• Gradient:𝛻𝐽 𝜃 = ∑ 𝑝j 𝑎 𝛻log𝑝j 𝑎 𝑟(𝑎)k∈𝒜• UseMonteCarlotoapproximate:• 𝛻𝐽 𝜃 ≈ +
q∑ ∑ 𝛻 log𝜋j 𝑎"s |ℎ+:"s ,𝑎+:"5+s 𝑅"sv
"Z+qsZ+
• 𝐾 interactionsequences• 𝑁 RNNtimesteps• 𝜋j :agent’spolicy• 𝑎":currentaction(𝑙"*+or𝑝")• 𝑅":cumulativerewardfromcurrenttimestep onward• ℎ":hiddenstate
• Optimizebymaximizingobjective
18
Training(Location𝑙"*+ andPredictionIndicator𝑝")•Rewardfunction:• Wanthighprecisionandrecall
• 𝑟v = T 𝑅<𝑖𝑓𝑀 > 0𝑎𝑛𝑑𝑁< = 0𝑁*𝑅* + 𝑁5𝑅5𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• 𝑁<:#predictions emittedbyagent• 𝑁*,𝑅*:#truepositive predictions andreward• 𝑁5, 𝑅5:#falsepositive preditions andreward• 𝑅<: penaltyfornotemittingpredictionwhen#groundtruth𝑀 > 0
• Prediction iscorrectifitsoverlapwithground truth isgreaterthanathresholdandhigher thananyotherprediction
19
Strengths/WeaknessesofApproach•Strengths:• Donotneedtolookatalltheframes• End-to-endlearning
•Weaknesses:• Needalltheframesinaclip(cannotdoonlinedetection)• Canbedifficulttolearnobservationpolicy ifeventcontainslessdiscriminativemovements
20
Results•ResultsfromTHUMOS’14comparingwithtop3performers.mAP isreportedfordifferentIOUthresholds𝛼
•Ablationstudiesshowthatwithoutlocalizationregressionandwheretoobservenext,resultsaresignificantlyworse
21
Results(LearnedObservationPolicy)
22
Results(LearnedObservationPolicy)
23
FutureDirection•Learnjointspatio-temporalobservationpolicies
24
DetectingEventsandKeyActorsinMulti-PersonVideos
25
DetectingEventsandKeyActorsinMulti-PersonVideos•PaperfromVignesh Ramanathan,JonathanHuang,SamiAbu-El-Haija,AlexanderGorban,KevinMurphyandLiFei-Fei inCVPR2016.
•Objective:• Predicteventsandkeyactorsinvideoswheremultiplepeopleareinvolved
•KeyContributions:• Introducelarge-scalebasketballeventdataset• Useattentiontodecidemostrelevantpeople totheactionbeingperformed• Showthattheattentionmodelresultsinbettereventrecognition
26
Dataset•Introducedalargedatasetwithmulti-personactionvideos.Thedatasetconsistsof257NCAAgameseacharound1.5hourslong.11differentbasketballeventsaredenselyannotatedinthevideos.
27
Approach•Eventsinateamsportareperformedbyasetofkeyplayers.Itissufficienttofocusonlytheplayersparticipatingtorecognizeanevent.Forexample,a“steal”eventinbasketballisdefinedbytheactionoftheplayerattemptingtopasstheballandtheplayerstealing.
•Theideaistofocusonkeyplayerstopredictevents.
28
Approach(Pipeline)•EachplayertrackisprocessedbyaBLSTMnetwork.Theoutputhiddenstateisprocessedbyanattentionmodeltoidentifykeyplayers.
•Thethicknessoftheboxesshowattentionweights.
•EachvideoframeisprocessedbyaBLSTMnetwork.
29
FeatureExtraction•Eachvideoframe𝑡 isrepresentedasafeaturevector𝑓{ fromtheactivationofthelastfullyconnectedlayeroftheInception7network.
•Eachplayer𝑖 boundingboxisrepresentedasafeaturevector𝑝{s fromInception7.
30
EventClassification•Computeglobalcontextvectorforeachframe𝑡:• ℎ{
| = 𝐵𝐿𝑆𝑇𝑀|�kS�(ℎ{5+| , ℎ{*+
| ,𝑓{ )
31
EventClassification•Computeglobalcontextvectorforeachframe𝑡:• ℎ{
| = 𝐵𝐿𝑆𝑇𝑀|�kS�(ℎ{5+| , ℎ{*+
| ,𝑓{ )
•Nextcomputehiddenstateofeventattime𝑡:• ℎ{� = 𝐿𝑆𝑇𝑀(ℎ{5+� ,ℎ{
|,𝑎{)• 𝑎{ isthefeaturevectorfortheplayersfromtheattentionmodel
32
EventClassification•Computeglobalcontextvectorforeachframe𝑡:• ℎ{
| = 𝐵𝐿𝑆𝑇𝑀|�kS�(ℎ{5+| , ℎ{*+
| ,𝑓{ )
•Nextcomputehiddenstateofeventattime𝑡:• ℎ{� = 𝐿𝑆𝑇𝑀(ℎ{5+� ,ℎ{
|,𝑎{)• 𝑎{ isthefeaturevectorfortheplayersfromtheattentionmodel
•Predictclasslabelusing𝑤��ℎ{�
•SquaredHingeLossfunction:• 𝐿 = +
�∑ ∑ max(0,1− 𝑦�𝑤��ℎ{�)�q
�Z+�{Z+
• 𝑦� is1ifthevideobelongs toclass𝑘 and-1otherwise
33
Attention•Howdowegetthefeaturevector𝑎{ fortheplayersfromtheattentionmodel?
34
AttentionModels(withtracking)•AttentionmodelwithKLTtrackingforplayer𝑖 andframet:• ℎ{s
< = 𝐵𝐿𝑆𝑇𝑀{�kc�(ℎ{5+,s< ,ℎ{*+,s
< , 𝑝{s)• 𝑎{{�kc� = ∑ 𝛾{s{�kc�ℎ{s
<v�sZ+
• 𝛾{s{�kc� = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝜙(ℎ{|,ℎ{s
< ,ℎ{5+� ); 𝜏)
•𝑎{: weightedcombinationoverplayersinframe𝑡•𝛾{s : attentionweights
•𝑁{: #playerdetectionsinframe𝑡•𝜙():multilayerperceptron
•𝜏:softmax temperature
35
AttentionModels(withouttracking)•Attentionmodelwithouttracking:• 𝑎{"h{�kc� = ∑ 𝛾{s"h{�kc� 𝑝{s
v�sZ+
• 𝛾{s"h{�kc� = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝜙(ℎ{|,𝑝{s , ℎ{5+� ); 𝜏)
•𝑎{:weightedcombinationoverplayersinframe𝑡
•𝛾{s : attentionweights
•𝑁{: #playerdetections inframe𝑡
•𝜙():multilayerperceptron
•𝜏:softmax temperature
•𝑝{s:playerfeaturevectorfromInception7
36
Strengths/WeaknessesofApproach•Strengths:• Attentionfocusesonkeyplayers
•Weaknesses:• Needalltheframesinaclip(cannotdoonlinedetection)• Model tendstobereluctanttoswitchattentionbetweenplayersinascene
37
Results(EventClassification)•Herewecomparetheabilitytoclassifyisolatedvideoclipsinto11classes
•Attentionisparticularlygoodforshot-basedeventswhereattendingtotheshotmakingpersonordefenderscanbeuseful
38
Results(EventDetection)•Herewecomparetheabilitytotemporallylocalizeeventsinuntrimmedvideos usinga4second slidingwindowthroughallthe videos
•Here,astealeventisparticularlychallengingasitisoftenmistakenforapass
•Combining theplayerfeaturesbyaveragingwithoutusingattentionperformsverygoodaswell• Possiblybecause thealgorithmhasdifficultychangingattentionsincewearedealingwithuntrimmedvideos
39
Results(Attention)•Attendedplayerisincyanandballisinyellow
•Resultsshowthatmodelattendstotheplayermakingtheshotatthebeginning
40
Results(AttentionHeatmap)•Distributionofattentionshowsinitiallyattentionfocussesonshooterandthendisperseslaterintheevent
41
WrapUp•Questions?
•Suggestions?
42