ActionRecognitionusingVisualAttentionSoft attention model (Our model) 41.3 43.9 Composite LSTM Model...

Action Recognition using Visual AttentionShikhar Sharma, Ryan Kiros and Ruslan SalakhutdinovDepartment of Computer Science, University of Toronto

MOTIVATIONAttention based models have been shown to achieve promising results on several challenging tasks,including caption generation [9], machine translation [1], game-playing and tracking [4].Attention based models can potentially infer the action happening in videos by focusing only on therelevant places in each video frame.Soft-attention models are deterministic and can be trained using backpropagation.We propose a soft-attention based model for action recognition in videos.We use multi-layered Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM).Our model tends to recognize important elements in video frames based on the activities it detects.

THE ATTENTION MECHANISM AND THE MODEL

(a) The soft-attention mechanism (b) Our recurrent model

We extract the last GoogLeNet [8] convolutional layer for the video frames.The last convolutional layer is a feature cube of shape K ×K ×D (7× 7× 1024 here).Feature slices: the K2 D-dimensional vectors within a feature cube.

Xt = [Xt,1, . . . ,Xt,K2 ], Xt,i ∈ RD.

Each of theseK2 vertical feature slices maps to different overlapping regions in the input space and ourmodel chooses to focus its attention on these K2 regions.The location softmax lt over K2 locations is:

lt,i = p(Lt = i|ht−1) =exp(W>i ht−1)∑K×K

j=1 exp(W>j ht−1)i ∈ 1 . . .K2

where Wi - the weights mapping to the ith element of the location softmaxLt - a random variable which can take 1-of-K2 valuesht−1 - the hidden state at time-step t− 1

The soft attention mechanism computes the expected value of the input at the next time-step xt:

xt = Ep(Lt|ht−1)[Xt] =K×K∑i=1

lt,iXt,i

where Xt - the feature cube at time-step txt - the input to the LSTM at time-step t

LOSS FUNCTIONLoss function: Cross-Entropy loss coupled with the doubly stochastic penalty introduced in [9]:

L = −T∑

C∑i=1

yt,i log yt,i + λK×K∑i=1

(1−T∑

lti)2 + γ

θ2i,j ,

where yt - one hot label vectoryt - vector of class probabilities at time-step tT - total number of time-stepsC - number of output classesλ - attention penalty coefficient

UCF-11, HMDB-51 AND HOLLYWOOD2: QUALITATIVE ANALYSISSuccess cases: Figure (a)-(i)

(a) cycling (b) pushup (c) DriveCar

(d) walking with a dog (e) draw_sword (f) Kiss

(g) push (h) trampoline jumping (i) golf swingingFailure cases: Figure (j)-(l)

(j) “kick_ball” misclassified as “somersault” (k) “soccer juggling” misclassified as “diving” (l) “flic_flac” misclassified as “hit”

We can see that to classify the corresponding activities correctly, the model focuses onFig.(a): parts of the cycleFig.(b): the person doing push-ups

Fig.(c): the steering wheel, the rear-view mirrorFig.(d): the dogs and the person

Among failure cases:Fig.(j): the model misclassifies the example despite attending to the relevat locationFig.(l): the model misclassifies the example and does not even attend to the relevant location

OPTIMIZING ATTENTION WITH THE CORRECT LABELS

(First)The original video frames

(Second)Failure case of modelPrediction: tennis swinging

(Third)Random initializationPrediction: tennis swinging

(Fourth)Attention after optimizationPrediction: soccer juggling

UCF-11, HMDB-51 AND HOLLYWOOD2: QUANTITATIVE ANALYSISTable 1: Performance on UCF-11 (acc %), HMDB-51 (acc %) and Hollywood2 (mAP %)

Model UCF-11 HMDB-51 Hollywood2Softmax Regression (full CNN feature cube) 82.37 33.46 34.62Avg pooled LSTM (@ 30 fps) 82.56 40.52 43.19Max pooled LSTM (@ 30 fps) 81.60 37.58 43.22Soft attention model (@ 30 fps) 84.96 41.31 43.91

Table 2: Comparison of performance on HMDB-51 and Hollywood2 with state-of-the-art modelsModel HMDB-51 (acc %) Hollywood2 (mAP %)

Spatial stream ConvNet [5] 40.5 -Soft attention model (Our model) 41.3 43.9Composite LSTM Model [6] 44.0 -DL-SFA [7] - 48.1VideoDarwin [2] 63.7 73.7Objects+Traditional+Stacked Fisher Vectors [3] 71.3 66.4

We have divided Table 2 into three sections:First section: models using only RGB dataSecond section: models using both RGB and optical flow dataThird section: models using RGB, optical flow and object responses on some ImageNet categories

Using hybrid soft and hard attention models in the future can potentially reduce computation cost andallow scaling to larger datasets like Sports-1M.

REFERENCES[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.

[2] B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.

[3] M. Jain, J. C. v. Gemert, and C. G. M. Snoek. What do 15,000 object categories tell us about classifying and localizing actions? In CVPR, June 2015.

[4] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In NIPS, 2014.

[5] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS. 2014.

[6] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. ICML, 2015.

[7] L. Sun, K. Jia, T. Chan, Y. Fang, G. Wang, and S. Yan. DL-SFA: deeply-learned slow feature analysis for action recognition. In CVPR, 2014.

[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.

[9] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. ICML, 2015.

ActionRecognitionusingVisualAttentionSoft attention model (Our model) 41.3 43.9 Composite LSTM Model...

Documents

HOME 2016/04/18 2016/04/18 2016/04/18 …2016/04/01 2016/06/30 2016/03/15 2016/04/18 HOME 58 2016/04/18 63.7 kg 2016/04/01 20 2016/04/18 116/03/15 58 63.7 kg Title お薬手帳情報参照サービスの特徴／使い方

CamelMilkIsaSaferChoicethanGoatMilkfor ...downloads.hindawi.com/archive/2011/391641.pdf · Housewife 31(81.6) Place of residence: Urban 28 (73.7%) Semi-urban 10 (26.3%) Type of house:

H G Infra Engineering - Moneycontrol.comstatic-news.moneycontrol.com/static-mcnews/2018/11/H-G...Shares outstanding 65m Shareholding pattern (%) Sep'18 Jun'18 Mar'18 Promoters 73.7

Presentation förenergivärdentemavind - energimyndigheten.se · Highest yielding onshore low wind turbine in the industry Larger Swept Area Blade length increased to 73.7 m using

SHEEP AND GOAT - AB › files › tarama › tarama_files › 11 › SC11DET_1… · 26.3 % of sheep in south eastern region of Turkey 73.7 % dispersed 97 % of meat production from

Midterm results Average mark 73.7% (29.5 / 40) Median mark 30 / 40

NAOSITE: Nagasaki University's Academic Output SITEnaosite.lb.nagasaki-u.ac.jp/dspace/bitstream/10069/15901/...The mean age of the patients was 63.7 (ranging from 37 to 78), of whom

NBER WORKING PAPER SERIES COMPETING APPROACHES … › papers › w12053.pdfCentrebet 76.9 57.9 73.7 Notes: (a) Economic models: Election eve estimate uses data from the quarter prior

University of Leeds - Futurevolcfuturevolc.hi.is/.../leeds_who_we_are.pdfUNIVERSITY O F LEEDS -19.8 63.8 2009/09/ 5- 63.7 -19.6 2010/03/20 -19.4 -19.2 5 km 63.6' LOS Displacement (mm)

29 IN. (73.7 CM) ELECTRIC DRYER INSTALLATION …€¦ · 29 IN. (73.7 CM) ELECTRIC DRYER INSTALLATION INSTRUCTIONS INSTRUCCIONES DE INSTALACIÓN PARA LA SECADORA ELÉCTRICA DE 29

OZONE A NOTE OZONE SETTING - efore.com · Page 1 AN3_Ozone Setting rev06_May2017 ... Designed for fast and easy configuration. ... 950 1-2 30 56 60 73.7

Initiating CoverageMining assets - Fuller Treacy Money · Initiating CoverageMining assets ... Cash & equivalents 103.4 1.0x 0.53 PV of G&A and expl. (73.7) ... BUY) Fruta del Norte

— 12-1 1990 (0/0) 60 ( g/rn3) 329 (0/0) 63.7 65.0 (Kg/cmè ) 4 ...( g/rn3) 329 (0/0) 63.7 65.0 (Kg/cmè ) 4 2.59 2.63 (0/0) 1.12 0.79 (Kg/ 9) 1.683 1.736 6.73 IOS Kg/cmÐ (mm) 25

2016 National Pulp & Paper Sustainability ... · Recycles 73.7% of paper and paperboard Australia’s implied recycling rate for all paper and paperboard in 2015-16 was 73.7%, ahead

FLORIDA - moorsch.org · Debbie Wasserman Schultz District 23 63.2% Frederica Wilson District 24 58.4% Mario Diaz-Balart District 25 63.7% Debbie Mucarsel-Powell District 26 65.3%

Encuesta Global sobre Tendencias en Gastos MédicosEncuesta ... Marcela Abraham.pdf · 78.4 81.5 80.1 81.0 79.9 81.4 80.4 82.6 67.8 72.4 75.1 73.1 63.7 U.S. France Germany Canada

Presentation of December 31, 2019 Valuation Results Valuation Presentation...December 31, 2019 Valuation Funded ratio: 73.7% Actuarial rate: 8.90% Statutory rate: 8.90% Actuarial Required

· cial criteria relative to hotel sales. For 1991 the HMBA's average listed first-mortgage loan-to-value ratio was 73.7 percent, the average

Dimension Data’s 2016 Global Contact Centre Benchmarking ... · PDF fileanalysis and best practice techniques ... Improve customer experience 73.7 ... 2016 Global Contact Centre

Research Article High Prevalence of Plasmodium falciparum ...downloads.hindawi.com/archive/2016/5405802.pdfMalariaResearchandTreatment 51% 62% 31% 22% 63.7% 63.7% Equateur Province