View
477
Download
2
Category
Preview:
Citation preview
Human Action Recognition with Kinect using a
Joint Motion Descriptor
Somar Boubou, Einoshin Suzuki09.02.2015
Journal of Intelligent Information Systems 44 (1), 49-65
• Introduction /7 slides/.
• Related work /6 slides/.
• Proposed approach /9 slides/.
• Experimental results /9 slides/.
• Detailed analysis /5 slides/.
• Experiments and detailed analysis with a public challenging MSR-Action3D dataset. /7 slides/
• Summary.
Structure of this presentation1
Vision-based human action recognition
Using machine learning techniques to detect and understand the meaningful motion of the human body.
Objective:
2
Kinect V1.8Kinect is a motion sensing input device with RGB camera, depth sensor and multi-array microphone.
[0] HipCenter[1] Spine[2] ShoulderCenter[3] Head[4] ShoulderLeft[5] ElbowLeft[6] WristLeft[7] HandLeft[8] ShoulderRight[9] ElbowRight[10] WristRight[11] HandRight[12] HipLeft[13] KneeLeft[14] AnkleLeft[15] FootLeft[16] HipRight[17] KneeRight[18] AnkleRight[19] FootRight
Depth sensor
RGB camera
Multi-array microphone
3
According to the Aggarwal review * :
Action primitives /Gestures/
- Elementary movements of a person's body part
- Can be described at the limb level (e.g., “Raising an arm”)
Actions
- Consist of several action primitives
- Could be temporally organized (e.g., “Walking”)
Activities
- Contain a number of subsequent actions (e.g., “Playing soccer”)
* [Aggarwal and Ryoo. Human activity analysis: A review, 2011.]
Taxonomies of human motion4
• Depth frames are a rich and informative sources of data:
- Complex actions and activities.
- Time consuming. (Not suitable for real-time applications).
• Recognizing gestures and actions in real-time is considered very useful for the application of human-machine interaction.
• Researchers are seeking an effective and fast descriptor for human action recognition which exploit simple data source such as skeleton data.
Significance5
Intra- and inter-class variations
- Speed of movement - Style of movement
• Wide variety of one action performance.
• Different actions have a high degree of similarity in between.
Environment and recording settings
- View point - Recording rate
- Partial occlusion
Computational time
- Complex descriptors lead to an extensive computational time.
Challenges of action recognition6
A new feature descriptor for human actions represented by 3D skeleton series acquired by Kinect.
• Fast and effective
• Scale-invariant
• Speed-invariant
• Length-invariant
• Outperforms the state-of-the-art descriptors.
Publications:
Contribution7
S. Boubou & E. Suzuki, “Classifying Actions Based onHistogram of Oriented Velocity Vectors”, J. Intell. Inf. Syst.44(1): 49-65 (2015)
Related work
8
Methodologies of
human action recognition
Single-layered approachHierarchical
approach
- Recognition of gestures- Actions with sequential
characteristics
Low computational cost
- Complex activities- High-level human actions
High computational cost
9
Methodologies
Single-layered approachHierarchical
approach
- To demonstrate a low computational cost.
- Suitable to be implemented for action recognition by robots.
- Fast and accurate enough for human-robot interaction.
10
Methodologies
Single-layered approach
Space-time approach
Sequential approach
Hierarchical approach
11
(e.g., Hidden Conditional Random Fields)
[Wang and Mori, CVPR 2009.]
Space-time approaches
Space-time volumes
TrajectoriesFeature
descriptors
12
[SHEIKH et at ,ICCV2005]
[LIU AND SHAH, CVPR2008]
[Gorelick et at, PAMI2007]
Histograms feature descriptors
[Ikizler and Duygulu. Histogram of orientedrectangles: A new pose descriptor for human action recognition. Image and Vision Computing, 2009.]
[Oreifej and Liu. HON4D. CVPR 2013.]
[Li, Zhang, and Liu. Action recognition based on a bag of 3D points,
CVPR 2010.]
13
Histogram of Oriented Velocity Vectors(HOVV)
14
Proposed approach:
We base our proposed motion understanding approach on the fact that the orientations of joint velocity vectors change over time with respect to the actions carried out.
15
Proposed approach
Proposed action classification approach
Feature vectorsDistance function
16
Feature extraction
Class labels
Descriptors of joint groups
Class labels
HOVV
Skeleton sequence
SVM or ELMKNN
Feature extraction
The orientation of the velocity vector is defined in a spherical coordinate system by two angles:
17
𝑉𝑡𝑖,𝑗= 𝑣𝑡,𝑥
𝑖,𝑗, 𝑣𝑡,𝑦𝑖,𝑗, 𝑣𝑡,𝑧𝑖,𝑗= 𝑥𝑡+1
𝑖,𝑗− 𝑥𝑡
𝑖,𝑗, 𝑦𝑡+1
𝑖,𝑗− 𝑦𝑡
𝑖,𝑗, 𝑧𝑡+1𝑖,𝑗− 𝑧𝑡
𝑖,𝑗
𝛼𝑡𝑖,𝑗= 𝑎𝑟𝑐𝑡𝑎𝑛
𝑣𝑡,𝑦𝑖,𝑗
𝑣𝑡,𝑥𝑖,𝑗
𝛽𝑡𝑖,𝑗= 𝑎𝑟𝑐𝑡𝑎𝑛
𝑣𝑡,𝑥𝑖,𝑗
𝑣𝑡,𝑥𝑖,𝑗 2
+ 𝑣𝑡,𝑦𝑖,𝑗
2+ 90
𝑖: Action identifier.𝑗: Joint identifier.𝑡: Time.
𝑣𝑡𝑜𝑡 =
𝑡=1
𝑇−1
𝑗=1
𝐽
𝑣𝑡,𝑥𝑗+ 𝑣𝑡,𝑦
𝑗+ 𝑣𝑡,𝑧
𝑗
Feature arrays 18
𝑖: Action identifier.𝑗 → 𝐽 : Joint identifier.𝑡 → 𝑇: Time.
𝛽
𝜶𝑻𝒐𝒓𝒔𝒐
Our proposed histogram descriptor19
𝒃1,1 𝒃𝐵,1
𝒃1,𝐵/2
𝒃1,1
𝒃𝐵,𝐵/2
• B= 24. • 𝛼 ∊ [0,360], 𝛽 ∊ [0,180]• Each bin representing a degree interval
equal to 360/B = 360/24 = 15°• 𝒃1,1 representing the occurrence of the
joints where 𝛼 ∊ [0,15°], 𝛽 ∊ [0,15°]
• The occurrence of all bins is regulated into an interval equal to [0,255] where 0 is represented by black and 255 is represented with white.
𝛽
𝜶𝑻𝒐𝒓𝒔𝒐 𝜶𝑼𝒑𝒑𝒆𝒓 𝒍𝒊𝒎𝒃𝒔𝜶𝑳𝒐𝒘𝒆𝒓 𝒍𝒊𝒎𝒃𝒔
Our proposed histogram descriptor20
Feature vector
w𝐡𝐞𝐫𝐞 𝒗𝒕𝒐𝒕 is defined as:
21
𝑣𝑡𝑜𝑡 =
𝑡=1
𝑇−1
𝑗=1
𝐽
𝑣𝑡,𝑥𝑗+ 𝑣𝑡,𝑦
𝑗+ 𝑣𝑡,𝑧
𝑗
The attributes of the feature vector 𝑓 are normalized density of the histogram bins [0; 255] and 𝑣𝑡𝑜𝑡 is added to the feature vector.
The new feature vector will be: 𝑓𝑘𝑙 = 𝑓𝑘𝑙 , 𝑣𝑡𝑜𝑡 .
f𝑘𝑙= 𝑏11, 𝑏21, … , 𝑏𝑘1, 𝑏12, 𝑏22, … , 𝑏𝑘2, … , 𝑏1𝑙 , … , 𝑏𝑘𝑙
• No need for a spatial transform.
• Each skeleton is rotated around a vertical axis passing through the skeleton’s Hip-center.
Rotation transform:
• The rotation angle is 𝜽𝒊.
22
Experimental Results
23
Office dataset / parameter tuning /
One subject, 9 actions, 100 segmented samples. Recording rate is 15 frames/sec. Each sample is fixed length and consists of 30 frames.
“standing-up” , “sitting-down” , “stretching” , “scratching the head” , “crossing hands”, “sliding the chair forward/backward while sitting” ,
“re-positioning chair” , “taking a rest by bending on the desk” and “putting two hands on the table”
24
Office dataset
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5
Acc
ura
cy
Distance functions
14 joints
16 joints
20 joints
• 1-nearest neighbor classification method• Distance functions are: 1.Bhattacharyya 2.Correlation
3.Chi-Square χ² 4.Intersection 5.Hellinger
25
3-D skeleton dataset
26
3-D skeleton dataset 9 actions, 9 subjects * 2 trials 162 samples. Recording rate is 15 frames/sec. In total 10687 frames (jpg, depth)
“sitting”, “standing”, “waving”, “walking”, “picking-up”, “stretching”, “using a hammer”, “drawing a circle”,
“forward-punching”.
27
3-D skeleton dataset
• 𝑘-nearest neighbor classification method /(𝑔=3)/• 20*10 fold cross validation
k=5
k=7
0%10%20%30%40%50%60%70%80%90%
100%
4 8 16 24 36 72
k=5
k=6
k=7
Acc
ura
cy
Number of histogram bins
28
Comparison with the-state-of-the-art methods
Method Accuracy%
Our approach HOVV 88.90 %
Other approaches
* HCRF [ω=0] 54.17%
* HCRF [ω=1] 56.94%
** HON4D 81.94%
** HON4DA 76.39%
*** Chen and Koskela [4](SVM) 78.32%
*** Chen and Koskela [4](ELM) 87.71%
We compare accuracy with:* Sequential representation approach with HCRF classifier.** 4-D Space-time representation approach with SVM classifier. *** Sequential data with SVM and ELM classifier.
29
Method Computational latency (Sec)(Feature extraction)
Our approach HOVV 0.055
Other approaches
* HCRF [ω=0] 0.017
* HCRF [ω=1] 0.058
** HON4D 34.755
** HON4DA 6.706
We compare computational latency with:* Sequential representation approach with HCRF classifier.** 4-D Space-time representation approach with SVM classifier.
Comparison with the-state-of-the-art methods30
Method Total classification computational latency
(train time + test time) (Sec)
ClassificationAccuracy
Proposed descriptor
Average 3.168 88.75%
Minimum 2.964 84.72%
Maximum 3.51 91.67%
Chen and Koskela
Average 3.389 87.71%
Minimum 3.136 82.27%
Maximum 3.651 89.88%
We compare with Chen and Koskela 2013* descriptor.ELM is used as classification method.
* Classification of RGB-D and Motion Capture Sequences Using Extreme Learning Machine. Scandinavian Conference on Image Analysis, 2013.
Comparison with the-state-of-the-art methods31
Detailed analysis of proposed descriptor performance
32
• Joint grouping proposal.• Number of histogram bins.• Performance per action.• What kind of actions HOVV is suitable for?
Joint grouping with 3-D skeleton dataset
• Three scenarios of joint grouping:
33
• In order to achieve a fast and efficient motion descriptor above mentioned three scenarios of joint grouping are analyzed.
• A motion descriptor with optimal efficiency must show the highest recognition accuracy with the lowestcomputational time.
Cla
ssif
icat
ion
acc
ura
cy
Time (ms)
Accuracy & computational time
HOVV_SVM_(g=3) HOVV_SVM_(g=8) HOVV_SVM_(g=20)HCRF[w=1] POS(SVM) Reg
Method Accuracy Time
HON4D 81.94% 34.76 Sec
HON4DA 76.39% 6.71 Sec
(8.33, 81.94%)g=3, B=16
Grouping joints into three groups allow us to outperform the-state-of-the-art methods with very low computational time.
34
Cla
ssif
icat
ion
acc
ura
cy
Time (ms)
Accuracy & computational time (Logarithmic scale)
HOVV_SVM_(g=3) HOVV_SVM_(g=8) HOVV_SVM_(g=20)HCRF[w=1] POS(SVM) HON4DHON4DA
(8.33, 81.94%)g=3, B=16
Grouping joints into three groups allow us to outperform the-state-of-the-art methods with very low computational time.
35
Action Recognition
accuracy
(My method)
(SVM)
HON4D HON4DA Chen
[SVM]
Chen
[ELM]
B=24, g=20 - - - -
Sitting 75.00% 75% 75% 79.8% 77.50%
Standing 100% 62.5% 62.5% 93.9% 98.21%
Waving 100% 37.5% 25% 59 % 65.54%
Walking 100% 100% 100% 100% 94.71%
Picking-up 100% 87.5% 75% 51 % 64.18%
Stretching 70.00% 100% 75% 65% 92.30%
Using-hummer 100% 100% 87.5% 53 % 82.91%
Drawing a circle 87.00% 87.5% 87.5% 100% 96.40%
Forward-punching 75.00% 87.5% 100% 98.7% 98.41%
Overall 88.89% 81.94% 76.39% 78.32% 85.57%
Compare accuracy for each action with3-D skeleton dataset
36
For more detailed analysis of the joint grouping and the method performance per action,
we implement proposed method on a public dataset provided by Microsoft:
MSR-Action3D dataset
37
• What kind of actions HOVV is suitable for?
Descriptor performance by action- In order to demonstrate the kind of actions that proposed descriptor is effective with, I conducted experiments on a public MSR action 3D Microsoft dataset*:
* http://research.microsoft.com/en-us/um/people/zliu/ActionRecoRsrc/
10 subjects, 20 actions
Each subject performs each action 2 or 3 times.
There are 557 samples in total.
- Recognition accuracy of each action is demonstrated under various conditions (i.e., number of bins and joint grouping).
- In the following three slides we show the confusion matrices of descriptor performance with
- Bins number = 36
- g=20, g=8 and g=3
38
Comparison with the state-of-the-art methods
Method My
method
Actionlet HON4D
(HON4DA)
Action
Graph on
Bag of 3D
Points
HMM Dynamic
Temporal
Warping
Recurrent
Neural
Network
Over all
accuracy 85.64% 88.2% 88.89
(85.85%)
74.7% 63% 54% 42.5%
Accuracy
by actionAvailable Available - - - - -
MSR-Action3D dataset
39
accuracy for each action
Action
Recognition accuracy
(My method)
Recognition accuracy
(Actionlet)
B=36, g=8
High arm waving 88.9% 91.67%
Horizontal arm waving 96.2% 100%
Hammering 74.1% 83.33%
Hand catching 76% 25.00%
Forward punching 65.4% 72.73%
High throwing 88.5% 72.73%
Drawing X 63% 53.85%
Drawing tick 76.7% 100%
Drawing circle 76.7% 100%
Hand clapping 93.3% 100%
Two hand waving 93.3% 100%
Side boxing 100% 86.67%
Bending 81.5% 93.33%
Forward kicking 86.2% 100%
Side kicking 95% 100%
Jogging 100% 100%
Tennis swinging 90% 100%
Tennis serving 90% 100%
Golf swinging 96.7% 100%
Pick-Up throwing 77.8% 64.29%
Overall 85.64% 88.2%
40
Action(MSR3D)
Recognition accuracy
B=36, g=20
B=36, g=8
B=36, g=3
High arm waving 7.4% 88.9% 88.9%
Horizontal arm waving 26.9% 96.2% 84.6%
Hammering 7.4% 74.1% 77.8%
Hand catching 40% 76% 72%
Forward punching 15.4% 65.4% 61.5%
High throwing 19.2% 88.5% 88.5%
Drawing X 18.5% 63% 66.7%
Drawing tick 36.7% 76.7% 73.3%
Drawing circle 23.3% 76.7% 76.7%
Hand clapping 43.3% 93.3% 90%
Two hand waving 76.7% 93.3% 90%
Side boxing 26.7% 100% 100%
Bending 63% 81.5% 74.1%
Forward kicking 51.7% 86.2% 82.8%
Side kicking 40% 95% 95%
Jogging 83.3% 100% 96.7%
Tennis swinging 43.3% 90% 93.3%
Tennis serving 53.3% 90% 90%
Golf swinging 60% 96.7% 96.7%
Pick-Up throwing 74.1% 77.8% 81.5%
Accuracy increased
No change
Accuracy decreased < 5%
Accuracy decreased > 5%
41
Action(3DOffice)
Recognition accuracy
B=16, g=20
B=16, g=8
B=16, g=3
Sitting 62.5% 62.5% 50%
Standing 100% 100% 100%
Waving 100% 100% 100%
Walking 100% 100% 100%
Picking-up 100% 75% 75%
Stretching 87.5% 87.5% 62.5%
Using-hummer 75% 87.5% 75%
Drawing a circle 87.5% 87.5% 87.5%
Forward-punching
75% 87.5% 87.5%
High arm waving
Horizontal arm waving
Hammering
Hand catching
Forward punching
High throwing
Drawing X
Draw tick
Drawing circle
Hand clappingTwo hand waving
Side boxing
Bending
Forward kicking
Side kicking
Jogging
Tennis Swinging
Tennis serving
Golf Swinging
pickup throwing
74.1%95%
65%90%
88.9%96.7%100%
93.3%93.3%
96.2%88.5%100%86.2%
63%76.7%81.5%76.7%
90%76%
77.8%
Accuracy increasedNo change
Accuracy decreased < 5%Accuracy decreased > 5%
36 1624 8
Very Effective
Effective
Not Effective
42Accuracy
Number of bins:
36 1624 8
74.1%95%
65%90%
88.9%96.7%100%
93.3%93.3%
96.2%88.5%100%86.2%
63%76.7%81.5%76.7%
90%76%
77.8%
Very Effective
Effective
Not Effective
43
Accuracy increasedNo change
Accuracy decreased < 5%Accuracy decreased > 5%
AccuracyNumber of bins:
High arm waving
Horizontal arm waving
Hammering
Hand catching
Forward punching
High throwing
Drawing X
Draw tick
Drawing circle
Hand clappingTwo hand waving
Side boxing
Bending
Forward kicking
Side kicking
Jogging
Tennis Swinging
Tennis serving
Golf Swinging
pickup throwing
Effective with:
- Periodic actions.
- Actions with unique joints trajectories.
Grouping is less effective with actions involve movement of few number of joints with small displacements.
Grouping joints into eight groups is always effective with actions of MSR3D dataset.
Summary
We proposed a novel descriptor for motion of skeleton joints.
Proposed descriptor proved to outperform the state-of-the-art descriptors such as HON4D and the one proposed by Chen et al 2013.
Our proposed approached proved to be effective for periodic actions (e.g., Waving, Walking, Jogging, Side-Boxing, etc).
Grouping was effective for actions with unique joints trajectories (e.g., Tennis serving, Side kicking , etc).
Grouping joints into eight groups is always effective with actions of MSR3D dataset.
44
The end
45
Recommended