29
Upper Body Poses and Gestures in Classroom Speaker Videos Fall 2017 Ishan Manjani Master of Science in Computer Science Columbia University [email protected] January 2, 2018 1

Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Upper Body Poses and Gestures inClassroom Speaker Videos

Fall 2017

Ishan ManjaniMaster of Science in Computer Science

Columbia [email protected]

January 2, 2018

1

Page 2: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Contents

1 Introduction 4

2 Database 4

3 Feature Representation 43.1 Joint Location Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Upper Body Joint Angles . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Upper Body Features via Coordinate Information . . . . . . . . . . . . . . 11

4 Upper Body Poses 184.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 PCA and Eigen Poses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Upper Body Gestures 255.1 PCA and Eigen Gestures . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Relationship with Hand Annotated Upper Body Gestures 25

7 Future Work 28

2

Page 3: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

List of Figures

1 A video frame from the ”Visual Perspective” module. . . . . . . . . . . . 52 Body keypoints found on running the ConvNet Model [Belagiannis and Zisserman, 2017]

on a frame from the ”Visual Perspective” video. . . . . . . . . . . . . . . 63 A visual of body keypoint detection by OpenPose library. . . . . . . . . . 74 Body keypoint locations on the human skeleton as detected by OpenPose

Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Left shoulder angles (radians) vs frame number for the ”Visual Perspec-

tive” video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Right shoulder angles (radians) vs frame number for the ”Visual Perspec-

tive” video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Left elbow angles (radians) vs frame number for the ”Visual Perspective”

video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Right elbow angles (radians) vs frame number for the ”Visual Perspective”

video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A plot of the X coordinate value of the left elbow. . . . . . . . . . . . . . 1410 A plot of the Y coordinate value of the left elbow. . . . . . . . . . . . . . 1511 A plot of the X coordinate value of the left wrist. . . . . . . . . . . . . . . 1612 A plot of the Y coordinate value of the left wrist. . . . . . . . . . . . . . . 1713 Mean poses obtained via the K-means algorithm run on X,Y Coordinates

of both wrists, elbows, and shoulders with k set to 4. . . . . . . . . . . . . 1914 A graph of SSE vs k while running K-means on Coordinate Information

from the ”Visual Perspective” video. . . . . . . . . . . . . . . . . . . . . . 2015 Mean Pose computed while running PCA using Coordinate Information

from the ”Visual Perspective” video. . . . . . . . . . . . . . . . . . . . . . 2216 Visualization of few top eigen vectors (”eigen poses”) generated from PCA

using Coordinate Information from the ”Visual Perspective” video. . . . . 2417 Cumulative eigen energy preserved while running PCA using Temporal

Coordinate Information from the ”Visual Perspective” video. . . . . . . . 26

3

Page 4: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Length (m:s) fps # Frames Resolution

”Visual Perspective” 4:56 30 8873 1920 × 1080

”Bicycle” 4:59 30 8967 1920 × 1080

”Road Paving” 4:39 30 8147 1920 × 1080

Table 1: Details of each video as part of the database.

1 Introduction

The broader objective this project is part of is to understand which attributes of aspeaker in a speech or a lecture setting lead to retention of concepts and are highlycorrelated with audience engagement. Some of these attributes are visual appearance- poses, gestures, hand movement, head movement; speech content and delivery; visualdata and text on the accompanying presentation.

This project focuses on visual appearance, i.e., speaker poses and gestures. We usea database of three speaker videos shot in a controlled environment for the study. Weemploy computer vision techniques to make sense of the raw video data.

The report is presented as follows. We describe the database videos and then look atthe meaningful features that can be extracted from visual data. We describe the toolsused to extract the feature representations. Further, we explore several algorithms toextract meaningful information from upper body speaker poses and gestures. We lookat the amount of information that each pose and gesture conveys and how common eachone is. Finally we seek whether the algorithmic processing of visual data goes hand inhand with manual annotation of speaker gestures.

2 Database

The database consists of three modules titled ”Visual Perspective”, ”Bicycles”, and”Road Paving”. Each module is made up of three videos - frontal view of the speakerbesides the presentation projection, speaker overlay in the right corner of the presentationscreen, and presentation slides in isolation. Figure 1 shows a frame from the ”VisualPerspective” module. Each video has the speaker’s narration as the audio signal.

Each video is approximately 5 minutes in length shot at 30 fps. Video resolution is setat 1920 × 1080. Table 1 details properties of each video file. For upper body pose andgesture understanding, videos with frontal view of the speaker besides the presentationprojection from each module are used.

3 Feature Representation

Each frame is processed and represented by four features which encode information ofthe speaker’s upper body. The feature vectors corresponding each frame can be analyzedin isolation to understand various speaker poses. If temporal information is included,

4

Page 5: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 1: A video frame from the ”Visual Perspective” module.

i.e., feature vectors are studied as part of a sequence of features across time, then suchsequences can be understood as a gesture. A gesture is a sequence of poses. Whileanalyzing a sequence, a particular pose can be said to belong or be part of the gesture.

3.1 Joint Location Detection

Detecting upper body joint locations is a principal step towards understanding upperbody poses and gestures. Joint detection is a well studied problem and various tech-niques have been proposed as solutions. A few of these are based on techniques us-ing ConvNet models [Belagiannis and Zisserman, 2017], or Part Affinity Fields (PAFs)[Cao et al., 2016]. Implementations and models are available for both these techniques.We tried both the methods on couple of frames from our videos to determine which issuitable for our task.

• ConvNet Model [Belagiannis and Zisserman, 2017]

On running the implementation on frames from our video we find a lot of falsemarkings of body joint locations. When a raw frame without any processing isprovided, the body keypoints for certain frames are found in the content of theslides. This is illustrated in Figure 2. To overcome this, the speaker is croppedfrom the entire frames before running the algorithm. It still results in multiple falsemarkings, such as certain lower body keypoints being marked in the head/neck

5

Page 6: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 2: Body keypoints found on running the ConvNet Model[Belagiannis and Zisserman, 2017] on a frame from the ”Visual Perspec-tive” video.

6

Page 7: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 3: A visual of body keypoint detection by OpenPose library.

region and mixing up of other upper body keypoints among themselves.

• Part Affinity Fields (implemented as OpenPose Library) [Cao et al., 2016]The OpenPose Library worked significantly well on small clips from the videodataset providing accurate body key point locations. There are some drops in fewbody keypoint across time, but the drop frequency is very low. This problem canbe solved by either ignoring frames with dropped keypoints, or interpolating valuesacross time. A visual of body keypoint detection by OpenPose is shown in Figure3.

From the above short experiments,OpenPose seems a reliable choice for further exper-iments. It has several additional advantages too. OpenPose is a real-time multi-personsystem which can jointly detect human body, hand, and facial keypoints on each frameof input videos. The time to process each frame is independent of number of people inthe single images.

We used OpenPose to detect 18 keypoints appearing on the speaker’s body for eachof the three videos. The body keypoint locations on the human skeleton are shown inFigure 4. The videos were resized from 1920 × 1080 resolution to 854 × 480 for fastercomputations. The implementation processes approximately 2 frames (size 854 × 480)per second on an NVIDIA Tesla K80 GPU.

7

Page 8: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 4: Body keypoint locations on the human skeleton as detected by OpenPose Li-brary.

8

Page 9: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 5: Left shoulder angles (radians) vs frame number for the ”Visual Perspective”video.

9

Page 10: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 6: Right shoulder angles (radians) vs frame number for the ”Visual Perspective”video.

10

Page 11: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

3.2 Upper Body Joint Angles

In this section we look at upper body joint angles computed from body joint locations.The joint angles represent features of speaker’s body computed from the raw data. Itbrings the data in a form which is easier to understand with respect to upper body posesand gestures as compared to raw video frames.

We examine four body angles - right and left shoulder, and right and left elbow angles.They are computed in the following way. From the body joint locations, lines joiningthe shoulders and elbow, and both the shoulders are construed as vectors. Simple vectoralgebra leads to computation of the required angles.

Figure 5 and 6 show plots of both shoulder angles (radians) vs frame number for the”Visual Perspective” video. A positive/negative sign in the angle values represents theclockwise and counterclockwise directions. It appears the angle values rest around amean value, but frequently beat - i.e., increase or decrease. The mean angle values canbe interpreted as the mean position of the speaker throughout the video. The smallsequences of frames for which the value deviates and returns to the mean representmovements performed by the speaker.

The elbow angles are illustrated in Figure 5 and 6. The elbow angles display behaviorwhich is similar to the shoulder angles. The mean angles correspond to the mean restingposition, whereas deviations represent motion in the upper body of the speaker. However,close inspection of the left elbow angle plot reveals excessive spikes in angle values in thedirection opposite to the general direction of the corresponding angle values. The spikeangle values are π radians. Frames corresponding the spike disclose that the spikes aredue to 3D body parts being projected to 2D. From the camera’s perspective, the elbowangle appears the be π radians when the shoulder, elbow, and wrist appear in the samestraight line, even-though the angle in 3D is not π radians.

Due to this problem arising out of 3D to 2D projection, which is especially amplified atthe time of angle computation using solely 2D data, we look at another way of encodingthe frames into meaningful upper body features.

3.3 Upper Body Features via Coordinate Information

We deploy a much simpler and hopefully more reliable method of extracting upper bodyfeatures from frames. The X and Y coordinates of shoulder, elbows, and wrists serve asfeature vector of size 12 for the upper body. The feature vector is easily extracted fromthe 18 keypoint output provided by OpenPose.

Figure 11, 12, 9, and 10 show plots of the X, Y values of the left wrist and elbow.The illustrations display a pattern similar to the one observed in angle diagrams. Thecoordinate values at the keypoints remain at a mean position and show slight deviations.The values move away from the mean in both positive and negative direction when thespeaker does a hand or body gesture. The deviations form hills in the plots, indicatingsequence of frames which probably gather audience attention, lead to a spike in theaudience EEG data, or a change in the diameter of the pupil of the audience.

The problem of excessive spikes towards extreme values is not present while deriving

11

Page 12: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 7: Left elbow angles (radians) vs frame number for the ”Visual Perspective”video.

12

Page 13: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 8: Right elbow angles (radians) vs frame number for the ”Visual Perspective”video.

13

Page 14: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 9: A plot of the X coordinate value of the left elbow.

14

Page 15: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 10: A plot of the Y coordinate value of the left elbow.

15

Page 16: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 11: A plot of the X coordinate value of the left wrist.

16

Page 17: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 12: A plot of the Y coordinate value of the left wrist.

17

Page 18: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

features from coordinate data instead of joint angles. The 2D projections of the 3Dcoordinate positions do not lead to distortions in the X, Y values when the 3D to 2Dmapping is done.

The plots show reliable data and have no unexplained spikes. Therefor they are usedfor further experiments with the following important finding. The deviations of coordi-nate values from the mean values are probably the regions of interest for determiningwhich speaker gestures or poses gather audience attention or lead the audience to listencarefully.

4 Upper Body Poses

The 12 points - X,Y Coordinates of both wrists, elbows, and shoulders - serve as a featurevector for the upper body. Each video is approximately five minutes long (@30fps),leading to approximately 9000 frames per video. Each frame is represented by upperbody pose information. Upper body poses can be distinguished into groups and studiedfurther via techniques like K-means and Principle Component Analysis.

4.1 K-means

For each video, the features vectors for each frame are taken as raw data and fed asinput to K-means algorithms. Values of k=3,4,10 are used. Illustrations of the differentmeans obtained from the algorithms when k is set to 4 are shown in Figure 13. Each ofthe sub figures represent a mean pose. The stick figure in the sub plots is color codedas follows. The line joining the shoulders is coded red, the right shoulder and the rightelbow are joined by a blue line, and green is the line joining the right elbow to rightwrist. The left shoulder to left elbow are pale yellow, while left elbow to left wrist isblack.

Some of these poses are a variation of the mean pose assumed by the speaker through-out the video. The mean pose has the speakers arms by the side of his body, restingdownwards, and the wrists a little raised from if the elbows were completely straight.The other poses obtained as one of the mean poses from K-means represent a pointinggesture (to the presentation towards the right of the speaker), or a pose with the armsraised.

The mean poses computed from the coordinate information of joint locations lead tovisualizations which are acceptable human poses. It may have so happened that themeans found from the data do not look like common or feasible human poses. Thisindicates the well functioning of the algorithm and the absence of excessive noise indata.

K-means for this experiment is run with various values of k. A sophisticated techniqueto pick the value of k is to run the algorithm with multiple values of k, say k=1,2,...10 andlook at the sum of squared error (SSE) between the points and their means, for all means.Figure 14 shows a graph of SSE vs k for increasing values of k. The graph generallyfalls really fast for initial values of k and then the decrease in SSE becomes very slowwith increasing values of k. The plot is supposed to look like an elbow which represents

18

Page 19: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 13: Mean poses obtained via the K-means algorithm run on X,Y Coordinates ofboth wrists, elbows, and shoulders with k set to 4.

19

Page 20: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 14: A graph of SSE vs k while running K-means on Coordinate Information fromthe ”Visual Perspective” video.

20

Page 21: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

the discontinuity between the fast falling and slow decrease in values. From the plotpresented in our case, the elbow is not very distinguishable, but certainly appears nearvalues of k=3 and k = 4. If K-means is run with say k=4, it produces a classification ofall the poses into 4 categories, each represented by a mean pose.

4.2 PCA and Eigen Poses

The feature vectors representing each video frame can also be analyzed using PrincipleComponent Analysis (PCA). The mean feature vector, referred to as the mean posehere, is subtracted from each data point. Features corresponding each frame are stackedon on top of another to create the data matrix F . The data matrix is used the computethe scatter matrix S.

Let f1, f2, ...fn represent features of each frames. Data matrix F is computed asfollows:

f̄ =1

n

n∑j=1

fj (1)

f ′i = fi − f̄ (2)

F = [f ′1, f′2, ...f

′n] (3)

The scatter matrix S is computed as follows

S =

n∑j=1

(fj − f̄)(fj − f̄)T (4)

The eigen vectors and corresponding eigen values of the scatter matrix S of the dataare computed. Eigen vectors represent directions of maximum variance in the data,in decreasing order, starting with the eigen vector corresponding the maximum eigenvalue. Let e1, e2, ..., em represent the m eigen vectors, and λ1, λ2, ..., λm represent thecorresponding eigen values.

Each feature vector f ′ is represented in the eigen space as follows.

f ′ =

m∑i=1

aiei (5)

a1, a2, ..., am are scalar values representing the projection of vector f onto each eigenvector.

The initial feature vector can be obtained by adding the mean to f ′.

f = f ′ +

m∑i=1

aiei (6)

The mean pose is illustrated in Figure 15. The pose is similar to mean poses of thespeaker found via K-means. The appearance resembles the speaker’s resting position.

21

Page 22: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 15: Mean Pose computed while running PCA using Coordinate Information fromthe ”Visual Perspective” video.

22

Page 23: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

The variation of poses from the mean pose is captured in the directions of the eigenvectors. The eigen vectors can be visualized as poses in the following way. The canbe interpreted as a pose having a projection value +1 (maximum normalized projectionvalue) in the direction of the eigen vector, and 0 projection value on all other vectors.This visualization represents the extreme effect of the eigen vector on the mean pose inthe positive direction. Similarly, the extreme effect of the eigen vector in the negativedirection can be looked at. Eigen vector ei can be visualized as a pose in the followingway:

epi = f ′ + 1.ei (7)

epi = f ′ − 1.ei (8)

Visualizations of eigen poses ep1, ep2, ..., epn, convey for the data, the directions ofvariations from the mean pose. Each frame feature vector (pose information) is nothingbut the mean pose plus a linear combination of these eigen poses.

Few top eigen vectors generated from the pose information are visualized in Figure16. The solid blue line denotes the mean pose, while the dotted line represents the eigenpose. The eigen directions show pointing poses, hands raised poses, and arms flexedtowards the ground poses.

The original data is present in 12 dimensional space, however, fewer dimensions may benecessary for representing the entire data, or the data with loss of minimal information.If the top k eigen vectors are used instead of all m, the dimensions of the pose informationcan be reduced, with the goal of minimizing loss of information.

f = f ′ +

k∑i=1

aiei (9)

Each frame feature vector f can be represented with a new feature vector of size kcomposed as follows: a1, a2, ..., akEigen values of the Scatter matrix represent the amount of variance of the data preservedcorresponding each eigen vector. Eigen energy, a metric to determine how much varianceof the data is preserved, can be used to select the number of eigen components required.

Ei =λi∑mj=1 λj

∗ 100% (10)

Ei represents the variance of the data preserved by eigen vector i. The cumulativeeigen energy for a set of vectors is used to select the number of eigen vectors required.In our case, we select the top eigen vectors which lead to atleast 90% preservation incumulative eigen energy.

For each video, the top 4 eigen vectors lead to 90% preservation in cumulative eigenenergy. Therefor, via PCA, the dimensionality of the data is reduced from 12 to 4. Theprojection of the data points on the eigen vectors have important information about thedata which is used for further understanding the poses or for classification.

23

Page 24: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 16: Visualization of few top eigen vectors (”eigen poses”) generated from PCAusing Coordinate Information from the ”Visual Perspective” video.

24

Page 25: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

5 Upper Body Gestures

Upper body gestures can be understood in a way similar to upper body poses. The X,Y coordinate plots for some joints shown in Figures 11, 12, 9, and 10, show that the hillsand dips from the mean positions of the X, Y values represent gestures of the speakers.

Gestural information can be derived from data through a sequence of pose informa-tion. To encode gestural information, a window of size say 2 seconds is moved over thecoordinate feature vectors. Each instant of time is now represented by a feature vectorwhich includes temporal information of 1 second before and after the current instant.Since the videos are shot at 30fps, each feature vector is now of size 720 (12 ∗ 2 ∗ 30). Afeature vector of size 720 is quite big, hence sub-sampling is performed. A few framesare selected instead of all from the 2 second window. The sub-sampling method used isthat every 5th frame is selected. Therefor 12 frames out of each 60 frame window areused leading to a feature vector of size 144.

5.1 PCA and Eigen Gestures

PCA is run on this 144 feature vector as it is run on the pose feature vector. The meanvector is subtracted from each feature vector. The scatter matrix is computed on themean subtracted data leading to computation of eigen vectors and values of the scattermatrix.

Figure 17 shows a graph capturing the eigen energy preserved by iteratively adding innew eigen vectors in the order of decreasing eigen values. It is merely 9-10 eigen vectorsout of 144 which preserve more than 90% variability in the data.

The mean vector is known as the mean gesture and each eigen vector is known as thean Eigen Gesture.

Similar to eigen poses, visualizations of eigen gestures ep1, ep2, ..., epn, convey for thedata, the directions of variations from the mean pose. Each feature vector (gesturalinformation) can be understood as the mean pose plus a linear combination of the eigengestures. Visualizations of the eigen gestures show a lot of hand and elbow movements.Some of them represent pointing gestures, moving gestures, hand raising gestures, andso on and so forth.

Animated gif files of eigen gestures for each video module are attached as additionalmaterial with this report.

6 Relationship with Hand Annotated Upper Body Gestures

As part of the project, the four videos have been hand annotated with gestures for theintervals when the speaker makes a gesture. The movements have been labeled usingthe four classical gestures: ’beats’, ’metaphoric’, ’deictics’, and ’iconic’. These gesturelabels can be summed as follows:

• Iconics: they illustrate the semantic content of speech(concrete physical featuresof objects, actions or events)

25

Page 26: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

Figure 17: Cumulative eigen energy preserved while running PCA using Temporal Co-ordinate Information from the ”Visual Perspective” video.

26

Page 27: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

• Metaphorics: they illustrate the semantic content of speech(abstract and imagi-nary ideas or objects, note the concept they represent has no physical form)

• Beats: simple gestures of emphasis, pacing.

• Deictics: pointing to objects present in speakers environment(actual physicalentities and locations, or to spaces that have previously been marked as relatingto some idea or concept).

We look forward to seeing if algorithmic approaches to coding gestural informationsuch as PCA can pick up specific hand annotated gestures. The important thing tonote is that algorithmic approaches strictly work based on the information supplied inthe form of data. These approaches very well encode the movements into the eigenvectors which can be seen from the corresponding illustrations. The hand annotatedlabels incorporate information in addition to what appears in the video scenes. Theycapture the overall setting in which a gesture is made. For example, a gesture of bothhands raised till the chest may be seen in two manually annotated categories: beats andmetaphoric. With a lack of context to the gesture, the movement is labeled as a beat,while the movement in a metaphorical context falls in the metaphor category.

An important difference between the algorithmic encoding of gestures and hand anno-tated gestures is the time scales at which the two are done. The algorithmic approacheswork at the frame level, i.e., the timeline for them is 1/30th of a second (at 30fps). Themanual annotation is done with regards to the speaker sentences. Hence a label appearsevery 1-3 seconds. In order to compare the two approaches, the timelines need to bemade similar. An approach to do this is to upscale the algorithmic approach. Each videoframe can be assigned a gesture category based on the eigen vector which is assignedhighest score from PCA. The most occurring gesture category in a block of 30 framesor 1s is the category assigned to the frame. This procedure leads to a label assignmentto each second of the video. For several instances in the video, both the visible gestureand the algorithmic encoding are observed. For some cases the eigen vector with the topscore clearly shows the gesture. For most cases the exact gesture performed does notlie in the top few eigen vectors. It lies in the middle eigen gestures, i.e., in the range of10-40.

The top eigen vectors when sorted by eigen energy contain the maximum variabilityof the data. These are broad and each gesture uses some of these movements. Some ofthe common gestures such as raising the hands, turning from left to the right or viceversa, pointing to the screen etc., are all captured in the top 9 eigen vectors. The subtleactions, such as moving of one hand as opposed to both, hand movement in a particulardirection, pointing when the hand is raised etc., are captured in the middle eigen vectors.The subtle gestures are better captured in the middle eigen vectors as opposed to thehigh order ones. The middle vectors may also have a high score in addition to the topones given it is never entirely possible to represent the data from a single eigen vector.

When looking at hand annotated gestures and trying to find them in the algorithmicprocessing, the algorithms clearly capture all the visual movements given as a linearcombination of several eigen vectors. For example a manually annotated pointing gesture

27

Page 28: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

is well encoded in one of the top eigen vectors. For a particular time instance in thevideo when a pointing gesture is made, PCA assigns a high score to the top eigen vectorshowing a pointing gesture. However if same visual gesture is manually labeled to fall intwo categories such as beat or metaphorical, depending on its context, it is not possiblefor an algorithm trained solely on visual data to differentiate between them.

7 Future Work

The current pose information feature vector examines only x and y coordinates of theupper body. As an extension, the z coordinate information can also be incorporated inthe feature vector. Eventhough the OpenPose library provides only x and y coordinateinformation, the z coordinate information can be computed for body joints such as theelbow and wrists. Assuming the z coordinate of the shoulders don’t change and armlength remains constant, maximum arm length can be computed from the data. Thedifference in arm length across frames can be attributed to change in the z direction.Incorporating z information leads to a stronger feature vector and will hopefully lead toa better understanding of the pose and gestural information.

Similar to K-means applied on upper body poses, it can be applied to upper bodygestures to understand the common gestures and which ones can be correlated withgathering audience attention and higher retention.

The processing of the video data, from feature representation to pose and gesturalanalysis serves as a baseline for further experiments. For each module the audience isasked a specific set of questions based on the module and their answers are recorded.Correct answers to questions can be seen as high retention among the audience. Poseand gestural analysis on the visual data can be correlated with the answers to see howwell it predicts them. Certain useful questions can be answered by doing this. What roledo speaker poses and gestures play in retention of concepts? Which poses and gesturesemphasize the concepts better and lead to higher retention?

As it gets available, the eye recording of the audience while watching speaker videosis useful information to have. The coordinates of the pupil, its diameter, and velocity,can be used as features and correlated with visual pose and gestural information. Eyemovement and pupil diameter are also useful in determining audience attention. An-swering questions such as which poses or gestures lead the audience to pay attention,are important to assessing speaker effectiveness.

28

Page 29: Upper Body Poses and Gestures in Classroom Speaker Videosim2496/Manjani-Gestural.pdfsequences can be understood as a gesture. A gesture is a sequence of poses. While analyzing a sequence,

References

[Belagiannis and Zisserman, 2017] Belagiannis, V. and Zisserman, A. (2017). Recurrenthuman pose estimation. In International Conference on Automatic Face and GestureRecognition. IEEE.

[Cao et al., 2016] Cao, Z., Simon, T., Wei, S., and Sheikh, Y. (2016). Realtime multi-person 2d pose estimation using part affinity fields. CoRR, abs/1611.08050.

29