Multimodal Affect Recognition in Learning Environments · fkapoor, [email protected] ... Kittler et al. [9] have ... One set of LEDs is on the optical axis and produces the red

Multimodal Affect Recognition in Learning Environments

Ashish Kapoor and Rosalind W. PicardMIT Media Laboratory

Cambridge, MA 02139, USA

{kapoor, picard}@media.mit.edu

ABSTRACTWe propose a multi-sensor affect recognition system andevaluate it on the challenging task of classifying interest (ordisinterest) in children trying to solve an educational puz-zle on the computer. The multimodal sensory informationfrom facial expressions and postural shifts of the learner iscombined with information about the learner’s activity onthe computer. We propose a unified approach, based on amixture of Gaussian Processes, for achieving sensor fusionunder the problematic conditions of missing channels andnoisy labels. This approach generates separate class labelscorresponding to each individual modality. The final classifi-cation is based upon a hidden random variable, which prob-abilistically combines the sensors. The multimodal Gaus-sian Process approach achieves accuracy of over 86%, sig-nificantly outperforming classification using the individualmodalities, and several other combination schemes.

Categories and Subject DescriptorsI.5 [Computing Methodologies]: Pattern Recognition;I.4.9 [Image Processing and Computer Vision]: Ap-plications; J.4 [Computer Applications]: Social and Be-havioral Sciences

General TermsAlgorithms, Design, Human Factors, Performance

1. INTRODUCTIONWe present a framework to automatically extract, processand model sequences of natural occurring non-verbal behav-ior for recognizing affective states that occur during naturallearning situations. The framework will be a componentin computerized learning companions [1, 6] that could pro-vide effective personalized assistance to children engaged inlearning explorations and will also help in developing theo-retical understanding of human behavior in learning situa-tions.

This work tackles a number of challenging issues. Whilemost of the prior work on emotion recognition has focusedon posed emotions by actors, our emphasis has been on nat-urally occurring non-verbal behaviors as it is crucial that thesystem is capable of dealing with the unconstrained natureof real data. Second, despite the advances in face analysisand gesture recognition, the real-time sensing of non-verbalbehaviors is still a challenging problem. In this work wedemonstrate a multi-modal system that can automaticallyextract non-verbal behaviors and features from face and pos-tures, which can be used to detect affective states. Third,training a pattern classification system needs labeled data.Unlike the case of posed emotions, in natural data gettingthe ground truth of labels is a challenging task. We discusshow we can label the data reliably for different affectivestates. Moreover, we note that there is always an uncer-tainty about the true labels of the data. There might belabeling noise; that is, some data points might have incor-rect labels; thus, requiring a principled approach to handleany labeling noise in the data. Finally, we address patternrecognition in the multi-modal scenario, which has been ad-dressed previously by using either data-level fusion or clas-sifier combination schemes. In the former, a single classifieris trained on joint features; however, the sensors might oftenfail and result in missing or bad data, a frequent problem inmany multimodal scenarios, resulting in a significant reduc-tion in the performance of the pattern recognition system.

We demonstrate an affect recognition system that addressesall four challenges mentioned above. The goal of the sys-tem is to classify affective states related to interest in chil-dren trying to solve a puzzle on a computer. The systemuses real-time tracking of facial features and behaviors andmonitors postures to extract relevant non-verbal cues. Theextracted sensory information from the face, the posturesand the state of the puzzle are combined using a unifiedBayesian approach based on a mixture of Gaussian Processclassifiers where classification using each channel is learnedvia Expectation Propagation [11]. The decision about theaffective state is made by combining the beliefs of each indi-vidual expert using another meta-level classification system.The approximate Bayesian inference in the model is very ef-ficient as EP approximates the probability distribution overeach expert’s classification as a product of Gaussians andcan be updated very quickly. The system is evaluated onnatural data and it achieves an accuracy of over 86%, sig-nificantly outperforming classification using the individualmodalities and several other combination schemes.

2. PREVIOUS WORKA lot of research has been done to develop methods for in-ferring affective states. Many researchers have used staticmethods such as questionnaires, dialogue boxes, etc., whichare easy to administer but have been criticized for beingstatic and thus not able to recognize changes in affectivestates. A more dynamic and objective approach for sensingaffect is via sensors such as cameras, microphones, wearabledevices, etc. However, most of the work on affect recognitionusing the sensors focuses on deliberately expressed emotions(happy /sad /angry etc.) by actors, and not on those thatarise in natural situations such as classroom learning. In thecontext of learning there have been very few approaches forthe purpose of affect recognition. Notable among them isConati’s [2] work on probabilistic assessment of affect in ed-ucational games. Also Mota and Picard [12] have describeda system that uses dynamic posture information to classifydifferent levels of interest in learning environments, whichwe significantly extend to the multimodal scenario.

Despite the advances in machine recognition of human emo-tion, much of the work on machine recognition of humanemotion has relied on a single modality. Exceptions includethe work of Picard et al.[15], which achieved 81% classifica-tion accuracy of eight emotional states of an individual overmany days of data, based on four physiological signals, andseveral efforts that have combined audio of the voice withvideo of the face, e.g. Huang et al. [3], who combined thesechannels to recognize six different affective states. Panticand Rothkrantz [14] provide a survey of other audio-videocombination efforts and an overview of issues in building amultimodal affect recognition system.

A lot of researchers have looked into the general problem ofcombining information from multiple channels. A commonapproach is “feature-level fusion”, where a single classifieris trained on joint features, formed by stacking the featuresextracted from all the modalities into one big vector. Oftenin affect sensing scenarios, the sensors or feature extractionmight fail, resulting in data with missing channels and re-ducing the performance of this approach. One alternatesolution is to use decision-level fusion. Kittler et al. [9] havedescribed a common framework for combining classifiers andprovided theoretical justification for using operators such asvote, sum, product, maximum and minimum. One prob-lem with these fixed rules is that it is difficult to predictwhich rule would perform best. There are methods, such aslayered HMMs [13], which perform decision fusion and sen-sor selection depending upon utility functions and stackedclassifiers. One main disadvantage of using stacked basedclassification is that these methods require a large amountof training data. Alternatively, there are mixture-of-experts[4] and critic-driven approaches [10] where base-level expertsare combined using critics that predict how well an expertis going to perform on the current input. In similar spiritToyama and Horvitz [16] demonstrate a head tracking sys-tem that uses contextual features as reliability indicators toselect different algorithms. In an earlier work [8] we haveproposed an expert-critic system based on HMMs to com-bine multiple modalities. In this paper we show an alter-native fusion strategy within the Bayesian framework whichsignificantly beats the earlier method.

Figure 1: The overall architecture

3. THE PROPOSED FRAMEWORKFigure 1 describes the architecture of the proposed system.The non-verbal behaviors are sensed through a camera anda pressure sensing chair. The camera is equipped with In-frared (IR) LEDs for structured lighting that help in real-time tracking of pupils and extracting other features fromthe face. Similarly the data sensed through the chair is usedto extract information about the postures. Features are ex-tracted from the activity that the subject is doing on thecomputer as well, which are then sent to a multimodal pat-tern analyzer that combines all the information to predictthe current affective state. In this paper we focus on a sce-nario where children try to solve puzzles on a computer.

3.1 Facial Features & Head GesturesThe feature extraction module for face and head gestures isshown in figure 2. We use an in-house built version of theIBM Blue Eyes Camera that tracks pupils unobtrusively us-ing two sets of IR LEDs. One set of LEDs is on the opticalaxis and produces the red eye effect. The two sets of LEDsare switched on and off to generate two interlaced images fora single frame.The image where the on-axis LEDs are on haswhite pupils whereas the image where the off-axis LEDs areon has black pupils. These two images are subtracted to geta difference image, which is used to track the pupils. Thepupils are detected and tracked using the difference image,which is noisy due to the motion artifacts and other spec-ularities. We have elsewhere described [7] an algorithm totrack pupils reliably using the noisy difference image. Oncetracked, the pupil positions are passed to an HMM basedhead-nod and head-shake detection system,which providesthe likelihoods of head-nods and head-shakes. Similarly, wehave also trained an HMM that uses the radii of the visiblepupil as inputs to produce the likelihoods of blinks. Fur-ther, we use the system described in [7] to recover shapeinformation of eyes and the eyebrows. Given pupil po-sitions we can also localize the image around the mouth.Rather than extracting the shape of the mouth we extracttwo real numbers which correspond to two kinds of mouthactivities: smiles and fidgets. We look at the sum of theabsolute difference of pixels of the extracted mouth imagein the current frame with the mouth images in the last 10frames. A large difference in images should correspond tomouth movements, namely the fidgets. Besides a numericalscore that corresponds to fidgets, the system also uses a sup-port vector machine (SVM) to compute the probability ofsmiles. Specifically, an SVM was trained using natural ex-amples of mouth images, to classify mouth images as smilingor not smiling. The localized mouth image in the current

Figure 2: Module to extract facial features.

frame is used as an input to this SVM classifier and the re-sulting output is passed through a sigmoid to compute theprobability of smile in the current frame. The system canextract features in real time at 27-29 frames per second on a1.8 GhZ Pentium 4 machine. The system tracks well as longas the subject is in the reasonable range of the camera. Thesystem can detect whenever it is unable to find eyes in itsfield of view, which might occasionally happen due to largehead and body movements. Also, sometimes the camera canonly see the upper part of the face and cannot extract lowerfacial features, which happens if the subject leans forward.Due to these problems we often have missing informationfrom the face; thus, we need an affect recognition systemthat is robust to such tracking failures.

3.2 The Posture Sensing ChairPostures are recognized using two matrices of pressure sen-sors made by Tekscan. One matrix is positioned on theseat-pan of a chair; the other is placed on the backrest. Eachmatrix is 0.10 millimeters thick and consists of a 42-by-48array of sensing pressure units distributed over an area of41 x 47 centimeters. A pressure unit is a variable resistor,and the normal force applied to its superficial area deter-mines its resistance. This resistance is transformed to an8-bit pressure reading, which can be interpreted as an 8-bitgrayscale value and visualized as a grayscale image. Figure3 shows the feature extraction strategy used for posturesin [12]. First, the pressure maps sensed by the chair arepre-processed to remove noise and the structure of the mapis modeled with a mixture of Gaussians. The parametersof the Gaussian mixture (means and variances) are used tofeed a 3-layer feed-forward neural network that classifies thestatic set of postures (for example, sitting upright, leaningback, etc.) and activity level (low, medium and high) in realtime at 8 frames per second, which are then used as posturefeatures by the multimodal affect classification module.

4. RECOGNIZING AFFECTIVE STATESTable 1 shows all the features that are extracted every oneeighth of a second. We deliberately grouped them under“channels”, separating for example the upper and lower facefeatures because often the upper face features were presentbut not the lower. We form four different channels (table1) and each channel corresponds to a group of features thatcan go missing simultaneously depending upon which sen-

Figure 3: Module to extract posture features.

sor fails. Our strategy will be to fuse decisions from thesechannels, rather than individual features, thus, preservingsome higher order statistics between different features.

Figure 4 shows the model we follow to solve the problem.The data xp from P different channels generate soft-decisionsyp corresponding to each different channel. The variableλ determines the channel that decides the final decision t.Note, that λ is never observed and we have to marginalizeover λ to compute the final decision. Intuitively, this canbe thought of as weighting individual decisions yp appropri-ately and combining them to infer the distribution over theclass label t. The weights, which corresponds to a probabil-ity distribution over λ, depend upon the data point beingclassified and are determined using another meta-level de-cision system. While the results in this paper are obtainedwith λ selecting to one channel at a time, it is possible moregenerally to have it select multiple channels. The systemdescribed in this paper uses Gaussian Process (GP) classi-fication to first infer the probability distribution over yp forall the channels. The final decision is gated through λ whoseprobability distribution conditioned on the test point is de-termined using another multi-label GP classification system.

In this work we follow a Bayesian paradigm, where the aimis to compute the posterior probability of an affective labelof a test point given all the annotated training data andthe model. The next subsection describes the acquisition ofthe annotated training data. Following that, we review GPclassification using EP. Section 4.3 then shows how to extendthe idea to a Mixture of Gaussian Processes in order to

Table 1: Extracted features from different modali-ties which are grouped into channels.

Channel 1: Upper Face

Brow ShapeEye Shape

likelihood of nodlikelihood of shakelikelihood of blink

Channel 3: Posture

Current PostureLevel of Activity

Channel 2: Lower Face

Probability of FidgetProbability of Smile

Channel 4: Game

Level of DifficultyState of the Game

21x x x x

t

1 2

y y y

p-1 p

p-1 yp

PSfrag replacementsλ

Figure 4: A mixture of GPs for p channels

handle multiple modalities in the same Bayesian framework.

4.1 Collecting the DatabaseData collection for affective studies is a challenging task:The subject needs to be exposed to conditions that can elicitthe emotional state in an authentic way; if we elicit affectivestates on demand, it is almost guaranteed not to bring outgenuinely the required emotional state. Affective states as-sociated with interest and boredom were elicited through anexperiment with children aged 8 to 11 years, coming fromrelatively affluent areas of the state of Massachusetts in theUSA. Each child was asked to solve a constraint satisfactiongame called Fripples Place for approximately 20 minutesand the space where the experiment took place was a nat-uralistic setting allowing the subject to move freely. It wasarranged with the sensor chair, one computer playing thegame and recording the screen activity, having a monitor,standard mouse and keyboard, as well as two video-cameras:one capturing a side-view and one the frontal view; and fi-nally, a Blue Eyes camera capturing the face. We made thecameras unobtrusive to encourage natural responses preserv-ing as much as possible the original behavior. Given thatwe cannot directly observe the student’s internal thoughtsand emotions, nor can children in the age range of 8 and11 years old reliably articulate their feelings, we chose to fo-cus on labeling affective states by observers who are teach-ers by profession. We engaged in several iterations withteachers to ascertain a set of meaningful labels that couldbe reliably inferred from the data. The teachers were al-lowed to look at frontal video, side video, and screen ac-tivity, recognizing that people are not used to looking atchair pressure patterns. Eventually, we found that teach-ers could reliably label the states of high, medium and lowinterest, bored, and a state that we call “taking a break,”which typically involved a forward-backward postural fidgetand sometimes stretching. Working separately and withoutbeing aware of the final purpose of the coding task, teach-ers obtained an average overall agreement (Cohen’s Kappa)of 78.6%. In this work, we did not use data classified as“bored” or “other” even though teachers identified themconsistently. The bored state was dropped since teachersonly classified very few episodes as bored, and this was notenough to develop separate training and test sets. The finaldatabase used to train and test the system included 8 differ-ent children with 61 samples of “high interest,” 59 samplesof “low interest” and 16 samples of “taking a break.” Eachof the samples is a maximum of 8 secs long with observa-tions recorded at 8 samples per second. Only 50 samples

had features present from all the four modalities, whereasthe other 86 samples had the face channel missing.

4.2 Gaussian Process ClassificationGP classification is related to kernel machines such as SVMsand has been well explored in the machine learning commu-nity. Under the Bayesian framework, given a set of labeleddata points X = {x1, ..,xn}, with class labels t = {t1, .., tn}and an unlabeled point x∗, we are interested in the distri-bution p(t∗|X, t,x∗). Here t∗ is a random variable denot-ing the class label for the point x∗. The idea behind GPclassification is that the hard labels t depend upon hiddensoft-labels y = {y1, ..., yn}, which are assumed to be jointlyGaussian with the covariance between two outputs yi andyj specified using a kernel function applied to xi and xj .Formally, {y1, .., yn} ∼ N (0,K) where K is a n-by-n ker-nel matrix with Kij = k(xi,xj). The observed labels t areassumed to be conditionally independent given the soft la-bels y and each ti depends upon yi through the conditionaldistribution:

p(ti|yi) = Φ(βyi · ti)

Here, Φ(z) =∫ z

−∞N (z; 0, 1) which provides a quadratic

slack for labeling errors; thus, the model should be morerobust to label noise. We are also looking at other Bayesiantechniques to handle label noise and uncertainty, which wewill report in a future publication.

Our task is to infer p(t∗|D), where D = {X, t,x∗}. We useExpectation Propagation (EP) to first approximate P (y, y∗|D)as a Gaussian and then use the approximate distributionp(y∗|D) ≈ N (M∗, V ∗) to classify the test point x∗:

p(t∗|D) ∝

∫y∗

p(t∗|y∗)N (M∗, V ∗) (1)

One of the useful byproducts of EP is the Gaussian approx-imations of the likelihoods p(ti|yi):

p(ti|yi) ≈ ti = si exp(−1

2vi

(yi · ti − mi)2) (2)

Conceptually, we can think of EP starting with the GP priorN (0,K) over the hidden soft labels (y, y∗) and incorporatingall the approximate terms ti to approximate the posteriorp(y, y∗|D) = N (M,V) as a Gaussian. For details readersare encouraged to look at [11].

4.3 Mixture of GPs for Sensor FusionGiven n data points x1, .., xn, obtained from P different sen-sors, our approach follows a mixture of GP model describedin figure 4. Let every ith data point be represented as xi =

{x(1)i , ..,x

(P )i }, and the soft labels as yi = {y(1)

i , .., y(P )i }.

Given λi ∈ {1, .., P}, the random variable that determinesthe channel for each point’s classification, the classificationlikelihood can be written as:

P (ti|yi, λi = j) = P (ti|y(j)i ) = Φ(βti · y

(j)i )

Given a test point x∗, let X = {x1, .., xn, x∗} denote all the

training and the test points. Further, let Y = {y(1), ..,y(P )},denote the hidden soft labels corresponding to each chan-nel of all the data including the test point. Let, Q(Y) =∏P

p=1 Q(y(p)) and Q(Λ) =∏n

i=1 Q(λi), denote the approx-

imate posterior over the hidden variables Y and Λ, whereΛ = {λ1, .., λn} are the switches corresponding only to the n

labeled data points. Let p(Y) and p(Λ) be the priors with

p(Y) =∏P

p=1 p(y(p)), the product of GP priors and p(Λ)

50 55 60 65 70 75 80 85 90 95 10050

55

60

65

70

75

80

85

90

95

100

Mix

of G

P (

Acc

urac

y)

Feature Level Fusion GP (Accuracy)

Data only with all channels available

50 55 60 65 70 75 80 85 90 95 10050

55

60

65

70

75

80

85

90

95

100

Mix

of G

P (

Acc

urac

y)

Feature Level Fusion GP (Accuracy)

All data

50 55 60 65 70 75 80 85 90 95 10050

55

60

65

70

75

80

85

90

95

100

Mix

of G

P (

Acc

urac

y)

Feature Level Fusion SVM (Accuracy)

Data only with all channels available

50 55 60 65 70 75 80 85 90 95 10050

55

60

65

70

75

80

85

90

95

100

Mix

of G

P (

Acc

urac

y)

Feature Level Fusion SVM (Accuracy)

All data

(a) (b) (c) (d)

Figure 5: Plots comparing accuracy of Mixture GP with GP (a)-(b) and SVM (c)-(d) where the lattermethods use feature fusion. In (a) and (c) only the subset of data where all modes are intact is used. In (b)and (d) all the data is used, as described in the text. There are 100 points on each graph and each point is(accuracy SVM/GP, accuracy Mixture GP) and corresponds to one test run. Circle width is proportionalto the number of points having that coordinate. Points above the diagonal indicate where Mixture GP wasmore accurate. While Mixture GP is better (a) or comparable (c) on average when all the modes are intact,it is particularly better when there are noisy and missing channels (b) and (d).

uniform. Our algorithm aims to compute good approxima-tions Q(Y) and Q(Λ) to the real posteriors by iterativelyoptimizing the variational bound:

F =

∫Y,Λ

Q(Y)Q(Λ) log(p(Y)p(Λ)p(t|X, Y,Λ)

Q(Y)Q(Λ)) (3)

Once we have the posterior over the switches, Q(λi) ∀i ∈[1..n], we first infer the switches for the test data x∗ using ameta-level GP based classification system. For this, we doa multi-label P -way classification using the GP algorithmdescribed in 4.2 with Λ = arg maxΛ Q(Λ) as labels. Specif-ically, for an unlabeled point x∗, P different classificationsare done where each classification provides us with q∗

r , wherer ∈ {1, .., P}, and equals the probability that channel r waschosen to classify x∗. The posterior Q(λ∗ = r) is then set

toq∗r∑

Pp=1

q∗p. In our experiments, to perform this multi-label

P -way classification, we clubbed all the channels togetherusing -1 as observations for the modalities that were miss-ing. Note, that we are not limited to using all the channelsclubbed together; but, various combinations of the modal-ities can be used including other indicator and contextualvariables. Once we have the posterior over the switch forthe test data, Q(λ∗), we can infer the class probability of anunlabeled data x∗ using:

p(t∗|X, t) =

∫Y,λ∗

p(t∗|Y, λ∗)Q(λ∗)Q(Y) (4)

The main feature of the algorithm is that the classificationusing EP is required only once and the bound in equation 3can be optimized very quickly using the Gaussian approxi-mations provided by EP. For details please refer to [5].

5. EMPIRICAL EVALUATIONWe performed experiments on the collected database usingthe framework to classify the state of interest (65 samples)vs. uninterest (71 samples). The experimental methodologywas to use 50% of the data for training and use the rest fortesting. Besides the comparison with the individual modal-ities, we also compare the mixture of GP with the HMMbased expert-critic scheme [8] and a naive feature level fu-sion. In the naive feature level fusion, the observations from

all the channels are stacked to form a big vector and thesevectors of fused observations are then used to train and testthe classifiers. However, in our case this is not trivial as wehave data with missing channels. We test a naive featurelevel fusion where we use -1 as a value of all those observa-tions that are missing, thus, fusing all the channels into onesingle vector.

All the experiments were done using GP classification asthe base classifiers and for completeness we also performcomparisons with SVMs. Both, GP classification and SVMsused RBF kernels and the hyperparameters for GP clas-sification (σ, β) were selected by evidence maximization.For SVMs we used the leave-one-out validation procedureto select the penalty parameter C and the kernel width σ.We randomly selected 50% of the points and computed thehyper-parameters for both GPs and SVMs for all individualmodalities and the naive feature level fusions. This processwas repeated five times and the mean values of the hyperpa-rameters were used in our experiments. Table 2 shows the

Table 2: Recognition rates (standard deviation inparenthesis) averaged over 100 runs.

GP SVM

Upper Face 66.81%(6.33) 69.84%(6.74)Lower Face 53.11%(9.49) 57.06%(9.16)Posture 81.97%(3.67) 82.52%(4.21)Game 57.22%(4.57) 58.85%(5.99)Mix of GP 86.55%(4.24) -

results for individual modalities using GP classification andSVMs. These numbers were generated by averaging over 100runs and we report the mean and the standard deviation.We can see that the posture channel can classify the modali-ties best, followed by features from the upper face, the gameand the lower face. Although, the performance obtained us-ing GP classification is similar to SVMs and slightly worsefor upper and the lower face, we find that extension of GPto a mixture boosts the performance and leads to signifi-cant gains over SVMs. The Mix of GP also outperformed

HMM − Expert Critic

Feature Fusion GP (All Feature)

Feature Fusion SVM (All Feature)

Mix of GP (All Feature)

Feature Fusion GP (All Data)

Feature Fusion SVM (All Data)

Mix of GP

65 70 75 80 85

Recognition Accuracy

Figure 6: Performance comparison of the proposedMixture of GP approach on the affect dataset withnaive feature level fusions and the HMM basedexpert-critic framework. Each point was generatedby averaging over 100 runs. Non overlapping of er-ror bars, the standard errors scaled by 1.64, indi-cates 95% significance of the performance difference.

fixed rule based decision fusion of the individual SVM andGP classifiers. The readers are requested to look at [5] forsimilar results.

Next, we compare the Mix of GP to feature level fusionwhile restricting the training and testing database to thepoints where all the channels are available. The restrictionresults in a significant decrease in the available data (only50 total datapoints) and the mix of GP obtained an averageaccuracy of 75.01% ± 1.15 over 100 runs, whereas the accu-racy using feature level fusion was 73.77%±1.22. When weperform naive feature level fusion using all the data, with-out any restriction and using -1 for the channels that weremissing, we get a significant gain in average accuracy to82.93%±0.81. However, the Mix of GP outperforms all ofthese significantly with an average accuracy of 86.55%±0.55(see figure 6 for significance). Figure 5(a) and (b) show thatthe Mixture of GP not only beats feature level fusion on thesubset of data where all channels are present, but also signif-icantly beats it when incomplete data can be used. Similarresults are obtained comparing to SVMs and are graphicallyshown in figure 5 (c) and (d). Figure 6 graphically demon-strates the performance gain obtained by the Mix of GPapproach over the feature level fusions and the HMM basedexpert critic framework as implemented in [8] with FACSfeatures. The points in figure 6 were generated by averagingover 100 runs. The non-overlapping error bars, standard er-rors scaled by 1.64, signify 95% confidence in performancedifference. The Mix of GP uses all the data and can handlethe missing information better than naive fusion methods;thus, it provides a significant performance boost.

In our earlier work [8, 5], we had used manually encodedFACS based Action Units (AU) as features extracted fromthe face. The accuracy of 66.81 ± 6.33 obtained while per-forming GP classification with automatically extracted up-per facial features was significantly better than the accuracyof 54.19 ± 3.79 obtained with GP classification that usedthe manually coded upper AUs. The difference is due totwo factors. First, the set of automatically extracted upperfacial features is richer than the AUs. Second, the AUs weremanually encoded by just one FACS expert, thus, resultingin features prone to noise.

6. CONCLUSIONIn this paper, we proposed a Mixture of Gaussian Processesapproach to classifying interest in a learning scenario us-ing multiple modalities under the challenging conditions ofmissing channels and uncertain labels. Using informationfrom upper and lower face, postures and task information,the proposed multisensor classification scheme outperformedseveral sensor fusion schemes and obtained a recognition rateof 86%. Future work includes incorporation of active learn-ing and application of this framework to other challengingproblems with limited labeled data.

AcknowledgmentsThanks to Selene Mota for help with the data collection.This research was supported by NSF ITR grant 0325428.

7. REFERENCES[1] T. W. Chan and A. Baskin. Intelligent Tutoring Systems:

At the Crossroads of Artificial Intelligence and Education,chapter 1: Learning companion systems. 1990.

[2] C. Conati. Probabilistic assessment of user’s emotions ineducational games. Applied Artificial Intelligence, specialissue on Merging Cognition and Affect in HCI, 16, 2002.

[3] T. S. Huang, L. S. Chen, and H. Tao. Bimodal emotionrecognition by man and machine. In ATR Workshop onVirtual Communication Environments, 1998.

[4] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E.Hinton. Adaptive mixtures of local experts. NeuralComputation, 3:79–87, 1991.

[5] A. Kapoor, H. Ahn, and R. W. Picard. Mixture of gaussianprocesses to combine multiple modalities. In Workshop onMCS, 2005.

[6] A. Kapoor, S. Mota, and R. W. Picard. Towards a learningcompanion that recognizes affect. In AAAI FallSymposium, Nov 2001.

[7] A. Kapoor and R. W. Picard. Real-time, fully automaticupper facial feature tracking. In Automatic Face andGesture Recognition, May 2002.

[8] A. Kapoor, R. W. Picard, and Y. Ivanov. Probabilisticcombination of multiple modalities to detect interest. InICPR, August 2004.

[9] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas. Oncombining classifiers. PAMI, 20(3):226–239, 1998.

[10] D. J. Miller and L. Yan. Critic-driven ensembleclassification. Signal Processing, 47(10), 1999.

[11] T. P. Minka. Expectation propagation for approximatebayesian inference. In UAI, 2001.

[12] S. Mota and R. W. Picard. Automated posture analysis fordetecting learner’s interest level. In CVPR Workshop onHCI, June 2003.

[13] N. Oliver, A. Garg, and E. Horvitz. Layered representationsfor learning and inferring office activity from multiplesensory channels. In ICMI, 2002.

[14] M. Pantic and L. J. M. Rothkrantz. Towards anaffect-sensitive multimodal human-computer interaction.Proceedings of IEEE, 91(9), 2003.

[15] R. W. Picard, E. Vyzas, and J. Healey. Toward machineemotional intelligence: Analysis of affective physiologicalstate. PAMI, 2001.

[16] K. Toyama and E. Horvitz. Bayesian modality fusion:Probabilistic integration of multiple vision algorithms forhead tracking. In ACCV, 2000.

Documents

Multimodal Affect Recognition in Learning Environments · fkapoor, [email protected] ... Kittler et al. [9] have ... One set of LEDs is on the optical axis and produces the red