6
Real Time Face Detection and Facial Expression Recognition: Development and Applications to Human Computer Interaction. Marian Stewart Bartlett Gwen Littlewort Ian Fasel Javier R. Movellan Machine Perception Laboratory, Institute for Neural Computation University of California, San Diego, CA 92093 & Intelligent Robotics and Communication Laboratories ATR, Kyoto, Japan. Abstract Computer animated agents and robots bring a social dimen- sion to human computer interaction and force us to think in new ways about how computers could be used in daily life. Face to face communication is a real-time process operat- ing at a a time scale in the order of 40 milliseconds. The level of uncertainty at this time scale is considerable, mak- ing it necessary for humans and machines to rely on sensory rich perceptual primitives rather than slow symbolic infer- ence processes. In this paper we present progress on one such perceptual primitive. The system automatically de- tects frontal faces in the video stream and codes them with respect to 7 dimensions in real time: neutral, anger, dis- gust, fear, joy, sadness, surprise. The face finder employs a cascade of feature detectors trained with boosting tech- niques [16, 3]. The expression recognizer receives image patches located by the face detector. A Gabor representa- tion of the patch is formed and then processed by a bank of SVM classifiers. A novel combination of Adaboost and SVM’s enhances performance. The system was tested on the Cohn-Kanade dataset of posed facial expressions [7]. The generalization performance to new subjects for a 7- way forced choice correct. Most interestingly the outputs of the classifier change smoothly as a function of time, pro- viding a potentially valuable representation to code facial expression dynamics in a fully automatic and unobtrusive manner. The system has been deployed on a wide variety of platforms including Sony’s Aibo pet robot, ATR’s RoboVie, and CU animator, and is currently being evaluated for ap- plications including automatic reading tutors, assessment of human-robot interaction. 1. Introduction Computer animated agents and robots bring a social dimen- sion to human computer interaction and force us to think in new ways about how computers could be used in daily life. Face to face communication is a real-time process op- erating at a a time scale in the order of 40 milliseconds. The level of uncertainty at this time scale is considerable, making it necessary for humans to rely on sensory rich per- ceptual primitives rather than slow symbolic inference pro- cesses. Thus fulfilling the idea of machines that interact face to face with us requires development of robust real-time per- ceptive primitives. In this paper we present some first steps towards the development of one such primitive: a system that automatically finds faces in the visual video stream and codes facial expression dynamics in real time. The system has been deployed on a wide variety of platforms including Sony’s Aibo pet robot, ATR’s RoboVie [6], and CU anima- tor [10]. The performance of the system is currently being evaluated for applications including automatic reading tu- tors, assessment of human-robot interaction, and evaluation of psychiatric intervention. Charles Darwin was one of the first scientists to recog- nize that facial expression is one of the most powerful and immediate means for human beings to communicate their emotions, intentions, and opinions to each other. In addition to providing information about affective state, facial expres- sions also provide information about cognitive state, such as interest, boredom, confusion, and stress, and conversational signals with information about speech emphasis and syntax. A number of ground breaking systems have appeared in the computer vision literature for automatic facial expression recognition. See [12, 1] for reviews. Automated systems will have a tremendous impact on basic research by making facial expression measurement more accessible as a behav- ioral measure, and by providing data on the dynamics of facial behavior at a resolution that was previously unavail- able. Computer systems with this capability have a wide range of applications in basic and applied research areas, including man-machine communication, security, law en- forcement, psychiatry, education, and telecommunications. In this paper we present results on a user independent fully automatic system for real time recognition of basic emotional expressions from video. The system automat- ically detects frontal faces in the video stream and codes each frame with respect to 7 dimensions: Neutral, anger, disgust, fear, joy, sadness, surprise. The system presented here differs from previous work in that it is fully auto- matic and operates in real-time at a high level of accu- racy (93% generalization to new subjects on a 7-alternative forced choice). Another distinction is that the preprocessing does not include explicit detection and alignment of internal facial features. This provides a savings in processing time which is important for real-time applications. We present a method for further speed advantage by combining feature selection based on Adaboost with feature integration based on support vector machines. 1

Real Time Face Detection and Facial Expression Recognition ... · computer vision literature for automatic facial expression recognition. See [12, 1] for reviews. Automated systems

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Real Time Face Detection and Facial Expression Recognition ... · computer vision literature for automatic facial expression recognition. See [12, 1] for reviews. Automated systems

RealTime FaceDetectionand Facial ExpressionRecognition: DevelopmentandApplications to Human Computer Interaction.

MarianStewartBartlett� � GwenLittlewort

� � IanFasel��� � � Javier R. Movellan

��� �

MachinePerceptionLaboratory, Institutefor NeuralComputation�

Universityof California,SanDiego,CA 92093&

IntelligentRoboticsandCommunicationLaboratories�

ATR, Kyoto,Japan.

AbstractComputeranimatedagentsandrobotsbringasocialdimen-sionto humancomputerinteractionandforceusto think innew waysabouthowcomputers couldbeusedin daily life.Faceto facecommunicationis a real-timeprocessoperat-ing at a a time scalein the order of 40 milliseconds.Thelevel of uncertaintyat this timescaleis considerable, mak-ing it necessaryfor humansandmachinesto relyonsensoryrich perceptualprimitivesrather thanslowsymbolicinfer-enceprocesses.In this paperwe presentprogresson onesuch perceptualprimitive. The systemautomaticallyde-tectsfrontal facesin thevideostreamandcodesthemwithrespectto 7 dimensionsin real time: neutral, anger, dis-gust, fear, joy, sadness,surprise. Thefacefinder employsa cascadeof feature detectors trainedwith boostingtech-niques[16, 3]. Theexpressionrecognizer receivesimagepatcheslocatedby the facedetector. A Gabor representa-tion of the patch is formedand thenprocessedby a bankof SVMclassifiers. A novel combinationof AdaboostandSVM’s enhancesperformance. The systemwas testedonthe Cohn-Kanadedatasetof posedfacial expressions[7].The generalization performanceto new subjectsfor a 7-way forcedchoicecorrect. Most interestinglythe outputsof theclassifierchange smoothlyasa functionof time, pro-viding a potentiallyvaluablerepresentationto codefacialexpressiondynamicsin a fully automaticand unobtrusivemanner. Thesystemhasbeendeployedona widevarietyofplatformsincludingSony’s Aibo petrobot,ATR’s RoboVie,andCU animator, and is currentlybeingevaluatedfor ap-plications including automaticreadingtutors, assessmentof human-robotinteraction.

1. Intr oductionComputeranimatedagentsandrobotsbringasocialdimen-sion to humancomputerinteractionand force us to thinkin new waysabouthow computerscould be usedin dailylife. Faceto facecommunicationis a real-timeprocessop-eratingat a a time scalein the order of 40 milliseconds.The level of uncertaintyat this time scaleis considerable,makingit necessaryfor humansto rely onsensoryrich per-ceptualprimitivesratherthanslow symbolicinferencepro-cesses.Thusfulfilling theideaof machinesthatinteractfaceto facewith usrequiresdevelopmentof robustreal-timeper-

ceptive primitives.In this paperwe presentsomefirst stepstowardsthe developmentof onesuchprimitive: a systemthatautomaticallyfindsfacesin thevisualvideostreamandcodesfacialexpressiondynamicsin real time. Thesystemhasbeendeployedonawidevarietyof platformsincludingSony’s Aibo petrobot,ATR’s RoboVie [6], andCU anima-tor [10]. Theperformanceof thesystemis currentlybeingevaluatedfor applicationsincluding automaticreadingtu-tors,assessmentof human-robotinteraction,andevaluationof psychiatricintervention.

CharlesDarwin wasoneof the first scientiststo recog-nize that facialexpressionis oneof themostpowerful andimmediatemeansfor humanbeingsto communicatetheiremotions,intentions,andopinionsto eachother. In additionto providing informationaboutaffectivestate,facialexpres-sionsalsoprovideinformationaboutcognitivestate,suchasinterest,boredom,confusion,andstress,andconversationalsignalswith informationaboutspeechemphasisandsyntax.A numberof groundbreakingsystemshaveappearedin thecomputervision literaturefor automaticfacial expressionrecognition. See[12, 1] for reviews. Automatedsystemswill haveatremendousimpactonbasicresearchby makingfacialexpressionmeasurementmoreaccessibleasa behav-ioral measure,and by providing dataon the dynamicsoffacialbehavior at a resolutionthatwaspreviously unavail-able. Computersystemswith this capability have a widerangeof applicationsin basicand appliedresearchareas,including man-machinecommunication,security, law en-forcement,psychiatry, education,andtelecommunications.

In this paperwe presentresultson a userindependentfully automaticsystemfor real time recognitionof basicemotionalexpressionsfrom video. The systemautomat-ically detectsfrontal facesin the video streamand codeseachframe with respectto 7 dimensions:Neutral, anger,disgust,fear, joy, sadness,surprise.The systempresentedhere differs from previous work in that it is fully auto-matic and operatesin real-time at a high level of accu-racy (93%generalizationto new subjectson a 7-alternativeforcedchoice).Anotherdistinctionis thatthepreprocessingdoesnot includeexplicit detectionandalignmentof internalfacial features.This providesa savings in processingtimewhich is importantfor real-timeapplications.We presenta methodfor furtherspeedadvantageby combiningfeatureselectionbasedon Adaboostwith featureintegrationbasedonsupportvectormachines.

1

Page 2: Real Time Face Detection and Facial Expression Recognition ... · computer vision literature for automatic facial expression recognition. See [12, 1] for reviews. Automated systems

2. Preparing training data2.1. DatasetThe systemwastrainedandtestedon CohnandKanade’sDFAT-504dataset[7]. This datasetconsistsof 100univer-sity studentsrangingin agefrom 18 to 30 years.65%werefemale,15% wereAfrican-American,and3% wereAsianor Latino. Videoswere recodedin analogS-videousinga cameralocateddirectly in front of the subject. Subjectswere instructedby an experimenterto performa seriesof23 facial expressions.Subjectsbegan andendedeachdis-play with a neutralface. Beforeperformingeachdisplay,anexperimenterdescribedandmodeledthedesireddisplay.Imagesequencesfrom neutralto target displayweredigi-tized into 640 by 480 pixel arrayswith 8-bit precisionforgrayscalevalues.

For our study, we selected313 sequencesfrom thedataset.The only selectioncriterion was that a sequencebe labeledasoneof the 6 basicemotions.The sequencescamefrom 90 subjects,with 1 to 6 emotionsper subject.The first and last frames(neutraland peak)were usedastraining imagesandfor testinggeneralizationto new sub-jects, for a total of 625 examples. The trainedclassifierswerelaterappliedto theentiresequence.

2.2. Locating the facesWe recently developeda real-time face-detectionsystembasedon [16], capableof detectionandfalsepositive ratesequivalentto thebestpublishedresults[14,15, 13, 16]. Thesystemscansacrossall possible������� pixel patchesinthe imageandclassifieseachasfacevs. non-face. Largerfacesaredetectedby applyingthesameclassifierat largerscalesin theimage(usingascalefactorof 1.2). Any detec-tion windowswith significantoverlapareaveragedtogether.Thefacedetectorwastrainedon5000facesandmillions ofnon-facepatchesfrom about8000 imagescollectedfromthewebby CompaqResearchLaboratories.

The systemconsistsof a cascadeof classifiers,eachofwhich containsa subsetof filters reminiscentof HaarBasisfunctions,which canbecomputedvery fastat any locationand scalein constanttime (seeFigure 1). In a ���� ���pixel window, thereareover 160,000possiblefilters of thistype. For eachclassifierin thecascadea subsetof 2 to 200of thesefilters arechosenby usinga featureselectionpro-cedurebasedon Adaboost[4] as follows: First, all 5000facepatchesanda randomsetof 10000non-facepatchesaretaken from the labeledimagesetfrom Compaq.Then,usinga randomsampleof 5% of thepossiblefilters,a sim-ple classifier(or “weak learner”)usingonly onefilter at atime is trainedto minimizetheweightedclassificationerroronthissamplefor eachof thefilters. Thesingle-filterclassi-fier thatgivesthebestperformanceis selected.Ratherthanusethis result directly, we refine the selectionby findingthebestperformingsingle-featureclassifierfrom a new setof filters generatedby shifting andscalingthechosenfilterby two pixels in eachdirection,aswell ascompositefiltersmadeby reflectingeachshiftedandscaledfeaturehorizon-tally aboutthecenterandsuperimposingit on theoriginal.This canbethoughtof asa singlegenerationgeneticalgo-rithm,andis muchfasterthanexhaustively searchingfor thebestclassifieramongall 160,000possiblefilters andtheirreflection-basedcousins.Usingthechosenclassifierastheweaklearnerfor thisroundof boosting,theweightsoverthe

a. b.

c. d.

Figure1: Integral imagefilters (after Viola & Jones,2001[16]). a. Thevalueof thepixel at ��������� is thesumof all thepixelsaboveandto theleft. b. Thesumof thepixelswithinrectangle� canbecomputedas ������������� �!� . (c) Eachfeatureis computedby taking the differenceof the sumsof the pixels in the white boxesandgrey boxes. Featuresinclude thoseshown in (c), as in [16], plus (d) the samefeaturessuperimposedon their reflectionabouttheY axis.

examplesarethenadjustedaccordingto its performanceoneachexampleusingtheAdaboostrule. This featureselec-tion processis thenrepeatedwith thenew weights,andtheentireboostingprocedurecontinuesuntil the“strongclassi-fier” (i.e., thecombinedclassifierusingall theweaklearn-ers for that stage)canachieve a minimum desiredperfor-mancerateon a validationset. Finally, after training eachstrongclassifier, aboot-strapround(ala [15]) is performed,in which thefull systemup to thatpoint is scannedacrossadatabaseof non-faceimages,andfalsealarmsarecollectedandusedasthenon-facesfor trainingthesubsequentstrongclassifierin thesequence.

While [16] useAdaboostin their featureselectionalgo-rithm, which requiresbinary classifiers,we have recentlybeen experimentingwith Gentleboost,describedin [5],which usesreal valuedfeatures. Figure 2 shows the firsttwo filters chosenby the systemalong with the real val-uedoutputof theweaklearners(or tuningcurves)built onthosefilters. We have alsodevelopeda training procedureto eliminatethecascadeof classifier, sothataftereachsin-gle feature,the systemcandecidewhetherto testanotherfeatureor to make a decision.Preliminaryresultsshow po-tential for dramaticimprovementsin speedwith no lossofaccuracy over thecurrentsystem.

Becausethestrongclassifiersearlyin thesequenceneedvery few featuresto achieve good performance(the firststagecanreject"�#!$ of thenon-facesusingonly � features,usingonly 20 simpleoperations,or about60 microproces-sor instructions),the averagenumberof featuresthat needto beevaluatedfor eachwindow is very small,makingtheoverall systemvery fast. The currentsystemachieves anexcellent tradeoff in speedand accuracy We host an on-line demoof thefacedetector, alongwith thefacialexpres-

2

Page 3: Real Time Face Detection and Facial Expression Recognition ... · computer vision literature for automatic facial expression recognition. See [12, 1] for reviews. Automated systems

Filter outputFilter output

Figure2: The first two features(top) and their respectivetuningcurves(bottom).Eachfeatureis shown over theav-erageface. The tuning curvesshow the evidencefor face(high) vs.non-face(low). Thefirst tuningcurve shows thata dark horizontalregion over a bright horizontalregion inthecenterof thewindow is evidencefor aface,andfor non-faceotherwise.Theoutputof thesecondfilter is bimodal.Both a strongpositive anda strongnegative output is evi-dencefor a face,while outputcloserto zerois evidencefornon-face.

sion recognitionsystemof [2], on the world wide web athttp://mplab.ucsd.edu.

Performanceon theCMU-MIT dataset(astandard,pub-lic datasetfor benchmarkingfrontalfacedetectionsystems)is comparableto [16]. While CMU-MIT containswidevari-ability in the imagesdue to illumination, occlusions,anddifferencesin image quality, the performancewas muchmoreaccurateonthedatasetusedfor thisstudy, becausethefaceswerefrontal, focusedandwell lit, with simpleback-ground[3]. All facesweredetectedfor thisdataset.

2.3. PreprocessingTheautomaticallylocatedfaceswererescaledto 48x48pix-els. A comparisonwas also madeat double resolution(96x96). No further registrationwasperformed.The typi-caldistancebetweenthecentersof theeyeswasroughly24pixels. Theimageswereconvertedinto a Gabormagnituderepresentation,using a bank of Gaborfilters at 8 orienta-tionsand5 spatialfrequencies(4:16pixelspercycle at 1/2octave steps)[8].

3. Facial ExpressionClassificationFacialexpressionclassificationwasbasedonsupportvectormachines(SVM’s). SVM’s arewell suitedto this taskbe-causethe high dimensionalityof the Gaborrepresentationdoesnot affect trainingtime for kernelclassifiers.Thesys-tem performeda 7-way forcedchoicebetweenthe follow-ing emotioncategories: Happiness,sadness,surprise,dis-

gust,fear, anger, neutral.Theclassificationwasperformedin two stages. First, supportvector machinesperformedbinary decisiontasks. Seven SVM’s were trainedto dis-criminateeachemotionfrom everythingelse.Theemotioncategory decisionwas then implementedby choosingtheclassifierwith the maximummargin for the testexample.Generalizationto novel subjectswastestedusingleave-one-subject-outcross-validation. Linear, polynomial,andRBFkernelswith Laplacian,andGaussianbasisfunctionswereexplored. LinearandRBF kernelsemploying a unit-widthGaussianperformedbest,andarepresentedhere.

We comparedrecognitionperformanceusingtheoutputof the automaticfacedetectorto performanceon imageswith explicit featurealignmentusinghand-labeledfeatures.For themanuallyalignedfaceimages,thefaceswererotatedsothattheeyeswerehorizontalandthenwarpedsothattheeyesandmouthwerealignedin eachface. The resultsaregiven in Table1. Therewasno significantdifferencebe-tweenperformanceon theautomaticallydetectedfacesandperformanceon themanuallyalignedfaces(z=0.25,p=0.4,n=625).

SVM Automatic Manuallyaligned

Linear 84.8 85.3RBF 87.5 87.6

Table 1: Facial expressionrecognition performanceformanually aligned versus automatically detected faces(96x96images).

3.1. SVM’sand AdaboostSVM performancewas next comparedto Adaboost foremotionclassification.The featuresemployed for the Ad-aboostemotionclassifierwerethe individual Gaborfilters.Therewere48x48x40= 92160possiblefeatures.A subsetof thesefilterswaschosenusingAdaboost.Oneachtraininground,the thresholdandscaleparameterof eachfilter wasoptimizedandthefeaturethatprovidedbestperformanceontheboosteddistributionwaschosen.SinceAdaboostis sig-nificantly slower to train thanSVM’s, we did not do ’ leaveonesubjectout’ crossvalidation. Insteadwe separatedthesubjectsrandomlyinto tengroupsof roughlyequalsizeanddid ’ leave onegroupout’ crossvalidation.

During Adaboost,training for eachemotion classifiercontinueduntil the distributions for the positive and neg-ative sampleswerecompletelyseparatedby a gap propor-tional to thewidthsof the two distributions(seeFigure3).Thetotalnumberof filtersselectedusingthisprocedurewas538.Here,thesystemcalculatedtheoutputof Gaborfilterslessefficiently, astheconvolutionsweredonein pixel spaceratherthanFourierspace,but theuseof 200timesfewerGa-bor filters neverthelessresultedin a substantialspeedben-efit. The generalizationperformance,85.0%,wascompa-rable to linear SVM performanceon the leave-group-outtestingparadigm,but Adaboostwassubstantiallyfaster, asshown in Table2.

Adaboostprovides an addedvalue of choosingwhichfeaturesaremostinformative to testat eachstepin thecas-cade.Figure4 illustratesthefirst 5 Gaborfeatureschosen

3

Page 4: Real Time Face Detection and Facial Expression Recognition ... · computer vision literature for automatic facial expression recognition. See [12, 1] for reviews. Automated systems

a. Numberof features b. Numberof features

Figure3: Stoppingcriteria for Adaboosttraining. a. Out-put of oneexpressionclassifierduring Adaboosttraining.The responsefor eachof the training examplesis shownasa functionof numberfeaturesastheclassifiergrows. b.Generalizationerrorasa functionof thenumberof featureschosenby Adaboost.Generalizationerrordoesnot increasewith “overtraining.”

ANGER DISGUST FEAR

JOY SADNESS SURPRISE

Figure4: Gaborsselectedby Adaboostfor eachexpression.Whitedotsindicatelocationsof all selectedGabors.Beloweachexpressionis a linear combinationof the real part ofthe first 5 Adaboostfeaturesselectedfor that expression.Facesshown areameanof 10 individuals.

for eachemotion.Thechosenfeaturesshow no preferencefor direction,but the highestfrequenciesarechosenmoreoften.Figure5 showsthenumberof chosenfeaturesateachof the5 wavelengthsused.

3.2 AdaSVM’sA combinationapproach,in which theGaborFeaturescho-sen by Adaboostwere usedas a reducedrepresentationfor trainingSVM’s(AdaSVM’s)outperformedAdaboostby3.8percentpoints,adifferencethatwasstatisticallysignifi-cant(z=1.99,p=0.02).AdaSVM’soutperformedSVM’sbyanaverageof 2.7 percentpoints,an improvementthatwasmarginally significant(z = 1.55,p = 0.06).

After examinationof the frequency distribution of the

2 4 6 8 10 12 14 16 180

50

100

150

200

250

wavelength in pixels

Wavelength distribution of Adaboost−chosen features

Figure 5: Wavelengthdistribution of featuresselectedbyAdaboost.

Gaborfilter selectedby Adaboost,it becameapparentthathigherspatialfrequency Gaborsandhigherresolutionim-agescould potentially improve performance. Indeed,bydoubling the resolutionto 96x96and increasingthe num-berof Gaborwavelengthsfrom 5 to 9 sothat they spanned2:32 pixels in 1/2 octave stepsimproved performanceofthe nonlinearAdaSVM to 93.3%correct. As the resolu-tion goesup,thespeedbenefitof AdaSVM’sbecomesevenmoreapparent.At thehigherresolution,thefull Gaborrep-resentationincreasedby a factorof 7, whereasthenumberof Gaborsselectedby Adaboostonly increasedby a factorof 1.75.

Leave-group-out Leave-subject-outAdaboost SVM SVM AdaSVM

Linear 85.0 84.8 86.2 88.8RBF 86.9 88.0 90.7

Table2: Performanceof Adaboost,SVM’s andAdaSVM’s(48x48images).

SVM Adaboost AdaSVMLin RBF Lin RBF

Time t t 90t 0.01t 0.01t 0.0125tTime t% t 90t 0.16t 0.16t 0.2t

Memory m 90m 3m 3m 3.3m

Table3: Processingtimeandmemoryconsiderations.Timet% includestheextra time to calculatetheoutputsof the538Gaborsin pixel spacefor Adaboostand AdaSVM, ratherthanthefull FFT employedby theSVM’s.

4. RealTime Emotion Mirr oringAlthougheachindividual imageis separatelyprocessedandclassified,thesystemoutputfor asequenceof videoframeschangessmoothlyasafunctionof time(SeeFigure6). This

4

Page 5: Real Time Face Detection and Facial Expression Recognition ... · computer vision literature for automatic facial expression recognition. See [12, 1] for reviews. Automated systems

Figure6: The neutraloutputdecreasesandthe output forthe relevant emotionincreasesasa function of time. Twotestsequencesfor onesubjectareshown.

Figure7: Examplesof the emotionmirror. The animatedcharactermirrorsthefacialexpressionof theuser.

providesapotentiallyvaluablerepresentationto codefacialexpressionin realtime.

To demonstratethepotentialof this ideawedevelopedareal time ’emotion mirror’. The emotionmirror rendersa3D characterin realtime thatmimicstheemotionalexpres-sionof aperson.

In theemotionmirror, theface-findercapturesafaceim-agewhich is sent to the emotionclassifier. The outputsof the7-emotionclassifierconstitutesa 7-D emotioncode.This codewassentto CU Animate,a setof softwaretoolsfor rendering3D computeranimatedcharactersin realtime,developedat the Centerfor Spoken LanguageResearchatCU Boulder. The7-D emotioncodegave a weightedcom-binationof morphtargetsfor eachemotion. The emotionmirror wasdemonstratedatNIPS2002.Figure7 shows theprototypesystematwork.

Theemotionmirror is aprototypesystemthatrecognizesthe emotionof the userandrespondsin an engagingway.Thelong-termgoalis to incorporatethissysteminto roboticandcomputeranimationapplicationsin which it is impor-tantto engagetheuseratanemotionallevel and/orhavethecomputerrecognizeandadaptto theemotionsof theuser.

5. Deploymentand EvaluationTherealtimesystempresentedherehasbeendeployedin avariety of platforms,including Sony’s Aibo Robot,ATR’sRoboVie [6], andCU animator[10]. The performanceofthe systemis currentlybeingevaluatedat homes,schools,andin laboratoryenvironments.

Automatedtutoring systemsmay be more effective ifthey adaptto the emotionalandcognitive stateof the stu-dent,likegoodteachersdo. Wearepresentlyintegratingau-tomaticfacetrackingandexpressionanalysisin automaticanimatedtutoringsystems.(SeeFigure8). This work is incollaborationwith RonColeatU. Colorado.

Facetrackingandexpressionrecognitionmayalsomakerobotsmoreengaging. For this purpose,the real time sys-temhasbeendeployedin theAibo robotandin theRoboVierobot(SeeFigure9). Thesystemalsoprovidesamethodformeasuringthegoodnessof interactionbetweenhumansandrobots. We have employed automaticexpressionanalysisto evaluatewhethernew featuresof the Robovie robot en-hanceduserenjoyment(SeeFigure10).

Figure8: Wearepresentlyintegratingautomaticfacetrack-ing andexpressionanalysisin automaticanimatedtutoringsystems.

Figure 9: The real time systemhasbeendeployed in theAibo robot(left) andtheRoboVie robot(right).

6. ConclusionsComputeranimatedagentsandrobotsbringasocialdimen-sionto humancomputerinteractionandforceusto think innew waysabouthow computerscouldbeusedin daily life.Faceto facecommunicationis a real-timeprocessoperat-ing at a a time scalein the orderof 40 milliseconds. Thelevel of uncertaintyat this time scaleis considerable,mak-ing it necessaryfor humansandmachinesto rely onsensory

5

Page 6: Real Time Face Detection and Facial Expression Recognition ... · computer vision literature for automatic facial expression recognition. See [12, 1] for reviews. Automated systems

Figure 10: Human responseduring interactionwith theRoboVie robotatATR is measuredby automaticexpressionanalysis.

rich perceptualprimitivesratherthanslow symbolicinfer-enceprocesses.In this paperwe presentprogresson onesuchperceptualprimitive: Real time recognitionof facialexpressions.

Ourresultssuggestthatuserindependentfully automaticreal time codingof basicexpressionsis anachievablegoalwith presentcomputerpower, at least for applicationsinwhich frontal views canbeassumed.Theproblemof clas-sificationinto 7 basicexpressionscanbe solved with highaccuracy by a simple linear system,after the imagesarepreprocessedby a bankof Gaborfilters. Theseresultsareconsistentwith thosereportedby PadgettandCottrell on asmallerdataset[11]. A previoussystem[9] employeddis-criminantanalysis(LDA) toclassifyfacialexpressionsfromGaborrepresentations.Herewe exploredusingSVM’s forfacial expressionclassification. While LDA’s are optimalwhenthe classdistributionsareGaussian,SVM’s may bemoreeffective whenthenclassdistributionsarenot Gaus-sian.

Goodperformanceresultswereobtainedfor directlypro-cessingtheoutputof anautomaticfacedetectorwithout theneedfor explicit detectionandregistrationof facialfeatures.Performanceof anonlinearSVM on theoutputof theauto-matic facefinder was almost identical to performanceonthesamesetof facesusingexplicit featurealignmentwithhand-labeledfeatures.

Using Adaboost to perform feature selection greatlyspeededupthetheapplication.SVM’strainedonthisrepre-sentationshow animprovedclassificationperformanceoverAdaboostaswell.

Acknowledgments

Supportfor this projectwasprovidedby ONR N00014-02-1-0616,NSF-ITRIIS-0086107,andCaliforniaDigital Me-dia InnovationProgramDIMI 01-10130,andtheMIND In-stitute. Two of theauthors(MovellanandFasel)werealsosupportedby the Intelligent RoboticsandCommunicationLaboratoryat ATR. We would also like to thank HiroshiIshiguroatATR for helpportingthesoftwarepresentedhereto RoboVie. This researchwas supportedin part by theTelecommunicationsAdvancementOrganizationof Japan.

References[1] Marian S. Bartlett. Face Image Analysisby Unsupervised

Learning, volume 612 of The Kluwer International Serieson Engineeringand ComputerScience. Kluwer AcademicPublishers,Boston,2001.

[2] M.S.Bartlett,G. Littlewort, B. Braathen,T.J.Sejnowsk,andJ.R. Movellan. A prototypefor automaticrecognitionofspontaneousfacial actions. In S. Becker, S. Thrun, andK. Obermayer, editors,Advancesin Neural InformationPro-cessingSystems, volume 15, Cambridge,MA, 2003. MITPress.

[3] I. Fasel and J. R. Movellan. Comparisonof neurally in-spiredfacedetectionalgorithms. In Proceedingsof the in-ternationalconferenceonartificial neural networks(ICANN2002). UAM, 2002.

[4] Yoav FreundandRobertE. Schapire. Experimentswith anew boostingalgorithm.In Proc.13thInternationalConfer-enceon Machine Learning, pages148–146.Morgan Kauf-mann,1996.

[5] J Friedman,T Hastie, and R Tibshirani. Additive logis-tic regression:A statisticalview of boosting. ANNALSOFSTATISTICS, 28(2):337–374,2000.

[6] H. Ishiguro, T. Ono, M. Imai, T. Maeda, and T. Kan-daandR. Nakatsu.Robovie: aninteractive humanoidrobot.28(6):498–503,2001.

[7] T. Kanade,J.F. Cohn,andY. Tian. Comprehensive databasefor facial expressionanalysis. In Proceedingsof the fourthIEEE International conferenceon automaticfaceand ges-ture recognition (FG’00), pages46–53,Grenoble,France,2000.

[8] M. Lades,J. Vorbruggen,J. Buhmann,J. Lange,W. Konen,C. von der Malsburg, and R. Wurtz. Distortion invariantobject recognitionin the dynamiclink architecture. IEEETransactionsonComputers, 42(3):300–311,1993.

[9] M. Lyons,J.Budynek,A. Plante,andS.Akamatsu.Classify-ing facialattributesusinga2-dgaborwaveletrepresentationanddiscriminantanalysis.In Proceedingsof the4th interna-tional conferenceonautomaticfaceandgesturerecognition,pages202–207,2000.

[10] JiyongMa,JieYan,RonCole,andCUAnimate.Cuanimate:Tools for enablingconversationswith animatedcharacters.In Proceedingsof ICSLP-2002, Denver, USA, 2002.

[11] C. Padgett and G. Cottrell. Representingface imagesfor emotion classification. In M. Mozer, M. Jordan,andT. Petsche,editors,Advancesin Neural InformationProcess-ing Systems, volume9, Cambridge,MA, 1997.MIT Press.

[12] M. Pantic andJ.M. Rothcrantz. Automaticanalysisof fa-cial expressions:Stateof theart. IEEETransactionsonPat-tern AnalysisandMachine Intelligence, 22(12):1424–1445,2000.

[13] H. Rowley, S.Baluja,andT. Kanade.Neuralnetwork-basedfacedetection.IEEETrans.onPatternAnalysisandMachineIntelligence, 1(20):23–28,1998.

[14] H. SchneidermanandT. Kanade.Probabilisticmodelingoflocal appearanceandspatialrelationshipsfor object recog-nition. In Proc. IEEE Intl. Conf. on ComputerVision andPatternRecognition, pages45–51,1998.

[15] KahKay SungandTomasoPoggio.Examplebasedlearningfor view-basedhumanfacedetection. IEEE Trans.PatternAnal.Mach. Intelligence, 20:3951,1998.

[16] PaulViola andMichaelJones.Robustreal-timeobjectdetec-tion. TechnicalReportCRL 20001/01,CambridgeResearch-Laboratory, 2001.

6