Detecting and Tracking Human Faces in Videos · Detecting and Tracking Human Faces in Videos Yadong Li, Ardeshir Goshtasby & Oscar Garcia Computer Science and Engineering Department

Detecting and Tracking Human Faces in Videos

YadongLi, ArdeshirGoshtasby& OscarGarciaComputerScienceandEngineeringDepartment

Wright StateUniversityDayton,OH 45435�

yli, agoshtas,ogarcia� @cs.wright.edu

Abstract

A methodfor detectingand tracking humanfaces incolor videosis presented.Themethodfirst usesa chromachart with informationaboutskin colors of variousracesto determineregions of skin color in the first frameof avideo. A new chromachart is computedfor each region,which more preciselyrepresentsthe color contentsof thatregion. Chromachartsfor differentregionsthatare similarare combined,while thosethat are considerably differentare kept separate. Model facial patternsare thenusedtodetectfaceswithin theskinregions.Oncea faceis detected,theparticularpatternandcolor of thefaceareusedto tracktheface. Regionswhere facial patternsarenotdetectedareexpectedto correspondto exposedparts of the bodyor ofthebackgroundandare ignored.Theproposedmethodcantrack faceswith a high degreeof accuracy oncethey areidentified.

1. Introduction and Background

This papertacklesthe problemof detectingand track-ing humanfacesin videos. It is assumedthata video clipcontainingoneor morehumanfacesis givenandthecam-eracapturingthe video is stationary. Motion in the videowould thenbe dueto humanmovementandwould not befrom thecamera.Videoscapturedfor surveillancein banksandsupermarketsarethekindof videosconsideredfor anal-ysishere.By trackinghumanfacesin videos,spatiotempo-ral motionsignaturesrepresentingthemotionsof facesaregenerated.Themotionsignaturescanthenbeusedto char-acterizehumanactivities.

Attempts to detecthumanfacesin imageshave beenmadebeforeby Yow andCipolla [5], andothers;andre-searchto detectskin regionsin imageshasbeenperformedby Miyake et al. [2], and others. Considerableeffortshavealsogoneinto trackinghumansandunderstandingac-tions of humansin videos. WachterandNagel[4] fitted a

modelfigure to projectionsof a personto track thepersonin a video, and Polanaand Nelson[3] attachedreferencepointsthatcouldbetrackedto thebodyof apersonto studybodymotions.Surveysof pastwork on trackinghumansinvideosandunderstandingtheperformedactionshave beengivenby Aggarwal andCai [1].

The proposedtrackingprocessinvolvesdetectingskin-color regions in the first video frame, from amongskin-color regions identifying thosethat correspondto humanfaces,andusingthecolorsandpatternsof thefacesto trackthem.

The innovation proposedhere is a new tracking algo-rithm that is robust and fast. Oncea faceis detected,itscolor andpatternareusedto track it. The proposedalgo-rithm cantrack a faceeven whenconsiderablechangesinthe orientationand size of the faceoccur. Stepsarealsobuilt into the systemto detectandtracknew facesastheyenterthevideo.

2. Detecting Skin-Color Regions

To characterizeskin colors independentof scenelight-ing, �� colorcomponentsaretransformedinto �� colorcomponentsand � , theluminancecomponent,is discarded.A chartrepresentingthe � and componentsof thecolorsisthenprepared.Thischartis referredto asthechromachartsinceit containsinformation aboutthe chromaof colors.Thechartshown in Fig. 1awasobtainedby selecting1300skin samplesfrom a largenumberof images.We seethatthechartis denserin someareasthanin others.However,evenin very denseareas,therearegapsin thechart. Sinceit is veryunlikely thatin asmallneighborhoodmany pointswould belongto skin color while the otherpointsdon’t, itis necessaryto fill the gapsin the chart. To fill the gaps,a radial basisfunction, suchasa Gaussian,is centeredateachsampleandthesumof thebasisfunctionsis computedat eachchart entry. Chart valuesare then normalizedsothat the largestvalue becomesone. Therefore,valuesinthenormalizedchromachartvarybetween0 and1. Higher

agoshtas

Note

This paper appeared in Int'l Conf. Pattern Recognition, Barcelona, Spain, Sept. 3-7, 2000, pp. 807-810.

valuescorrespondto neighborhoodswherea larger num-ber of samplesare available, while smallervaluesfall intheneighborhoodswherea smallernumberof samplesareavailable. The chart obtainedin this manneris shown inFig. 1b. Chartentries,therefore,show likelihoodsof differ-entchromasrepresentingskincolor.

(a) (b)

Figure 1. (a) 1300 skin samples. (b) Skin-likelihood char t obtained from the skin sam-ples.

By transformingtheskin colorsin Fig. 2ato skin likeli-hoodsusingthechromachartof Fig. 1b,agrayscaleimagesimilar to that shown in Fig. 2b is obtained. Segmentingthis imageby anoptimalthresholdingmethodproducesre-gionssimilar to thoseshown in Fig. 2c. To improve thesegmentationaccuracy, after the initial skin regions fromthe chromachartof Fig. 1b areobtained,colorsof pixelsin eachregion areusedto createa new chromachart forthat region. Chromachartsfrom two or moreregionsarethencombinedif they aresufficiently close. Two chromachartsareconsideredsufficiently closeif the sumof theirabsolutedifferencesis smallerthanagiventhresholdvalue,whichis determinedexperimentally. In Fig. 2b,thechromachartsobtainedfor the two regions werecloseenoughtobe combined.Using the combinedchromachart, the skinlikelihoodimageof Fig. 2b wasobtained.Segmentingthisimageby theoptimalthresholdingmethodproducedthere-gionsshown in Fig. 2c. Theoptimal thresholdingmethodfinds the thresholdvalueat which changein the sizeof aregionasafunctionof changein intensityis minimum.Theoptimalthresholdvalue,therefore,is onethatproducesare-gion moststableunderthevariationof thethresholdvalue.

Althoughtheregionsobtainedasa resultof this thresh-oldinghavethecolorof skin,thereis noguaranteethattheyactuallybelongtoskin. Someregionsmaybelongtoobjectsin thescenethathave similar color. Thisprocess,however,canreliably identify andeliminatethoseregionsthatdonotrepresenttheskin. Facesarethenlocatedin theremaining

(a) (b)

(c) (d)

Figure 2. (a) An image containing two faces.(b) The skin-likelihood image. (c) Segmen-tation by optimal thresholding. (d) Detectedfaces.

skin-colorregionsusingfacialmodels.

3. Detecting the Faces

In orderto detectfacesin animage,modelsrepresentingdifferentviews of a facearefirst created.At present,onlythree(frontal, left, andright) views of anaveragefaceareused.To createanaverageface,differentfacial imagesareoverlaidby anaffine transformationin sucha way that thecentersof the eyesandmouthin the frontal view, andthecenterof aneye, the tip of thenose,andoneearin a side-view imagecoincide. Then the averageintensitiesof theoverlaid facesaredetermined.The averagefacial modelsobtainedin this mannerareshown in Fig. 3. The geome-try of anaveragefaceis obtainedby averagingthecoordi-natesof thecentersof theeyesandthemouthof thefront-view faces,andthecoordinatesof centerof aneye, the tipof nose,andthe centerof an ear in the side-view images.Only theinterior portionsof theaveragedfacesareusedinmatchingin orderto eliminatedifferencesdueto hairstyles,hats,and the backgroundin images. Eachmodel is ob-tainedby averaging16 faceswith no facialhairandglassesfrom amongthefacedatabaseof KriegmanandBelhumeurat Yale University, andthe facedatabaseof AchermannatUniversityof Bern.

A facial region containsfeatures,suchas the eyesandthemouth,whichdonothavethecolorof skin. After imagesegmentation,thesefeaturesappearas subregions withinskin regions. Thesubregionsarehypothesizedastheeyesand the mouth in the frontal-view image,or as an eye ina left or a right side-view image. In a side-view image,the mouthregion may connectto the background,the earmaynot bedetected,andonly oneof theeyesmaybevis-

(a) (b) (c)

Figure 3. (a) Front-vie w, (b) left-vie w, and (c)right-vie w models of the face .

ible. Sincethedirectionat which a personis viewedis notknown, all threeviews areconsideredin thematching,andthe correctview is determinedby finding the model thatproducesthehighestmatchratingwith theskin region.

All combinationsof zero,one,two, andthreesubregionsareusedin thematching.Fewer thanthreesubregionsareconsideredin thematchingevenwhenmorethanthreesub-regionsareavailablebecausesomeof the subregionsmaybedueto noise.Sinceall detectedsubregionscouldbedueto noise,a facial model is also matchedto the region insucha way thattheverticalaxisof a modelalignswith themajor axis of the region and its width is the sameas thatof the region. Among all matches,the oneproducingthehighestmatchrating is selectedto identify the orientationandsizeof the face. If the highestmatchrating is belowa thresholdvalue, it is assumedthat a facedoesnot existin theskin region. In our implementation,thematch-ratingthresholdwasdeterminedexperimentallyin suchawaythattheprobabilityof falselydetectinga faceis thesameastheprobabilityof missinga face.

This methodof hypothesisandverificationenablesde-tectionof facesat arbitraryscalesandorientations.Figure2d shows thefacesdetectedin theskin regionsobtainedinFig. 2c. Dueto occlusion,scenelighting, andotherfactors,somefacesmay be missedor somenon-faceregionsmaybeclassifiedasfacesin thefirst frameof avideoclip. Suchmistakesarecorrectedlaterwhensubsequentvideoframesareprocessed.

4. Tracking the Detected Faces

Oncea faceis detected,its color is usedto tracktheskinregion containingthe face in subsequentframes,and thepatternof thedetectedfaceis usedto locatethefacein thenext frame. Using the specificcolor andpatternof a facesimplifiesdetectionandimprovesaccuracy. Sincemotionof a person’s headfrom one frameto the next is limited,a boundingwindow is definedfor eachfaceand,therefore,segmentationandmatchingisperformedonlywithin thede-finedwindow. Imagesegmentationis still neededin orderto determinethe boundaryof the face,but the processis

muchfasternow sinceit is limited to asmallwindow. Afterthefirst video frame,detectionis fasterandmoreaccuratebecausetheexactchromaandtheexactpatternof the faceareknown.

The template-matchingprocessis demonstratedin Fig.4. Figure 4a shows the templateobtainedfrom the firstframe in a video. This templatecontainsthe facial pat-ternfor a particularindividual in frameone.Thispatternissearchedin frametwo of thevideo,within a slightly largerareacalledthesearcharea.This is shown in Fig. 4b. Thesearchareais a window centeredat the faceregion repre-sentingthe templatebut is slightly largerto allow possiblemotionof thefacefrom oneframeto thenext. Thesizeofthesearchareacanbeestimatedfrom themaximummotionof apersonin avideo.

The resultsof both image segmentationand templatematchingareneededto correctlytrack a face. Imageseg-mentationcapturestheboundaryof thefaceandhelpselim-inatepartsof thescenethatdo not belongto theface,suchas the backgroundor the hair. Segmentationalone,how-ever, maymisstheexactfacialboundarywhena skin-colorregion in the scenefalls next to the faceor a part of thefaceis occludedby anobjecthaving thecolor of skin. Theroleof templatematchingis to locatethefacewithin a skinregion and, if the detectedskin region is muchsmallerorlargerthanthetemplate,adjustthetemplatefor matchinginthesubsequentframe.

(a) (b) (c) (d)

Figure 4. (a) A template in a video frame . (b)The search area in the subsequent frame . (c)Segmentation of the search area. (d) Nexttemplate selected for matc hing.

Theaboveprocesscantrackafaceaslongasaportionofit is visible. Evenwhenthefacialfeaturessuchastheeyesor the mouthbecomeinvisible, the processcanstill tracktheface.

After processingevery � frames(typically �� ), thenew frameis assumedto bethefirst framein thevideoandfacesin it are againdetectedusing the processdescribedabove. This processnot only detectsfacesnewly enteringthevideo,it alsodetectsfacesthatweremissedearlierdueto small size,occlusion,lighting, shadow, or otherfactors.If new facesareobtained,they areaddedto thelist of facesto betracked. If thereis a needto trackthefacesthatwere

missedearlier, thefacesarebacktrackeduntil eitherthebe-ginning of the video or a framewherethe faceentersthevideois reached.

(a) (b)

(c) (d)

Figure 5. (a) A video sequence containing aperson moving back and then forwar d whilefacing the camera. (b) The depth map ofthe tub ular structure obtained by tracking thefacial region in the spatiotemporal domain.(c) A horizontal cross-section of the tub ularstructure . (d) The motion signature obtainedby tracing the centr oid of the tub ular struc-ture .

Motion of aperson’sfacecontainsconsiderableinforma-tion aboutactivities performedby theperson.By trackinga face,we cantracea tubular structureof the kind shownin Fig. 5b in spatiotemporalspace.A cross-sectionof thistubular structurewith a horizontalplaneis shown in Fig.5c. By tracingthecentroidof this tubularstructure,we ob-tain a motionsignatureasshown in Fig. 5d. Suchmotionsignaturescanbeusedto characterizehumanactivities.

5. Results

The computationalrequirementsof the face detectionand tracking algorithm are as follows. Determinationoffacesin an imageof size �� pixels takes25 sec-ondson average.Theexacttime dependson thenumberoffacesin animage.Trackingof facesandskin regionsfromonevideo frameto thenext takesabout2 secondswhenafew personsarepresent.Thesemeasurementsreflectcom-putationtimesonanSGIO2R10000computer.

Thefacedetectionalgorithmdetectsabout90percentofthefacesin imagesof size �� pixelswhencontain-ing a few faces.The facedetectionaccuracy decreasesasthenumberof facesincreases.As morefacesappearin an

image,thefacesbecomesmallerandfacialfeaturesbecomelessobvious.Amongthefacesmissedin thefirst frame,80percentof themaredetectedlater whensubsequentvideoframesare processed.This rate increaseswhen a personwalkstowardthecamera,andit decreaseswhenthepersonwalks away from thecamera.Oncea faceis detected,thefaceis trackedwithoutany erroruntil it becomesoccluded.An occludedfacewill be detectedagainwhenit becomesvisible.

6. Conclusions

The first stepin analyzingvideoscontaininghumansisthedetectionandtrackingof humanfaces.A robustmethodfor detectingandtrackinghumanfacesin color videoswaspresented.Oncethe algorithmlocks onto a face,it tracksthefaceevenwhenonly a smallportionof it is visible. Bytracking the face,motion signaturesaregenerated,whichthencanbe usedto characterizehumanactivities. Futureworkwill focusonanalyzingthesemotionsignaturesto rec-ognizevarioushumanactivities.

Acknowledgements

We would like to thanktheNationalScienceFoundation(Grants# IIS-9906340andEIA-9601670)andtheInforma-tion TechnologyResearchInstitute, Wright StateUniver-sity, for supportingthiswork.

References

[1] J. K. Aggarwal andQ. Cai, Humanmotion analysis:A review, ComputerVisionandImageUnderstanding,vol. 73,no.3, 1999,pp.428–440.

[2] Y. Miyake, H. Saitoh,H. Yaguchi,and N. Tsukada,Facial pattern detectionand color correction fromtelevision pictureandnewspaperprinting, Journal ofImaging Technology, vol. 16, no. 5. 1990, pp. 165–169.

[3] R. PolanaandR. Nelson,Detectingactivities, Proc.ComputerVision and Pattern Recognition, 1991,pp.2–7.

[4] W. Wachterand H.-H. Nagel, Tracking personsinmonocularimagesequences,ComputerVisionandIm-ageUnderstanding, vol. 74,no.3, 1999,pp.174–192.

[5] K. C. Yow andR. Cipolla, Feature-basedhumanfacedetection,ImageandVisionComputing, vol. 15,1997,pp.713–735.

Documents

Detecting and Tracking Human Faces in Videos · Detecting and Tracking Human Faces in Videos Yadong Li, Ardeshir Goshtasby & Oscar Garcia Computer Science and Engineering Department