4
Detecting and Tracking Human Faces in Videos Yadong Li, Ardeshir Goshtasby & Oscar Garcia Computer Science and Engineering Department Wright State University Dayton, OH 45435 yli, agoshtas, ogarcia @cs.wright.edu Abstract A method for detecting and tracking human faces in color videos is presented. The method first uses a chroma chart with information about skin colors of various races to determine regions of skin color in the first frame of a video. A new chroma chart is computed for each region, which more precisely represents the color contents of that region. Chroma charts for different regions that are similar are combined, while those that are considerably different are kept separate. Model facial patterns are then used to detect faces within the skin regions. Once a face is detected, the particular pattern and color of the face are used to track the face. Regions where facial patterns are not detected are expected to correspond to exposed parts of the body or of the background and are ignored. The proposed method can track faces with a high degree of accuracy once they are identified. 1. Introduction and Background This paper tackles the problem of detecting and track- ing human faces in videos. It is assumed that a video clip containing one or more human faces is given and the cam- era capturing the video is stationary. Motion in the video would then be due to human movement and would not be from the camera. Videos captured for surveillance in banks and supermarkets are the kind of videos considered for anal- ysis here. By tracking human faces in videos, spatiotempo- ral motion signatures representing the motions of faces are generated. The motion signatures can then be used to char- acterize human activities. Attempts to detect human faces in images have been made before by Yow and Cipolla [5], and others; and re- search to detect skin regions in images has been performed by Miyake et al. [2], and others. Considerable efforts have also gone into tracking humans and understanding ac- tions of humans in videos. Wachter and Nagel [4] fitted a model figure to projections of a person to track the person in a video, and Polana and Nelson [3] attached reference points that could be tracked to the body of a person to study body motions. Surveys of past work on tracking humans in videos and understanding the performed actions have been given by Aggarwal and Cai [1]. The proposed tracking process involves detecting skin- color regions in the first video frame, from among skin- color regions identifying those that correspond to human faces, and using the colors and patterns of the faces to track them. The innovation proposed here is a new tracking algo- rithm that is robust and fast. Once a face is detected, its color and pattern are used to track it. The proposed algo- rithm can track a face even when considerable changes in the orientation and size of the face occur. Steps are also built into the system to detect and track new faces as they enter the video. 2. Detecting Skin-Color Regions To characterize skin colors independent of scene light- ing, color components are transformed into color components and , the luminance component, is discarded. A chart representing the and components of the colors is then prepared. This chart is referred to as the chroma chart since it contains information about the chroma of colors. The chart shown in Fig. 1a was obtained by selecting 1300 skin samples from a large number of images. We see that the chart is denser in some areas than in others. However, even in very dense areas, there are gaps in the chart. Since it is very unlikely that in a small neighborhood many points would belong to skin color while the other points don’t, it is necessary to fill the gaps in the chart. To fill the gaps, a radial basis function, such as a Gaussian, is centered at each sample and the sum of the basis functions is computed at each chart entry. Chart values are then normalized so that the largest value becomes one. Therefore, values in the normalized chroma chart vary between 0 and 1. Higher

Detecting and Tracking Human Faces in Videos · Detecting and Tracking Human Faces in Videos Yadong Li, Ardeshir Goshtasby & Oscar Garcia Computer Science and Engineering Department

Embed Size (px)

Citation preview

Page 1: Detecting and Tracking Human Faces in Videos · Detecting and Tracking Human Faces in Videos Yadong Li, Ardeshir Goshtasby & Oscar Garcia Computer Science and Engineering Department

Detecting and Tracking Human Faces in Videos

YadongLi, ArdeshirGoshtasby& OscarGarciaComputerScienceandEngineeringDepartment

Wright StateUniversityDayton,OH 45435�

yli, agoshtas,ogarcia� @cs.wright.edu

Abstract

A methodfor detectingand tracking humanfaces incolor videosis presented.Themethodfirst usesa chromachart with informationaboutskin colors of variousracesto determineregions of skin color in the first frameof avideo. A new chromachart is computedfor each region,which more preciselyrepresentsthe color contentsof thatregion. Chromachartsfor differentregionsthatare similarare combined,while thosethat are considerably differentare kept separate. Model facial patternsare thenusedtodetectfaceswithin theskinregions.Oncea faceis detected,theparticularpatternandcolor of thefaceareusedto tracktheface. Regionswhere facial patternsarenotdetectedareexpectedto correspondto exposedparts of the bodyor ofthebackgroundandare ignored.Theproposedmethodcantrack faceswith a high degreeof accuracy oncethey areidentified.

1. Introduction and Background

This papertacklesthe problemof detectingand track-ing humanfacesin videos. It is assumedthata video clipcontainingoneor morehumanfacesis givenandthecam-eracapturingthe video is stationary. Motion in the videowould thenbe dueto humanmovementandwould not befrom thecamera.Videoscapturedfor surveillancein banksandsupermarketsarethekindof videosconsideredfor anal-ysishere.By trackinghumanfacesin videos,spatiotempo-ral motionsignaturesrepresentingthemotionsof facesaregenerated.Themotionsignaturescanthenbeusedto char-acterizehumanactivities.

Attempts to detecthumanfacesin imageshave beenmadebeforeby Yow andCipolla [5], andothers;andre-searchto detectskin regionsin imageshasbeenperformedby Miyake et al. [2], and others. Considerableeffortshavealsogoneinto trackinghumansandunderstandingac-tions of humansin videos. WachterandNagel[4] fitted a

modelfigure to projectionsof a personto track thepersonin a video, and Polanaand Nelson[3] attachedreferencepointsthatcouldbetrackedto thebodyof apersonto studybodymotions.Surveysof pastwork on trackinghumansinvideosandunderstandingtheperformedactionshave beengivenby Aggarwal andCai [1].

The proposedtrackingprocessinvolvesdetectingskin-color regions in the first video frame, from amongskin-color regions identifying thosethat correspondto humanfaces,andusingthecolorsandpatternsof thefacesto trackthem.

The innovation proposedhere is a new tracking algo-rithm that is robust and fast. Oncea faceis detected,itscolor andpatternareusedto track it. The proposedalgo-rithm cantrack a faceeven whenconsiderablechangesinthe orientationand size of the faceoccur. Stepsarealsobuilt into the systemto detectandtracknew facesastheyenterthevideo.

2. Detecting Skin-Color Regions

To characterizeskin colors independentof scenelight-ing, ����� colorcomponentsaretransformedinto ��� colorcomponentsand � , theluminancecomponent,is discarded.A chartrepresentingthe � and componentsof thecolorsisthenprepared.Thischartis referredto asthechromachartsinceit containsinformation aboutthe chromaof colors.Thechartshown in Fig. 1awasobtainedby selecting1300skin samplesfrom a largenumberof images.We seethatthechartis denserin someareasthanin others.However,evenin very denseareas,therearegapsin thechart. Sinceit is veryunlikely thatin asmallneighborhoodmany pointswould belongto skin color while the otherpointsdon’t, itis necessaryto fill the gapsin the chart. To fill the gaps,a radial basisfunction, suchasa Gaussian,is centeredateachsampleandthesumof thebasisfunctionsis computedat eachchart entry. Chart valuesare then normalizedsothat the largestvalue becomesone. Therefore,valuesinthenormalizedchromachartvarybetween0 and1. Higher

agoshtas
Note
This paper appeared in Int'l Conf. Pattern Recognition, Barcelona, Spain, Sept. 3-7, 2000, pp. 807-810.
Page 2: Detecting and Tracking Human Faces in Videos · Detecting and Tracking Human Faces in Videos Yadong Li, Ardeshir Goshtasby & Oscar Garcia Computer Science and Engineering Department

valuescorrespondto neighborhoodswherea larger num-ber of samplesare available, while smallervaluesfall intheneighborhoodswherea smallernumberof samplesareavailable. The chart obtainedin this manneris shown inFig. 1b. Chartentries,therefore,show likelihoodsof differ-entchromasrepresentingskincolor.

(a) (b)

Figure 1. (a) 1300 skin samples. (b) Skin-likelihood char t obtained from the skin sam-ples.

By transformingtheskin colorsin Fig. 2ato skin likeli-hoodsusingthechromachartof Fig. 1b,agrayscaleimagesimilar to that shown in Fig. 2b is obtained. Segmentingthis imageby anoptimalthresholdingmethodproducesre-gionssimilar to thoseshown in Fig. 2c. To improve thesegmentationaccuracy, after the initial skin regions fromthe chromachartof Fig. 1b areobtained,colorsof pixelsin eachregion areusedto createa new chromachart forthat region. Chromachartsfrom two or moreregionsarethencombinedif they aresufficiently close. Two chromachartsareconsideredsufficiently closeif the sumof theirabsolutedifferencesis smallerthanagiventhresholdvalue,whichis determinedexperimentally. In Fig. 2b,thechromachartsobtainedfor the two regions werecloseenoughtobe combined.Using the combinedchromachart, the skinlikelihoodimageof Fig. 2b wasobtained.Segmentingthisimageby theoptimalthresholdingmethodproducedthere-gionsshown in Fig. 2c. Theoptimal thresholdingmethodfinds the thresholdvalueat which changein the sizeof aregionasafunctionof changein intensityis minimum.Theoptimalthresholdvalue,therefore,is onethatproducesare-gion moststableunderthevariationof thethresholdvalue.

Althoughtheregionsobtainedasa resultof this thresh-oldinghavethecolorof skin,thereis noguaranteethattheyactuallybelongtoskin. Someregionsmaybelongtoobjectsin thescenethathave similar color. Thisprocess,however,canreliably identify andeliminatethoseregionsthatdonotrepresenttheskin. Facesarethenlocatedin theremaining

(a) (b)

(c) (d)

Figure 2. (a) An image containing two faces.(b) The skin-likelihood image. (c) Segmen-tation by optimal thresholding. (d) Detectedfaces.

skin-colorregionsusingfacialmodels.

3. Detecting the Faces

In orderto detectfacesin animage,modelsrepresentingdifferentviews of a facearefirst created.At present,onlythree(frontal, left, andright) views of anaveragefaceareused.To createanaverageface,differentfacial imagesareoverlaidby anaffine transformationin sucha way that thecentersof the eyesandmouthin the frontal view, andthecenterof aneye, the tip of thenose,andoneearin a side-view imagecoincide. Then the averageintensitiesof theoverlaid facesaredetermined.The averagefacial modelsobtainedin this mannerareshown in Fig. 3. The geome-try of anaveragefaceis obtainedby averagingthecoordi-natesof thecentersof theeyesandthemouthof thefront-view faces,andthecoordinatesof centerof aneye, the tipof nose,andthe centerof an ear in the side-view images.Only theinterior portionsof theaveragedfacesareusedinmatchingin orderto eliminatedifferencesdueto hairstyles,hats,and the backgroundin images. Eachmodel is ob-tainedby averaging16 faceswith no facialhairandglassesfrom amongthefacedatabaseof KriegmanandBelhumeurat Yale University, andthe facedatabaseof AchermannatUniversityof Bern.

A facial region containsfeatures,suchas the eyesandthemouth,whichdonothavethecolorof skin. After imagesegmentation,thesefeaturesappearas subregions withinskin regions. Thesubregionsarehypothesizedastheeyesand the mouth in the frontal-view image,or as an eye ina left or a right side-view image. In a side-view image,the mouthregion may connectto the background,the earmaynot bedetected,andonly oneof theeyesmaybevis-

Page 3: Detecting and Tracking Human Faces in Videos · Detecting and Tracking Human Faces in Videos Yadong Li, Ardeshir Goshtasby & Oscar Garcia Computer Science and Engineering Department

(a) (b) (c)

Figure 3. (a) Front-vie w, (b) left-vie w, and (c)right-vie w models of the face .

ible. Sincethedirectionat which a personis viewedis notknown, all threeviews areconsideredin thematching,andthe correctview is determinedby finding the model thatproducesthehighestmatchratingwith theskin region.

All combinationsof zero,one,two, andthreesubregionsareusedin thematching.Fewer thanthreesubregionsareconsideredin thematchingevenwhenmorethanthreesub-regionsareavailablebecausesomeof the subregionsmaybedueto noise.Sinceall detectedsubregionscouldbedueto noise,a facial model is also matchedto the region insucha way thattheverticalaxisof a modelalignswith themajor axis of the region and its width is the sameas thatof the region. Among all matches,the oneproducingthehighestmatchrating is selectedto identify the orientationandsizeof the face. If the highestmatchrating is belowa thresholdvalue, it is assumedthat a facedoesnot existin theskin region. In our implementation,thematch-ratingthresholdwasdeterminedexperimentallyin suchawaythattheprobabilityof falselydetectinga faceis thesameastheprobabilityof missinga face.

This methodof hypothesisandverificationenablesde-tectionof facesat arbitraryscalesandorientations.Figure2d shows thefacesdetectedin theskin regionsobtainedinFig. 2c. Dueto occlusion,scenelighting, andotherfactors,somefacesmay be missedor somenon-faceregionsmaybeclassifiedasfacesin thefirst frameof avideoclip. Suchmistakesarecorrectedlaterwhensubsequentvideoframesareprocessed.

4. Tracking the Detected Faces

Oncea faceis detected,its color is usedto tracktheskinregion containingthe face in subsequentframes,and thepatternof thedetectedfaceis usedto locatethefacein thenext frame. Using the specificcolor andpatternof a facesimplifiesdetectionandimprovesaccuracy. Sincemotionof a person’s headfrom one frameto the next is limited,a boundingwindow is definedfor eachfaceand,therefore,segmentationandmatchingisperformedonlywithin thede-finedwindow. Imagesegmentationis still neededin orderto determinethe boundaryof the face,but the processis

muchfasternow sinceit is limited to asmallwindow. Afterthefirst video frame,detectionis fasterandmoreaccuratebecausetheexactchromaandtheexactpatternof the faceareknown.

The template-matchingprocessis demonstratedin Fig.4. Figure 4a shows the templateobtainedfrom the firstframe in a video. This templatecontainsthe facial pat-ternfor a particularindividual in frameone.Thispatternissearchedin frametwo of thevideo,within a slightly largerareacalledthesearcharea.This is shown in Fig. 4b. Thesearchareais a window centeredat the faceregion repre-sentingthe templatebut is slightly largerto allow possiblemotionof thefacefrom oneframeto thenext. Thesizeofthesearchareacanbeestimatedfrom themaximummotionof apersonin avideo.

The resultsof both image segmentationand templatematchingareneededto correctlytrack a face. Imageseg-mentationcapturestheboundaryof thefaceandhelpselim-inatepartsof thescenethatdo not belongto theface,suchas the backgroundor the hair. Segmentationalone,how-ever, maymisstheexactfacialboundarywhena skin-colorregion in the scenefalls next to the faceor a part of thefaceis occludedby anobjecthaving thecolor of skin. Theroleof templatematchingis to locatethefacewithin a skinregion and, if the detectedskin region is muchsmallerorlargerthanthetemplate,adjustthetemplatefor matchinginthesubsequentframe.

(a) (b) (c) (d)

Figure 4. (a) A template in a video frame . (b)The search area in the subsequent frame . (c)Segmentation of the search area. (d) Nexttemplate selected for matc hing.

Theaboveprocesscantrackafaceaslongasaportionofit is visible. Evenwhenthefacialfeaturessuchastheeyesor the mouthbecomeinvisible, the processcanstill tracktheface.

After processingevery � frames(typically �� ���� ), thenew frameis assumedto bethefirst framein thevideoandfacesin it are againdetectedusing the processdescribedabove. This processnot only detectsfacesnewly enteringthevideo,it alsodetectsfacesthatweremissedearlierdueto small size,occlusion,lighting, shadow, or otherfactors.If new facesareobtained,they areaddedto thelist of facesto betracked. If thereis a needto trackthefacesthatwere

Page 4: Detecting and Tracking Human Faces in Videos · Detecting and Tracking Human Faces in Videos Yadong Li, Ardeshir Goshtasby & Oscar Garcia Computer Science and Engineering Department

missedearlier, thefacesarebacktrackeduntil eitherthebe-ginning of the video or a framewherethe faceentersthevideois reached.

(a) (b)

(c) (d)

Figure 5. (a) A video sequence containing aperson moving back and then forwar d whilefacing the camera. (b) The depth map ofthe tub ular structure obtained by tracking thefacial region in the spatiotemporal domain.(c) A horizontal cross-section of the tub ularstructure . (d) The motion signature obtainedby tracing the centr oid of the tub ular struc-ture .

Motion of aperson’sfacecontainsconsiderableinforma-tion aboutactivities performedby theperson.By trackinga face,we cantracea tubular structureof the kind shownin Fig. 5b in spatiotemporalspace.A cross-sectionof thistubular structurewith a horizontalplaneis shown in Fig.5c. By tracingthecentroidof this tubularstructure,we ob-tain a motionsignatureasshown in Fig. 5d. Suchmotionsignaturescanbeusedto characterizehumanactivities.

5. Results

The computationalrequirementsof the face detectionand tracking algorithm are as follows. Determinationoffacesin an imageof size ������������� pixels takes25 sec-ondson average.Theexacttime dependson thenumberoffacesin animage.Trackingof facesandskin regionsfromonevideo frameto thenext takesabout2 secondswhenafew personsarepresent.Thesemeasurementsreflectcom-putationtimesonanSGIO2R10000computer.

Thefacedetectionalgorithmdetectsabout90percentofthefacesin imagesof size ������������� pixelswhencontain-ing a few faces.The facedetectionaccuracy decreasesasthenumberof facesincreases.As morefacesappearin an

image,thefacesbecomesmallerandfacialfeaturesbecomelessobvious.Amongthefacesmissedin thefirst frame,80percentof themaredetectedlater whensubsequentvideoframesare processed.This rate increaseswhen a personwalkstowardthecamera,andit decreaseswhenthepersonwalks away from thecamera.Oncea faceis detected,thefaceis trackedwithoutany erroruntil it becomesoccluded.An occludedfacewill be detectedagainwhenit becomesvisible.

6. Conclusions

The first stepin analyzingvideoscontaininghumansisthedetectionandtrackingof humanfaces.A robustmethodfor detectingandtrackinghumanfacesin color videoswaspresented.Oncethe algorithmlocks onto a face,it tracksthefaceevenwhenonly a smallportionof it is visible. Bytracking the face,motion signaturesaregenerated,whichthencanbe usedto characterizehumanactivities. Futureworkwill focusonanalyzingthesemotionsignaturesto rec-ognizevarioushumanactivities.

Acknowledgements

We would like to thanktheNationalScienceFoundation(Grants# IIS-9906340andEIA-9601670)andtheInforma-tion TechnologyResearchInstitute, Wright StateUniver-sity, for supportingthiswork.

References

[1] J. K. Aggarwal andQ. Cai, Humanmotion analysis:A review, ComputerVisionandImageUnderstanding,vol. 73,no.3, 1999,pp.428–440.

[2] Y. Miyake, H. Saitoh,H. Yaguchi,and N. Tsukada,Facial pattern detectionand color correction fromtelevision pictureandnewspaperprinting, Journal ofImaging Technology, vol. 16, no. 5. 1990, pp. 165–169.

[3] R. PolanaandR. Nelson,Detectingactivities, Proc.ComputerVision and Pattern Recognition, 1991,pp.2–7.

[4] W. Wachterand H.-H. Nagel, Tracking personsinmonocularimagesequences,ComputerVisionandIm-ageUnderstanding, vol. 74,no.3, 1999,pp.174–192.

[5] K. C. Yow andR. Cipolla, Feature-basedhumanfacedetection,ImageandVisionComputing, vol. 15,1997,pp.713–735.