Region of Interest Extraction using Colour Based Methods ...the red and blue chrominance components to lo-cate the mouth region. a) Zhang’s Method Zhang et al. evaluated the e ectiveness

ISSC 2009, Dublin, June 10–11

Region of Interest Extraction using Colour BasedMethods on the CUAVE Database

Craig Berry and Naomi Harte

Department of Electronic and Electrical EngineeringTrinity College Dublin, Ireland

E-mail: [email protected]

Abstract — Region of interest (ROI) extraction is an important step in deriving vi-sual features for an audio-visual speech recognition system. Colour based segmentationoffers the potential of computationally inexpensive algorithms for ROI selection. Thispaper presents a comparative study of two colour based techniques, one using hue andaccumulated difference, the other chrominance. Results are presented for the CUAVEdatabase. The two methods achieved 69% and 72% correct ROI extraction. The exper-iment prompted investigation of a new method using a chrominance based accumulateddifference image. The new method achieved 79% correct ROI identification. The overallresults suggest that a dual approach using chrominance to locate the mouth region andonly employing an accumulated difference image when significant motion is not presentwould offer good robustness with lower computational cost.

Keywords — Audio Visual Speech Recognition, Region of Interest, Colour Based Segmen-tation, CUAVE database.

I Introduction

Automatic Speech Recognition (ASR) is a highlyenabling technology [1]. Its value lies in mak-ing human interaction with machines more nat-ural and efficient. The bimodal nature of humanspeech interpretation, with its use of both audioand visual cues, has prompted research into Audio-Visual Speech Recognition (AVSR) as a means ofimproving the accuracy and robustness of conven-tional audio only speech recognition systems. Thefusion of both audio and visual features has beenshown to improve the reliability of speech recogni-tion, making the task less susceptible to operatingconditions such as high levels of background noiseor multiple speaker environments[2].

In a Hidden Markov Model (HMM) based AVSRsystem, the first stage in developing a new sys-tem is feature extraction from the audio and videostream. This study is focused on the visual featureextraction stage for an AVSR system. In order toextract useful visual features it is essential that areliable estimate of the mouth region of interest(ROI) can be established from the visual informa-tion stream. The visual features in this system

will capture cues used by humans in lipreading.Hence the ROI extracted is required to encompassthe lip region and chin. ROI extraction has gen-erally been implemented using either image pro-cessing techniques (colour segmentation, edge de-tection, motion information) or statistical model-ing (template matching, snakes, active appearancemodels)[2]. The attraction of image processingtechniques is that they can be less computation-ally expensive than those that require statisticalmodeling. To date there is no strong consensusin the AVSR community as to which approach ismost robust.

This paper presents a comparative study oftwo prominent image processing ROI extractionalgorithms[3][4]. Both methods use colour infor-mation to segment the lip region. In order to re-duce the computational requirements of AVSR sys-tems, colour information is not always used. How-ever, it has been shown that it can significantlyimprove lip recognition in comparison to grayleveltechniques [5]. The database used in evaluatingthe performance of the methods is the CUAVEaudio-visual database[6]. This database presentsa reasonably challenging task and allows compari-

son between AVSR systems.This paper is organised as follows: Section II

provides a brief overview of ROI extraction tech-niques, followed by a detailed description of thetwo implemented algorithms. Section III intro-duces the CUAVE audio-visual database and inSection IV the experimental procedure is outlined.Finally, in Section V our findings are given beforesome conclusions are drawn.

II Colour Based Segmentation of MouthROI

Establishing the location of the mouth regionwithin an image through colour space analysisis a difficult problem. The brightness of humanskin/lips can vary significantly with differing light-ing conditions but is found to have consistent chro-matic features [7]. Many approaches thereforetransform the RGB signal to discard the luminancecomponent, utilising chrominance components inthe segmentation[3][4][7][8][9].

In this paper two colour based segmentationmethods are implemented. The first method wasproposed by Zhang et al.[3]. It employs thehue component obtained from HSV transformationcoupled with the accumulated difference of an im-age sequence to find the mouth region. The secondmethod was proposed by Hsu et al. [4] and usesthe red and blue chrominance components to lo-cate the mouth region.

a) Zhang’s Method

Zhang et al. evaluated the effectiveness of differentcolour spaces to discriminate between the face andlip regions. By plotting the histograms of eachcomponent of RGB, normalised rgb, YCbCr andHSV for both the face region and the mouth region,it was shown that the hue component histogramsfor the face and mouth region showed the leastsimilarity.

The red component of hue is located at both thelow and high ends of the hue colour range. To ob-tain a connected red region in the high end, thehue component is rotated about 0 by 0.0333 forhue defined within the range [0, 1] (or 12◦ withinthe range [0, 360]). Then, by performing a suitablethresholding operation all non-red hue componentscan be removed from the image. Another thresh-olding operation is required to remove hue com-ponents which have low saturation values as theseare more susceptible to noise. Once the threshold-ing operations are completed, all remaining pixelsare designated as lip pixels. The result is a binaryimage:

T (i, j) =

{1, for H(i, j) > H0 and S(i, j) > S0

0, otherwise

(1)

where H is the hue image, S is the saturation im-age, i and j are the pixel locations and H0 = 0.8and S0 = 0.25 for H,S ∈ [0, 1].

To remove non-lip red hue pixels in the resul-tant binary image, an accumulated difference im-age (ADI) is calculated on the R component ofRGB for the proceeding 100 frames as:

ADIk(i, j) =k∑

q=k−98

∆Rq(i, j) (2)

where:∆Rq(i, j) = |Rq(i, j)−Rq−1(i, j)| (3)

k is the frame number and R is the red componentof RGB.

The lips can be extracted from this image astheir movement during speaking will cause a largeADI within the mouth region. Two sequentialthresholding operations are then performed on theADI, the first using Otsu’s method [10] on the en-tire image, the second by applying Otsu’s methodagain to all pixels greater than the first threshold.

Finally in order to obtain a location for themouth region, an AND operation is performed onthe thresholded binary ADI and hue images. Thenby selecting the largest connected region a bound-ing box can be drawn around the mouth region(see Figure 2).

b) Hsu’s Method

Hsu et al. acknowledge that the red chrominancecomponent Cr is more prevalent in the mouth re-gion than the blue chrominance component Cb. Itwas shown that the mouth region had a high C2

r

response, but a low Cr/Cb response. The mouthmap is found as:

M = C2r

(C2

r − ηCr

Cb

)2

(4)

where η is the ratio of the average C2r to the aver-

age Cr/Cb,

η = 0.95×

1n

∑(i,j)∈F

Cr(i, j)2

1n

∑(i,j)∈F

Cr(i, j)Cb(i, j)

(5)

C2r and Cr/Cb are in the range [0, 255] and n is

the number of pixels in the face mask F . Theface mask, F , is created by finding the skin colourpixels in a facial image and grouping them withina pseudo-convex hull (see Figure 3).

III CUAVE Database

The video sequences used for this study were takenfrom the Clemson University Audio-Visual Exper-iments (CUAVE) database [6]. This database is

freely available and therefore is an excellent can-didate for comparison of results in audio-visualresearch. Many researchers have previously usedtheir own recorded data and therefore it can bedifficult to compare the robustness of individualmethods.

The CUAVE database consists of 36 individualspeakers, 19 male and 17 female, and 20 pairs ofspeakers speaking both connected and continuousstrings of integers. The database comprises over7,000 utterances, and though its size is limited tofit on one DVD, it has been designed to be as com-prehensive and challenging as possible. A widevariety of subjects with different skin tones andvisual features such as spectacles, hats and facialhair are included. Each speaker speaks in a varietyof positions. This study is focused on the frontalprofile.

IV Experimental Setup

To quantify the operation of the two algorithms,and their ability to find the mouth region of in-terest within the face region, four example framesof each subject in CUAVE were analysed (total of144 frames). The test frames are 250, 500, 600 and850. Hsu’s method only requires one frame to findthe mouth region, however the ADI for Zhang’smethod is calculated on the preceding 100 frameswith the hue thresholding operation performed onthe test frames.

Zhang’s method was implemented as follows: AnOpenCV implementation[11] of the Viola Jones[12] face detector was used to isolate the face regionin the frame under test. This reduces the opera-tion of the algorithm to the extraction of the lipregion from the face, as opposed to the face findingitself. The ADI for that frame is calculated using100 frames as defined in equation 2. The hue cal-culations as denoted in II (a) are performed on thetest frame. All pixels outside the face region arezeroed (masked) in both the ADI and thresholdedhue images.

Early testing showed that the AND operationfrequently failed to return a useful result. This oc-curred when the ADI highlighted the centre of themouth region and the thresholded hue returnedonly the lip region. Depending on the currentmouth position, the two might not overlap. There-fore it was necessary to dilate both the thresh-olded ADI and hue images before ANDing. Thelargest connected region within this resultant im-age is then selected as the mouth region. The cen-tre pixel within this region is noted and a boundingbox is drawn around this. This step is not specifi-cally detailed in Zhang’s original paper.

Hsu’s method is implemented directly using theformula in equation 4. It is necessary to deter-mine the number of pixels within the face region,

n. In Hsu’s paper[4] skin pixels are found and apseudo-convex hull is fitted to them. The numberof pixels within this region is n. In this study n ischosen to be the number of pixels within the faceregion returned from the Viola Jones algorithm.The resultant image is normalised to the region[0,1]. Equation 4 returns the mouth region withhigher intensity values than non-mouth region pix-els.

In this study a threshold is applied to remove allpixels with intensity less than 0.6. The image is di-lated and then segmented into connected regions,the largest of which is selected as the mouth region.The centre pixel within this region is chosen anda bounding box is drawn around this. These con-cluding steps are not specifically detailed in Hsu’spaper.

It was determined empirically that a suitablesize of bounding box should comprise 26% of thepixels that are contained in the face region. Thisallows for the lips and the chin to be selected.From observation it was noted that the height ofthe bounding box should be 80% that of the width.The positioning of the bounding box is differentfor the two methods. The centre pixel in Zhang’smethod should lie within the centre of the mouthas the ADI is largest there. Hsu’s method will re-turn the region with the largest chrominance, i.e.the lips. In fact, the highest red chromatic pix-els are contained within the lower lip, and in mostcases the centre pixel is within this region. ForZhang, it was found that the centre pixel of thesegmented mouth should be 30% of the verticalheight from the top of the bounding box, whilegiven Hsu’s method generally finds the lower lip,the centre pixel is placed 44% from the top. Inboth methods the centre pixel is centered in thehorizontal direction.

Fig. 1: Four types of ROI.

To qualitatively assess performance and hencecompare the two methods, a returned ROI was cat-egorised as one of four types. Figure 1 provides anexample of each category. Full ROI refers to anROI that encloses a centered mouth and chin. Apartial ROI is where the chin is not included inthe ROI but the mouth is centered, or where themouth is shifted slightly to one side resulting insome minor cropping. Poor ROI refers to caseswhere the mouth region was returned but one lipmight not be fully included. If some other region

RGB Image

Hue ImageThresholded, Dilated and Masked Hue

Result

ADI Thresholded, Dilated and Masked ADI

AND

Fig. 2: A working example of Zhang’s method.

Fig. 3: A working example of Hsu’s method.

is selected other than the mouth this is classifiedas a failure of method.

V Results

Figure 2 shows the outlined steps in Zhang’smethod for one frame. This frame is typical ofsuccessful ROI detection. Both the ADI and hueimage show a significant outline of the mouth re-gion. It can also be seen that this person’s shirthas a large proportion of red hue. Applying theViola Jones mask (zeroing all pixels outside circle)and thresholding results in a well defined mouthand some blotches around the right eye and nose.The output of the AND stage is a well definedbounding box enclosing the mouth and chin.

Figure 3 shows the steps outlined for Hsu’smethod using the same frame as Figure 2. Againthis is typical of successful segmentation of ROI.The C2

r image in Figure 3 shows a significantlylarge (dark) region around the lips, and also forsome of the blotches on the subject’s face. TheCr/Cb image shows much less of a response forthese regions of high red chrominance. Hence thedifference between the two should be greatest forthese regions. The resultant image has a largemouth region, with some other smaller blotchesalso included. By selecting the largest region the

mouth region is selected and the result is a fullROI.

Table 1 summarises the results obtained fromboth Hsu’s and Zhang’s method. Both methodshave comparable results with high success rates.Hsu’s method has a slightly higher success ratethan Zhang (at 72.2% and 69.4% respectively) andits principal advantage over Zhang’s method isthat it does not require multiple frames to findthe mouth region. The ADI does however provideuseful information. For situations where there aremultiple people in frame it can indicate which indi-vidual is speaking. Both had low failure of methodrates (3.6% and 5.6% respectively). Even thoughthe overall failure rates (frames not achieving a fullROI) are quite significant, none of the methodsfailed consistently for any one individual. Also themajority of the frames that returned a partial ROI,returned a full mouth region but did not enclosethe chin. In most cases this was due to the positionof the subject relative to the camera. With some ofthe male subjects, the video sequences have beenshot with the top of their heads not in view (seefigure 4(a)). The result is that the Viola Jonesface detector can return a smaller enclosing regionfor those subjects and the bounding box made rel-ative to this region is smaller than normal. This

Results of Zhang’s MethodFull ROI Partial ROI Poor ROI Failure of Method

Frames: 100 31 5 8Percentage: 69.4% 21.5% 3.5% 5.6%

Results of Hsu’s MethodFull ROI Partial ROI Poor ROI Failure of Method


Results of Red Chrominance ADIFull ROI Partial ROI Poor ROI Failure of Method


Table 1: Results of Methods

problem occurred for both methods.Figures 4(b) and 4(c) show an example of one of

the failure of method cases for Zhang’s algorithm.Here the eyes are the most prominent features ofthe ADI. Figures 4(d) and 4(e) show another fail-ure example of Zhang’s method. The original im-age is shown in 4(d) and as can be seen in 4(e) theADI selects the glasses as showing the most move-ment. The metallic glasses cause large fluctuationsin luminance when they move.

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Fig. 4: Some examples of where the method areunsuccessful: (a) ROI is too small due to a cropped head(All methods); (b) large movement of the eyes comparedto the mouth for Zhang’s ADI; (c) shows that the eye isselected for(b); (d) and (e) ADI is not within the mouth

region for Zhang’s method, instead the glasses areselected; (f) Hsu’s mouth map, large chrominance region

selected on cheek (g); and in (h) and (i) the eye is selectedfrom the chrominance ADI.

Hsu’s method fails to find the mouth regionwhen other skin regions yield high amounts of redchrominance, most notably the cheek region. Ascan be seen from Table 1, this only occurred in3.5% of cases and did not occur consistently for

any one individual. However, this effect is moreprominent in individuals with pale skin as typifiedby the subject shown in figures 4(f) and 4(g).

The poor ROI is typified in figure 1. This is theworst case example, all other frames select muchmore of the full mouth region. This frame is alsonotable in that it fails for all methods. The sub-ject’s head is tilted backwards which causes theproblem. A much larger bounding box is requiredin this case than is normally used.

Many of the failings of Zhang’s method werecaused by varying luminance effects, especiallywhere metallic objects adorned the face. By trans-forming the images used in the ADI the luminanceissues can be removed from its result. Thereforean additional experiment was conducted to inves-tigate the use of red chrominance for an ADI. Thisinvolved using the calculation as described in equa-tion 2 with the red components R being replacedwith the red-chrominance components Cr of thecorresponding frames. A threshold was then ap-plied to remove all pixels with intensity less than0.65. The largest connected region was selected asthe mouth and its centre pixel was positioned 35%(of the bounding box vertical height) from the top.The results for this are included in Table 1.

The chrominance ADI method outperforms bothmethods. Like Hsu’s and Zhang’s methods, one ofthe largest contributions to a partial ROI occursin cases where the mouth region does not enclosethe chin due to the whole head not being visiblewithin the image. The two occasions in which themethod fails are when the subjects eye is selectedinstead of the mouth, in cases where eye movementis significantly large when compared to speakingrate. Figures 4(h) and 4(i) show an example ofone of these occurrences.

The greatest advantage of using Hsu’s algorithmis that it provides an indication of the mouth re-gion from just one frame. Both the ADIs havebeen implemented using 100 frames so cannot beused to provide an initial estimate at the begin-

ning of a sequence. However, once a suitable timeperiod has elapsed the chrominance ADI methodprovides a more accurate estimation of the mouthregion. This is only if the subject has remainedrelatively stationary for this period, though it hasbeen shown here that ADIs can provide accurateresults even with the natural head movement inthe CUAVE database. Also, once this time pe-riod has elapsed, the computation is just a singleframe subtraction operation, making it more effi-cient than Hsu’s algorithm. Given that Zhang’smethod requires a thresholded hue image and anANDing operation to define the ROI, it is compu-tationally less efficient with no performance gain.

VI Conclusion

This paper presented a comparative study of twocolour based segmentation techniques. The al-gorithms of Zhang[3] and Hsu[4] achieved similarROI extraction at 69% and 72% respectively. Anew method based on a red chrominance ADI wasshown to outperform both existing methods with79% successful ROI extraction. Figure 5 showsa variety of successful ROI extractions from thethree methods. The majority of errors in all meth-ods were related to the choice of ROI boundingbox based on the facial region returned by the Vi-ola Jones algorithm. None of the methods failedconsistently for any one individual. Hence by em-ploying a Viterbi tracking algorithm[13] it wouldbe possible to prevent isolated errors causing anAVSR system to lose track of the ROI.

A system implementation that utilises Hsu’s al-gorithm to initially find the mouth region andthen, after a suitable period of minimal movement,would switch to use the chrominance ADI duringrelatively stationary periods warrants further in-vestigation. Through the analysis of motion vec-tors it would be possible to detect significant move-ment within the sequence and then revert to Hsu’salgorithm. By performing the ADI an indicationof the person speaking in group situations is alsogiven.

The results have been tested on 144 frames inCUAVE over a range of conditions. With the pro-posed improvements outlined, it would be worth-while verifying performance on a larger datasetsuch as XM2VTSDB.

Acknowledgements

The authors would like to thank the Irish ResearchCouncil for Science, Engineering and Technologyfor funding this research.

References

[1] D. O’Shaughnessy. Interacting with computers byvoice: automatic speech recogntion and synthesis. Pro-ceedings of the IEEE, 91(9):1272–1305, 2003.

[2] G. Potamianos, C. Neti, G. Gravier, A. Garg, and

Fig. 5: Examples of successful ROI extraction from allreported methods.

A. Senior. Recent Advances in the Automatic Recogni-tion of Audio-Visual Speech. Proceedings of the IEEE,91(9):1306–1326, 2003.

[3] X. Zhang, C. Broun, R. Mesereau, and M. Clements.Automatic speechreading with applications to human-computer interfaces. Eurasip Journal on Applied Sig-nal Processing, 1:1228–1247, 2002.

[4] R.L. Hsu, M. Abdel-Mottaleb, and A.K. Jain. Facedetection in color images. Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on, 24(5):696–706, 2002.

[5] P. Daubias. Is colour information really useful for lipreading? In INTERSPEECH-2005, pages 1193–1196,2005.

[6] E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy.A new audi-visual database for multimodal human-computer interface research. In Proceedings of the In-ternational Conference on Acoustics, Speech and Sig-nal Processing - ICASSP, 2002.

[7] N. Eveno, A. Caplier, and P.Y. Coulon. New colortransformation for lips segmentation. MultimediaSignal Processing, 2001 IEEE Fourth Workshop on,pages 3–8, 2001.

[8] T. Coianiz, L. Torresani, and B. Caprile. 2Ddeformable models for visual speech analysis. inSpeechreading by Humans and Machines, NATO ASISeries, Springer., vol: 150:391–398, 1996.

[9] M. Lievin and F. Luthon. Lip features automatic ex-traction. Image Processing, 1998. ICIP 98. Proceed-ings. 1998 International Conference on, 3:168–172,1998.

[10] N. Otsu. A threshold selection method from gray levelhistograms. Systems, Man, and Cybernetics, IEEETransactions on, 9(1):62–66, 1979.

[11] Open Source Computer Vision Library (OpenCV).http://sourceforge.net/projects/opencvlibrary/.

[12] P. Viola and M. Jones. Rapid object detection using aboosted cascade of simple features. in Proceedings ofInternational Conference of Image Processing (ICIP),1:511–518, 2001.

[13] F. Pitie, S-A. Berrani, R. Dahyot, and A. Kokaram.Off-line multiple object tracking using candidate se-lection and the viterbi algorithm. IEEE InternationalConference on Image Processing (ICIP’05), 3:109–112, 2005.

Documents

Region of Interest Extraction using Colour Based Methods ...the red and blue chrominance components to lo-cate the mouth region. a) Zhang’s Method Zhang et al. evaluated the e ectiveness