Gesture-Based Control of Augmented Webcam ... - Ashwin Bhat · Ashwin Bhat abhat4 Johns Hopkins University [email protected] Eric Chiang echiang3 Johns Hopkins University [email protected]

Gesture-Based Control of Augmented WebcamExperience

Ashwin Bhatabhat4

Johns Hopkins [email protected]

Eric Chiangechiang3

Johns Hopkins [email protected]

I. INTRODUCTIONAs there is an increased focus on human-

computer interaction, there is great emphasis oncreating new, user-friendly ways of allowing hu-mans to control computers [1]. In this project, wedeveloped an augmented webcam experience thatcould be intuitively controlled by users. A largepart of creating this environment was developinga robust method of finger-tracking to allow forthe users to easily choose different filters to applyin the webcam video and give them a sense ofmore futuristic interaction than a simple touchpador computer mouse. In addition, this setup re-quired no extra hardware besides a webcam, whichnowadays comes built-in to most laptops. Thefull implementation of this augmented webcamexperience was done using eye tracking and fingertracking.

A. Related WorksPrevious methods of hand-tracking that were

very advanced or accurate often required extrahardware or were computationally expensive andmade it difficult to do finger tracking on real-time video without significant latency. The Kinectsensor is often used for hand tracking becauseof the infrared camera that gives depth images.Some studies have used it to recognize hands andclassify gestures [2]. One of the most advancedarticulated hand tracking implementations usingthe Kinect sensor was done by Microsoft Research.This setup works very well and allows for fully-articulated hand tracking via depth images in real-time [3]. However, it does have a limitation in real-world use because it requires an infrared camera,which most people do not have. The Kinect sensor,

while somewhat portable, is also not as seamlessan experience as using a laptop integrated webcamor even a USB-interfaced webcam. As that wasour goal in making our augmented webcam expe-rience, we steered clear of any camera device be-sides a simple webcam and made an environmentwhere the user could have their fingers trackedand use their fingers to control the output of thewebcam (change filters, remove filters, etc).

B. Our Approach

Our approach, in broad strokes, is below. Thedetails shall be discusses in following sections.

1) Detect and track eyes in video feed using acombination of a cascading object detectorand extraction of salient features around theeyes to track them if moved.

2) Track fingers in video feed using threshold-ing and morphological operators.

3) Apply facial masks/overlays.

II. EYE TRACKING

We describe here the method utilized for track-ing of the eyes in the video feed. We used analgorithm that was known for its robustness in facetracking and has also been applied to eye tracking.However, while it is fairly robust the computationalcost can become too expensive when running onevery frame of a live video while also applyingother image processing. This led to occasional lagsor freezes in the output video. To reduce the costof eye-tracking, we implemented a less expensivefeature tracking algorithm to detect specific pointsin the region where eyes were detected. Later on,we decided upon using a combination of the two

1

algorithms. This shall be discussed in greater detailbelow.

A. Viola-Jones Algorithm

We first utilized the Viola-Jones algorithm,which involves using Haar Features and represent-ing the image with an integral image.

For learning, the Viola-Jones algorithm usesa variant of Adaboost training and then utilizescascading classifiers. The use of cascading classi-fiers is highly important as it allows for the useof simple classifiers to pick out negative imagesand then adds in complex classifiers to reduce tothe false-positive rate to an almost insignificantpercentage [4].

The Viola-Jones algorithm is often used forface-detection, but it can easily be used for eye-detection, as well. When used for eye detection, itis still highly accurate with a low false-positiverate. In all of our observations, the only false-positive was an instance when the eyebrows weredetected as a second pair of eyes right above theactual eyes of the person in the frame.

Fig. 1: Eye Detection/Tracking with Viola-JonesAlgorithm at Multiple Instances

However, while the Viola-Jones algorithm israther robust and has a very low false positivepercentage, it does begin to start being computa-tionally expensive when run on every single frameof real-time video. This is even more of an issue

when trying to detect the eyes while also trackingthe fingers as described later on.

B. Kanade-Lucas-Tomasi Feature Tracking

In order to reduce the computational cost, we de-cided to combine the Kanade-Lucas-Tomasi (KLT)feature tracking algorithm with our existing Viola-Jones eye tracker. The initial difference was thatwe would run the eye tracker once, at the verybeginning of the live video feed, instead of onevery single frame.

Fig. 2: Eye Tracking with KLT Algorithm

After that, we used the KLT algorithm for fea-ture extraction to find the important points insidethe bounded box where our Viola-Jones eye trackerfound eyes in the video feed. The points weredetermined by finding corners using the minimumeigenvalue algorithm. With the salient featuresnow known, those points were tracked and thegeometric transformation as the points moved fromframe to frame was estimated [5]. This let us keepa bounding box around those same set of points asthey moved in the image.

The bounding box was important to have be-cause it was used in the calculations for face mask-ing and face filters. One of the main benefits ofusing the KLT algorithm, aside from the increasedefficiency, was that the points were a bit moreresistant to the changes in angle and rotation thatwe found to be shortcomings in eye tracking withthe Viola-Jones algorithm.

This method of a single instance of eye trackingfollowed by the KLT algorithm to keep track of

2

the eyes worked fairly well, but it did have a fewproblems. The primary one was that if part ofthe eyes were moved off screen momentarily, thepoints that were being used to track the eyes werelost. If the eyes were moved back on screen thosepoints would not be detected again by the basicKLT algorithm.

Another issue was that after many frames andmovements, some points would begin to be con-sidered outliers based on the calculations of theKLT implementation and then the entirety of theeyes were no longer detected. Therefore, despitethe increase in speed of the eye tracking and thedecreased computational cost, there was a funda-mental issue with this eye tracking method due tothe decreased accuracy.

C. Combined ApproachIn order to compensate for this issue, we de-

cided to make a method that compromised onspeed/computational cost and accuracy. In thismethod we used the Viola-Jones eye tracking moreoften than just at the start of the video feed andwere able to have the eye tracking and KLT featuretracking work together to achieve better speed thanjust the Viola-Jones algorithm and better accuracythan when just the KLT algorithm was doing theframe to frame tracking.

After initially finding the eyes and then usingKLT to track the salient points in the eye region,the eye detector would once again be run on asingle frame in the video. Based on some trialerror, we found every 90 frames to be a goodfrequency of running the eye detector on the video.Once the eyes were detected, the KLT algorithmwas run to find salient features in the newly foundeye region and then those new points were used tofor the next 90 frames before the eye detector wasutilized again.

Naturally, this does slightly increase the com-putational cost since the Viola-Jones eye detectorhas to be run more than the single time it wasused when allowing the KLT tracker to do thebrunt of the work. However, it was still much morelightweight than running the Viola-Jones algorithmon every single frame as we initially tried. Thatmethod used the eye detector 90 times in threeseconds, whereas this combined approach onlyused the eye detector once in three seconds.

Algorithm 1: Eye Tracking-Combined MethodInput: V - A live video streamOutput: BB - A bounding box containing

eyes (changes every frame)Initialize Video Stream VTake a single frame of videoI←V (current)Detect the eyes and initialize a bounding boxBB← detectEyes(I)Initialize list of salient features (corners) as ff ← corners(BB→ I)Initialize a list for previous points as prevprev← fInitialize a forward transformation as t fwhile V continues do

if frameCount mod 90 == 0 thenI←V (current)BB← detectEyes(I)if eyes are detected then

f ← corners(BB→ I)prev← f

endendI←V (current)f ← f indPoints(I, prev)t f ← f indTrans f orm(prev, f )Moves the bounding box to new eyesregion.

BB← f orwardT F(BB, t f )prev← freturn BB

end

Another benefit of this combined method wasthat it improved the accuracy of our usage ofthe Viola-Jones algorithm. The Viola-Jones eyedetection/tracking returned false negatives when aperson’s eyes in the video frame were rotated orangled away from the camera. However, with thiscombined method, if such an issue were to happenwhen trying to re-detect the eyes, the previouslyfound salient features from the KLT algorithmwould continue be used. Thus, if eyes were de-tected at some point previously, the KLT algorithmstill kept track of them and reduced the rate offalse negatives. Therefore, this combined approachimproved the robustness of the eye tracking.

3

We also implemented facial overlays as part ofthe augmented webcam experience and to have afun feature. Depending on the result of the fingersfound/tracked as described in the section below,a different face filter was placed into the videofeed. The position was determined based on thebounding box that was continually returned by theeye tracking algorithm with some shifts dependingon the type of filter. For example, a beard filterwould be below the eye bounding box, while asunglasses filter would be able to go near or in thebounding box.

III. FINGER TRACKING

We describe here our approach to counting thefingers shown on a hand. On a high level, wefirst thresholded the hand to isolate it. We thenused morphological operators to isolate the palmof the hand and subtracted the result of this fromthe original thresholded binary image. This result-ing binary image contained only the thresholdedfingers which were then counted. The subsequentsections will explain in-depth each aspect of ourmethod.

Algorithm 2: Finger TrackingInput: V - A live video streamOutput: F - Tracked fingersInitialize Hand Image as HInitialize Palm Image as PInitialize Finger Image as f ingInitialize Tracked Fingers as Fwhile V continues do

H← thresholdHand(V )P← isolatePalm(H)f ing← H−PF ← f indFingers( f ing)return F

end

A. Thresholding Technique and Preprocessing

First, using the bounding box from the EyeTracking method as described above, we assumedthat the area directly below the bounding was thesubject’s body, so we immediately masked that outin order to decrease the complexity of thresholdingthe hand cleanly.

Fig. 3: (a) Image captured from USB-interfacedwebcam, (b) Thresholded image, (c) Palm Isola-tion, (d) Finger Isolation

We experimented with a couple different thresh-olding techniques to isolate the hand.

We first attempted to utilize a skin-tone basedthresholding technique describe by Jusoh et al.which involved thresholding the hue componentof an image to get candidate areas of skin in theimage [6]. These candidate areas were then used todetermine areas to do RGB thresholding. However,with our testing we found this method had a coupleof limitations in our application. Firstly, with thesteps involved in this thresholding technique, wefound that there was a noticeable delay in process-ing images. In some cases, there was noticeablehanging while performing real-time thresholding.Secondly, there were issues with detecting a rangeof skin colors. The technique worked best withpeach-colored skin in well-lit conditions; withoutthose conditions, the shareholder would not func-tion accurately.

We decided to use a traditional intensity-basedthresholding technique using gray-scale images.With stable lighting and a mostly homogeneousbackground, we were able to threshold handsfound in the images quickly and consistently asseen in Figure 3a.

After the hand was thresholded based on inten-sity, we applied an algorithm that filled holes inbinary images (Figure 3b) in order to prepare itfor the next steps.

4

B. Binary Morphological Operators

Once the hand was isolated with thresholding,we used morphological operators to determine thecount the hand displayed.

1) Palm Isolation: The first step was to isolatethe ”palm” of the hand. In order to do this, wefirst utilized binary erosion with a large circularstructuring element. This allowed us to effectivelyerode any fingers extending from the palm. Wethen used binary dilation on the palm in order tosmooth its edges, increase its area slightly, andensure that it acts as a proper mask of the palmfor the next step as shown in Figure 3c.

2) Finger Isolation: We subtracted the ”palmmask” from the original image to get a roughbinary image of the finger and applied one morebinary erosion with a small circular structuringelement to eliminate any remaining artifacts andnoise in the binary image. Lastly, we thresholdedthe remaining binary elements by area in order toisolate the final position and count of the fingersas shown in Figure 3d.

IV. MASKS AND IMPLEMENATION

With algorithms for eye tracking and fin-ger tracking, we could properly implement ourgesture-based masks.

Masks were overlaid on the frames of videoread in. The masks were scaled proportionallyto the size of the bounding boxes drawn aroundthe eyes. This ensured the masks scaled withthe user properly. We then applied an offset alsoproportional the bounding box size to ensure themask stayed in relatively the same position as theuser moved.

We combined both algorithms into a script thatread in frames from a USB-interfaced webcamand applied the mask in real-time using our eyetracking method. Users were able to put up anumber of fingers, from zero to five, in order toswitch masks as shown in Figure 4.

One limitation we came across, as describedearlier, was significant slowdowns due to the com-putationally expensive behavior of the Viola-JonesAlgorithm. At times, the webcam output wouldskip frames, and it would sometimes even com-pletely freeze. Our solution to this issue was to use

the KLT algorithm in conjunction with the Viola-Jones algorithm. This significantly increased thespeed of our operation as described below.

Fig. 4: Output from our implementation. (a) Nofingers hid all masks, (b-c) showing different fin-gers changed the mask being used

V. EXPERIMENTAL RESULTSWhile much of our results are qualitative, as in

we were observing the output video response toensure that fingers and eyes are being correctlytracked, there were some attributed that we lookedat in determining the speed/accuracy of our system.

The primary accuracy we looked at was the ac-curacy of eye tracking because in the course of ourexperimentation we implemented three variationsof eye tracking. In our observation, the Viola-JonesAlgorithm, when not running alongside the fingertracking ran rather quickly and detected eyes whenthey came into the image frame. One of the mainproblems with the accuracy was that too muchtilting or rotating of the subject’s head would resultin the cascading object detector no longer detectingthe person’s eyes, even though they were visiblein the image.

The implementation where we used the KLTalgorithm to do continuous eye tracking based onfeature points found in the eye region from thecascading object detector being run on the video asingle time, was also flawed. The problem here wasthat since feature extraction was based on thosepoints in the eye region from the very first time,after a lot of motion and movement of the head,some of those points were lost, and would not get

5

re-detected. While the eye detection/tracking withthis method was slightly more resistant to rotationthan just the cascading-object detector, losing theeyes over time was not useful. Furthermore, thelack of re-detection prevented our design frombeing robust because a person moving half theirhead out of the image would only have one eyestill tracked even when their entire head movedback into the image.

The final combined method worked best in thatregard, because we catered it to meet our designrequirements. With the cascading object detectorbeing re-run every 90 frames, the eyes wouldalways be re-detected if lost momentarily, andthen the important features extracted by the KLTalgorithm were updated when the eyes were re-detected, so if any salient points had somehowbeen lost that was rectified at a fixed interval.

The primary factor in determining the speed ofour system was the frame rate of the webcamvideo after it was processed and outputted withtracked eyes and fingers. As mentioned previously,when only using a cascading object detector foreye detection in every single frame it resulted inframe-rate drops. Table 1 quantifies the averageframe-rate in the different versions of eye track-ing combined with finger tracking. Each of theseframe-rates is the average of 20 samples. Eachsample’s frame-rate was the average observed for3000 frames of video with the roughly the samemovement of the test subject’s head and hand.

TABLE I: Table of the frame- rates (in fps) intesting with Eye Tracking and Finger Tracking(with differing Eye Tracking Methods

Cascading Object Detector Viola-Jones + KLT12 f ps 23 f ps

As evident, the combined Viola-Jones and KLTAlgorithms approach has a higher frame-rate andwas subject to less frame-rate dropping. This wasa fast enough frame rate to have good qualityvideo output. The Viola-Jones eye detector aloneis fairly quick as well, as the frame-rates wouldoften approach 24 or 25 fps. However, since ourend goal was to actually doing finger tracking andeye tracking together in real-time to create our

augmented experience, the 12 fps frame-rate whentracking both was too low to have good videooutput. The combined approach had a solid 23 fpswhen doing both eye tracking and finger trackingso it was clearly superior.

VI. CONCLUSION

In our project, we were able to develop anaugmented webcam experience that was fairlyrobust for finger tracking and eye tracking. Dueour combination of the Viola-Jones Algorithmwith the KLT Algorithm and our computationalinexpensive method of finger tracking and gesturerecognition, we developed a system that works inreal-time without noticeable frame-rate droppingor latency. Our next steps would be to expand thisenvironment beyond our novelty application of anaugmented webcam experience. Due to its speedand lightweight computational cost, we believe itcould be applied to further intuitive control ofsystems besides the just the webcam. Using thewebcam, we could use our finger tracking as aninterface for controlling the entire computer. Forexample, because we can locate the fingers in theimage, we can use specific fingers as cursors oreven use the webcam as a keyboard input based onwhere the user move their fingers in the perspectiveof the video feed. Movement of a finger in specificdirections in the webcam view could also be usedto scroll in windows or switch windows, meaningthe user can control what they see on their screensimply by waving their fingers around. These arejust a few possible situation we can apply oursystem, and our next step would be exploring theimplementation of these.

VII. ACKNOWLEDGEMENTS

We thank the TAs and CAs for helping uslearn and understand the material. We’d like toespecially thank Dr. Austin Reiter for his guidancein the class and on this project.

REFERENCES

[1] Rautaray, Siddharth S., and Anupam Agrawal. ”Vision basedhand gesture recognition for human computer interaction: asurvey.” Artificial Intelligence Review, 43.1 (2015): 1-54.

[2] Sharp, Toby, et al. ”Accurate, robust, and flexible real-timehand tracking.” Proceedings of the 33rd Annual ACM Confer-ence on Human Factors in Computing Systems, ACM, 2015.

6

[3] Li, Yi. ”Hand gesture recognition using Kinect.” SoftwareEngineering and Service Science (ICSESS), 2012 IEEE 3rdInternational Conference on. IEEE, 2012.

[4] Viola, Paul, and Michael J. Jones. ”Robust real-time facedetection.” International Journal of Computer Vision, 57.2(2004): 137-154.

[5] Bourel, Fabrice, Claude C. Chibelushi, and Adrian A. Low.”Robust Facial Feature Tracking.” BMVC. 2000.

[6] Jusoh, Rizal Mat, et al. ”Skin detection based on thresholdingin RGB and hue component.” Industrial Electronics andApplications (ISIEA), 2010 IEEE Symposium on. IEEE, 2010.

7

Documents

Gesture-Based Control of Augmented Webcam ... - Ashwin Bhat · Ashwin Bhat abhat4 Johns Hopkins University [email protected] Eric Chiang echiang3 Johns Hopkins University [email protected]