Computer Vision and Image Understanding - …cvrr.ucsd.edu/publications/2012/Tran_CVIU2012.pdfimage and scene understanding, advanced human computer inter-action, intelligent environment,

Computer Vision and Image Understanding xxx (2011) xxx–xxx

Contents lists available at SciVerse ScienceDirect

Computer Vision and Image Understanding

journal homepage: www.elsevier .com/ locate/cviu

Modeling and prediction of driver behavior by foot gesture analysis

Cuong Tran ⇑, Anup Doshi, Mohan Manubhai TrivediLaboratory for Intelligent and Safe Automobiles (LISA), University of California, San Diego, CA 92093, USA

a r t i c l e i n f o a b s t r a c t

Article history:Available online xxxx

Keywords:Gesture analysisHidden Markov ModelsOptical flowIntelligent Driver AssistanceActive safetyAccident preventionPedal misapplicationsFoot trackingIntelligent vehicle

1077-3142/$ - see front matter � 2011 Elsevier Inc. Adoi:10.1016/j.cviu.2011.09.008

⇑ Corresponding author.E-mail address: [email protected] (C. Tran).

Please cite this article in press as: C. Tran et a(2011), doi:10.1016/j.cviu.2011.09.008

Understanding driver behavior is an essential component in human-centric Intelligent Driver AssistanceSystems. Specifically, driver foot behavior is an important factor in controlling the vehicle, though therehave been very few research studies on analyzing foot behavior. While embedded pedal sensors mayreveal some information about driver foot behavior, using vision-based foot behavior analysis has addi-tional advantages. The foot movement before and after a pedal press can provide valuable information forbetter semantic understanding of driver behaviors, states, and styles. They can also be used to gain a timeadvantage in predicting a pedal press before it actually happens, which is very important for providingproper assistance to driver in time critical (e.g. safety related) situations. In this paper, we propose anddevelop a new vision based framework for driver foot behavior analysis using optical flow based foottracking and a Hidden Markov Model (HMM) based technique to characterize the temporal foot behavior.In our experiment with a real-world driving testbed, we also use our trained HMM foot behavior modelfor prediction of brake and acceleration pedal presses. The experimental results over different subjectsprovided high accuracy (�94% on average) for both foot behavior state inference and pedal press predic-tion. By 133 ms before the actual press, �74% of the pedal presses were predicted correctly. This showsthe promise of applying this approach for real-world driver assistance systems.

� 2011 Elsevier Inc. All rights reserved.

1. Introduction

Human behavior analysis from vision input is a challenging butattractive research area with lots of promising applications, such asimage and scene understanding, advanced human computer inter-action, intelligent environment, driver assistance systems, videosurveillance, video indexing and retrieval. In general, humanbehavior can be analyzed at several levels of resolution such as fullbody level [15,22], upper body [3,4,33,34], lower body [2], hand[5,16], head [9,14,24,37]. There are trade-offs between achievingdetailed information of human behavior at different levels, andthe efficiency as well as robustness of the algorithm. Therefore re-search studies typically focus on behavior analysis at one leveldepending on applications. In this paper, we focus on driver footbehavior analysis with applications to Intelligent Driver Assistance.It should be mentioned that an effective driver assistance systemneeds to be human-centric, and take into account informationabout all three main components (i.e. environment, vehicle, anddriver) interacting in a holistic manner [35,36]. Among those, dri-ver foot behavior is an important source of information that has astrong impact on vehicle control.

One problem of recent interest to the automotive safety com-munity is that of ‘‘pedal misapplication’’, in which the driver acci-

ll rights reserved.

l., Modeling and prediction of

dentally presses the wrong pedal, as seen in Fig. 1. Several recentunintended acceleration-related accidents in the US could havebeen a result of this pedal misapplication phenomenon [12]. Inci-dents related to pedal misapplication have been observed for manyyears [28], and the investigation into Toyota’s recent ‘‘suddenunintended accelerations’’ [30] has led to a renewed interest inavoiding such incidents. We propose that understanding the driverfoot behavior could help to predict and mitigate this kind ofproblem.

To our knowledge, there are very few research studies in footgesture and behavior analysis; for example, Choi and Ricci [6]developed a foot-mounted device which can recognize walkinggestures. In the domain of driver assistance, some studies relatedto analyzing driver foot behavior have been published. Park andSheridan used pressure-based sensors in a driving simulator toshow that driver leg motion can help to improve the performanceof Antilock Brake System (ABS) [26]. Tanaka et al. [31] analyzed amechanical model of driver foot and pedal which has potentialfor having better pedal design and layout. McCall and Trivedi[21] developed a brake assistance system, which took into accountboth driver’s intent to brake (from pedal positions and the camera-based foot movement) and the need to brake given the current sit-uation, in order to determine at what level the driver should bewarned. Also in an effort to reduce rear-end collisions, which ac-count for a large portion of traffic accidents [25], Mulder et al. haveintroduced a haptic gas pedal feedback system for car-following

driver behavior by foot gesture analysis, Comput. Vis. Image Understand.

http://dx.doi.org/10.1016/j.cviu.2011.09.008

mailto:[email protected]


http://www.sciencedirect.com/science/journal/10773142

http://www.elsevier.com/locate/cviu


Fig. 1. Sample scenario depicting a motivation for foot gesture analysis. In thefollowing research we show that by using vision-based modeling and prediction offoot gestures, we are able to predict instances of ‘‘pedal misapplication’’ at about200 ms prior to the actual pedal press. This time could provide a critical advantagefor an Advanced Driver Assistance System (ADAS) in reducing the severity of apotential collision.

Fig. 2. Vision-based framework for driver foot behavior analysis.

2 C. Tran et al. / Computer Vision and Image Understanding xxx (2011) xxx–xxx

[1,23]. In addition to the information of lead-vehicle-separation,they showed that the performance can be improved by using adeceleration control algorithm based on the gas pedal position.

In this study, we develop a new vision-based framework for dri-ver foot behavior analysis. Although embedded vehicle sensorparameters from the Controller Area Network (CAN-bus) like brakeor acceleration pedal states tell us something about the foot behav-iors, the foot movement before and after a pedal press detectedfrom vision-based sensors can provide valuable information forbetter semantic understanding of driver behaviors, states, andstyles. They can also be used to predict a pedal press before it actu-ally happens. This is very important in time critical (e.g. safety re-lated) situations in which we need time to provide properassistance to the driver when needed. In the proposed approach,an optical flow based method is used to track foot movementand a Hidden Markov Model (HMM) is trained to characterizethe temporal foot behavior. The vehicle parameters from theCAN-bus are also utilized in live estimation as well as in post-pro-cessing, as part of an automatic data labeling procedure for valida-tion purposes. This makes the proposed framework easier to adaptto different subjects and situations and thereby improve perfor-mance. The resulting system appears to be the first such systemdeveloped for vision-based driver foot behavior modeling andprediction.

The remaining sections are organized as follows. Section 2 pre-sents the details of our proposed framework for driver foot behav-ior analysis. Section 3 describes our real-world driving testbed anddata acquisition process, and Section 4 shows experimental resultsand performance analysis with real-world driving data. Finally, wehave some discussions and concluding remarks in Section 5.

2. Vision-based framework for driver foot behavior analysis

In this paper, our goal is to develop a computer vision systemthat takes the input of driver foot video, along with vehicle-basedpedal sensor measurements, and gives the output of a set of high-er-level semantic descriptions for the driver foot behavior. It is alsodesirable if these semantic descriptions can be used to predict apedal press before it actually happens.

Fig. 2 shows the components of our proposed vision-basedframework for driver foot behavior analysis. First, using data cap-tured from a camera facing the driver’s foot, an optical flow basedmethod is used to track the foot movement. Then we design a Hid-den Markov Model (HMM) to learn the temporal foot behaviorfrom the extracted foot movement and vehicle CAN information.

Please cite this article in press as: C. Tran et al., Modeling and prediction of(2011), doi:10.1016/j.cviu.2011.09.008

Using the trained HMM, we estimate the current semantic stateof the driver foot at each frame as well as use that informationto predict a brake of acceleration press before it actually happens.Utilizing reliable information from the vehicle CAN data, we alsodevelop an automatic data labeling procedure so that the learnedHMM model can be evaluated in an online manner after each pedalpress has occurred. This is an important aspect of the proposedframework since it is easier to adapt to different subjects and situ-ations and therefore potentially improve the performance.

2.1. Optical flow based foot tracking

Optical flow is a well known computer vision technique for mo-tion estimation. It is based on the assumption of a constant bright-ness profile

Iðx; y; tÞ ¼ Iðxþ udt; yþ vdt; t þ dtÞ

where (u,v) are the velocities (optical flow) at pixel location (x,y). Ifwe also assume small motion between frames, we can do the Taylorexpansion and only keep first order terms

Iðx; y; tÞ ¼ Iðx; y; tÞ þ dxdIdxþ dy

dIdyþ dt

dIdt

) udtdIdxþ vdt

dIdyþ dt

dIdt¼ 0

) udIdxþ v dI

dyþ dI

dt¼ 0

Optical flow has also been applied in several human motiontracking and human gesture analysis studies. For example, Decarloand Metaxas use optical flow constraint on the motion of a deform-able model for face tracking [7]. Holte et al. use 3D version of opti-cal flow for view invariant gesture recognition [13]. In this paper,we use the coarse-to-fine Lucas Kanade algorithm [20] for opticalflow detection. The global foot motion is extracted by a major votefrom the detected optical flows and is then combined with a simplelinear motion model for foot tracking. This method works quitewell for the driver foot video as visually shown in Fig. 3. The outputof this foot tracking step will provide foot position (px, py) andvelocity (vx,vy) at each frame.



Fig. 3. Examples of optical flow based foot tracking output for visual evaluation. Red arrows – Detected optical flows. Blue arrow – Tracking output of foot position andvelocity at current frame. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 4. Illustration of foot trajectories extracted from optical flow based foot tracking in a single run; Green: Trajectories towards the acceleration, Red: Trajectories towardsthe brake. The trajectories are time-aligned to the start of the foot motion, and proceed until a pedal is pressed. The differences between braking and acceleration trajectoriescan be observed over time. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

C. Tran et al. / Computer Vision and Image Understanding xxx (2011) xxx–xxx 3

Fig. 4 visualizes all the tracked foot trajectories moving forwardfrom a stopped state to a pedal press, over a single drive (128 stopor go trials). Trajectories towards the brake pedal are marked in redand trajectories towards acceleration are marked in green. We seethat though the end points are quite separate between brake tra-jectories and acceleration trajectories, the beginnings of those tra-jectories are more overlapped. Therefore using information fromsingle points (frames) in this overlapped region to understand dri-ver foot behavior would seem to be ambiguous. In such cases,behavioral models like HMMs that take into account temporalinformation could help disambiguate these trajectories.


2.2. HMM-based foot behavior model

By observing the driver foot movement, e.g. Fig. 3, we see thatthe foot motion can be divided into the following semantic states:

1. Neutral (hover off pedal).2. Moving towards brake pedal.3. Moving towards acceleration pedal.4. Engaging brake pedal.5. Engaging acceleration pedal.6. Release from brake pedal.




7. Release from acceleration pedal.

Based on this intuitive interpretation, we design a state modelfor driver foot behavior as shown in Fig. 5. We see that ‘‘clean’’ ped-al press actions would follow the path: Neutral ? Move TowardsBrake/Accel ? Brake/Accel Engaged ? Release Brake/Accel ? Neu-tral, and so on. However there are cases in which after the ReleaseBrake/Accel state, the foot does not actually come back to the Neu-tral state but has a continuous motion and changes into the nextMove Towards Brake or Acceleration state.

We choose the HMM-based technique since it can characterizetime series data with both spatial and temporal variability. It hasbeen used widely in speech recognition [29] and behavior recogni-tion [27], and recently it has been successfully applied in vision-based gesture recognition, e.g. [16]. To learn the temporal footbehavior, we use a continuous HMM with Gaussian output proba-bility. The elements of our HMM are as follows:

� Hidden states: We have seven states {s1, s2, s3, s4, s5, s6, s7}including Neutral, BrkEngage, AccEngage, TowardsBrk, Toward-sAcc, ReleaseBrk, ReleaseAcc. The state at time t is denoted bythe random variable qt.� Observation: The observation at time t is denoted by the random

variable Ot which has 6 components Ot = {px,py,vx,vy,B,A} where{px,py,vx,vy} are the current position and (optical flow) velocityof driver foot estimated from optical flow based foot trackingstep. {B,A} are obtained from vehicle CAN information whichdetermine whether the brake and accelerator are currentlyengaged or not.� Observation probability distributions: In our HMM model, we

assume a Gaussian output probability distribution P(Otjqt = si) = N(li,ri)� Transition matrix: A = {aij} is an 7 � 7 state transition matrix

where aij is the probability of making a transition from state si

to sj

Fig. 5.describ

Please(2011

aij ¼ Pðqtþ1 ¼ sjjqt ¼ siÞ

� Initial state distribution: We assume an uniform distribution ofthe initial states.

Foot behavior HMM state model with seven states. The states are defined ased in Section 2.2.

cite this article in press as: C. Tran et al., Modeling and prediction of), doi:10.1016/j.cviu.2011.09.008

2.2.1. HMM parameters learningGiven the complete training data (has both observation and hid-

den states) obtained from the automatic labeling procedure in Sec-tion 2.3, the set of HMM model parameters K including theGaussian observation probability distribution and the transition ma-trix can be learned using Baum–Welch algorithm. In our implemen-tation, we use the Probabilistic Modeling ToolKit (PMTK) for Matlab1

which supports several probabilistic models including HMM.

2.2.2. Foot behavior state estimationGiven learned HMM model, at a time t we use the observations

in a time window Ot�TimeWindow . . .Ot to estimate the most likelycurrent state for qt.

st ¼ argmaxi¼1:7fPðqt ¼ sijOt�TimeWindow . . . Ot;Kg ð1Þ

Using the current framework, however, in certain cases there issome confusion between TowardsBrake and BrakeEngage as well asbetween TowardsAccel and AccelEngage. In our currently definedBrk/AccEngage states, the foot movement can comprise of differentvalues like no motion, moving forward, and moving backward,which could create some ambiguities in learning the HMM model.Moreover, the current foot camera setup does not cover the wholepedal area. Therefore when the pedals are pressed a bit hard, partof the foot will be out of the field of view and the optical flow basedfoot tracking will not have enough information to provide goodtracking output.

However since we have the CAN information to verify if thebrake or acceleration is actually engaged or not, this kind of confu-sion can be corrected with the following practical rule:

If st ¼¼ BrkEngage AND Bt == 0 then st ¼ TowardsBrkIf st ¼¼ AccEngage AND At == 0 then st ¼ TowardsAcc

2.2.3. Prediction of brake and acceleration pedal pressThe meaning of our estimated foot behavior states directly con-

nect to the prediction of actual pedal presses. Whenever the foot isin the state Move Towards Brake or Move Towards Acceleration, wecan predict that the corresponding brake pedal or acceleration ped-al will be pressed in near future. To avoid possible error noise inthe HMM estimation of current foot behavior state, we accumulateover a small time period (a few hundred milliseconds in our exper-iment) before the current time t.:

pb ¼Xðs ¼ MoveTowardsBrakeÞ ð2Þ

pa ¼Xðs ¼ MoveTowardsAccelerationÞ ð3Þ

Using these accumulation values, the prediction is done as below

IF(pa==0 AND pb==0)Prediction = No pedal will be pressed

ELSE

IF (pa>pb)Prediction = Acceleration pedal will be

pressed

ELSE

Prediction = Brake pedal will be pressed

END

END

Note that although the absolute value and relative proportionbetween pb and pa have not been exploited in our current imple-mentation, they could give us some useful information about theprediction confidence.

1 http://code.google.com/p/pmtk3/.


http://code.google.com/p/pmtk3/


Fig. 6. LISA-P testbed configuration for data acquisition.


2.3. Automatic data labeling based on vehicle CAN-bus information

Based on embedded pedal sensor data via the vehicle CAN-bus,we can reliably determine the state (engaged/not engaged) of thebrake and acceleration pedals. Utilizing this information combinedwith the optical flow based driver foot tracking px, py, vx, vy, B, A, wedevelop a post-processing procedure to automatically label theobservations into seven behavior states.

� BrakeEngage state: Observations with B = 1 are labeled asBrakeEngage.� AccelEngage state: Observations with A = 1 are labeled as

AccelEngage.� Neutral state: Observations with A = 0 AND B = 0 ANDffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

v2x þ v2

y

q< MotionThres (no motion and off pedal) are labeled

as Neutral.� TowardsBrake state: We detect the events when B change from

0 to 1 (the start of a BrakeEngage). Looking backward in timefrom these events t1 until– A Neutral state is detected at time t2: Label all the observa-

tions between t2 and t1 as TowardsBrake.– A BrakeEngage state is detected at time t2: This means the

foot changed from a previous ReleaseBrake state toTowardsBrake without going through Neutral state (no stopmotion), and then to BrakeEngage. Therefore we need to sep-arate between ReleaseBrake and TowardsBrake states. We seethat although there is no stop point, the foot trajectory inReleaseBrk/Acc state and TowardsBrk/Acc state need to followopposite directions in x (backward and forward) axis. There-fore we determine the time t3 between t2 and t1 where px isminimum and then label observations from t2 to t3 asReleaseBrake and from t3 to t1 as TowardsBrake.

– An AccelEngage state is detected at time t2: Similarly, we findthe time t3 between t2 and t1 where px is minimum and thenlabel observations from t2 to t3 as ReleaseAccel and from t3 tot1 as TowardsBrake.

� TowardsAccel state: We detect the events when A change from0 to 1 (the start of an AccelEngage). Looking backward in timefrom these events t1 until– A Neutral state is detected at time t2: Label all the observa-

tions between t2 and t1 as TowardsAccel.– A BrakeEngage state is detected at time t2: Determine the

time t3 between t2 and t1 where px is minimum and thenlabel observations from t2 to t3 as ReleaseBrake and from t3

to t1 as TowardsAccel.– An AccelEngage state is detected at time t2: Determine the

time t3 between t2 and t1 where px is minimum and thenlabel observations from t2 to t3 as ReleaseAccel and from t3

to t1 as TowardsAccel.� ReleaseBrake state: We detect the events when B change from

1 to 0 (the end of a BrakeEngage). Looking forward in time fromthese events t1 until– A Neutral state is detected at time t2: Label all the observa-

tions between t1 and t2 as ReleaseBrake.– A BrakeEngage state is detected at time t2: Determine the


to t2 as TowardsBrake.– An AccelEngage state is detected at time t2: Determine the


to t1 as TowardsAccel.� ReleaseAccel state: We detect the events when A change from 1

to 0 (the end of an AccelEngage). Looking forward in time fromthese events t1 until


– A Neutral state is detected at time t2: Label all the observa-tions between t1 and t2 as ReleaseAccel.

– A BrakeEngage state is detected at time t2: Determine thetime t3 between t1 and t2 where px is minimum and thenlabel observations from t1 to t3 as ReleaseAccel and from t3

to t2 as TowardsBrake.– An AccelEngage state is detected at time t2: Determine the

time t3 between t1 and t2 where px is minimum and thenlabel observations from t1 to t3 as ReleaseAccel and from t3

to t1 as TowardsAccel.

The output of this automatic procedure was qualitatively vali-dated by looking at the labeled states over time in a synchronizedmode with video of the foot, and the labeled data looked reason-able. Note that this automatic labeling procedure need to use awhole long data sequence where we can look into the ‘‘past’’ and‘‘future’’ of a pedal press event. In the estimation and predictionmode, we only have a Time �Window of previous (past) observa-tions. Therefore we need to learn the HMM model for that purpose.

3. Real-world driving testbed and data acquisition

The driver foot behavior data used in this study were capturedfrom our intelligent vehicular testbed, the LISA-P, seen in Fig. 6. Anexperimental data collection paradigm was designed to approxi-mate stop-and-go traffic, in order to maximize the number of pedalpresses observed per participant. This is actually one of our seriesof experiments which were carefully designed in cooperation witha cognitive scientist to study ‘‘sequential effects’’ in complex andnaturalistic tasks [10] (i.e. moving from the simple task of two-but-ton press to a carefully designed driving simulation task, to a real-world driving experiment). Prior research has shown these con-texts tend to correlate with a higher number of driver errors,potentially resulting in ‘‘pedal misapplications’’ [10,11,19]. The de-sign of the experimental data collection is as follows:

� Apparatus: The experiment was conducted in our LISA-P testbedwith the configuration as shown in Fig. 3. There is a laser basedhead-up display which can display visual cues anywhere on thewindshield. More details on the heads-up display can be foundhere [8]. Two speakers were also used to generate audio cueswhen needed. Driver foot movements were captured by a footcamera at 640 � 480 resolution (30 frames per second). Thevehicle parameters like brake and acceleration pedal states,steering angle, etc were also acquired from the CAN-bus every10 ms, timestamped and synchronized with the video and cuedata.



Table 1Confusion matrix of seven gesture states averaged over 15 runs (each subject doesthree runs with visual cues only, audio cues only, and both audio and visual cues). Foreach run, the first 2 min are used for training and � 8 remaining minutes are used fortesting. Each row sums up to 100%.

Actual Predicted state

Neutral BrkEng AccEng TwdBrk TwdAcc RlsBrk RlsAcc

Neutral 0.9034 0 0 0.0087 0.0254 0.0199 0.0426BrkEng 0 0.9952 0 0.0045 0 0 0.0003AccEng 0 0.0009 0.9955 0 0.0027 0.0003 0.0007TwdBrk 0.0912 0 0 0.8762 0.0175 0.0009 0.0143TwdAcc 0.1181 0.0006 0 0.0090 0.8664 0 0.0059RlsBrk 0.0371 0 0.0007 0.0007 0.0006 0.9596 0.0014RlsAcc 0.0082 0.0037 0 0.0098 0.0049 0.0057 0.9678


� Procedure: With regard to safety issue, the experiment was con-ducted in an empty parking lot, with cues given to the driver inthree different conditions. A set of random sequences of cues, tobrake or accelerate, were presented to the driver (1) visually, (2)by audio, or (3) using both audio and visual simultaneously.These sequences of stop and go cues were designed to be similarto a recent study on sequential effects in driving [10]. The driverresponded as soon as possible by tapping the brake or acceler-ation pedal accordingly. Each subject did three runs, one undereach of the three cueing conditions. A run includes 128 trials,where each trial includes one stop or go cue with the relevantresponse.� Participants: Experimental data was recorded on 12 subjects, of

various nationalities, genders, and ranging in age from their 20 sto their 50 s. All subjects were licensed drivers, and ranged inexperience from novice to decades of experience.

The experimental paradigm captured data of naturalistic driv-ing behavior in scenarios similar to stop-and-go traffic. Given thissetup, we use the captured data for driver foot behavior modelingand prediction. In addition to this well controlled experiment inthe parking lot, we also have some preliminary analysis with datacaptured from a few naturalistic driving scenarios (Section 4.6).

4. Experimental results and analysis

Since the proposed post-processing framework can automati-cally label the data, the HMM parameters can be learned specifi-cally for different subjects for better performance. Therefore inthis section, we will analyze the behavior state estimation and ped-al press prediction using data from all three conditions of theexperiment to show performance of the proposed framework.We also test a subject-wise cross-validation procedure (i.e. trainon subject 1 and test on subject 2) to illustrate the benefit of theproposed framework in adapting the learned model to differentsubjects and situations.

4.1. HMM foot behavior state estimation

For each experimental run of a subject (about 10 min of data),we use the first 2 min for training and the remaining part for test-ing. Fig. 7 visualizes an example of the estimated foot behaviorstate in comparison with the ground truth from automatic datalabeling procedure.

Fig. 7. Plot of HMM estimated behavior states compared to the ground-truthobtained from automatic labeling.


The confusion matrix of seven states which is averaged over 15runs (each subject does three runs with visual cues only, audiocues only, and both audio and visual cues) is shown in Table 1.The mean of correct classification rate for the seven classes is93.77%. The significant source of confusion remains mostly in thedistinction between the end of a ‘‘Neutral’’ state and transition toa movement ‘‘Towards’’ the pedals. The identification of the initialmovement point is more difficult in live analysis than in post-pro-cessing. Examples of this can be seen in Fig. 7, which shows a con-sistent pattern of slightly later transitions out of ‘‘Neutral’’ in thepredicted behavior states.

4.2. Brake/acceleration pedal press prediction

In applying this state detection methodology to pedal press pre-diction, we would like to accurately predict the pedal presses assoon as possible. However typically, there is a trade-off betweenthe time advantage of prediction (how soon is the time of predic-tion compared to the actual pedal press) and both the recall2 andprecision rates.3 To analyze this kind of trade-off, we attempt tomake predictions at various points leading up to an actual pedalpress (as determined from embedded pedal sensor data). For exam-ple, we may set a threshold time of 300 ms before the pedal press,and using the accumulated information leading up to that time(Eq. (2)), we can determine the performance of a corresponding pre-dictive classifier (i.e. classifying whether/which pedal will bepressed).

Table 2 shows the precision and recall of brake and accelerationpredictions with different thresholds of time before the actualpress. These results are the average over all 15 runs. We see thatat 133 ms prior to the actual pedal press, a major part �74% ofthe pedal presses can be predicted (there is 82.66% recall rate foracceleration predictions, and 65.61% for brake predictions).

4.3. Validation across subjects

We use leave one out cross validation to analyze the perfor-mance of the learned HMM behavior model across subjects. Foreach test, we select a subject to test on and train the HMM behav-ior model with data from the remaining subjects. The results arethen averaged over all tests as shown in Table 3. We see that theperformance degrades in comparison to Table 1 where the HMMmodel is trained and tested on the same subject for each test. Thereare more confusions especially in the TowardsBrake, TowardsAccel,and ReleaseBrk states, which further indicates that there are somedifferences in foot movement style between subjects. This demon-

2 Recall = jfActual pedal pressesg\fPredicted pedal pressesgjjfActual pedal pressesgj .

3 Precision = jfActual pedal pressesg\fPredicted pedal pressesgjjfPredicted pedal pressesgj .



Table 2The recall and precision of pedal press prediction (averaged over 15 runs, �10 min each).

Time before actual pedal press Brake prediction Acceleration prediction

Frame Milliseconds Precision Recall Precision Recall

30 999 1.0000 0.0458 0.8980 0.104215 499 0.8016 0.0532 0.9153 0.141210 333 0.7947 0.0854 0.8882 0.32468 266 0.8869 0.2284 0.9065 0.52647 233 0.9010 0.3085 0.9221 0.62656 200 0.9037 0.3972 0.9352 0.72655 167 0.9220 0.5090 0.9398 0.78414 133 0.9564 0.6561 0.9529 0.82663 100 0.9695 0.8111 0.9630 0.86222 67 0.9781 0.8816 0.9694 0.89711 33 0.9823 0.8970 0.9732 0.91380 0 1 1 1 1

Table 3Confusion matrix of seven states with leave one out cross validation. Each row sumsup to 100%. For each subject, we train the HMM behavior model with data from theother subjects and test on the selected one. The results are then averaged over alltests.



Neutral 0.9066 0 0 0.0114 0.0359 0.0200 0.0262BrkEng 0 0.9022 0 0.0978 0 0 0AccEng 0 0 1 0 0 0 0TwdBrk 0.1697 0 0 0.6701 0.1532 0 0.0069TwdAcc 0.1951 0 0 0.0110 0.7940 0 0RlsBrk 0.1432 0.0002 0.0002 0 0.0590 0.6267 0.1709RlsAcc 0.1086 0.0022 0 0.0113 0 0.0066 0.8713

Fig. 8. Effect of changing observation time window for estimating the foot behaviorstates (illustrated for Subject 1).

Fig. 9. HMM confidence in estimating behavior states with different observationtime windows (measured by argmaxi=1:k{P(qt = SijO = Ot � TimeWindow . . . Ot,K}).


strates the importance of the automatic data labeling procedurewhich makes it easier to train individual models for each of thesubjects in order to achieve better performance.

4.4. Benefits of temporal information

Another analysis of interest is the relative contribution of tem-poral information, in comparison with the instantaneous observa-tion information at each single frame. Fig. 8 shows the recall ratefor seven states (the diagonal of the confusion matrix) as a functionof the observation time window used for estimation (a time win-dow of 1 means only the observation at current frame is used).We see that for most of the states the temporal information didhelp to improve the performance. Fig. 9 also shows that with afew historical frames, the average confidences of the learnedHMM model estimating the foot behavior state are all over 90%(the confidence is determined by the max state estimation proba-bility in Eq. (1)).

4.5. Foot gesture analysis for prediction of pedal misapplication

As mentioned above, one potential application for the detectionof foot gestures is in predicting and mitigating the effects of pedalmisapplication. Due to the nature of the experimental data, wewere able to observe several instances of unintended pedal presses,or ‘‘misapplications’’. In these cases, the subjects were cued to hit aspecific pedal but instead applied the wrong pedal. These casestended to occur when the subjects had a significant workload todeal with (e.g. environmental stimuli), or a rapid set of alternatingcues to respond to (e.g. confusing historical context [10]).

Fig. 10 visualizes the outputs of the proposed framework for abrake misapplication example (when the subject was cued to hitacceleration but instead applied brake) and an acceleration misap-


plication example (when the subject was cued to hit brake but ininstead applied acceleration). Over 15 runs, there were five casesof brake misapplications and 15 cases of acceleration misapplica-tions. In all these 20 cases, the movement towards the wrong pedalwas correctly predicted by the HMM prediction framework. The



Fig. 10. Trajectories of an actual brake misapplication (in red – the subject was cued to hit acceleration but instead applied brake) and an acceleration misapplication (in blue– the subject was cued to hit brake but instead applied acceleration). The trajectories are obtained by optical flow based foot tracking (the X, Y axes are the image coordinatesof the tracked foot). The labeled points show the outputs of the HMM based foot behavior analysis. In each case, the pedal misapplication is correctly predicted over 100 ms inadvance of the actual pedal press. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 11. Statistics (mean and standard deviation) of the prediction time prior to theactual pedal press for brake misapplications (the subject was cued to hitacceleration but instead applied brake) and acceleration misapplication (thesubject was cued to hit brake but instead applied acceleration). Over 15 runs(�10 min each), there were 5 brake misapplications and 15 accelerationmisapplications.


average prediction time prior to the actual pedal press was�193 ms for brake misapplications and �206 ms for accelerationmisapplications. The statistics of these prediction times are shownin Fig. 11.

In real-world instances, if an intelligent ADAS is aware that adriver needs to begin applying the brakes, but instead detects amove towards the accelerator, the ADAS could take measures to re-duce the effects of the error. Haptic feedback, either alerting thedriver or making it more difficult to press the pedal, could possiblyprevent a simple error from escalating into a critical incident.

4.6. Preliminary analysis in naturalistic driving scenarios

We also have a preliminary test of the proposed system onsome naturalistic driving scenarios (Fig. 12). In order to have an


adequate number of pedal presses, we did ask the subject tomainly drive in an urban area with some entering/exiting onramps. However in this test, the subject just drives normally with-out any generated audio or visual cue. We capture data from a fewruns of such a naturalistic driving scenario,�45 min each. Based onthe CAN data, we have above 300 pedal (brake and acceleration)presses and typically there are a bit more acceleration presses thanbrake presses. The optical flow based foot tracking was pretty good(illustrated on Fig. 13 – Bottom right images).

Based on the automatic data labeling procedure, the foot behav-ior modeling and prediction was trained and tested individually foreach run (half of the data was used for training, the other half wasused for testing). Tables 4 and 5 show the averaged results for footbehavior state estimation and pedal press prediction respectively.We see that the results are not as good as in the controlled exper-iment in the parking lot. From Table 4, Neutral state was the mainsource of errors in estimating TowardsBrake and TowardsAccelstates. Digging more into the data, we see that one common driverfoot behavior in naturalistic driving is keeping foot on the acceler-ation (or brake) pedal and subtly adjusting the speed. This kind ofsubtle pedal release/press was hard to observe visually (e.g. visu-ally, the foot is still on the pedal but from the CAN data, we see thatthe pedal was released and pressed again). This seems to be themain reason for the misclassification of TowardsBrake and Towards-Accel states with Neutral state as well as the low recall rates in Ta-ble 5. However note that when a pedal press is predicted, theprediction accuracy is still high.

4.7. Remaining difficulties

As illustrated in Fig. 13, the optical flow based foot trackingworks pretty well with various foot appearances and lighting con-ditions. However due to the nature of optical flow approach, wehave some difficult situations as shown on Fig. 14. The left imagesillustrate a case of shoes with no texture. In this case, there are onlyoptical flows detected at the boundary of the shoes which some-times make the estimated position and motion of the foot moreambiguous. For example the ReleaseAcceleration foot movement is



Fig. 12. Samples of data captured from naturalistic driving scenarios.

Fig. 13. Optical flow foot tracking worked well with various foot appearances and lighting conditions.

Table 4Confusion matrix of seven states (averaged over multiple runs in naturalistic drivingscenarios). Each row sums up to 100%.



Neutral 0.9931 0 0 0.0017 0.0052 0 0BrkEng 0.0091 0.9897 0 0.0012 0 0 0AccEng 0.0061 0 0.9934 0 0.0002 0 0.0002TwdBrk 0.2962 0 0 0.6005 0.0054 0 0.0978TwdAcc 0.4648 0 0 0.0047 0.4601 0.0704 0RlsBrk 0.1864 0 0 0 0 0.8136 0RlsAcc 0.1962 0.0096 0 0 0 0.0048 0.7895



less prominent due to the lack of textures or when part of the shoesboundary is out of the scene, the detected motion center becomesmore jumpy. Of course if the shoes appearance is indistinguishablefrom the background then optical flow approach will not work (wedo not have such case in our data though). In our proposed system,we adapt the HMM based behavior model for each individual dri-ver based on the automatic data labeling procedure. Thereforesome characteristics of foot movement with textureless shoescould be somewhat taken into account in the training processwhich help to alleviate the performance degradation in the finalfoot behavior modeling and prediction.

The right images in Fig. 14 illustrate a situation when there is noactual foot movement but the optical flow approach detects some



Table 5The recall and precision of pedal press prediction (averaged over multiple runs in naturalistic driving scenarios, � 45 min each).

Time before actual pedal press Brake prediction Acceleration prediction

Frame Milliseconds Precision Recall Precision Recall

30 999 0.9333 0.3182 0.8696 0.208315 499 0.9302 0.4086 0.9615 0.260410 333 0.9411 0.4205 0.9643 0.28138 266 0.9411 0.4205 0.9643 0.28137 233 0.9411 0.4205 0.9643 0.28136 200 0.9487 0.4432 0.9655 0.29175 167 0.9487 0.4432 0.9667 0.35244 133 0.9524 0.4545 0.9667 0.35243 100 0.9524 0.4545 0.9685 0.37112 67 0.9656 0.5417 0.9717 0.39161 33 0.9656 0.5417 0.9755 0.41580 0 1 1 1 1

Fig. 14. Examples of difficult situations for optical flow foot tracking. Left: Textureless shoes degrade the quality of optical flow tracking. Right: Unreal foot movementdetected by optical flow due to the change in extreme lighting conditions.


movement due to a change in extreme lighting conditions. Sincewe use majority of detected optical flows with a threshold on themagnitude, this problem only happens occasionally (i.e. lightingcondition changes in general will not result in a false detectionof foot motion). Moreover in our data, this kind of false motiondetection seems to happen for a very short amount of time. There-fore a smoothing preprocess (e.g. we used a median filter) couldhelp to alleviate the issue.

As mentioned in Section 4.6, one important issue with the nat-ural driving scenarios is the type of the subtle speed adjustmentwhich is hard to observe from visual foot movement. Our current7-state HMM model and automatic data labeling procedure didnot characterize well this kind of situations (e.g. the behaviorstate when the foot seems to visually stay still on the pedal butfrom the CAN data, we see that the pedal was released and en-gaged again). Therefore the performance of foot behavior model-ing and prediction was not very good in our preliminary analysis.For the future work, it is possible that we can somehow modifyour behavior model and the automatic labeling procedure (e.g.introduce other behavior states) to take into account this issue.Moreover since the visual information is less informative in thiskind of situations, developing a mechanism which can be adaptedto focus more on information from CAN in such cases could beuseful.

5. Discussion and concluding remarks

In this paper, we have proposed and implemented a new vision-based framework for driver foot behavior modeling and prediction,which is an important but still open area in developing IntelligentDriver Assistance Systems. To our knowledge the proposed systemis the first system developed for vision-based semantic under-standing of driver foot behavior, including both modeling and pre-


diction. Using the output of optical flow based foot trackingcombined with embedded pedal sensor information, we designan HMM model to learn the temporal foot behavior. The learnedHMM model is then used to interpret driver foot movement intoseven behavior states (Neutral, BrakeEngage, AccelerationEngage,TowardsBrake, TowardsAcceleration, ReleaseBrake, ReleaseAccelera-tion) as well as to predict a brake or acceleration press before itactually happens. The proposed framework also utilizes reliablevehicle sensor information for an automatic post-hoc data labelingprocedure. An opportunity in follow-up research is to use this la-beled data to update the HMM model for different subjects and sit-uations in an online manner.

In our experiment with real-world driving data, the proposedframework provided good results with recall rate of �94% averageover all seven behavior states in estimating foot behavior statescompared to the labeled ground truth. The analysis of differentobservation time windows showed that using temporal informa-tion did help to improve the performance. Results of our cross sub-ject test (i.e. training the HMM with one subject and testing onanother subject) implied that different drivers may have differentfoot behavior styles. Therefore the potential of online updatingthe HMM behavior model for different subjects and situations isan important aspect of the proposed framework. Regarding thepedal press prediction based on the estimated behavior states, amajor portion of the pedal presses can be precisely predicted be-fore they actually happen (e.g. recall rate of �74% at 133 ms beforethe actual press). This indicates the potential of using the proposedframework in some open problems in Intelligent Driver Assistance,like predicting and mitigating the pedal misapplication phenome-non [32]. Of course, there are still remaining issues that we need todeal with if we want to apply the proposed system to more natu-ralistic, real-world driving scenarios. Some of these issues are dis-cussed in our preliminary analysis on some naturalistic drivingdata.




It should be mentioned that �133 ms seems not to be enough toprovide an useful visual feedback, but it might be possible to gen-erate an appropriate haptic pedal feedback. As reported in [1], theresponse time delay for visual feedback is around 200–500 ms,while responses to continuous haptic feedbacks are significantlyfaster – on the order of �50 ms, with less cognition involved. Giventhat we are able to detect several instances of pedal misapplicationabout 200 ms prior to the pedal press, we have demonstrated thatthe approach is a feasible tactic to help avoid dangerous situations.

When considering real-world driver assistance applications, thesystem processing time is also an important factor that needs to betaken into account. In our implementation, the time of HMMbehavior state estimation is not an issue since it typically takesonly few milliseconds for each estimation. The current bottleneckis the computation of the optical-flow-based foot tracking. It runsat about 10 frames per second, implying a delay time for opticalflow computation of up to 100 ms – which is slow for time criticalapplications. There is potential to make this approach more feasi-ble to real-world driver assistance application, for example byusing parallel algorithms for optical flow. We can also reduce theimage resolution and the number of image features used for opticalflow estimation (currently we use 640 � 480 image resolution and200 image features). This kind of reduction however might lead tosome trade-offs in the quality of the output that we will need toconsider.

The implications for improving the safety and comfort of driv-ing are significant. By proposing a novel system that is able tomodel and predict foot gestures in vehicles, we have ultimatelydemonstrated a unique opportunity to harness computer visionin improving safety on our roads. Driver foot gesture analysisshould also enable future studies to address and develop countermeasures for dual-task interference and distraction related issuesduring driving [17,18].

Acknowledgments

We thank the reviewers and editors for their insightful and con-structive comments which helped us in improving our paper. Re-search presented in this paper is sponsored in part by UCDiscovery Program and National Science Foundation. The authorswould also like to thank the volunteers and our colleagues in theLISA lab for their valuable advice and assistance.

References

[1] D.A. Abbink, E.R. Boer, and M. Mulder, Motivation for Continuous Haptic GasPedal Feedback to Support Car Following, IEEE Intelligent Vehicles Symposium,2008.

[2] M.A. Brubaker, D.J. Fleet, A. Hertzmann, Physics-based person tracking usingthe anthropomorphic walker, Int. J. Comput. Vis. (2009), doi:10.1007/s11263-009-0274-5.

[3] S.Y. Cheng, S. Park, M.M. Trivedi, Multiperspective and multimodal videoarrays for 3D body tracking and activity analysis, Comput. Vis. ImageUnderstand. 106 (2007). Special Issue on Advances in Vision Algorithms andSystems Beyond the Visible Spectrum.

[4] S.Y. Cheng, M.M. Trivedi, Turn-intent analysis using body pose for intelligentdriver assistance, IEEE Pervas. Comput. 5 (4) (2006) 28–37.

[5] S.Y. Cheng, M.M. Trivedi, Vision-based infotainment user determination byhand recognition for driver assistance, IEEE Trans. Intell. Transport. Syst.(2010).

[6] I. Choi, C. Ricci, Foot-mounted gesture detection and its application in virtualenvironments, IEEE Int. Conf. Syst., Man, Cybern. (1997).

[7] D. DeCarlo, D. Metaxas, Optical flow constraints on deformable models withapplications to face tracking, Int. J. Comput. Vis. 38 (2000) 99–127.


[8] A. Doshi, S.Y. Cheng, M.M. Trivedi, A novel, active heads-up display for driverassistance, IEEE Trans. Syst., Man, Cybern. – Part B 39 (2009).

[9] A. Doshi, B. Morris, M.M. Trivedi, On-road prediction of driver’s intent withmultimodal sensory cues, IEEE Pervasive (2011). Special Issue on AutomotivePervasive Computing.

[10] A. Doshi, C. Tran, M. Wilder, M.C. Mozer, M.M. Trivedi, SequentialDependencies in Driving, Cognitive Sci. (2011).

[11] B. Freund, L.A. Colgrovea, D. Petrakosa, R. McLeod, In my car the brake is on theright: pedal errors among older drivers, Accident Anal. Prevent. (2008).

[12] J.R. Healey, S.S. Carty, Driver error found in some Toyota acceleration cases,USA Today, 2010.

[13] M.B. Holte, T.B. Moeslund, P. Fihl, View-invariant gesture recognition using 3Doptical flow and harmonic motion context, Comput. Vis. Image Understand.114 (2010) 1353–1361.

[14] K.S. Huang, M.M. Trivedi, Robust real-time detection, tracking, and poseestimation of faces in video streams, in: Proceedings of InternationalConference on Pattern Recognition, 2004.

[15] K.S. Huang, M.M. Trivedi, 3D Shape context based gesture analysis integratedwith tracking using omni video array, in: IEEE Workshop on Vision forHuman–Computer Interaction (V4HCI), CVPR, 2005.

[16] H. Lee, J.H. Kim, An HMM-based threshold model approach for gesturerecognition, IEEE Trans. Pattern Anal. Mach. Intell. 21 (10) (1999).

[17] J. Levy, H. Pashler, Task prioritisation in multitasking during driving:opportunity to abort a concurrent task does not insulate braking responsesfrom dual-task slowing, Appl. Cognitive Psych. 22 (2008) 507–525.

[18] J. Levy, H. Pashler, E. Boer, Central interference in driving: is there any stoppingthe psychological refractory period, Psychol. Sci. 17 (3) (2006) 228–235.

[19] L. Livne, D. Shinar, Effects of uncertainty, transmission type, driver age andgender on brake reaction and movement time, J. Safety Res. (2002).

[20] B. Lucas, T. Kanade, An iterative image registration technique with anapplication to stereo vision, in: Proceedings of Imaging UnderstandingWorkshop, 1981.

[21] J.C. McCall, M.M. Trivedi, Driver behavior and situation aware brake assistancefor intelligent vehicles, Proc. IEEE 95 (2) (2007).

[22] T. Moeslund, A. Hilton, V. Kruger, A survey on advances in vision-based humanmotion capture and analysis, Computer Vision and Image Understanding 104(2006).

[23] M. Mulder, J.J.A. Pauwelussen, M.M. van Paassen, M. Mulder, D.A. Abbink,Active deceleration support in car following, IEEE Trans. Syst., Man, Cybern. –Part A 40 (6) (2010).

[24] E. Murphy-Chutorian, M.M. Trivedi, Head pose estimation and augmentedreality tracking: an integrated system and evaluation for monitoring driverawareness, IEEE Trans. Intell. Transport. Syst. (2010).

[25] NHTSA, Traffic Safety Facts 2006 – A Compilation of Motor Vehicle Crash DataFrom the Fatality Analysis Reporting System and the General EstimatesSystem, Washington, DC: Nat. Center Stat. Anal., US Dept. Transp., 2006.

[26] S. Park, T.B. Sheridan, Enhanced human machine interface in braking, IEEETrans. Syst, Man, Cybern – Part A 34 (5) (2004).

[27] A. Pentland, A. Liu, Modeling and prediction of human behavior, NeuralComputation 11 (1999).

[28] J. Pollard, E.D. Sussman, An Examination of Sudden Acceleration, Report DOTHS 807367, NHTSA, U.S. Department of Transportation, 1989.

[29] L.R. Rabiner, A tutorial on hidden Markov models and selected applications inspeech recognition, Proc. IEEE 77 (2) (1989).

[30] M. Ramsey, Toyota rethinks pedal design, The Wall Street J. (2010).[31] Y. Tanaka, H. Kaneyuki, T. Tsuji, T. Miyazaki, K. Nishikawa, T. Nouzawa,

Mechanical and perceptual analyses of human foot movements in pedaloperation, in: IEEE International Conference on Systems, Man, and Cybernetics,2009.

[32] C. Tran, A. Doshi, M.M. Trivedi, Pedal errors prediction by driver foot gestureanalysis: a vision-based Inquiry, in: IEEE Intelligent Vehicles Symposium,2011.

[33] C. Tran, M.M. Trivedi, Driver assistance for keeping hands on the wheel andeyes on the road, in: IEEE International Conference on Vehicular Electronicsand Safety, 2009.

[34] C. Tran, M.M. Trivedi, Towards a vision-based system exploring 3D driverposture dynamics for driver assistance: issues and possibilities, in: IEEEIntelligent Vehicles Symposium, 2010.

[35] M.M. Trivedi, S.Y. Cheng, Holistic sensing and active displays for intelligentdriver support systems, IEEE Comput. (2007).

[36] M.M. Trivedi, T. Gandhi, J. McCall, Looking-in and looking-out of a vehicle:computer-vision-based enhanced vehicle safety, IEEE Trans. Intell. Transport.Syst. (2007) 108–120.

[37] J. Wu, M.M. Trivedi, A two-stage head pose estimation framework andevaluation, Pattern Recogn. 41 (3) (2008) 1138–1158.


http://dx.doi.org/10.1007/s11263-009-0274-5

http://dx.doi.org/10.1007/s11263-009-0274-5


Documents

Computer Vision and Image Understanding - …cvrr.ucsd.edu/publications/2012/Tran_CVIU2012.pdfimage and scene understanding, advanced human computer inter-action, intelligent environment,