9
Driver2Vec: Discovering Driver Signatures in Automotive Data CS 341 Final Report Ruge Zhao Meixian Zhu Jing Bo Yang [email protected] [email protected] [email protected] Abstract Driver behavior analysis has numerous applications, ranging from smart personalized vehicle assistant to insur- ance risk assessment. These analyses benefit from a scheme to convert vehicle sensor data to a concise embedding. Our study proposes a deep learning architecture to extract driver behavior into a vector (Driver2vec). The model was built on the recently developed temporal convolutional net- works and leverages the strength of triplet loss. Our model achieved an average pairwise driver identification accuracy of 81.8% among 51 drivers, much higher than those ob- tained in previous studies. Downstream tasks such as driver identification, embedding clustering and risk profiling were performed to verify the effectiveness of Driver2vec embed- ding. 1. Problem Description Modern vehicles are equipped with increasing number of sensor to capture internal data (e.g. brake pedal, accelera- tion) and how the vehicle is interacting with the surround- ings (e.g. GPS location, lane positions, distance to other vehicles). Such data provides a comprehensive summary of not only the trip but also drivers’ driving style. The primary goal of this project is to build a model that can quantify and encode driving behavior in the form of a vector (we call it driver2vec), commonly refer to as embedding generation. Generating embeddings from sen- sor data to analyze drivers’ driving behavior and using the driver2vec embedding to identify a driver have numerous potential applications. These include driver identification, which is helpful for building personalized assistant to au- tomatically adjust settings (music playlist, temperature, ra- dio station etc.) based on the driver’s preference. Be- sides classifying drivers, driver2vec embeddings can serve as medium for clustering methods that identify drivers with similar driving styles. With accurate clustering, businesses can improve their understanding of risks associated with different driving behaviors. We can also use these embed- dings and other available labels to train downstream tasks such as driver risk profile or safety scoring. In this report, we draw on previous works done on anal- ysis of driving behavior and methods used for embedding generation. We show that our embedding generation model developed based on triplet loss, wavelet and Temporal Con- volutional Network (TCN) [1] is highly effective at clas- sifying drivers. We also demonstrate that the embeddings generated are of great quality, enabling computation of col- lision risk and identification of driving styles, 2. Dataset The dataset employed for this project was collected from a high-end driving simulator by Nervtech. The dataset con- tains 51 volunteer drivers’ test drives on four disjoint road areas for 55 rides. The data was sampled at 100Hz and each driver spent 20 minutes to an hour on the simulator. This dataset also separately tracks whether collisions have hap- pened during the simulated drive for each driver. Among the 51 drivers, 29 drivers had at least one accident. Each driver’s simulated ride contains similar data, de- pending on the scenario associated with each simulation. We categorize these data into three main groups, those di- rectly associated with a vehicle, those about the surround- ing environment, and absolute location within the simula- tion. Internal data are extracted from sensors installed on the vehicle related to the vehicle itself, such as speed, ac- celeration, gas/brake pedal, and steering wheel angle. Ex- ternal data include distance to lane marker and distance to other vehicles. These data columns are manually sepa- rated into numerous groups for example, three axis of ac- celeration are under one group. Categorizing available data helps us understand effectiveness of different data sources. Each group has different contribution to model accuracy, as demonstrated in Section 5.7. 2.1. Preprocessing The provided dataset has 4 area IDs (-1,0,1,2) and area 2 corresponds to a combined urban and highway area. Hence, we further split area 2 to 2 separate areas 1

Driver2Vec: Discovering Driver Signatures in Automotive ...web.stanford.edu/class/cs341/project/Zhao-Zhu-Yang_report.pdf · Haar wavelet, TCN, triplet loss and LightGBM. This sec-tion

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Driver2Vec: Discovering Driver Signatures in Automotive ...web.stanford.edu/class/cs341/project/Zhao-Zhu-Yang_report.pdf · Haar wavelet, TCN, triplet loss and LightGBM. This sec-tion

Driver2Vec: Discovering Driver Signatures in Automotive DataCS 341 Final Report

Ruge Zhao Meixian Zhu Jing Bo [email protected] [email protected] [email protected]

Abstract

Driver behavior analysis has numerous applications,

ranging from smart personalized vehicle assistant to insur-

ance risk assessment. These analyses benefit from a scheme

to convert vehicle sensor data to a concise embedding.

Our study proposes a deep learning architecture to extract

driver behavior into a vector (Driver2vec). The model was

built on the recently developed temporal convolutional net-

works and leverages the strength of triplet loss. Our model

achieved an average pairwise driver identification accuracy

of 81.8% among 51 drivers, much higher than those ob-

tained in previous studies. Downstream tasks such as driver

identification, embedding clustering and risk profiling were

performed to verify the effectiveness of Driver2vec embed-

ding.

1. Problem DescriptionModern vehicles are equipped with increasing number of

sensor to capture internal data (e.g. brake pedal, accelera-tion) and how the vehicle is interacting with the surround-ings (e.g. GPS location, lane positions, distance to othervehicles). Such data provides a comprehensive summary ofnot only the trip but also drivers’ driving style.

The primary goal of this project is to build a modelthat can quantify and encode driving behavior in the formof a vector (we call it driver2vec), commonly refer to asembedding generation. Generating embeddings from sen-sor data to analyze drivers’ driving behavior and using thedriver2vec embedding to identify a driver have numerouspotential applications. These include driver identification,which is helpful for building personalized assistant to au-tomatically adjust settings (music playlist, temperature, ra-dio station etc.) based on the driver’s preference. Be-sides classifying drivers, driver2vec embeddings can serveas medium for clustering methods that identify drivers withsimilar driving styles. With accurate clustering, businessescan improve their understanding of risks associated withdifferent driving behaviors. We can also use these embed-

dings and other available labels to train downstream taskssuch as driver risk profile or safety scoring.

In this report, we draw on previous works done on anal-ysis of driving behavior and methods used for embeddinggeneration. We show that our embedding generation modeldeveloped based on triplet loss, wavelet and Temporal Con-volutional Network (TCN) [1] is highly effective at clas-sifying drivers. We also demonstrate that the embeddingsgenerated are of great quality, enabling computation of col-lision risk and identification of driving styles,

2. DatasetThe dataset employed for this project was collected from

a high-end driving simulator by Nervtech. The dataset con-tains 51 volunteer drivers’ test drives on four disjoint roadareas for 55 rides. The data was sampled at 100Hz and eachdriver spent 20 minutes to an hour on the simulator. Thisdataset also separately tracks whether collisions have hap-pened during the simulated drive for each driver. Amongthe 51 drivers, 29 drivers had at least one accident.

Each driver’s simulated ride contains similar data, de-pending on the scenario associated with each simulation.We categorize these data into three main groups, those di-rectly associated with a vehicle, those about the surround-ing environment, and absolute location within the simula-tion. Internal data are extracted from sensors installed onthe vehicle related to the vehicle itself, such as speed, ac-celeration, gas/brake pedal, and steering wheel angle. Ex-ternal data include distance to lane marker and distanceto other vehicles. These data columns are manually sepa-rated into numerous groups for example, three axis of ac-celeration are under one group. Categorizing available datahelps us understand effectiveness of different data sources.Each group has different contribution to model accuracy, asdemonstrated in Section 5.7.

2.1. PreprocessingThe provided dataset has 4 area IDs (-1,0,1,2) and

area 2 corresponds to a combined urban and highwayarea. Hence, we further split area 2 to 2 separate areas

1

Page 2: Driver2Vec: Discovering Driver Signatures in Automotive ...web.stanford.edu/class/cs341/project/Zhao-Zhu-Yang_report.pdf · Haar wavelet, TCN, triplet loss and LightGBM. This sec-tion

with distinct labels to accommodate all different types ofareas.

Since sensor readings are based on different units andscale, it is important to normalize their values. To stay awayfrom building a “road2vec”, local conditions are removedby creating an evenly spaced grid with 5, 000 micro-regionsaccording to (x, y, z) coordinates. Values are normalizedto have zero mean and unit standard deviation of the cor-responding grid across all available training samples fallingin the same local grid.

2.2. Data Split

Since each area ID represent a distinct type of road (tu-torial, urban, sub-urban and highway), an 8:1:1 split withineach area is used for train:eval:test. The splittingmethod guarantees that train, eval and test sets come fromisolated data, ensuring generalization for our trained model.

In this work, various gaps are introduced to lessen theeffect of overlapping, which in turn decreases the correla-tion among input data. A random offset of 1/4 gap lengthis introduced to make the dataset slightly different as a formof data augmentation.

3. Related WorkDriver identification has already been investigated in [2].

In this work, Hallac et al. used decision trees on derivedwavelet features of vehicle data to identify driver using datafrom a single turn, achieving 76.9% overall 2-way accu-racy on select turns. Sensor importance was also analyzedbut no particular sensor was found to be significantly moreimportant than others. This paper also highlights the diffi-culty of identifying drivers on highways. Miyajima et al.in [3] demonstrates that a Gaussian Mixture Model(GMM)with a smaller set of sensors can achieve 76.8% accuracyafter employing cepstrum transforms. Although using sim-ple machine learning models, these two works demonstratethe effectiveness of wavelet features.

Next, embedding prediction is common in building em-bedding models. David et al. [4] demonstrated that drivingembedding can be learned through predicting sensor valuesof immediate and distant future. In this work, a GRU-basedrecurrent neural network is used, leading to F1-score muchhigher than random. Downstream tasks utilizing this em-bedding include predicting sensor values and collision riskassessment.

Aside from direct applications of neural networks fordriving embedding, methods employed for natural languageprocessing are also worth investigating. Much like workdone in [5], driver embedding generation should be similarto speaker embedding generation for diarization. The pro-posed triplet-loss network had previously been used mostlyfor face IDs as in [6], but Song et al’s work [5] proves that

the method is also effective with time series data trained onrecurrent structures.

Most importantly, as outlined in [4], many potentialdownstream tasks exist for en effective driver embedding.These include but not limited to future action prediction andrisk assessment.

Our work is the first to employ advanced deep learningtechniques for driver embedding generation. Our modeldemonstrates that driver embedding can be learned by ajoint triplet-loss with driver identification (cross-entropyloss). With this high quality embedding, this work is alsothe first to demonstrate good results on numerous tasks in-volving analysis of driver behaviors.

4. MethodologyOur embedding generation model is built by combining

Haar wavelet, TCN, triplet loss and LightGBM. This sec-tion explains the key ideas behind each component. Thecomplete model architecture is illustrated in Section 4.5.

4.1. TCNTemporal convolutional network (TCN), as described

in [7], uses dilated causal convolution to support variablelength input sequence and maintains RNN-like behavior.This architecture has been shown to train faster and has bet-ter performance than recurrent models of comparable com-plexity. To accomplish the first point, the TCN uses a fully-convolutional network (FCN) architecture ([8]), where eachhidden layer is the same length as the input layer, and zeropadding of length (kernel size �1) is added to keep subse-quent layers the same length as previous ones. To achievethe second point, the TCN uses causal convolutions, con-volutions where an output at time t is convolved only withelements from time t and earlier in the previous layer.

However, a simple causal convolution is only able tolook back at a history with size linear in the depth of thenetwork. The solution is to introduce dilated convolutionswhich enables an exponentially large receptive field ([9]),as shown in Figure 1. More formally, for a 1-D sequenceinput x 2 Rn and a filter f : {0, ..., k�1} ! R, the di-lated convolution operation F on element s of the sequenceis defined as

F (s) = (x ⇤d f)(s) = ⌃k�1i=0 f(i) · xs�d·i

where d is the dilation factor, k is the filter size, and s�d · iaccounts for the direction of the past.

Residual blocks as shown in Figure 1 are employed tostabilize the TCN network. The residual block for our base-line TCN is shown in Figure 1(b). Within a residual block,the TCN has two layers of dilated causal convolution andReLU. For normalization, we applied weight normaliza-tion ([10]) to the convolutional filters. In addition, a spatial

2

Page 3: Driver2Vec: Discovering Driver Signatures in Automotive ...web.stanford.edu/class/cs341/project/Zhao-Zhu-Yang_report.pdf · Haar wavelet, TCN, triplet loss and LightGBM. This sec-tion

Figure 1: Structure of temporal convolutional network [7]

dropout is added after each dilated convolution for regular-ization: at each training step, a whole channel is zeroed out.

4.2. Triplet Loss

Figure 2: Illustration of relationship between embeddingsin a triplet loss computation.

[5] explored the use of joint representation learning andsimilarity metric learning with triplet loss in speaker di-arization. In particular, the model leveraged attention net-works to model the temporal characteristics of speech seg-ments (see Figure 2). And then the representations from thedeep attention model (Figure 3) were used to learn a simi-larity metric with the triplet ranking loss.

Figure 3: Illustration of the attention model used for com-puting embeddings from features of speech segments

In a triplet network, each input is constructed as a set of3 samples x = xp, xr, xn, where xr denotes an anchor, xp

denotes a positive sample belonging to the same driver asxr, and xn a negative sample from a different driver. Eachof the samples in x are processed using the attention modelA and distances are computed in the resulting latent spaces:

D2rp = kA(xr)�A(xp)k2

D2rn = kA(xr)�A(xn)k2

The triplet loss is defined as: l(xp, xr, xn) =max(0, D2

rp�D2rn + ↵) where ↵ is the margin. In other

words, the objective is to achieve D2rn � D

2rp + ↵.

4.3. LightGBMGradient Boosting Decision Tree (GBDT) is a widely

used machine learning algorithm due to its accuracy andinterpretability. GBDT is an ensemble model of decisiontrees, which are trained in sequence, where each decisiontree learns by fitting the negative gradients. GBDT hasachieved decent performance in tasks like multi-class clas-sification and it has some implementations like XGBoost.However, it faces challenges in the trade-off between ef-ficiency and accuracy in traditional implementations. Mi-crosoft has developed an algorithm called LightGBM [11].

• Gradient-based One-Side Sampling (GOSS)

GOSS excludes data instances with small gradientsand only use the rest to estimate the information gain.It keeps all the instances with large gradients and per-forms random sampling on the instances with smallgradients. In order to compensate the influence tothe data distribution, when computing the informationgain, GOSS introduces a constant multiplier for thedata instances with small gradients.

• Exclusive Feature Bundling (EFB);

3

Page 4: Driver2Vec: Discovering Driver Signatures in Automotive ...web.stanford.edu/class/cs341/project/Zhao-Zhu-Yang_report.pdf · Haar wavelet, TCN, triplet loss and LightGBM. This sec-tion

EFB bundles mutually exclusive features (i.e. theyrarely take nonzero values simultaneously), which areoften sparse, to reduce the number of features

In our model, LightGBM is used to supplement tripletloss. Instead of using a fully connected linear layer andcross entropy loss to directly train the neural network ondriver classification task, we made use of LightGBM whichtakes driver embeddings generated using triplet loss as in-puts.

4.4. Haar WaveletTo capture the spectral component of the features, we

borrowed the idea in [2] and computed discrete wavelettransform (DWT) [12] in our feature engineering process.We used the Haar Wavelet transformation, with the Haarmatrix

H =

"1p2

1p2

1p2

� 1p2

#

For any sequence (a0, a1, . . . , a2n, a2n+1) of evenlength, we can break it into a sequence of two-component-vectors ((a0, a1) , . . . , (a2n, a2n+1)). Then, we can right-multiply each vector with the matrix H , and get the result((s0, d0) , . . . , (sn, dn)) The s0is and the d0is are the approx-imation and detail coefficients respectively, which we canuse together to get back the original signal.

4.5. Model ArchitectureFigure 4 shows an overview of our proposed

TCN+Triplet+Wavelet model. With an original se-quence of length 1000 (10 seconds) and 31 features, we use8 layers of TCN module with 32 output channels and kernelsize of 16 for TCN’s perceptive field to model input data.The corresponding embedding size for this structure is 32.At the meantime, we extracted Haar wavelet features andsent the two output vectors through fully connected layersto get two vectors of size 15 each. As Haar wavelet alreadycontains information in the frequency domain, we believean FC layer is sufficient. These are then concatenated withTCN output to form an embedding size of 62.

As shown, the above structure is executed two additionaltimes during training. The first (second overall) is to pro-cess a sequence from the same driver, while the second time(third overall) processes a sequence from a different driver.The original embedding and the two additional embeddingsdescribed here are used to compute triplet loss functiondiscussed in 4.2. For most downstream tasks, LightGBMis trained with different ground truth (driver identification,collision prediction) but in general had the same input em-beddings.

We compared our model’s performance relative baselinemodel and studied variations of hyperparameters (e.g. fea-

tures, window size, gaps between adjacent windows, em-bedding size) in Section 5.

4.6. Evaluation Metrics4.6.1 One hot prediction

Logits needed for one-hot prediction can either be generatedfrom fully connected layers after embedding generation orfrom multi-class LightGBM classifier. These two methodsof generating logits can be evaluated with the same met-rics. A confusion matrix can subsequently be generated forvisual examination of model behavior.

4.6.2 P-way accuracy

In real life, a vehicle is usually shared among only a fewdrivers, so a model deployed in the real world should iden-tify a driver from a small number of drivers while it modelcan be trained on all available drivers. Hence, we includedaverage pairwise accuracy (p = 2) and p-way average ac-curacy on combinations of p drivers for p = 3, 4, 5.

There are two stages in computing p-way accuracy:

1. Mask generation

For p = 2, 3, for any given driver, we enumerated allpossible subsets of size p (

� 50p�1

�subsets that contain

the ground truth, ground truth is always unmasked) togenerate masks. Drivers in these subsets are consid-ered unmasked.

For p = 4, 5, we sampled 4000 from all possible sub-sets, since

�503

�and

�504

�are too large to fit into mem-

ory.

2. Evaluation time

Unlike 51-way prediction where we take the argmax

of the score, in p-way prediction we only considerargmax of the unmasked drivers (p drivers only) tobe the final prediction.

We take the average results of all such size p combina-tions as the p-way prediction in this use case.

4.6.3 Area specific p-way accuracy

We repeated Section 4.6.2 for each area by only taking em-beddings from the particular area ID.

4.6.4 p-way accuracy with none-of-the-above option

On top of Section 4.6.2, we introduced a “none-of-the-above” option at evaluation time. Following changes weremade at each stage as compared to a normal p-way accu-racy.

4

Page 5: Driver2Vec: Discovering Driver Signatures in Automotive ...web.stanford.edu/class/cs341/project/Zhao-Zhu-Yang_report.pdf · Haar wavelet, TCN, triplet loss and LightGBM. This sec-tion

Figure 4: Model architecture of TCN+Triplet+Wavelet

1. Mask generation

For p = 2, 3, for any given driver, we enumerated allpossible subsets that contain ground truth (

� 50p�1

�sub-

sets that contain the ground truth) and sampled� 50p�1

subsets that do not contain the ground truth ( out of alllpossible

�51p

�subsets) as masks. Drivers in the subsets

are considered unmasked.

For p = 4, 5, we sampled 2000 from all possible sub-sets that contain ground truth, since

�503

�and

�504

�are

too large to fit into memory. Similarly, we generated2000 subsets that do not contain the ground truth.

2. Evaluation time

Unlike normal p-way prediction, where we con-sider argmax of the unmasked drivers (p driversonly) to be the final prediction, with none-of-the-above option we introduced a threshold. If all un-masked probability sums to less than the threshold,none-of-the-above will be the final prediction.In this case, if the ground truth is indeed not in the un-masked subset, this prediction is considered correct.

This setting introduces 50% “noise” at evaluation time,making it a valid measurement of robustness of the model.

5. Results and AnalysisExperiments are performed to evaluate our model’s ca-

pability to identify drivers and to evaluate the effectivenessof generated embeddings for subsequent tasks. Most ofthe experiments are executed on machines with K80 andT4 GPUs on Google Cloud. Experiments are loggedon Google Cloud Filestore for shared accessibil-ity and to support an always-on Tensorboard server.

5.1. Best Model and Hyper-parametersPairwise accuracy (p = 2) on eval data is used as selec-

tion criteria for best model. The best model is found throughgrid hyperparameter search. Search space includes learningrate, decay ratio, embedding size, kernel size, triplet marginand triplet weight. Window size is fixed at 10 seconds andwindow gap is fixed at 2 seconds. Models are run for 9000steps, slightly after models show sign of overfitting.

The best model uses learning rate = 4 ⇥ 10�4,decay = 0.975, embedding size = 32, kernel size =16, triplet margin = 1. This set of parameters lead to2-way evaluation accuracy of 81.8% and test accuracy of83.1%.

5.2. Confusion MatrixFigure 5 shows the confusion matrix produced by the

best model on evaluation data. As shown, there is a cleardiagonal, meaning that the model is only uncertain among

5

Page 6: Driver2Vec: Discovering Driver Signatures in Automotive ...web.stanford.edu/class/cs341/project/Zhao-Zhu-Yang_report.pdf · Haar wavelet, TCN, triplet loss and LightGBM. This sec-tion

Figure 5: Confusion matrix on the best model on evaluationdata

a small group (two or three) of alternative choices, whichcould in fact represent similarity among drivers. This ob-servation concurs with the observed high performance insubsequent p-way driver identification tasks.

5.3. P-way Predictions and “None-of-the-above"

Without N-o-t-a With 50% N-o-t-a2-way 83.1 67.83-way 73.8 62.24-way 67.3 59.15-way 62.5 56.3

Table 1: p-way accuracy with and without None-of-the-above(N-o-t-a) noise on test data

The model is still robust even with 50% noise that the un-masked p-drivers do not contain the ground truth. Thresh-old search was performed on evaluation data for each p =2, 3, 4, 5. Table 1 shows the normal p-way accuracy andp-way accuracy calculated using 50% “none-of-the-above”noisy masks on evaluation data using the best threshold.The model is robust to make “none-of-the-above” predic-tion at evaluation time. In addition, as p increases, reductionin accuracy decreases, which is an expected behavior.

5.4. Performance on Different Road AreasTable 2 shows the 2-way accuracy on evaluation and test

data in each of the area types. Except for tutorial area,drivers can be the mostly accurately distinguished in urbanareas.

5.5. Window Size and Window GapThe purpose of studying the effect of window size and

window gap is to find out the shortest window length to

Driving Area Type 2-way Accuracy 2-way AccuracyEval(%) Test(%)

Highway 78.4 81.1Suburban 70.7 81.9Urban 78.9 82.4Tutorial 80.4 84.7

Table 2: Model performance in given driving area type

achieve reasonable performance.Window size refers to length of input sequence. Win-

dow gap is defined as the number of time stamps betweenneighbouring windows. Larger window gap (less overlap)makes the sequences less correlated, but it leads to smallereffective dataset size due to less overlapping data.

Figure 6: Effect of window size on 2-way accuracy on eval-uation data with 2 seconds window gap

Figure 7: Effect of window gap on 2-way accuracy on evaldata with 10 seconds window size

Figure 6 shows that longer time window leads to sig-nificantly better accuracy. With 30-second sequences, ourmodel can achieve over 90% eval 2-way accuracy. How-ever, there is limited data available and of little use for win-dow length that is too long.

6

Page 7: Driver2Vec: Discovering Driver Signatures in Automotive ...web.stanford.edu/class/cs341/project/Zhao-Zhu-Yang_report.pdf · Haar wavelet, TCN, triplet loss and LightGBM. This sec-tion

Figure 7 shows that as the gap between adjacent se-quences becomes smaller (more overlapping data), modelperformance is improved. However, decreasing windowsize linearly increases training time while improvement inperformance is limited.

5.6. Embedding Size

TCN embedding size 2-way Accuracy(Total embedding size) Eval(%)

16 (46) 79.732 (62) 81.864 (94) 79.4

Table 3: p-way accuracy generated by LightGBM on evaldata using different embedding size

Table 3 illustrated the effect of embedding size by vary-ing TCN layers’ output channel size. Total embedding sizeis a combination of TCN embedding size and wavelet em-bedding size. Wavelet embedding size is fixed at 30.

5.7. Feature Importance

Available Sensor Groups 2-way Accuracy Eval (%)all non-weather related included 81.8speed, acceleration only 66.3no stop sign/following distance 74.6no lane distance 77.8no gas/break pedal 78.1no speed 78.8no gear box 79.0no acceleration 79.1no steering wheel angle 79.2no turn indicators 79.3

Table 4: Feature ablation and 2-way accuracy on evaluationdata.

In the dataset, there are weather related features such asrain, snow, windshield wipers, fog and fog light. There isa significant drop in model performance once weather re-lated features are removed. However, it is important to pre-vent model from overfitting to these exogenous variables.Hence, in all subsequent models weather-related variableswere excluded.

Table 4 shows after removing weather related features,more variables are grouped and a particular group was re-moved in each experiment. The first row “speed, acceler-ation only” refers to only keeping speed and accelerationrelated data. These data can be captured via smartphone

or GPS devices alone without installation of additional sen-sors. The observed drop in performance suggests that manyother aspects of driving are also critical for analysis of driv-ing style.

For other features, the group that has the highest predic-tive power is stop sign/following distance which are relatedto the distance between a stop sign or the previous vehicle tothe driver’s vehicle. Removal of this group results in 7.2%decrease in 2-way accuracy. Removal of other groups resultin a performance drop of about 2-4% on evaluation data. Wehypothesize that features such as gear box, gas/break pedal,speed and acceleration are highly correlated. The modelsuccessfully recovered missing information despite havinga group of features removed.

5.8. Model Ablation

Model 2-way Accuracy Eval(%)

RNN CE 1 72.3RNN Wavelet CE 78.4RNN Triplet 70.7RNN Triplet Wavelet 78.9TCN CE 80.4TCN Wavelet CE 80.0TCN Triplet 75.4TCN Triplet Wavelet 81.8

Table 5: Model performance with different combination ofcomponents

Our baseline model is RNN (LSTM) with fully con-nected linear layer as last layer that conducts the 51-waydriver classification using cross-entropy loss. We intro-duced TCN as a replacement for LSTM to handle the se-quence data. Incrementally we replaced cross-entropy lossfunction with the triplet loss setup and added wavelet fea-tures into the embeddings to the architecture as illustratedin Figure 4.

Since TCN, triplet loss, and wavelet were introduced intothe best model, in Table 5 an ablation study was conductedto analyze the marginal effect on eval data for each compo-nent.

6. Downstream TasksMultiple downstream tasks and real life applications can

be built upon the generated driver embeddings.

6.1. Driver IdentificationThe major application of our model is to differentiate the

driver behind wheel with only a short time interval of se-1Models followed by “CE” are trained with cross-entropy loss

7

Page 8: Driver2Vec: Discovering Driver Signatures in Automotive ...web.stanford.edu/class/cs341/project/Zhao-Zhu-Yang_report.pdf · Haar wavelet, TCN, triplet loss and LightGBM. This sec-tion

Figure 8: t-SNE visualiztion with 6 sampled drivers. The remaining 45 drivers are colored gray.

quence on family cars or shared cars (e.g. zipcar, rentalcars). Results from section 5.2 to 5.4 show that our modelis able to distinguish the drivers in this use case even with“none-of-the-above” noise which may happen often in reallife to alert that an infrequent user of the car is driving.

6.2. Data VisualizationDriver embeddings are projected onto 2-dimensional la-

tent space using t-SNE. In Figure 8, a number of drivers’embeddings are colored while those of the other drivers arein light grey. Although scattered, each driver’s embeddingsare only present in a portion of the graph and have littleoverlap with embeddings from other drivers. Scatterednesscan potentially be a measurement of consistency in drivingstyle.

6.3. Driver Safety ScoreLastly, we generate a safety score to assess a driver’s

safety level based on a piece of his or her driving record,which could potentially be very useful for insurance com-panies.

Embeddings are first based on whether the driver hasa collision history. We then train a binary classifier usingLightGBM to predict the probability of an embedding com-ing from a driver with collision history. This classifier isable to achieve a cross validated AUC of 0.7 on evaluationset. To calculate the driver safety score, we feed the embed-dings of each driver into the classifier, and using a threshold

of 0.5, we measure the percentage of embeddings which ex-ceed the threshold for that driver. We define the percentageas the driver’s safety score.

To compare with the original data, drivers are sorted bytheir safety score. The top 29 are labelled as “collision”(dangerous) and the bottom 22 drivers are labelled as “non-collision” (safe). We picked the numbers 29 and 22 to matchthe number of dangerous drivers in the original dataset. Aconfusion matrix for the test set is shown in figure 9. Wewere able to get a false positive rate of 31% and a falsenegative rate of 24%.This is a reasonable result given thelimited number of drivers available.

7. Future WorkAs discussed in Section 3, embedding generation from

driving behavior can leverage techniques currently utilizedfor facial recognition and speech diarization. Song et al’swork [5] discussed the use of attention and Gaussian Mix-ture Model. These can be integrated with current model ar-chitecture and evaluation framework with relative ease. Fur-ther more, Snell et al’s Prototypical Network [13] demon-strates additional ways that could improve quality of driv-ing embedding clusters. For real-world deployment wherecomplete data and accurate labels are expensive to obtain,unsupervised and semi-supervised methods are worth ex-ploring. These require prior knowledge of embedding ofcertain driving habits, much like UBM (Universal Back-ground Model) for speaker identification tasks.

8

Page 9: Driver2Vec: Discovering Driver Signatures in Automotive ...web.stanford.edu/class/cs341/project/Zhao-Zhu-Yang_report.pdf · Haar wavelet, TCN, triplet loss and LightGBM. This sec-tion

Figure 9: Driver safety score confusion matrix

8. ConclusionThis paper proposes a deep learning architecture to con-

vert driver behavior to embedding. Our proposed modelcombines temporal convolutional network with triplet loss.Performance on the core task of pairwise driver identifica-tion is 81.3%, beating RNN implementation performanceof 72.3%. With a high quality embedding, we graphicallydemonstrate that embeddings from the same driver form ac-ceptable clusters. We are also able to classify whether driveris collision-free with more than 0.7 AUC from sensor data.To summarize, the effectiveness of this deep learning modelmakes it a great starting point for applications depending ondriving style analytics.

References[1] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation

of generic convolutional and recurrent networks for sequencemodeling,” arXiv:1803.01271, 2018.

[2] D. Hallac, A. Sharang, R. Stahlmann, A. Lamprecht, M. Hu-ber, M. Roehder, R. Sosic, and J. Leskovec, “Driver iden-tification using automobile sensor data from a single turn,”CoRR, vol. abs/1708.04636, 2017.

[3] T. Wakita, K. Ozawa, C. Miyajima, K. Igarashi, K. Itou,K. Takeda, and F. Itakura, “Driver identification using driv-ing behavior signals,” IEICE TRANSACTIONS on Informa-

tion and Systems, vol. 89, no. 3, pp. 1188–1194, 2006.

[4] D. Hallac, S. Bhooshan, M. Chen, K. Abida, R. Sosic,and J. Leskovec, “Drive2vec: Multiscale state-space em-bedding of vehicular sensor data,” 2018 21st International

Conference on Intelligent Transportation Systems (ITSC),pp. 3233–3238, 2018.

[5] H. Song, M. Willi, J. J. Thiagarajan, V. Berisha, andA. Spanias, “Triplet network with attention for speaker di-arization,” 2018.

[6] J. Deng, Y. Zhou, and S. Zafeiriou, “Marginal loss for deepface recognition,” in Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition Workshops,pp. 60–68, 2017.

[7] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluationof generic convolutional and recurrent networks for sequencemodeling,” arXiv preprint arXiv:1803.01271, 2018.

[8] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutionalnetworks for semantic segmentation,” in Proceedings of the

IEEE conference on computer vision and pattern recogni-

tion, pp. 3431–3440, 2015.

[9] F. Yu and V. Koltun, “Multi-scale context aggregation by di-lated convolutions,” arXiv preprint arXiv:1511.07122, 2015.

[10] T. Salimans and D. P. Kingma, “Weight normalization: Asimple reparameterization to accelerate training of deep neu-ral networks,” in Advances in Neural Information Processing

Systems, pp. 901–909, 2016.

[11] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye,and T.-Y. Liu, “Lightgbm: A highly efficient gradient boost-ing decision tree,”

[12] A. Jensen and A. la Cour-Harbo, Ripples in mathematics:

the discrete wavelet transform. Springer Science & BusinessMedia, 2001.

[13] J. Snell, K. Swersky, and R. Zemel, “Prototypical networksfor few-shot learning,” in Advances in Neural Information

Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Ben-gio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,eds.), pp. 4077–4087, Curran Associates, Inc., 2017.

9