10
Rising Stars track PervasiveHealth’19, May 20–23, 2019, Trento, Italy Passive mobile sensing and psychological traits for large scale mood prediction Dimitris Spathis University of Cambridge, UK Dept. of Comp. Sci. & Tech. [email protected] Sandra Servia-Rodriguez University of Cambridge, UK Dept. of Comp. Sci. & Tech. [email protected] Katayoun Farrahi University of Southampton, UK Dept. of Electr. & Comp. Sci. [email protected] Cecilia Mascolo University of Cambridge, UK Dept. of Comp. Sci. & Tech. [email protected] Jason Rentfrow University of Cambridge, UK Dept. of Psychology [email protected] ABSTRACT Experience sampling has long been the established method to sam- ple people’s mood in order to assess their mental state. Smartphones start to be used as experience sampling tools for mental health state as they accompany individuals during their day and can therefore gather in-the-moment data. However, the granularity of the data needs to be traded off with the level of interruption these tools introduce. As a consequence the data collected with this technique is often sparse. This has been obviated by the use of passive sensing in addition to mood reports, however, this adds additional noise. In this paper we show that psychological traits collected through one-off questionnaires combined with passively collected sensing data (movement from the accelerometer and noise levels from the microphone) can be used to detect individuals whose general mood deviates from the common relaxed characteristic of the general pop- ulation. By using the reported mood as a classification target we show how to design models that depend only on passive sensors and one-off questionnaires, without bothering users with tedious expe- rience sampling. We validate our approach by using a large dataset of mood reports and passive sensing data collected in the wild with tens of thousands of participants, finding that the combination of these modalities achieves the best classification performance, and that passive sensing yields a +5% boost in accuracy. We also show that sensor data collected for a week performs better than single days for this task. We discuss feature extraction techniques and appropriate classifiers for this kind of multimodal data, as well as overfitting shortcomings of using deep learning to handle static and dynamic features. We believe these findings have significant implications for mobile health applications that can benefit from the correct modeling of passive sensing along with extra user metadata. KEYWORDS machine learning, mobile sensing, multi-modal data, mood predic- tion ACM Reference Format: Dimitris Spathis, Sandra Servia-Rodriguez, Katayoun Farrahi, Cecilia Mas- colo, and Jason Rentfrow. 2019. Passive mobile sensing and psychological traits for large scale mood prediction. In The 13th International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth’19), May 20–23, 2019, Trento, Italy. ACM, New York, NY, USA, 10 pages. https: //doi.org/10.1145/3329189.3329213 PervasiveHealth’19, May 20–23, 2019, Trento, Italy 2019. ACM ISBN 978-1-4503-6126-2/19/05. . . $15.00 https://doi.org/10.1145/3329189.3329213 1 INTRODUCTION Experience sampling –which involves asking participants to report on their behaviors or environment on repeated occasions over time– has long been used as a mechanism to longitudinally assess the mental health of individuals by prompting them to report their mood using questionnaires traditionally delivered through pen and paper, but also through the web. Psychologists have used different tools or scales that facilitate users to assess their mood. These include the Positive and Negative Affect Schedule (PANAS) [46], a self-report questionnaire of two 10-item scales that measures both positive and negative affect; and the Affect Grid [32] scale, a 2-dimensional grid, where the x-axis indicates the feeling in terms of its positiveness or negativeness while the y-axis indicates its intensity. Independently of the scale used, timely and accurate mood report is important to anticipate clinical outcomes such as depression [7], longevity [43] or mortality [1]. The pervasiveness of smartphones and wearable devices has enabled timely delivery of experience sampling [10], allowing a near real-time detection of clinical outcomes and relapses. This led to the development of several mobile phone applications that prompt their users to assess and report their mood one or more times per day, using one or more different scales [35, 37]. Apart from potentially inducing biases in the measurements, interrupting users during their daily lives at a high frequency and with the same purpose is seen as a high burden by many users [26], as it is evidenced by the high dropout rates reported in these applications. Indeed, according to recent statistics, more than 2/3rds of people who download a mobile health app used it only once [21]. Previous research has pointed out the link between self reported mood and some personality traits such as emotional stability [8, 11]. Exploiting this link to track mental health would mitigate users’ burden, as assessing their personality as well as other psychologi- cal traits would only require one off questionnaires. At the same time, personal mobile devices come also equipped with a growing set of built-in sensors, such as an accelerometer, microphone and gyroscope. A proper and rigorous analysis of the data passively collected with these sensors provides valuable insights for the users’ physical behaviour [2], but could also act as a proxy of their mental health [44]. However, how to use psychological traits and passive sensing data to accurately track mental health is still an open re- search question. Also, the use of low sampling rates for passive sensing data collection due to battery consumption issues often lead to very sparse sensing data, which adds to the challenge.

Passive mobile sensing and psychological traits for large ...ds806/papers/PH_Passive_mobile_Spathi… · ical outcomes and relapses using very sparse data are key to the widespread

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Passive mobile sensing and psychological traits for large ...ds806/papers/PH_Passive_mobile_Spathi… · ical outcomes and relapses using very sparse data are key to the widespread

Rising Stars track PervasiveHealth’19, May 20–23, 2019, Trento, Italy

Passive mobile sensing and psychological traitsfor large scale mood prediction

Dimitris SpathisUniversity of Cambridge, UKDept. of Comp. Sci. & Tech.

[email protected]

Sandra Servia-RodriguezUniversity of Cambridge, UKDept. of Comp. Sci. & Tech.

[email protected]

Katayoun FarrahiUniversity of Southampton, UKDept. of Electr. & Comp. Sci.

[email protected]

Cecilia MascoloUniversity of Cambridge, UKDept. of Comp. Sci. & Tech.

[email protected]

Jason RentfrowUniversity of Cambridge, UK

Dept. of [email protected]

ABSTRACTExperience sampling has long been the established method to sam-ple people’s mood in order to assess their mental state. Smartphonesstart to be used as experience sampling tools for mental health stateas they accompany individuals during their day and can thereforegather in-the-moment data. However, the granularity of the dataneeds to be traded off with the level of interruption these toolsintroduce. As a consequence the data collected with this techniqueis often sparse. This has been obviated by the use of passive sensingin addition to mood reports, however, this adds additional noise.

In this paper we show that psychological traits collected throughone-off questionnaires combined with passively collected sensingdata (movement from the accelerometer and noise levels from themicrophone) can be used to detect individuals whose general mooddeviates from the common relaxed characteristic of the general pop-ulation. By using the reported mood as a classification target weshow how to designmodels that depend only on passive sensors andone-off questionnaires, without bothering users with tedious expe-rience sampling. We validate our approach by using a large datasetof mood reports and passive sensing data collected in the wild withtens of thousands of participants, finding that the combination ofthese modalities achieves the best classification performance, andthat passive sensing yields a +5% boost in accuracy. We also showthat sensor data collected for a week performs better than singledays for this task. We discuss feature extraction techniques andappropriate classifiers for this kind of multimodal data, as well asoverfitting shortcomings of using deep learning to handle staticand dynamic features. We believe these findings have significantimplications for mobile health applications that can benefit from thecorrect modeling of passive sensing along with extra user metadata.

KEYWORDSmachine learning, mobile sensing, multi-modal data, mood predic-tionACM Reference Format:Dimitris Spathis, Sandra Servia-Rodriguez, Katayoun Farrahi, Cecilia Mas-colo, and Jason Rentfrow. 2019. Passive mobile sensing and psychologicaltraits for large scale mood prediction. In The 13th International Conferenceon Pervasive Computing Technologies for Healthcare (PervasiveHealth’19),May 20–23, 2019, Trento, Italy. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3329189.3329213

PervasiveHealth’19, May 20–23, 2019, Trento, Italy2019. ACM ISBN 978-1-4503-6126-2/19/05. . . $15.00https://doi.org/10.1145/3329189.3329213

1 INTRODUCTIONExperience sampling –which involves asking participants to reporton their behaviors or environment on repeated occasions over time–has long been used as a mechanism to longitudinally assess themental health of individuals by prompting them to report theirmood using questionnaires traditionally delivered through pen andpaper, but also through the web. Psychologists have used differenttools or scales that facilitate users to assess their mood. Theseinclude the Positive and Negative Affect Schedule (PANAS) [46],a self-report questionnaire of two 10-item scales that measuresboth positive and negative affect; and the Affect Grid [32] scale, a2-dimensional grid, where the x-axis indicates the feeling in termsof its positiveness or negativeness while the y-axis indicates itsintensity. Independently of the scale used, timely and accuratemood report is important to anticipate clinical outcomes such asdepression [7], longevity [43] or mortality [1].

The pervasiveness of smartphones and wearable devices hasenabled timely delivery of experience sampling [10], allowing anear real-time detection of clinical outcomes and relapses. Thisled to the development of several mobile phone applications thatprompt their users to assess and report their mood one or moretimes per day, using one or more different scales [35, 37]. Apartfrom potentially inducing biases in the measurements, interruptingusers during their daily lives at a high frequency and with thesame purpose is seen as a high burden by many users [26], as it isevidenced by the high dropout rates reported in these applications.Indeed, according to recent statistics, more than 2/3rds of peoplewho download a mobile health app used it only once [21].

Previous research has pointed out the link between self reportedmood and some personality traits such as emotional stability [8, 11].Exploiting this link to track mental health would mitigate users’burden, as assessing their personality as well as other psychologi-cal traits would only require one off questionnaires. At the sametime, personal mobile devices come also equipped with a growingset of built-in sensors, such as an accelerometer, microphone andgyroscope. A proper and rigorous analysis of the data passivelycollected with these sensors provides valuable insights for the users’physical behaviour [2], but could also act as a proxy of their mentalhealth [44]. However, how to use psychological traits and passivesensing data to accurately track mental health is still an open re-search question. Also, the use of low sampling rates for passivesensing data collection due to battery consumption issues oftenlead to very sparse sensing data, which adds to the challenge.

Page 2: Passive mobile sensing and psychological traits for large ...ds806/papers/PH_Passive_mobile_Spathi… · ical outcomes and relapses using very sparse data are key to the widespread

Rising Stars track Spathis D. et al.

Depression Sleepiness Relaxation

UnpleasantFeelings

500100015002000250030003500

StressHigh

Arousal Excitement

HighValence

(a) Heatmap of valence and arousal.

0.0 0.2 0.4 0.6 0.8 1.00

5000

10000

15000

20000 ValenceArousal

(b) Distribution of valence and arousal.

Figure 1: Aggregate 735.778 self-reported mood scores in the Emotionsense dataset collected from 17.251 users. Most usersreport neutral (around 0.5,0.5) and calm-happy (down right quadrant) mood on the affect grid (a). The two multi-modal distri-butions (pearson r=-0.23, p<0.00001) of the mood(b).

Figure 2: The mood tracking application. Users can reporttheir mood in an affect grid, complete personality and otherquestionnaires.

The penetration of mobile devices has also introduced scale:many more individuals can now be reached and assessed. For exam-ple, in a hospital environment, mobile experience sampling enabledthe collection of 11,381 survey responses over a 12-month periodfrom 304 physicians and nurses, completed with minimal initialtraining [39]. Mobile sensors enable researchers to collect not onlythe explicit reports of the participants, but also the context in whichthese answers were provided. Indeed, a recent survey of 110 papersconcluded that a total of 70 studies (63.6%) passively or activelycollected sensor data from the participants’ study device [41]. Ona larger scale, Utsureko [37] and Emotionsense [35], two differentsmartphone applications for mood monitoring through self-reportswere used by more than 24, 000 and 17, 000 users, respectively.However, most of the studies on investigating the use of smart-phones to track and improve mental health and well-being havebeen conducted through controlled experiments, and limited num-ber of participants and observations [19, 22, 33, 45]. Conducting

such studies in the wild would allow reaching many more partic-ipants, broadening the significance of the findings. However, theabsence of rigid control over participation and the limited mech-anisms to promote engagement, make the data collected noisierand sparser than in controlled setups, and it is unclear whetherprevious findings and methodologies can be transferred to theselarge natural datasets. Robust methodologies for anticipating clin-ical outcomes and relapses using very sparse data are key to thewidespread adoption of smartphones as tools to provide mentalhealth support.

Mobile sensing applications often require inputs from sensors inthe form of high-dimensional time-series, coming from accelerom-eters, gyroscopes, microphones or other user-generated data [18].However, these sensor measurements are quite noisy and althoughfor some purposes simple first-order features have proved to beeffective, it is not straightforward how to select robust featuresfrom different noise levels of individual user behaviors, since everyuser introduces different levels of noise according to its device, en-vironment etc. For example, the MoodExplorer study [47] extractedthe mean, variance, and signal-to-noise ratios from the microphonesensor, while the Emotionsense study [35] calculated the standarddeviation of the magnitude of acceleration from the three axes(x ,y, z) of the accelerometer. Noise in mobile measurements is hardto model because it is correlated over time [29] and presents anon-linear structure [3].

In this paper we investigate whether individuals’ perceivedmoodcan be obtained through their psychological traits collected throughone-off questionnaires, as well as passively collected mobile sensingdata, thus avoiding sending frequent experience sampling question-naires. More specifically, we investigate whether these psychologi-cal traits and passive sensing data can be used to detect individualswhose general mood deviates from the common relaxed mood dis-tinctive among mentally healthy individuals [32]. To do so, wepropose a machine learning methodology to classify individualsaccording to their general mood, that takes as inputs sparse answersto one-off surveys covering different profile-related characteristicsof the individuals, as well as features extracted from noise and

Page 3: Passive mobile sensing and psychological traits for large ...ds806/papers/PH_Passive_mobile_Spathi… · ical outcomes and relapses using very sparse data are key to the widespread

Rising Stars track PervasiveHealth’19, May 20–23, 2019, Trento, Italy

sparse accelerometer and microphone sensors readings passivelycollected with their smartphones. We evaluate our methodologyusing a large-scale dataset of mobile sensing and self-reported datacollected in the wild for more than 3 years and that contains datafrom more than 17,000 participants. We conduct extensive exper-imentation by training over 100 models in order to find out thebest combination of modalities. We also conduct extensive first andsecond-order feature extraction from the sensor time-series.

This paper makes the following contributions:• We conducted an extensive data exploration of the self-reportedmoods provided by 17, 251 of the users of a experience-sampling based smartphone applications, with the aim ofidentifying the most common reporting behaviour so as tocharacterize mentally healthy individuals in the context ofour research. Our findings showed that the majority of thepopulation in our dataset reported feeling, on average, re-laxed (down-right side of the affect grid), which is in linewith previous research [32].

• We provide a supervised learning methodology to detectindividuals whose general mood deviates from the commonrelaxed mood distinctive amongmentally healthy individuals[32]. Our methodology does not involve any kind of cumber-some experience sampling, but only uses one-off question-naires (demographics, personality, etc.) as well as sparse andnoisy passive sensing data collected with the accelerometerand microphone sensors of individuals’ smartphones.

• We performed an extensive evaluation of our methodologyusing a large scale dataset collected in the wild. Our resultsshowed that the combination of one-off questionnaires andpassive sensing data gives the best performance in moodprediction. Indeed, by adding passive sensing datawe achievea +5% in accuracy (75% in absolute) with respect to only usingquestionnaires.

These findings have the potential of informing future developersof mobile health applications as well as psychologists on how toproperly use one-off questionnaires and passive sensing data forthe early detection of symptoms of mental disorders at scale.

2 THE PROBLEM AND THE DATAMobile health applications aimed at assisting users with their men-tal health so as to prevent clinical outcomes should minimize theburden to the user so as to increase adherence and satisfactionwith the app. Instead of the timely and continuous collection ofmood self-reports, psychological traits obtained through one-offquestionnaires, as well as passive sensing data, should be preferredin order to design effective and useful applications. Our aim in therest of this paper is to investigate how psychological traits andpassive sensing data can be used to detect individuals who mightnot feel mentally well, i.e., users who have been reported moodsthat deviate from the general reports of the population.

To do so, we first conduct an exploratory analysis of the moodreports provided by more than 17,000 individuals for a period ofmore than 3 years, in order to identify the most common set of men-tal states (moods) reported by any of these individuals (Section 3).Given the scale and the in the wild nature of the data collection,we believe our results are general enough to be representative ofthe whole population. We then use these findings as the ground

0.00

0.25

0.50

0.75

1.00

10 1000# of mood reports

CC

DF

reportsdone

missed

expected

Figure 3: CCDF of the mood reported by users during thetime they were using the application. This includes (i) theself-reports actually done (done), (ii) those that users wereprompted to report but they did not do so (missed) and (iii)the sum of both (expected).

.

truth to validate our machine learning methodology to identifyindividuals whose record of reported moods deviates from thatof the majority, by only using one-off questionnaires and passivesensing data (Section 4). We provide further details of the data usedin our analysis and experiments in the rest of this section.

2.1 The dataWe use the Emotion Sense dataset [35], a dataset that contains sensorand self-reported data collected with a mobile phone applicationfor Android (Fig. 2) designed to study subjective well-being andbehavior. From February 2013 until October 2016, this applicationcollected 735,778 self-report data from 17,251 users, through surveyspresented on the phone via experience sampling, and behavioraldata from physical and software sensors in the phone (accelerom-eter, microphone, location, text messages, phone calls, etc.). Theparticipants singed a consent form that restricts the use of thedata to the University of Cambridge researchers, according to theInstitutional Review Board (IRB). For this analysis, we considerself reported mood collected graphically using the Affect Grid [32],profile-related surveys, as well as sensed data collected with theaccelerometer and microphone sensors. Twice per day, between8AM and 10PM and with a difference of at least 120 minutes apart,participants received a notification asking them to report theirmood in the affect grid (Figure 1). Meanwhile, sensed data werecollected passively in the background at different moments duringthe day depending on the different versions of the application. Atdifferent stages of the application, participants were requested tocomplete profile-related questionnaires covering a broad range oftopics: demographics, personality, gratitude, health, sociability, jobsatisfaction, life aspirations and connectedness, where the ques-tions were answered using Likert scales. Below we describe thespecific data we use in our experiments.

Experience sampling. The Emotion Sense application for moodmonitoring prompted their users to report, twice per day, how theyfelt using an Affect Grid scale. Figure 1 shows the aggregate ofmood self-reports for all the users of the application, where thedown-right quadrant, corresponding to relaxed mood, is the mostdensely populated, a result that matches previous studies in thearea [32]. Due to the in the wild nature of the data collection, users

Page 4: Passive mobile sensing and psychological traits for large ...ds806/papers/PH_Passive_mobile_Spathi… · ical outcomes and relapses using very sparse data are key to the widespread

Rising Stars track Spathis D. et al.

did not always report their mood even if they were prompted todo so, which might be consequence of the burden that experiencesampling brings to the users. In more detail, Figure 3 shows theCCDF of moods reported per participant, included the ones theywere expected to do given the time they were using the app, the onesthey were prompted to do but they did not (missed) and those thatthey actually did so. Thus, alternatives to experience sampling arerequired to design effective, long-term, mobile health applicationsfor mental health. As we will show later, by using the reportedmood as a classification target we can design systems that dependonly on passive sensors and one-off surveys.

One-off questionnaires. Previous research has found a linkbetween self reported mood and personality traits such as emo-tional stability [8, 11]. However, to the best of our knowledge, itis not clear yet how to use personality, and other psychologicaltraits, to detect potentially mentally unhealthy individuals. In theEmotion Sense dataset, a subset of the users (12,106, 70% of total)completed some one-off surveys providing information regardingtheir demographics, personality, gratitude, health, sociability, jobsatisfaction, life aspirations, connectedness, and satisfaction withlife.

Passive sensing data. Data collected through the built-in ac-celerometer sensor of our smartphones provide valuable insightsinto our activity level throughout the day. At the same time, previ-ous research has demonstrated the link between activity level andhappiness [20, 35]. We hypothesize that our activity level through-out the day has a high impact on how we feel on that day andtherefore also use these sensing data in our experiments. In the Emo-tion Sense dataset, accelerometer samples consist of [x ,y, z](m/s2)axes data for periods of 5, 8 or 10 seconds, collected at differentintervals throughout the day depending on the version of the ap-plication. Microphone samples, on the other hand, provide insightsinto the noise level in the user’s environment. As with activity, wehypothesize that how we feel (our mood) influences/is influencedby the kind of places or environments we visit and the level ofnoise in these spaces. Therefore we use this in our experiments. Topreserve privacy, the Emotion Sense application only recorded theamplitude level of noise at 20Hz for periods of 5, 8, 10 seconds atdifferent intervals throughout the day depending on the version ofthe application.

Varied amounts of data are available for each of the sensors andself reports, mainly due to the uncontrolledway inwhich users wererecruited. Also, the in the wild nature of the data collection makesthe available data noisy and sparse, which adds to the challenge.We present more details on how we dealt with these noisinessand sparseness, as well as on the number of participants and daysof sensed and self reported data used for each analysis, later onSections 3 and 4.

3 FINDING GROUPS OF USERS FROMSELF-REPORTED MOOD TRAJECTORIES

The main goal of our research is on investigating whether psy-chological traits and passive sensing data can be used to identifyusers whose set of mood reports deviates from those of the generalpopulation, which might be indicative of some mental condition.Fig 1a shows a visualization of the aggregation of self reports pro-vided by the users in the Emotion Sense dataset, where the mostcommon mood reported is in the down-right side of the affect grid,

25%

_x75

%_x

mea

n_x

50%

_xm

ax_x

max

_yco

unt_

xco

unt_

yst

d_x

std_

y75

%_y

25%

_ym

ean_

y50

%_y

min

_xm

in_y

62451056610501933147441124827641685741631656216626171385060447313838398133588941033962781577175

0.00.20.40.60.81.0

Figure 4: Hierarchical clustering of the users (y-axis, onlysome user IDs are visible) and features (x-axis) extractedfrom their historical mood (_x=valence, _y=arousal). Thecolorbar represents the actual value of the feature.

corresponding to the relaxed mental state. However, it is not clearwhether to fix the boundaries on the affect grid. We propose notto hard code the thresholds and potentially inducing biases in ourlabels, but instead relying on clustering techniques to make labelsnaturally emerge from the data. The rest of this section describesin detail the methodology to label users into relaxed/non-relaxedin the Emotion Sense dataset.

3.1 MethodologyA mood self-report in the affect grid is described by means of twocoordinates: the x-coordinate that indicates the feeling in terms ofits positive and negative and the y-coordinate indicates the intensityof alertness. The history of mood-reports of an individual consistsof time-series trajectories of [x,y] tuples recorded over time in theaffect grid. Also, the noise and sparseness of an in-the-wild setupresult in that (i) the number of self-reports reported by differentindividuals might be different, and (ii) that for a given individual,the reported moods might not be consecutive (as a consequence ofusers missing reports). In order to cope with this variability andobtain independent features to allow clustering algorithms to learnrepresentative clusters, we extract 8 simple features for each axisor coordinate, namely counts, mean, std, min, max and quantiles(25%, 50%, 75%), resulting in 16 final features for every user. Missingvalues are replaced with zeros and minmax [0,1] normalization isapplied to the final features column-wise. Due to the sparsity of themood and the power law distribution of the counts, these two countfeatures that measure non-missed reports are affected the most bythe normalization, concentrating all their mass close to zero.

We then apply the k-means [24] clustering algorithm to producemutually exclusive clusters of spherical shapes based on distance. Inorder to come up with the optimal number of clusters, we conduct

Page 5: Passive mobile sensing and psychological traits for large ...ds806/papers/PH_Passive_mobile_Spathi… · ical outcomes and relapses using very sparse data are key to the widespread

Rising Stars track PervasiveHealth’19, May 20–23, 2019, Trento, Italy

coun

t_x

mea

n_x

std_

xm

in_x

25%

_x50

%_x

75%

_xm

ax_x

coun

t_y

mea

n_y

std_

ym

in_y

25%

_y50

%_y

75%

_ym

ax_y

0.0

0.5

cluster 1 cluster 2

(a)

coun

t_x

mea

n_x

std_

xm

in_x

25%

_x50

%_x

75%

_xm

ax_x

coun

t_y

mea

n_y

std_

ym

in_y

25%

_y50

%_y

75%

_ym

ax_y

0.0

0.5

cluster 1 cluster 2 cluster 3

(b)

coun

t_x

mea

n_x

std_

xm

in_x

25%

_x50

%_x

75%

_xm

ax_x

coun

t_y

mea

n_y

std_

ym

in_y

25%

_y50

%_y

75%

_ym

ax_y

0.0

0.5

cl. 1 cl. 2 cl. 3 cl. 4

(c)

0.0 0.2 0.4 0.6 0.8 1.0

Valence

0.0

0.2

0.4

0.6

0.8

1.0

Arou

sal

(d)

0.0 0.2 0.4 0.6 0.8 1.0

Valence

0.0

0.2

0.4

0.6

0.8

1.0

Arou

sal

(e)

0.0 0.2 0.4 0.6 0.8 1.0

Valence

0.0

0.2

0.4

0.6

0.8

1.0

Arou

sal

(f)

Figure 5: Clustering the historical mood trajectories of 17251 users (every dot is a user) in 2, 3, and 4 clusters: (a,b,c) Parallelcoordinate plot of cluster centroids for each feature, (d,e,f) Affect grid plot of the mean valence and arousal of the clusteredusers. The clusters of the first plot (d) are used as prediction labels for the mood classification task.

2 3 4 5 6 7 8 9 10Clusters

0.180.200.220.240.260.280.30

Silh

ouet

te sc

ore

Figure 6: Elbow plot to determine the optimal number ofclusters, estimated with the silhouette score.

the Elbow method [40] where we increase the number of clustersand observe the drop of the evaluation metric. Here, we use thesilhouette metric [31] which measures how similar a sample is toits own cluster compared to other clusters.

Other clustering algorithms might also be used. In fact, tech-niques such as hierarchical (agglomerative) clustering [17] appliedto the matrix of [users,features], can be used to find partitions onthe data, but also to uncover overlapping patterns between features.

3.2 FindingsWe applied our methodology to identify non-relaxed users (or thosethat deviate from the most common mood feeling reported) in

the Emotion Sense dataset. For each of the 17,251 users that havereported their mood at least once, we obtain 2,682 sparse moodreports completed over 3 years, for valence and arousal. This is thefinal sample we used for this experiment.

Exploratory analysis. As a first exploratory analysis, we applyhierarchical clustering to the historical mood of the users. Figure 4shows the resulting trees. Specifically, the y-axis shows the clusterof users whereas the x-axis the cluster of features (16 features, 8per valence and 8 per arousal). We observe that there are multipleuser groups shown on the left side tree, pointing out that somemood reporting behaviours resemble other users’. However, it isnot easy to spot clear relationship due to the number of users. Thefeatures are also clustered with the most prominent 2 groups beingthe valence and arousal. However, there are some intruders in thoseclusters: for example, the maximum arousal (max_y) belongs tothe valence cluster while the counts (counts_x) and the minimum(min_x) of valence goes into the arousal group. These feature clus-ters provide hints regarding the non-linear relationships of themood components.

k-means. We now apply k-means to obtain the labels to usein our experiments. We repeat the experiments by varying k, theresulting number of clusters in order to visually identify them in theaffect grid. Figure 5 shows the resulting clusters when increasingthe number of clusters from 2 to 4. For 2 clusters (Fig. 5d), byplotting the mean valence and arousal in the affect grid, we noticea group of consistently relaxed users on the down-right quadrantand another group that consists of depressed, stressed and excited

Page 6: Passive mobile sensing and psychological traits for large ...ds806/papers/PH_Passive_mobile_Spathi… · ical outcomes and relapses using very sparse data are key to the widespread

Rising Stars track Spathis D. et al.

users on the rest of the grid. When we further increase the numberof clusters, things get more complicated for pattern finding. Forexample, with 3 clusters (Fig. 5e) we spot a central neutral groupwhich is now distinct, while the rest is similar to the previous plot(relaxed and non-relaxed). Finally, for 4 clusters (Fig. 5f), we spotagain the middle neutral users but this time the valence axis breaksdown to two areas: excitement (up right) and relaxation (downright). It is still interesting that the negative feelings (left side) donot break down to sub-clusters hinting that the two spectra ofarousal for unpleasant feelings (stress and depression) might sharesome common characteristics. However, these last plots (3 and 4clusters) present significant cluster overlap.

We also show the cluster centroid for every feature in a parallelcoordinate plot (Figure 5a-c) in order to identify the significantfeatures for clustering. Intuitively, this means that the clusteringalgorithm found 2 centers in the high-dimensional space and wejust plot the values for every feature of these points. For instance,for 2 clusters (Fig. 5a) the largest distance seems to be betweenmaximum valence as well as the minimum arousal. These twofeatures could be enough to separate the two clusters. By movingup in the number of clusters, things get more complicated sincewe have to find features for which all the features reside equallyapart. Namely, for 3 clusters (Fig. 5b) minimum valence seems tobe different for every group, while for 4 clusters (Fig. 5c) there isnot a single feature with distinct centroids.

Finally, we perform the elbow method to quantitatively find theoptimal number of clusters. Figure 6 shows that the top silhouettescore is 0.30 (higher is better) with two clusters while it goes down0.23 with three clusters. We observe that it plateaus at around 0.20with seven clusters or more. These two groups will be used as alabel in the machine learning pipeline to infer non-relaxed usersfrom one-off questionnaires and passive sensing data in the nextsection. We are aware that these clusters are inferred informationand thus could include some errors, however we incorporate thesilhouette score with the lowest error. Please note that there is aclass imbalance between the clusters on the user level: cluster 1(65%), cluster 2 (45%), which we will address later in that section.

4 PSYCHOLOGICAL TRAITS AND MOBILESENSING TO PREDICT NON-RELAXEDMOOD

We now describe our methodology to identify non-relaxed indi-viduals from their psychological traits obtained through one-offquestionnaires, and passive sensing data collected using the ac-celerometer and microphone sensors of their smartphones. Wefollow the workflow in Figure 8, where we begin by extractingfeatures from the accelerometer and microphone raw data, as wellas one-hot encoding the answers to the one off questionnaires re-garding users’ psychological traits. We then perform a two-stepfeature selection, where we first calculate the feature significanceof a real-valued feature to a binary target as a p-value using theunivariate Mann-Whitney U test [25], and then we transform theseselected features with Principal Component Analysis (PCA) [30]to obtain feature combinations with the maximum variance. Thesefeatures are finally fed to classifiers. We detail these steps below.

4.1 Feature extractionQuestionnaires. One-off surveys cover a wide range of a user pro-file attributes such as demographics, personality, gratitude, health,sociability, job satisfaction, life aspirations, connectedness, and sat-isfaction with life. These 92 features are represented as Likert-scalesor categories. In order to be appropriate for machine learning mod-els, the categorical features are transformed to individual featureswith one-hot encoding, so that a feature with e.g. 3 possible choices(Yes, No, missing), is transformed to 3 different features. Categoricalfeatures include the gender, age group, education level and ethnicgroup among others. The total list of questionnaire features is 131.

Accelerometer. We consider the 3 (x,y,z) dimensions of theaccelerometer and compute the magnitude of the acceleration for 5,8, and 10-second samples, resulting in 48 time-steps for every user-day (336 time-steps for every user-week). We aggregate the sensorin 30-min bins since this level of granularity is the best trade-offbetween data sparsity and modeling the sub-hourly movement ofindividuals. By doing this light processing, we end up with onetime-series instead of three, combining the three axes into one time-series. Based on the sparsity histogram (Fig. 7b), we filter thosesamples that have at least 50 time-steps during the week (20 time-steps during the day). This time-series is normalized with minmaxscaling to a [0.05-1] range and the missing values are replaced withzeros. We extract 721 simple and second order features that covera wide range of attributes of a sensor such as the energy, auto-correlation, entropy, trends, wavelet and Fourier coefficients, peaks,etc. For a comprehensive list of the features we refer the reader tothe documentation of the tsfresh library [9].

Microphone. Similarly with the accelerometer data, we com-pute the mean of the 5, 8, and 10-second window over the initialraw microphone data over the amplitude level of noise at 20 Hz,ending up with 48 time-steps for every user-day (336 time-stepsfor every user-week). We apply the same filtering, normalizationand feature extraction as the accelerometer above, resulting in 717features.

Seasonality. Temporal features are extracted by the end of thesensor user-week time-stamp in order to capture the inherent sea-sonality patterns. Namely, we compute these 5 increasingly detailedtime-aware features: the number of the quarter, month, week, dayof week and hour of day. We consider these features to belong tothe sensor modality that we introduce later.

4.2 ClassifiersWe considered three different classifiers for our inference task:Logistic Regression, Gradient Boosting Trees and a Deep NeuralNetwork. Below we describe the details of our implementation.

Logistic Regression (LR). An sklearn implementation of a bi-nary logistic regression, with penalty of L2 regularization alongwith a C = 1 (inverse of regularization strength), was tested.

Gradient BoostingTrees (GB).An sklearn implementation of agradient boosting was tested. Reportedly the state-of-art in feature-based machine learning [28], this classifier forms an ensemble ofweak prediction models, typically decision trees.

Deep Neural Network (NN). We use a straightforward bottle-neck architecture of 4 feed forward Dense layers of dimensionality100-50-100. The reduced dimensionality in the middle (50 units)has been shown to lead to better generalization in deep learningarchitectures [13, 23]. A rectified linear unit (ReLU ) [12] activation

Page 7: Passive mobile sensing and psychological traits for large ...ds806/papers/PH_Passive_mobile_Spathi… · ical outcomes and relapses using very sparse data are key to the widespread

Rising Stars track PervasiveHealth’19, May 20–23, 2019, Trento, Italy

(a)

0 100 200 300missing accelerometer time-steps

0

100000

200000

300000

400000

user

-wee

ks

(b)

(c)

0 100 200 300missing microphone time-steps

0

100000

200000

300000

400000

user

-wee

ks

(d)

Figure 7: Sparsity analysis of the sensors. Missing values for the sensors on the weekly (a,c) level. Cumulative distributionfunctions (CDF) for the missing time-steps (b,d) show the long tail distribution of sparsity. Some weekly periodicity is alsospotted. Similar conclusions are drawn with the daily level sensors.

FeatureExtraction

FeatureExtraction

One hot Encoding

FeatureSelection PCA

LR GB

DNNs

Figure 8: Workflow of the data processing and model devel-opment for the mood prediction task.

is applied at the output of every layer, followed by a batch nor-malization layer that transforms the output to have zero mean andunit variance [14]. Dropout of 50% probability is applied to everylayer to reduce overfitting [36]. The final layer performs a softmaxactivation which estimates thecross-entropy loss, while the back-propagation optimizer is Adam [16]. We train for 300 epochs oruntil the validation loss stops improving for 10 consecutive epochs.Our implementation is based on Tensorflow/Keras.

5 EVALUATIONWe now detail the evaluation of our methodology to identify non-relaxed users from one-off questionnaires and passive sensing data

described in Section 4. We used the Emotion Sense dataset, for ourexperiments, and the clustered mood we obtained using k-meansin the Section 3 as the labels for the classifiers. Below we indicatehow we merged the data from the different modalities and how wepartition the dataset for our experiments in Section 5.1. Findingsand results are provided in Section 5.2.

5.1 Experimental setupModality merge. Experiments in the wild such as this one do notguarantee complete and fine-grained data, especially when theyinvolve battery consuming tasks like sensor-tracking or input-basedprompts like self-reports from users. Therefore, not all modalitiesappear for the same users. We start by merging the accelerometerand microphone modalities resulting in 141,261 user-weeks whilewe concatenate their features along with the seasonality ones. Then,we find which users from those weeks have completed at least asingle questionnaire and concatenate these static features to thefeature vector, resulting in 131,793 user-weeks. Finally, we mergewith the clusters that we produced in the previous section, so thatevery user-week feature vector corresponds to one of the two usermood clusters. Please note that these clusters came up by takinginto account the full mood history of the users and therefore we donot imply that mood is static. Apparently, the high class imbalanceon the user level earlier is exaggerated here because only 7% ofthe user-weeks belong to cluster 2 (green in Fig. 5). As a result, wesubsample the majority class, resulting in 18,998 balanced user-weeks from 2,812 users. The same processing is followed for the

Page 8: Passive mobile sensing and psychological traits for large ...ds806/papers/PH_Passive_mobile_Spathi… · ical outcomes and relapses using very sparse data are key to the widespread

Rising Stars track Spathis D. et al.

Table 1: Mean classification performance (AUC) to predict mood group based on weekly or daily sensors, across 10 cross-validation runs along with standard deviation in brackets (NN=neural network, LR=Logistic Regression, GB=Gradient Boost-ing).

Modality Weekly DailyLR GB NN LR GB NN

Sensors (S) 0.575 (0.03) 0.555 (0.03) 0.550 (0.04) 0.543 (0.04) 0.514 (0.02) 0.510 (0.03)Questionnaires (Q) 0.690 (0.05) 0.627 (0.10) 0.687 (0.09) 0.671 (0.11) 0.729 (0.09) 0.701 (0.09)All (S +Q) 0.749 (0.06) 0.721 (0.03) 0.725 (0.06) 0.706 (0.07) 0.740 (0.09) 0.697 (0.10)

daily sensors: 112,161 user-days after sensor merge, 106,672 afterquestionnaire merge, and we end up with 16,470 user-days from1,859 unique users when we merge with the labels and sub-sample.

Feature ablation studies. In order to identify which featuremodality contributes to the classification more we repeat our ex-periments with 3 different modalities: only sensors (accelerometer,microphone and seasonality), only one-off questionnaires (psycho-logical profile) and combined. To make for a fair comparison, forevery modality we keep only 100 features that we feed to the clas-sifiers. Since every modality contains different numbers of features(combined=1,564, sensors=1,434, questionnaires=130), we performa two-step feature selection. First, we calculate the feature sig-nificance of a real-valued feature to a binary target as a p-valueusing the univariate Mann-Whitney U test [25]. Then, these se-lected features are transformed with Principal Component Analysis(PCA) [30], a common decorrelation method, that produces featurecombinations with the maximum variance, ending up with 100components/features.

User based cross validation. Typical cross-validation wouldnot be adequate in our task since some static features such as theage or gender are repeated for different weeks because they belongto the same user. Therefore, we create training and test sets fromdisjoint user splits, making sure that weeks from the same userdo not appear in both splits. Please note that this does not resultin perfectly balanced class splits, but the evaluation metric weare using, the Receiver operating characteristic-Area Under Curve(ROC-AUC or simply AUC) is robust to class imbalances. Even then,it is not easy to guarantee that a split picked a representative test-set, so we perform a 10-fold-like cross validation using 10 differentseeds to pick disjoint users. Consequently, we conduct an extensiveexperimentation by testing 180 models (3 modalities × 10 user splits× 3 classifiers × 2 temporal levels). The size of the test set is 10% ofthe dataset, and of the rest 90% used for training we keep a random10% for validation (used only in neural networks). This validationset belongs to the same distribution as the training set. We reportthe average performance of the folds and the standard deviation.

5.2 ResultsWe now present the classification results of predicting whether auser-week/day belongs to the relaxed or the rest of the mood spec-trum, based on sensors, questionnaires and other meta-data. Asdiscussed earlier, we performed extensive experiments and trained180 models to evaluate the impact of the different modalities anduser splits. In Table 1, we present the mean classification perfor-mance of the experiment setup described in the previous section,that of predicting the mood cluster group (relaxed or not) based oneach user’s weekly/daily sensors and questionnaire metadata.

Week level. By using the sensors on the week level we achievethe best overall performance of 0.749 AUC, which comes from theLR model, while the NN comes second with 0.725. Even thoughthe NN and GB are non-linear classifiers they under-perform, pos-sibly due to the issue of overfitting or the data compression withPCA. Also the LR model shows stability with the lowest standarddeviation across all cross-validation runs. Regarding the modali-ties, in the best case of the LR, the combined representation of thesensors and the questionnaires outperforms the single modality ofquestionnaires by +5.9% AUC and reaches +9.4% in the case of GB(with a lower max AUC in the combined representation though).The sole use of sensors achieves less than 60% for all the models.This ranking is consistent for all the classifiers.

Day level. Considering only one day of sensing data, the abso-lute results are slightly lower than that of the weekly level. Here,the GB model achieves an AUC of 0.740, while the LR comes secondwith 0.706. The NN presents similar performance for the combinedand questionnaire representation, hinting that the daily sensors donot contribute much for it. However, the rest models show a rise of+1.1% (GB) and +3.5% (LR) in AUC, when we add the sensors to thequestionnaires.

Discussion. These results show that by adding passive sensingto traditional personality and demographics surveys we are able topredict the mood group of individual users with a higher precision.Specifically, for our task we achieve ∼ 75%AUC by classifying usersinto relaxed or not. Also, we observe that by tracking the users formore time (week over day level), we achieve better performance.In hindsight, this is intuitive since movement and noise levels areexpected to be related with relaxation levels. Beyond the binary task,extra experiments with 3 or 4 clusters (multi-class) yielded worseresults due to the significant cluster overlap and less data-pointsper class to learn. Last, putting our results in the context of relatedwork we see that similar datasets yield lower accuracy (around 65%)for slightly different tasks such as predicting tomorrow’s mood [38]or daily mood average [22].

6 RELATEDWORKAs noted in one of the first seminal review papers in 2010 [18], themain obstacle to the field of mobile sensing and pervasive health isnot lack of adoption, since billions already carry sensor-rich devices,but rather on how to perform privacy-aware and resource-sensitivereasoning with noisy and missing data, and to deliver effectiveinterventions. When these issues are solved, mobile sensing will actas a catalyst for diverse domains such as social networking, health,and energy. Here, we focus on the challenges regarding learningrobust and informative features from noisy signals and how theycan assist with user modeling and interventions.

Page 9: Passive mobile sensing and psychological traits for large ...ds806/papers/PH_Passive_mobile_Spathi… · ical outcomes and relapses using very sparse data are key to the widespread

Rising Stars track PervasiveHealth’19, May 20–23, 2019, Trento, Italy

While the motivation for building mood prediction systemsseems well-founded, the implementation thereof appears to be chal-lenging. Numerous mobile apps for mental health monitoring havebeen proposed, like BeWell [19], orMoodScope [22]. Specific groups,like undergrad students, have been studied in controlled setups, e.g.StudentLife [45] measured the impact of student workload on stresswith sensors and self-reports, whereas Snapshot [33] tracked moodand sleep. Other efforts have focused on detecting depression bytracking medication, sleep patterns and actions [37], location [5], oreven keypress acceleration [6]. Like in our case, static personalitymetadata have been combined with sensor time-series [4]. Pleasenote that the paper that introduced this dataset [35] also predictedmood by using smartphone sensor data, but used a smaller subsetof users and most importantly a different prediction target (moodat time t with data sensed before and after t ).

One of the biggest limitations of the above works is the relativelysmall sample size, with participants often belonging to similar so-cioeconomic backgrounds, in order to draw robust conclusions.Besides, participants were often tracked for short periods and incontrolled setups. For instance, the MoodScope study [22] moni-tored 32 people over 2 months, the StudentLife project [45] tracked48 students over 10 weeks, whereas Snapshot [33] is probably thebiggest general published study with 206 students tracked for over1 month. In contrast, we draw robust conclusions from an initialdataset of more than 17, 000 users, collected in the wild for morethan 3 years.

Putting aside the limitations of the sample size, perhaps the mostclosely related work to ours is the Snapshot [33] study. This studyinvestigated how daily behavior gathered through passive sensingdata influence sleep, stress, mood, and other wellbeing-related fac-tors. Multiple papers focused on different aspects of the collecteddataset, such as personalization with multi-task learning to predicttomorrow’s mood, stress, and health [38], prediction of happy/sadmood based on sleep history [34], or a denoising autoencoder tofill in missing sensor data for mood prediction [15]. Similar to us,they first cluster the users before going into classification [38], al-though their goal here is to provide personalized predictions tothese clusters. However, our models do not distinguish betweenhealthy and depressed patients, but predict the clustered moodgroup which roughly correspond to relaxed or not-relaxed users.From a more practical perspective, personalized models are difficultto be deployed on a real world scenario, since they require trainingN personalized models, with N being the number of users. Eventhough previous research has shown that better performance canbe achieved by averaging the individual model accuracies [5, 22],no results are reported on unseen disjoint users. Instead, we providesingle end-to-end trainable models while in all of our experimentswe report performance from a disjoint user set that the model hasnot seen during training.

The majority of related literature has applied supervised learn-ing algorithms, like Logistic Regression or SVMs, without focusingon systematic first and second order feature extraction from thesensors. The only alternative seems to be using some kind of deeplearning which although yields moderate results (e.g StudentLifedataset with deep feed-forward neural networks [27]). Other neuralapproaches include the Deepmood paper that uses RNNs for depres-sion prediction [37]. We build upon this growing piece of literatureof employing machine learning on mood prediction by proposingend-to-end models that exploit a thorough feature extraction of the

sensors as well as well rich information about the demographic andpersonality data of the users.

7 CONCLUSIONThe pervasiveness of smartphones have converted them into expe-rience sampling tools to collect people’s mood so as to assess theirmental state. However the granularity of the data needs to be tradedoff with the level of interruption these tools introduce on users’activities, which often results into very sparse data. In this paperwe propose a machine learning methodology to detect if an indi-vidual’ perceived mood differs from that of the general population,by solely considering their psychological traits collected throughone off questionnaires and passively collected mobile sensing data,thus avoiding the use of experience sampling questionnaires.

We evaluate our methodology by using a large-scale dataset col-lected in the wild for more than 3 years and 17, 000 participants. Anexploratory analysis of the data revealed that relaxed is the mostcommon state reported by our population. Our experiments alsoconfirmed that our methodology is able to distinguish between gen-erally relaxed/non-relaxed individuals with a 75% AUC when usinga combination of weekly sensors (accelerometer and microphone)and one-off questionnaire data (personality, demographics, etc) asinputs. Besides, the use of passive sensing data yields a +5% boostin accuracy. In healthcare context, this accuracy states that we cangroup users 3 out of 4 times correctly using only short-time mobilephone sensing and sparse surveys. While that level of accuracymight not be adequate for medical deployments, our focus is mostlyon the positive contribution of passive sensing.

As future work, we plan to study data imputation techniquesin order to ameliorate the significant data loss while merging themodalities [15] as well as focus on feature importance analysis.Also, in our current setup we use the aggregate approach of targetclusters of users which someone can argue that might change overtime; we are working on continuous predictions of both the sensorsand themood predictions.We also plan to adapt models that operateon raw time-series such as Wavenet [42] and combine them withmulti-modal approaches for the static features.

ACKNOWLEDGMENTSThis work was supported by the Embiricos Trust Scholarship of Je-sus College Cambridge, the EPSRC throughGrants DTP (EP/N509620/1)and UBHAVE (EP/I032673/1), and Nokia Bell Labs.

REFERENCES[1] Stephen Aichele, Patrick Rabbitt, and Paolo Ghisletta. 2016. Think fast, feel fine,

live long: A 29-year study of cognition, health, and survival in middle-aged andolder adults. Psychological science 27, 4 (2016), 518–529.

[2] Tim Althoff, Jennifer L Hicks, Abby C King, Scott L Delp, Jure Leskovec, et al.2017. Large-scale physical activity data reveal worldwide activity inequality.Nature 547, 7663 (2017), 336.

[3] Wei Tech Ang, Pradeep K Khosla, and Cameron N Riviere. 2007. Nonlinearregression model of a low-д MEMS accelerometer. IEEE Sensors Journal 7, 1(2007), 81–88.

[4] Andrey Bogomolov, Bruno Lepri, Michela Ferron, Fabio Pianesi, and Alex SandyPentland. 2014. Daily stress recognition from mobile phone data, weather condi-tions and individual traits. In Proceedings of the 22nd ACM international conferenceon Multimedia. ACM, 477–486.

[5] Luca Canzian and Mirco Musolesi. 2015. Trajectories of depression: unobtrusivemonitoring of depressive states by means of smartphone mobility traces analysis.In Proceedings of the 2015 ACM international joint conference on pervasive andubiquitous computing. ACM, 1293–1304.

[6] Bokai Cao, Lei Zheng, Chenwei Zhang, Philip S Yu, Andrea Piscitello, JohnZulueta, Olu Ajilore, Kelly Ryan, and Alex D Leow. 2017. DeepMood: Modeling

Page 10: Passive mobile sensing and psychological traits for large ...ds806/papers/PH_Passive_mobile_Spathi… · ical outcomes and relapses using very sparse data are key to the widespread

Rising Stars track Spathis D. et al.

Mobile Phone Typing Dynamics for Mood Detection. In Proceedings of the 23rdACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM, 747–755.

[7] Helen Cheng and Adrian Furnham. 2003. Personality, self-esteem, and demo-graphic predictions of happiness and depression. Personality and individualdifferences 34, 6 (2003), 921–942.

[8] Charles M Ching, A Timothy Church, Marcia S Katigbak, Jose Alberto SReyes, Junko Tanaka-Matsumi, Shino Takaoka, Hengsheng Zhang, Jiliang Shen,Rina Mazuera Arias, Brigida Carolina Rincon, et al. 2014. The manifestation oftraits in everyday behavior and affect: A five-culture study. Journal of Researchin Personality 48 (2014), 1–16.

[9] Maximilian Christ, Nils Braun, Julius Neuffer, and Andreas W Kempa-Liehr. 2018.Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh–APython package). Neurocomputing (2018).

[10] Mihaly Csikszentmihalyi and Reed Larson. 2014. Validity and reliability of theexperience-sampling method. In Flow and the foundations of positive psychology.Springer, 35–54.

[11] Katharina Geukes, Steffen Nestler, Roos Hutteman, Albrecht CP Küfner, andMitja D Back. 2017. Trait personality and state variability: Predicting individualdifferences in within-and cross-context fluctuations in affect, self-evaluations, andbehavior in everyday life. Journal of Research in Personality 69 (2017), 124–138.

[12] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifierneural networks. In Proceedings of the Fourteenth International Conference onArtificial Intelligence and Statistics. 315–323.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In Proceedings of the IEEE conference on computervision and pattern recognition. 770–778.

[14] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXiv preprintarXiv:1502.03167 (2015).

[15] Natasha Jaques, Sara Taylor, Akane Sano, and Rosalind Picard. 2017. MultimodalAutoencoder: A Deep Learning Approach to Filling In Missing Sensor Data andEnabling Better Mood Prediction. In Proc. International Conference on AffectiveComputing and Intelligent Interaction (ACII), San Antonio, Texas.

[16] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 (2014).

[17] Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen,Alireza Heravi-Moussavi, Pouya Kheradpour, Zhizhuo Zhang, Jianrong Wang,Michael J Ziller, et al. 2015. Integrative analysis of 111 reference humanepigenomes. Nature 518, 7539 (2015), 317.

[18] Nicholas D Lane, Emiliano Miluzzo, Hong Lu, Daniel Peebles, Tanzeem Choud-hury, and Andrew T Campbell. 2010. A survey of mobile phone sensing. IEEECommunications magazine 48, 9 (2010).

[19] Nicholas D. Lane, Mashfiqui Mohammod, Mu Lin, Xiaochao Yang, Hong Lu,Shahid Ali, Afsaneh Doryab, Ethan Berke, Tanzeem Choudhury, and Andrew T.Campbell. 2011. Bewell: A smartphone application tomonitor, model and promotewellbeing. In Pervasive Computing Technologies for Healthcare.

[20] Neal Lathia, Gillian M Sandstrom, Cecilia Mascolo, and Peter J Rentfrow. 2017.Happier people live more active lives: using smartphones to link happiness andphysical activity. PloS one 12, 1 (2017), e0160589.

[21] Kyunghee Lee, Hyeyon Kwon, Byungtae Lee, Guna Lee, Jae Ho Lee, Yu RangPark, and Soo-Yong Shin. 2018. Effect of self-monitoring on long-term patientengagement with mobile health applications. PloS one 13, 7 (2018), e0201166.

[22] Robert LiKamWa, Yunxin Liu, Nicholas D. Lane, and Lin Zhong. 2013. MoodScope:Building a Mood Sensor from Smartphone Usage Patterns. In MobiSys ’13. ACM.

[23] Alicia Lozano-Diez, Ruben Zazo, Doroteo T Toledano, and Joaquin Gonzalez-Rodriguez. 2017. An analysis of the influence of deep neural network (DNN)topology in bottleneck feature based language recognition. PloS one 12, 8 (2017),e0182580.

[24] James MacQueen et al. 1967. Some methods for classification and analysis ofmultivariate observations. In Proceedings of the fifth Berkeley symposium onmathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281–297.

[25] Henry B Mann and Donald R Whitney. 1947. On a test of whether one oftwo random variables is stochastically larger than the other. The annals ofmathematical statistics (1947), 50–60.

[26] Abhinav Mehrotra, Jo Vermeulen, Veljko Pejovic, and Mirco Musolesi. 2015.Ask, but don’t interrupt: the case for interruptibility-aware mobile experiencesampling. InAdjunct Proceedings of the 2015 ACM International Joint Conference onPervasive and Ubiquitous Computing and Proceedings of the 2015 ACM InternationalSymposium on Wearable Computers. ACM, 723–732.

[27] Gatis Mikelsons, Matthew Smith, Abhinav Mehrotra, and Mirco Musolesi. 2017.Towards Deep Learning Models for Psychological State Prediction using Smart-phone Data: Challenges and Opportunities. In In Workshop on Machine Learningfor Health (ML4H) at NIPS 2017.

[28] RS Olson,WCava, ZMustahsan, A Varik, and JHMoore. 2018. Data-driven advicefor applying machine learning to bioinformatics problems.. In Pacific Symposiumon Biocomputing. Pacific Symposium on Biocomputing, Vol. 23. 192–203.

[29] Minha Park. 2004. Error analysis and stochastic modeling of MEMS based inertialsensors for land vehicle navigation applications. (2004).

[30] Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points inspace. The London, Edinburgh, and Dublin Philosophical Magazine and Journal ofScience 2, 11 (1901), 559–572.

[31] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation andvalidation of cluster analysis. Journal of computational and applied mathematics20 (1987), 53–65.

[32] James A. Russell, Anna Weiss, and Gerald A. Mendelsohn. 1989. Affect Grid:A single-item scale of pleasure and arousal. Journal of Personality and SocialPsychology (1989).

[33] Akane Sano. 2016. Measuring college students’ sleep, stress, mental health and well-being with wearable sensors and mobile phones. Ph.D. Dissertation. MassachusettsInstitute of Technology.

[34] Akane Sano, Z Yu Amy, Andrew W McHill, Andrew JK Phillips, Sara Taylor,Natasha Jaques, Elizabeth B Klerman, and Rosalind W Picard. 2015. Prediction ofhappy-sad mood from daily behaviors and previous sleep history. In Engineeringin Medicine and Biology Society (EMBC), 2015 37th Annual International Conferenceof the IEEE. IEEE, 6796–6799.

[35] Sandra Servia-Rodríguez, Kiran K Rachuri, Cecilia Mascolo, Peter J Rentfrow,Neal Lathia, and Gillian M Sandstrom. 2017. Mobile sensing at the service ofmental well-being: a large-scale longitudinal study. In Proceedings of the 26thInternational Conference on World Wide Web. International World Wide WebConferences Steering Committee, 103–112.

[36] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: a simple way to prevent neural networks fromoverfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.

[37] Yoshihiko Suhara, Yinzhan Xu, and Alex’Sandy’ Pentland. 2017. Deepmood:Forecasting depressed mood based on self-reported histories via recurrent neuralnetworks. In Proceedings of the 26th International Conference on World Wide Web.International World Wide Web Conferences Steering Committee, 715–724.

[38] Sara Ann Taylor, Natasha Jaques, Ehimwenma Nosakhare, Akane Sano, andRosalind Picard. 2017. PersonalizedMultitask Learning for Predicting Tomorrow’sMood, Stress, and Health. IEEE Transactions on Affective Computing (2017).

[39] Nabyl Tejani, Timothy R Dresselhaus, and Matthew B Weinger. 2010. Develop-ment of a hand-held computer platform for real-time behavioral assessment ofphysicians and nurses. Journal of biomedical informatics 43, 1 (2010), 75–80.

[40] Robert L Thorndike. 1953. Who belongs in the family? Psychometrika 18, 4 (1953),267–276.

[41] Niels Van Berkel, Denzil Ferreira, and Vassilis Kostakos. 2018. The experiencesampling method on mobile devices. ACM Computing Surveys (CSUR) 50, 6 (2018),93.

[42] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, OriolVinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu.2016. Wavenet: A generative model for raw audio. CoRR abs/1609.03499 (2016).

[43] Ruut Veenhoven. 2008. Healthy happiness: Effects of happiness on physicalhealth and the consequences for preventive health care. Journal of happinessstudies 9, 3 (2008), 449–469.

[44] Rui Wang, Min SH Aung, Saeed Abdullah, Rachel Brian, Andrew T Campbell,Tanzeem Choudhury, Marta Hauser, John Kane, Michael Merrill, Emily A Scherer,et al. 2016. CrossCheck: toward passive sensing and detection of mental healthchanges in people with schizophrenia. In Proceedings of the 2016 ACM Interna-tional Joint Conference on Pervasive and Ubiquitous Computing. ACM, 886–897.

[45] Rui Wang, Fanglin Chen, Zhenyu Chen, Tianxing Li, Gabriella Harari, StefanieTignor, Xia Zhou, Dror Ben-Zeev, and Andrew T Campbell. 2014. StudentLife:assessing mental health, academic performance and behavioral trends of collegestudents using smartphones. In Proceedings of the 2014 ACM International JointConference on Pervasive and Ubiquitous Computing. ACM, 3–14.

[46] David Watson, Lee Anna Clark, and Auke Tellegen. 1988. Development andvalidation of brief measures of positive and negative affect: the PANAS scales.Journal of personality and social psychology 54, 6 (1988), 1063.

[47] Xiao Zhang, Wenzhong Li, Xu Chen, and Sanglu Lu. 2018. MoodExplorer: To-wards Compound Emotion Detection via Smartphone Sensing. Proceedings ofthe ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, 4 (2018),176.