12
Monitoring Depression Trend on Twitter during the COVID-19 Pandemic Yipeng Zhang 1 , Hanjia Lyu 1* , Yubao Liu 1* , Xiyang Zhang 2 , Yu Wang 1 , Jiebo Luo 1 1 University of Rochester, 2 University of Akron {yzh232,hlyu5}@ur.rochester.edu Abstract The COVID-19 pandemic has severely af- fected people’s daily lives and caused tremen- dous economic loss worldwide. However, its influence on people’s mental health condi- tions has not received as much attention. To study this subject, we choose social media as our main data resource and create by far the largest English Twitter depression dataset containing 2,575 distinct identified depression users with their past tweets. To examine the effect of depression on people’s Twitter lan- guage, we train three transformer-based de- pression classification models on the dataset, evaluate their performance with progressively increased training sizes, and compare the model’s “tweet chunk”-level and user-level performances. Furthermore, inspired by psy- chological studies, we create a fusion classi- fier that combines deep learning model scores with psychological text features and users’ de- mographic information and investigate these features’ relations to depression signals. Fi- nally, we demonstrate our model’s capability of monitoring both group-level and population- level depression trends by presenting two of its applications during the COVID-19 pandemic. We hope this study can raise awareness among researchers and the general public of COVID- 19’s impact on people’s mental health. 1 Introduction COVID-19, or the coronavirus disease of 2019, is an infectious disease that reportedly originated in China and has been spreading rapidly across the globe in 2020. It was first identified on December 31st, 2019, and was officially declared as a pan- demic by the World Health Organization (WHO) on March 11th, 2020. 1 As of May 29th, 2020, COVID-19 has infected 216 countries, areas or ter- ritories with over 5.7 million confirmed cases and * Equal contribution. 1 https://www.who.int/emergencies/diseases/novel- coronavirus-2019 3.5 hundred thousand confirmed deaths 1 . In re- sponse to the pandemic, over 170 countries have issued nationwide closure of educational facilities 2 , and many governments have issued flight restric- tions and stay-at-home-orders, affecting everyday lives of people around the globe. Multiple studies have investigated the economic and social impacts of COVID-19 (Fernandes, 2020; Baker et al., 2020; Nicola et al., 2020), but what mental impact the drastic life changes bring to peo- ple and how to quantify it at the population level are yet to be studied. Mental disorders are dis- turbing approximately 380 million people of all ages worldwide. 3 Previous studies have shown that mental disorders lead to many negative outcomes including suicide (Inskip et al., 1998; San Too et al., 2019), but individuals who suffer from mental dis- orders are sometimes unwilling or ashamed to seek help (Yoshikawa et al., 2017). Moreover, it is infea- sible for psychological studies to obtain and track a large sample of diagnosed individuals and perform statistically significant numerical analysis. In the past decade, people have been increas- ingly relying on social media platforms such as Facebook, Twitter, and Instagram to express their feelings. We believe that social media is a resource- ful medium to mine information about general pub- lic’s mental health conditions. As shown in Fig- ure 1, we use data from the ForSight by Crimson Hexagonplot 4 to plot the word frequencies of sev- eral mental disorders on Twitter, including “depres- sion”, “PTSD”, “bipolar disorder”, “autism”, from January 1st, 2020 to May 4th, 2020. Note that we exclude false positive tweets that contain mis- leading phrases such as “economic depression” or “great depression”. We notice a rapid growth of the word frequencies of autism and depression starting 2 https://en.unesco.org/covid19/educationresponse 3 https://www.who.int/en/news-room/fact- sheets/detail/mental-disorders 4 https://www.brandwatch.com/ arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

Monitoring Depression Trend on Twitterduring the COVID-19 Pandemic

Yipeng Zhang1, Hanjia Lyu1∗, Yubao Liu1∗, Xiyang Zhang2, Yu Wang1, Jiebo Luo1

1University of Rochester, 2University of Akron{yzh232,hlyu5}@ur.rochester.edu

Abstract

The COVID-19 pandemic has severely af-fected people’s daily lives and caused tremen-dous economic loss worldwide. However,its influence on people’s mental health condi-tions has not received as much attention. Tostudy this subject, we choose social mediaas our main data resource and create by farthe largest English Twitter depression datasetcontaining 2,575 distinct identified depressionusers with their past tweets. To examine theeffect of depression on people’s Twitter lan-guage, we train three transformer-based de-pression classification models on the dataset,evaluate their performance with progressivelyincreased training sizes, and compare themodel’s “tweet chunk”-level and user-levelperformances. Furthermore, inspired by psy-chological studies, we create a fusion classi-fier that combines deep learning model scoreswith psychological text features and users’ de-mographic information and investigate thesefeatures’ relations to depression signals. Fi-nally, we demonstrate our model’s capabilityof monitoring both group-level and population-level depression trends by presenting two of itsapplications during the COVID-19 pandemic.We hope this study can raise awareness amongresearchers and the general public of COVID-19’s impact on people’s mental health.

1 Introduction

COVID-19, or the coronavirus disease of 2019, isan infectious disease that reportedly originated inChina and has been spreading rapidly across theglobe in 2020. It was first identified on December31st, 2019, and was officially declared as a pan-demic by the World Health Organization (WHO)on March 11th, 2020.1 As of May 29th, 2020,COVID-19 has infected 216 countries, areas or ter-ritories with over 5.7 million confirmed cases and

∗Equal contribution.1https://www.who.int/emergencies/diseases/novel-

coronavirus-2019

3.5 hundred thousand confirmed deaths1. In re-sponse to the pandemic, over 170 countries haveissued nationwide closure of educational facilities2,and many governments have issued flight restric-tions and stay-at-home-orders, affecting everydaylives of people around the globe.

Multiple studies have investigated the economicand social impacts of COVID-19 (Fernandes, 2020;Baker et al., 2020; Nicola et al., 2020), but whatmental impact the drastic life changes bring to peo-ple and how to quantify it at the population levelare yet to be studied. Mental disorders are dis-turbing approximately 380 million people of allages worldwide.3 Previous studies have shown thatmental disorders lead to many negative outcomesincluding suicide (Inskip et al., 1998; San Too et al.,2019), but individuals who suffer from mental dis-orders are sometimes unwilling or ashamed to seekhelp (Yoshikawa et al., 2017). Moreover, it is infea-sible for psychological studies to obtain and track alarge sample of diagnosed individuals and performstatistically significant numerical analysis.

In the past decade, people have been increas-ingly relying on social media platforms such asFacebook, Twitter, and Instagram to express theirfeelings. We believe that social media is a resource-ful medium to mine information about general pub-lic’s mental health conditions. As shown in Fig-ure 1, we use data from the ForSight by CrimsonHexagonplot4 to plot the word frequencies of sev-eral mental disorders on Twitter, including “depres-sion”, “PTSD”, “bipolar disorder”, “autism”, fromJanuary 1st, 2020 to May 4th, 2020. Note thatwe exclude false positive tweets that contain mis-leading phrases such as “economic depression” or“great depression”. We notice a rapid growth of theword frequencies of autism and depression starting

2https://en.unesco.org/covid19/educationresponse3https://www.who.int/en/news-room/fact-

sheets/detail/mental-disorders4https://www.brandwatch.com/

arX

iv:2

007.

0022

8v2

[cs

.SI]

2 J

ul 2

020

Page 2: arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

from March 17th, when the pandemic spread acrossmost of the globe1. Since depression occurs sub-stantially more frequently compared to the otherthree mental disorders, we focus on understandingCOVID-19’s impact on depression in this study.

Figure 1: Density of Twitter coverage regarding “de-pression”, “ptsd”, “bipolar disorder”, and “autism”.

Previous studies have used n-gram languagemodels (Coppersmith et al., 2014), topic models(Armstrong, 2018), and deep learning models suchas 1-dimensional convolutional neural networks(CNN) and bidirectional long short-term memory(BiLSTM) (Orabi et al., 2018) to classify depres-sion at the user level using Twitter data. All theseworks use small samples of fewer than 500 users.Shen et al. (2017) extend previous studies by ex-panding the dataset to contain 1,402 depressionusers and using a multimodal dictionary learningapproach to learn the latent features of the data.

In this study, we create a dataset of 5,150 Twitterusers, including half identified depression users andhalf control users, along with their tweets withinpast three months and their Twitter activity data.We develop a chunking and regrouping method toconstruct 32,420 tweet chunks, 250 words eachin the dataset. Recently, transformer-based deeplearning language models have achieved state-of-the-art performance in multiple language modelingtasks. We investigate the performance of someof these models, including BERT (Devlin et al.,2018), RoBERTa (Liu et al., 2019), and XLNet(Yang et al., 2019) on our dataset. We progres-sively add data to our training set and notice a clearperformance growth on all models, which validatesthe importance of our dataset. We compare themodels’ performance at the chunk level as wellas the user level, and observe further performancegain which adds credibility to our chunking method.

Furthermore, we build a more accurate classifica-tion model upon the deep learning models alongwith linguistic analysis of dimensions includingpersonality, Linguistic Inquiry and Word Count(LIWC) - a well-validated psycholinguistic dictio-nary (Tausczik and Pennebaker, 2010), sentimentfeatures and demographic information.

De Choudhury et al. (2013a) demonstrate thatdepression prediction models can potentially beused on the population level. However, to the bestof our knowledge, all Twitter user depression iden-tification studies reviewed above focus on eithertweet-level or user-level classification rather thanapplying the model to analyzing the mental healthtrends of a large population. The word frequencytrend shown in Figure 1 is apparently filled withnoise and lacks plausible explanation, making itunable to reflect the actual mental health trends ofthe population. In this study, we present two appli-cations to monitor the change of depression level indifferent groups as the disease propagates. We uselatent Dirichlet allocation (LDA) and psychologi-cal theories to analyze the trend we discover. Tothe best of our knowledge, we are the first to applythese transformer-based models to identifying de-pression users on Twitter using a large-scale datasetand to monitoring the public depression trend.

In summary, our main contributions are:

• We develop an effective search method toidentify depression users on Twitter. Us-ing this method, we are able to create byfar the largest dataset with 2,575 depressionusers along with their past tweets within threemonths. Our experiment shows that the per-formance of deep learning models increase asthe size of the training set grows.

• To the best of our knowledge, we are thefirst to investigate the potential of transformer-based deep learning models on Twitter userdepression classification.

• We build a tool marrying the deep learningmodels and psychological text analysis to fur-ther improve the classification performance.

• We build a pipeline to monitor public depres-sion trend by aggregating individuals’ pasttweets within a time frame using our model,and analyze the depression level trends duringCOVID-19, thus shedding light on the psycho-logical impacts of the pandemic.

Page 3: arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

2 Related Work

The potential of machine learning models for iden-tifying Twitter users who have been diagnosed withdepression was pioneered by De Choudhury et al.(2013b), who analyzed how features obtained byLIWC are related to depression signals on socialmedia and how that can be used for user-level clas-sification on a dataset containing 171 depressionusers. The data was collected by designing surveysto volunteers through crowdsourcing. Followingthe work, Coppersmith et al. (2014) used LIWC, 1-gram language model, character 5-gram model, anduser’s engagement on social media (user mentionrate, tweet frequency, etc.) to perform tweet-levelclassification on a dataset containing 441 depres-sion users.

The CLPsych 2015 Shared Task dataset contain-ing 447 diagnosed depression users (Coppersmithet al., 2015) was published in 2015 and was fa-vored by a wide range of studies (Armstrong, 2018;Nadeem, 2016; Jamil et al., 2017; Orabi et al.,2018). The data was gathered by regular expressionsearch in tweets in combination with manual anno-tation. Among these studies, topic modeling wasshown effective by Armstrong (2018) using LDA;performance of traditional machine learning clas-sification algorithms (decision trees, SVM, naiveBayes, logistic regression) on 1-grams and 2-gramswas investigated by Nadeem (2016); Jamil et al.(2017) used SVM on bag of words (BOW) anddepression word count, along with LIWC featuresand NRC sentiment features; Orabi et al. (2018) ex-plored the performance of small deep neural archi-tectures - 1-dimensional CNN and BiLSTM withcontext-aware attention on the task.

Tsugawa et al. (2015) performed analysis ofmodels using BOW, LDA, and social media engage-ment features on a dataset containing 81 Japanese-speaking depression Twitter users collected bycrowdsourcing. One recent work (Shen et al., 2017)proposed a multimodal dictionary learning methodthat utilizes topic, social media engagement, profileimage, emotional features to learn a latent featuredictionary that performed well on a dataset of 1402depression users, the largest dataset used to the bestof our knowledge.

3 Data Collection and Analysis

We use the Tweepy API to obtain 41.3 milliontweets posted from March 23rd to April 18th andthe information of their authors. We look for sig-

nals that can tell whether the author suffers fromdepression from both the text and the user pro-file description. Normally, Twitter users suffer-ing from depression describe themselves as de-pression fighters in their descriptions. Some ofthem may also post relevant tweets to declare thatthey have been diagnosed with depression. In-spired by Coppersmith et al. (2014), we use regu-lar expression to find these authors by examiningtheir tweets and descriptions. Building upon theirmethod, we further extend our regular expressionsearch based on some patterns we notice on manu-ally identified depression users, in pursuit of effi-cacy. In tweets, we search for phrases such as “Ihave/developed/got/suffer(ed) from X depression”,“my X depression”, “I’m healing from X depres-sion”, “I’m diagnosed with X depression”, whereX’s are descriptive words such as “severe” and “ma-jor”. In descriptions, we further add phrases such as“depression fighter/sufferer/survivor” to the regularexpression list; we remove users that have “prac-titioner” and “counselor” in their descriptions toexclude mental health practitioners.

Once we have found the targeted Twitter users,we use the Tweepy API to retrieve the public tweetsposted by these users within last 3 months by thetime of posting the depression-related tweet with amax cap of 200 tweets per user. If the user is identi-fied from the description, we limit the time scope to3 months by April 18th. In the end, 2,575 distinctTwitter users are considered suffering from depres-sion (DP, Depression). We randomly select another2,575 distinct users such that depression-relatedterms do not appear in their past 200 tweets or de-scriptions as our control group. Users in this groupare not considered to be suffering from depression(ND, Non-Depression).

3.1 Personality

Previous psychological research shows that theBig-Five personality traits (openness, conscien-tiousness, extraversion, agreeableness and neuroti-cism) are related to depression (Bagby et al., 1996).We estimate individuals’ personality scores usingIBM’s Personality Insights service.5 For each indi-vidual, we aggregate all their tweets into a singletextual input and call the Personality Insights APIto obtain the scores. The minimum number ofwords for using the API is 100 and we are able to

5https://www.ibm.com/cloud/watson-personality-insights.

Page 4: arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

retrieve 4,697 (91.2%) of the 5,150 users’ scores.Summary statistics are shown in Appendix A.

3.2 SentimentsBesides personality, we hypothesize that individ-uals’ sentiment and emotions could also reflectwhether they are experiencing depression or not.We estimate individuals’ sentiments using Va-lence Aware Dictionary and Entiment Reasoner(VADER). VADER is a lexicon and rule-basedmodel developed by researchers from Georgia Insti-tute of Technology (Hutto and Gilbert, 2014). Weaggregate a users tweets into a single chunk, applyVADER, and retrieve its scores for positive andnegative emotions. In Figure 2, we report the distri-butions of positive emotions and negative emotionsamong DP and ND, respectively.

0.0 0.2 0.4 0.6 0.8 1.0

Positive Emotions

DepressionNon-Depression

0.0 0.1 0.2 0.3 0.4 0.5 0.6

Negative Emotions

DepressionNon-Depression

Figure 2: Compared with individuals with no de-pression, those with depression tend to exhibit bothstronger positive and negative emotions.

3.3 DemographicsPrevious psychological studies show differencesin depression rates among people of different agesand of different genders (Mirowsky and Ross, 1992;Wang et al., 2020; Wainwright and Surtees, 2002).To estimate the age and gender of the user, we adoptthe M3-inference model proposed by Wang et al.(2019). The M3 model performs multimodal anal-ysis on a user’s profile image, username, and de-scription. Following M3’s structure, we label eachuser a binary gender label (as approximation) anda one-hot age label among four age intervals (≤18,19-29, 30-39, ≥40), which are then used in ourfusion model. We are able to retrieve 5,059/5,150(98.2%) users’ demographic information.

3.4 Linguistic Inquiry Word Count (LIWC)We use LIWC to capture people’s psychologicalstates by analyzing the contents of their tweets6.We choose 8 features that were analyzed in pre-vious works (Coppersmith et al., 2014; Stirman

6https://liwc.wpengine.com/how-it-works/

and Pennebaker, 2001; Rude et al., 2004), as wellas 7 other features that we find relevant to ourstudy. Similar to the method conducted in (Chenet al., 2020), we apply LIWC to the concatenatedtweets of individuals. Figure 3 shows the linguis-tic profiles for the tweets of DP and ND. Both DPand ND exhibit slightly positive tone, with negli-gible differences. The tweets of ND show moreanalytical thinking, more clout and less authenticexpression than those of DP. The tweets of DPscore higher in both positive and negative emo-tion categories than the ones of ND in which sug-gests a higher degree of immersion (Holmes et al.,2007). Moreover, the tweets of DP also show moreanxiety, anger emotions and include more “swear”words, consistent with the findings of Coppersmithet al. (2014). Death-related words appear more fre-quently in the tweets of DP, which echoes (Stirmanand Pennebaker, 2001). Similar to Coppersmithet al. (2014); Stirman and Pennebaker (2001), wefind more first-person singular in the tweets of DP.

We also find that the tweets of DP express moresadness emotion and use words related to the bi-ological process more frequently. While there isno clear link between biological-process-relatedwords and depression, this finding shows that peo-ple who suffer from depression may pay more atten-tion to their biological statuses. The “power” scorefor the tweets of ND is higher which reflects ahigher need for the power according to the findingsof McClelland (1979). By comparing the “work”scores of DP and ND, we find that the users of NDpay more attention to work-related issues as well.

Figure 3: Linguistic profiles for the DP/ND tweets.

3.5 Social Media Engagement

We use the proportion of tweets with mentions,number of responses, unique users mentions, usersmentions, and tweets to measure the social me-dia engagement of each user, as did Coppersmithet al. (2014). Our analysis details are reported inAppendix A.

Page 5: arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

4 Model

4.1 Tweet Chunking and PreprocessingWe perform stratified random sampling on ourdataset. We first sample 500 users to form ourtesting set. On the rest of the users, we progres-sively add users to the training sets and record theperformance of the models trained on sets of 1,000,2,000, and 4,650 users, respectively. All the train-ing and testing sets have a 1-1 DP-ND ratio.

Jamil et al. (2017) have shown that one singletweet does not contain enough signals to deter-mine whether a user has depression. Thus, weconcatenate consecutive tweets of the same usertogether to create tweet chunks of 250 words andlabel the chunks based on the user’s label. Wepreprocess the text using the tweet preprocessingpipeline proposed by Baziotis et al. (2017). Weadopt this method especially due to its capability ofmarking Twitter-specific text habits and convertingthem to special tokens such as “〈allcaps〉” (cap-italized words), “〈elongated〉” (repeated letters),“〈repeated〉” (repeated words), etc. For example,

YESSSSS, I love it so so much!!!

after preprocessing will be in the form of

Yes 〈allcaps〉 〈elongated〉, I love it so〈repeated〉 much! 〈elongated〉

After chunking and preprocessing, on average, eachuser has 6-7 text chunks, making the actual sizesof the 4,650-user train-val set and the 500-usertesting set to be 29,315 and 3,105, respectively.The preprocessed tweet chunk datasets are thenpassed to deep learning models for training.

4.2 Deep Learning ModelsWe use deep learning models to perform chunk-level classification. We set up 2 baseline models,multi-channel CNN and bidirectional LSTM withcontext-aware attention (Attention BiLSTM), asdescribed in Orabi et al. (2018). We use the pre-trained GloVe embedding (840B tokens, 300d vec-tors) (Pennington et al., 2014) augmented with thespecial tokens added during preprocessing. Theembedding weights are further trained jointly withthe model. We also train three representativetransformer-based sequence classification models- BERT, RoBERTa and XLNet - with their ownpretrained tokenizers augmented with the specialtokens for tokenization. We choose to use the basemodels for all of them since we find no noticeableperformance gains using their larger counterparts.

4.3 Signal FusionWe run the models on all the tweet chunks of thesame user and take average of the confidence scoresto get the user-level confidence score. There are4,163/4,650 (89.5%) users remaining in the trainingset and 446/500 (89.2%) in the testing set whoseentire features are retrievable. We then pass differ-ent combinations of user-level scores (personality,VADER, demographics, engagement, LIWC, andaverage confidence) to machine learning classifica-tion algorithms including random forest, logisticregression, and support vector machines (SVM)provided by the scikit-learn library.7 We only usethe explainable LIWC features mentioned in thedata collection section for training the classifiers.

4.4 Training DetailsDuring training, we randomly split the train-valset to training and validation sets with a ratio of9:1. We record the models’ performances on thevalidation set after each epoch and keep the modelwith the highest accuracy and F1 scores while train-ing until convergence. Hyper-parameters and otherexperimental settings are described in Appendix B.

5 Experimental Results

5.1 Chunk-Level ClassificationIn Table 1, we report our classification results atthe chunk level on the testing set. Our evaluationmetrics include accuracy, F1 score, AUC, preci-sion and recall.8 One immediate observation isthat regardless of the model type, the classificationperformance improves as we increase the size ofour train-val set. This shows that for building de-pression classification models it is imperative tohave a large number of training samples. At thesame time, it also confirms that the larger numberof training samples in our experiments is indeed anadvantage.

Another observation is the performance gain oftransformer-based models over BiLSTM and CNNmodels. The CNN model slightly outperforms BiL-STM, which replicates the findings of Orabi et al.(2018). We observe BERT, RoBERTa and XLnetinvariably outperform BiLSTM and CNN regard-less of the size of our training set. In particular,the XLNet model records the best AUC as well asaccuracy of all models when trained with our fulltraining set.

7https://scikit-learn.org/stable/8We use 0.5 as the threshold when calculating the scores.

Page 6: arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

Model Train-Val Set Accuracy F1 AUC Precision Recall

Attention BiLSTM1k users2k users

4.65k users

70.770.372.7

69.068.371.6

76.577.479.3

70.970.772.1

67.366.171.1

CNN1k users2k users

4.65k users

71.872.874.0

72.674.570.9

77.480.381.0

72.772.277.4

72.676.968.9

BERT1k users2k users

4.65k users

72.775.776.5

74.476.377.5

79.882.983.9

72.076.176.3

76.975.778.8

RoBERTa1k users2k users

4.65k users

74.475.976.2

75.777.978.0

82.083.284.1

74.273.874.4

77.382.581.9

XLNet1k users2k users

4.65k users

73.774.677.1

75.176.877.9

80.782.684.4

73.272.677.5

77.281.578.3

Table 1: Chunk-level performance (%) of all 5 different models using training-validation sets of different sizes.

5.2 User-Level Classification

Features Accuracy F1 AUCVADER 54.9 61.7 54.6Demographics 58.7 56.0 61.4Engagement 58.7 62.3 61.7Personality 64.8 67.8 72.4LIWC 70.6 70.8 76.0V+D+E+P+L 71.5 72.0 78.3XLNet 78.1 77.9 84.9All (Rand. Forest) 78.4 78.1 84.9All (Log. Reg.) 78.3 78.5 86.4All (SVM) 78.9 79.2 86.1

Table 2: User-level performance (%) using differentfeatures. We use SVM for classifying individual fea-tures.

Next, we report our experiment results at theuser level. Since XLNet trained on the 4,650-userdataset outperforms other models, we take it foruser-level performance comparison. Our exper-imental results demonstrate a huge increase onthe user-level scores of XLNet shown in Table 2compared to the chunk-level score shown in Ta-ble 1. This indicates that more textual informationof a user yields more reliable results on determin-ing whether the user has depression. Building onthe user-level XLNet scores, we further includeVADER (V), demographics (D), engagement (E),personality (P), and LIWC (L) scores as signals.We first use all features and compare the perfor-mance of random forest, logistic regression, andSVM. We notice that SVM achieves the best scoreson accuracy and F1, slightly surpassing logisticregression. Thus, we use SVM for testing the per-formance when using part of the features collected.

The results are shown in Table 2. Results haveshown that using VADER, demographics, and so-cial media engagement features alone does not help

the classification by much. Classifiers using person-ality features and LIWC features perform relativelybetter. We then use all these five feature groups andobtain a better result (accuracy 71.5%, F1 72.0%).However, the classifier is still outperformed by XL-Net, showing that the transformer-based modelsindeed works better on depressive Twitter text mod-eling compared with other approaches. We furtherincrease the classifier’s performance by using allthe features, namely, VDEPL features and the av-eraged XLNet confidence score; the performanceof the three ML algorithms does not vary muchand the SVM classifier achieves the best accuracy(78.9%) and F1 (79.2%) scores.

openness

consci

entiousn

ess

extrave

rsion

agreeableness

neuroticism neg

neupos

compound

# tweets

proportion m

tns

# resp

onses

# unique usr mtns

# usr mtns

analyticclo

ut

authentic i

posemo

negemoanx

angersa

dsw

eardeath bio

powerwork

0.000

0.005

0.010

0.015

0.020

0.025

0.030

PersonalityVADERSocial EngagementLIWC

Figure 4: Permutation importance of different features.

In an attempt to investigate what specific textualfeatures besides those extracted by XLNet have themost impact on depression classification, we calcu-late the permutation feature importance (Altmannet al., 2010) on the trained random forest classifierusing the VEPL features with 10 repeats. The im-portance scores of individual features are shown inFigure 4. Among the LIWC features, “i”, “bio”,

Page 7: arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

“power”, “sad”, “authentic”, “clout” and “analytic”are shown to be important in classification. Amongthe five personality features, “conscientiousness”and “neuroticism” are shown to be closely relatedto depression cues. We do not observe a strongrelation between VADER sentiment features or so-cial media engagement features and the depressionsignals. As for the LIWC sentiment features, only“sad” and “anxiety” are shown to be relatively im-portant; we hypothesize that the ideas a person triesto express on Twitter are more related to depressioncues than his/her sentiments, which are likely tohave only short-term effects.

6 Applications

In this section, we report two COVID-19 relatedapplications of our XLNet based depression classi-fier: (a) monitoring the evolution of depressionlevel among the depression group and the non-depression group, and (b) monitoring the depres-sion level at the U.S. country level and state levelduring the pandemic.

6.1 Depression Monitoring on DP/ND GroupWe take the 500 users from the testing set alongwith their tweets from January 1st, 2020 to May22nd, 2020. We concatenate a user’s tweets consec-utively from January 1st one by one until reaching250 words and label this chunk’s date as the dateof the author posting the tweet that is in the mid-dle of the chunk. We group three days into a binfrom January 1st and assign the chunks to the binsaccording to the labeled date. We run the XLNetmodel on the preprocessed tweet chunks and recordthe confidence scores. We trim the upper and lower10% of the data to reduce the skew in the score dis-tribution. We then take the mean of the scores foreach time bin and plot the depression trend shownin Figure 5a. We further take a moving average of5 time bins to smooth the curves.

We mark three important time points on the plot- the first confirmed case of COVID-19 in the U.S.(January 21st), U.S. National Emergency announce-ment (March 13th), and the last stay-at-home or-der issued (South Carolina, April 7th). In January,both groups experience a drop in depression scores.This may be caused by the fact that people’s moodusually hits its lowest in winter (Thompson et al.,1988). From the day when there was the first con-firmed case in the U.S. to the day of the announce-ment of U.S. National Emergency, the trends ofDP and ND are different. The depression level of

DP goes down slightly while the depression levelof ND goes up. Aided by psychological findings,we hypothesize that depressive users are less af-fected by negative events happening in the outsideworld because they focus on their own feelings andlife events, since (a) they are mostly affected bynegative events that threaten them directly (Yueet al., 2016), and (b) more interactions with theoutside world give them more negative feedback(Winer and Salem, 2016). Moreover, the depres-sion levels of DP and ND both increase after theannouncement of the U.S. National Emergency.

To better understand the trend, we apply the LDAmodel to retrieve the topics before and after the an-nouncement of the U.S. National Emergency. Eachchunk of the tweets is assigned 5 weights for the 5topics respectively. We label the topic of the high-est weight as the dominant topic of this chunk ofthe tweets, and count the frequency of each topicshown in Figure 5b. Details about the keywords ofthe topics are reported in Appendix C. Before theannouncement, the 2 most frequent topics of DPand ND are the discussions about U.S. PresidentDonald Trump and about school and work. Thethird most frequent topic of ND is about health,while that of DP is about entertainment. This sup-ports the difference of the depression level trendsof two groups. After the announcement of the U.S.National Emergency, the most frequent topic ofDP is depression and anxiety during COVID-19,while this is the third frequent topic of ND. Further,all the 5 topics of each group are about COVID-19. This shows that when people mostly talk aboutCOVID-19, depression signals rise for both groups.

6.2 Aggregated Depression in COVID-19

To investigate country- and state-level depressiontrend during COVID-19, we randomly sampleusers who have U.S. state locations stated in theirprofiles and crawl their tweets between March 3rd,2020 and May 22nd, 2020, the period right beforeand after the U.S. announced National Emergencyon March 13th. Using the same logic as in Section6.1, we plot the change of depression scores of allgeo-located 9,050 users as the country-level trend.For state-level comparison, we plot the aggregatedscores of three representative states - economicalcenter New York on the East Coast that is highlyaffected by the virus, tech center California on theWest Coast that is also struck hard by the virus,and the less affected tourism center Florida in the

Page 8: arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

1/1/2020 1/25/2020 2/18/2020 3/13/2020 4/6/2020 4/30/2020 5/24/20200.600

0.625

0.650

0.675

0.700

0.725

0.750

0.775

0.800

DP

Use

rs D

epre

ssio

n Le

vel

depression (DP)

0.225

0.250

0.275

0.300

0.325

0.350

0.375

0.400

0.425

ND

Use

rs D

epre

ssio

n Le

vel

First Confirmed Casein U.S.

U.S. NationalEmergency

Last Stay-At-HomeOrder

non-depression (ND)

(a) DP-ND trends (b) Percentage of DP-ND topics

3/3/2020 3/18/2020 4/2/2020 4/17/2020 5/2/2020 5/17/20200.28

0.30

0.32

0.34

0.36

0.38

0.40

0.42

Dep

ress

ion

Leve

l

U.S. NationalEmergency

USNYCAFL

(c) State-level trends (d) Percentage of State-level topics

Figure 5: (a) Aggregated depression level trends of DP users and ND users from January 1st, 2020 to May 22nd, 2020. Weuse different y-axes for the 2 groups in order to compare them side by side. (b) Topics of DP and ND before and after theannouncement of the U.S. National Emergency. (c) Aggregated depression level trends of U.S., NY, CA, and FL from Mar 3rd,2020 to May 22nd, 2020. (d) Top 5 topics (state-level) after the announcement of the U.S. National Emergency.

southeast. Each selected state has at least 550 usersin the dataset to validate our findings. Their depres-sion levels are shown in Figure 5c.

The first observation of the plot is that depres-sion scores of all three states and the U.S. behavesimilarly during the pandemic; they experience adecrease right before the National Emergency, asteady increase after that, a slight decrease pastApril 23rd, and another sharp increase after May10th. We also notice that the overall depressionscore of Florida is significantly lower than the U.S.average as well as the other two states. SinceFlorida has a lower score both before and after thevirus breakout, we hypothesize that it has a lowerdepression level overall compared to the averageU.S. level irrespective of the pandemic.

We calculate the topics at the state level after theannouncement of the U.S. National Emergency. Asshown in the Figure 5d, the most frequent topic isthe government’s policy on COVID-19. Californiaand Florida are the states that pay relatively moreattention to this topic compared to the U.S. averageand New York. Florida also talks more about thelife change during COVID-19. Another findingis that people in New York talk more about the

hospital news, likely because the state contains themajority of cases in the country by May 22nd 9.

7 Conclusion

COVID-19 has infected over five million peopleglobally, virtually brought the whole world to a halt.During this period, social media witnessed a spikein depression terms. Against this backdrop, wehave developed transformer-based models trainedwith by far the largest dataset on depression. Wehave analyzed our models’ performance in compar-ison to existing models and verified that the largetraining set we compile is beneficial to improvingthe models’ performance. We further show that ourmodels can be readily applied to the monitoring ofstress and depression trend of targeted groups overgeographical entities such as states. We notice sig-nificant increases in depression signals as peopletalk more about COVID-19. We hope researchersand mental health practitioners find our models use-ful and this study can raise the awareness of themental health impacts of the pandemic.

9https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html

Page 9: arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

ReferencesAndre Altmann, Laura Tolosi, Oliver Sander, and

Thomas Lengauer. 2010. Permutation importance:a corrected feature importance measure. Bioinfor-matics, 26(10):1340–1347.

William Armstrong. 2018. Using topic models to in-vestigate depression on social media. Technicalreport, Technical report, University of Maryland,USA, 2015. Scholarly paper.

R Michael Bagby, Deborah R Schuller, Anthony JLevitt, Russell T Joffe, and Kate L Harkness. 1996.Seasonal and non-seasonal depression and the five-factor model of personality. Journal of Affective Dis-orders, 38(2-3):89–95.

Scott R Baker, Nicholas Bloom, Steven J Davis, andStephen J Terry. 2020. Covid-induced economic un-certainty. Technical report, National Bureau of Eco-nomic Research.

Christos Baziotis, Nikos Pelekis, and Christos Doulk-eridis. 2017. DataStories at SemEval-2017 task 4:Deep LSTM with attention for message-level andtopic-based sentiment analysis. In Proceedings ofthe 11th International Workshop on Semantic Eval-uation (SemEval-2017), pages 747–754, Vancouver,Canada. Association for Computational Linguistics.

Long Chen, Hanjia Lyu, Tongyu Yang, Yu Wang, andJiebo Luo. 2020. In the eyes of the beholder: Sen-timent and topic analyses on social media use ofneutral and controversial terms for covid-19. arXivpreprint arXiv:2004.10225.

Glen Coppersmith, Mark Dredze, and Craig Harman.2014. Quantifying mental health signals in twitter.In Proceedings of the workshop on computationallinguistics and clinical psychology: From linguisticsignal to clinical reality, pages 51–60.

Glen Coppersmith, Mark Dredze, Craig Harman,Kristy Hollingshead, and Margaret Mitchell. 2015.Clpsych 2015 shared task: Depression and ptsd ontwitter. In Proceedings of the 2nd Workshop onComputational Linguistics and Clinical Psychology:From Linguistic Signal to Clinical Reality, pages 31–39.

Munmun De Choudhury, Scott Counts, and EricHorvitz. 2013a. Social media as a measurement toolof depression in populations. In Proceedings of the5th Annual ACM Web Science Conference, pages 47–56.

Munmun De Choudhury, Michael Gamon, ScottCounts, and Eric Horvitz. 2013b. Predicting depres-sion via social media. In Seventh international AAAIconference on weblogs and social media.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Nuno Fernandes. 2020. Economic effects of coron-avirus outbreak (covid-19) on the world economy.Available at SSRN 3557504.

Danielle Holmes, Georg W Alpers, Tasneem Ismailji,Catherine Classen, Talor Wales, Valerie Cheasty, An-drew Miller, and Cheryl Koopman. 2007. Cognitiveand emotional processing in narratives of womenabused by intimate partners. Violence againstwomen, 13(11):1192–1205.

Clayton J Hutto and Eric Gilbert. 2014. Vader: A par-simonious rule-based model for sentiment analysisof social media text. In Eighth international AAAIconference on weblogs and social media.

Hazel Inskip, Clare Harris, and Brian Barraclough.1998. Lifetime risk of suicide for affective disorder,alcoholism and schizophrenia. The British Journalof Psychiatry, 172(1):35–37.

Zunaira Jamil, Diana Inkpen, Prasadith Buddhitha, andKenton White. 2017. Monitoring tweets for depres-sion to detect at-risk users. In Proceedings of theFourth Workshop on Computational Linguistics andClinical PsychologyFrom Linguistic Signal to Clini-cal Reality, pages 32–40.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

David C McClelland. 1979. Inhibited power motiva-tion and high blood pressure in men. Journal of Ab-normal Psychology, 88(2):182.

John Mirowsky and Catherine E Ross. 1992. Age anddepression. Journal of health and social behavior,pages 187–205.

Moin Nadeem. 2016. Identifying depression on twitter.arXiv preprint arXiv:1607.07384.

Maria Nicola, Zaid Alsafi, Catrin Sohrabi, AhmedKerwan, Ahmed Al-Jabir, Christos Iosifidis, MalihaAgha, and Riaz Agha. 2020. The socio-economicimplications of the coronavirus and covid-19 pan-demic: A review. International Journal of Surgery.

Ahmed Husseini Orabi, Prasadith Buddhitha, Mah-moud Husseini Orabi, and Diana Inkpen. 2018.Deep learning for depression detection of twitterusers. In Proceedings of the Fifth Workshop onComputational Linguistics and Clinical Psychology:From Keyboard to Clinic, pages 88–97.

Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word rep-resentation. In Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1532–1543.

Stephanie Rude, Eva-Maria Gortner, and James Pen-nebaker. 2004. Language use of depressed anddepression-vulnerable college students. Cognition& Emotion, 18(8):1121–1133.

Page 10: arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

Lay San Too, Matthew J Spittal, Lyndal Bugeja,Lennart Reifels, Peter Butterworth, and Jane Pirkis.2019. The association between mental disorders andsuicide: A systematic review and meta-analysis ofrecord linkage studies. Journal of affective disor-ders.

Guangyao Shen, Jia Jia, Liqiang Nie, Fuli Feng, Cun-jun Zhang, Tianrui Hu, Tat-Seng Chua, and WenwuZhu. 2017. Depression detection via harvesting so-cial media: A multimodal dictionary learning solu-tion. In IJCAI, pages 3838–3844.

Shannon Wiltsey Stirman and James W Pennebaker.2001. Word use in the poetry of suicidal and non-suicidal poets. Psychosomatic medicine, 63(4):517–522.

Yla R Tausczik and James W Pennebaker. 2010. Thepsychological meaning of words: Liwc and comput-erized text analysis methods. Journal of languageand social psychology, 29(1):24–54.

Chris Thompson, Deborah Stinson, Margaret Fernan-dez, Jeffrey Fine, and Geoffrey Isaacs. 1988. Acomparison of normal, bipolar and seasonal affec-tive disorder subjects using the seasonal pattern as-sessment questionnaire. Journal of Affective Disor-ders, 14(3):257–264.

Sho Tsugawa, Yusuke Kikuchi, Fumio Kishino, Ko-suke Nakajima, Yuichi Itoh, and Hiroyuki Ohsaki.2015. Recognizing depression from twitter activity.In Proceedings of the 33rd annual ACM conferenceon human factors in computing systems, pages 3187–3196.

NWJ Wainwright and Paul G Surtees. 2002. Childhoodadversity, gender and depression over the life-course.Journal of affective disorders, 72(1):33–44.

Cuiyan Wang, Riyu Pan, Xiaoyang Wan, Yilin Tan,Linkang Xu, Cyrus S Ho, and Roger C Ho. 2020.Immediate psychological responses and associatedfactors during the initial stage of the 2019 coron-avirus disease (covid-19) epidemic among the gen-eral population in china. International journal of en-vironmental research and public health, 17(5):1729.

Zijian Wang, Scott A. Hale, David Adelani, Przemys-law A. Grabowicz, Timo Hartmann, Fabian Flo”ck,and David Jurgens. 2019. Demographic inferenceand representative population estimates from mul-tilingual social media data. In Proceedings of the2019 World Wide Web Conference. ACM.

E Samuel Winer and Taban Salem. 2016. Rewarddevaluation: Dot-probe meta-analytic evidence ofavoidance of positive information in depressed per-sons. Psychological bulletin, 142(1):18.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In Advances in neural in-formation processing systems, pages 5754–5764.

Eisho Yoshikawa, Toshiatsu Taniguchi, NanakoNakamura-Taira, Shin Ishiguro, and Hiromichi Mat-sumura. 2017. Factors associated with unwilling-ness to seek professional help for depression: a web-based survey. BMC research notes, 10(1):673.

Li Yue, Zhang Dajun, Liang Yinghao, and Hu Tian-qiang. 2016. Meta-analysis of the relationship be-tween life events and depression in adolescents.Journal of Pediatric Care, 2(1):1–13.

A Supplementary Data Statistics

In Table 3, we report the summary statistics of thepersonality estimates. We observe that all the esti-mates fall between 0 and 1. The standard deviationsrange from 0.24 to 0.28. Openness has the highestmean value at 0.61 and conscientiousness has thelowest mean value at 0.28.

Count Min Max Mean SD

Openness 4697 0.00 1.00 0.61 0.28Conscientiousness 4697 0.00 1.00 0.28 0.26Extraversion 4697 0.00 1.00 0.32 0.24Agreeableness 4697 0.00 1.00 0.30 0.26Neuroticism 4697 0.00 1.00 0.56 0.28

N 4697

Table 3: Summary statistics of personality scores ofusers in the data set. SD stands for standard deviation.

In Table 4, we further report the correlation co-efficients between the personality variables. Weobserve that extraversion is highly correlated withconscientiousness and agreeableness (correlationcoefficients > 0.45). Meanwhile, neuroticism isnegatively correlated with openness, conscientious-ness, and extraversion.

Openness Conscientious Extraversion Agreeableness Neuroticism

Openness 1

Conscientious 0.238∗∗∗ 1

Extraversion 0.265∗∗∗ 0.475∗∗∗ 1

Agreeableness -0.0768∗∗∗ 0.441∗∗∗ 0.469∗∗∗ 1

Neuroticism -0.232∗∗∗ -0.279∗∗∗ -0.285∗∗∗ 0.0460∗∗ 1

∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001

Table 4: Correlation between the personality variables.

To better understand the difference of social me-dia engagement between DP and ND, we add 0.1to the number of responses, unique users mentions,users mentions, and tweets and take the logarithm.

Page 11: arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

Figure 6: Density plots of proportion of tweets with mentions, log-scale number of responses, unique users men-tions, users mentions, and tweets.

The density comparisons are shown in Figure 6. Byapplying the Mann-Whitney rank test, we find thatexcept for the number of unique users mentions,other features are statistically different (p < 0.05)between DP and ND. The users of DP post moretweets and reply more. They tend to post fewertweets with mentions while the number the usersmentioned for DP is larger, which suggests thatwhen users of DP post tweets to interact with otherusers, it involves more users.

B Hyper-Parameters and OtherExperimental Settings

We manually select the hyper-parameters that givethe best accuracy and F1 scores on the deep learn-ing models. We use Adam optimizer with learningrate 7e-3, weight decay 1e-4 for training AttentionBiLSTM. We use Adam optimizer with a learningrate of 5e-4 for training CNN. We use AdamWoptimizer with a learning rate of 2e-5 for trainingBERT and RoBERTa, 8e-6 for training XLNet. Weuse the cross entropy loss for all our models dur-ing training. We use SGD optimizer with adaptivelearning rate with initial learning rate as 0.1 fortraining SVM and logistic regression classifier.

The experiments of the deep learning models areperformed on a single NVIDIA Tesla T4 GPU. Werecord the models’ number of parameters and theiraverage training time per epoch on the 4,650-usertraining set. The BiLSTM and CNN model has23 million and 57 million parameters, respectively(including the embedding weights). We train BiL-STM using batch size 32 and it takes 27 secondsper epoch. We train CNN using batch size 32 and ittakes 39 minutes per epoch. The transformer-basedmodels (BERT, RoBERTa, XLNet) have 110 mil-lion parameters and their speeds do not vary much.We train these models using batch size 8 and theytake 1 hour 37 minutes per epoch. We train SVM,logistic regression classifier, and random forest onCPU.

We use the ekphrasis Python library10 for major-ity of the preprocessing and use the transformerslibrary11 for building our transformer-based mod-els.

C LDA Topics

We use latent Drichlet allocation (LDA) providedby the gensim library12 to model the topics of thetweets. To better understand the topics, we removeall the adjectives, adverbs and verbs from the text,and only keep the nouns. The number of the topicsis set to be 5. In total we have trained three LDAmodels: one for tweets before the announcementof U.S. National Emergency, one for tweets afterthe announcement of U.S. National Emergency,and one for national and state level tweets betweenMarch 3rd, 2020 and May 22nd, 2020. Table 5shows the top 15 keywords for each topic.

10https://pypi.org/project/ekphrasis/11https://github.com/huggingface/transformers12https://pypi.org/project/gensim/

Page 12: arXiv:2007.00228v2 [cs.SI] 2 Jul 2020

Model 1: Before Announcement of U.S. National Emergency

School and Worktime, day, people, anyone, thing, support, work, man,

class, today, school, week, watch, love, dog

Healthday, world, time, week, thank, today, ko,

health, hope, video, news, something, hell, job, people

Trump Discussionspeople, life, time, day, work, love, trump, lot,

man, today, person, someone, thing, way, hope

Cancertime, year, people, day, work, way, tweet, city,

cancer, head, friend, death, today, love, problem

Entertainmentlove, people, time, way, song, something, day, man

nothing, everyone, tonight, year, today, game, guy

Model 2: After Announcement of U.S. National Emergency

Hoarding related to COVID-19time, dog, people, man, everything, call, work, way,

covid, hope, news, food, night, thank, someone

Quarantine and Depressiontime, work, love, day, people, something, man, thank,

thing, hope, everyone, life, quarantine, house, home

Chinese News about COVID-19china, people, time, love, day, street, song, name,

person, trump, news, government, expert, virus, dey

Depression and Anxiety during COVID-19people, today, time, day, covid, work, home, love,

virus, man, hope, tweet, thing, depression, everyone

Work from Homeday, people, time, love, everyone, job, quarantine,

today, year, thing, life, home, way, week, video

Model 3: State-level Topics

Life Change during COVID-19day, today, march, morning, week, life, run, friend,

show, death, man, place, mayor, order, change

Quarantineday, time, love, quarantine, today, ass, night, year,

thank, tomorrow, house, video, game, life, miss

Hospital Newscovid, state, home, work, health, news, county, week,

hospital, testing, today, day, world, time, order

Government’s policy on COVID-19trump, people, president, time, virus, care, vote, country,

china, money, need, medium, nothing, job, everyone

Family and Worktime, people, love, man, way, job, thank,

day, work, lot, someone, something, thing, family, today

Table 5: Top 15 keywords for each topic