Thesis Release Permission Form - Northeastern University349562/fulltext.pdf · To my wife, whose love and support encouraged me to do things I never dreamed. Thank you for always

Thesis Release Permission Form

Northeastern UniversityCollege of Computer and Information Science

Title:

Predicting Satisfaction with Life from Facebook Features

c�2015 Susan Katrina CollinsThe author hereby grants to Northeastern University and The Charles StarkDraper Laboratory, Inc. permission to reproduce and to distribute publiclypaper and electronic copies of this thesis document in whole or in any part

medium now known or hereafter created.

iii

c� Copyright 2015 by Susan Katrina Collins

All Rights Reserved

iv

Disclaimer

The views expressed in this article are those of the author and do not reflectthe official policy or position of the United States Air Force, Department of

Defense, or the U.S. Government.

This material is declared a work of the U.S. Government and is not subjectto copyright protection in the United States.

v

Dedication

To my wife, whose love and support encouraged me to do things I neverdreamed. Thank you for always being there through the late nights, earlymornings, and never ending school hours. To all my friends and family,

thank you for your understanding and encouragement through thisadventure.

vi

Acknowledgments

I would like to express my sincere gratitude to my Northeastern advisor,Yizhou Sun for her guidance and technical expertise in Data Mining andMachine Learning. To my Draper advisor, Natasha Markuzon, thank youfor your help; our partnership has taught me many life long lessons aboutperseverance and resilience that I will carry with me in my professional andpersonal life. To the Draper Laboratory and Education Staff, I am forevergrateful for the research and collaboration opportunity granted to me. Aspecial thanks to Dr. Michal Kosinksi and Dr. David Stillwell for givingme guidance from a psychology perspective and allowing me to collaborateon the myPersonality project. Finally, an incredible thanks to TSgt CassieBeauchene who gave me insight into real world operations and how myresearch could help others. Her continued enthusiasm and support helpedkeep this project going.

vii

AbstractPredicting Life Satisfaction with Facebook Features

Thesis

Susan Katrina Collins

Supervising Professor: Dr. Yizhou Sun

Social media can be beneficial in detecting early signs of emotional dif-ficulty. We utilized the Satisfaction with Life (SWL) index as a cogni-tive health measure and presented models to predict an individual’s SWL.Our models considered ego, temporal, and link Facebook features collectedthrough the myPersonality.org project. We demonstrated the strong corre-lation between Big Five personality features and SWL, and we used thisinsight to build two-step Random Forest Regression models from ego fea-tures. As an intermediate step, the two-step model predicts Big Five fea-tures that are later incorporated in the SWL prediction models. We showedthat the two-step approach more accurately predicted SWL than one-stepmodels. By incorporating temporal features we demonstrated that “moodswings” do not affect SWL prediction and confirmed SWL’s high temporalconsistency. Strong link features, such as the SWL of top friends or signif-icant others, increased prediction accuracy. Our final model incorporatedego features, predicted personality features, and the SWL of strong links.The final model out-performed previous research on the same dataset by45%.

viii

Contents

Disclaimer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . vi

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . 3

2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Satisfaction with Life and Subjective Well Being . . . . . . 4

2.1.1 Satisfaction With Life Questionnaire . . . . . . . . . 42.1.2 Reliability, Stability and Validity of Satisfaction with

Life Scales . . . . . . . . . . . . . . . . . . . . . . 52.1.3 Making Satisfaction with Life Judgments . . . . . . 62.1.4 Variability of Life Satisfaction . . . . . . . . . . . . 8

2.2 Social Media and Facebook . . . . . . . . . . . . . . . . . . 82.2.1 Facebook Validity . . . . . . . . . . . . . . . . . . 9

2.3 Big Five Personality . . . . . . . . . . . . . . . . . . . . . . 10

ix

3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1 Satisfaction with Life in Social Media . . . . . . . . . . . . 14

3.1.1 Static Ego Features . . . . . . . . . . . . . . . . . . 143.1.2 Link Features . . . . . . . . . . . . . . . . . . . . . 163.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . 17

4 Data Description . . . . . . . . . . . . . . . . . . . . . . . . 184.0.4 Static Ego Features . . . . . . . . . . . . . . . . . . 194.0.5 Temporal Features . . . . . . . . . . . . . . . . . . 204.0.6 Link Features . . . . . . . . . . . . . . . . . . . . . 204.0.7 Satisfaction With Life Score . . . . . . . . . . . . . 214.0.8 Sample Size . . . . . . . . . . . . . . . . . . . . . . 21

5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 235.1 Machine Learning Models . . . . . . . . . . . . . . . . . . 23

5.1.1 Decision Trees . . . . . . . . . . . . . . . . . . . . 235.1.2 Random Forest Regression (RFR) . . . . . . . . . . 245.1.3 Linear Regression . . . . . . . . . . . . . . . . . . 25

5.2 Model Creation . . . . . . . . . . . . . . . . . . . . . . . . 255.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 27

5.3.1 Facebook “Likes” . . . . . . . . . . . . . . . . . . . 275.3.2 Facebook Status Updates . . . . . . . . . . . . . . . 285.3.3 Friendships . . . . . . . . . . . . . . . . . . . . . . 295.3.4 Partnerships . . . . . . . . . . . . . . . . . . . . . . 30

5.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 305.4.1 Pearson’s Correlation Coefficient (R) . . . . . . . . 315.4.2 R-squared (R2) . . . . . . . . . . . . . . . . . . . . 315.4.3 Feature Importance with Random Forest Regression 32

5.5 Model Types . . . . . . . . . . . . . . . . . . . . . . . . . 33

x

5.5.1 Static Models . . . . . . . . . . . . . . . . . . . . . 335.5.2 Temporal Models . . . . . . . . . . . . . . . . . . . 335.5.3 Link Models . . . . . . . . . . . . . . . . . . . . . 345.5.4 Combined Models . . . . . . . . . . . . . . . . . . 34

5.6 Handling Missing Data . . . . . . . . . . . . . . . . . . . . 345.7 Model Populations . . . . . . . . . . . . . . . . . . . . . . 375.8 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . 38

5.8.1 Mean Absolute Error . . . . . . . . . . . . . . . . . 385.8.2 Random Baseline . . . . . . . . . . . . . . . . . . . 385.8.3 Precision . . . . . . . . . . . . . . . . . . . . . . . 385.8.4 Linear Regression Baseline . . . . . . . . . . . . . . 395.8.5 10-Fold Cross Validation . . . . . . . . . . . . . . . 39

6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . 406.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 40

6.1.1 Pearson Correlation . . . . . . . . . . . . . . . . . . 406.1.2 Wrapper Method . . . . . . . . . . . . . . . . . . . 50

6.2 Static Models . . . . . . . . . . . . . . . . . . . . . . . . . 506.2.1 Predicted Big Five as an Intermediate . . . . . . . . 516.2.2 Predicted Big Five Models . . . . . . . . . . . . . . 526.2.3 Combined Static Models . . . . . . . . . . . . . . . 586.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . 59

6.3 Temporal Models . . . . . . . . . . . . . . . . . . . . . . . 596.4 Link Models . . . . . . . . . . . . . . . . . . . . . . . . . . 606.5 Combined Static and Link Models . . . . . . . . . . . . . . 61

7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 647.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 65

xi

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

A Supplemental Graphs . . . . . . . . . . . . . . . . . . . . . 75A.1 Big Five Density . . . . . . . . . . . . . . . . . . . . . . . 75

B Pearson Correlations . . . . . . . . . . . . . . . . . . . . . . 78B.1 LIWC Correlations . . . . . . . . . . . . . . . . . . . . . . 79B.2 Ego Feature Correlations . . . . . . . . . . . . . . . . . . . 81B.3 Big5 Correlations . . . . . . . . . . . . . . . . . . . . . . . 81

C Data Descriptions . . . . . . . . . . . . . . . . . . . . . . . . 82C.1 General Acronyms . . . . . . . . . . . . . . . . . . . . . . 82C.2 LIWC Category Description . . . . . . . . . . . . . . . . . 83C.3 LDA “Like” Topic Descriptions . . . . . . . . . . . . . . . 85

D Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 86D.1 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . 86

xii

List of Tables

4.1 Data Description. Note that the number of samples is basedon users who have an SWL score. . . . . . . . . . . . . . . 19

4.2 One-hot encoding example for 4 categories. . . . . . . . . . 20

5.1 Summary of sample sizes for each model. . . . . . . . . . . 37

6.1 Summary of correlations between a feature and SWL. Rrefers to Pearson’s R, whereas R’ is the Pearson’s R for aver-aged features calculated for SWL values. SE is the standarderror for the regression when raw features are used. SE’is the standard error when averaged features are used. Fea-tures shown have correlation with SWL and a standard errorof 1.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2 Correlations coefficients for linear regression using relation-ship status as the explanatory variable and SWL as the targetvariable. Note that relationship types where n 100 werenot included. . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.3 Positively and Negatively correlated “Like” categories groupedinto 600 categories using LDA. Influence of categories arelisted in descending order. . . . . . . . . . . . . . . . . . . 49

6.4 Static Ego Models . . . . . . . . . . . . . . . . . . . . . . . 516.5 Important Facebook demographic features for predicting Big

Five as identified with RFR, n = 30, 766 . . . . . . . . . . . 53

xiii

6.6 Important LIWC features for predicting Big Five scores asidentified with RFR, n = 115, 874 . . . . . . . . . . . . . . 56

6.7 Important “Like” features for predicting Big Five scoresidentified with RFR, n = 92, 255. Note that the categoriesare summarized by the most prominent features. . . . . . . . 57

6.8 Static Ego Model Results . . . . . . . . . . . . . . . . . . . 586.9 Combined Static Ego Model Results . . . . . . . . . . . . . 596.10 Temporal Model Results . . . . . . . . . . . . . . . . . . . 606.11 Link Model Results . . . . . . . . . . . . . . . . . . . . . . 616.12 Combined Static and Link Models Results . . . . . . . . . . 62

xiv

List of Figures

2.1 Histograms that shows the density of the Big Five featurefor users who have an SWL score. n = 86, 073 . . . . . . . 11

4.1 This graph shows the percentage of users who have a par-ticular feature and SWL score. For example, 85% of userswho have an SWL score also have a Big Five score. . . . . . 22

5.1 Example of a decision tree for determining SWL score. Branchescorrespond to the values of the attributes and leaves indicateprediction value. Note that the full decision tree is not shown. 24

5.2 Prediction Model Approach. First features are extractedfrom Facebook. These features fall into three main cate-gories: static ego, temporal, and link. Then they are ana-lyzed and used to create RFR models to predict SWL. Highlydimensional features are reduced to Big Five and then allmodels are combined into a final model. . . . . . . . . . . . 26

5.3 Process for extracting friendships . . . . . . . . . . . . . . . 285.4 Process for extracting friendships . . . . . . . . . . . . . . . 295.5 Process for extracting significant other relationships . . . . . 305.6 Distribution of SWL for complete samples. Sample features

include: Big Five, Network Size, Number of Photo Tags,Relationship Status, and Age. n = 8, 608 . . . . . . . . . . 35

xv

6.1 Graphs of SWL versus the mean of each Big Five personal-ity score. Pearson’s R is annotated in the top right corner ofthe graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.3 Graphs of averaged demographic features versus SWL. Net-work Size shows the strongest correlation (R0). . . . . . . . 46

6.4 Pearson Correlation between averaged “common” Facebookfeatures and Big Five personality scores. The most highlycorrelated features are shown. . . . . . . . . . . . . . . . . 54

A.1 Histograms that shows the density of the Big Five featurefor all users in myPersonality dataset. n = 3, 137, 694 . . . . 76

1

Chapter 1

Introduction

Have you ever Googled “happiness”? If you have, then you have noticedthere are about 325,000,000 results and counting. The results for happinessare so abundant because of its relevance toward all aspects of life. Hap-piness is often interchangeable with an individuals’ satisfaction with lifedue to its all-encompassing relation to self-worth and subjective well-being.However, satisfaction with life has proven to be a dauntingly hard conceptto express and obtain, due to its extreme depth and varying quantitative ac-curacy. Regardless of these challenges, many nations across the world havespent numerous resources and time attempting to obtain happiness and sat-isfaction with life for its people. The question is, how can something soabstract, be compared and measured over time? How can these countriesbetter utilize their resources to achieve these goals? Could social media bethe answer? This thesis explains some of the endless possibilities behindsocial media data, and how it can be utilized and manipulated to achievesuch a difficult task as predicting satisfaction with life.

1.1 Motivation

The motivation to accomplish such a task as creating a satisfaction with life(SWL) prediction model is two-fold. The first motivation arises from indi-vidual well-being. SWL is not an epiphenomenon.[33]. Its reach into all as-pects of life can be seen from a cognitive and biological view point. Studieshave shown that SWL can predict depression [43], occupational function-ing [46], and successful interpersonal relationships [28]. Other studies haveshown its biological connection to health and longevity [17]. By predicting

2

individual SWL from social domains, perhaps early warning schemes couldbe developed for those in distress. For instance, there have been several de-tailed cases of Facebook users who posted suicide notes before committingthe act [57]. Is it possible to identify early signs of these ailments, per-haps before they become drastic measures like suicide? Such a pursuit iscertainly backed by a noble cause and worth further investigation.

The second motivation emerges from the desire to have a “happy” com-munity. Since the 19th century, the conviction of self actualization has man-ifested social reform throughout the world [68], causing societies to strivefor personal happiness. In the Utilitarian creed it is believed that the bestsociety is one which provides ‘the greatest happiness for the greatest num-ber’ [3]. For this reason, numerous societies, such as the European Unionhave conducted large-scale efforts to understand SWL to improve the qual-ity of life for its people. In 2010, David Cameron, the prime minister of theUnited Kingdom, asked the Office of National Statistics to survey the na-tion for its life satisfaction as a part of a £2m/year well-being project [35].Clearly the identification and understanding of SWL has raised national andinternational attention making it a noteworthy pursuit.

The question is, how can we accomplish such a task, using minimal re-sources, and still obtain the greatest amount of accuracy? One interestingconsideration is social media. With the ubiquity of social media, research indata mining, natural language processing and other computational scienceshas dramatically grown [45]. People are posting about their lives, family,and social interactions, making sites like Twitter, Facebook, LinkedIn, etc.gold-mines for data. In 2013, it was estimated that 74% of online adults useFacebook [22]. In other words, these users have already accomplished thetedious time and resource consuming work of cataloging their interactionsfor us.

The challenge now, is to effectively use these resources and transformthem from raw data to interpretable knowledge. However, it is unclear if do-mains like social media are appropriate sources for predicting psychologicalstates like SWL. Although research in SWL has found many correlations tothings like personality, cultural background, and demographic information,predicting SWL from social media is still developing [19, 21, 43, 40, 49].

3

1.2 Goals

The goal of this thesis is to select and extract features from social media inorder to create a prediction model for an individual’s SWL. Specifically, thisthesis will use data from Facebook because of its popularity and activeness[22]. The focus on common Facebook features will be maintained for creat-ing prediction models in order to ensure generalization of prediction modelsto a wider group of Facebook users. Common Facebook features refers toinformation collected by Facebook rather than third party applications. Thisthesis will not only focus on creating a prediction model for SWL, but it willalso validate and identify different features that affect life satisfaction.

1.3 Thesis Organization

This thesis is organized as follows: Chapter 2 discusses the background ofSWL and subjective well-being (SWB), Social Media, and Big Five Per-sonality; Chapter three reviews related work to SWL and focuses primarilyon studies related to online content; Chapter 4 describes the data used inthis research; Chapter 5 details the machine learning methods uses for pre-diction, the methodology for building a prediction model for SWL, and theevaluation methods; Chapter 6 presents the results of the prediction mod-els and identifies the strong correlates to SWL; Finally, Chapter 7 ends thethesis with conclusions and future work.

4

Chapter 2

Background

2.1 Satisfaction with Life and Subjective Well Being

Satisfaction with Life (SWL) is a component of the Subjective Well-Being(SWB) measure. It is defined by a cognitive judgmental process on how in-dividuals evaluate their lives according to their personal criterion. In laymanterms, we call this “happiness” or “satisfaction” [64]. Since the 1970s SWLand SWB have been studied quantitatively by social scientists [20, 51]. Thestudy of human well-being from a positive orientation, rather than from thetraditional prospective of treating mental illness, has served as a comple-ment to traditional treatments for mental illness [51]. By creating ques-tionnaires, scientists have captured quantitative scores for life satisfaction.These scores were then used to gain a more robust understanding of a pa-tient’s life and how to prevent pathologies rather than eliminate them [63].

2.1.1 Satisfaction With Life Questionnaire

There are multiple scales in which life satisfaction can be measured. Thisexperiment uses the Satisfaction with Life Scale (SWLS). The SWLS is ashort 5-item instrument designed to measure global cognitive judgments ofsatisfaction with one’s life [18]. The scale was developed in 1985 by Dr. EdDiener and his colleagues and continues to be used as a tool for subjectivewell-being evaluation. The questions include:

1. In most ways my life is close to my ideal.

2. The conditions of my life are excellent.

3. I am satisfied with life.

5

4. So far I have gotten the important things I want in life.

5. If I could live my life over, I would change almost nothing.

Respondents answer each question with a value of 1 - 7, where 1 corre-sponds to “strongly disagree” and 7 corresponds to “strongly agree”.

2.1.2 Reliability, Stability and Validity of Satisfaction with Life Scales

An initial question may be, how can life satisfaction be reliably measured?Reliability of the SWL scales is expressed through achieving equivalent re-sults when conditions remained unchanged [20]. On a side note, it is ex-pected that results may change over a longer time period of time. Thesechanges suggest potential for new life-altering experiences that can affectcognitive changes in SWL. However, when tests are performed repeatedlyover short time intervals, the scores are expected to remain consistent.

The next question is, what is the actual test-retest correlation of Life Sat-isfaction? Although there are many types of SWL scales, all seem to havea moderate to high re-testability. The Cronbach alpha for the Satisfactionwith Life Scale tends to fall in the 0.80’s [18]. A 2004 study conducted byEid and Diener calculated even higher scores between 0.90 and 0.96 [25].This evidence supports the reliability of the SWL scales.

What about the stability of the scales? Will scores actually change overlonger periods of time as life conditions change? In a study by Diener andFujita, respondents showed a stability coefficient of 0.56 over one year. Thiscoefficient progressively declined after 19 years to 0.24 [29]. This studyshowed stability of SWL scores by interpreting a big change in score overtime as life conditions changed.

The previous examples show the stability and reliability of the SWLscale, but how valid is it? It is common for society to compare life con-ditions and social status of individuals. Poor life conditions and low socialstatus tend to be highly correlated with low SWL. Studies which surveyedSWL of groups such as prisoners [37], homeless [69], PTSD patients withbrain injury [8], and prostitutes [2] have consistently shown lower satisfac-tion with life. Whereas participants surveyed in better circumstances such

6

as the wealthy, have higher SWL scores [24]. Certainly these are reasonableand expected outcomes which give a sense of qualitative validity, but thereare also quantitative measures that validate the SWLS. For instance, in astudy by Schneider and Schimmack, researchers reviewed forty-four stud-ies and found a mean correlation of 0.42 when between reports collectedfrom participants and informants (i.e. family and friends) [60].

2.1.3 Making Satisfaction with Life Judgments

Because this thesis focuses on predicting SWL and understanding the fac-tors that affect SWL, it is important to understand how people make satisfac-tion with life judgments. When attempting to understand how individualsmake these judgments, there are 4 facets to consider: attention and avail-ability, relevance and values, standards and top down effects [20].

When referring to attention and availability, one must consider how anindividual was primed before answering SWL questions. The surroundingsand immediate experiences may change the focus of how a person evaluatestheir SWL. For example, Dermer et al. found that people who read articlesabout happy or sad life stories compared their lives and adjusted their SWLaccordingly [14]. Other momentary factors such as having a handicappedindividual in eye-sight while answering SWL questions made respondentsreport more positively about their own lives [65]. With that said, it importantto acknowledge that there will be some variance in the SWL score due tocurrent situation; in this experiment, momentary experiences are consideredby looking at word categories in Facebook status updates. For instance,if someone has a bad day at work, do they use more negative words thanusual? If we can find indicators such as this, it may help to predict SWLwith more precision.

The second consideration for understanding SWL judgments is relevanceand values. Because people are inherently different (due to personality, cul-ture, life situation, etc.), they place different weights on various aspects oftheir lives [20]. For example, in a 2011 study researches found that people inreligious societies gave more weight to religiosity as compared to people in

7

less religious societies [67]. Because of the sparsity in data, this is a limita-tion of the current experiment as populations are not separated into differentsub-cultures such as differing religious view, political views, ethnicities, etc.

The third consideration, standards, refers to the scale by which peoplejudge their SWL. It is not surprising that many of these standards arisefrom individuals comparing their lives to others. For example, a 2009 studyshowed that even though income levels are rising people are still report-ing dissatisfaction if they do not meet their aspiration [34]. In additionto comparison, a 2010 experiment proposed there are basic human needsand values (i.e. food and shelter, safety and security, feelings of pride, self-direction, etc.) that also shape someone’s SWL judgment [32]. From a basicneeds prospective, it is assumed that the physical basic needs such as foodand water are not concerns for Facebook users. This is because the accessto technologies such as Facebook is assumed to be less important to thosewho require basic physical needs, and therefore those people are less likelyto use Facebook before basic needs are met. Although not controlled fordirectly, comparison of lives is incorporated by analyzing link relationships(i.e. friendships). These link features help determine how relationships af-fect SWL prediction.

Finally, the fourth consideration are top down effects. Top down effectsrefer to the biases people have about their lives due to cultural orientation,life circumstance, or general outlook on-life [20]. For instance, a Wirtz etal. study found that people’s memory of their vacation was a better predictorof taking a future vacation at the same place versus their recorded experi-ence while on vacation at the location [71]. This suggests that even thoughpeople have biases about how the should feel in a situation, it is not neces-sarily a mistake in experimental set-up because these biases are often betterpredictors of future outcomes than recorded experience. In this thesis weincorporate demographic information such as age, gender, and personalityto try to capture biases that may affect SWL prediction.

8

2.1.4 Variability of Life Satisfaction

With so many factors contributing to SWL, one may wonder what causesthe variability. In particular, how much do long-term factors such as person-ality affect SWL versus temporary moods and other situational factors? Ina 2004 study [25], Eid and Diener reported that approximately 74% of thevariance could be attributed to chronically accessible information, 16% totemporarily accessible information, and 10% to random error. Similarly a2005 study [62] found that the reliability was 80, 10, an 10 %. However, ina 2007 study [44], Lucas and Donnellan found that 36% of variance was dueto stable trait differences, 31% due to moderately stable straits, and the re-mainder was due to random variance and occasion-specific variance. Fromthe studies discussed it appears that 60-80% of variance can be attributed tolong/mid term factors, and the remaining 40-20% is due to specific situa-tional factors and random variance.

2.2 Social Media and Facebook

Merriam-Webster defines social media as “forms of electronic communica-tion (as Web sites for social networking and micro-blogging) through whichusers create online communities to share information, ideas, personal mes-sages, and other content.” However, in recent years social media has grownto be much more than a place for personal interactions. With the vast amountof information collected, social media has been used for business decisions[38, 39], public health [53], psychological research [50] and much more.This high interests in social media could be attributed to its rapid gain inpopularity. In a Pew Research Center study, it was found that social mediause by adults has risen from 8% in 2005 to 74% in 2014 [22].

Social media can take on many forms as mentioned above, but in this the-sis the focus is on predicting psychological well-being of users. Because ofthis, media with high self-presentation (i.e. disclosure personal information)and rich social interactions is desired. One such site that fits this descriptionis Facebook. Facebook is by far the most popular social media site, with71% of online adults having membership and 1.35 billion users around the

9

world. Other sites such as LinkedIn and Pinterest are in a far second withonly 28% of online adults. Its high user activity also serves as a great ben-efit for social research, since approximately 70% of its users engage in thesite daily; this is contrast to other social media sites like Instagram (49%),Pinterest (17%), Twitter (36%), and LinkedIn (13%) [22]. One reason forthis popularity and activity may be due to Facebook’s ubiquity. In recentyears Facebook has extended itself to mobile platforms, third party sites,third party applications and more. This essentially allows Facebook to bethe center of people’s daily social media intake. For these reason, Facebookis chosen as the platform for research into SWL.

2.2.1 Facebook Validity

As previously noted, this thesis focuses on utilizing Facebook to mine dataabout users in order to predict individual SWL. One major factor for usingFacebook is the availability of psychometric data (e.g. Big Five scores andSWL scores) that is collected through surveys on Facebook. Although thesesurveys are standard measurement tools in psychology, readers may stillquestion the ability to use Facebook for such questionnaires. This sectionaddresses some of those concerns and drawbacks in using Facebook.

One concern a reader may have is that users are distracted or less moti-vated in taking the psychometric surveys seriously. Although this could bea true statement, the researchers at Cambridge University validated the databy removing inattentively answered surveys (e.g. only using a score of 1)and ensuring that the statistics gathered were similar to those survey donein traditional environments. Another concern a reader may have is repeatparticipation. For example, a user may repeat a survey in order to generatea particular output to other users. These repeated tests are removed, or usedin test-retest studies. In line with the removal of other invalid data, profilesthat are deemed to be “fake” are removed by looking at network propertiesand being “verified” by individual’s in a user’s circle.

Probably one of the largest concerns a reader may have with collectinginformation from Facebook is sample bias. Indeed, this is a real issue formaking general conclusions about society at large; however there are two

10

things to consider: First, although there may be a bias, sample sizes aregenerally large (up to 5.5 million users), so even under-represented groupscan have many data points. Second, this thesis limits itself to predictingindividual SWL from social media, and does not necessarily make gener-alizations to other arenas. Finally one may be concerned about technicalproblems in data collection. Although possible, the Facebook data in thisresearch has been vetted through several organizations to include MicrosoftResearch, Stanford, Harvard, Carnegie Mellon, Facebook, Google, and aslew of other research and educational institutions.

2.3 Big Five Personality

The Big Five features refer to the five dimensions of human personality:openness, conscientiousness, extraversion, agreeableness, and neuroticism[10]. The Big Five features were collected through IPIP proxy for Costaand McCrae’s NEO-PI-R questionnaire and are numerical variables in therange of [1.0 - 5.0]. Previous research shows that Big Five, particularlyneuroticism and extraversion, strongly correlates to SWL [21]. This thesisverifies these findings and exploits the strong correlation in more robustmodels. A description of each Big Five dimension is given below:

• Openness (to experience): Individuals who score higher in this areatend to reflect a degree of intellectual curiosity. They tend to enjoyadventure, creativity, variety of experience, etc. This trait also depictshow imaginative or independent an individual may be.

• Conscientiousness: Individuals scoring higher in this category tendto show more organization and dependability versus carelessness andspontaneity. Conscientious individuals tend to show more self-disciplineand have goals planned for achievement.

• Extraversion (energy, outgoingness, assertiveness, sociability): Indi-viduals who score higher in this category tend to seek stimulationthrough the company of others. They tend to communicate and interactwell in social situations.

11

• Agreeableness: Individuals who score higher in this category tend todisplay more compassion and cooperation towards others rather thansuspicion or antagonism. This trait measures trustful nature, and gen-eral temperament.

• Neuroticism: Individuals scoring higher in this category tend to bemore sensitive and nervous rather than secure and and confident. Theytend to feel more vulnerable and can experience anger, anxiety anddepression more easily. This term can also be interpreted as an indi-vidual’s degree of emotional stability.

Figure 2.1 shows the density of these features for users who have anSWL score. Note that all figures have some skew to them; however, thisis especially prominent in the Openness feature where the µ = 3.97. Thiscould be due to a sample bias with Facebook users. In particular, one canimagine that those who are willing to interact and post personal informationon Facebook are also likely to be more open.

Though the graphs shown below are for users with an SWL score, theBig Five scores for the entire database show similar results. See AppendixA.1 for the distribution of scores (n = 3, 137, 694).

Figure 2.1: Histograms that shows the density of the Big Five feature for users who havean SWL score. n = 86, 073

(a) Density of Agreeableness

12

(b) Density Conscientiousness

(c) Density of Extraversion

13

(d) Density of Openness

(e) Density of Neuroticism

14

Chapter 3

Related Work

3.1 Satisfaction with Life in Social Media

In recent years, many studies have taken advantage of social media data tostudy SWL and SWB [5, 12, 13, 40, 41, 61, 70]. Like other fields of study,the large amounts of data available has made the study of SWL and SWBmultidisciplinary. Researcher now include traditional psychologist, data sci-entists, statisticians and more. This is not too surprising as the stakeholdersinclude organizations such as educational institutions, businesses, and evengovernment agencies. Although there has been a recent peak in using socialmedia to understand SWL, this area of study is fairly new due to the factthat online social media has grown dramatically in the last 10 years [22].In this section, findings and methods of previous research predicting SWLand SWB are reviewed. The discussion is divided into two section: modelsusing static ego variables (i.e. personal information such as gender, age,work place, etc.) and models using link relationships (i.e. features based oninfluence from others).

3.1.1 Static Ego Features

This thesis defines static ego features as those features that originate froma user but have no time stamp. Static ego features are a common selectionfor SWL models due to their direct connection with a user. These featuresare often demographic features, but can also include activity and behavior,personality, and natural language. This section discusses studies that usethese types of features for SWL and SWB prediction.

15

In a geographic study of SWL, researchers used Tweets from 1,300 dif-ferent U.S. counties to identify whether language could predict SWL at thecounty level. The researchers utilized Latent Dirichlet Allocation (LDA)and Linguistic Inquiry Word Count (LIWC) to identify topics in Tweets.These topics were used as features in an ordinary least squares linear regres-sion model. A baseline model for predicting SWL was created with socio-economic status information from the U.S. Census Bureau, and ground truthwas used from a previous study [42]. Results of this research indicated thecombination of Tweet information and socio-economic information was themost accurate model for prediction of SWL. From a qualitative perspective,words relating to outdoors, spiritual meaning, exercise, and good jobs, pos-itively correlated with an increase of life satisfaction. Conversely, wordsthat signified disengagement, like bored and tired, had a negative correla-tion with SWL [61]. In this study, daily, weekly and overall word usage isextracted from Facebook status update and is combined with demographicinformation to enhance the model performance.

In a study using Facebook data, researchers used “Like” data to predicta wide range of private traits and behaviors including: sexual orientation,ethnicity, religious and political views, personality traits, intelligence, hap-piness, use of addictive substances, parental separation, age, and gender.The analysis used the myPersonality dataset and 58,000 volunteers who pro-vided their Facebook Likes, detailed demographic profiles, and the resultsof several psychometric tests. The model used dimensionality reduction forpreprocessing the Likes data, which were then entered into logistic/linearregression models to predict individual psycho-demographic profiles fromLikes. The models had many successes prediction of dichotomous variables.For example, they correctly discriminated between homosexual and hetero-sexual men in 88% of cases, African Americans and Caucasian Americansin 95% of cases, and between Democrat and Republican in 85% of cases.However, for numerical and psychological measures like SWL, their predic-tion accuracy was poor, with R = 0.17 and baseline test-retest R = 0.44.One explanation the researchers gave for this result is that SWL could varymore due to “mood swings, unlike “Likes which remain fairly consistentover time [40]. This study attempts to account for variations such as “mood

16

swings” by analyzing the changes in mood (through word usage) and incor-porating these features into predictive models. For instance, we hypothesizethat if there is a shift in word usage, such as increased use of “anger” words,then there will be a shift in SWL scores.

3.1.2 Link Features

In addition to ego features, link relationships also play a role in predictingSWL [5, 13]. Link features are defined as those features that do not originatefrom an ego, but are somehow connected to the ego by another entity. Forexample, a link feature that could be the number of times a friend posts onthe Facebook wall of an ego. The friend is connected to the ego via “friend-ship” but the number of times a friend posts is not controlled by the ego.Unlike static ego features, these features are often tricker to define, extractand control. For instance, if one wants to determine how friends’ emotionsare spread on Facebook, a good representation of “emotion” must be de-fined and controlled. Control of these types of features is especially difficultas many time data points are collected passively, and one may not know if itwas truly the link feature or other outside circumstances that contributed tothe outcome. This section discusses some previous research done in theseareas and the methods employed.

In a study using Facebook News feeds, researchers determined if emo-tional contagion, the spread of one person’s emotions to another, occurred ina virtual environment. They tested this hypothesis by reducing the amountof emotional content displayed to users. When positive expressions werereduced, people posted fewer positive posts and more negative expression.Conversely was true when negative expressions were reduced [13]. Al-though interesting, there is question about how the mood of a user is ac-tually affected. For instance, are people just re-posting positive news aboutlast night’s football game or are they specifically referring to their mood?The experiments in this thesis attempt to incorporate link relationships di-rectly by evaluating how SWL of friends affect the SWL of a user.

In another study of Tweets, researchers determined if assortative mixingtakes place in an online social network context [5]. Assortative mixing is

17

the tendency of individuals with similar characteristics to favor one another.Results from the experiment showed that the general happiness of Twitterusers, as measured from a 6-month record of their individual tweets, wasassortative across the Twitter social network. In particular, stronger rela-tionships (ones with more interconnected links) are more influential thanweak relationships. We utilize this idea in our models by identifying andusing strong link relationships in the individual SWL prediction models. Inparticular,the SWL of best friends, top 3 friends, friends in the same cityand significant others are extracted and incorporated into models that pre-dict individual SWL.

3.1.3 Summary

This section discussed previous work related to SWL and social media. Wefound that many works focus on one type of feature to predict SWL. Thisthesis aims to pull together all of the findings from these works to create acombined model that is more accurate than previous research. Specifically,we analyzed and combined static ego, links, and temporal information ex-tracted from Facebook profiles to predict individual SWL.

18

Chapter 4

Data Description

The data for this thesis was collected by Cambridge University under themyPersonality.org project [40]. The myPersonality project is a collectionof psychometric test results and Facebook data used for social science re-search. Currently, there are thirty-eight published works on this data, mak-ing it a popular and reliable for source for data. Studies utilizing this havea wide range of interests from study personality [27, 50, 73], user traits[1, 40, 58], emotion [26], self-monitoring [30, 72] and geographic charac-terization [48, 52].

The dataset for this work contains 101,069 users with an SWL score.Information was collected from January 1, 2009 to November 27, 2011.Table 4.1 describes the databases used from the myPersonality project. Thedata is further separated into three categories: static ego, temporal and linkfeatures.

19

Table 4.1: Data Description. Note that the number of samples is based on users who havean SWL score.

Title Description # of SamplesBig Five Feature containing scores for the Five Factor

Model or Big Five personality traits: openness,conscientiousness, extraversion, agreeableness,and neuroticism. Each factor is a different fea-ture in the range [1.00 - 5.00].

86,073

SWL Score Target feature of study. Values range between[1.0-7.0]

101,069

Demographic Data Includes: gender, age, relationship status, net-work size, timezone, interested in, and birthday.

32,326 - 91, 587

Facebook Activity Shows number of user’s status updates, frienddiads, photo-tags, likes, events, concentrations(in school), groups, education, and work places.

802 - 23,197

Facebook Likes Topical decomposition of a users “Like” Data(600 topics). Like data spans a wide variety ofinterests (eg. sports, tv shows, actors, celebri-ties, food, etc.) that a user may like. Topicswere extracted using Latent Dirichlet Alloca-tion (LDA).

3943

Facebook Status Updates Free text entered by the user updating his/hercurrent status. Data includes userid, date, statusupdate (free text).

553,267

Facebook Status (LIWC) Linguistic inquiry word counts for users overentire span of Facebook posts. This featurescontains 64 variables, each of which corre-sponds to a construct represented in the lan-guage participants use in their status updated.

3505

Couples Couples as recorded by Facebook–includes ge-ographical distance, overlap in friendship net-works, and some other features.

8,169

Dyads (friendships) Simple table showing friendship betweenfriend1 and friend2.

692,649

4.0.4 Static Ego Features

Static ego features belong to a user but do not have timestamps associated(e.g. number of friends, number of photo tags, number of likes, age). Notethat categorical features such as relationship status were transformed using

20

one-hot encoding to be used in linear regression and support vector ma-chines.

The transformation for one-hot encoding is done by taking one featurethat has n values and creating n features that have two values (0 or 1). Thefeature can then only have one “hot” value (or a value equal 1). An examplefor one-hot encoding is shown in Table 4.2.

Table 4.2: One-hot encoding example for 4 categories.Binary One-hot00 000101 001010 010011 1000

4.0.5 Temporal Features

The Facebook status update feature contains temporal information. The sta-tus update is free text posted by a user associated with a time-stamp. Thegoal for using temporal features is to identify whether Facebook statusesposted near the time of an SWL test have an affect on the SWL score. Thepotential for “mood swings is evaluated by identifying changes in word us-age over time. This hypothesis is tested by calculating LIWC per user overdifferent time frames. If the features extracted closer to an SWL test havebetter prediction accuracy, it is concluded that temporal features play a rolein predicting SWL.

4.0.6 Link Features

Link features are any features that are do not originate from an ego, but butare somehow connected to the ego by another entity. For instance, the newsfeed for an ego could be considered a link feature. News feeds are displayedto the ego, but do not originate from the ego. Another example could beposts. The possibilities for these types of features is almost endless. In thisthesis, SWL of friends and significant others is considered. In particular we

21

are interested in the SWL of top 3 friends, best friends, same city friends,and significant others. Each link feature is discussed further in Section 5.5.3

4.0.7 Satisfaction With Life Score

The SWL score is the target variable for this study. It is a numerical la-bel ranging from [1.0 - 7.0], where 1.0 corresponds to highly unsatisfiedindividuals and 7.0 corresponds to highly satisfied individuals.

4.0.8 Sample Size

Models had variable sample sizes due to missing values for features. Forexample, of users with an SWL score, only 85% had Big Five features andonly 4% had LIWC. We calculated the sample size for any particular modelby taking the intersection of users who contained all model’s features. Thesparsity in the features caused models to have drastically different samplesizes. To combat some of these small sample sizes, we chose Facebookdemographic and activity features with n � 20, 000. Figure 4.1 shows thepopulation of each feature.

22

Figure 4.1: This graph shows the percentage of users who have a particular feature andSWL score. For example, 85% of users who have an SWL score also have a Big Fivescore.

23

Chapter 5

Methodology

5.1 Machine Learning Models

Although past research [40, 61] predicting SWL has used linear regres-sion as a supervised learning model, we utilized Random Forest Regression(RFR) [7]. RFR was used for its interpretability (features can be ranked byimportance), non-linear assumptions, efficiency, and accuracy. Other meth-ods such as linear regression and support vector regression were exploredduring feature analysis but were not used as prediction models because theydid not provide better prediction accuracy and afforded less interpretabilitythan RFR.

5.1.1 Decision Trees

To understand Random Forest Regression, one must first understand deci-sion trees. Decision trees are a supervised machine learning method thatcan be used for a classification or regression task. In the case of predictingSWL, the trees are used for a regressions task. The structure of a decisiontree is similar to that of control flow structure. At each internal (non-leaf)node, a feature and threshold is used to separate the samples. These feature-threshold pairs are inferred from the data by finding which pair yields thelargest information gain. The decision tree method recursively separatesthe samples until no information can be gained (or another criterion is met,such as max depth). Once the samples are no longer separated, a leaf nodeis created. For regression tasks, the average of the samples will be used asthe predicted label for future test samples. An example of a decision tree isshown in 5.1.

24

Figure 5.1: Example of a decision tree for determining SWL score. Branches correspond tothe values of the attributes and leaves indicate prediction value. Note that the full decisiontree is not shown.

In the scikit-learn implementation of decision trees, an optimized ver-sion of CART (classification and regression tree) [6] is implemented. Themathematical formulation for decision trees is described in Appendix D.1.

5.1.2 Random Forest Regression (RFR)

Random Forest Regression is a simple yet effective extension of decisiontrees. RFR is an ensemble model which uses a series of decision treescreated from randomly selected features. The features are selected withreplacement (bootstrap aggregation) from the training set. Because deep re-gression trees have a tendency to over fit (low bias and high variance), RFRis a way to reduce over fitting by averaging the output of multiple trees.The output is generally more accurate than regression trees alone [7]. Dur-ing experimentation, the parameters of RFR were fine tuned by adjustingmaximum tree height and number of trees. When tree height and estima-tors converged on accuracy, they were saved for the final models. The finalmodels use a tree height equal to 7 and trees equal to 30.

We employed the scikit-learn’s implementation of RFR [59]. Mean squareerror was used as the splitting criterion [54].

25

5.1.3 Linear Regression

Linear regression is a statistical modeling technique to create predictionmodels [54]. It assumes an additive and linear relationship between ex-planatory variables and the target variable, as seen in equations 5.1 and 5.2.Linear regression models are formed by estimating the weights, wd, of eachexplanatory variable, xd, to find the strength of the relationship betweeneach explanatory variable and the target variable. There are several ways toestimate these weights, but one of the simplest is least square regression. Inleast square regression, we estimate the weights which minimize the meansquare error as show in Equation 5.3.

hw

(x) = w0 + w1x1 + w2x2...wnxn (5.1)

or more compactly:

hw

(x) =DX

d=0

wdxd (5.2)

J(w) =X

t

(hw

(xt

)� yt

)2 (5.3)

Although linear regression was not employed as a final model for predict-ing SWL, it was still used to gain insight about how variables (particularlycategorical variables) influence SWL.

5.2 Model Creation

Model creation follows the standard machine learning pipeline. Featuresare first extracted from the data and binned into types (i.e. static ego, tem-poral, and link). This allows separate analysis of feature types to identifytheir influence on SWL. Next, features are analyzed and selected for basicmachine learning models. By selecting each feature and iteratively includ-ing it into a model, affect on individual SWL can be measured. After basicmodels are built, models are analyzed for performance. The intuition is thatcombining the best performing model types will create a more robust and

26

Figure 5.2: Prediction Model Approach. First features are extracted from Facebook. Thesefeatures fall into three main categories: static ego, temporal, and link. Then they are an-alyzed and used to create RFR models to predict SWL. Highly dimensional features arereduced to Big Five and then all models are combined into a final model.

accurate model. Figure 5.2 gives a visual representation of the model cre-ation process. Note that there is an extra step called Big Five Intermediate.This step is included after analysis of the Big Five feature. It is utilized toreduce high dimensional features (e.g. LIWC) into highly predictive ones(i.e. Big Five).

27

5.3 Feature Extraction

The myPersonality dataset contained many directly extracted features fromFacebook profiles; however, some features such as “likes”, status updates,friendships, and partnerships required processing in order to be incorporatedinto models. The following section reviews why and how these featureswere processed in order to be utilized in prediction models.

5.3.1 Facebook “Likes”

The Facebook “likes” feature was extracted by myPersonality and contained128,774 dimensions representing the interests of users. These interests in-cluded sports, movies, foods, celebrities,etc. The high dimensionality ofthis feature caused noise in prediction models and therefore needed re-duction. The reduction method employed was Latent Dirichlet Allocation(LDA).LDA grouped the “likes” into more general categories reducing thefeature dimension from K = 128,774 to K = 600. The number of categorieswas based on the interpretability and accuracy for models that predicted age,gender, and Big Five.

LDA is a generative probabilistic model that allows a set of observations(i.e. user “likes”) to be explained by unobserved groups (i.e. topics) [4]. Inthis experiment, a topic was a distribution over all “likes”. For example, thetopic of sports would have a high probability to contain “likes” such as: bas-ketball, softball, football, etc. It is important to note that all topics containall “likes” but with different probabilities. Each user, then, was a mixtureof these topics. Each “like” was drawn (with some probability) from eachtopic. Using LDA the underlying structure (i.e. topics) was inferred.

28

Figure 5.3: Process for extracting friendships

5.3.2 Facebook Status Updates

As noted in Section 4.0.5, one hypothesis for predicting SWL is that it isaffected by “mood swings”. In order to extract a feature that could reflecta mood swing, the Linguistic Inquiry Word Count (LIWC) is extracted foreach status update of each user. LIWC is a text analysis program that countswords into psychologically meaningful categories [66]. The program wasfirst developed to efficiently and effectively analyze texts about from emo-tional essays. The program has two main features: processing and dictio-naries. The processing part essentially tags the words, and the dictionarypart identifies the tag for a word. The classification was developed by handlabeling words into particular categories by three judges.

LIWC gives not only syntactic labeling, but also semantic labeling ofwords. There are 80 different labels used, all linked to hundreds of studiesabout the psychological process and word use. See Appendix C.2 for adescription of the categories. Today, LIWC is used in many studies that

29

aim to study the affect of persons through linguistics [66]. In our study, weuse LIWC as a proxy for feelings and moods a user may express in statusupdates.

5.3.3 Friendships

In order to turn the dyads table into a usable feature, it must be combinedwith other tables. Figure 5.4 diagrams how tables are joined in order toget friends who have SWL scores. Note that the SWL of a friend is esti-mated through the Big Five model discussed in Section 5.5.1 because notall friends have a true SWL score.

Figure 5.4: Process for extracting friendships

Although the dyads feature provides a table of friendships, it does notindicate the strengths of friendships. Friendship strength is estimated byfinding the mutual friendships between users and dividing by total numberof friends. Equation 5.4 shows how friend

i

ranks friendj

and similarlyEquation 5.5 shows how friend

j

ranks friendi

.By dividing mutual friends by a user’s total friends, rank is normalized.

That is, if friendi

has more total friends than friendj

, then rankij

<rank

ji

. Of course, this is only an estimation of friendship strength and cer-tain situations, like friending all your co-workers, may not reflect the truestrength of a relationship.

rankij

= mutualfriends(i, j)/friends(i) (5.4)

30

rankji

= mutualfriends(i, j)/friends(j) (5.5)

5.3.4 Partnerships

Similar to friendships, the couples feature was used to extract the SWL ofa significant other. Unlike friends, rank is not required since only one per-son can be designated as a partner. Figure 5.5 diagrams how the tables arejoined to create the significant other link feature. As with friends, SWL ofa significant other must be estimated due to lack of data.

Figure 5.5: Process for extracting significant other relationships

5.4 Feature Selection

Identifying a representative feature set for a prediction model can be one ofthe most important steps for creating an accurate machine learning model.In theory, more features should result in more discriminating power; how-ever in practice this is not always the case as more features can cause overfitting, noise, and unreliability. During model creation, features are pre-liminary selected for their predictive power using filter and wrapper meth-ods. Filter Method are methods that rank features by correlation with label.They identify correlations between variables independent of the machinelearning model [36]. In this case, Pearson Correlation and R2 are used asa filter method to find linear correlation between the explanatory variablesand the target. Wrapper methods identify important features dependent of aprediction model. Typically they train a model on a subset of the data and

31

identify the error rate of the model. Each models is given a score based onthe error rate and the best features are determined from the models [36]. Be-cause wrapper methods require the creation of a prediction model, they areusually more computationally intensive than filter methods. The ImportantFeature attribute calculated with Random Forest is the wrapper method usedto select and analyze features.

Although it would be ideal to use the “best features” in the SWL pre-diction models, sparse data requires a compromise between predictive andpopulated features. In order to deal with this issues, features that have somecorrelation R � |0.1| and n � 20, 000 are used. In spite of the fact that somefeatures cannot be used due to sparsity, the correlations still give insight intouseful features for further research on SWL.

5.4.1 Pearson’s Correlation Coefficient (R)

To identify linear correlation between features and SWL, the Pearson’s Cor-relation Coefficient (R) was calculated for each sample population. R mea-sures the dependency between two variables (i.e. Facebook feature andSWL). A value between -1 and 1 is given to identify negative, positive orno correlation. For example, an R = �1.0 would indicate total negativecorrelation, R = 1 indicates total positive correlation, and R = 0 indicatesno correlation. The formula for calculating R is expressed below:

R =

nPi=1

(Xi

� X)(Yi

� Y )

pPi = 1n(X

i

� X)2pP

i = 1n(Yi

� Y )2(5.6)

5.4.2 R-squared (R2)

R-squared, also called the coefficient of determination, indicates how welldata fits a statistical model. Specifically, it describes the percentage of theresponse variable variation that is explained by the regressors in a model. R2

ranges from 0.0 to 1.0. An R2 value of 1.0 indicates that the model perfectlyfit the data. In the preliminary feature selection, categorical variables arecorrelated with SWL using linear regression. R2 is used to identify how

32

well a linear model explains the relationship between SWL and a categoricalvariable.

If y is the mean of the observed data (i.e. SWL), then the variability ofthe data set can be measured using the sums of squares formulas (Equations5.7 and 5.8). Equation 5.7 is the total sum of squares. If y

i

is the observedtarget value, than this describes the variation in y. Equation 5.8 is the re-gression sum of squares. If y

i

is the predicted value from the model, thenEquation 5.8 explains how much variation is in the model as compared to y.To find the percentage of variation in y that is described by the explanatoryvariables, we use R2 as described in Equation 5.9.

SStot

=X

i

(yi

� y)2 (5.7)

SSreg

=X

i

(yi

� y)2 (5.8)

R2 = 1� SSreg

SStotal

(5.9)

5.4.3 Feature Importance with Random Forest Regression

The depth of a feature used as a decision node in a tree can be used to assessthe relative importance of that feature for predicting SWL. Features usedat the top of the tree contribute to the final prediction decision of a largerfraction of the input samples. The expected fraction of the samples theycontribute to can thus be used as an estimate of the relative importance of thefeatures. By averaging those expected activity rates over several randomizedtrees one can reduce the variance of such an estimate and use it for featureselection [59]. We use the feature importance method implemented in scikit-learn to identify features that could be useful for predicting SWL.

33

5.5 Model Types

5.5.1 Static Models

Static models are models that include features of a user that do not havetimestamps associated. There were many features we considered, but ourfinal models included the following:

• Big Five: The Big Five features refer to the five dimensions of humanpersonality: Openness, Conscientiousness, Extraversion, Agreeable-ness, and Neuroticism. The Big Five features were collected throughIPIP proxy for Costa and McCrae’s NEO-PI-R questionnaire and arenumerical variables in the range of [1.0-5.0].

• Age: Reported age of a user.

• Network Size: Number of “friends” a user has in his network.

• Number of Photo Tags: Number of tagged photos for a user.

• Relationship Status: Categorical value representing a user’s relation-ship status.

• Likes: Topical decomposition of user’s Like Data into 600 topics. Top-ics were extacted using Latent Dirichlet Allocation (LDA) [4].

• Linguistic Inquiry Word Count (LIWC) Overall: Linguistic InquiryWord Count is a text analysis program that counts words into psycho-logically meaningful categories [66].

5.5.2 Temporal Models

Temporal models use the features extracted from Facebook status updatesas described in Section 5.3.2. For each status update of each user, LIWCwas calculated. Posts were then aggregated into daily and weekly averages.

34

5.5.3 Link Models

Link models incorporate extracted link features described in Sections 5.3.3and 5.3.4. The Big Five score of a friend or significant other is predictedfrom Big Five, since not all friends or significant other’s have and SWLfeature. A more detailed description of each type of link feature is describedbelow:

• Top 3 Friends: The top 3 friends for each user is calculated by rankingfriends by their mutual friends. Each friends’ SWL score is then usedas a feature to determine whether a friend’s “happiness affects a user.

• Top Friend: Similar to Top 3 Friends, we calculate the rank amongfriends and only use the predicted SWL of the top/best friend. We dothis to see if influence of a best friend.

• Same City Friends: We want to identify how much of a role physicalproximity may play in influencing SWL. We test to see if friends in thesame city have more of an influence than friends that are not.

• Significant Other: The SWL score of a significant other was used as afeature to identify how a significant others’ “happiness” may influencethe user.

5.5.4 Combined Models

Combined models use the best predicting features from each model andcombine them into one model. When multiple Big Five scores are used,each component is averaged and then used a feature. The resulting combi-nation models are expected to outperform all other models.

5.6 Handling Missing Data

Sparse and missing values are a common occurrence for real-world dataand can have significant effects of machine learning model accuracy. Thereare three classification schemes that are commonly used when referring to

35

missing data: Missing Completely at Random (MCAR), Missing at Random(MAR) and Not Missing at Random (NMAR) [55].

• MCAR: Values that are missing completely at random happened whenthe probability of an observation being missing does not depend onobserved or unobserved measurements. That is, if data are MCAR,then consistent results should be obtained from sets of data that havecompletely filled values.

• MAR: Values that are missing at random have the property that themissingness does not depend on unobserved data. In other words,missing values share the same statistical behavior as observed values.

• NMAR: Values that are missing not at random occur when neitherMCAR or MAR hold.

Figure 5.6: Distribution of SWL for complete samples. Sample features include: Big Five,Network Size, Number of Photo Tags, Relationship Status, and Age. n = 8, 608

The myPersonality dataset used in this experimentation was collectedfor various research projects. For example, some data points were collectedfrom a Big Five study and other were collected from an SWL study. Fromthe definitions stated above, the dataset for this experimentation would fallunder NMAR. This is a particularly tricky situation which calls for a joint in-ference model that considers an inference on both observed and unobserveddata; however, due to limitations in the implementation of Random ForestRegression (i.e. it does not handle missing values), a simplified version of

36

this problem (MCAR) is performed. By formulating the missing data issueas MCAR, only fully complete samples are considered in models. Althougha simpler problem, the sample size of models is large and the distribution ofSWL (Figure 5.6) seems to reflect the other estimations of “happiness” andlife satisfaction [31] making it a reasonable assumption.

37

5.7 Model Populations

As noted in Section 4.0.8, models had varying sizes due to feature popula-tion. Table 5.1 gives a quick-reference overview of sample sizes for eachmodel.

Table 5.1: Summary of sample sizes for each model.Type Model # of SamplesStatic-Ego Big5 86,073

FBAttrib 9,461Likes 3,920LIWC 3,251Big5.Likes 3,693Big5.FBAttrib 9,242Big5.LIWC 3,251

Combined Combo.Static.1 1,360Static-Ego Combo.Static.2 1,160

Combo.Static.3 2,055Combo.Static.4 9,242Combo.Static.5 1,845

Temporal Models Temporal.Daily 3,275Temporal.Weekly 3,275

Link Models 3FriendSWL 6953RandomFriendSWL 695Friend SWL 3,671RandomFriendSWL 3,671CityFriendSWL 1654OtherSWL 171

Combined Link FBAttrib.Big5.3FriendSWL 695and Static-Ego FBAttrib.Big5.No3Friend 695

FBAttrib.Big5.FriendSWL 695FBAttrib.Big5.NoFriend 695FBAttrib.Big5.CityFriendSWL 1,654FBAttrib.Big5.NoCityFriend 1,654FBAttrib.Big5.OtherSWL 171FBAttrib.Big5.NoOther 171

38

5.8 Evaluation methods

5.8.1 Mean Absolute Error

To evaluate our models we used mean absolute error (MAE) measure [54].MAE is defined as the average of the absolute errors,|f

i

� yi

|, over n sam-ples, where f

i

is the predicted value and yi

is the actual value.

MAE =1

n

nX

i=1

|fi

� yi

| (5.10)

5.8.2 Random Baseline

We evaluated our model by calculating MAE for SWL prediction and com-paring it to the MAE of a random model generated from the probability dis-tribution of a sample. The probability distribution function was estimatedby interpolating over a 10-bin histogram of the labeled data.

5.8.3 Precision

Although the random baseline allows us to compare our model to anothermodel, we still would like an evaluation metric that shows the usefulnessof the SWL prediction. For this reason, we reference the “UnderstandingScores on the Satisfaction with Life Scale” [16] worksheet to identify cate-gories of SWL. The worksheet interprets the SWL scores into 7 categories:very high score, high score, average score, slightly below average score,low score, and very low score. As noted previously the SWL scores rangefrom [1.0 - 7.0], so the categories are broken into [1.0 - 2.0), [2.0 - 3.0),[3.0-4.0), [4.0-5.0), [5.0-6.0), [6.0-7.0]. Based on the descriptions of thecategories, we define a good prediction model as a model thats average er-ror rate is 1.0. We also consider prediction models that are better thanrandom baseline and have an error rate 2.0 as acceptable based on thequalitative description of the categories.

39

5.8.4 Linear Regression Baseline

We also compared our model to a previously discussed model which useslinear regression and user likes to predict SWL [40]. We replicated theirmethods and found that MAE = 1.22 ± 0.04 (n = 3, 920) using the samedata in the Likes model. All experimental results were based on 10-foldcross validation.

5.8.5 10-Fold Cross Validation

We used this method to estimate how accurately our predictive models wouldgeneralize to an independent data set. This method first splits the data setinto 10 equal parts. It then trains on 9 partitions and validates on the 10th.To reduce variability, cross-validation was performed for 10 rounds. Thevalidation set was rotated for each round, covering all possible combina-tions. Final evaluation of the model was created by averaging the results forall validation sets.

40

Chapter 6

Results and Discussion

6.1 Feature Selection

6.1.1 Pearson Correlation

In initial feature selection, features were correlated to SWL using Pearson’sR. Table 2 summarizes important correlations (R � |0.10|) between thefeatures and SWL. See Appendix B for all correlation results.

During analysis, many features were shown to be weakly correlated toSWL; however, when the mean of a feature was correlated to SWL, manystrong linear relationships were uncovered. Using the mean of a featureserved two purposes: First, it allowed for a better understanding of how afeature affects SWL in general. Second, it enabled clearer visualization ofthe data with much less noise. The downside of using the mean of a featureis it may indicate that these features are generally predictive of SWL but notof the individual SWL. Another possibility, however, is that these featuresare predictive of SWL, but they are not complete linear correlations. Thus,these features are also analyzed using RFR important features.

41

Table 6.1: Summary of correlations between a feature and SWL. R refers to Pearson’s R,whereas R’ is the Pearson’s R for averaged features calculated for SWL values. SE is thestandard error for the regression when raw features are used. SE’ is the standard error whenaveraged features are used. Features shown have correlation with SWL and a standard errorof 1.5.

Feature R R’ SE SE’ # ofSamples

con 0.277651 0.985919 0.006271 0.201948 86073ext 0.295244 0.996589 0.005425 0.085871 86073agr 0.241992 0.988066 0.006938 0.227220 86073neu -0.471156 -0.997529 0.004977 0.044451 86073age 0.013145 0.248857 0.000736 0.694510 42264network size 0.061001 0.845528 0.000028 0.011462 60863# of groups -0.055199 -0.678348 0.000362 0.030468 5443# of likes -0.078601 -0.721082 0.000035 0.003121 7173# of statuses 0.013460 0.311264 0.000127 0.016274 3503# of photo tags 0.029699 0.596474 0.000112 0.057597 23197anger -0.158459 -0.944416 0.036384 0.750013 3505negemo -0.159933 -0.929049 0.022741 0.532171 3505swear -0.147914 -0.880167 0.049320 1.493586 3505body -0.105941 -0.835244 0.047152 2.445377 3505

Big Five On the other hand, Big Five features were moderately to stronglycorrelated to SWL regardless of using the mean values. This indicates BigFive’s strong influence on an individual’s SWL and is consistent with previ-ous research [9, 15, 21].

In particular, positive correlations were shown with agreeableness, con-scientiousness, extraversion, and openness; negative correlation was shownwith neuroticism. When analyzing the results and graphs (Figures 6.2(a),6.2(b), 6.2(c)), agreeableness, conscientiousness and extraversion exhibitedlow standard errors (0.227, 0.202 and 0.086) and high correlation (R =0.242, R = 0.277, R = 0.295) to SWL, confirming their predictive powerfor individual SWL. Similar results were found in a Costa and McCrae study[9] where the researchers hypothesized that the positive affect associatedwith these two traits allow people to foster additional quality relationshipsand thus have better life satisfaction. This makes sense as extraversion oftendeals with the quantity of relationships a person has and agreeableness fo-cuses on the quality of those relationships. Conscientiousness, on the other

42

hand, relates to self-discipline and goal related activity. Costa and McCraeargue that these qualities tend to make people more satisfied because theyare able to assert control over their environments to enhance quality of life.

When reviewing Figure 6.3(d) for openness, it was shown that the stan-dard error for prediction of SWL was 2.9. This error is high compared tothe precision metric discussed in Section 5.8.3, which looks for error 1.0.This result indicated that openness was a poor predictor of individual SWL.One possible explanation for is found in the definition of openness. Open-ness refers to the willingness of individuals to have new experiences; there-fore, it is plausible to think that open people may have both positive and neg-ative experiences. Both types of experiences may have an affect on SWL,but the type of experience may depend on circumstance rather than open-ness. In general, this may be a reason why openness appears uncorrelatedto SWL.

Neuroticism versus SWL showed the strongest correlation with R =�0.471 and standard error of 0.044. Diener et. al explain this phenomenonby hypothesizing how neurotic people are more likely to have bad expecta-tions for their lives and are therefore less satisfied [21]. Later it is shownthat some linguistic features, which have negative connotation, are posi-tively correlated to neuroticism and negatively correlated with SWL. Thesefindings also supported the Diener hypothesis.

Numerical Correlations When analyzing common Facebook features, the av-erage age, network size, number of statuses, and number of photo tagsshowed positive correlations with SWL. It is important to note, however,that the correlation was very weak when not averaged. This indicated thegeneral usefulness of these features for predicting SWL, but showed that thefeatures may not be precise enough to predict individual SWL.

Network Size (Figure 6.4(a)) showed a strong positive correlation toSWL, which is not too surprising considering extraversion’s positive cor-relation to SWL. Because extraverts are characterized by their sociability,outgoingness and assertiveness, it seems likely they would have character-istics to promote this behavior, such as having a large network of friends.

43

Figure 6.1: Graphs of SWL versus the mean of each Big Five personality score. Pearson’sR is annotated in the top right corner of the graphs.

(a) Agreeableness

(b) Concientiousness

(c) Extraversion

44

(d) Openness

(e) Neuroticism

45

Similarly, number of statuses and number of photo tags were positively cor-related to SWL, perhaps due to the fact that these behaviors indicated socia-bility. After reviewing Figure 6.4(f), the average age graph showed a slightpositive correlation to SWL. Although the correlation was slightly positive,it could still be useful with a standard error of 0.69. In fact, Section 6.2.1explained how age played a large role when predicting some Big Five at-tributes, particularly conscientiousness.

On the other hand, negative correlations were shown for number of groupsand number of likes. Perhaps SWL is linked to active versus passive activ-ity on Facebook [70]. If an individual has more friends and more phototags, this may indicate more social activity. On the contrary, if an indi-vidual spends all their time on Facebook joining virtual groups and likingrandom stuff, this may indicate lack of a social and active lifestyle. The ideaof active and passive activity becomes a recurring theme in the proceedingsections and is an important consideration.

Facebook Categorical Correlations For the three categorical variables (gender,relationship status and interested in), we used one-hot encoding in multivari-ate linear regression to find R2. Through this method, none of the featuresshowed linear correlation with SWL (R2 = 0.0009, R2 = 0.0128, R2 =0.0019 respectively); however, we noted an interesting occurrence for theregression coefficients in relationship status. Table 6.2 shows that thosewho are in relationships positively influence SWL, whereas those who arein complicated relationships or widowed, negatively influence SWL. Thisis consistent with Diener’s research [21], which showed people with strongsocial ties tend to have higher SWL scores. Although these variables had nolinear correlation with SWL, they were still reviewed with the RFR impor-tant features tool.

LIWC Correlations Table 6.1 shows that the prominent LIWC categoriesare all negatively correlated with SWL. Specifically it shows that anger,negative emotion, swears and body words are associated with a lower SWL.This was an interesting and plausible find. As one may expect, those whoare unsatisfied with their lives tend to choose words expressing discontent.

46

Figure 6.3: Graphs of averaged demographic features versus SWL. Network Size showsthe strongest correlation (R0).

(a) Average Network Size

(b) Average Number of Facebook Statues

(c) Average Number of Photo Tags

47

(d) Average Number of Groups

(e) Average Number of Likes

(f) Average Age

48

Table 6.2: Correlations coefficients for linear regression using relationship status as theexplanatory variable and SWL as the target variable. Note that relationship types wheren 100 were not included.

Relationship Type CorrelationCoefficient # of Samples

Single 0.0625593 27828In a relationship 0.2827495 11194Married 0.45827924 7281Engaged 0.34196561 1719It’s Complicated -0.07448584 1710In an Open Relationship 0.20190719 305Widowed -0.08986092 153

As noted in the Big Five analysis, these types of words are also related topersonality, further explained in Section 6.2.1. Body, although not an inher-ently negative connotation, also displayed negative correlation. This corre-lation was the weakest of the LIWC categories, and would require furtheranalysis to understand why this category was significant.

“Like” Correlations The 600 dimension “Like” features showed correlationto SWL with an R2 = 0.191. When reviewing the results for influential“Like” categories, it was important to remember that categories were cre-ated using LDA and therefore may not always appear to have related topics.For example, the relationship among Family Feud, Support Our Troops, ThePink Ribbons, Pedigree Adoption Drive, and www.peopleofwalmart.comis not clear. These words commonly appear together and therefore aregrouped together. On the other hand, some categories like Volleyball, Bas-ketball, Softball, Sports and Soccer are all related to team sports. The topten positively correlated categories and negatively correlated categories areshown in Table 6.3. The positively correlated features show topics likehealth, fitness, and sports stars. These categories were all indicative of ac-tive lifestyles. On the other hand, some categories for negatively correlated“likes” refer to reality T.V. shows, comedy T.V. shows and rock/metal music.Unlike sports or fitness, these topics were generally more passive. Again,the theme of an active lifestyle correlated positively and a passive lifestylecorrelated negatively.

49

Table 6.3: Positively and Negatively correlated “Like” categories grouped into 600 cate-gories using LDA. Influence of categories are listed in descending order.

Correlation Type Category Topics1. House, Skittles, Dr. House, YouTube, Oreo2. FaceMoods, My Personality, My Top Fans,Facebook, SKETCH YOUR PHOTO3. Roger Federer, Rafael Nadal, Tennis, MariaSharapova, Michael Phelps4. Shopping mall, Shoes, Fashion, Traveling,Dancing5. Skittles, Starburst, Reese’s, Oreo, Duck Tape

Positively Correlated Likes 6. Jillian Michaels, P90X, Bob Harper, Active.com,Zumba Fitness7. Volleyball, Basketball, Softball, Sports, Soccer8. App Store, iTunes, Angry Birds, Facebook forAndroid, Facebook for iPhone9. Nitro Circus, Travis Pastrana, Fox Racing, RobDyrdek, Monster Energy10. Indianapolis Colts, Buffalo Wild Wings, PeytonManning, Reggie Wayne, Indianapolis Colts1. DJ Pauly D, Jersey Shore, Mike The Situation,JWOWW, Snookie2. Family Feud, Bejeweled Blitz, Zuma Blitz, WheelOf Fortune, UNO3. Metallica, AC/DC, Guns N’ Roses, Bon Jovi,Linkin Park4. Dallas Cowboys, Dallas Cowboys Fanatics!,Dallas Cowboys, George Lopez, Texas Rangers

Negatively Correlated Likes 5. Converse All Star, Coca-Cola, Facebook, Oreo,Pringles6. Heavy metal music, Punk rock, Alternative rock,Rock music, Death metal7. Family Feud, Support our Troops, The PinkRibbon, Pedigree Adoption Drive,www.peopleofwalmart.com8. Avenged Sevenfold, My Chemical Romance,Green Day, Linkin Park, Fall Out Boy9. Family Guy, The Simpsons, South Park,Futurama, The Simpsons10. Family Guy, Eminem, South Park, MichaelJackson, Megan Fox

50

It is also noted that some interests, such as “Facebook”, appear to be bothpositively and negatively both correlated. This may occur due to the type ofactivity related to these interests. For instance, Wenninger et. al found thatteenagers who engaged in active use of Facebook (i.e. chatting and posting)had a positive affect on SWL and SWB versus users who engaged in passivefollowing.

6.1.2 Wrapper Method

Although Pearson’s R gave a good idea of linearly correlated features, it didnot find non-linearly correlated features. For example, relationship statusseemed relevant to SWL, but did not have a strong linear correlation. Forthis reason, we used a wrapper method to identify useful features. A wrap-per method selects features based on the usefulness to a given predictor [36].Feature importance, as noted in Section 5.4.3, was used to find predictivefeatures for RFR.

By finding feature importance for the common Facebook features, weconfirmed previous correlations with network size, age and number of phototags. Feature importance also showed relationship status, number of educa-tion places and number of work places as important features for predictingSWL. However, because some features were sparsely populated, only fea-tures with sample size n � 20, 000 (i.e. network size, age, number of phototags, and relationship status) were included.

6.2 Static Models

After finding the important features, RFR models were created to predictSWL. Because of the sparsity in the data, models were limited to the mostinfluential and most populated features. First, static ego features were uti-lized from each dataset to train RFR models to predict SWL. Four differentmodels were created from the main features discussed in Section 6.1. Theresults for the models are shown in table 6.4.

• Big5: Big Five scores collected from a questionnaire

51

• FBAttrib: Age, Network Size, Relationship Status and Number ofPhoto Tags

• Likes: “Likes” of a user as represented by 600-dimensional vector

• LIWC: Overall LIWC for a user represented by a 64-dimensional vec-tor

Table 6.4: Static Ego ModelsModel # of Samples MAE Random MAEBig5 86073 0.97 1.58

FBAttrib 9461 1.19 1.34Likes 3920 1.15 1.57LIWC 3251 1.16 1.60

6.2.1 Predicted Big Five as an Intermediate

To maintain focus for predicting SWL from Facebook, feature selectionwas limited to those commonly available to most users. However, we stillwanted to harness the prediction power of Big Five personality. This dilemmainfluenced the use of Big Five personality scores as an intermediate variablefor predicting SWL. If common Facebook information such as age, gender,“likes”, etc. could be utilized to predict Big Five, perhaps the benefits ofpersonality’s strong correlation to SWL can still be used. Like the othermodels, RFR was used to predict Big Five.

Although RFR gives an idea of what features were important, one shouldnote that the important features do not suggest positive correlation. Rather,they indicated an important splitting feature for the prediction of personality.For example, achievement words may be an important feature for predict-ing neuroticism; however, this does not mean achievement and neuroticismpositively correlate. Instead, this indicates that achievement is a decidingfactor of the neuroticism score.

Demographic Features Predicting Big Five From the demographic features, amodel was created for predicting each Big Five construct. The results for

52

predicting Big Five are shown in Table 6.5. Graphs of the most highly cor-related features are shown in Figure 6.4. Some features, such as age, wereweakly correlated with SWL; however, by using Big Five as an intermedi-ate, these features were still indirectly predictable of SWL.

LIWC Predicting Big Five Because LIWC has high dimensionality, a sepa-rate RFR model was created to pick up any small contributions LIWC fea-tures made toward predicting Big Five. In Table 6.6, it is noted that negativeconnotation categories, such as negative emotion and sad, were highly pre-dictive of Neuroticism as hypothesized in Section 6.1.1, Big Five.

“Likes” Predicting Big Five Similarly to LIWC, “Likes” were used to predictBig Five. The results are shown in Table 6.7. Again, the theme of activeversus passive lifestyles appeared in some of the Big Five predictors. Forexample, the positively correlated feature, extraversion, was associated withtopics like basketball, sports, and outdoor activities; whereas, the negativelycorrelated feature, neuroticism, was related to reality tv, games, and anime.

Other topics like agreeableness and conscientiousness were less straight-forward in their connection to passive or active interests. For example, manytopics that were predictors of agreeableness were related to the Christian re-ligion. Although these interests do not point directly to physically activeinterests like sports, it suggests social and community activity. Conscien-tiousness, on the other hand, had interests in topics like home goods andcooking. While these topics don’t seem as active as other topics, they areconsistent with the non-spontaneous behavior of conscientious individuals.In particular, these topics may relate to stability of home life and may beactive interests in that area.

6.2.2 Predicted Big Five Models

After predicting Big Five scores from ”common” Facebook features, thescores were used to predict SWL. Although these models did not out-preformthe original Big5 model, they did do better or equal to using the features di-rectly. Specifically Big5.FBAttrib is better than the FBAttrib model and

53

Table 6.5: Important Facebook demographic features for predicting Big Five as identifiedwith RFR, n = 30, 766

Openness ConscientiousnessRandom Forest MAE:0.47 Random Forest MAE:0.561. number of like (0.261486) 1. age (0.538828)2. age (0.131944) 2. number of like (0.116656)3. number of group (0.126521) 3. network size (0.097241)4. number of status (0.111631) 4. number of group (0.083196)5. number of tags (0.100288) 5. number of status (0.046087)6. network size (0.094449) 6. number of tags (0.036197)7. relationship status (0.093745) 7. relationship status (0.031632)8. number of education (0.060224) 8. number of work (0.030706)9. number of work (0.019713) 9. number of education (0.019457)Extraversion AgreeablenessRandom Forest MAE:0.65 Random Forest MAE:0.541. network size (0.627355) 1. network size (0.251437)2. number of group (0.096529) 2. age (0.197810)3. number of status (0.075881) 3. number of tags (0.136323)4. number of tags (0.073915) 4. number of like (0.132674)5. number of like (0.046137) 5. number of group (0.093844)6. age (0.034730) 6. number of status (0.093179)7. relationship status (0.017319) 7. relationship status (0.054738)8. number of education (0.014922) 8. number of work (0.032081)9. number of work (0.013212) 9. number of education (0.022333)NeuroticismRandom Forest MAE:0.691. network size (0.251437)2. number of likes (0.213813)3. number of status (0.193442)4. age (0.098579)5. number of group (0.087906)6. number of tags (0.077159)7. relationship status (0.034802)8. number of work (0.026912)9. number of education (0.015950)

54

Figure 6.4: Pearson Correlation between averaged “common” Facebook features and BigFive personality scores. The most highly correlated features are shown.

(a) Agreeableness

(b) Conscientiousness

(c) Extraversion

55

(d) Openness

(e) Neuroticism

(f) Neuroticism

56

Table 6.6: Important LIWC features for predicting Big Five scores as identified with RFR,n = 115, 874

Openness ConscientiousnessRandom Forest MAE:0.53 Random Forest MAE:0.581. death (0.231061) 1. posemo (0.260162)2. article (0.159621) 2. relativ (0.220944)3. family (0.123684) 3. swear (0.099381)4. insight (0.108446) 4. negemo (0.095408)5. affect (0.058550) 5. preps (0.070630)6. percept (0.035793) 6. achieve (0.062061)7. relativ (0.026079) 7. assent (0.016817)8. social (0.024329) 8. anger (0.013758)9. motion (0.016042) 9. social (0.012895)10. posemo (0.013007) 10. death (0.012768)Extraversion AgreeablenessRandom Forest MAE:0.64 Random Forest MAE:0.551. sexual (0.303806) 1. anger (0.364546)2. posemo (0.146730) 2. posemo (0.301972)3. insight (0.060779) 3. relativ (0.026353)4. death (0.054196) 4. swear (0.025740)5. leisure (0.042724) 5. death (0.023222)6. negemo (0.030493) 6. relig (0.021642)7. humans (0.028119) 7. incl (0.016721)8. past (0.026305) 8. humans (0.015216)9. discrep (0.020933) 9. negemo (0.011938)10. swear (0.018870) 10. shehe (0.011407)NeuroticismRandom Forest MAE: 0.641. negemo (0.270750)2. achieve (0.124964)3. sad (0.090772)4. leisure (0.042204)5. health (0.033696)6. adverb (0.031334)7. relativ (0.027536)8. family (0.024485)9. preps (0.019936)10. discrep (0.019758)

57

Table 6.7: Important “Like” features for predicting Big Five scores identified with RFR,n = 92, 255. Note that the categories are summarized by the most prominent features.

Openness ConscientiousnessRandom Forest MAE: 0.51 Random Forest MAE: 0.581. writers/authors 1. home goods2. social sciences (psych, history, philo,politics, music) 2. anime/art

3. metal/rock bands 3. hard-core/rock music male models4. rock/indie/alternative music 4. rock/indie/alternative music5. alternative/ 5. food/cooking6. social science 6. pop cultureExtraversion AgreeablenessRandom Forest MAE: 0.65 Random Forest MAE: 0.551. post hard-core/rock music, malemodels 1. Christian music

2. basketball 2. metal bands3. anime 3. Jesus/Bible4. basketball/rappers 4. metal/rock bands5. sports 5. Mormons6. outdoor activities 6. Bible/JesusNeuroticismRandom Forest MAE: 0.651. anime2. reality tv3. rappers4. anime5. video games/board games6. anime

58

Table 6.8: Static Ego Model ResultsModel # of Samples MAE Random MAE

Big5.FBAttrib 9242 1.10 1.61Big5.Likes 3693 1.13 1.56Big5.LIWC 3251 1.16 1.60

Big5.Likes is better than the Likes model. Descriptions of each model arelisted below:

• Big5.FBAttrib: Big Five scores predicted by FBattrib features

• Big5.Likes: Big Five scores predicted by“Likes” of a user

• Big5.LIWC: Big Five scores predicted from LIWC features

6.2.3 Combined Static Models

After predicting Big Five from Facebook features, all static models werecombined to find combinations of features that would perform better thanindividual features. The best combined model was Combo.Static.1, whichcombined network size, age, relationship status, number of photo tags, andthe mean of the predicted Big Five features. All combined static models aredescribed below with the results in Table 6.9.

• Combo.Static.1: FBAttrib features and the mean of Big5.FBAttrib,Big5.Likes, and Big5.LIWC features

• Combo.Static.2: FBAttrib features and the mean of Big5.Likes andBig5.LIWC features

• Combo.Static.3: FBAttrib features and Big5.LIWC features

• Combo.Static.4: FBAttrib features and Big5.FBAttrib features

• Combo.Static.5: FBAttrib features and Big5.Likes features

59

Table 6.9: Combined Static Ego Model Results

Model # ofSamples MAE Random

MAECombo.Static.1 1360 1.04 1.64Combo.Static.2 1190 1.07 1.61Combo.Static.3 2055 1.19 1.61Combo.Static.4 9242 1.10 1.56Combo.Static.5 1845 1.08 1.51

6.2.4 Summary

From Tables 6.4 and 6.8 it is shown that all models perform better than theRandom Baseline and that all models out-performed a more sophisticatedmodel [40] using only “Like” features (MAE = 1.22). The high accuracyfrom the Big5 model motivated the creation of predicted Big Five modelsthat use common Facebook features to predict Big Five. When these modelswere created, they out-performed their direct feature counterparts.

By combining the best static ego and predicted Big Five features into theCombo.Static.1 one model, they yielded greater accuracy than employingstatic ego features alone (MAE = 1.04); however, the Big5 model still outpreformed Combo.Static.1 model. This underscores the importance of BigFive when predicting SWL.

6.3 Temporal Models

Temporal Models tested whether words expressed in Facebook statuses closerto the time of the SWL test had greater prediction accuracy than words ear-lier. We considered two granularities: daily and weekly statuses. The fol-lowing summarizes the features for temporal models:

• Temporal.Daily: LIWC derived from Facebook statuses “n” days be-fore SWL test, where n = [1-7]

• Temporal.Weekly: LIWC derived from Facebook statues “n” weeksbefore SWL test, where n = [1-7]

60

Although the temporal models proved to be predictive of SWL, there waslittle variance over time. This suggests “mood swings” (expressed throughLIWC), do not affect SWL. In particular, Temporal.Daily’s performanceshowed no significant difference in prediction accuracy when using recentposts, see Table 6.10. Similarly Temporal.Weekly showed no significantdifference on a weekly scale.

Table 6.10: Temporal Model ResultsModel T1 MAE T2 MAE T3 MAE T4 MAE T5 MAE T6 MAE T7 MAETemporal.Daily 1.18 1.18 1.18 1.19 1.19 1.18 1.18Temporal.Weekly 1.18 1.17 1.18 1.17 1.17 1.17 1.17

6.4 Link Models

Link models incorporated the SWL of a friend or significant other in orderto predict individual SWL. The features used in each model are describedbelow with the results shown in Table 6.11. When just link features areused, we see that the SWL of the top 3 friends and significant other makethe most impact on SWL prediction accuracy. The results for significantothers is not too surprising, since it is typical that significant others havea unique and stronger bond than other relationships. On the other hand itis interesting to note that the top 3 friends have more prediction accuracythan a best friend. This could be due to the fact that these relationshipsare not as strong as other types of relationships, and therefore one cannotrely on just one friend. City friends, on the other hand, incorporate morethan one friend, but may not have a significant impact due to the strengthof a relationship. One could imagine that friends in the same city are justacquaintances or friends due to proximity versus strength of a relationshipand therefore may have little affect on a person. Note that all predictedSWLs were predicted from the Big5 model.

• 3FriendSWL: Predicted SWL of Top 3 friends.

• 3RandomFriendSWL: Predicted SWL of 3 random friends. Used asa control group to compare with 3FriendSWL.

61

• FriendSWL: Predicted SWL of best friend.

• RandomFriendSWL: Predicted SWL of a random friend. Used as acontrol group to compare with FriendSWL.

• CityFriendSWL: Predicted SWL of friends who are in the same city.

• OtherSWL: Predicted SWL of a user who is listed under “in a rela-tionship with”.

Table 6.11: Link Model Results


MAE3FriendSWL 695 0.865 1.543RandomFriendSWL 695 1.13 1.56FriendSWL 3671 1.14 1.56RandomFriendSWL 3671 1.13 1.57CityFriendSWL 1654 1.15 1.55OtherSWL 171 0.804 1.48

6.5 Combined Static and Link Models

The final models merged Combo.Static.1 model with the each of the linkfeatures. The following summarizes the features for combined static egoand link models and the results are shown in Table 6.12.

• FBAttrib.Big5.FriendSWL: Combo.Static.1 features combined with3FriendSWL model.

• FBAttrib.Big5.NoFriend: Combo.Static.1 features of users who havetop 3 friends, but does not include 3FriendsSWL features. We usedthis model as a baseline to determine the affect of the top 3 friends onSWL.

• FBAttrib.Big5.FriendSWL: Combo.Static.1 features combined withFriendSWL model.

62

• FBAttrib.Big5.NoFriend: Combo.Static.1 features of users who havea top friends, but does not include FriendsSWL feature. We used thismodel as a baseline to determine the affect of the top friend on SWL.

• FBAttrib.Big5.CityFriendSWL: Combo.Static.1 features of combinedwith CityFriendSWL model.

• FBAttrib.Big5.NoCityFriend: Combo.Static.1 features of users whohave friends that live in the same city, but not including the CityFriendSWLfeatures. We used this model as a baseline to determine the affect ofsame city friends SWL.

• FBAttrib.Big5.OtherSWL: Combo.Static.1 features combined withOtherSWL model.

• FBAttrib.Big5.NoOther: Combo.Static.1 features of users who havea significant other listed, but not including the OtherSWL features. Weused this model as a baseline to determine the affect of significant oth-ers on SWL.

Table 6.12: Combined Static and Link Models Results


MAEFBAttrib.Big5.3FriendSWL 695 0.822 1.54FBAttrib.Big5.No3Friend 695 0.827 1.54FBAttrib.Big5.FriendSWL 695 0.822 1.54FBAttrib.Big5.NoFriend 695 0.827 1.54FBAttrib.Big5.CityFriendSWL 1654 0.95 1.55FBAttrib.Big5.NoCityFriend 1654 0.95 1.55FBAttrib.Big5.OtherSWL 171 0.670 1.48FBAttrib.Big5.NoOther 171 0.681 1.48

Table 6.12 summarizes the findings when link information was added asa feature. When incorporating the top 3 friends’ SWL, only a slight perfor-mance boost (MAE = 0.822) in the FBAttrib.Big5.3FriendSWL model wasobserved over the FBAttrib.Big5.No3Friend model (MAE = 0.827); thereare a few reason this may have occurred. One reason may stem from thefeatures used in FBAttrib.Big5.No3Friend. The efficacy of the features in

63

Combo.Static.1 (MAE = 1.04) may have dominated the prediction accuracyof the model, leaving only small improvements for link features. Anotherreason for the small improvement could be caused by the feature extractionof a link related SWL score. Since these scores were predicted from theBig5 model, the SWL scores may not have been accurate enough to have asignificant effect in the link model.

When incorporating the best friend feature, there was a similar result toFBAttrib.Big5.3FriendSWL. This was a surprising occurrence because thethe SWL of a best friend was not as predictive as the top 3 friends, as seenin Table 6.11. As stated previously, this result may have occurred due to thefeatures in Combined.Static.1. The strong static ego feature combined witha weak SWL score may have prevented better results.

Interestingly, adding the SWL of a city friend did not affect predictionaccuracy and therefore may not be an indicator of SWL. This was a plau-sible result, as friends who are in the same city may just be friends due toproximity versus strength of a relationship.

The final link feature, significant other’s SWL, showed the best perfor-mance (MAE = 0.670) of all the models. Similar to top 3 friends, the modelutilizing significant other’s SWL was only slightly better than the model thatdid not include link features (MAE = 0.681). It is noted that a significantother’s SWL predicted a user’s SWL more accurately than his Top 3 friends.This finding is plausible since a significant other is more likely to share indaily life events and may be more influential than “friends” on Facebook.

64

Chapter 7

Conclusions

7.1 Conclusion

In this study several models were created to predict individual SWL. Thefindings showed that static ego features such as network size, number ofphoto tags, age, relationship status, likes, and overall word usage (LIWC)can be combined to make a good predictor of SWL (MAE = 1.04). Itwas also shown that Big Five consistently predicted SWL and reductionof high dimensional variable (i.e. 600-dimensional “Likes”) to highly pre-dictive variables (i.e. Big Five) increased the performance of our model.This was an interesting result for three reasons: First, it showed consis-tency with other literature confirming Big Five’s strong correlation to SWL.Second, it showed that Big Five can be predicted from common Facebookfeatures accurately enough to predict SWL. Third, it showed that reducinghigh dimensional features into highly predictive ones created more accuratepredictions.

When using link features, we found that a user’s top 3 friends and sig-nificant other had the most impact on prediction accuracy. However, whencompared to combined static ego feature models, the boost was minimal.This may be attributed to the noise of using a predicted SWL score forfriends and couples. If we had the true SWL values for link relationshipsthese models may have shown more lift. Another possibility was that thelink features were over-shadowed by ego features (i.e. personality). Thistoo was a plausible explanation, as previous research [20] noted that up to80% of the variability in SWL could be attributed to long lasting features.

Although LIWC was a good predictor of SWL, the temporal feature of

65

Facebook statuses showed no improvement to our models. This may beattributed to SWL’s high internal and temporal consistency as noted in pre-vious research [18]. Because SWL measures a cognitive-judgmental pro-cess, it is plausible that “mood swings”, expressed by LIWC, would not bea large indicator of a user’s overall SWL. Another explanation could be thatthe time frames were not granular enough to capture the transient mood ofa user prior to the SWL test.

Overall, when compared to the Random Baseline, all of our models outperformed random prediction by at least 11%. When compared to a linearregression model that used “Likes” features [40], we found our best modelwas 45% more accurate. We believe that the selection of Random Forest Re-gression, a combination of static ego features, and “important” link featuresprovided an increase in prediction accuracy.

In our study, it was demonstrated that social media sites, such as Face-book, contain interactions that could be utilized for predicting private traits.The ability to predict user attributes like SWL may be useful for social sci-ences at the individual and community level. From an individual stand-point, we can potentially create early warning schemes to identify userswho are in distress. From a community stand-point, collection of SWL andSWB information from social media can provide an efficient evaluation forpublic wellness.

7.2 Future Work

The results of these experiments were promising, but there are many areasleft for research and improvement. For example, a major limitation of thisstudy was sparse features. Some features (e.g. number of groups) correlatedhighly with SWL (R = �0.678) but were not well populated, and thereforecould not be utilized as a predictive feature. Future work in this area shouldfocus on fine-tuning feature collection and selection.

Although the ability to collect all features from all users would be ideal,this is typically not a realistic goal. This experiment regarded the missingfeatures as MCAR; however a more complete analysis could be done if the

66

problem utilized two inference models, one for observed and one for unob-served data.

Unfortunately, temporal features showed little predictive power in thisexperiment; however, based on other research it appears that occasion-specificevents and short term events do have an affect on SWL. Future research intowhat social media features could indicate these phenomenon should be ex-plored. For instance, sentiment analysis versus LIWC of posts could provemore predictive of “mood swings” or significant events.

Link relationships proved to a promising predictor of SWL; however, itsimportance is still unclear due to the small improvement seen over combinedstatic models. The model which included the SWL of significant others wasparticularly interesting since it gave the best results (MAE = 0.67); howeverthe sample size (n = 171) was relatively small. Future works should considerlarger sample size to ensure better generalization.

As expressed in the conclusion, expanding this work could aid in detec-tion of early warning signs for users in distress. SWL is not an epiphenom-ena. It is related to other psychological and social occurrences. PredictingSWL from social media could be the first step toward further research indepression detection, PTSD detection, and more. In order to carry out suchexperiments, subgroups, like unsatisfied users, should be explored in moredepth.

Finally, this study only focused on a small subset of features from pre-vious research; however, there are many other features yet to be exploredfrom Facebook. For instance, although not explicitly studied in this exper-iment, factors such as culture and location have shown an affect on SWL.By dividing samples into subgroups, perhaps new insights can be gained.

67

Bibliography

[1] Bachrach, Yoram, et al. “Your digital image: factors behind demo-graphic and psychometric predictions from social network profiles.”Proceedings of the 2014 international conference on Autonomousagents and multi-agent systems. International Foundation for Au-tonomous Agents and Multiagent Systems, 2014.

[2] Baker, L. M., Wilson, F. L., and Winebarger, A. (2004). An ex-ploratory study of the health problems, stigmatization, life satisfac-tion, and literacy skills of urban, street-level sex workers. Women andHealth, 39, 8396.

[3] Bentham, Jeremy. A fragment on government. The Lawbook Ex-change, Ltd., 1891.

[4] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. ”Latent dirich-let allocation.” the Journal of machine Learning research 3 (2003):993-1022.

[5] Bollen, Johan, et al. “Happiness is assortative in online social net-works.” Artificial life 17.3 (2011): 237-251.

[6] Breiman, Leo, et al. Classification and regression trees. CRC press,1984.

[7] Breiman, Leo. “Random forests.” Machine learning 45.1 (2001): 5-32.

[8] Bryant, R. A., Marosszeky, J. E., Crooks, J., Baguley, I. J., and Gurka,J. A. (2001). Posttraumatic stress disorder and psychosocial function-ing after severe traumatic brain injury. The Journal of Nervous andMental Disorders, 189, 109113.

68

[9] Costa, Paul T., and Robert R. McCrae. ”Influence of extraversion andneuroticism on subjective well-being: happy and unhappy people.”Journal of personality and social psychology 38.4 (1980): 668.

[10] Costa Jr, P. T., and Robert R. McCrae. “Neo personality inventoryre-vised (neo-pi-r) and neo five-factor inventory (neo-ffi) professionalmanual.” Odessa, FL: Psychological Assessment Resources (1992).

[11] Constine, Josh. “Facebook Ups Character Limite to 60,000, Goole+’sIs Still Bigger”. 30 Nov. 2011. Web. 12 Oct. 2014.

[12] Correa, Teresa, Amber Willard Hinsley, and Homero Gil De Zuniga.“Who interacts on the Web?: The intersection of users personality andsocial media use.” Computers in Human Behavior 26.2 (2010): 247-253.

[13] Coviello, Lorenzo, et al. “Detecting Emotional Contagion in MassiveSocial Networks.” PloS one 9.3 (2014): e90315.

[14] Dermer, Marshall, et al. ”Evaluative judgments of aspects of life as afunction of vicarious exposure to hedonic extremes.” Journal of Per-sonality and Social Psychology 37.2 (1979): 247.

[15] DeNeve, Kristina M., and Harris Cooper. ”The happy personality:a meta-analysis of 137 personality traits and subjective well-being.”Psychological bulletin 124.2 (1998): 197.

[16] Diener, Edward, et al. “Understanding scores on the satisfaction withlife scale.” Retrieved January 11 (2006): 2008.

[17] Diener, Ed, and Micaela Y. Chan. ”Happy people live longer: Subjec-tive wellbeing contributes to health and longevity.” Applied Psychol-ogy: Health and WellBeing 3.1 (2011): 1-43.

[18] Diener E, Emmons RA, Larsen RJ, Griffin S (1985) The satisfactionwith life scale. J Pers Assess 49(1):7175.

[19] Diener, Ed. “Subjective well-being: The science of happiness and aproposal for a national index.” American psychologist 55.1 (2000):34.

69

[20] Diener, Ed, Ronald Inglehart, and Louis Tay. ”Theory and validityof life satisfaction scales.” Social Indicators Research 112.3 (2013):497-527.

[21] Diener, Ed, Shigehiro Oishi, and Richard E. Lucas. “Personality, cul-ture, and subjective well-being: Emotional and cognitive evaluationsof life.” Annual review of psychology 54.1 (2003): 403-425.

[22] Duggan, Maeve, Nicole B. Ellison, Cliff Lampe, Amanda Lenhart,and Mary Madden. ”Social Media Update 2014.” Pew Research Cen-ters Internet American Life Project RSS. Pew Research Center, 09Jan. 2015. Web. 09 Jan. 2015.

[23] Drucker, Harris, et al. ”Support vector regression machines.” Ad-vances in neural information processing systems 9 (1997): 155-161.

[24] Economist Intelligence Unit. (2004). The economist intelligenceunits quality-of-life index. Retrieved from November 17, 2012 fromhttp://www.economist.com/media.

[25] Eid, M., and Diener, E. (2004). Global judgments of subjective well-being: Situational variability and longterm ability. Social IndicatorsResearch, 65, 245277.

[26] Farnadi, Golnoosh, et al. “How are you doing? Emotions and person-ality in Facebook.” Proceedings of the EMPIRE Workshop of the 22ndInternational Conference on User Modeling, Adaptation and Person-alization (UMAP 2014). 2014.

[27] Farnadi, Golnoosh et al. “Computational Personality Recognition inSocial Media”, User Modeling and User-Adapted Interaction: TheJournal of Personalization Research (UMUAI), in press.

[28] Furr, R. M. and Funder, D. “A multimodal analysis of personal neg-ativity.” Journal of Personality and Social Psychology, 74, 15801591.1998.

70

[29] Fujita, F., and Diener, E. “Life satisfaction set point: Stabilityand change.” Journal of Personality and Social Psychology, 88,158164.2005.

[30] He, Qiwei, et al. “Predicting self-monitoring skills using textual postson Facebook.” Computers in human behavior 33 (2014): 69-78.

[31] Helliwell, John F., Richard Layard, and Jeffrey Sachs, eds. Worldhappiness report 2013. Sustainable Development Solutions Network,2013.

[32] Hsee, C. K., and Zhang, J. “General evaluability theory.” Perspectiveson Psychological Science, 5, 343355.2010.

[33] Gilman, Rich, and Scott Huebner. “A review of life satisfaction re-search with children and adolescents.” School Psychology Quarterly18.2 (2003): 192.

[34] Graham, C. “Happiness around the world: The paradox of happypeasants and miserable millionaires.” Oxford: Oxford UniversityPress.2009

[35] GOV.UK. ”Wellbeing: Introduction to Subjective WellbeingDatasets.” Research and Analysis. Cabinet Office, 27 Mar. 2013. Web.08 Aug. 2014.

[36] Guyon, Isabelle; Elisseeff, Andr. “An Introduction to Variable andFeature Selection”. JMLR 3.2003.

[37] Joy, R. H. “Path analytic investigation of stress-symptom relation-ships: Physical and psychological symptom models.” Unpublisheddoctoral dissertation, University of Illinois at Champaign-Urbana.1990.

[38] Kaplan, Andreas M., and Michael Haenlein. “Users of the world,unite! The challenges and opportunities of Social Media.” Businesshorizons 53.1 (2010): 59-68.

71

[39] Kietzmann, Jan H., et al. “Social media? Get serious! Understand-ing the functional building blocks of social media.” Business horizons54.3 (2011): 241-251.

[40] Kosinski, M., Stillwell D.J., Graepel T. “Private traits and attributesare predictable from digital records of human behavior”. Proceedingsof the National Academy of Sciences (PNAS).2013.

[41] Krasnova, Hanna, et al. “Envy on Facebook: A Hidden Threat toUsers’ Life Satisfaction?.” Wirtschaftsinformatik 92 (2013).

[42] Lawless, N. M., and Lucas, R. E. “Predictors of regionalwell-being: a county level analysis.” Social Indicators Research101(3):341357.2011.

[43] Lewinsohn, Peter M., J. Redner, and J. Seeley. “The relationship be-tween life satisfaction and psychosocial variables: New perspectives.”Subjective well-being: An interdisciplinary perspective (1991): 141-169.

[44] Lucas, Richard E., and M. Brent Donnellan. “How stable is happi-ness? Using the STARTS model to estimate the stability of life satis-faction.” Journal of Research in Personality 41.5 (2007): 1091-1098.

[45] Manyika, James, et al. “Big data: The next frontier for innovation,competition, and productivity.” (2011).

[46] Marks, Gary N., and Nicole Fleming. “Influences and consequencesof well-being among Australian young people: 19801995.” Social In-dicators Research 46.3 (1999): 301-323.

[47] Monica, Paul R. La. “Facebook Now worth $200 Billion.” CNN-Money. Cable News Network, 9 Sept. 2014. Web. 12 Oct. 2014.

[48] Na, J., Kosinski, M., and Stillwell, D. “When a new tool is intro-duced in different cultural contexts: Individualism-Collectivism andsocial network on Facebook”. Journal of Cross-Cultural Psychology,in press.

72

[49] Oishi, Shigehiro, and Helen W. Sullivan. “The predictive value ofdaily vs. retrospective well-being judgments in relationship stability.”Journal of Experimental Social Psychology 42.4 (2006): 460-470.

[50] Park, Namsu, Kerk F. Kee, and Sebastin Valenzuela. “Being immersedin social networking environment: Facebook groups, uses and grati-fications, and social outcomes.” CyberPsychology and Behavior 12.6(2009): 729-733.

[51] Pavot, William, and Ed Diener. “Review of the satisfaction with lifescale.” Psychological assessment 5.2 (1993): 164.

[52] Quercia, Daniele. “Don’t worry, be happy: the geography of happi-ness on Facebook.” Proceedings of the 5th Annual ACM Web ScienceConference. ACM, 2013.

[53] Ramanadhan, Shoba, et al. “Social media use by community-basedorganizations conducting health promotion: a content analysis.” BMCpublic health 13.1 (2013): 1129.

[54] Rice, John. Mathematical statistics and data analysis. Cengage Learn-ing, 2006.

[55] Roderick J. A. Little and Donald B. Rubin. Statistical Analysis withMissing Data. John Wiley and Sons, Inc., 1987.

[56] Rogers, Simon. “Happiness Index: The UK in Happiness, Anxietyand Job Satisfaction.” Theguardian. The Guardian, 20 Nov. 2012.Web. 01 July 2014.

[57] Ruder, Thomas D., et al. “Suicide announcement on Facebook.” Cri-sis: The Journal of Crisis Intervention and Suicide Prevention 32.5(2011): 280-282.

[58] Sap, M. et al. “Developing Age and Gender Predictive Lexica overSocial Media”.Proceedings of the Conference on Empirical Methodsin Natural Language Processing (EMNLP), 2014.

73

[59] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12,pp. 2825-2830, 2011.

[60] Schneider, Leann, and Ulrich Schimmack. “Self-informant agreementin well-being ratings: A meta-analysis.” Social Indicators Research94.3 (2009): 363-376.

[61] Schwartz, Hansen Andrew, et al. “Characterizing Geographic Varia-tion in Well-Being Using Tweets.” ICWSM. 2013.

[62] Schimmack, Ulrich, and Shigehiro Oishi. “The influence of chroni-cally and temporarily accessible information on life satisfaction judg-ments.” Journal of personality and social psychology 89.3 (2005):395.

[63] Seligman, Martin EP, and Mihaly Csikszentmihalyi. Positive psychol-ogy: An introduction. Vol. 55. No. 1. American Psychological Asso-ciation, 2000.

[64] Shin, Doh C., and Dan M. Johnson. “Avowed happiness as an over-all assessment of the quality of life.” Social indicators research 5.1-4(1978): 475-492.

[65] Strack, Fritz, et al. “Salience of comparison standards and the acti-vation of social norms: Consequences for judgements of happinessand their communication.” British Journal of Social Psychology 29.4(1990): 303-314.

[66] Tausczik, Yla R., and James W. Pennebaker. “The psychologicalmeaning of words: LIWC and computerized text analysis methods.”Journal of language and social psychology 29.1 (2010): 24-54.

[67] Tay, Louis, and Ed Diener. “Needs and subjective well-being aroundthe world.” Journal of personality and social psychology 101.2 (2011):354.

[68] Veenhoven, Ruut. The study of life-satisfaction. 1996.

74

[69] Vitters, Joar, Robert Biswas-Diener, and Ed Diener. ”The divergentmeanings of life satisfaction: Item response modeling of the satis-faction with life scale in Greenland and Norway.” Social IndicatorsResearch 74.2 (2005): 327-348.

[70] Wenninger, Helena, Hanna Krasnova, and Peter Buxmann. “ActivityMatters: Investigating the Influence of Facebook on Life Satisfactionof Teenage Users.” Twenty Second European Conference on Informa-tion Systems, 2014.

[71] Wirtz, Derrick, et al. ”What to do on spring break? The role of pre-dicted, on-line, and remembered experience in future choice.” Psy-chological Science 14.5 (2003): 520-524.

[72] Wilmot, M., C.G. DeYoung, D. Stillwell and M. Kosinski. “Self-Monitoring and the Metatraits”. Journal of Personality, 2015.

[73] Youyou, Wu, Michal Kosinski, and David Stillwell. “Computer-basedpersonality judgments are more accurate than those made by hu-mans.” Proceedings of the National Academy of Sciences (2015):201418680.

75

Appendix A

Supplemental Graphs

A.1 Big Five Density

76

Figure A.1: Histograms that shows the density of the Big Five feature for all users inmyPersonality dataset. n = 3, 137, 694

(a) Density of Agreeableness

(b) Density Conscientiousness

77

(c) Density of Extraversion

(d) Density of Openness

(e) Density of Neuroticism

78

Appendix B

79

Pearson Correlations

B.1 LIWC CorrelationsFeature R R’ # of Samplesachieve 0.042560 0.595418 3505adverb -0.057497 -0.662586 3505affect -0.052232 -0.572170 3505anger -0.158459* -0.944416 3505anx -0.030429 -0.425586 3505

article 0.042317 0.541564 3505assent -0.024270 -0.281534 3505

auxverb -0.063327 -0.591759 3505bio -0.090912 -0.720758 3505

body -0.105941* -0.835244 3505cause -0.049897 -0.557251 3505certain -0.007785 -0.073090 3505

cogmech -0.043152 -0.508757 3505conj -0.014088 -0.238587 3505death -0.049767 -0.502260 3505

discrep -0.076119 -0.697549 3505excl -0.058647 -0.655101 3505

family 0.012077 -0.092474 3505feel -0.046374 -0.476030 3505filler -0.015947 -0.236960 3505friend -0.006006 -0.008868 3505funct -0.029431 -0.350704 3505future -0.066235 -0.694945 3505health -0.069294 -0.731503 3505hear -0.031863 -0.133496 3505

home 0.058177 0.602766 3505humans -0.030948 -0.479601 3505

i -0.070849 -0.601048 3505

80

Feature R R’ # of Samplesincl 0.035071 0.373532 3505

ingest 0.016065 0.208542 3505inhib 0.004064 0.149162 3505

insight -0.030014 -0.408507 3505ipron -0.040504 -0.523688 3505

leisure 0.071732 0.597954 3505money 0.012197 0.205470 3505motion 0.043362 0.566081 3505negate -0.089598 -0.859773 3505

negemo -0.159933* -0.929049 3505nonfl -0.049053 -0.506875 3505

number 0.020785 0.331345 3505past -0.012796 -0.268714 3505

percept -0.049534 -0.612704 3505posemo 0.039235 0.211055 3505ppron -0.056868 -0.443792 3505preps 0.005822 -0.009251 3505

present -0.057636 -0.503600 3505pronoun -0.058101 -0.500969 3505

quant -0.000531 -0.075164 3505relativ 0.022571 0.197992 3505relig 0.054916 0.332988 3505sad -0.041164 -0.482742 3505see -0.022406 -0.277629 3505

sexual -0.046376 -0.562598 3505shehe 0.003924 -0.020562 3505social -0.007290 -0.135795 3505space -0.008032 -0.102725 3505swear -0.147914* -0.880167 3505

81

B.2 Ego Feature Correlations

Feature R R’ # of Samples# of concentrations -0.009628 -0.282290 1854

# of educations -0.025117 -0.204770 21022# of events -0.013790 -0.071350 802# of groups -0.055199 -0.678348 5443# of likes -0.078601 -0.721082 7173

# of statuses 0.013460 0.311264 3503# of photo tags 0.029699 0.596474 23197

# of work places -0.026692 -0.378117 19540age 0.013145 0.248857 42264

network size 0.061001 0.845528 60863

B.3 Big5 Correlations

Feature R R’ # of Samplesope 0.051915 0.900966 86073con 0.277651 0.985919 86073ext 0.295244 0.996589 86073agr 0.241992 0.988066 86073neu -0.471156 -0.997529 86073

82

Appendix C

Data Descriptions

C.1 General Acronyms

Agr AgreeablenessCon ConscientiousnessExt ExtraversionLIWC Linguistic Inquiry Word CountLDA Latent Dirichlet AllocationOpe OpennessNeu NeuroticismSWLS Satisfaction with Life ScaleSWL Satisfaction with LifeSWB Subjective Well BeingRFR Random Forest RegressionR Pearson’s Correlation Coefficient

83

C.2 LIWC Category Description

Category Abbreviations ExamplesLinguistic Processesfunction words funcPersonal pronouns ppron I, them, itselfFirst-person singular i I, me, mineFirst-person plural we We, us, ourSecond person you You, your, thouThird-person singular shehe She, her, himThird-person plural thy They, their, theydIndefinite pronouns It, its, thoseArticles article A, an, theCommon verbs verb Walk, went, seeAuxiliary verbs auxverb Am, will, havePast tense past Went, ran, hadPresent tense present Is, does, hearFuture tense future Will, gonnaAdverbs adverb Very, really, quicklyPrepositions prep To, with, aboveConjunctions conj And, but, whereasNegations negate No, not, neverQuantifiers quant Few, many, muchNumbers number Second, thousandSwear words swear Damn, piss, fuckPsychological processesSocial processes social Mate, talk, they,Family family Daughter, husbandFriends friend Buddy, friend, neighborHumans human Adult, baby, boyAffective processes affect Happy, cried, abandonPositive Emotion posemo Love, nice, sweetNegative Emotion negemo Hurt, ugly, nastyAnxiety anx Worried, nervous

84

Category Abbrevia-tions Examples

Anger anger Hate, kill, annoyedSadness sad Crying, grief, sadCognitive processes cogmech Cause, know, oughtInsight insight Think, know, considerCausation cause Because, effect, henceDiscrepancy descrep Should, would, couldTentative tentat Maybe, perhaps, guessCertainty certain Always, neverInhibition inhib Block, constrain, stopInclusive incl And, with, includeExclusive excl But, without, excludePerceptual processes precept Observing, heard, feelingSee see View, saw, seenHear hear Listen, hearingFeel feel Feels, touchBiological processes bio Eat, blood, painBody body Cheek, hands, spitHealth health Clinic, flu, pillSexual sex Horny, love, incestIngestion ingest Dish, eat, pizzaRelativity relativ Area, bend, goMotion motion Arrive, car, goSpace space Down, in, thinTime time End, until, seasonPersonal concernsWork work Job, majors, xeroxAchievement achieve Earn, hero, winLeisure leisure Cook, chat, movieHome home family Apartment, kitchen,Money money Audit, cash, oweReligion relig Altar, church, mosqueDeath death Bury, coffin, killSpoken categoriesAssent assent Agree, OK, yesNonfluencies nonflu Er, hm, ummFillers filler Blah, I mean, ya know

85

C.3 LDA “Like” Topic Descriptions

Topic # Topics

t1 Survivor , Big Brother , American Idol , The Amazing Race , AmazingRace

t2 Paintball , Anything About Guns , Airsoft , Guns , Modern Warfare 2t3 National Treasure , Shrek X-Men Movies , The Mummy , Spider-Man

t4 Hamlet , The Great Gatsby , Wuthering Heights , Frankenstein ,Macbeth

t5 Hollister Co. , American Eagle Outfitters , Abercrombie & Fitch ,Victoria’s Secret Pink , Cheerleading

t6 Green Bay Packers , Aaron Rodgers , Milwaukee Brewers , WisconsinBadgers , Donald Driver

t7 FRIENDS (TV Show) , Chandler Bing , Barney Stinson , MatthewPerry , Inception

t8 Rush Hour , Rush Hour Rush Hour Will Smith , Family Guy

t9 Impression In Style , Inspiring Designs , Urban Design Concepts , KBridals , Generation

t10 Remember the Titans , Coach Carter , Step Up , Basketball , FridayNight Lights

t11 Skittles , Dr Pepper , YouTube , Oreo , Reese’st12 hillsong united , Hillsong Live , Bible , Jesus Culture , Chris Tomlint13 Tide , Swag Bucks , Downy , Clairol , Extra Gumt14 Walmart , Target , Walt Disney World , Disney , Disneyland

t15 Buffalo Sabres , New York Yankees , Upstate New York , Buffalo Bills, Adirondacks

t16 Scarface , The Godfather , Goodfellas , Casino , Goodfellas

t17 The Twilight Saga , Twilight , Victoria’s Secret , Victoria’s Secret Pink, Team Twilight

t18 Jamba Juice , Arizona Iced Tea , California , In-N-Out Burger , Sushit19 T.I. , Usher , Keri Hilson , Trey Songz , Ciara

t20 Maher Zain , Golden Screen Cinemas , Lisa Surihani ,ILoveAllaah.com , Yuna

t21 Cristiano Ronaldo , Manchester United , David Beckham , RealMadrid C.F. , FC Barcelona

t22 Cupcakes! , Chocolate , Cotton Candy , Penguins!!! , cats

86

Appendix D

Algorithms

D.1 Decision Tree

87

Given training vectors xi

2 Rn, i = 1, ..., l and a label vector y 2 Rl, a decision tree recur-sively partitions the space such that the samples with the same labels are grouped together.Let the data at node m be represented by Q. For each candidate split ✓ = (j, t

m

) con-sisting of a feature j and threshold t

m

. Partition the data into Qleft

(✓) and Qright

(✓) subsets:

Qleft

(✓) = (x, y)|xj

<= tm

(D.1)

Qright

(✓) = Q \Qleft

(✓) (D.2)

The impurity at m is computed using an impurity function H().

G(Q, ✓) =numberofleft

numberofmH(Q

left

(✓)) +numberofright

numberofmH(Q

right

(✓)) (D.3)

We select the parameters that minimizes the impurity function.

✓⇤ = argmin✓

G(Q, ✓) (D.4)

Then we recurse for subsets Qleft

(✓⇤) and Qright

(✓⇤) until the maximum allowable depthis reached, N

m

< minsamples

or Nm

= 1.

Because we are trying to predict the numerical value of SWL (a regression task), wechoose Mean Squared Error as the impurity function. At each node m, we find thefeature-threshold pair, X

m

that minimizes H .

cm

=1

numberofm

X

i2numberofm

yi

(D.5)

H(Xm

) =1

numberofm

X

i2numberofm

(yi

� cm

)2 (D.6)

Documents

Thesis Release Permission Form - Northeastern University349562/fulltext.pdf · To my wife, whose love and support encouraged me to do things I never dreamed. Thank you for always