16
Causal Inference Struggles with Agency on Online Platforms Smitha Milli * UC Berkeley [email protected] Luca Belli Twitter [email protected] Moritz Hardt UC Berkeley [email protected] Abstract Online platforms regularly conduct randomized experiments to understand how changes to the plat- form causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users self-select into the treatment of interest as an alternative to experiments in which the platform controls whether the user receives treatment or not. In this paper, we conduct four large-scale within-study comparisons on Twitter aimed at assessing the effectiveness of observational studies derived from user self-selection on online platforms. In a within-study comparison, treatment effects from an observational study are assessed based on how effectively they replicate results from a randomized experiment with the same target population. We test the naive difference in group means estimator, exact matching, regression adjustment, and inverse probability of treatment weighting while controlling for plausible confounding variables. In all cases, all observational estimates perform poorly at recovering the ground-truth estimate from the analogous randomized experiments. In all cases except one, the observational estimates have the opposite sign of the randomized estimate. Our results suggest that observational studies derived from user self-selection are a poor alternative to randomized experimentation on online platforms. In discussing our results, we postulate “Catch-22”s that suggest that the success of causal inference in these settings may be at odds with the original motivations for providing users with greater agency. 1 Background Online platforms commonly use machine learning algorithms to personalize user experience. However, such algorithms may lead to harmful unintended consequences including disparate impacts [Buolamwini and Gebru, 2018, Angwin et al., 2016, Obermeyer et al., 2019], amplification of misinformation [Starbird et al., 2019, Bradshaw and Howard, 2018], and political polarization [Allcott et al., 2020, Levy, 2021]. In response, activists, regulators, researchers, and even the platforms themselves have been pushing to let users have greater control and agency over their experiences on the platform, particularly when it comes to the use of machine learning algorithms [Lukoff et al., 2021, Burrell et al., 2019, Harambam et al., 2019, Kaminski, 2018, Jones, 2017, Dean et al., 2020]. For example, after finding that its image cropping algorithm was more likely to crop out black or masculine-presenting individuals than white or feminine-presenting individuals, Twitter has planned to remove the use of algorithmic cropping altogether [Yee et al., 2021]. However, user control by itself is not a panacea, as for example, the case of privacy controls indicates [Barocas and Nissenbaum, 2014]. When a user can pick between multiple options, the platform must choose a default setting, and since most users simply leave the default setting [Manber et al., 2000, Mackay, 1991], it is crucial to understand how the choice of default setting affects outcomes on the platform. Moreover, as the options to users increase, the number of different user experiences does too; hence, any attempt to improve the platform must take these different user experiences into account. For example, suppose users * Work done while the author was an intern at Twitter. MH was a paid consultant at Twitter. Work performed while consulting for Twitter. 1 arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

Causal Inference Struggles with Agency on Online Platforms

Smitha Milli∗

UC [email protected]

Luca BelliTwitter

[email protected]

Moritz Hardt†

UC [email protected]

Abstract

Online platforms regularly conduct randomized experiments to understand how changes to the plat-form causally affect various outcomes of interest. However, experimentation on online platforms has beencriticized for having, among other issues, a lack of meaningful oversight and user consent. As platformsgive users greater agency, it becomes possible to conduct observational studies in which users self-selectinto the treatment of interest as an alternative to experiments in which the platform controls whetherthe user receives treatment or not. In this paper, we conduct four large-scale within-study comparisonson Twitter aimed at assessing the effectiveness of observational studies derived from user self-selectionon online platforms. In a within-study comparison, treatment effects from an observational study areassessed based on how effectively they replicate results from a randomized experiment with the sametarget population. We test the naive difference in group means estimator, exact matching, regressionadjustment, and inverse probability of treatment weighting while controlling for plausible confoundingvariables. In all cases, all observational estimates perform poorly at recovering the ground-truth estimatefrom the analogous randomized experiments. In all cases except one, the observational estimates havethe opposite sign of the randomized estimate. Our results suggest that observational studies derivedfrom user self-selection are a poor alternative to randomized experimentation on online platforms. Indiscussing our results, we postulate “Catch-22”s that suggest that the success of causal inference in thesesettings may be at odds with the original motivations for providing users with greater agency.

1 Background

Online platforms commonly use machine learning algorithms to personalize user experience. However, suchalgorithms may lead to harmful unintended consequences including disparate impacts [Buolamwini andGebru, 2018, Angwin et al., 2016, Obermeyer et al., 2019], amplification of misinformation [Starbird et al.,2019, Bradshaw and Howard, 2018], and political polarization [Allcott et al., 2020, Levy, 2021]. In response,activists, regulators, researchers, and even the platforms themselves have been pushing to let users havegreater control and agency over their experiences on the platform, particularly when it comes to the use ofmachine learning algorithms [Lukoff et al., 2021, Burrell et al., 2019, Harambam et al., 2019, Kaminski,2018, Jones, 2017, Dean et al., 2020]. For example, after finding that its image cropping algorithm was morelikely to crop out black or masculine-presenting individuals than white or feminine-presenting individuals,Twitter has planned to remove the use of algorithmic cropping altogether [Yee et al., 2021].

However, user control by itself is not a panacea, as for example, the case of privacy controls indicates[Barocas and Nissenbaum, 2014]. When a user can pick between multiple options, the platform must choosea default setting, and since most users simply leave the default setting [Manber et al., 2000, Mackay, 1991],it is crucial to understand how the choice of default setting affects outcomes on the platform. Moreover,as the options to users increase, the number of different user experiences does too; hence, any attempt toimprove the platform must take these different user experiences into account. For example, suppose users

∗Work done while the author was an intern at Twitter.†MH was a paid consultant at Twitter. Work performed while consulting for Twitter.

1

arX

iv:2

107.

0899

5v1

[cs

.LG

] 1

9 Ju

l 202

1

Page 2: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

Push NotificationsU

ser M

inut

es o

n Pl

atfo

rm

-80

-35

10

RCT DIGM EM RA IPTW

Num

ber o

f Stra

nger

Blo

cks

-0.03

-0.015

0

RCT DIGM EM RA IPTW

Email Notifications

Use

r Min

utes

on

Plat

form

-30

-12.5

5

RCT DIGM EM RA IPTW

Timeline

Use

r Min

utes

on

Tim

elin

e

-150

-50

50

RCT DIGM EM RA IPTW

Quality filter

Figure 1: The results for the four within-study comparisons in which we compare the ground-truth randomized con-trolled trial (RCT) estimate to four observational estimates: the difference in group means (DIGM), exact matching(EM), regression adjustment (RA), and inverse probability of treatment weighting (IPTW). In all four cases, theobservational methods do poorly at recovering the RCT estimate, and in all except the quality filter, they recoverthe incorrect sign. All error bars are calculated through bootstrapped sampling.

can choose whether the content they see is algorithmically filtered to remove offensive content. If at somepoint, the platform sees an increase in reports of offensive content, then a researcher trying to isolate thecause for the increase will now likely need to test whether the reports primarily stem from those who do nothave the filter on.

Even if the user is given control over a setting, we will typically still need to understand the effects of thesetting, and especially of the default. To determine how changes to the platform causally affect various out-comes of interest, online platforms regularly conduct randomized experiments, also referred to as A/B tests.Randomized experiments are typically thought to be the gold standard for causal inference [Byar et al., 1976,Byar, 1980]. However, experimentation on online platforms has received widespread criticism under ethicaland legal grounds [Felten, 2015, Tufekci, 2015, Grimmelmann, 2015, Hunter and Evans, 2016, Jones, 2017],much of it sparked by the controversial Facebook emotional contagion study [Kramer et al., 2014], which ma-nipulated users’ newsfeeds in the aim of understanding whether user moods would be positively/negativelyinfluenced by showing more positive/negative content. Among many other concerns, online experimentationhas been particularly criticized for requiring little user consent [Benbunan-Fich, 2017, Metcalf and Crawford,2016].

As platforms give users greater agency, observational data is naturally generated for each setting the usercan toggle, raising the intriguing possibility of conducting observational studies in which users self-selectinto the treatment of interest as an alternative to experiments in which the platform controls whether theuser receives treatment or not. For example, we could compare users who chose to keep personalized ads onto those who opt-out of them to investigate the effect that personalization has on ad revenue. Importantly,the platform does not autonomously manipulate the user’s settings; the users are the ones who select whichsetting they receive. Thus, the use of observational studies is appealing as it may mitigate some of the issueswith online experimentation such as lack of user consent.

However, it is unclear how effective such observational studies will be. The efficacy of observational methodsdiffers between domains and is often controversial. For example, in the medical domain, Benson and Hartz[2000] and Concato et al. [2000] found that observational studies produced results remarkably similar tothose from randomized experiments (although the editorial accompanying these articles cast doubt on theirresults [Pocock and Elbourne, 2000]). On the other hand, when used to evaluate the effectiveness of jobtraining programs, observational methods have generally been found to be less reliable than randomizedexperiments [Glazerman et al., 2003, Smith and Todd, 2005, Wong et al., 2018].

2

Page 3: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

In this paper, we conduct four large-scale within-study comparisons on Twitter aimed at assessing theeffectiveness of observational studies derived from user self-selection on online platforms. In a within-studycomparison, treatment effects from an observational study are assessed based on how effectively they replicateresults from a randomized experiment with the same target population [LaLonde, 1986, Fraker and Maynard,1987, Wong and Steiner, 2018]. In the four cases we look at, the user can choose whether (a) they receivealgorithmic email notifications, (b) they receive algorithmic push notifications, (c) their notifications arefiltered by a model that predicts whether the notifications are quality or not, (d) their feed is algorithmicallycurated. We test the naive difference in group means estimator, exact matching, regression adjustment,and propensity score weighting while controlling for plausible confounding variables that are available to theexperimenter. In all cases, all four estimates perform poorly at recovering the ground-truth estimate fromthe analogous randomized experiments. In all cases except the quality filter, the observational estimates, infact, have the opposite sign as the randomized estimate. Our results are shown in Figure 1 and are elaboratedupon in Section 3.

Our empirical results suggest that observational studies derived from user self-selection are a poor alternativeto randomized experimentation on online platforms. We end by discussing deeper “Catch-22”s that questionwhether the success of causal inference is even compatible with the motivations for user agency. For instance,a major motivation for giving users greater agency is the belief that there is no adequate model for predictinguser behavior, however, performing observational causal inference successfully requires exactly that. Allobservational methods need as an assumption some form of unconfoundedness, and this Catch-22 suggeststhat it will be hard to justify this assumption. Alternatively, even if we could predict users, we maystill want to avoid prediction out of concerns for human dignity, and instead opt to give users greateragency [Kaminski, 2018, Jones, 2017]. This brings us to our second Catch-22: out of dignitary concerns, wegive users greater control to avoid user prediction, but observational causal inference requires such prediction,and thus, jeopardizes the dignitary considerations that originally motivated user control.

2 Methodology

As a side-effect of giving users greater control on the platform, it becomes possible to conduct observationalstudies in which users self-select into treatments of interest. Our goal in this paper is to experimentallyinvestigate the effectiveness of observational studies from user self-selection on online platforms. To do so,we conduct four large-scale within-study comparisons on the Twitter platform. For each case, we have botha randomized A/B test and an observational study for the same population and treatment of interest. Wethen assess the estimates from the observational study by measuring how closely they replicate the “ground-truth” estimate from the randomized experiment. Our methodology is very similar to that of Gordon et al.[2019] who used within-study comparisons to assess the effectiveness of observational methods for recoveringthe effects of ad campaigns at Facebook.

The randomized A/B test splits a small portion of users into a treatment and control group. Users in thecontrol group never receive the treatment. Users in the treatment group receive the treatment by default,however, they can also opt-out if they wish to. In other words, the experiment has one-sided compliance.Users in the treatment group that opt-out are unexposed and referred to as the treatment-unexposed (TU)group. The rest are exposed and referred to as the treatment-exposed (TE) group. For the observationalstudy, we compare the treatment-exposed group to the treatment-unexposed group while ignoring the controlgroup. In many cases where users can control a treatment of interest, there is no corresponding A/B test,and thus no control group, so comparing the treatment-exposed and treatment-unexposed group mimics howwe would ordinarily conduct an observational study. See Figure 2 for an illustration of the set-up.

For a concrete example, let us return to our personalization example. By default users on a platform havepersonalized ads on, however, they can also opt-out if they wish to. The platform runs an A/B test tomeasure how the use of ads personalization affects user activity on the platform. Regardless of what theuser’s personalization setting is, the platform never personalizes ads for those in the control group. However,

3

Page 4: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

TE TUTreatmentControl

Observational Study

Randomized Experiment

Figure 2: An illustration of the set-up for the within-study comparisons. In the randomized experiment,each user is randomly assigned to either the control group or the treatment group. Those in the controlgroup never receive the treatment, whereas those in the treatment group can opt-out if they wish to. Thetreatment-unexposed (TU) group are users in the treatment group who opt-out and the treatment-exposed(TE) group are users in the treatment group who do not opt-out. In the observational study, we onlycompare the TE and TU group, while ignoring the control group.

those in the treatment group only receive personalized ads, i.e are exposed, if they have the setting on. Ifthey have opted-out, then they do not receive personalization and are unexposed.

We now introduce notation to formalize the setting. There are a total of n users in the experiment. Foreach user i ∈ [n], the variable Zi represents their group assignment: Zi = 0 if they were assigned to thecontrol group and Zi = 1 if they were assigned to the treatment group. However, assignment does notnecessarily imply exposure. For each user i ∈ [n], the exposure outcome Wi is the user’s exposure and thepotential exposure outcome Wi(Zi) is the user’s exposure given their treatment status. Since no users inthe control group are ever exposed, Wi(0) = 0 for all users. Finally, for each user i ∈ [n], the outcome Yiis the user’s actual outcome and Yi(Zi,Wi(Zi)) is the user’s potential outcome given their assignment andexposure status.

In the randomized experiment, we have access to both the treatment and control group. The simplest effectto estimate in the randomized experiment is the intent-to-treatment (ITT) effect which calculates the effectof assignment to the treatment or control group while ignoring whether users were actually exposed to thetreatment, i.e. ignoring whether users opted-out or not:

ITT := E [Yi(1,Wi(1))− Yi(0,Wi(0))] . (1)

An alternative measure that only calculates the effect of treatment on those who would not opt-out, andthus would actually receive the treatment, is the average treatment effect on the treated (ATT):

ATT := E [Yi(1,Wi(1))− Yi(0,Wi(0)) |Wi(1) = 1] . (2)

Both the ITT and ATT could be relevant measures depending on the interests of the researcher. However,since we only use users in the treatment group for the observational study, we cannot compute observationalestimates of the ITT. Thus, in order to have a comparable measure across the observational study and therandomized experiment, we focus on estimation of the ATT, as is done in other within-study comparisons[Gordon et al., 2019, Litwok, 2020].

2.1 Experimental estimate of ATT

First, we describe how we estimate the ATT from the randomized experiment. Since users are unaware oftheir assignment Zi, it is reasonable to assume exclusion restriction holds, i.e. that assignment only affectsoutcomes through exposure Wi:

Yi(0, w) = Yi(1, w) ∀i ∈ [n], w ∈ 0, 1 . (3)

4

Page 5: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

Thus, we henceforth write the potential outcome Yi(Zi) as a function of only the assignment.

Imbens and Angrist [1994] show that the ATT can be estimated in an instrumental variables framework.Given exclusion restriction and one-sided noncompliance, the ATT is equal to the ITT divided by theproportion of compliers:

ATT =ITT

P(Wi(1) = 1). (4)

The sample-based version that we use as an estimate of ATT is ATT = ITTnte/nt

where

ITT =

∑ni=1 1{Zi = 1}Yi

nt−∑n

i=1 1{Zi = 0}Yinc

, (5)

and nt, nc, and nte are the number of users in the treatment, control, and treatment-exposed group, respec-tively.

2.2 Observational estimates of ATT

We calculate four different estimates from the observational study based on (1) the difference in group means(DIGM), (2) exact matching, (3) regression adjustment, and (4) inverse probability of treatment weighting(IPTW).

Difference in group means. The difference in group means (DIGM) estimator is simply the differencein the sample mean of the outcomes for the treatment-exposed group and the treatment-unexposed group:

ATTDIGM =1

nte

n∑i=1

1{Zi = 1,Wi = 1}Yi −1

ntu

n∑i=1

1{Zi = 1,Wi = 0}Yi , (6)

where nte =∑n

i=1 1{Zi = 1,Wi = 1} and ntu =∑n

i=1 1{Zi = 1,Wi = 0} are the number of treatment-exposed and treatment-unexposed users, respectively. Since the DIGM estimator would be an unbiasedestimate of the ITT if the observational study were actually a randomized experiment, it provides a useful,naive baseline for us to compare to.

Exact matching. The exact matching estimator controls for a set of covariates exactly. Each user i ∈ [n] isrepresented by a vector of covariates Xi. LetMi be the set of treatment-unexposed users who have the samecovariates as treatment-exposed user i. We estimate the counterfactual outcome for an treament-exposeduser i by averaging the outcomes of treatment-unexposed users with the same covariates:

Yi(0) =1

|Mi|∑

j∈Mi

Yj (7)

Then, the exact matching estimator of the ATT is

ATTExact =1

nte

n∑i=1

[1{Zi = 1,Wi = 1}

(Yi − Yi(0)

)]. (8)

The down-side to exact matching is that we can only use a few covariates, typically binary, so that thereis enough overlap between the groups. The next two methods allow us greater flexibility in the choice ofcovariates.

5

Page 6: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

Regression adjustment. In regression adjustment, we learn a linear function f to predict the outcomes ofthe treatment-unexposed group from their covariates. We then use this function to predict the counterfactualoutcomes for the treatment-exposed group. Notably, unlike exact matching, we can use a greater numberof covariates as well as continuous covariates because we use the regression function to extrapolate theoutcomes. The regression adjustment estimator is:

ATTRegression =1

nte

n∑i=1

[1{Zi = 1,Wi = 1} (Yi − f(Xi))] . (9)

IPTW. The last method we consider is inverse probability of treatment weighting (IPTW) [Luncefordand Davidian, 2004, Austin and Stuart, 2015] which reweighs samples with the propensity score [Rosenbaumand Rubin, 1983], which in our setting is e(Xi) = P(Wi = 1 | Xi, Zi = 1), the probability of being exposedgiven assignment to the treatment group and the covariates Xi. As is commonly done, we estimate thepropensity score via a logistic regression model.

Let e1, . . . , en be the estimated propensity score of the users. Then, we assign to user i the weight αi = 1if the user is in the treatment-exposed group and αi = ei/(1− ei) if the user is in the treatment-unexposedgroup. Since large weights can cause high variance in the estimates produced, it is common to trim theweights [Lee et al., 2011]. We trim our weights to the 0.01 and 0.99 quantiles of the weight distribution.Then, the IPTW estimate of the ATT is the difference in the weighted sample means:

ATTIPTW =1

nte

n∑i=1

1{Zi = 1,Wi = 1}Yi −1∑

i αi1{Zi = 1,Wi = 0}

n∑i=1

1{Zi = 1,Wi = 0}Yi . (10)

3 Experiments and Results

We now present the four within-study comparisons and their results.

3.1 Experiments

First, we describe the four settings.

Personalized push notifications. On Twitter, users can receive algorithmically personalized push no-tifications that are sent to their phone. Most notifications on Twitter are not algorithmically chosen; thesenotifications are simply triggered by other users directly interacting with you, e.g. Favoriting or Retweetingyour Tweet. In contrast, algorithmically-chosen notifications typically notify users of activity that doesn’tdirectly involve them, such as “X user tweeted for the first time in a while” or “X, Y, Z just liked A’s tweet”.In the experiment we analyze, a subset of users is drawn from all users that have standard push notifica-tions enabled; these users are randomly assigned to either a control or treatment group. The control groupnever receives algorithmic push notifications. Users in the treatment group do not receive algorithmic pushnotifications if they have opted out of them, otherwise, they do. We measured the effect that algorithmicpush notifications have on the total minutes users spend on the platform for a certain interval of time1.

Personalized email notifications. Similar to the algorithmic push notifications, users may also receivealgorithmic email notifications. Such email notifications typically provide “digests” of activity that the userhas missed while not active. In the experiment we analyze, a control group of users never receives algorithmicemail notifications. Users in the treatment group do not receive algorthmic email notifications if they haveopted out of them, otherwise, they do. We measure the effect that algorithmic email notifications have onthe total minutes users spend on the platform for a certain interval of time.

1We do not reveal what the time interval is due to the sensitive nature of the measure.

6

Page 7: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

Variable Descriptionis protected Whether the account’s Tweets are public or only visible to followershas prof pic Whether the user has uploaded a profile picture or nothas dms open Whether the user allows direct messages from anyone or only users they followdays old The number of days since the user account was createdhas geo enabled Whether the user has geolocation enabledallows ads Whether the user allows ads personalization or notis verified Whether the account is verifiedis restricted Whether the account has been restricted for violating Twitter terms of serviceis sensitive Whether the user has marked their Tweets as containing sensitive medianum tweets The number of Tweets the user has publishednum followers The number of followers the user hasnum followings The number of users that the user follows

Table 1: Covariates used for the observational methods. IPTW and regression adjustment use all covariateswhile exact matching only uses the first three.

Personalized timeline. Twitter, like most social media platforms, is by default algorithmically curated,meaning that machine learning algorithms select and rank the content that a user sees. However, userscan opt-out of algorithmic curation and switch to a chronological timeline that only shows Tweets frompeople they follow, ordered by recency. We analyze an experiment in which a control group of users neverreceives algorithmic timeline, regardless of what their opt-out setting is. Users in the treatment group receivethe algorithmic timeline by default, but can also switch to chronological timeline. Since users may switchback and forth between the two timelines, we subsample our data further. In particular, for a certain timeduration, we only consider the subset of users in the experiment who visited Twitter at least one time andonly used one type of timeline for the entire duration. For these users, we measure the effect that algorithmiccuration has on the number of minutes the users spend on their timeline.

Quality filter. On Twitter, notifications are, by default, filtered by a model that predicts whether thenotification is low-quality, e.g. a Tweet that appears to be duplicated or automatically generated. However,users can also opt-out of the filter in their settings. In the experiment we analyze, a control group neverhas notifications filtered by the quality filter. In the treatment group, users who opt-out do not have theirnotifications filtered, otherwise, they do. We measure the effect that quality filtering has on the number oftimes a user blocks a stranger (a user that they do not follow) who mentions them in a Tweet. Note thaton Twitter, users typically receive a notification when another user mentions them. A user would typicallyonly block a stranger who mentions them after receiving a notification of the stranger’s mention. So if thequality filter works, and filters out low-quality notifications, then we would expect the number of blocks afterstranger mentions to decrease.

3.2 Covariates

We control for plausible covariates that were available to the researcher. The twelve covariates we use arelisted in Table 1.

A plausible confounder across all cases is whether the user is a “power user” [Zhong, 2013]. Prior research hasshown that a small subset of users who are active and technologically savvy, the power users, take considerateadvantage of user controls, while most other users simply leave the default setting [Manber et al., 2000,Mackay, 1991, Sundar and Marathe, 2010]. Therefore, being a power user is a potential confounder because,for example, a power-user is more likely to opt-out of personalized timeline but also, by virtue of beinga power user, is more likely to be active on the platform, thus confounding our estimate of the effect ofpersonalized timeline on the user’s time spent on the timeline. We try to control for potential indicators

7

Page 8: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

of being a power user, such as the age of the account (days old), whether the user has a profile picture(has prof pic), whether the account is verified (is verified), the number of Tweets (num tweets), thenumber of followers (num followers), the number of users the user follows (num following).

Another type of confounder is having a preference for keeping content or data private. For example, weexpect a user who has a preference for greater privacy to be more likely to have a protected account, i.e.an account where their Tweets are only visible to their followers, and also more likely to opt-out of thepersonalization features we consider. But because their account is protected, they have fewer interactionswith others, thus reducing the amount of time they spend on the platform, and therefore confounding orestimate of e.g. the effect of personalized push notifications on time spent on the platform. We controlfor the user’s use of the following privacy controls: whether the users’ Tweets are public or only visible tofollowers (is protected), whether the user has geolocation enabled (has geo enabled), whether the userallows direct messages from anyone or only users they follow (has dms open), and whether the user allowspersonalized ads (allows ads). We also note that the use of privacy controls may also be indicative of beinga power user [Sundar and Marathe, 2010, Kang and Shin, 2016], and thus, controlling for the use of privacycontrols may also be prudent as a means of controlling for being a power user.

Finally, we control for whether the user’s account is restricted for a violation of Twitter’s terms of service(is restricted) or whether the user has marked their Tweets as potentially containing sensitive media(is sensitive). These two covariates seem especially pertinent in the quality filter case. For example, userswho are restricted may be more likely to turn the quality filter off because they may themselves be morelikely to post low-quality content. However, if this is the case, they may also be less likely to block strangersfor the notifications that are normally filtered out. Thus, is restricted may confound our estimate of theeffect that the quality filter has on the number of blocks of strangers who mention the user.

For all cases, we use all twelve covariates for regression adjustment and IPTW. Since exact matching canonly handle a small number of variables while still ensuring overlap between the treatment-exposed andtreatment-unexposed groups, we only use three binary covariates for it. We avoid choosing covariates likeis verified or is restricted which are rare and would make overlap less likely. Instead, we opt to usethree more common covariates: is protected, has prof pic, and has dms open.

3.3 Results

Our results are shown in Figure 1. In all cases, all four observational methods do poorly at recoveringthe true causal effect, and in all cases except the quality filter, the observational methods not only get themagnitude wrong, they also get the sign of the estimate wrong.

We also note that, when using IPTW, the treatment-exposed group and the treatment-unexposed groupshould have the same distribution of covariates after weighting [Austin and Stuart, 2015]. In Appendix A,as a sanity check, we plot the standardized difference between the mean of the covariates before and afterweighting, finding evidence that weighting does indeed appear to balance the covariates.

The effect of personalization on user activity. For personalized push notifications, email notifica-tions, and timeline, the DIGM estimator returns a negative estimate, meaning that those who opt-out ofpersonalization spend more time on the platform, and implying that personalization decreases time on theplatform. The result seems to be counter-intuitive as personalization is typically thought to increase thetime that users spend on the platform. And indeed, the randomized experiments show that personalizationdoes increase user activity.

The large discrepancy between the DIGM estimate and the randomized estimate implies a large amount ofselection bias. One potential interpretation of this selection bias is that, although personalization increasesactivity for the average user, those who opt-out are the small number of users who prefer an unpersonalizedexperience and will spend more time on the platform if they opt-out. In a sense, this would mean that givingusers control is “working”: users can choose the option they prefer.

8

Page 9: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

Another interpretation could be that the only users who opt-out of these settings are “power” users who alsouse the platform a lot for reasons other than personalization. For regression adjustment and IPTW, we docontrol for available covariates that could be indicators of being a power user (as described in Section 3.2),however, the estimates still do not recover the correct sign. Nonetheless, we cannot rule out the power-userinterpretation.

The effect of the quality filter on blocks. For the quality filter, although all the observational estimatesdo retrieve the correct sign of the estimate, they over-estimate, by a factor of 2.5 to 5 times, the amountthat the quality filter decreases the number of blocks. This suggest that users who opt-out of the qualityfilter block users more frequently than a random sample of users would. It is unclear why this is. It couldbe that users opt-out because they notice that notifications are being hidden from them, which could meanthat those who opt-out of the quality filter tend to be more likely to receive low-quality notifications in thefirst place.

4 Conclusion: “Catch-22”s of causal inference and user agency

As platforms give users greater agency, observational data is naturally generated for each setting that theuser can toggle, which raises the intriguing possibility of applying observational methods for causal inferenceto study the outcomes of these settings. The use of observational studies, as opposed to randomized experi-mentation, is appealing as a way to mitigate some of the issues with online experimentation, such as lack ofuser consent. However, our empirical results suggest that observational studies from user self-selection arenot currently a suitable replacement for randomized experimentation on online platforms. Furthermore, webelieve that our findings are only a facet of a deeper tension between causal inference and user agency.

A major motivation for giving users agency is that humans are complex and their preferences and behaviorcannot be fully predicted by an algorithm. This brings us to our first Catch-22 between causal inferenceand user agency: we give users control because we postulate that there is no model that can adequatelypredict a user, but causal inference requires exactly that. All observational methods need as an assumptionsome form of unconfoundedness, and it will be difficult to justify this assumption without the existence ofan adequate model for the user.

Furthermore, even if users could be predicted accurately, many have argued that autonomously predictinguser behavior with little oversight or feedback from the users themselves violates principles of human dignity,and thus have called for greater user agency as a means of avoiding autonomous prediction [Kaminski, 2018,Jones, 2017]. Yet, observational methods for causal inference generally require us to resort to prediction. Forexample, for propensity score based methods, we need a good estimate of the propensity score, which in ourcase means we need to predict which users will choose to opt-out of the treatment. Similarly, in regressionadjustment, we must predict the outcome behaviors of the user from the covariates. This brings us to oursecond Catch-22: out of dignitary concerns, we give users control to avoid user prediction, but observationalcausal inference requires such prediction, and thus, jeopardizes the dignitary considerations that originallymotivated user control.

What the Catch-22s suggest is that the success of observational studies based upon user self-selection may beat odds with the original motivations for giving users greater control in the first place. Even as researchersmake progress in improving causal inference methods, these fundamental tensions between prediction andagency will likely need to be grappled with.

9

Page 10: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

ReferencesHunt Allcott, Luca Braghieri, Sarah Eichmeyer, and Matthew Gentzkow. The welfare effects of social media.

American Economic Review, 110(3):629–76, 2020.

Julia Angwin, Jeff Larson, Surya Battu, and Lauren Kirchner. Machine bias. ProPublica, 2016.

Peter C Austin and Elizabeth A Stuart. Moving towards best practice when using inverse probability oftreatment weighting (iptw) using the propensity score to estimate causal treatment effects in observationalstudies. Statistics in medicine, 34(28):3661–3679, 2015.

Solon Barocas and Helen Nissenbaum. Big data’s end run around anonymity and consent. Privacy, big data,and the public good: Frameworks for engagement, 1:44–75, 2014.

Raquel Benbunan-Fich. The ethics of online research with unsuspecting users: From a/b testing to c/dexperimentation. Research Ethics, 13(3-4):200–218, 2017.

Kjell Benson and Arthur J Hartz. A comparison of observational studies and randomized, controlled trials.New England Journal of Medicine, 342(25):1878–1886, 2000.

Samantha Bradshaw and Phillip N. Howard. Why does junk news spread so quickly across social media?algorithms, advertising and exposure in public life, Jan 2018. URL https://kf-site-production.s3.

amazonaws.com/media_elements/files/000/000/142/original/Topos_KF_White-Paper_Howard_V1_

ado.pdf.

Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial genderclassification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018.

Jenna Burrell, Zoe Kahn, Anne Jonas, and Daniel Griffin. When users control the algorithms: Valuesexpressed in practices on twitter. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW):1–20, 2019.

David P Byar. Why data bases should not replace randomized clinical trials. Biometrics, pages 337–342,1980.

David P Byar, Richard M Simon, William T Friedewald, James J Schlesselman, David L DeMets, Jonas HEllenberg, Mitchell H Gail, and James H Ware. Randomized clinical trials: perspectives on some recentideas. New England Journal of Medicine, 295(2):74–80, 1976.

John Concato, Nirav Shah, and Ralph I Horwitz. Randomized, controlled trials, observational studies, andthe hierarchy of research designs. New England journal of medicine, 342(25):1887–1892, 2000.

Sarah Dean, Sarah Rich, and Benjamin Recht. Recommendations and user agency: the reachability ofcollaboratively-filtered information. In Proceedings of the 2020 Conference on Fairness, Accountability,and Transparency, pages 436–445, 2020.

Edward W Felten. Privacy and a/b experiments. Colo. Tech. LJ, 13:193, 2015.

Thomas Fraker and Rebecca Maynard. The adequacy of comparison group designs for evaluations ofemployment-related programs. Journal of Human Resources, pages 194–227, 1987.

Steven Glazerman, Dan M Levy, and David Myers. Nonexperimental versus experimental estimates ofearnings impacts. The Annals of the American Academy of Political and Social Science, 589(1):63–93,2003.

Brett R Gordon, Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky. A comparison of approaches toadvertising measurement: Evidence from big field experiments at facebook. Marketing Science, 38(2):193–225, 2019.

10

Page 11: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

James Grimmelmann. The law and ethics of experiments on social media users. Colo. Tech. LJ, 13:219,2015.

Jaron Harambam, Dimitrios Bountouridis, Mykola Makhortykh, and Joris Van Hoboken. Designing forthe better by taking users into account: A qualitative evaluation of user control mechanisms in (news)recommender systems. In Proceedings of the 13th ACM Conference on Recommender Systems, pages69–77, 2019.

David Hunter and Nicholas Evans. Facebook emotional contagion experiment controversy. Research Ethics,2016.

Guido W Imbens and Joshua D Angrist. Identification and estimation of local average treatment effects.Econometrica: Journal of the Econometric Society, pages 467–475, 1994.

Meg Leta Jones. The right to a human in the loop: Political constructions of computer automation andpersonhood. Social Studies of Science, 47(2):216–239, 2017.

Margot E Kaminski. Binary governance: Lessons from the GDPR’s approach to algorithmic accountability.S. Cal. L. Rev., 92:1529, 2018.

Hyunjin Kang and Wonsun Shin. Do smartphone power users protect mobile privacy better than nonpowerusers? exploring power usage as a factor in mobile privacy protection and disclosure. Cyberpsychology,Behavior, and Social Networking, 19(3):179–185, 2016.

Adam DI Kramer, Jamie E Guillory, and Jeffrey T Hancock. Experimental evidence of massive-scale emo-tional contagion through social networks. Proceedings of the National Academy of Sciences, 111(24):8788–8790, 2014.

Robert J LaLonde. Evaluating the econometric evaluations of training programs with experimental data.The American economic review, pages 604–620, 1986.

Brian K Lee, Justin Lessler, and Elizabeth A Stuart. Weight trimming and propensity score weighting. PloSone, 6(3):e18174, 2011.

Ro’ee Levy. Social media, news consumption, and polarization: Evidence from a field experiment. Americaneconomic review, 111(3):831–70, 2021.

Daniel Litwok. Using nonexperimental methods to address noncompliance. 2020.

Kai Lukoff, Ulrik Lyngs, Himanshu Zade, J Vera Liao, James Choi, Kaiyue Fan, Sean A Munson, and AlexisHiniker. How the design of youtube influences user sense of agency. In Proceedings of the 2021 CHIConference on Human Factors in Computing Systems, pages 1–17, 2021.

Jared K Lunceford and Marie Davidian. Stratification and weighting via the propensity score in estimationof causal treatment effects: a comparative study. Statistics in medicine, 23(19):2937–2960, 2004.

Wendy E Mackay. Triggers and barriers to customizing software. In Proceedings of the SIGCHI conferenceon Human factors in computing systems, pages 153–160, 1991.

Udi Manber, Ash Patel, and John Robison. Experience with personalization of yahoo! Communications ofthe ACM, 43(8):35–39, 2000.

Jacob Metcalf and Kate Crawford. Where are human subjects in big data research? the emerging ethicsdivide. Big Data & Society, 3(1):2053951716650211, 2016.

Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. Dissecting racial bias in analgorithm used to manage the health of populations. Science, 366(6464):447–453, 2019.

11

Page 12: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

Stuart J Pocock and Diana R Elbourne. Randomized trials or observational tribulations?, 2000.

Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studiesfor causal effects. Biometrika, 70(1):41–55, 1983.

Jeffrey A Smith and Petra E Todd. Does matching overcome lalonde’s critique of nonexperimental estimators?Journal of econometrics, 125(1-2):305–353, 2005.

Kate Starbird, Ahmer Arif, and Tom Wilson. Disinformation as collaborative work: Surfacing the participa-tory nature of strategic information operations. Proceedings of the ACM on Human-Computer Interaction,3(CSCW):1–26, 2019.

S Shyam Sundar and Sampada S Marathe. Personalization versus customization: The importance of agency,privacy, and power usage. Human Communication Research, 36(3):298–322, 2010.

Zeynep Tufekci. Algorithmic harms beyond facebook and google: Emergent challenges of computationalagency. Colo. Tech. LJ, 13:203, 2015.

Vivian C Wong and Peter M Steiner. Designs of empirical evaluations of nonexperimental methods in fieldsettings. Evaluation review, 42(2):176–213, 2018.

Vivian C Wong, Peter M Steiner, and Kylie L Anglin. What can be learned from empirical evaluations ofnonexperimental methods? Evaluation review, 42(2):147–175, 2018.

Kyra Yee, Uthaipon Tantipongpipat, and Shubhanshu Mishra. Image cropping on twitter: Fairnessmetrics, their limitations, and the importance of representation, design, and agency. arXiv preprintarXiv:2105.08667, 2021.

Bu Zhong. From smartphones to ipad: Power users’ disposition toward mobile media devices. Computers inhuman behavior, 29(4):1742–1748, 2013.

12

Page 13: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

A Balance in IPTW

In IPTW, the treatment-unexposed and treatment-exposed groups are supposed have the same distribution ofcovariates after weighting. The following figures show the absolute standardized difference between the meanof covariates before and after weighting in all four cases. Achieving perfect balance is generally unattainablein practice, and there is no single accepted heuristic for assessing balance, but generally a difference of lessthan 0.1 has been accepted as reasonably balanced [Austin and Stuart, 2015]. In all cases, weighting appearsto generally balance the covariates.

0

0.2

0.4

0.6

0.8

1

days_old

has_geo_enabled

has_dms_open

allows_ads

is_protected

is_verified

is_ restricted

is_sensitive

has_prof_pic

num_tweets

num_followers

num_followings

Abs

olut

e St

anda

rdiz

ed D

iffer

ence

in M

eans

Push NotificationsBefore weighting After weighting

13

Page 14: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

0

0.2

0.4

0.6

0.8

1

days_old

has_geo_enabled

has_dms_open

allows_ads

is_protected

is_verified

is_ restricted

is_sensitive

has_prof_pic

num_tweets

num_followers

num_followings

Abs

olut

e St

anda

rdiz

ed D

iffer

ence

in M

eans

Email NotificationsBefore weighting After weighting

14

Page 15: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

0

0.2

0.4

0.6

0.8

1

days_old

has_geo_enabled

has_dms_open

allows_ads

is_protected

is_verified

is_ restricted

is_sensitive

has_prof_pic

num_tweets

num_followers

num_followings

Abs

olut

e St

anda

rdiz

ed D

iffer

ence

in M

eans

TimelineBefore weighting After weighting

15

Page 16: arXiv:2107.08995v1 [cs.LG] 19 Jul 2021

0

0.2

0.4

0.6

0.8

1

days_old

has_geo_enabled

has_dms_open

allows_ads

is_protected

is_verified

is_ restricted

is_sensitive

has_prof_pic

num_tweets

num_followers

num_followings

Abs

olut

e St

anda

rdiz

ed D

iffer

ence

in M

eans

Quality FilterBefore weighting After weighting

16