gvt 201 23spring - Harvard University1. Load and explore data data

GVT 201Data Analysis and Politics

Professor Elena Llaudet

Lecture 23 | April 11, 2017

Experiment

I The votes are in and the winner is: The experiment aboutthe e�ects of music on productivity!

I ProtocolI On a related note:

https://youtu.be/2HmQRb2uzlc?t=90

Next Thursday: Student Research Conference

8-9:15am Breakfast and posters9:25-10:40am

I Panel: Equity, Identity, and American DemocracyI Panel: Lobbying and the Executive Branch

10:50am-12:05pm

I Panel: Variations of Political Participation in ElectionsI Panel: International Relations

1:40-2:55pm

I Panel: Justice and InequalityI Panel: Education PolicyI Panel: Internship Panel

3:05-4:20pm Student Panel4:30-5:30pm Award Ceremony5:30-7:00pm Alumni Speed Networking!

Highly recommend it! (especially alumni speednetworking). Attendance is mandatory (at leastduring our time slot). Location: Sargent Hall.

Plan for Today

I Review: Hypothesis Testing of Di�s-in-MeansI Hypothesis Testing of Coe�cientsI Example: Do Women Promote Di�erent Policies?

1. Load and explore data2. Identify/calculate outcome and independent variables3. Estimate the e�ect of X on Y4. Identify whether the e�ect is statisically

signficant at the 95% confidence level usinghypothesis testing

Uncertainty: Hypothesis Testing of Di�s-in-MeansI We have seen how to determine whether the e�ect of X on

Y, as measured by the di�s-in-means estimator, isstatistically significant at the 95% confidence level usinghypothesis testing

I H0: E (Xt) ≠ E (Xc) = 0 (no e�ect)I H1: E (Xt) ≠ E (Xc) ”= 0 (e�ect can be + or -)I we calculate the observed test statisticI we calculate the p-value of the observed test statisticI we will reject the null hypothesis of no causal e�ect at the

population level and conclude the e�ect is statisticallysignficant at the 95% confidence level

I if the p-value Æ 0.05 orI if |observed test statistic| Ø 1.96

Uncertainty: Hypothesis Testing of Coe�cientsI How about if we use regression estimates/coe�cients

to measure the e�ect of X on Y? How would we determinewhether the e�ect is statistically significant at the 95%confidence level using hypothesis testing in this case?

I First, how can we calculate the di�s-in-means estimatorusing regression analysis?

I By running a regression where X is binary and identifiestreatment assignment

Y = – + —X where X =Y]

[1 if Treatment0 if Control

I In this case, what coe�cient can be interpreted as thedi�s-in-means estimator? —

I How about if we had more than one treatment, how can wecalculate the multiple di�s-in-means estimator usingregression analysis?

I By running a regression where we have a binary Xjidentifying each of the j treatments and the baselinecategory is the control group

I For example, if the experiment uses 2 di�erent treatments

Y = – + —1X1 + —2X2

X1 =Y]

[1 if Treatment 10 if Control

and X2 =Y]

[1 if Treatment 20 if Control

I In this case, what coe�cient can be interpreted as thedi�s-in-means estimator of treatment 1? —1

I What coe�cient can be interpreted as the di�s-in-meansestimator of treatment 2? —2

Hypothesis Testing of —: Formal Procedure1. In this class, we will always test whether the population

parameter (—) is zero (i.e., there is no e�ect of X on Y –what we ultimately want to refute)

I H0 : — = 02. We will always use as our alternative hypothesis that the

e�ect is di�erent from zero and can be positive or negativeI H1 : — ”= 0

3. The test statistic in this case is defined as:

test statistic = — ≠ —

SE (—)I where — is the observed coe�cientI — is the population parameterI SE (—) stands for the standard error of —

4. If n is large enough and the true — is indeed zero:

test statistic = —

SE (—)if null is true ≥ N(0, 1)

5. Calculate observed test stat and draw conclusions about —I We can compute the p-value (probability under the null of

observed test statistic or more extreme value)I p-value = p(test statistic Ø |observed value|) + p(test

statistic Æ -|observed value|)I if p-value Æ 0.05 (our cuto�) æ reject the null

hypothesis and conclude that the e�ect is statisticallysignificant at the 95% confidence level

I Or, we can compare |observed test statistic| to 1.96I if |observed test statistic| Ø 1.96 æ reject the null

hypothesis and conclude that the e�ect is statisticallysignificant at the 95% confidence level

Standard Error of —


SE (—)where SE (—) is standard error of —

I The standard error of the — represents the estimatedstandard deviation of its sampling distribution (in thehypothetical repeated sampling)

SE (—) =ııÙ

1n

qni=1 ‘2

iqni=1(xi ≠ x)2

I No need to memorize formula but do need to understandimplications

I bigger residuals æ noisier estimatorI larger n æ more precise estimatorI more variation in x æ more precise estimator

Example: Do Women Promote Di�erent Policies?

I Randomized policy experiment in India; where some councilseats where randomly assigned to women

I Particular policy outcomes we look into:I number of new and repaired drinking water facilitiesI number of new and repaired irrigation systems facilities

1. Load and explore data

data <- read.csv("women.csv")

head(data)

## village female water irrigation

## 1 1.2 1 10 0

## 2 1.1 1 0 5

## 3 2.2 1 2 2

## 4 2.1 1 31 4

## 5 3.2 0 0 0

## 6 3.1 0 0 0

I village: village identifierI female: council assigned to womenI water : number of drinking water facilities increased or

repairedI irrigation: number of irrigation facilities increased or repaired

2. Identify/calculate outcome and independent variables

I What should be the model we estimate, if what we aretrying to do is run a regression that will provide us with acoe�cient equivalent to the di�s-in-means estimator thatcan be interpreted as the estimated e�ect of having a femaleleader on the number of new and repaired drinking waterfacilities?

I In other words, what should be our Y?I water

I What should be our X?I female

I So the model to estimate is:

\Drinking Water Facilitiesi = – + — Female Leaderi

3. Estimate the e�ect of X on Y. Does female leadership a�ectthe number of new and repaired water facilities?

I So, we want to estimate the following regression:\

Drinking Water Facilitiesi = – + —Female Leaderi

I What should be the R code?regression <- lm(data$water ~ data$female)

regression

##

## Call:

## lm(formula = data$water ~ data$female)

##

## Coefficients:

## (Intercept) data$female

## 14.738 9.252

I So, the estimated model is:\

Drinking Water Facilitiesi = 14.74 + 9.25 Female Leaderi

I Interpretation of –? 14.74I mathematically – is always Y when X = 0I in special case where X is a binary/dummy variable that

identifies treatment assignment – can be interpreted asthe average outcome for the control group

I in this case, then, – which is estimated to be of 14.74is the expected average number of new and repaireddrinking facilities in villages with male leaders

I Interpretation of —? 9.25I mathematically — is always —Y when —X = 1I in special case where X is a binary/dummy variable that

identifies treatment assignment — can be interpreted asthe di�s-in-means estimator

I in this case, then, — is the estimated average causale�ect of female leadership on water facilities – i.e.,having a female leader leads to having 9.25 more newor repaired drinking facilities on average as compared tohaving a male leader

I Can this observed e�ect be due to noise alone (i.e., due tosampling variability)? In other words, is this e�ectstatistically distinguishable from zero?

I we need to do hypothesis testing

4. Identify whether the e�ect is statistically significant at the95% confidence level using hypothesis testing

I What is our null hypothesis? H0 : — = 0I What is our alternative hypothesis? H1 : — ”= 0I What is the test statistic?


SE (—)I If n is large enough and the null is true, the test statistic will

be distributed like what? The standard normal distribution:test statistic ≥ N(0,1)

I Then, if |observed test statistic| Ø 1.96, we will reject thenull hypothesis and conclude that the e�ect is statisticallysignificant at the 95% confidence level

I OK, so let’s calculate the observed test statistic. . .

I We want to calculate this:


SE (—)

I We already have the —. We need the SE (—)I One way to find the standard error of the estimated

coe�cients is to use the function tidy() from the broompackage (you may have to install the package first)

# if package not installed: install.packages("broom")

library(broom)

tidy(regression)

## term estimate std.error statistic p.value

## 1 (Intercept) 14.738318 2.286300 6.446363 4.216474e-10

## 2 data$female 9.252423 3.947746 2.343723 1.970398e-02

tidy(regression)

## term estimate std.error statistic p.value

## 1 (Intercept) 14.738318 2.286300 6.446363 4.216474e-10

## 2 data$female 9.252423 3.947746 2.343723 1.970398e-02

I We can store the coe�cient and standard error in objects,and then calculate the observed test statistic like so

slope <- tidy(regression)[2,2] # second row, second column

slope.se <- tidy(regression)[2,3] # second row, third column

t.stat <- slope / slope.se

t.stat

## [1] 2.343723

I Or just look at the reported test statistic in the tableI Based on the observed test statistic of —, is the e�ect of

female leadership on drinking water facilities statisticallysignificant at the 95% confidence level?

I |observed test statistic| Ø 1.96, so we reject the nulland conclude that the e�ect IS statistically significant

I We could also go ahead and calculate the p-value associatedwith that observed test statisticp.val <- 2 * pnorm(-abs(t.stat))

p.val

## [1] 0.01909235

I Based on the p-value associated with —, is the e�ect offemale leadership on drinking water facilities statisticallysignificant at the 95% confidence level?

I because p-value Æ 0.05, we reject the null æ slopecoe�cient is statistically significant (i.e., distinguishablefrom zero) at the 95% confidence level

I the e�ect IS statistically significant

I Does statistical significance mean that the e�ect isimportant/meaningful?

I not necessarily; just means that it’s not likely to be 0I to gauge magnitude we can compare the size of the

e�ect (9.25) with the standard deviation of theoutcome variablesd(data$water)

## [1] 33.67894

slope/sd(data$water)

## [1] 0.2747243

I the e�ect is equivalent to about 1/4 of the standarddeviation of Y - that is a sizable e�ect (recall that if Yis normally distributed, 2/3 of its data will be within 1standard deviation of the mean)

Today’s Class and Next

TodayI Hypothesis Testing with Regression Coe�cients

I ConceptI Example

Next Thursday: Student Research ConferenceNext Tuesday: A Su�olk MondayFollowing Thursday: Another example of hypothesis testing withregression coe�cients in class (very similar to PSet # 11)

I bring your computers!!

Documents

gvt 201 23spring - Harvard University1. Load and explore data data