Goal is to model outcomes as a function of features. is to model outcomes as a function of features. ... { ##10 tries for each cluster size so we can test with multiple ... TechFest

𝑋: 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠.𝑦: outcomes

Goal is to model outcomes as a function of features.

Width of 𝑋 is 𝑝

Length of 𝑋 and 𝑦 is 𝑛

𝑋: 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠.𝑦: outcomes

𝑌 = 𝑓 𝑋 + 𝜖

𝑦𝑖 = 𝑓(𝑥𝑖) + 𝜖𝑖

Where 𝜖𝑖 is called the error term.

𝑌 = 𝑓 𝑋 + 𝜖

𝑓𝑙𝑖𝑝𝑖 = 𝑝 + 𝜖𝑖

if outcome = 1 then 𝜖𝑖 = 1 − 𝑝if outcome = 0 then 𝜖𝑖 = −𝑝

𝑝 is a constant, called the “success rate”

𝑦𝑖 = 𝛼 + 𝛽1𝑥𝑖 + 𝜖𝑖

ො𝛼

ෝ𝑦𝑖 = ො𝛼 + 𝑥𝑖

ෝ𝒚 − 𝒚 𝟐

ො𝛼gives the intercept

መ𝛽 gives the slope

General formula:

ℎ𝑖 = 𝑓 𝑎𝑔𝑒𝑖 + 𝜖𝑖

𝜖𝑖 allows people of he same age to have different heights. Even if we have the correct growth model using age alone, we expect some variation in height conditional on age.

To use a linear model, polynomial features allow for a non-linear relationship between the feature, age, and the outcome, height.

ℎ𝑖 = 𝛼 + 𝛽1𝑎𝑔𝑒𝑖 + 𝛽2𝑎𝑔𝑒𝑖2 +⋯+ 𝛽𝑝𝑎𝑔𝑒𝑖

𝑝+ 𝜖𝑖

Linear regression allows you to use non-linear functions of the features, provided they enter the function as an additive term, with weight 𝛽

𝑦𝑖 = 𝛼 + 𝛽1𝑔1 𝑥𝑖,1 +⋯+ 𝛽𝑝𝑔𝑝(𝑥𝑖,𝑝) + 𝜖𝑖

The simplest example is where all features simply enter without any transformations

𝑦𝑖 = 𝛼 + 𝛽1𝑥𝑖,1 +⋯+ 𝛽𝑝𝑥𝑖,𝑝 + 𝜖𝑖

𝑦𝑖 = 𝛼 + 𝛽1𝑥𝑖 + 𝜖𝑖

ෝ𝑦𝑖 = ො𝛼 + 𝑥𝑖 𝑥𝑖2 + 𝑥𝑖

3

ො𝑦 − 𝑦 2

𝑦𝑖 = 𝛼 + 𝛽1𝑒𝑑𝑢𝑐𝑖 + 𝛽2𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦 + 𝜖𝑖

1. Define the relationship we are trying to model. What is the outcome? What raw features do we have? E.g.

ℎ𝑖 = 𝑓(𝑎𝑔𝑒𝑖) + 𝜖𝑖2. Define a linear model to approximate this relationship, e.g.

ℎ𝑖 = 𝛼 + 𝛽1𝑎𝑔𝑒𝑖 + 𝛽2𝑎𝑔𝑒𝑖2 +⋯+ 𝛽𝑝𝑎𝑔𝑒𝑖

𝑝+ 𝜖𝑖

3. Create the necessary features eg. Age2=age^2 (or use poly)

4. Estimate the model:

mymodel=lm(y~age+age2…+ageP)

summary(mymodel)

5. Evaluate the model with an evaluation metric.

6. Repeat steps 2-5 to improve model fit.

1. After we have fit a model, we can predict the outcome for any input.

ℎ𝑖 = 𝛼 +𝛽1𝑎𝑔𝑒𝑖 + 𝛽2𝑎𝑔𝑒𝑖2 +⋯+ 𝛽𝑝 𝑎𝑔𝑒𝑖

𝑝

2. For any given observation ℎ𝑖 − ℎ𝑖 = 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟

(ℎ𝑖− ℎ𝑖)2 ∶ squared loss

|ℎ𝑖 − ℎ𝑖| ∶ absolute loss

Fit refers to how well our model, an approximation of reality, matches what we actually observe in the data

Mean squared error is the average of squared loss for each data point (row) we evaluate

We typically want to “train” the model on the data “we have” and test the model on “new data”

This is a realistic test of how well the model predicts outcomes.

In practice, we will subset our data:

X%: Training

Y%: Test

80% is commonly used for training.

• Randomly select observations (rows) to be in the test set. The remainder are the training set.

• Why random: want complete coverage. E.g. don’t want to train in February and test on July.

• Subset on the training set for all model estimation. E.g. any data.frameyou pass to lm

• Use the predict command to make predictions ( ො𝑦) for the test set

• Compute the evaluation metric using test set (you have the observe value and the predicted or modeled value)

• R-squared is reported for the training data. That is, it is “in sample”

• We learned that we typically want to use out-of-sample evaluation metrics, what do you we do

• It turns out that 𝑅2 = 𝑐𝑜𝑟𝑟 ෝ𝑦𝑖 , 𝑦𝑖2

• This means, 𝑐𝑜𝑟𝑟 ෝ𝑦𝑖 , 𝑦𝑖2, for the test set can be interpreted just like R-

squared (% of variation explain) and is a “fair test”

Binary features: equal to 1 or 0. For example:

Continuous features: can take an real value

Categorical features: can take one of N values. E.g. Saturday, Sunday, Monday… declaring as.factor then R will treat it as N binary variables for each category (ex. =1 if Saturday, =0 otherwise)

𝛽1 allows for a different y-intercept for students

𝑦𝑖 = 𝛼 + 𝛽1{1 𝑖𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡} + 𝛽2𝑖𝑛𝑐𝑜𝑚𝑒 + 𝜖𝑖

𝛽1 allows for a different

y-intercept for students

𝛽3 allows for a different

Slope for students

𝑦𝑖 = 𝛼 + 𝛽1 1 𝑖𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 + 𝛽2𝑖𝑛𝑐𝑜𝑚𝑒 + 𝛽3 1 𝑖𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 ∗ 𝑖𝑛𝑐𝑜𝑚𝑒 + 𝜖𝑖

Interaction term: multiplying two features by each other. When done as such:

{continuous feature}*{binary feature}

it allows for a different slope for the group represented by the binary feature. Note, be sure to include the continuous feature without the interaction as well.

Example, suppose temperature has a different impact on # of bikeshare trips taken on weekdays vs. weekends. We can create a binary variable is_weekend.

If we add is_weekend to the model, we allow for a different baseline ridership on weekends. If we add is_weekend*temp, we allow temperature to have a different effect for weekends.

We can “interact” is_weekend with a polynomial in temp. This effectively gives us a new model for temperature for the weekends vs. weekdays

If we have two raw features, A and B, how many models can we make?

Without interactions (4): none, just A, just B, A + B

With interactions (8): A*B, A+B + A*B, A + A*B, B+ A*B

If we have p features then we have 2𝑝+1 possible models!

𝑝 = 29 −→ 1,073,741,824

We cannot possible run all these models… so we’ll learn methods to guide us.

Human intelligence is often a great guide as well.

Summary(my_model) gives the coefficient estimates (beta’s), t-statistics, standard errors and p-values.

t-statistics evaluate the hypothesis that the coefficient is equal to zero. Thus t greater than 1.96 in absolute value allows us to reject this hypothesis at the 95% level. P-values simply convert t-stats into the probability we would get something this far from zero due to sampling chance alone.

Note that t-statistics evaluate features “one by one”. Overall model fit is a better guide for explanatory power (e.g. we might be better for leaving insignificant features in sometimes)

t-stats can be a good guide to throw out irrelevant features

𝑌 = 𝑓 𝑋 + 𝜖

Parametric: 𝑓 defined by a model we write down with parameters that we have to estimate (e.g. 𝛼, 𝛽1, 𝑒𝑡𝑐. )

Non-parametric: we directly fit the data

We just saw that regressions can be used to predict data.

Lets combine predict with grouping types of observations that have similar outcomes.

Called a “SUPERVISED LEARNING” algorithm.

Why supervised? There is a LHS variable

that we want to predict…

Alternative is “UNSUPERVISED LEARNING”

where we don’t have a LHS variable.

More on that….

𝑦𝑖𝑡 = 𝛼 + 𝑋𝑖𝑡′𝛽 + 1 𝑡𝑟𝑒𝑎𝑡𝑒𝑑𝑖𝑡 𝛾 + 𝜖𝑖𝑡

𝑦𝑖𝑡 = 𝛼 + 𝑋𝑖𝑡′𝛽 + 1 𝑡𝑟𝑒𝑎𝑡𝑒𝑑𝑖𝑡 1 𝑎𝑔𝑒 < 55𝑖𝑡 𝛾′ + 1 𝑡𝑟𝑒𝑎𝑡𝑒𝑑𝑖𝑡 1 𝑎𝑔𝑒 ≥ 55𝑖𝑡 𝛾′′ + 𝜖𝑖𝑡

common task

𝜋 𝛾′, 𝛾′′ > 𝜋 𝛾

ID Age Educ Politics Treated

1 59 BA D 0

2 33 PhD n/a 1

3 41 HS I 1

… … … … …

25.4%

6.9% 4.0% 2.1%

4.9%3.5%

1.7%

13.5%17.9% 21.6%

Control(1st Position)

(1,3) (1,5) (1,10)

CT

R

Loss in CTR from Link Demotion (US All Non-Navigational)

Loss from DemotionGain from Increased Relevance

𝑦; 𝑋 𝑤ℎ𝑒𝑟𝑒 𝑋 𝑖𝑠 𝑁 𝑥 P

𝑦; 𝑋 𝑤ℎ𝑒𝑟𝑒 𝑋 𝑖𝑠 𝑁 𝑥 𝐾

𝑦; 𝑋 𝑤ℎ𝑒𝑟𝑒 𝑋 𝑖𝑠 𝑁 𝑥 𝐾

𝑦𝑖𝑡 = 𝛼 + 𝑋𝑖𝑡′𝛽 + 1 𝑡𝑟𝑒𝑎𝑡𝑒𝑑𝑖𝑡 1 𝑎𝑔𝑒 < 55𝑖𝑡 𝛾′ + 1 𝑡𝑟𝑒𝑎𝑡𝑒𝑑𝑖𝑡 1 𝑎𝑔𝑒 ≥ 55𝑖𝑡 𝛾′′ + 𝜖𝑖𝑡

how does this matter for treatment?

IDEA: Force treatment to the top of the tree and run subtrees from that… “Two Tree” technique-> Two tree has problems w.r.t. design and inference.

library(rpart)

library(rpart.plot)

library(partykit)

library(permute)

library(maptree)

mydata <- read.csv("C:/Users/jlariv/Documents/HeteroATE/ln_RequiredData.csv")

mydata<- mydata[,2:ncol(mydata)]

colnames(mydata)

# This just gets the data in a format where everthing has a name I can work with.

dataToPass<-

mydata[,c("ln_resid_use","heat_pump","gas_heating","vinyl","wood","brick","beds","top_floor","basement","tot_sq_ft","yearbuilt","unregistered",…

…"undeclared","rep","dem","treated","kWh","expense","CO2")]

#This creates a dataframe with only the variables I want to use to explain variation in the variable I care to examine (e.g., ln(residual use)

#This next line actually fits the tree.

bfit<-rpart(as.formula(ln_resid_use ~ .),data=dataToPass,method="anova",cp=0.0003)

#NOTE: cp=.0003 is the "complexity parameter" and the lower it goes, the more complicated the tree gets (e.g., more leaves).

draw.tree(bfit)

#This command draws the tree

dataToPass$leaf = bfit$where

#This final command identifies which observation is associated with what leaf and creates a new variable for it in the original dataframe

#disco

#Now you know where hashtags came from

So… Regression trees are cool at categorizing variables with a continuous LHS variable (e.g., minutes on application)

What if there is no LHS variable?

Example: rather than minutes on an application as a function

of covariates, you care about how people use the application.

-> e.g., user types or “road types”

library(geosphere)

x <- collapsed[ ,c(2:45, 65:72)] ##pull variables to cluster on from dataframe “c”

colnames(x) <- colnames(collapsed)[c(2:45, 65:72)]

##Clustering

for(i in 1:1){ ##10 tries for each cluster size so we can test with multiple groupings if we choose

set.seed(i)

##set the number of clusters we want to use

for(j in c(8)){ ##clusters of size 8, 10, 15)

k <- j

##K means clustering

cl <- kmeans(x, k) ##k is the number of clusters, x is our set of clustering variables

collapsed <- cbind(collapsed, cl$cluster) ##append data with cluster ids

colnames(collapsed)[ncol(collapsed)] <- paste("cluster_", k, "_seed_", i, sep="")

}

}

segment_clusters = data.frame(cbind(as.character(collapsed$tmc), collapsed$cluster_8_seed_1)) ##using k=10

segmentscsv$tmc <- as.character(segmentscsv$tmc) ##making sure segements aren't stored as factors

segment_clusters$X1 <- as.character(segment_clusters$X1) ##X1 is the segment variable

segmentscsv <- merge(segmentscsv, segment_clusters, by.x = "tmc", by.y="X1")

colnames(segmentscsv) <- c(colnames(segmentscsv)[1:(ncol(segmentscsv)-1)], "cluster_id")