62

Goal is to model outcomes as a function of features. is to model outcomes as a function of features. ... { ##10 tries for each cluster size so we can test with multiple ... TechFest

  • Upload
    vodung

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

𝑋: π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’π‘ .𝑦: outcomes

Goal is to model outcomes as a function of features.

Width of 𝑋 is 𝑝

Length of 𝑋 and 𝑦 is 𝑛

𝑋: π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’π‘ .𝑦: outcomes

π‘Œ = 𝑓 𝑋 + πœ–

𝑦𝑖 = 𝑓(π‘₯𝑖) + πœ–π‘–

Where πœ–π‘– is called the error term.

π‘Œ = 𝑓 𝑋 + πœ–

𝑓𝑙𝑖𝑝𝑖 = 𝑝 + πœ–π‘–

if outcome = 1 then πœ–π‘– = 1 βˆ’ 𝑝if outcome = 0 then πœ–π‘– = βˆ’π‘

𝑝 is a constant, called the β€œsuccess rate”

𝑦𝑖 = 𝛼 + 𝛽1π‘₯𝑖 + πœ–π‘–

ΰ·œπ›Ό

ෝ𝑦𝑖 = ΰ·œπ›Ό + π‘₯𝑖

ΰ·π’š βˆ’ π’š 𝟐

ΰ·œπ›Όgives the intercept

αˆ˜π›½ gives the slope

General formula:

β„Žπ‘– = 𝑓 π‘Žπ‘”π‘’π‘– + πœ–π‘–

πœ–π‘– allows people of he same age to have different heights. Even if we have the correct growth model using age alone, we expect some variation in height conditional on age.

To use a linear model, polynomial features allow for a non-linear relationship between the feature, age, and the outcome, height.

β„Žπ‘– = 𝛼 + 𝛽1π‘Žπ‘”π‘’π‘– + 𝛽2π‘Žπ‘”π‘’π‘–2 +β‹―+ π›½π‘π‘Žπ‘”π‘’π‘–

𝑝+ πœ–π‘–

Linear regression allows you to use non-linear functions of the features, provided they enter the function as an additive term, with weight 𝛽

𝑦𝑖 = 𝛼 + 𝛽1𝑔1 π‘₯𝑖,1 +β‹―+ 𝛽𝑝𝑔𝑝(π‘₯𝑖,𝑝) + πœ–π‘–

The simplest example is where all features simply enter without any transformations

𝑦𝑖 = 𝛼 + 𝛽1π‘₯𝑖,1 +β‹―+ 𝛽𝑝π‘₯𝑖,𝑝 + πœ–π‘–

𝑦𝑖 = 𝛼 + 𝛽1π‘₯𝑖 + πœ–π‘–

ෝ𝑦𝑖 = ΰ·œπ›Ό + π‘₯𝑖 π‘₯𝑖2 + π‘₯𝑖

3

ΰ·œπ‘¦ βˆ’ 𝑦 2

𝑦𝑖 = 𝛼 + 𝛽1𝑒𝑑𝑒𝑐𝑖 + 𝛽2π‘ π‘’π‘›π‘–π‘œπ‘Ÿπ‘–π‘‘π‘¦ + πœ–π‘–

1. Define the relationship we are trying to model. What is the outcome? What raw features do we have? E.g.

β„Žπ‘– = 𝑓(π‘Žπ‘”π‘’π‘–) + πœ–π‘–2. Define a linear model to approximate this relationship, e.g.

β„Žπ‘– = 𝛼 + 𝛽1π‘Žπ‘”π‘’π‘– + 𝛽2π‘Žπ‘”π‘’π‘–2 +β‹―+ π›½π‘π‘Žπ‘”π‘’π‘–

𝑝+ πœ–π‘–

3. Create the necessary features eg. Age2=age^2 (or use poly)

4. Estimate the model:

mymodel=lm(y~age+age2…+ageP)

summary(mymodel)

5. Evaluate the model with an evaluation metric.

6. Repeat steps 2-5 to improve model fit.

1. After we have fit a model, we can predict the outcome for any input.

β„Žπ‘– = 𝛼 +𝛽1π‘Žπ‘”π‘’π‘– + 𝛽2π‘Žπ‘”π‘’π‘–2 +β‹―+ 𝛽𝑝 π‘Žπ‘”π‘’π‘–

𝑝

2. For any given observation β„Žπ‘– βˆ’ β„Žπ‘– = π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘–π‘œπ‘› π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ

(β„Žπ‘–βˆ’ β„Žπ‘–)2 ∢ squared loss

|β„Žπ‘– βˆ’ β„Žπ‘–| ∢ absolute loss

Fit refers to how well our model, an approximation of reality, matches what we actually observe in the data

Mean squared error is the average of squared loss for each data point (row) we evaluate

We typically want to β€œtrain” the model on the data β€œwe have” and test the model on β€œnew data”

This is a realistic test of how well the model predicts outcomes.

In practice, we will subset our data:

X%: Training

Y%: Test

80% is commonly used for training.

β€’ Randomly select observations (rows) to be in the test set. The remainder are the training set.

β€’ Why random: want complete coverage. E.g. don’t want to train in February and test on July.

β€’ Subset on the training set for all model estimation. E.g. any data.frameyou pass to lm

β€’ Use the predict command to make predictions ( ΰ·œπ‘¦) for the test set

β€’ Compute the evaluation metric using test set (you have the observe value and the predicted or modeled value)

β€’ R-squared is reported for the training data. That is, it is β€œin sample”

β€’ We learned that we typically want to use out-of-sample evaluation metrics, what do you we do

β€’ It turns out that 𝑅2 = π‘π‘œπ‘Ÿπ‘Ÿ ෝ𝑦𝑖 , 𝑦𝑖2

β€’ This means, π‘π‘œπ‘Ÿπ‘Ÿ ෝ𝑦𝑖 , 𝑦𝑖2, for the test set can be interpreted just like R-

squared (% of variation explain) and is a β€œfair test”

Binary features: equal to 1 or 0. For example:

Continuous features: can take an real value

Categorical features: can take one of N values. E.g. Saturday, Sunday, Monday… declaring as.factor then R will treat it as N binary variables for each category (ex. =1 if Saturday, =0 otherwise)

𝛽1 allows for a different y-intercept for students

𝑦𝑖 = 𝛼 + 𝛽1{1 𝑖𝑓 𝑠𝑑𝑒𝑑𝑒𝑛𝑑} + 𝛽2π‘–π‘›π‘π‘œπ‘šπ‘’ + πœ–π‘–

𝛽1 allows for a different

y-intercept for students

𝛽3 allows for a different

Slope for students

𝑦𝑖 = 𝛼 + 𝛽1 1 𝑖𝑓 𝑠𝑑𝑒𝑑𝑒𝑛𝑑 + 𝛽2π‘–π‘›π‘π‘œπ‘šπ‘’ + 𝛽3 1 𝑖𝑓 𝑠𝑑𝑒𝑑𝑒𝑛𝑑 βˆ— π‘–π‘›π‘π‘œπ‘šπ‘’ + πœ–π‘–

Interaction term: multiplying two features by each other. When done as such:

{continuous feature}*{binary feature}

it allows for a different slope for the group represented by the binary feature. Note, be sure to include the continuous feature without the interaction as well.

Example, suppose temperature has a different impact on # of bikeshare trips taken on weekdays vs. weekends. We can create a binary variable is_weekend.

If we add is_weekend to the model, we allow for a different baseline ridership on weekends. If we add is_weekend*temp, we allow temperature to have a different effect for weekends.

We can β€œinteract” is_weekend with a polynomial in temp. This effectively gives us a new model for temperature for the weekends vs. weekdays

If we have two raw features, A and B, how many models can we make?

Without interactions (4): none, just A, just B, A + B

With interactions (8): A*B, A+B + A*B, A + A*B, B+ A*B

If we have p features then we have 2𝑝+1 possible models!

𝑝 = 29 βˆ’β†’ 1,073,741,824

We cannot possible run all these models… so we’ll learn methods to guide us.

Human intelligence is often a great guide as well.

Summary(my_model) gives the coefficient estimates (beta’s), t-statistics, standard errors and p-values.

t-statistics evaluate the hypothesis that the coefficient is equal to zero. Thus t greater than 1.96 in absolute value allows us to reject this hypothesis at the 95% level. P-values simply convert t-stats into the probability we would get something this far from zero due to sampling chance alone.

Note that t-statistics evaluate features β€œone by one”. Overall model fit is a better guide for explanatory power (e.g. we might be better for leaving insignificant features in sometimes)

t-stats can be a good guide to throw out irrelevant features

π‘Œ = 𝑓 𝑋 + πœ–

Parametric: 𝑓 defined by a model we write down with parameters that we have to estimate (e.g. 𝛼, 𝛽1, 𝑒𝑑𝑐. )

Non-parametric: we directly fit the data

We just saw that regressions can be used to predict data.

Lets combine predict with grouping types of observations that have similar outcomes.

Called a β€œSUPERVISED LEARNING” algorithm.

Why supervised? There is a LHS variable

that we want to predict…

Alternative is β€œUNSUPERVISED LEARNING”

where we don’t have a LHS variable.

More on that….

𝑦𝑖𝑑 = 𝛼 + 𝑋𝑖𝑑′𝛽 + 1 π‘‘π‘Ÿπ‘’π‘Žπ‘‘π‘’π‘‘π‘–π‘‘ 𝛾 + πœ–π‘–π‘‘

𝑦𝑖𝑑 = 𝛼 + 𝑋𝑖𝑑′𝛽 + 1 π‘‘π‘Ÿπ‘’π‘Žπ‘‘π‘’π‘‘π‘–π‘‘ 1 π‘Žπ‘”π‘’ < 55𝑖𝑑 𝛾′ + 1 π‘‘π‘Ÿπ‘’π‘Žπ‘‘π‘’π‘‘π‘–π‘‘ 1 π‘Žπ‘”π‘’ β‰₯ 55𝑖𝑑 𝛾′′ + πœ–π‘–π‘‘

common task

πœ‹ 𝛾′, 𝛾′′ > πœ‹ 𝛾

ID Age Educ Politics Treated

1 59 BA D 0

2 33 PhD n/a 1

3 41 HS I 1

… … … … …

25.4%

6.9% 4.0% 2.1%

4.9%3.5%

1.7%

13.5%17.9% 21.6%

Control(1st Position)

(1,3) (1,5) (1,10)

CT

R

Loss in CTR from Link Demotion (US All Non-Navigational)

Loss from DemotionGain from Increased Relevance

𝑦; 𝑋 π‘€β„Žπ‘’π‘Ÿπ‘’ 𝑋 𝑖𝑠 𝑁 π‘₯ P

𝑦; 𝑋 π‘€β„Žπ‘’π‘Ÿπ‘’ 𝑋 𝑖𝑠 𝑁 π‘₯ 𝐾

𝑦; 𝑋 π‘€β„Žπ‘’π‘Ÿπ‘’ 𝑋 𝑖𝑠 𝑁 π‘₯ 𝐾

𝑦𝑖𝑑 = 𝛼 + 𝑋𝑖𝑑′𝛽 + 1 π‘‘π‘Ÿπ‘’π‘Žπ‘‘π‘’π‘‘π‘–π‘‘ 1 π‘Žπ‘”π‘’ < 55𝑖𝑑 𝛾′ + 1 π‘‘π‘Ÿπ‘’π‘Žπ‘‘π‘’π‘‘π‘–π‘‘ 1 π‘Žπ‘”π‘’ β‰₯ 55𝑖𝑑 𝛾′′ + πœ–π‘–π‘‘

how does this matter for treatment?

IDEA: Force treatment to the top of the tree and run subtrees from that… β€œTwo Tree” technique-> Two tree has problems w.r.t. design and inference.

library(rpart)

library(rpart.plot)

library(partykit)

library(permute)

library(maptree)

mydata <- read.csv("C:/Users/jlariv/Documents/HeteroATE/ln_RequiredData.csv")

mydata<- mydata[,2:ncol(mydata)]

colnames(mydata)

# This just gets the data in a format where everthing has a name I can work with.

dataToPass<-

mydata[,c("ln_resid_use","heat_pump","gas_heating","vinyl","wood","brick","beds","top_floor","basement","tot_sq_ft","yearbuilt","unregistered",…

…"undeclared","rep","dem","treated","kWh","expense","CO2")]

#This creates a dataframe with only the variables I want to use to explain variation in the variable I care to examine (e.g., ln(residual use)

#This next line actually fits the tree.

bfit<-rpart(as.formula(ln_resid_use ~ .),data=dataToPass,method="anova",cp=0.0003)

#NOTE: cp=.0003 is the "complexity parameter" and the lower it goes, the more complicated the tree gets (e.g., more leaves).

draw.tree(bfit)

#This command draws the tree

dataToPass$leaf = bfit$where

#This final command identifies which observation is associated with what leaf and creates a new variable for it in the original dataframe

#disco

#Now you know where hashtags came from

So… Regression trees are cool at categorizing variables with a continuous LHS variable (e.g., minutes on application)

What if there is no LHS variable?

Example: rather than minutes on an application as a function

of covariates, you care about how people use the application.

-> e.g., user types or β€œroad types”

library(geosphere)

x <- collapsed[ ,c(2:45, 65:72)] ##pull variables to cluster on from dataframe β€œc”

colnames(x) <- colnames(collapsed)[c(2:45, 65:72)]

##Clustering

for(i in 1:1){ ##10 tries for each cluster size so we can test with multiple groupings if we choose

set.seed(i)

##set the number of clusters we want to use

for(j in c(8)){ ##clusters of size 8, 10, 15)

k <- j

##K means clustering

cl <- kmeans(x, k) ##k is the number of clusters, x is our set of clustering variables

collapsed <- cbind(collapsed, cl$cluster) ##append data with cluster ids

colnames(collapsed)[ncol(collapsed)] <- paste("cluster_", k, "_seed_", i, sep="")

}

}

segment_clusters = data.frame(cbind(as.character(collapsed$tmc), collapsed$cluster_8_seed_1)) ##using k=10

segmentscsv$tmc <- as.character(segmentscsv$tmc) ##making sure segements aren't stored as factors

segment_clusters$X1 <- as.character(segment_clusters$X1) ##X1 is the segment variable

segmentscsv <- merge(segmentscsv, segment_clusters, by.x = "tmc", by.y="X1")

colnames(segmentscsv) <- c(colnames(segmentscsv)[1:(ncol(segmentscsv)-1)], "cluster_id")