Upload
vodung
View
216
Download
2
Embed Size (px)
Citation preview
π: ππππ‘π’πππ .π¦: outcomes
Goal is to model outcomes as a function of features.
Width of π is π
Length of π and π¦ is π
π = π π + π
π¦π = π(π₯π) + ππ
Where ππ is called the error term.
π = π π + π
πππππ = π + ππ
if outcome = 1 then ππ = 1 β πif outcome = 0 then ππ = βπ
π is a constant, called the βsuccess rateβ
π¦π = πΌ + π½1π₯π + ππ
ΰ·πΌ
ΰ·π¦π = ΰ·πΌ + π₯π
ΰ·π β π π
ΰ·πΌgives the intercept
απ½ gives the slope
General formula:
βπ = π ππππ + ππ
ππ allows people of he same age to have different heights. Even if we have the correct growth model using age alone, we expect some variation in height conditional on age.
To use a linear model, polynomial features allow for a non-linear relationship between the feature, age, and the outcome, height.
βπ = πΌ + π½1ππππ + π½2ππππ2 +β―+ π½πππππ
π+ ππ
Linear regression allows you to use non-linear functions of the features, provided they enter the function as an additive term, with weight π½
π¦π = πΌ + π½1π1 π₯π,1 +β―+ π½πππ(π₯π,π) + ππ
The simplest example is where all features simply enter without any transformations
π¦π = πΌ + π½1π₯π,1 +β―+ π½ππ₯π,π + ππ
1. Define the relationship we are trying to model. What is the outcome? What raw features do we have? E.g.
βπ = π(ππππ) + ππ2. Define a linear model to approximate this relationship, e.g.
βπ = πΌ + π½1ππππ + π½2ππππ2 +β―+ π½πππππ
π+ ππ
3. Create the necessary features eg. Age2=age^2 (or use poly)
4. Estimate the model:
mymodel=lm(y~age+age2β¦+ageP)
summary(mymodel)
5. Evaluate the model with an evaluation metric.
6. Repeat steps 2-5 to improve model fit.
1. After we have fit a model, we can predict the outcome for any input.
βπ = πΌ +π½1ππππ + π½2ππππ2 +β―+ π½π ππππ
π
2. For any given observation βπ β βπ = πππππππ‘πππ πππππ
(βπβ βπ)2 βΆ squared loss
|βπ β βπ| βΆ absolute loss
Fit refers to how well our model, an approximation of reality, matches what we actually observe in the data
Mean squared error is the average of squared loss for each data point (row) we evaluate
We typically want to βtrainβ the model on the data βwe haveβ and test the model on βnew dataβ
This is a realistic test of how well the model predicts outcomes.
In practice, we will subset our data:
X%: Training
Y%: Test
80% is commonly used for training.
β’ Randomly select observations (rows) to be in the test set. The remainder are the training set.
β’ Why random: want complete coverage. E.g. donβt want to train in February and test on July.
β’ Subset on the training set for all model estimation. E.g. any data.frameyou pass to lm
β’ Use the predict command to make predictions ( ΰ·π¦) for the test set
β’ Compute the evaluation metric using test set (you have the observe value and the predicted or modeled value)
β’ R-squared is reported for the training data. That is, it is βin sampleβ
β’ We learned that we typically want to use out-of-sample evaluation metrics, what do you we do
β’ It turns out that π 2 = ππππ ΰ·π¦π , π¦π2
β’ This means, ππππ ΰ·π¦π , π¦π2, for the test set can be interpreted just like R-
squared (% of variation explain) and is a βfair testβ
Binary features: equal to 1 or 0. For example:
Continuous features: can take an real value
Categorical features: can take one of N values. E.g. Saturday, Sunday, Monday⦠declaring as.factor then R will treat it as N binary variables for each category (ex. =1 if Saturday, =0 otherwise)
π½1 allows for a different y-intercept for students
π¦π = πΌ + π½1{1 ππ π π‘π’ππππ‘} + π½2ππππππ + ππ
π½1 allows for a different
y-intercept for students
π½3 allows for a different
Slope for students
π¦π = πΌ + π½1 1 ππ π π‘π’ππππ‘ + π½2ππππππ + π½3 1 ππ π π‘π’ππππ‘ β ππππππ + ππ
Interaction term: multiplying two features by each other. When done as such:
{continuous feature}*{binary feature}
it allows for a different slope for the group represented by the binary feature. Note, be sure to include the continuous feature without the interaction as well.
Example, suppose temperature has a different impact on # of bikeshare trips taken on weekdays vs. weekends. We can create a binary variable is_weekend.
If we add is_weekend to the model, we allow for a different baseline ridership on weekends. If we add is_weekend*temp, we allow temperature to have a different effect for weekends.
We can βinteractβ is_weekend with a polynomial in temp. This effectively gives us a new model for temperature for the weekends vs. weekdays
If we have two raw features, A and B, how many models can we make?
Without interactions (4): none, just A, just B, A + B
With interactions (8): A*B, A+B + A*B, A + A*B, B+ A*B
If we have p features then we have 2π+1 possible models!
π = 29 ββ 1,073,741,824
We cannot possible run all these modelsβ¦ so weβll learn methods to guide us.
Human intelligence is often a great guide as well.
Summary(my_model) gives the coefficient estimates (betaβs), t-statistics, standard errors and p-values.
t-statistics evaluate the hypothesis that the coefficient is equal to zero. Thus t greater than 1.96 in absolute value allows us to reject this hypothesis at the 95% level. P-values simply convert t-stats into the probability we would get something this far from zero due to sampling chance alone.
Note that t-statistics evaluate features βone by oneβ. Overall model fit is a better guide for explanatory power (e.g. we might be better for leaving insignificant features in sometimes)
t-stats can be a good guide to throw out irrelevant features
π = π π + π
Parametric: π defined by a model we write down with parameters that we have to estimate (e.g. πΌ, π½1, ππ‘π. )
Non-parametric: we directly fit the data
We just saw that regressions can be used to predict data.
Lets combine predict with grouping types of observations that have similar outcomes.
Called a βSUPERVISED LEARNINGβ algorithm.
Why supervised? There is a LHS variable
that we want to predictβ¦
Alternative is βUNSUPERVISED LEARNINGβ
where we donβt have a LHS variable.
More on thatβ¦.
π¦ππ‘ = πΌ + πππ‘β²π½ + 1 π‘ππππ‘ππππ‘ πΎ + πππ‘
π¦ππ‘ = πΌ + πππ‘β²π½ + 1 π‘ππππ‘ππππ‘ 1 πππ < 55ππ‘ πΎβ² + 1 π‘ππππ‘ππππ‘ 1 πππ β₯ 55ππ‘ πΎβ²β² + πππ‘
common task
π πΎβ², πΎβ²β² > π πΎ
25.4%
6.9% 4.0% 2.1%
4.9%3.5%
1.7%
13.5%17.9% 21.6%
Control(1st Position)
(1,3) (1,5) (1,10)
CT
R
Loss in CTR from Link Demotion (US All Non-Navigational)
Loss from DemotionGain from Increased Relevance
π¦ππ‘ = πΌ + πππ‘β²π½ + 1 π‘ππππ‘ππππ‘ 1 πππ < 55ππ‘ πΎβ² + 1 π‘ππππ‘ππππ‘ 1 πππ β₯ 55ππ‘ πΎβ²β² + πππ‘
how does this matter for treatment?
IDEA: Force treatment to the top of the tree and run subtrees from thatβ¦ βTwo Treeβ technique-> Two tree has problems w.r.t. design and inference.
mydata <- read.csv("C:/Users/jlariv/Documents/HeteroATE/ln_RequiredData.csv")
mydata<- mydata[,2:ncol(mydata)]
colnames(mydata)
# This just gets the data in a format where everthing has a name I can work with.
dataToPass<-
mydata[,c("ln_resid_use","heat_pump","gas_heating","vinyl","wood","brick","beds","top_floor","basement","tot_sq_ft","yearbuilt","unregistered",β¦
β¦"undeclared","rep","dem","treated","kWh","expense","CO2")]
#This creates a dataframe with only the variables I want to use to explain variation in the variable I care to examine (e.g., ln(residual use)
#This next line actually fits the tree.
bfit<-rpart(as.formula(ln_resid_use ~ .),data=dataToPass,method="anova",cp=0.0003)
#NOTE: cp=.0003 is the "complexity parameter" and the lower it goes, the more complicated the tree gets (e.g., more leaves).
draw.tree(bfit)
#This command draws the tree
dataToPass$leaf = bfit$where
#This final command identifies which observation is associated with what leaf and creates a new variable for it in the original dataframe
#disco
#Now you know where hashtags came from
So⦠Regression trees are cool at categorizing variables with a continuous LHS variable (e.g., minutes on application)
What if there is no LHS variable?
Example: rather than minutes on an application as a function
of covariates, you care about how people use the application.
-> e.g., user types or βroad typesβ
library(geosphere)
x <- collapsed[ ,c(2:45, 65:72)] ##pull variables to cluster on from dataframe βcβ
colnames(x) <- colnames(collapsed)[c(2:45, 65:72)]
##Clustering
for(i in 1:1){ ##10 tries for each cluster size so we can test with multiple groupings if we choose
set.seed(i)
##set the number of clusters we want to use
for(j in c(8)){ ##clusters of size 8, 10, 15)
k <- j
##K means clustering
cl <- kmeans(x, k) ##k is the number of clusters, x is our set of clustering variables
collapsed <- cbind(collapsed, cl$cluster) ##append data with cluster ids
colnames(collapsed)[ncol(collapsed)] <- paste("cluster_", k, "_seed_", i, sep="")
}
}
segment_clusters = data.frame(cbind(as.character(collapsed$tmc), collapsed$cluster_8_seed_1)) ##using k=10
segmentscsv$tmc <- as.character(segmentscsv$tmc) ##making sure segements aren't stored as factors
segment_clusters$X1 <- as.character(segment_clusters$X1) ##X1 is the segment variable
segmentscsv <- merge(segmentscsv, segment_clusters, by.x = "tmc", by.y="X1")
colnames(segmentscsv) <- c(colnames(segmentscsv)[1:(ncol(segmentscsv)-1)], "cluster_id")