Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Rattle to RR Bohn April 21, 2015 update 2018
1
R is ….
• A programming language • (source) code • Interactive language (opposite of compiled)
• All R code and (non-visual) results are just text • copy and paste • Manipulate the text = manipulate the code • Save in word processor
2
3
We should look at the data, here (Exploratory) 4
5
Paste this into RStudio
# Rattle timestamp: 2015-04-21 11:25:13 x86_64-apple-darwin13.4.0
# Note the user selections. # Build the training/validate/test datasets.
set.seed(crv$seed) crs$nobs <- nrow(crs$dataset) # 1436 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 1005 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.15*crs$nobs) # 215 observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 216 observations
6
# The following variable selections have been noted.
crs$input <- c("Age_08_04", "Mfg_Year", "KM", "Fuel_Type", "HP", "Doors", "Quarterly_Tax", "Weight", "Guarantee_Period", "TFC_Met_Color", "TFC_Automatic", "TFC_Mfr_Guarantee", "TFC_BOVAG_Guarantee", "TFC_ABS", "TFC_Airbag_1", "TFC_Airco", "TFC_Automatic_airco", "TFC_Power_Steering")
crs$numeric <- c("Age_08_04", "Mfg_Year", "KM", "HP", "Doors", "Quarterly_Tax", "Weight", "Guarantee_Period")
crs$categoric <- c("Fuel_Type", "TFC_Met_Color", "TFC_Automatic", "TFC_Mfr_Guarantee", "TFC_BOVAG_Guarantee", "TFC_ABS", "TFC_Airbag_1", "TFC_Airco", "TFC_Automatic_airco", "TFC_Power_Steering")
7
crs$target <- "Price" crs$risk <- NULL crs$ident <- "Id”
crs$ignore <- c ("Model", "Mfg_Month", "Met_Color", "Color", "Automatic", "CC", "Cylinders", "Gears", "Mfr_Guarantee", "BOVAG_Guarantee", "ABS", "Airbag_1", "Airbag_2", "Airco", "Automatic_airco", "Boardcomputer", "CD_Player", "Central_Lock", "Powered_Windows", "Power_Steering", "Radio", "Mistlamps", "Sport_Model", "Backseat_Divider", "Metallic_Rim", "Radio_cassette", "Parking_Assistant", "Tow_Bar", "TFC_Airbag_2", "TFC_Boardcomputer", "TFC_CD_Player", "TFC_Central_Lock", "TFC_Powered_Windows", "TFC_Radio", "TFC_Mistlamps", "TFC_Sport_Model", "TFC_Backseat_Divider", "TFC_Metallic_Rim", "TFC_Radio_cassette", "TFC_Parking_Assistant", "TFC_Tow_Bar")
crs$weights <- NULL 8
9
#============================================================ # Regression model
# Build a Regression model.
crs$glm <- lm(Price ~ ., data=crs$dataset[crs$train,c(crs$input, crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm)) #This is the second key line
cat('==== ANOVA ====
') print(anova(crs$glm)) print(" ")
# Time taken: 0.03 secs 10
11
12
R cookbook page on regression
http://proquest.safaribooksonline.com/book/programming/r/9780596809287/11dot-linear-regression-and-anova/id3419297 The lm function returns a model object that you can assign to a variable: > m <- lm(y ~ u + v + w) From the model object, you can extract important information using specialized functions. The most important function is summary: > summary(m)
13
Specifying a model
• crs$glm <- lm(Price ~ .+ KM* Mfg_Year , data=crs$dataset[crs$train,c(crs$input, crs$target)])
• This line runs the model with an interaction term: KM* Mfg_Year
• summary(crs$glm) • This prints the result
• Comment: You see why should start variable names with CAPS or other signifier
columns: Causes, targetRows: training set
14
Learning Objectives week 4; How did we do?
Introduce continuous outcomes aka “regression” Setting up linear models:
Dummy variables for ‘factors’ Modeling nonlinearities: interactions Transformations e.g. square, log of x
Interpreting regression Results: Physical meaning of the coefficients Which variables matter? Importance vs statistical significance Measure overall model performance: Mean Absolute Error
Other differences BDA vs Classical Stats = Hypothesis testing Homework: estimating EPA fuel efficiency
Using Rattle and R together Transforming data 15